KHAN-THESIS-2021.pdf

The graph on the left is zoomed in as a function of micro-F1 values, precision and recall. The graph on the left is a zoomed-in display of micro-F1 values, precision and recall.

Study Context

Additionally, embedding models require much less text processing and engineering to achieve SOTA performance compared to earlier clinical NLP methods. NLP methods continue to evolve, often doing so to overcome the limitations of existing techniques.

Motivation for the Study

This persists despite the fact that embedding models are much more generalizable and their potential to provide state-of-the-art (SOTA) performance on most downstream NLP tasks.

Aim and Scope

Study Significance

A secondary intended outcome of the study is to determine which embedding algorithm(s) and pretraining corpora represent words that best match clinical judgment. Furthermore, a third intended outcome explains which embedding model(s) and pretraining corpora perform best on clinical phenotyping using radiology text.

Overview of the Study

The results of our intrinsic evaluation task clarify the model and pre-training corpus selection most useful for optimizing word vector representations to match clinician assessment. These results provide insight into the effect of pretraining custom models and how model and corpus choice influence phenotype classification performance, and which are most preferable depending on label balance and access to computational resources.

Clinical Text

Ontologies and Knowledge Graphs

Additional heuristics have also been introduced to address the challenges of detecting negations and abbreviations (Chapman et al. 2001; Schwartz and Hearst 2003). Yet, the heterogeneity of clinical text can affect the performance of these tools, e.g.

Count-based word representations

One-hot encodings
n-grams
Term-frequency and Inverse document frequency
Dense vector representations
Limitations of Count-based Methods

One of the most common approaches to doing this in NLP is the use of Singular Value Decomposition (SVD), which is supposedly popularized in this space through Latent Semantic Analysis (LSA), which includes the SVD calculation (Levy, Goldberg and Dagan 2015). As detailed above, many of the count-based methods for representing words fail to capture information about semantic or syntactic similarity.

Figure 2.1: Term-frequency and Inverse document frequency example.

Non-contextual Word Embeddings

word2vec
Beyond word2vec
Properties of NCWEs
Augmenting Non-contextual Embeddings with Knowledge Graphs 13
Other weighting schemes

They also found that the combination of the two performed better than either approach alone (Agirre et al. 2009). In addition, neural networks have been shown to outperform LSA in their ability to preserve linear regularities between words (Mikolov, Chen, et al. 2013).

Figure 2.2: Vector similarity methods can be used to compute a numeric value representing the similarity of two words based on each word’s vector representation.

Deep Learning for NLP

Recurrent Neural Networks

Levy and Goldberg find that a few adjustments to the PPMI matrix, namely softening the context distribution and singular value decomposition (SVD), can produce word vector representations that yield comparable performance to word2vec and GloVe embeddings on a range of common intrinsic evaluation tasks ( Levy, Goldberg and Dagan 2015). Long short-term memory (LSTM) and the gated recurrent unit (GRU) models are variants of the traditional RNN that improve the ability to retain long-term information and thus overcome the vanishing gradient problem (Hochreiter and Schmidhuber 1997; Cho et al. 2014 ; Goodfellow, Bengio and Courville 2016).

Attention Mechanisms

In practice, RNNs can become unstable during training, especially when processing longer input sequences (Bengio, Simard, and Frasconi 1994). Gradient clipping involves reducing the gradient before applying the gradient descent optimization update during training (Goodfellow, Bengio, and Courville 2016).

Contextual Word Embeddings

Transformer Architecture
Tokenization in BERT
Self-Attention in BERT
The Introduction of Efficient Transformers
Limitations of Contextual Word Embeddings

During pre-training, BERT attempts to reconstruct the original token that was masked, as shown in Figure 2.4 (Devlin et al. 2019). The CLS token of the BERT model is a hidden state representation of the entire input string, which has size 768 in the basic BERT model (Devlin et al. 2019).

Figure 2.3: Architecture of the original Transformer model by Vaswani et al., 2017 (Source: Tay et al., 2020)

Intrinsic Evaluation

Semantic Similarity

Semantic similarity as I use it here refers to the attributive similarity or synonymy of words; words that are synonymous have a high degree of semantic similarity.1 Semantic similarity is considered a special case of the more general concept of semantic relatedness (Resnik 1995; Turney 2006). For example, the terms scalpel and surgeon or lung and nodule are not synonymous, but are related.

Qualitative Evaluation

Amazon Mechanical Turk was used to examine human judgments of word or topic coherence for each task (Chang et al. 2009). In addition to introducing the word2vec algorithm, they introduced a new word analogy task to illustrate the semantic and syntactic properties of word2vec-trained embeddings (Mikolov, Chen, et al. 2013).

Quantitative Evaluation

Specifically, BERT-based models include a masked-language modeling (MLM) pre-training strategy as shown in Figure 2.4 (Devlin et al. 2019). Of the 120 term pairs, the 30 pairs with the highest agreement among the coders were then annotated by 9 medical coders and 3 rheumatologists on a 4-point scale (Pedersen et al. 2007).

Figure 2.7: Example cloze prompt. Predictions from BERT and BERT-based models are illustrative and were not shown to survey participants.

Extrinsic Evaluation

In addition, the standard General Language Understanding Evaluation (GLUE) and SuperGLUE are standards that include a diverse set of general natural language understanding tasks (A. Wang et al. Recently, the Arena standard with long range was introduced to evaluate and compare the performance between model Efficient Transformers (Tay et al. 2021) Of the 10 tasks in the BLUE benchmark set, only one evaluates document classification performance and involves multi-label classification of PubMed abstracts for hallmarks of cancer (Baker et al. 2015).

Other phenotyping techniques have included rule-based systems, statistical or machine learning-based approaches, and hybrid methods (Shivade et al. They concluded that there are few robust tools or off-the-shelf models available for phenotyping (Shivade et al. 2014).

Methods

Study Design
Text Pre-processing
Non-contextual embedding models
Contextual embedding models
Term pair selection
Cloze task generation
Survey administration
Survey analysis

I used the gensim library implementations of the word2vec and fastText models (Mikolov, Chen, et al. In this study, spherical embedding models were created using the source code provided by the authors (Meng et al. 2019). I also used a spherical embedding. model trained on the English Wikipedia corpus provided by Meng et al (Meng et al. 2019).

I used the transformers library to create a custom tokenizer and for pre-training on the radiology report corpus (Wolf et al. 2020). Each cloze task request was drawn with each of the 3 BERT-based models under study using the transformer library (Wolf et al. 2020).

Table 3.1: List of CT Study Description codes with description used for inclusion.

Results

Comparison to Fleischner glossary
Non-contextual embedding model similarity
Survey participant characteristics
Term pair similarity
Performance on UMNSRS benchmarks
Cloze task accuracy

Clinical background of the resident, fellow and attending physicians included anesthesiology (1), emergency medicine (1), family medicine (1), internal medicine (5) and radiology (1). The calculated Pearson correlation coefficient (ρp) and Spearman correlation coefficient (ρs) values comparing mean study similarity scores with cosine similarity for each model are shown in Table 3.3. Among the embedding models trained using our X-ray corpus, the word2vec model returned the lowest correlation between human and model similarity scores (ρp 0.34; ρs 0.35).

For the word2vec model trained on English Wikipedia, ρp and ρs were 0.31 and 0.32, respectively; Wikipedia's spherically embedded model returned a ρp of 0.26 and ρs of 0.27. To compare the performance of the word2vec model on different training corpora, I present the performance of the previously published algorithm on the modified UMNSRS benchmarks in relation to our findings in Table 3.5 (Pakhomov et al. 2016).

Figure 3.2: Distribution of cosine similarity scores on RadSim281. Custom fastText and word2vec models and BioWordVec are near-normal, whereas Wikipedia-trained models and custom spherical embeddings are skewed.

Discussion

Similar to my own results, they found that fastText models tend to outperform word2vec models on most term pair similarity benchmarks, including the English Rare Word dataset (Bojanowski et al. 2017). The originally proposed BioWordVec fastText model was trained using PubMed and Medical Subject Headings (MeSH) (Y. Zhang et al. 2019). They also find that increasing the size of the context window improves the correlation between model and human judgment (Agirre et al. 2009).

I attempted to enrich the embedding models using the RadLex ontology, but met with little success as I found limited RadLex term coverage in our corpus, which is a challenge previously noted (Percha et al. 2018). . BERT-based models can only accept a maximum sequence input length of 512 tokens (Devlin et al. 2019).

Conclusion

With limited model research assignments in clinical NLP, I argue that this area warrants further investigation, particularly given the evidence of racial and other disparities present in clinical evidence, management and outcomes (Balderston et al. 2021). Future studies may use methods to generate multi-word terms while still allowing for comparison within the embedding space (Mikolov, Sutskever, et al. 2013; Rose et al. 2010; Campos et al. 2018). Furthermore, the term pair generation method excluded term pairs that were not present in all the nested vocabularies.

Another limitation is that the corpus of radiology reports was primarily limited to CT scans, including chest. Also, the corpus was curated from a single academic medical center, which limits the size of the corpus and affects the conditions and findings identified in the reports of those found in the region.

Methods

Study Design
Text processing
Non-contextual embedding models
Contextual embedding models
Manual Document Annotation
Model Setup
Logistic Regression Models
Bidirectional LSTM Models
BERT-based Models
Efficient Transformer Models
Statistical Analyses

For binary classification, I used the scikit-learn implementation of Stratified k-fold cross-validation using the StratifiedKFold method (Pedregosa et al. 2011). Each of our binary and multi-label classification models were trained using binary cross-entropy with logit loss implemented in PyTorch (Paszke et al. 2019). Each of our metrics was calculated using the scikit-learn methods for calculating accuracy and precision, recall, and F1 scores (Pedregosa et al. 2011).

When using the English Wikipedia trained spherical embeddings at a maximum learning rate of 0.001, the model failed to converge. Binary classification: As with our BERT-based models (Section 4.1.9), I loaded the Longformer equivalent of each of these three models and their tokenizers using the transformers library (Wolf et al. 2020).

Results

Annotation
Label Distribution
Binary Classification
Multi-label Classification

Bi-Way LSTM Models: Model performance for each of our Bi-Way LSTM (BiLSTM) models is shown in Table 4.3 and Figure 4.2. BERT-based models: Performance measures for each of our three studied BERT-based models are shown in Table 4.4 and Figure 4.3. Efficient transformer models: Performance measures for each of our Efficient transformer models are shown in Table 4.5.

Logistic Regression Models: The results for each of our logistic regression models are presented in Table 4.6 and Figure 4.7. Efficiency transformer designs: Performance metrics for each of our efficiency transformer designs are shown in Table 4.9.

Table 4.2: Binary classification performance for Logistic Regression models.

Discussion

In Figure 4.12, I compare the micro-F1, precision and recall efficiency between the BERT-based models and the effective transformer models. As described in Section 4.1.8, the BiLSTM models using our custom spherical embeddings and the spherical embeddings trained on the English Wikipedia converged only after increasing the learning rate. BERT-based models: For binary and multi-label classification, I consistently found that our custom RadBERT model outperformed the base BERT model and the BlueBERT model.

Efficient Transformer Models: Our efficient transformer models are the Longformer model equivalent of the BERT-based models: basic, BERT without case, BlueBERT and our custom RadBERT model. As expected, BiLSTM models trained faster than the BERT-based and Efficient Transformer models.

Figure 4.10: Multi-label Classification: Performance metrics for each Efficient Transformer model based on 5-fold cross validation

Conclusion

A Generalized Natural-Language Text Processor for Clinical Radiology." Journal of the American Medical Informatics Association. Proceedings of the 56th Annual Meeting of the Society for Computational Linguistics (Volume 1: Long Papers). In. Extending a radiology lexicon through contextual patterns in using radiology reports." Journal of the American Medical Informatics Association.

CLAMP – a toolkit for efficiently building personalized natural language processing clinical pipelines.” Journal of the American Medical Informatics Association. Identification of Patient Smoking Status from Discharge Medical Records.” Journal of the American Medical Informatics Association.