The graph on the left is zoomed in as a function of micro-F1 values, precision and recall. The graph on the left is a zoomed-in display of micro-F1 values, precision and recall.
Study Context
Additionally, embedding models require much less text processing and engineering to achieve SOTA performance compared to earlier clinical NLP methods. NLP methods continue to evolve, often doing so to overcome the limitations of existing techniques.
Motivation for the Study
This persists despite the fact that embedding models are much more generalizable and their potential to provide state-of-the-art (SOTA) performance on most downstream NLP tasks.
Aim and Scope
Study Significance
A secondary intended outcome of the study is to determine which embedding algorithm(s) and pretraining corpora represent words that best match clinical judgment. Furthermore, a third intended outcome explains which embedding model(s) and pretraining corpora perform best on clinical phenotyping using radiology text.
Overview of the Study
The results of our intrinsic evaluation task clarify the model and pre-training corpus selection most useful for optimizing word vector representations to match clinician assessment. These results provide insight into the effect of pretraining custom models and how model and corpus choice influence phenotype classification performance, and which are most preferable depending on label balance and access to computational resources.
Clinical Text
Ontologies and Knowledge Graphs
Additional heuristics have also been introduced to address the challenges of detecting negations and abbreviations (Chapman et al. 2001; Schwartz and Hearst 2003). Yet, the heterogeneity of clinical text can affect the performance of these tools, e.g.
Count-based word representations
- One-hot encodings
- n-grams
- Term-frequency and Inverse document frequency
- Dense vector representations
- Limitations of Count-based Methods
One of the most common approaches to doing this in NLP is the use of Singular Value Decomposition (SVD), which is supposedly popularized in this space through Latent Semantic Analysis (LSA), which includes the SVD calculation (Levy, Goldberg and Dagan 2015). As detailed above, many of the count-based methods for representing words fail to capture information about semantic or syntactic similarity.
Non-contextual Word Embeddings
- word2vec
- Beyond word2vec
- Properties of NCWEs
- Augmenting Non-contextual Embeddings with Knowledge Graphs 13
- Other weighting schemes
They also found that the combination of the two performed better than either approach alone (Agirre et al. 2009). In addition, neural networks have been shown to outperform LSA in their ability to preserve linear regularities between words (Mikolov, Chen, et al. 2013).
Deep Learning for NLP
Recurrent Neural Networks
Levy and Goldberg find that a few adjustments to the PPMI matrix, namely softening the context distribution and singular value decomposition (SVD), can produce word vector representations that yield comparable performance to word2vec and GloVe embeddings on a range of common intrinsic evaluation tasks ( Levy, Goldberg and Dagan 2015). Long short-term memory (LSTM) and the gated recurrent unit (GRU) models are variants of the traditional RNN that improve the ability to retain long-term information and thus overcome the vanishing gradient problem (Hochreiter and Schmidhuber 1997; Cho et al. 2014 ; Goodfellow, Bengio and Courville 2016).
Attention Mechanisms
In practice, RNNs can become unstable during training, especially when processing longer input sequences (Bengio, Simard, and Frasconi 1994). Gradient clipping involves reducing the gradient before applying the gradient descent optimization update during training (Goodfellow, Bengio, and Courville 2016).
Contextual Word Embeddings
- Transformer Architecture
- Tokenization in BERT
- Self-Attention in BERT
- The Introduction of Efficient Transformers
- Limitations of Contextual Word Embeddings
During pre-training, BERT attempts to reconstruct the original token that was masked, as shown in Figure 2.4 (Devlin et al. 2019). The CLS token of the BERT model is a hidden state representation of the entire input string, which has size 768 in the basic BERT model (Devlin et al. 2019).
Intrinsic Evaluation
Semantic Similarity
Semantic similarity as I use it here refers to the attributive similarity or synonymy of words; words that are synonymous have a high degree of semantic similarity.1 Semantic similarity is considered a special case of the more general concept of semantic relatedness (Resnik 1995; Turney 2006). For example, the terms scalpel and surgeon or lung and nodule are not synonymous, but are related.
Qualitative Evaluation
Amazon Mechanical Turk was used to examine human judgments of word or topic coherence for each task (Chang et al. 2009). In addition to introducing the word2vec algorithm, they introduced a new word analogy task to illustrate the semantic and syntactic properties of word2vec-trained embeddings (Mikolov, Chen, et al. 2013).
Quantitative Evaluation
Specifically, BERT-based models include a masked-language modeling (MLM) pre-training strategy as shown in Figure 2.4 (Devlin et al. 2019). Of the 120 term pairs, the 30 pairs with the highest agreement among the coders were then annotated by 9 medical coders and 3 rheumatologists on a 4-point scale (Pedersen et al. 2007).
Extrinsic Evaluation
In addition, the standard General Language Understanding Evaluation (GLUE) and SuperGLUE are standards that include a diverse set of general natural language understanding tasks (A. Wang et al. Recently, the Arena standard with long range was introduced to evaluate and compare the performance between model Efficient Transformers (Tay et al. 2021) Of the 10 tasks in the BLUE benchmark set, only one evaluates document classification performance and involves multi-label classification of PubMed abstracts for hallmarks of cancer (Baker et al. 2015).
Other phenotyping techniques have included rule-based systems, statistical or machine learning-based approaches, and hybrid methods (Shivade et al. They concluded that there are few robust tools or off-the-shelf models available for phenotyping (Shivade et al. 2014).
Methods
- Study Design
- Text Pre-processing
- Non-contextual embedding models
- Contextual embedding models
- Term pair selection
- Cloze task generation
- Survey administration
- Survey analysis
I used the gensim library implementations of the word2vec and fastText models (Mikolov, Chen, et al. In this study, spherical embedding models were created using the source code provided by the authors (Meng et al. 2019). I also used a spherical embedding. model trained on the English Wikipedia corpus provided by Meng et al (Meng et al. 2019).
I used the transformers library to create a custom tokenizer and for pre-training on the radiology report corpus (Wolf et al. 2020). Each cloze task request was drawn with each of the 3 BERT-based models under study using the transformer library (Wolf et al. 2020).
Results
- Comparison to Fleischner glossary
- Non-contextual embedding model similarity
- Survey participant characteristics
- Term pair similarity
- Performance on UMNSRS benchmarks
- Cloze task accuracy
Clinical background of the resident, fellow and attending physicians included anesthesiology (1), emergency medicine (1), family medicine (1), internal medicine (5) and radiology (1). The calculated Pearson correlation coefficient (ρp) and Spearman correlation coefficient (ρs) values comparing mean study similarity scores with cosine similarity for each model are shown in Table 3.3. Among the embedding models trained using our X-ray corpus, the word2vec model returned the lowest correlation between human and model similarity scores (ρp 0.34; ρs 0.35).
For the word2vec model trained on English Wikipedia, ρp and ρs were 0.31 and 0.32, respectively; Wikipedia's spherically embedded model returned a ρp of 0.26 and ρs of 0.27. To compare the performance of the word2vec model on different training corpora, I present the performance of the previously published algorithm on the modified UMNSRS benchmarks in relation to our findings in Table 3.5 (Pakhomov et al. 2016).
Discussion
Similar to my own results, they found that fastText models tend to outperform word2vec models on most term pair similarity benchmarks, including the English Rare Word dataset (Bojanowski et al. 2017). The originally proposed BioWordVec fastText model was trained using PubMed and Medical Subject Headings (MeSH) (Y. Zhang et al. 2019). They also find that increasing the size of the context window improves the correlation between model and human judgment (Agirre et al. 2009).
I attempted to enrich the embedding models using the RadLex ontology, but met with little success as I found limited RadLex term coverage in our corpus, which is a challenge previously noted (Percha et al. 2018). . BERT-based models can only accept a maximum sequence input length of 512 tokens (Devlin et al. 2019).
Conclusion
With limited model research assignments in clinical NLP, I argue that this area warrants further investigation, particularly given the evidence of racial and other disparities present in clinical evidence, management and outcomes (Balderston et al. 2021). Future studies may use methods to generate multi-word terms while still allowing for comparison within the embedding space (Mikolov, Sutskever, et al. 2013; Rose et al. 2010; Campos et al. 2018). Furthermore, the term pair generation method excluded term pairs that were not present in all the nested vocabularies.
Another limitation is that the corpus of radiology reports was primarily limited to CT scans, including chest. Also, the corpus was curated from a single academic medical center, which limits the size of the corpus and affects the conditions and findings identified in the reports of those found in the region.
Methods
- Study Design
- Text processing
- Non-contextual embedding models
- Contextual embedding models
- Manual Document Annotation
- Model Setup
- Logistic Regression Models
- Bidirectional LSTM Models
- BERT-based Models
- Efficient Transformer Models
- Statistical Analyses
For binary classification, I used the scikit-learn implementation of Stratified k-fold cross-validation using the StratifiedKFold method (Pedregosa et al. 2011). Each of our binary and multi-label classification models were trained using binary cross-entropy with logit loss implemented in PyTorch (Paszke et al. 2019). Each of our metrics was calculated using the scikit-learn methods for calculating accuracy and precision, recall, and F1 scores (Pedregosa et al. 2011).
When using the English Wikipedia trained spherical embeddings at a maximum learning rate of 0.001, the model failed to converge. Binary classification: As with our BERT-based models (Section 4.1.9), I loaded the Longformer equivalent of each of these three models and their tokenizers using the transformers library (Wolf et al. 2020).
Results
- Annotation
- Label Distribution
- Binary Classification
- Multi-label Classification
Bi-Way LSTM Models: Model performance for each of our Bi-Way LSTM (BiLSTM) models is shown in Table 4.3 and Figure 4.2. BERT-based models: Performance measures for each of our three studied BERT-based models are shown in Table 4.4 and Figure 4.3. Efficient transformer models: Performance measures for each of our Efficient transformer models are shown in Table 4.5.
Logistic Regression Models: The results for each of our logistic regression models are presented in Table 4.6 and Figure 4.7. Efficiency transformer designs: Performance metrics for each of our efficiency transformer designs are shown in Table 4.9.
Discussion
In Figure 4.12, I compare the micro-F1, precision and recall efficiency between the BERT-based models and the effective transformer models. As described in Section 4.1.8, the BiLSTM models using our custom spherical embeddings and the spherical embeddings trained on the English Wikipedia converged only after increasing the learning rate. BERT-based models: For binary and multi-label classification, I consistently found that our custom RadBERT model outperformed the base BERT model and the BlueBERT model.
Efficient Transformer Models: Our efficient transformer models are the Longformer model equivalent of the BERT-based models: basic, BERT without case, BlueBERT and our custom RadBERT model. As expected, BiLSTM models trained faster than the BERT-based and Efficient Transformer models.
Conclusion
A Generalized Natural-Language Text Processor for Clinical Radiology." Journal of the American Medical Informatics Association. Proceedings of the 56th Annual Meeting of the Society for Computational Linguistics (Volume 1: Long Papers). In. Extending a radiology lexicon through contextual patterns in using radiology reports." Journal of the American Medical Informatics Association.
CLAMP – a toolkit for efficiently building personalized natural language processing clinical pipelines.” Journal of the American Medical Informatics Association. Identification of Patient Smoking Status from Discharge Medical Records.” Journal of the American Medical Informatics Association.