Active Learning for Named Entity Recognition in Clinical Text

Josh Denny led me to study clinical NLP systems and their applications in the medical domain. Reading speed curves plotting the number of words in the annotated sentences over annotation time in minutes from the main studies of Random and CAUSE by user 1 and user 2.

Natural language processing in the medical domain

Our final AL-enabled NER system showed better performance than random sampling in the real-world annotation task, demonstrating the potential of AL in clinical NER. In the medical field, the rapid increase in the use of clinical notes in EHR is a strong incentive for the development of clinical NLP [9].

Machine learning-based named entity recognition in clinical text

The training of the ML-based NER model extracts the pattern representing the relationship between the sequential words with their features and their labels. A number of features extracted from the raw text data were systematically investigated to improve the NER model [38].

Figure 1. An example of "BIO" representation of problem, treatment, and lab test entities for each word/token in a sentence

Active learning

Pool-based active learning framework
Active learning methods
Simulated active learning studies
Active learning in practice

There are many variations of AL (query) algorithms, which can be classified into six main types: uncertainty sampling [50], query by commission (QBC) [51], expected gradient length (EGL) [52], information Fisher [53], estimated error reduction (EER) [54] and information density [48]. Group data labels were considered unknown at the start of the AL process.

Active learning in biomedical text processing

Applying active learning to assertion classification in clinical text

Specifically, we developed new AL algorithms and applied them to the assertion classification task for concepts in clinical text. The statement classification task was to assign one of six labels ("absent", "associated with someone else", "conditional", "hypothetical", "possible" and . "present") to medical problems emerging from clinical documents have been identified.

Applying active learning to supervised word sense disambiguation in MEDLINE

Applying active learning to high-throughput phenotyping in electronic health record

Furthermore, AI and trait engineering based on domain knowledge can be combined to develop efficient and generalizable phenotyping methods.

Summary

Sentence length curves plotting the number of words versus the number of sentences in the training set. In the simulation study, we used the number of words in the annotated sentences as the estimated annotation cost.

A Simulation Study of Active Learning Methods for Named Entity Recognition in Clinical

Introduction

The goal of AL for NER would be to select informative sentences from the pool and hopefully save annotation costs. Most AL algorithms outperformed the baselines, indicating the promise of AL in NER.

Methods

Dataset
Active learning experimental framework

Uncertainty-based querying algorithms
Diversity-based querying algorithms
Baseline algorithms

Evaluation

In most of our implementations, only the N-best sequence labels were considered since the size of the possible sequence labels grows exponentially as the length of a sentence increases. We also extended the N-best sequence labels to cover most of the highly probable labels.

Table 1. Distribution of words and different types of entities in the corpus of 20,423 unique sentences

Results

Based on the learning curves in Figures 3 and 4, we also calculated the number of marked sentences and words needed to reach a fixed F-measure for each method. Entity count curves showing the number of entities versus the number of sentences in the training set.

Table 2. Two types of ALC scores for all AL algorithms versus passive learning

Discussion

However, the selection of sentences with a large number of entities (e.g. Length-concepts) or longest sentences (e.g. Length-words) could not outperform passive learning in ALC2 score, which we consider an evaluation metric closer to the real-time situation. These findings suggest that we should be more cautious about results of simulated experiments of AL on clinical NER. As described above, the main limitation of this study is that it is a simulated study of AL for clinical NER.

To assess the true value of AL for clinical NLP, we will need to evaluate it in a real-world setting.

Conclusion

For the user survey, the best query algorithm from the simulation was implemented in the system. The initial sentence selection schemes used in the user study were the same as the simulation. In terms of annotation speed, both users performed the annotation faster in the first two sessions (Session 1 and 2) than the last two sessions (Session 3 and 4) in the CAUSE condition (according to green curves in Figure 20).

The same two users participated in the new user survey to evaluate random sampling and CAUSE2.

An Active Learning-enabled Annotation System for Building Clinical Named Entity

Introduction

Settles et al. [60] performed a detailed empirical study to assess the benefits of AL in terms of real-world annotation costs, and their analysis concludes that a reduction in the number of required annotated sentences does not guarantee a real reduction in cost. At the backend, the system iteratively trains CRF models based on users' annotations and selects the most useful sentences based on the query engine. The algorithm is called the Clustering And Uncertainty Sampling Engine (CAUSE), which served as the query engine in Active LEARNER.

In a user study, we compared the performance of CAUSE with RANDOM (random sampling), which represents passive learning.

Methods

System workflow
Querying methods (clustering and uncertainty sampling engine)
The user study

Study design
Datasets
Evaluation

The top sentence in the ordered untagged set is queried and displayed on the interface. This design allows a user to continuously annotate the top unlabeled sentence of .. the ranked list, which is generated by either the current or previous learning process in the learning thread. After the user submitted the annotation of the first sentence, the second sentence was shown.

In the annotated warm-up training, the revised sentences are from the independent test set.

Table 4. An example of a cluster that contains multiple sentences about prescription Cluster

Results

The Active LEARNER system
Simulated results
User study results

Summary of analysis curves for performance measurements of user notes and characteristics of methods in user research. The maximum area is equal to the final cost spent on training (eg the number of words in the final training set or the actual annotation time) times the best possible F-measure. In the simulation, we evaluated the Random, Uncertainty, and CAUSE methods assuming the same cost per word.

The AL process stopped at the point where the training set contained only 7,200 words.

Figure 9. The initial interface to select parameters, such as user name, algorithm name, section time, training mode, and dataset

Discussion

Learning curves for the best performing method in each of four categories: Random, Uncertainty (N-best sequence entropy), CAUSE (CAUSE_nbest) and CAUSE2 (CAUSE_EntityEntropyPerCost), for user 1. Appendix A shows the details of the statistical analysis based on Wilcoxon signed-rank test [86]. Please rate your impression of the average length of the set of sentences given each week.

Please score your impression of the clinical relevance of the sentences given each week.

Figure 21. Annotation qualities per section in the main studies of Random and CAUSE from user 1 and 2

Annotation Time Modeling for Active Learning in Clinical Named Entity Recognition

Introduction

This finding indicates that the CAUSE algorithm is not guaranteed to save actual annotation time for each user compared to random sampling. The simulated results based on learning curves showed that CAUSE2 outperformed CAUSE, uncertainty sampling, and random sampling. Among all AL methods based on the new time estimation model, CAUSE2 was the only algorithm significantly better than random sampling.

The results showed that CAUSE2 globally outperformed random sampling in terms of the area under the learning curve scores for both users.

Methods

Active learning with annotation time models
Datasets

Training dataset for building annotation time models
Dataset for simulation studies
Dataset for the user study

Evaluation

Evaluation of annotation time models
Evaluation using the simulation study
Evaluation by the user study

We also generated a baseline linear regression model to predict annotation time using only word count as one feature. Since the annotation time for the same sentence varies from one annotator to another, we trained a regression model for each annotator based on existing individual annotated data from the previous study in Chapter 2. In this study, R2 was calculated as the square of the Pearson -value. correlation coefficient between the actual annotation time and the estimated annotation time.

Therefore, we used the predicted annotation time based on the model trained by individual users to generate learning curves that plot the F vs .

Table 12. Distributions of the training data for building annotation time models

Results

Annotation cost models evaluation results
Results of the simulation studies
Results of the user study

These results show that the weights of all three predictors and the intercept are significant in the annotation time models for both users. User methods ALC scores F-measurements at 120 minutes .. values based on Wilcoxon signed- .. Learning)curves)in)the)new)user)study)of)user)2). Meanwhile, we also simulated the new user study following the design in the simulation study (Section 2.3.2).

The learning curves in Figure 26 (simulation) appeared to be similar to the learning curves in Figure 24 (user study), indicating that the simulation study is a viable way to mimic the user study.

Table 15. Statistical analysis for annotation cost model for user 1

Discussion

Most studies provided evidence supporting AL's promises in annotation cost reduction for various tasks. Proceedings of the 2007 Joint Conference on Empirical Methods for Natural Language Processing and Computational Natural Language Learning. Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers.

Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics.

Figure 29. Simulated learning curves for Random and CAUSE2 for 20 estimated hours of annotation time based on an annotation cost model for user 1

Conclusion

Summary of key findings

We also reviewed the literature on AL in a practical setting, which suggested that without proper attention to annotation costs, AL might not differ from random sampling. Preliminary results showed that uncertainty sampling-based algorithms outperformed diversity-based sampling methods and random sampling under two evaluation assumptions: (1) the same cost per sentence, and (2) the same cost per word. This cost-aware model, CAUSE2, was compared in the simulation with CAUSE, uncertainty sampling, and random sampling, using the estimated annotation time as the cost.

To achieve a NER model with 0.70 in F-measure, CAUSE2 reduced labeling time by 26.5% compared to single-user random sampling.

Innovations and contributions

Innovations
Contributions

Furthermore, our study demonstrated an informatics tool that truly interacts with medical field experts and is one of the newest applications that intersects biomedical research and information technology. Our new methods, based on state-of-the-art computer science techniques, are generalizable to text processing tasks in open domains. In relation to healthcare, our system can increase efficiency in building clinical NER systems, thereby facilitating clinical research that uses EHR data.

Our new AL paradigm and system could be one of the big data analytics solutions in healthcare.

Limitations and future work

They have contributed to our studies in the fields of biomedical informatics, biomedical NLP, computer science and healthcare. In the future, we plan to expand the user study to include a larger number of professional users. To reduce the influence of human factors, we plan to develop a new integrated system that combines all inquiry methods and simultaneously evaluates them in a user study.

In the annotation cost model, we will add more predictive variables, such as annotation time and annotation difficulty based on the document frequency of concepts and the complexity of syntactic structures of sentences.

Conclusion

Note: A higher score represents your impression that the medical concepts in the sentence were denser (or the ratio of words you marked as part of the medical concept in the sentence was higher). Please rate your impression of the note's difficulty when commenting on the sentences given in each week. Proceedings: conference of the American Medical Informatics Association / AMIA Annual Fall Symposium AMIA Fall Symposium.

Machine-learned solutions for three phases of clinical information extraction: the state of the art at i2b2 2010.

Table 24. Wilcoxon signed-rank test based on the smooth learning curves by Random and CAUSE2 from user 1 and user 2

Two types of ALC scores for all AL algorithms versus passive learning

A scenario of two most informative sentences that occurred back-to-back when Active

An example of a cluster that contains multiple sentences about prescription

Schedule of the user study

Characteristics (counts of sentences, words, and entities, words per sentence, entities per

Summarization of analysis curves for the measurements of annotation performance of

Annotation counts, speed, and quality comparison in the 120-minute main study

Comparison between Random and CAUSE in ALC score and F-measure of the last

Characteristics of Random and CAUSE in each 120-minute main study from user 1

Distributions of the training data for building annotation time models

Distribution of words and different types of entities in the pool of 29,789 unique

Schedule of the new user study using new data

Statistical analysis for annotation cost model for user 1

Statistical analysis for annotation cost model for user 2

Evaluation of different annotation cost models in R 2

ALC scores of both users for different AL methods in the simulation study

Characteristics in average sentence length, entities per sentence, and entity density for

ALC scores, F-measures at the end of 120-minute annotation, and the statistical test

Annotation quantity, speed, and quality comparison in the 120-minute main study for