Josh Denny led me to study clinical NLP systems and their applications in the medical domain. Reading speed curves plotting the number of words in the annotated sentences over annotation time in minutes from the main studies of Random and CAUSE by user 1 and user 2.
Natural language processing in the medical domain
Our final AL-enabled NER system showed better performance than random sampling in the real-world annotation task, demonstrating the potential of AL in clinical NER. In the medical field, the rapid increase in the use of clinical notes in EHR is a strong incentive for the development of clinical NLP [9].
Machine learning-based named entity recognition in clinical text
The training of the ML-based NER model extracts the pattern representing the relationship between the sequential words with their features and their labels. A number of features extracted from the raw text data were systematically investigated to improve the NER model [38].
Active learning
- Pool-based active learning framework
- Active learning methods
- Simulated active learning studies
- Active learning in practice
There are many variations of AL (query) algorithms, which can be classified into six main types: uncertainty sampling [50], query by commission (QBC) [51], expected gradient length (EGL) [52], information Fisher [53], estimated error reduction (EER) [54] and information density [48]. Group data labels were considered unknown at the start of the AL process.
Active learning in biomedical text processing
Applying active learning to assertion classification in clinical text
Specifically, we developed new AL algorithms and applied them to the assertion classification task for concepts in clinical text. The statement classification task was to assign one of six labels ("absent", "associated with someone else", "conditional", "hypothetical", "possible" and . "present") to medical problems emerging from clinical documents have been identified.
Applying active learning to supervised word sense disambiguation in MEDLINE
Applying active learning to high-throughput phenotyping in electronic health record
Furthermore, AI and trait engineering based on domain knowledge can be combined to develop efficient and generalizable phenotyping methods.
Summary
Sentence length curves plotting the number of words versus the number of sentences in the training set. In the simulation study, we used the number of words in the annotated sentences as the estimated annotation cost.
A Simulation Study of Active Learning Methods for Named Entity Recognition in Clinical
Introduction
The goal of AL for NER would be to select informative sentences from the pool and hopefully save annotation costs. Most AL algorithms outperformed the baselines, indicating the promise of AL in NER.
Methods
- Dataset
- Active learning experimental framework
- Uncertainty-based querying algorithms
- Diversity-based querying algorithms
- Baseline algorithms
- Evaluation
In most of our implementations, only the N-best sequence labels were considered since the size of the possible sequence labels grows exponentially as the length of a sentence increases. We also extended the N-best sequence labels to cover most of the highly probable labels.
Results
Based on the learning curves in Figures 3 and 4, we also calculated the number of marked sentences and words needed to reach a fixed F-measure for each method. Entity count curves showing the number of entities versus the number of sentences in the training set.
Discussion
However, the selection of sentences with a large number of entities (e.g. Length-concepts) or longest sentences (e.g. Length-words) could not outperform passive learning in ALC2 score, which we consider an evaluation metric closer to the real-time situation. These findings suggest that we should be more cautious about results of simulated experiments of AL on clinical NER. As described above, the main limitation of this study is that it is a simulated study of AL for clinical NER.
To assess the true value of AL for clinical NLP, we will need to evaluate it in a real-world setting.
Conclusion
For the user survey, the best query algorithm from the simulation was implemented in the system. The initial sentence selection schemes used in the user study were the same as the simulation. In terms of annotation speed, both users performed the annotation faster in the first two sessions (Session 1 and 2) than the last two sessions (Session 3 and 4) in the CAUSE condition (according to green curves in Figure 20).
The same two users participated in the new user survey to evaluate random sampling and CAUSE2.
An Active Learning-enabled Annotation System for Building Clinical Named Entity
Introduction
Settles et al. [60] performed a detailed empirical study to assess the benefits of AL in terms of real-world annotation costs, and their analysis concludes that a reduction in the number of required annotated sentences does not guarantee a real reduction in cost. At the backend, the system iteratively trains CRF models based on users' annotations and selects the most useful sentences based on the query engine. The algorithm is called the Clustering And Uncertainty Sampling Engine (CAUSE), which served as the query engine in Active LEARNER.
In a user study, we compared the performance of CAUSE with RANDOM (random sampling), which represents passive learning.
Methods
- System workflow
- Querying methods (clustering and uncertainty sampling engine)
- The user study
- Study design
- Datasets
- Evaluation
The top sentence in the ordered untagged set is queried and displayed on the interface. This design allows a user to continuously annotate the top unlabeled sentence of .. the ranked list, which is generated by either the current or previous learning process in the learning thread. After the user submitted the annotation of the first sentence, the second sentence was shown.
In the annotated warm-up training, the revised sentences are from the independent test set.
Results
- The Active LEARNER system
- Simulated results
- User study results
Summary of analysis curves for performance measurements of user notes and characteristics of methods in user research. The maximum area is equal to the final cost spent on training (eg the number of words in the final training set or the actual annotation time) times the best possible F-measure. In the simulation, we evaluated the Random, Uncertainty, and CAUSE methods assuming the same cost per word.
The AL process stopped at the point where the training set contained only 7,200 words.
Discussion
Learning curves for the best performing method in each of four categories: Random, Uncertainty (N-best sequence entropy), CAUSE (CAUSE_nbest) and CAUSE2 (CAUSE_EntityEntropyPerCost), for user 1. Appendix A shows the details of the statistical analysis based on Wilcoxon signed-rank test [86]. Please rate your impression of the average length of the set of sentences given each week.
Please score your impression of the clinical relevance of the sentences given each week.
Annotation Time Modeling for Active Learning in Clinical Named Entity Recognition
Introduction
This finding indicates that the CAUSE algorithm is not guaranteed to save actual annotation time for each user compared to random sampling. The simulated results based on learning curves showed that CAUSE2 outperformed CAUSE, uncertainty sampling, and random sampling. Among all AL methods based on the new time estimation model, CAUSE2 was the only algorithm significantly better than random sampling.
The results showed that CAUSE2 globally outperformed random sampling in terms of the area under the learning curve scores for both users.
Methods
- Active learning with annotation time models
- Datasets
- Training dataset for building annotation time models
- Dataset for simulation studies
- Dataset for the user study
- Evaluation
- Evaluation of annotation time models
- Evaluation using the simulation study
- Evaluation by the user study
We also generated a baseline linear regression model to predict annotation time using only word count as one feature. Since the annotation time for the same sentence varies from one annotator to another, we trained a regression model for each annotator based on existing individual annotated data from the previous study in Chapter 2. In this study, R2 was calculated as the square of the Pearson -value. correlation coefficient between the actual annotation time and the estimated annotation time.
Therefore, we used the predicted annotation time based on the model trained by individual users to generate learning curves that plot the F vs .
Results
- Annotation cost models evaluation results
- Results of the simulation studies
- Results of the user study
These results show that the weights of all three predictors and the intercept are significant in the annotation time models for both users. User methods ALC scores F-measurements at 120 minutes .. values based on Wilcoxon signed- .. Learning)curves)in)the)new)user)study)of)user)2). Meanwhile, we also simulated the new user study following the design in the simulation study (Section 2.3.2).
The learning curves in Figure 26 (simulation) appeared to be similar to the learning curves in Figure 24 (user study), indicating that the simulation study is a viable way to mimic the user study.
Discussion
Most studies provided evidence supporting AL's promises in annotation cost reduction for various tasks. Proceedings of the 2007 Joint Conference on Empirical Methods for Natural Language Processing and Computational Natural Language Learning. Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers.
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics.
Conclusion
Summary of key findings
We also reviewed the literature on AL in a practical setting, which suggested that without proper attention to annotation costs, AL might not differ from random sampling. Preliminary results showed that uncertainty sampling-based algorithms outperformed diversity-based sampling methods and random sampling under two evaluation assumptions: (1) the same cost per sentence, and (2) the same cost per word. This cost-aware model, CAUSE2, was compared in the simulation with CAUSE, uncertainty sampling, and random sampling, using the estimated annotation time as the cost.
To achieve a NER model with 0.70 in F-measure, CAUSE2 reduced labeling time by 26.5% compared to single-user random sampling.
Innovations and contributions
- Innovations
- Contributions
Furthermore, our study demonstrated an informatics tool that truly interacts with medical field experts and is one of the newest applications that intersects biomedical research and information technology. Our new methods, based on state-of-the-art computer science techniques, are generalizable to text processing tasks in open domains. In relation to healthcare, our system can increase efficiency in building clinical NER systems, thereby facilitating clinical research that uses EHR data.
Our new AL paradigm and system could be one of the big data analytics solutions in healthcare.
Limitations and future work
They have contributed to our studies in the fields of biomedical informatics, biomedical NLP, computer science and healthcare. In the future, we plan to expand the user study to include a larger number of professional users. To reduce the influence of human factors, we plan to develop a new integrated system that combines all inquiry methods and simultaneously evaluates them in a user study.
In the annotation cost model, we will add more predictive variables, such as annotation time and annotation difficulty based on the document frequency of concepts and the complexity of syntactic structures of sentences.
Conclusion
Note: A higher score represents your impression that the medical concepts in the sentence were denser (or the ratio of words you marked as part of the medical concept in the sentence was higher). Please rate your impression of the note's difficulty when commenting on the sentences given in each week. Proceedings: conference of the American Medical Informatics Association / AMIA Annual Fall Symposium AMIA Fall Symposium.
Machine-learned solutions for three phases of clinical information extraction: the state of the art at i2b2 2010.
Two types of ALC scores for all AL algorithms versus passive learning
A scenario of two most informative sentences that occurred back-to-back when Active
An example of a cluster that contains multiple sentences about prescription
Schedule of the user study
Characteristics (counts of sentences, words, and entities, words per sentence, entities per
Summarization of analysis curves for the measurements of annotation performance of
Annotation counts, speed, and quality comparison in the 120-minute main study
Comparison between Random and CAUSE in ALC score and F-measure of the last
Characteristics of Random and CAUSE in each 120-minute main study from user 1
Characteristics of Random and CAUSE in each 120-minute main study from user 1
Distributions of the training data for building annotation time models
Distribution of words and different types of entities in the pool of 29,789 unique
Schedule of the new user study using new data
Statistical analysis for annotation cost model for user 1
Statistical analysis for annotation cost model for user 2
Evaluation of different annotation cost models in R 2
ALC scores of both users for different AL methods in the simulation study
Characteristics in average sentence length, entities per sentence, and entity density for
ALC scores, F-measures at the end of 120-minute annotation, and the statistical test
Annotation quantity, speed, and quality comparison in the 120-minute main study for