System workflow - An Active Learning-enabled Annotation System for Building Clinical Named Enti

3. An Active Learning-enabled Annotation System for Building Clinical Named Entity

3.2 Methods

3.2.1.1 System workflow

Initial design: The original plan was to follow the traditional pool-based AL framework [45].

Figure 7 shows the initial design workflow of the Active LEARNER system. Once the system starts, the pool of unlabeled data is loaded into the memory. At the initial iteration or before the CRF model is generated, all sentences are randomly ranked. The top sentence in the ranked unlabeled set is queried and displayed on the interface. The annotator then highlights clinical entities in the sentence via the labeling function on the interface. When the user submits the annotated sentence, the labeled set and the unlabeled set are updated and the learning process is activated. Specifically, the learning process includes CRF model encoding based on the current labeled set and sentence ranking by the querying engine. The CRF model encoding is straightforward; however, it could take time to rebuild the CRF model when the labeled data set gets bigger. Sentence ranking consists of two steps: 1) CRF model decoding, which is to make predictions for each unlabeled sentence based on the current model; and 2) ranking sentences by the querying algorithm, which considers both the probabilistic prediction of each sentence from step 1, and other information about the unlabeled sentences (i.e. clustering results). The learning

process is complete when the ranked unlabeled set is updated. The next iteration starts when the annotator starts reading the new top unlabeled sentence on the interface. The program is stopped when the user either clicks the quit button or a pre-set cutoff time runs out. The drawback of the initial design, however, is that the annotator sometimes has to wait for the next sentence, because the learning process could take time, as CRF model encoding/decoding could be slow with large samples.

Figure 7. Workflow of Active LEARNER - initial design

Final design: To avoid delay in the workflow, we separate the annotation and learning processes by paralleling two threads: the annotation thread and the learning thread. The final design workflow is shown in Figure 8.

Figure 8. Workflow of Active LEARNER - final design

In the annotation thread, the black circle in the figure splits the flow into two, which run simultaneously. One sub-flow runs back to the ranked unlabeled set and interface. Therefore, the user can immediately read the next sentence on the interface right after the annotation of the previous sentence is submitted. The other sub-flow adds the newly annotated sentence to the labeled dataset and pushes the newly updated labeled set to the learning thread. In the learning thread, the process starts from an activator. A new learning process will be activated if the encoding or querying process in the learning thread is not busy and the number of newly annotated sentences is greater or equal to a threshold (equal to five in our study), which is for the update frequency control. When the learning process is activated, it runs in parallel with the annotation thread and it updates the ranked unlabeled set whenever the new rankings are generated. This design allows a user to continuously annotate the top unlabeled sentence from

the ranked list, which is generated by either the current or previous learning process in the learning thread. The stop criteria are the same as those described in the initial design.

To better record and manage the user study, we also integrated additional functions to the Active LEARNER system:

Log Function: We collect and record various types of information during annotation, including 1) user annotation activities such as marking, changing, or deleting entities; 2) detailed time information such the start and end annotation time stamp for every sentence; and 3) model performance information, such as intermediate NER models files and querying score for each unlabeled sentence for each update, so that we can report the precision, recall, and F-measure of models over time. All of the logging information is analyzed after the annotation task is completed, to provide additional insights to the annotation and learning processes.

Session Manager: we divide the entire annotation task into different sessions so that users can take a break between sessions. The time for each session can be pre-set on the interface for 15, 30, 45, or 60 minutes. When the user clicks the start button, Session #1 starts and the timer activates. When the time of the session is up, a pop up window will interrupt the annotation and remind the user to take a break. After the break, the user can click the "Resume" button to continue the annotation for the next session (e.g. session #2, session #3, and so on). The system also automatically saves everything so that the annotation task can be resumed in case it is paused in the middle of a session.

Prerequisites for running the Active LEANER include: (1) Corpus should be pre-processed for tokenization and sentence separation; (2) Features for CRF encoding and decoding should be

pre-extracted for every sentence; (3) The entities of interest need to be pre-defined (e.g. problem, treatment, and test).

Dalam dokumen Active Learning for Named Entity Recognition in Clinical Text (Halaman 53-57)