Clinical Text Mining

Chapter 2 continues with the history of patient records and the languages used in them. Automatically structuring the patient record to make it more readable is another interesting application that will be explained.

Early Work and Review Articles

This chapter introduces the long history of the patient record, from ancient times to the present. From the first attempts to describe and classify the nature that formed the first patient record, to the modern paper records with their various headings and sections describing the findings and symptoms, the patient's treatment and ultimately the outcome; followed by the organization of the paper records to make them easy to follow.

The Egyptians and the Greeks

In addition, another important guideline is that the physician should not harm the patient while he or she is being treated. These case histories describe symptoms day by day for a typical patient and the outcomes, most of which lead to the death of the patient.

Fig. 2.1 Part of the Edwin Smith Papyrus describing in Egyptian hieratic script (a cursive hieroglyph writing) different surgery cases from 1600 BC (Published in Wikipedia)

The Arabs

The Swedes

The first formal record system in Sweden was developed and systematic medical documentation was introduced in connection with the opening of Serafimerlasarette (Seraphim Hospital) in Stockholm in 1752 (Nilsson and Nilsson 2003; Nilsson 2007). In Sweden, the paper-based patient record system was developed and refined until 1980, when computerized patient record systems began to become more common, and it was more or less completely digitized in 2007 (Nilsson2007).

The Paper Based Patient Record

For hospitalized patients, the patient record contains daily notes of treatment status and progress. These notes are usually taken by the nurses who also care for the patient on a daily basis.

Fig. 2.2 Description of the parts of a Swedish medical record from 1943. A 6 year old boy was hit by a car and obtained a fracture on his femur

Greek and Latin Used in the Patient Record

Summary of the History of the Patient Record

Images or other third-party material in this chapter are covered under this chapter's Creative Commons license, unless otherwise noted in the credit line for the material. If the material is not covered by a Creative Commons Chapter license and your intended use is not permitted by law or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reading and Retrieving Efficiency of Patient Records

Give the user positional structure so that the user can find new information but easily navigate back to the original position and still have a link to a reference. The events are marked and are for example: problems, allergies, symptoms, diagnosis and drug prescriptions (drugs). The user can click on it and get more information, see Fig.3.2.

Natural Language Processing on Clinical Text

LifeLines is a prototype patient record system where the physician can see important events in a timeline in the patient record (Plaisant et al.1998). This is of course for clinical research reasons, but can also be used by hospital management (Wang et al.2011).

Electronic Patient Record System

Many of the previous ways of keeping track of the paper file have disappeared in the electronic patient record system. However, there has been research on how to improve ways to browse patient data.

Different User Groups

Summary

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution, and reproduction in any medium or format, so as long as you give appropriate credit to the original author(s) and source, you must provide a link to the Creative Commons license and indicate whether changes have been made. This chapter will describe characteristics of patient records compared to other types of text, including: A comparison of characteristics of patient records written in different languages, the number of spelling errors compared to other types of text, syntactic differences, word choice, abbreviations, acronyms, compounds and compound construction, negation expressions and also speculative keywords and factual expressions in clinical text.

Fig. 3.2 Screenshot of the Lifelines (Lifelines, http://www.cs.umd.edu/hcil/lifelines/

Patient Records

A comparison of the characteristics of patient records written in different languages, the number of spelling errors compared to other text types, syntactic differences, word choices, abbreviations, acronyms, compounds and compound constructions, negative expressions, as well as speculative clues and factual expressions in clinical text. In general, patient records are written by highly skilled doctors and nurses using domain-specific terms.

Pathology Reports

The patient records are written under time pressure; the patient record systems do not include any spelling correction (or grammar checking) system due to the difficulties in building such a function due to the complicated non-standard vocabulary used within healthcare. Therefore, in clinical text, non-standard abbreviations, domain-specific acronyms and incomplete sentences without a subject can be observed, which means that the patient is not mentioned, only his or her status.

Spelling Errors in Clinical Text

In Table 4.1 there are some examples of misspellings in the Swedish text of the patient record and their correct spelling, together with the corresponding misspelled English version and the correctly spelled English word.

Abbreviations

Nizamuddin and Dalianis (2014) studied the Stockholm EPR PHI Corpus, which is a subset of the Stockholm EPR Corpus, and found 2.7% abbreviations. Regarding the ambiguity of clinical abbreviations, there are two studies: Liu et al. 2001) found that 33% of abbreviations in English clinical text are highly ambiguous and Lövestam et al. 2014) analyzed 40 different abbreviations in Swedish dietetic notes from a subset of the Stockholm EPR corpus written by three professions: dietitians, nurses and doctors.

Acronyms

Assertions

Negations

In a study by Chapman et al. 2001) more than half of the expressions in US radiology reports were found to contain disclaimers. Physicians' notes contain more disclaimers than nursing narratives related to the patient's day-to-day health care.

Speculation and Factuality

In Swedish, see Section.4.7.2, the texts of several clinical units were studied under the heading assessment and it was found that negated sentences or expressions comprised 13.5% of the texts (910 negated sentences out of a total of 6640 sentences ) (Dalianis and Skeppstedt2010). Definitely Negative expressions, and 12.2% were in the middle of the scale (Possibly Positive and Possibly Negative) while 47.6% of the expressions were confirmed as Definitely Positive in the final version of the corpus.

Table taken from Table 1 in Velupillai et al. (2011) © 2012 with permission from IOS Press.

Clinical Corpora Available

English Clinical Corpora Available

BioScope Corpus and Thyme corpus are two other well-known clinical corpora written in English. The clinical corpora in English are de-identified with respect to sensitive identifiers such as personal names, telephone numbers, etc.

Swedish Clinical Corpora

Relationships such as indications, adverse drug events, ADE outcome and ADE cause (these relationships will be explained in section 10.2. The descriptions are in Swedish, but can be understood as the annotation classes are in English and there are numerical values for the number of classes.

Clinical Corpora in Other Languages than Swedish

Another Danish clinical corpus containing 323,122 patient health records used for de-identification (Pantazos et al.2016). An Italian clinical corpus containing 23,695 patient records used for entity extraction and definition of semantic relationships (Attardi et al.2015).

Summary

A Norwegian clinical corpus containing 7741 patient records comprising a total of 1,133,223 unstructured EHR text documents used for the identification of cancer patients (Jensen et al.2017). A Spanish clinical corpus, IXAMed corpus from Galdakao-Usansolo Hospital, collected during 2008-2012 containing 141,800 patient records (Pérez et al. 2017).

International Statistical Classification of Diseases

International Classification of Diseases

There is a separate version of the ICD, the International Classification of Diseases for Oncology (ICD-O-3),2 which is also used to code pathology reports for cancer. Topology describes the anatomical site of origin, where the tumor is located in the body, and morphology describes the cell type (histology), the stage or behavior of the tumor (malignant or benign), and the number of tumors or metastases.

Systematized Nomenclature of Medicine: Clinical Terms

SNOMED CT is a clinical, hierarchical terminology containing medical terms and their associations as well as synonyms, including more than 320,000 terms. ICD-10 has a longer history than SNOMED CT and is widely used and known, while SNOMED CT is less well known.

Fig. 5.2 Hierarchy of the SNOMED CT code for Pneumonia using the IHTSDO SNOMED CT Browser (IHTSDO SNOMED CT Browser, http://browser.ihtsdotools.org/?perspective=

Medical Subject Headings (MeSH)

Unified Medical Language Systems (UMLS)

In this example, the numeric encoding of the MeSH descriptors (MeSH-pneumonia entry, https://www.nlm.nih.gov/cgi/mesh/2016/MB_cgi?mode=&.

Anatomical Therapeutic Chemical Classification (ATC)

Different Standards for Interoperability

Health Level 7 (HL7)

OpenEHR

Mapping and Expanding Terminologies

Summary of Medical Classifications and Terminologies

First, the scientific basis for evaluating all information retrieval systems, called the Cranfield paradigm, will be described. An example of a common task for retrieving information from electronic patient records will be presented.

Qualitative and Quantitative Evaluation

Subsequently, different evaluation concepts such as precision, recall, F-score, development, training and evaluation sets and k-fold cross-validation are described. This chapter also discusses manual annotation and inter-annotator agreement, annotation tools such as BRAT, and the gold standard.

The Cranfield Paradigm

Voorhees (2001) elaborates on the Cranfield paradigm and argues that this is the only way to evaluate information retrieval systems, as objective manual evaluation is very costly and can also be very inaccurate. Cranfield's paradigm has spawned the Text Retrieval Conference (TREC) and the Cross-Language Evaluation Forum (CLEF), where large collections of controlled documents along with questions on specific topics are used to evaluate information retrieval.

Metrics

The F-score is defined as the weighted average of both precision and recall, depending on the weighting function β, see Formula 6.3. Sensitivity (same as recall) measures the proportion of negatives correctly identified (eg, the percentage of healthy people correctly identified as not having the condition).

Annotation

Specificity measures the proportion of negatives that are correctly identified as negatives or that do not have the condition. Accuracy is another measure defined as the percentage of true positive and negative cases received among all cases received.

Inter-Annotator Agreement (IAA)

Confidence and Statistical Significance Testing

The more times an algorithm outperforms another algorithm, the more reliable the results. If big data is considered, statistical significance testing is not really useful since there is so much data that the results will always be significant.

Annotation Tools

Gold Standard

Summary of Evaluation Metrics and Annotation

This chapter will describe the basics of text processing and provide an overview of standard methods or techniques: pre-processing texts such as tokenization and text segmentation. Sentence-based methods such as part-of-speech tagging, syntactic analysis or parsing, semantic analysis such as named entity recognition, negation detection, relation extraction, temporal processing and anaphora resolution.

Definitions

In general, the same building blocks used for ordinary texts can also be used for clinical text processing. The term text mining is also used in health informatics which mostly means the use of rule-based methods to process clinical or biomedical text.

Segmentation and Tokenisation

Morphological Processing

Lemmatisation
Stemming
Compound Splitting (Decompounding)
Abbreviation Detection and Expansion
Spell Checking and Spelling Error Correction
Part-of-Speech Tagging (POS Tagging)

The clinical text contains a proportion of abbreviations ranging from 3% to 10% of the entire text, see section 4.4. Another method is a rule-based anheuristic (or rule-of-thumb) method based on the form of abbreviations in clinical text, such as words that meet one of the following criteria (Xu et al. 2007):

Syntactical Analysis

Shallow Parsing (Chunking)

Shallow parsing (also called parsing or "light parsing") is something between POS markup and parsing. A shallow parser or chunker detects constituent parts in sentences in the form of noun phrases or verb phrases.

Grammar Tools

The DCG grammar (and Prolog) can be thought of as a set of propositions and the lexical items as facts to be proved using a proposition prover. Another advantage of DCG is that it can be easily extended to produce a syntax tree that can be used to perform operations.

Semantic Analysis and Concept Extraction

Named Entity Recognition
Negation Detection
Factuality Detection
Relative Processing (Family History)
Temporal Processing
Relation Extraction
Anaphora Resolution

HeidelTime was adapted to English clinical text labeling in the i2b2 challenge and achieved the best results of all systems (Sun et al.2013a). SVM produced the best results for the relation extraction task for clinical text in the 2010 i2b2/VA challenge (Uzuner et al.2011).

Fig. 7.4 Diagram of clinical entity recognition systems: Comparison to a number of previous clinical NER studies

Summary of Basic Building Blocks for Clinical Text

Various machine learning approaches such as topic modeling, distributional semantics, and clustering will be presented. The results of rule-based systems and machine learning will be explained.

Rule-Based Methods

Regular Expressions

Therefore, to find all Swedish personal identity numbers in a file called "personnummer.txt", you need to type the following command in the Bash shell script of the Linux operating system. Regular expressions can be used to find and replace personal identity numbers, telephone numbers and email addresses that are regularly and easily identified, for example for de-identification purposes, see section 9.4.

Machine Learning-Based Methods

Features and Feature Selection
Active Learning
Pre-Annotation with Revision or Machine
Clustering
Topic Modelling
Distributional Semantics
Association Rules

Supervised methods require time-consuming manual recording of data to be used for training (and evaluation). Features represent certain aspects of the training data and are used as input to machine learning tools.

Fig. 8.3 The interface of the Weka toolkit. In this example classifying patient records for healthcare associated infections taken from Ehrentraut et al

Explaining and Understanding the Results Produced

Computational Linguistic Modules for Clinical Text

NLP Tools: UIMA, GATE, NLTK etc

Summary of Computational Methods for Text Analysis

Various open tools for clinical text mining such as cTakes, NLTK, GATE and Stanford Core NLP were mentioned. The whole process of accessing electronic health records for research is complicated and requires certain steps.

Ethical Permission

These records contain valuable information about symptoms and conditions, rationales used to determine the patient's diagnosis and treatment, as well as any side effects the patient may have experienced. However, the free text of the patient record sometimes includes information that can identify the patient, such as telephone numbers of relatives (for example, the telephone number of the patient's wife, Mary, in these cases the patient can be considered identifiable and therefore the patient consent to be part of the research project, or have the identifiable parts removed.

Social Security Number

After obtaining ethical clearance for the research, the researcher must have access to anonymized patient data to conduct the research. Accessing patient records can be technically cumbersome if there is no easy way to extract the data from the electronic patient record system.

Safe Storage

Automatic De-Identification of Patient Records

Density of PHI in Electronic Patient Record Text

Regarding the density of PHI in the free text of patient data we can observe studies such as Douglass et al. In total 4423 instances of PHI were found which equates to 2.5% PHI of the total amount of tokens.

Table 9.2 Types and numbers of annotated tokens in the Stockholm EPR PHI Corpus

Pseudonymisation of Electronic Patient Records

The rest of the phone number was replaced with the same number of random numbers. Personal names were replaced with real names sourced from the US Census Bureau at a frequency above 144 (or 0.002% of the data).

Re-Identification and Privacy

State-of-the-art de-identification systems achieve recall rates of 95-97%, using the technique described in Carrell and other residual identifiers, HIPS can be effectively masked without improving the de-identification system. Almgren and Pavlov knew the format of sensitive hand-marked data in electronic patient records that needed to be evaluated, in this case clinical entities in Swedish (disorder and finding, pharmaceutical drug, and body structure).

Summary of Ethics and Privacy of Patient Records

We sent the software code and trained models to an authorized person with access to sensitive data in a black box. In general, the highest density of sensitive PHI is in assessment and social affairs and discharge summaries of patient records.

Detection and Prediction of Healthcare Associated

Healthcare Associated Infections (HAIs)

Part of the definition of an HAI is that the patient must be admitted to the hospital for more than 48 hours before the infection can be defined as an HAI. Medical (or care) episodes are daily notes and data entered into the patient's file about the patient's treatment and status.

Fig. 10.1 If a patient (Patient 2) is discharged from one clinical unit and admitted to another within 24 h and the whole period is more than 48 h then that patient is considered to be admitted for the whole care episode and can therefore be analysed for H

Detecting and Predicting HAI

Two machine learning algorithms Support Vector Machine (SVM) and Random Forest (RF) in the Weka toolkit were applied to the annotated Stockholm EPR Detect-HAI Corpus. A rule-based approach using the Swedish clinical text to detect urinary tract infections was investigated by Tanushi et al. 2014), patient record data and text were used.

Table 10.1 Statistics for the Stockholm EPR Detect-HAI Corpus

Commercial HAI Surveillance Systems

Note that the evaluated results are from a different domain than system development. Some systems also use information from the text in the patient's record.

Detection of Adverse Drug Events (ADEs)

Adverse Drug Events (ADEs)
Resources for Adverse Drug Event Detection
Passive Surveillance of ADEs
Active Surveillance of ADEs
Approaches for ADE Detection

A side effect is hopefully mild and probably known to the doctor, but is also related to the pharmacological properties of the drug. In the review article by Warrer et al. 2012), the authors focused on text mining-based warning systems for undesirable drugs.

Table 10.2 ICD-10 diagnosis codes for adverse drug events ICD-10 code Description

Suicide Prevention by Mining Electronic Patient Records

One observation was that most relationships are cross-entity, spanning more than 10 sentences from the disease entity to the drug entity, in cases of positive ADR events. There are still relatively few articles in this area, using free text in patient records to detect suicide risk, but hopefully research in this area will grow soon.

Mining Pathology Reports for Diagnostic Tests

The Case of the Cancer Registry of Norway

The first study was by Singh et al. 2015) using 25 pathology reports for prostate cancer written in free text in Norwegian. The second study was by Dahl et al. 2016) used the same 25 pathology reports for prostate cancer in free text used by Singh et al.

The Medical Text Extraction (Medtex) System

The text contains descriptions of 9 biopsies, four from the left side and five from the right side. Published in Weegar and Dalianis (2015). The translation into English and excerpts of the free text are added in this publication).

Mining for Cancer Symptoms

Text Summarisation and Translation of Patient Record

Summarising the Patient Record
Other Approaches in Summarising the Patient
Summarising Medical Scientific Text
Simplification of the Patient Record for Laypeople

One of the first systems for creating an automatic summary or summary was described in Luhn (1958). Related to summarizing the patient record is simplifying the patient record for the lay reader.

Fig. 10.6 Example of automatic discharge summary creation. Redundant information is removed and high scoring information is added to the beginning of the summary from highest to lowest, low scoring information G, F and H, is excluded

ICD-10 Diagnosis Code Assignment and Validation

Natural Language Generation from SNOMED CT

SNOMED CT is a very complex hierarchical medical terminology that contains many separate pieces of information with the aim of describing a disorder, its cause, symptoms and in which body part the disorder occurs. See for example the IHTSDO SNOMED CT browser and its description of scarlet fever in Fig.10.9,.

Search Cohort Selection and Similar Patient Cases

Comorbidities
Information Retrieval from Electronic Patient
Search Engine Solr
Supporting the Clinician in an Emergency
Incident Reporting
Hypothesis Generation
Practical Use of SNOMED CT
ICD-10 and SNOMED CT Code Mapping
Analysing the Patient’s Speech
MYCIN and Clinical Decision Support
IBM Watson Health

All teams participating in the joint task had to sign confidentiality agreements due to the potential sensitivity of patient data. In the first study (Lee et al. 2013), called “A survey of SNOMED CT implementations”, the authors contacted over 50 SNOMED CT users in February 2012, this resulted in 14 interviews for 13 different implementations in over eight countries . .

Fig. 10.12 Screenshot of Radsearch, a tool for extracting cohorts of patients with a certain medical condition in radiology

Summary of Applications of Clinical Text Mining

Both tools and datasets can be found and downloaded in these networks for use in clinical text mining. The shared tasks use clinical text sets made available to the research community.

Conferences, Workshops and Journals

Workshops such as the International Workshop on Medical Text Mining and Information Analysis (Louhi) and Biomedical Natural Language Processing (BioNLP) at both the Association for Computational Linguistics (ACL) and the Recent Advances in Natural Language Processing (RANLP) conferences are dedicated workshops . in Biomedicine and Clinical Text Mining. For scientific journals, there are a large number of forums such as Journal of Biomedical Informatics, Artificial Intelligence in Medicine, Journal of Biomedical Semantics, International Journal of Medical Informatics, Yearbook of Medical Informatics, BMC Medical Informatics and Decision Making, Journal of American Medical Informatics Society , Health informatics, Study of health technology and informatics and much more.

Summary of Networks and Shared Tasks in Clinical Text

Denial detection is important since many of the symptoms in the clinical text are denied in the reasoning process undertaken by the physician to find the patient's disorder. The book concluded with a presentation of a number of research networks and common tasks performed in clinical text mining.

Outcomes

An electronic medical record based on a structured narrative. Journal of the American Medical Informatics Association. Automated methods for summarizing electronic health records. Journal of the American Medical Informatics Association.