Chapter 2 continues with the history of patient records and the languages used in them. Automatically structuring the patient record to make it more readable is another interesting application that will be explained.
Early Work and Review Articles
This chapter introduces the long history of the patient record, from ancient times to the present. From the first attempts to describe and classify the nature that formed the first patient record, to the modern paper records with their various headings and sections describing the findings and symptoms, the patient's treatment and ultimately the outcome; followed by the organization of the paper records to make them easy to follow.
The Egyptians and the Greeks
In addition, another important guideline is that the physician should not harm the patient while he or she is being treated. These case histories describe symptoms day by day for a typical patient and the outcomes, most of which lead to the death of the patient.
The Arabs
The Swedes
The first formal record system in Sweden was developed and systematic medical documentation was introduced in connection with the opening of Serafimerlasarette (Seraphim Hospital) in Stockholm in 1752 (Nilsson and Nilsson 2003; Nilsson 2007). In Sweden, the paper-based patient record system was developed and refined until 1980, when computerized patient record systems began to become more common, and it was more or less completely digitized in 2007 (Nilsson2007).
The Paper Based Patient Record
For hospitalized patients, the patient record contains daily notes of treatment status and progress. These notes are usually taken by the nurses who also care for the patient on a daily basis.
Greek and Latin Used in the Patient Record
Summary of the History of the Patient Record
Images or other third-party material in this chapter are covered under this chapter's Creative Commons license, unless otherwise noted in the credit line for the material. If the material is not covered by a Creative Commons Chapter license and your intended use is not permitted by law or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Reading and Retrieving Efficiency of Patient Records
Give the user positional structure so that the user can find new information but easily navigate back to the original position and still have a link to a reference. The events are marked and are for example: problems, allergies, symptoms, diagnosis and drug prescriptions (drugs). The user can click on it and get more information, see Fig.3.2.
Natural Language Processing on Clinical Text
LifeLines is a prototype patient record system where the physician can see important events in a timeline in the patient record (Plaisant et al.1998). This is of course for clinical research reasons, but can also be used by hospital management (Wang et al.2011).
Electronic Patient Record System
Many of the previous ways of keeping track of the paper file have disappeared in the electronic patient record system. However, there has been research on how to improve ways to browse patient data.
Different User Groups
Summary
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution, and reproduction in any medium or format, so as long as you give appropriate credit to the original author(s) and source, you must provide a link to the Creative Commons license and indicate whether changes have been made. This chapter will describe characteristics of patient records compared to other types of text, including: A comparison of characteristics of patient records written in different languages, the number of spelling errors compared to other types of text, syntactic differences, word choice, abbreviations, acronyms, compounds and compound construction, negation expressions and also speculative keywords and factual expressions in clinical text.
Patient Records
A comparison of the characteristics of patient records written in different languages, the number of spelling errors compared to other text types, syntactic differences, word choices, abbreviations, acronyms, compounds and compound constructions, negative expressions, as well as speculative clues and factual expressions in clinical text. In general, patient records are written by highly skilled doctors and nurses using domain-specific terms.
Pathology Reports
The patient records are written under time pressure; the patient record systems do not include any spelling correction (or grammar checking) system due to the difficulties in building such a function due to the complicated non-standard vocabulary used within healthcare. Therefore, in clinical text, non-standard abbreviations, domain-specific acronyms and incomplete sentences without a subject can be observed, which means that the patient is not mentioned, only his or her status.
Spelling Errors in Clinical Text
In Table 4.1 there are some examples of misspellings in the Swedish text of the patient record and their correct spelling, together with the corresponding misspelled English version and the correctly spelled English word.
Abbreviations
Nizamuddin and Dalianis (2014) studied the Stockholm EPR PHI Corpus, which is a subset of the Stockholm EPR Corpus, and found 2.7% abbreviations. Regarding the ambiguity of clinical abbreviations, there are two studies: Liu et al. 2001) found that 33% of abbreviations in English clinical text are highly ambiguous and Lövestam et al. 2014) analyzed 40 different abbreviations in Swedish dietetic notes from a subset of the Stockholm EPR corpus written by three professions: dietitians, nurses and doctors.
Acronyms
Assertions
Negations
In a study by Chapman et al. 2001) more than half of the expressions in US radiology reports were found to contain disclaimers. Physicians' notes contain more disclaimers than nursing narratives related to the patient's day-to-day health care.
Speculation and Factuality
In Swedish, see Section.4.7.2, the texts of several clinical units were studied under the heading assessment and it was found that negated sentences or expressions comprised 13.5% of the texts (910 negated sentences out of a total of 6640 sentences ) (Dalianis and Skeppstedt2010). Definitely Negative expressions, and 12.2% were in the middle of the scale (Possibly Positive and Possibly Negative) while 47.6% of the expressions were confirmed as Definitely Positive in the final version of the corpus.
Clinical Corpora Available
English Clinical Corpora Available
BioScope Corpus and Thyme corpus are two other well-known clinical corpora written in English. The clinical corpora in English are de-identified with respect to sensitive identifiers such as personal names, telephone numbers, etc.
Swedish Clinical Corpora
Relationships such as indications, adverse drug events, ADE outcome and ADE cause (these relationships will be explained in section 10.2. The descriptions are in Swedish, but can be understood as the annotation classes are in English and there are numerical values for the number of classes.
Clinical Corpora in Other Languages than Swedish
Another Danish clinical corpus containing 323,122 patient health records used for de-identification (Pantazos et al.2016). An Italian clinical corpus containing 23,695 patient records used for entity extraction and definition of semantic relationships (Attardi et al.2015).
Summary
A Norwegian clinical corpus containing 7741 patient records comprising a total of 1,133,223 unstructured EHR text documents used for the identification of cancer patients (Jensen et al.2017). A Spanish clinical corpus, IXAMed corpus from Galdakao-Usansolo Hospital, collected during 2008-2012 containing 141,800 patient records (Pérez et al. 2017).
International Statistical Classification of Diseases
International Classification of Diseases
There is a separate version of the ICD, the International Classification of Diseases for Oncology (ICD-O-3),2 which is also used to code pathology reports for cancer. Topology describes the anatomical site of origin, where the tumor is located in the body, and morphology describes the cell type (histology), the stage or behavior of the tumor (malignant or benign), and the number of tumors or metastases.
Systematized Nomenclature of Medicine: Clinical Terms
SNOMED CT is a clinical, hierarchical terminology containing medical terms and their associations as well as synonyms, including more than 320,000 terms. ICD-10 has a longer history than SNOMED CT and is widely used and known, while SNOMED CT is less well known.
Medical Subject Headings (MeSH)
Unified Medical Language Systems (UMLS)
In this example, the numeric encoding of the MeSH descriptors (MeSH-pneumonia entry, https://www.nlm.nih.gov/cgi/mesh/2016/MB_cgi?mode=&.
Anatomical Therapeutic Chemical Classification (ATC)
Different Standards for Interoperability
Health Level 7 (HL7)
OpenEHR
Mapping and Expanding Terminologies
Summary of Medical Classifications and Terminologies
First, the scientific basis for evaluating all information retrieval systems, called the Cranfield paradigm, will be described. An example of a common task for retrieving information from electronic patient records will be presented.
Qualitative and Quantitative Evaluation
Subsequently, different evaluation concepts such as precision, recall, F-score, development, training and evaluation sets and k-fold cross-validation are described. This chapter also discusses manual annotation and inter-annotator agreement, annotation tools such as BRAT, and the gold standard.
The Cranfield Paradigm
Voorhees (2001) elaborates on the Cranfield paradigm and argues that this is the only way to evaluate information retrieval systems, as objective manual evaluation is very costly and can also be very inaccurate. Cranfield's paradigm has spawned the Text Retrieval Conference (TREC) and the Cross-Language Evaluation Forum (CLEF), where large collections of controlled documents along with questions on specific topics are used to evaluate information retrieval.
Metrics
The F-score is defined as the weighted average of both precision and recall, depending on the weighting function β, see Formula 6.3. Sensitivity (same as recall) measures the proportion of negatives correctly identified (eg, the percentage of healthy people correctly identified as not having the condition).
Annotation
Specificity measures the proportion of negatives that are correctly identified as negatives or that do not have the condition. Accuracy is another measure defined as the percentage of true positive and negative cases received among all cases received.
Inter-Annotator Agreement (IAA)
Confidence and Statistical Significance Testing
The more times an algorithm outperforms another algorithm, the more reliable the results. If big data is considered, statistical significance testing is not really useful since there is so much data that the results will always be significant.
Annotation Tools
Gold Standard
Summary of Evaluation Metrics and Annotation
This chapter will describe the basics of text processing and provide an overview of standard methods or techniques: pre-processing texts such as tokenization and text segmentation. Sentence-based methods such as part-of-speech tagging, syntactic analysis or parsing, semantic analysis such as named entity recognition, negation detection, relation extraction, temporal processing and anaphora resolution.
Definitions
In general, the same building blocks used for ordinary texts can also be used for clinical text processing. The term text mining is also used in health informatics which mostly means the use of rule-based methods to process clinical or biomedical text.
Segmentation and Tokenisation
Morphological Processing
- Lemmatisation
- Stemming
- Compound Splitting (Decompounding)
- Abbreviation Detection and Expansion
- Spell Checking and Spelling Error Correction
- Part-of-Speech Tagging (POS Tagging)
The clinical text contains a proportion of abbreviations ranging from 3% to 10% of the entire text, see section 4.4. Another method is a rule-based anheuristic (or rule-of-thumb) method based on the form of abbreviations in clinical text, such as words that meet one of the following criteria (Xu et al. 2007):
Syntactical Analysis
Shallow Parsing (Chunking)
Shallow parsing (also called parsing or "light parsing") is something between POS markup and parsing. A shallow parser or chunker detects constituent parts in sentences in the form of noun phrases or verb phrases.
Grammar Tools
The DCG grammar (and Prolog) can be thought of as a set of propositions and the lexical items as facts to be proved using a proposition prover. Another advantage of DCG is that it can be easily extended to produce a syntax tree that can be used to perform operations.
Semantic Analysis and Concept Extraction
- Named Entity Recognition
- Negation Detection
- Factuality Detection
- Relative Processing (Family History)
- Temporal Processing
- Relation Extraction
- Anaphora Resolution
HeidelTime was adapted to English clinical text labeling in the i2b2 challenge and achieved the best results of all systems (Sun et al.2013a). SVM produced the best results for the relation extraction task for clinical text in the 2010 i2b2/VA challenge (Uzuner et al.2011).
Summary of Basic Building Blocks for Clinical Text
Various machine learning approaches such as topic modeling, distributional semantics, and clustering will be presented. The results of rule-based systems and machine learning will be explained.
Rule-Based Methods
Regular Expressions
Therefore, to find all Swedish personal identity numbers in a file called "personnummer.txt", you need to type the following command in the Bash shell script of the Linux operating system. Regular expressions can be used to find and replace personal identity numbers, telephone numbers and email addresses that are regularly and easily identified, for example for de-identification purposes, see section 9.4.
Machine Learning-Based Methods
- Features and Feature Selection
- Active Learning
- Pre-Annotation with Revision or Machine
- Clustering
- Topic Modelling
- Distributional Semantics
- Association Rules
Supervised methods require time-consuming manual recording of data to be used for training (and evaluation). Features represent certain aspects of the training data and are used as input to machine learning tools.
Explaining and Understanding the Results Produced
Computational Linguistic Modules for Clinical Text
NLP Tools: UIMA, GATE, NLTK etc
Summary of Computational Methods for Text Analysis
Various open tools for clinical text mining such as cTakes, NLTK, GATE and Stanford Core NLP were mentioned. The whole process of accessing electronic health records for research is complicated and requires certain steps.
Ethical Permission
These records contain valuable information about symptoms and conditions, rationales used to determine the patient's diagnosis and treatment, as well as any side effects the patient may have experienced. However, the free text of the patient record sometimes includes information that can identify the patient, such as telephone numbers of relatives (for example, the telephone number of the patient's wife, Mary, in these cases the patient can be considered identifiable and therefore the patient consent to be part of the research project, or have the identifiable parts removed.
Social Security Number
After obtaining ethical clearance for the research, the researcher must have access to anonymized patient data to conduct the research. Accessing patient records can be technically cumbersome if there is no easy way to extract the data from the electronic patient record system.
Safe Storage
Automatic De-Identification of Patient Records
Density of PHI in Electronic Patient Record Text
Regarding the density of PHI in the free text of patient data we can observe studies such as Douglass et al. In total 4423 instances of PHI were found which equates to 2.5% PHI of the total amount of tokens.
Pseudonymisation of Electronic Patient Records
The rest of the phone number was replaced with the same number of random numbers. Personal names were replaced with real names sourced from the US Census Bureau at a frequency above 144 (or 0.002% of the data).
Re-Identification and Privacy
State-of-the-art de-identification systems achieve recall rates of 95-97%, using the technique described in Carrell and other residual identifiers, HIPS can be effectively masked without improving the de-identification system. Almgren and Pavlov knew the format of sensitive hand-marked data in electronic patient records that needed to be evaluated, in this case clinical entities in Swedish (disorder and finding, pharmaceutical drug, and body structure).
Summary of Ethics and Privacy of Patient Records
We sent the software code and trained models to an authorized person with access to sensitive data in a black box. In general, the highest density of sensitive PHI is in assessment and social affairs and discharge summaries of patient records.
Detection and Prediction of Healthcare Associated
Healthcare Associated Infections (HAIs)
Part of the definition of an HAI is that the patient must be admitted to the hospital for more than 48 hours before the infection can be defined as an HAI. Medical (or care) episodes are daily notes and data entered into the patient's file about the patient's treatment and status.
Detecting and Predicting HAI
Two machine learning algorithms Support Vector Machine (SVM) and Random Forest (RF) in the Weka toolkit were applied to the annotated Stockholm EPR Detect-HAI Corpus. A rule-based approach using the Swedish clinical text to detect urinary tract infections was investigated by Tanushi et al. 2014), patient record data and text were used.
Commercial HAI Surveillance Systems
Note that the evaluated results are from a different domain than system development. Some systems also use information from the text in the patient's record.
Detection of Adverse Drug Events (ADEs)
- Adverse Drug Events (ADEs)
- Resources for Adverse Drug Event Detection
- Passive Surveillance of ADEs
- Active Surveillance of ADEs
- Approaches for ADE Detection
A side effect is hopefully mild and probably known to the doctor, but is also related to the pharmacological properties of the drug. In the review article by Warrer et al. 2012), the authors focused on text mining-based warning systems for undesirable drugs.
Suicide Prevention by Mining Electronic Patient Records
One observation was that most relationships are cross-entity, spanning more than 10 sentences from the disease entity to the drug entity, in cases of positive ADR events. There are still relatively few articles in this area, using free text in patient records to detect suicide risk, but hopefully research in this area will grow soon.
Mining Pathology Reports for Diagnostic Tests
The Case of the Cancer Registry of Norway
The first study was by Singh et al. 2015) using 25 pathology reports for prostate cancer written in free text in Norwegian. The second study was by Dahl et al. 2016) used the same 25 pathology reports for prostate cancer in free text used by Singh et al.
The Medical Text Extraction (Medtex) System
The text contains descriptions of 9 biopsies, four from the left side and five from the right side. Published in Weegar and Dalianis (2015). The translation into English and excerpts of the free text are added in this publication).
Mining for Cancer Symptoms
Text Summarisation and Translation of Patient Record
- Summarising the Patient Record
- Other Approaches in Summarising the Patient
- Summarising Medical Scientific Text
- Simplification of the Patient Record for Laypeople
One of the first systems for creating an automatic summary or summary was described in Luhn (1958). Related to summarizing the patient record is simplifying the patient record for the lay reader.
ICD-10 Diagnosis Code Assignment and Validation
Natural Language Generation from SNOMED CT
SNOMED CT is a very complex hierarchical medical terminology that contains many separate pieces of information with the aim of describing a disorder, its cause, symptoms and in which body part the disorder occurs. See for example the IHTSDO SNOMED CT browser and its description of scarlet fever in Fig.10.9,.
Search Cohort Selection and Similar Patient Cases
- Comorbidities
- Information Retrieval from Electronic Patient
- Search Engine Solr
- Supporting the Clinician in an Emergency
- Incident Reporting
- Hypothesis Generation
- Practical Use of SNOMED CT
- ICD-10 and SNOMED CT Code Mapping
- Analysing the Patient’s Speech
- MYCIN and Clinical Decision Support
- IBM Watson Health
All teams participating in the joint task had to sign confidentiality agreements due to the potential sensitivity of patient data. In the first study (Lee et al. 2013), called “A survey of SNOMED CT implementations”, the authors contacted over 50 SNOMED CT users in February 2012, this resulted in 14 interviews for 13 different implementations in over eight countries . .
Summary of Applications of Clinical Text Mining
Both tools and datasets can be found and downloaded in these networks for use in clinical text mining. The shared tasks use clinical text sets made available to the research community.
Conferences, Workshops and Journals
Workshops such as the International Workshop on Medical Text Mining and Information Analysis (Louhi) and Biomedical Natural Language Processing (BioNLP) at both the Association for Computational Linguistics (ACL) and the Recent Advances in Natural Language Processing (RANLP) conferences are dedicated workshops . in Biomedicine and Clinical Text Mining. For scientific journals, there are a large number of forums such as Journal of Biomedical Informatics, Artificial Intelligence in Medicine, Journal of Biomedical Semantics, International Journal of Medical Informatics, Yearbook of Medical Informatics, BMC Medical Informatics and Decision Making, Journal of American Medical Informatics Society , Health informatics, Study of health technology and informatics and much more.
Summary of Networks and Shared Tasks in Clinical Text
Denial detection is important since many of the symptoms in the clinical text are denied in the reasoning process undertaken by the physician to find the patient's disorder. The book concluded with a presentation of a number of research networks and common tasks performed in clinical text mining.
Outcomes
An electronic medical record based on a structured narrative. Journal of the American Medical Informatics Association. Automated methods for summarizing electronic health records. Journal of the American Medical Informatics Association.