The English dictionary is used to convert word tokens into correct candidate words, while the proprietary dictionary is used as a guide to select only meaningful words in the domain-specific task. The practicality of the proposed approach was demonstrated in the text recognition task of the list of ingredients printed on the cover of packaged foods. The word tokens are then converted into correct words by processes involving the use of dictionaries. The result of these combined approaches in the system are reliable as it gives an accurate result of components without useless characters and non-essential components.
Introduction
Research Objectives
The OCR result is enhanced by the post-processing method attached to the OCR.
Literature Review
- Review of Optical Character Recognition (OCR)
- Review of The Tesseract OCR Engine
- Review of OCR Post-Processing
- Review of Related Work
- Eatable
- Food Allergy Scanner
As shown in Table 1.1 below, the usage and various information comparisons on a few of the most used OCR engines recently [11]. Tesseract is recognized as one of the most accurate open source OCR engines besides Google Cloud Vision, ABBYY FineReader, etc. [15]. Basically, the input from the OCR process is an image that has been pre-processed to get a changed image form (in size and perspective of the image).
The voting technique [19] used in OCR has two approaches, the first is where the image data is first pre-processed with various image filters. This approach that is proposed can be categorized as one of the domain-based retrieval approaches [21], since the extracted data is known in advance, so a proprietary dictionary can be built to correct the OCR output. In addition, it causes a reaction on the patient's body, but does not involve the patient's immune system.
There are some ingredients that can cause their symptoms even if a smaller amount of ingredients containing the allergen is consumed. Both apps share the same method which uses allergens as input from other users to locate the store. Then grocery products and nearby stores appear as a search result.
Both applications use product barcodes to search and retrieve product information from an online food database available on the Internet.
System Design
Get Image of Product’s Ingredients
- Client-Server Architecture
For this system, the image of the product is sent by the user through a mobile application. On the application, the user's information, disease details secured in the database and the uploaded image can be accessed by the system. To use the system, the image must be uploaded to the application before the image is sent and processed on the back end.
First of all, users register their personal information, including name, age, gender and especially their disease states, on the server by using the client systems, which are usually smartphones. When the user with an illness asks whether the packaged food is safe to consume, the user then takes a photo of the product cover on which the ingredients list is printed. Then the image of the product's ingredient list is sent to the server for processing.
The server, for its part, first finds each ingredient from the image of the list and then evaluates the safety of the food by comparing each individual ingredient in the food with the harmful ingredients for the user's illness. In order to make a correct decision about the safety of the product, it is crucial to find out the exact ingredient list from the image.
Perform OCR
- OCR and Its Pre-Processing
OCR technology performs a number of pre-processing steps before the main process, which in turn includes segmentation, feature extraction and classification. The skewed image gives a direct impression of the OCR line segmentation, reducing its accuracy. To correct the skew text, it is necessary to first detect the text block with skew in the image, then calculate the angle of rotation, before rotating the image to correct the skew.
After these preprocessing steps, the segmentation follows to divide digital image into multiple segments (sets of regions). 20 . segmentation is used for text-based images that aim to retrieve specific information from the entire image. The features of the feature extraction techniques should be independent of the scalable font features such as type, size, style, tilt, rotation and should be able to effectively describe the complex, distorted, broken characters.
Post-Processing of OCR
- Tokenization
- Extract Correct Ingredients Using Dictionaries
Tokenization is one of the natural language processing tasks and is often used in computer science. It is also one of the important processes of lexical analysis. This segmentation, which is also one of the crucial ones, is carried out based on punctuation marks such as '.', '?' and '!', because they mark the boundaries of the sentence. In this study, the presence of non-alphanumeric characters in the word is one of the indicators of an unwanted word.
Characters and indications such as '%', '$', punctuation marks '!', '?', hyphens, commas and periods are removed together, because this system does not require any information other than the name of the ingredients. Another important step for the tokenizer for this system is the process of multi-word tokenization, which is necessary because the name of the ingredients can consist of more than one word, for example "green tea", "sour cream" or "maple syrup" . The preprocessing of this system uses the word tokens as they are, without further processing under stems and lemmatization.
These dictionaries play a very important role in this system as they are used for the post-processing of this approach. Most packaged food ingredients are obtained from the Internet, but the dictionary can always be updated. Proprietary dictionaries are used as a guide to select the right ingredients from the candidate ingredients obtained by the OCR process. In this study, the system retrieves word tokens from the previous process, which is tokenization, and then extracts candidate constituents using an English dictionary to ensure that the retrieved word tokens are meaningful words and that they are retrieved in the correct spelling.
The candidates are then processed using the proprietary dictionary as a guide to ensure that only the correct ingredients are selected and unnecessary words are dropped from the ingredient list queue.
Search Harmful Ingredients for The User using Database
It is built primarily to record user data and diseases. Moreover, it is also used to match the string between the components obtained from the picture with the patient's diseases that contains the harmful components of the disease. The users table contains several columns that are populated with user information including the user's food intolerances.
The 'disease_ingredients' table contains a list of the disease names and the ingredients that cause the disease. Once the list of ingredients of the product is obtained in text format, the server first finds out the user's diseases from the user table and gets the ingredients that are harmful to the user. Then the system compares each ingredient in the product ingredient list with the harmful ingredients.
If one or more of the ingredients obtained correspond to the ingredients of the chosen disease, the user is informed that the product is harmful to the user.
Notify Result to the user
Overall Explanation
- Pre-Processing of OCR
- Post-Processing of OCR
Then the own dictionary is used to select only constituent components that are registered in this dictionary. Assuming that the correct ingredients are obtained by the system from the image of the ingredients, the list of ingredients is sent to the database for string matching to find out any harmful ingredients according to the user's data. The result of OCR is not suitable for further processing because many unnecessary components are included such as symbols, numbers and etc.
The implementation of the tokenization OCR result is achieved by one of the NLTK modules, which is the nltk.tokenize.API module. Taking the result of the OCR, the system then uses the tokenizer to remove the unwanted characters, numbers, convert the uppercase letters to lowercase letters. The tokenized multiword result is separated by '_' to indicate that there are spaces between the word.
From the tokenizer result, tokens are used to extract correct ingredient candidates from English dictionary, then correct ingredients from the proprietary dictionary of ingredients. Then the candidates are used as input and process through proprietary dictionary of ingredients to extract the right ingredients. On a side note, since 'ingredients' is not recognized by the proprietary dictionary of ingredients, it is neglected and removed from the queue of the real ingredient result.
The system then proceeds to collect the proprietary dictionary result to be added to the list of correct ingredients before progressing further.
Database of the System
The disease ingredients are placed in a row and separated by commas to indicate different ranges.
System Prototype using Android Studio
Wikipedia (2019, December) Optical character recognition, retrieved from https://en.wikipedia.org/wiki/Optical_character_recognition. 2019, April) OCR (Optical Character Recognition), Retrieved from https://searchcontentmanagement.techtarge t.com/definition/OCR-optical-character-recognition. Wikipedia, (2019, October) Comparison of OCR software, retrieved from https://en.wikipedia.org/wiki/Comparison_of_optical_character_reco gnition_software. Wikipedia, (2019, October) Comparison of OCR software, retrieved from https://en.wikipedia.org/wiki/Comparison_of_optical_.
Wikipedia, (2019, Oct) Comparison of OCR software, retrieved from https://en.wikipedia.org/wiki/Comparison_of_optical_character_recognition_software,. 2018, Feb) Build your own optical character recognition, retrieved from https://medium.com/@balaajip/ optical-character-recognition-99aba2dad314. 1997). Expert opinion), retrieved from https://www.mayo clinic.org/diseases-conditions/food-allergy/expert-answers/food-aller gy/faq-20058538.
Retrieved from https://www.healthline.com /health/allergies/food-allergy-sensitivity-difference#food-allergies 26. ww.healthline.com/health/lupus/diet-tips. 2018 May) Foods That Can Trigger Asthma Attacks Retrieved from https://www.webmd.com/asthma/guide/food-allergies-and-asthma#1. Food Allergy Research and Education, (2019, June) About Anaphylaxis, Retrieved from https://www.foodallergy.org/life-with-food-allergies/anaphy laxis/about-anaphylaxis.
Techopedia, (2019) Tokenization, Retrieved from https://www. 2013, January) The Art of Tokenization, Retrieved from https://www.ibm.com/developerworks/community/blogs/nlp/entry/tok enization?lang=en. 2013) Multi-Word Tokenization for Natural Language Processing, University of Stuttgart pp 113-117.
Conclusion