• Tidak ada hasil yang ditemukan

IEEE Paper Template in A4 (V1)

N/A
N/A
Protected

Academic year: 2024

Membagikan "IEEE Paper Template in A4 (V1)"

Copied!
4
0
0

Teks penuh

(1)

International Journal of Recent Advances in Engineering & Technology (IJRAET)

ISSN (Online): 2347 - 2812, Volume-1, Issue - 2, 2013 62

A WHITE PAPER ON CHALLENGES AND ACCOMPLISHMENTS IN THE FIELD OF NATURAL LANGUAGE PROCESSING

Avneesh Kaur G. Bhatia Government Engineering College Email : [email protected]

Abstract- This paper provides a gist of the natural language processing field, it's plethora of challenges and accomplishments. The paper aims at surmising the scope yet unravelled and the various beneficial applications of such a system.

Index Terms—Idiom recognition, NLP, Parsing, Spam Detection, Part-of-speech tagging.

I. INTRODUCTION

Natural language processing falls under the rubric of Artificial Intelligence which is a sub-field of computer science.

Natural language, as a term suggests the language spoken or written by humans in their daily interactions with each other. It is a complex and sophisticated version of the programming language commands used by humans to interact with computers for programming and calculations.

As computers are becoming increasingly vital to the mankind for acquisition, storage, analysis, transmission and transformation of information, it is all the more convenient that computers be equipped with abilities to understand and generate information expressed in natural languages. The dominance of natural language as a means of communication in all spheres of interaction amongst humans drives the need if utilizing it as the medium in human-computer interaction as well.

The principle aim of a natural processing system is to enable it’s user to interact with and address the computer in a way he would to a human colleague. This goal is not easy to reach. "Understanding" language means, among other things, knowing what concepts a word or phrase stands for and knowing how to link those concepts together in a meaningful way. It's ironic that natural language, the symbol system that is easiest for humans to learn and use, is hardest for a computer to master.

Long after machines have proven capable of inverting

large matrices with speed and grace, they still fail to master the basics of our spoken and written languages.

II. CHALLENGES

Acquiring a system that can generate responses in accordance with human conversations is a stupendous task. The challenges crop up due to highly ambiguous nature of natural languages. The major challenges of any natural language processing system-whether it is a text based or speech recognition based system- can categorized into the following general ones:-

A. Recognition and deciphering, followed by adaptability:

The first task that the system should be able to successfully accomplish is that of recognizing the primary language of communication. It is necessary that the system recognizes and responds in the language that the user has used.

Adaptability is a required feature in the scenario where one system is communicating with multiple users, who are in turn using different languages to communicate with the system. The system must be capable of responding to each individual in the language they initiated conversing in and to do this switching at a reasonable speed. [1]

B. Parsing and Semantic Ambiguity Resolution Ambiguity in languages is pervasive. The users don’t usually use punctuations as much as they should.

Resultantly, more than one parse tree for a single sentence can exist, depending on the various interpretations. Also, the user and system’s interpretation might be different, leading to conflicts and improper response on the system’s behalf. [2]

A domain-independent parser of unrestricted text, that is practically usable, still remains elusive.

(2)

International Journal of Recent Advances in Engineering & Technology (IJRAET)

ISSN (Online): 2347 - 2812, Volume-1, Issue - 2, 2013 63

C. Accuracy of Analysis

A fundamental point of difference between a Natural Language Processing system and conventional software is that there exist no specific set of rules or methods or algorithms that can help measure the accuracy of the NLP system. [3] The system’s interpretation and analysis capability define its accuracy. As these properties depend on the primitive understanding of the language, which in turn is human dependent, the accuracy of such systems can not be measured with precision.

D. Testing the system

Natural Language processing systems can never be fully tested for each input possible. Simply because, there are innumerable ways of combining into a sentence the vast number of words available in each language. The list of tests cases is exhaustive and might as well be infinite.

Also, the system can never guarantee to provide all and only correct results. The interpretation differences and semantic ambiguities are the major factors of unguaranteed performance.

E. Efficiency of Translation[4]

Consider a scenario where the user asks the system to translate a piece from one language to another. The system must decipher both the languages and translate the piece into the target language with its tone and semantics intact.

This is a considerably enormous challenge as the system has to reproduce the exact sentiment of the input piece, which is a very complex procedure as compared to just replacing the words of input language into similar meaning words in the target language. On the system’s part, considerable amount of intelligent judgment is required to re-create the grammar, lexicons, rule-sets, etc.

F. Flexibility

Like the conventional software, NLP system must be flexible to analyze and process different formats. A parser primarily developed to process and summarize online news articles should be able to process e-mails. [5]

G. Slang and idiom recognition

It is not possible for every user to converse in pure form of that language. Users tend to use recent slangs and idioms in their daily conversation with other humans, and this behavior might be noticed in their interaction with the system too. Now, the slangs and idioms might also affect the tone and overall meaning of the sentence or the entire conversation. This leads to changes in the semantic labeling and parse tree requiring the system to decipher accurate interpretation of the interaction. [6]

H. Multi-lingual words in the same sentence or in a continuing dialog by a single user

It is highly probable that the user might incorporate words of different languages in the same sentence. As an example the user might ask the system what is ‘chidiya’

called in English? The system must first be able to decipher that it has to translate the hindi name of a bird into English.

Also, sometimes the user may not be able to converse in fluent English and thus, keep using words from his natural or native language to form the query. In this case, the major challenge for the system is to decide which language to respond in and/or whether to respond in both languages. The goal is to interact with the user in a way that eases the conversation and provides accurate answer to user’s query. [7]

III. RESPONSE GENERATION

The major challenge is how to response to the user. The system must first accept the query and identify the actual question being asked. The system then searches for the answer in its database or on the web. After the answer has been found, the system must formulate a semantically appropriate response containing the answer in the same language as used by user to ask the question.[8]

I. Measurement of progress and productivity of the system

While the system is being developed, it must be measured for development progress and compared against a timeline. Since any language contains a stupendous amount of words, grammars and possible sentences; the system has to account for each interpretation along with all the words in the language.

These factors decide the productivity of system.

Currently, there are no software engineering parameters to measure the productivity of NLP systems. [9] So the determination of time required to build the system, the correctness of the response is as yet an unsolved aspect of NLP system development.

J. Reusability

Among the NLP systems being developed for various languages, amount of reuse is low. This is partially due to the high degree of difficulty in integration. On the other hand, the higher the reuse rate the higher is the flexibility.

One way is to use componentization i.e. all input should be separated from core parsing and translating functionality. But this approach has several barriers, to being successfully applied, such as lack of knowledge of existing components, mismatch between component

(3)

International Journal of Recent Advances in Engineering & Technology (IJRAET)

ISSN (Online): 2347 - 2812, Volume-1, Issue - 2, 2013 64

properties and project requirements, software platform incompatibility, etc. [10]

IV. PROGRESS SO FAR AND TARGETS AHEAD

K. Almost solved problems

 Spam detection: This is the most accomplished task of all. We can easily observe the effects of spam detection while checking our mails.

Majority of the times the system successfully bifurcates incoming mail into inbox and the spam folder. Though not fully achieved, more than 90% efficiency has been achieved.

 Part-of-speech tagging: Ex. Colorless green ideas sleep furiously. Here colourless and green are adjectives describing noun ideas with the verb sleep and adverb furiously.

 Named entity recognition: Ex. Einstein met with UN officials in Princeton. Here Einstein is a person, UN (United Nation) is an organization and Princeton is the name of a place. The system must be able to distinguish the nouns into names of places, things, people, organizations etc.

L. Making good progress

 Sentiment Analysis: Ex. Best food in the city. – Like! One hour waiting time.  - Dislike! This requires the system to be able to distinguish emotions based on adjectives. Like best implies a positive sentiment and waiting indicates a negative sentiment. The system must differentiate and then response with a like or dislike.

 Co reference Resolution: Ex. Carter told John he shouldn’t run. Here carter is telling john that he (john) should not run. The co reference must be resolved that ‘he’ describes John and not Carter. In sentences lacking punctuations or having gerunds or other such grammar anomalies, the system must generate a response by resolving the co reference in the most feasible interpretation possible.

 Word sense disambiguation: Ex. I need new batteries for my mouse. Here the batteries are needed for a computer mouse and not the animal.

The system must be able to sense the meaning of a word with respect to the context it is being used in.

 Parsing: Ex. Flying planes can be dangerous. Here the dual interpretation will lead to creation of two different parse trees.

Is the more plausible interpretation that the pilot is at risk, or that the danger is to people on the ground?

Should the word "can" be analyzed as a verb or as a noun? Which of the many possible meanings of "plane"

is relevant? Depending on context, "plane" could refer to, among other things, an airplane, a geometric object, or a woodworking tool. How much and what sort of context needs to be brought to bear on these questions in order to adequately disambiguate the sentence? The decision of choosing which parse tree is proper to use for further processing, is a crucial one.

 Machine translation: Ex. Translating a Chinese sentence into a corresponding English sentence.

This requires acquiring appropriate words of the target language and forming a sentence with grammar and tone that is synonymous with the input language.

 Information extraction: Ex. You are invited to our party happening at The Plaza on May 2 at 8:30 in the night. Do come! Information extraction involves making a calendar reminder from the above by marking

Place: The Plaza Time: 8:30 p.m.

Date: May 2, 2013

The major challenge here is to extract and interpret the data currently. For example, in the night indicates that time is post-fixed by p.m.

M. Still really difficult to accomplish / Far from achieved

 Intelligible Question answering: Ex. User asks:

How effective is Paracetamol in reducing fever in patients with Malaria? The system must first decipher that Paracetamol is a medicine and Malaria is a disease. Then it should accordingly search for a factual answer and respond in a dialog with a bit of description justifying its opinion. Ex. System: No! Please don’t use Paracetamol for Malaria. The medicine is supposed to be used for mild fevers. For malaria medicine, consult your doctor!

 Paraphrasing: Ex. Arcelor acquired TMT yesterday.

 TMT has been taken over by Arcelor. The problem with paraphrasing is to generate the paraphrase of the given sentence in such a way that the two sentences have the same meaning.

 Summarization: Ex. HUL share is up. Infosys’s proce jumped. Housing prices rose. System summary: Economy is good. The system must

(4)

International Journal of Recent Advances in Engineering & Technology (IJRAET)

ISSN (Online): 2347 - 2812, Volume-1, Issue - 2, 2013 65

be able to interpret the aggregate of a situation from some headlines or articles or sentences.

 Interactive Dialog(similar to chat): Ex. User:

Where is Despicable Me playing in English in Ahmedabad?

System: PVR, Acropolis mall at 10 a.m. and 9.30 p.m.

Want a ticket? This is the toughest challenge to achieve.

It is a difficult task to generate a sensible and conversational response to the user’s subjective query.

V. APPLICATION OF A NATURAL LANGUAGE PROCESSING SYSTEM

The ultimate goal of a natural language processing system is to allow the user to be able to address the computer as though addressing another person. The major applications that can derive maximum utilization from the system are listed below:

(i) Ease of computer operation for physically challenged.

(ii) Math puzzle solving systems.

(iii) Tweet analyzer systems.

(iv) Computers can act as planner and/or personal assistant.

(v) If gesture and voice recognition based systems can be developed, then automation of many menial day-to-day tasks related to computer operation is possible. For example, browser tabs can be switched on command without manually doing it, conducting web search operations, etc.

(vi) Summarizing news and books.

(vii) More accurate forecasting.

REFERENCES

[1] White paper on Natural language processing by Ralph Weischedel, Chairperson, BBN Systems and Technologies Corporation; Jaime Carbonell, Carnegie-Mellon University; Barbara Grosz, Harvard University; Wendy Lehnertt, University of Massachussetts; Raymond Perrault, SRI International; Robert Wilensky, University of California

[2] Current Issues in software engineering for natural processing by Jochen L. Leidner, School of Informatics, University of Edinburgh



Referensi

Dokumen terkait

Infrastructure Mobile Internet devices that can be grasped like a cell phone or tablet offers easy access to the Internet for many people of Indonesia without having to rely on the

Based on the results of research Prastitasari, Sa'dijah & Qohar, 2018, in the preparation of quality teaching materials, it is characterized by several things, including being relevant

406 Problem Solving Ability: Symbolic Level Analysis in Chemistry Education Students on the Topic of Chemical Equilibrium Aceng Haetami1, Elsya Febiana Fahira2, La Rudi3,

CONCLUSSION Based on the results and discussion, several conclusions can be drawn, including that i lecturers' abilities have a big impact on students' learning outcomes in geography

METHOD This research is development research that aims to develop learning tools with a blended learning model based on Virlenda by using problem-based learning strategies to improve

There are some suggestions addressed to: policy makers , that it is time to enter the variable level of education as a basis in determining the salaries of public sector workers, so

Based on the acquisition of student historical awareness questionnaire data scores which reached a percentage of 90% and the results of students' critical thinking skills, which reached

User Friendly 3,83 Sangat Valid Rerata Skor Seluruh Aspek 3,832 Sangat Valid Komentar dan saran yang diberikan oleh ahli modul, antara lain 1 layout cover perlu diperbaiki, 2