Conclusion - Component-Based Evaluation for Question Answering

Component-Based Evaluation for Question Answering

8.5 Conclusion

As we have discussed in this chapter, the development of common interchange for- mats for language processing modules in the JAVELIN project (Lin et al.2005; Mita- mura et al.2007; Shima et al.2006) led to the use of common schemas in the NTCIR IR4QA embedded task (Mitamura et al.2008), which we believe is the first example of a common QA evaluation using a shared data schema and automatic combination runs. Although it is expensive to use human evaluators to judge all possible combinations of systems, automatic metrics (such as ROUGE) can be used to find novel combinations that seem to perform well or better than the state of the art; this subset of novel systems can then be evaluated by humans. In the OAQA project (which fol- lowed JAVELIN at CMU), development participants began to create gold-standard datasets that include expected outputs for all stages in the QA pipeline, not just the final answer (Garduño et al.2013). This allowed precise automatic evaluation and more effective error analysis, leading to the development of high-performance QA incorporating hundreds of different strategies in real time (IBM Watson) (Ferrucci et al.2010). The OAQA approach was also used to evaluate and optimize several multi-strategy QA systems, some of which achieved state-of-the-art performance on the TREC Genomics datasets (2006 and 2007) (Yang et al.2013) and BioASQ tasks (2015–2018) (Chandu et al.2017; Yang et al.2015, 2016).

Although academic datasets in the QA field have recently focused on specific parts of the QA task (such as answer sentence and answer span selection) (Rajpurkar et al.

2016, 2018) which can be solved by a single deep learning or neural architecture, systems which achieve state-of-the-art performance on messy, real-world datasets (such as Jeopardy! or Yahoo! Answers) must employ a multi-strategy approach. For example, neural QA components were combined with classic information-theoretic algorithms (e.g., BM25) to achieve the best automatic QA system performance on the TREC LiveQA task (2015–2017) (Wang and Nyberg2015a,b, 2016, 2017), which was based on a Yahoo! Answers community QA dataset. It is our expectation that a path to more general QA performance will be found by upholding the tradition of multi-strategy, multi-component evaluations pioneered by NTCIR. In our most recent work, we have tried to extend the state of the art in neural QA by performing comparative evaluations of different neural QA architectures across QA datasets (Wadhwa et al.2018a), and we expect that future work will also focus on how to curate the most challenging (and realistic) datasets for real-world QA tasks (Wadhwa et al.2018c).

Acknowledgements We would like to thank to the editors and to the past organizers and participants of the NTCIR ACLIA QA tasks. Special thanks go to Hideki Shima, who worked on CMU’s Javelin QA system to develop the component-based evaluation and helped to organize the ACLIA tasks.

We also thank the other students and staff who contributed to the JAVELIN, ACLIA, OAQA, and LiveQA projects.

124 T. Mitamura and E. Nyberg

References

Andrews C (2011) Ibm announces eight universities contributing to the watson computing system’s development. PR Newswire.https://tinyurl.com/yxsmx8q5

Asahara M, Matsumoto Y (2003) Japanese named entity extraction with redundant morphological analysis. In: Proceedings of the 2003 human language technology conference of the North Amer- ican chapter of the association for computational linguistics. pp 8–15.https://www.aclweb.org/

anthology/N03-1002

Chandu KR, Naik A, Chandrasekar A, Yang Z, Gupta N, Nyberg E (2017) Tackling biomedical text summarization: OAQA at bioasq 5b. In: Cohen KB, Demner-Fushman D, Ananiadou S, Tsujii J (eds) BioNLP 2017, Association for Computational Linguistics, Vancouver, Canada. pp 58–66.

10.18653/v1/W17-2307.https://doi.org/10.18653/v1/W17-2307

Ferrucci D, Lally A, Verspoor K, Nyberg E (2009a) Unstructured information management architecture (uima) version 1.0. OASIS Standard. OASIS

Ferrucci D, Nyberg E, Allan J, Barker K, Brown E, Chu-Carroll J, Ciccolo A, Duboue P, Fan J, Gondek D et al (2009b) Towards the open advancement of question answering systems. IBM, IBM Res Rep, Armonk, NY

Ferrucci D, Brown E, Chu-Carroll J, Fan J, Gondek D, Kalyanpur AA, Lally A, Murdock JW, Nyberg E, Prager J et al (2010) Building watson: an overview of the deepqa project. AI Magazine 31(3):59–79

Garduño E, Yang Z, Maiberg A, McCormack C, Fang Y, Nyberg E (2013) CSE framework: a uima- based distributed system for configuration space exploration. In: Klügl P, Eckart de Castilho R, Tomanek K (eds) Proceedings of the 3rd workshop on unstructured information management architecture, vol 1038. CEUR-WS.org, CEUR Workshop, Darmstadt, Germany, pp 14–17.http://

ceur-ws.org/Vol-1038/paper_10.pdf

Lin F, Shima H, Wang M, Mitamura T (2005) CMU JAVELIN system for NTCIR5 CLQA1.

In: Kando N (ed) Proceedings of the Fifth NTCIR workshop meeting on evaluation of information access technologies: information retrieval, question answering and cross-lingual information access, NTCIR-5, National Center of Sciences, National Institute of Informatics (NII), Tokyo, Japan.http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings5/data/CLQA/

NTCIR5-CLQA1-LinF.pdf

Lin J, Demner-Fushman D (2006) Will pyramids built of nuggets topple over? In: Proceedings of the human language technology conference of the NAACL, main conference. Association for Computational Linguistics, New York, USA, pp 383–390.https://www.aclweb.org/anthology/

N06-1049

Mitamura T, Lin F, Shima H, Wang M, Ko J, Betteridge J, Bilotti MW, Schlaikjer AH, Nyberg E (2007) JAVELIN III: cross-lingual question answering from japanese and chinese documents.

In: Kando N (ed) Proceedings of the 6th NTCIR workshop meeting on evaluation of information access technologies: information retrieval, question answering and cross-lingual information access, NTCIR-6. National Center of Sciences, Tokyo, Japan, National Institute of Informatics (NII).http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings6/NTCIR/51.pdf

Mitamura T, Nyberg E, Shima H, Kato T, Mori T, Lin C, Song R, Lin C, Sakai T, Ji D, Kando N (2008) Overview of the NTCIR-7 ACLIA tasks: advanced cross-lingual information access. In: Kando N (ed) Proceedings of the 7th NTCIR workshop meeting on evaluation of information access technologies: information retrieval, question answering and cross-lingual information access, NTCIR-7. National Center of Sciences, National Institute of Informatics (NII), Tokyo, Japan.http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings7/pdf/NTCIR7/

C1/CCLQA/01-NTCIR7-OV-CCLQA-MitamuraT.pdf

Rajpurkar P, Zhang J, Lopyrev K, Liang P (2016) Squad: 100,000+ questions for machine compre- hension of text.arXiv:160605250

8 Component-Based Evaluation for Question Answering 125

Rajpurkar P, Jia R, Liang P (2018) Know what you don’t know: Unanswerable questions for squad.

arXiv:180603822

Shima H, Wang M, Lin F, Mitamura T (2006) Modular approach to error analysis and evaluation for multilingual question answering. In: Calzolari N, Choukri K, Gangemi A, Maegaard B, Mariani J, Odijk J, Tapias D (eds) Proceedings of the fifth international conference on language resources and evaluation, LREC 2006. European Language Resources Association (ELRA), Genoa, Italy, pp 1143–1146.http://www.lrec-conf.org/proceedings/lrec2006/pdf/782_pdf.pdf

Voorhees EM (2003) Overview of the TREC 2003 question answering track. In: Voorhees EM, Buckland LP (eds) Proceedings of The twelfth text Retrieval conference, TREC 2003, vol Spe- cial Publication 500-255. National Institute of Standards and Technology (NIST), Gaithersburg, Maryland, USA, pp 54–68.http://trec.nist.gov/pubs/trec12/papers/QA.OVERVIEW.pdf Voorhees EM (2004) Overview of the TREC 2004 question answering track. In: Voorhees EM,

Buckland LP (eds) Proceedings of the thirteenth text retrieval conference, TREC 2004, vol Spe- cial Publication 500-261. National Institute of Standards and Technology (NIST), Gaithersburg, Maryland, USA.http://trec.nist.gov/pubs/trec13/papers/QA.OVERVIEW.pdf

Wadhwa S, Chandu KR, Nyberg E (2018a) Comparative analysis of neural QA models on squad.

http://arxiv.org/abs/1806.06972

Wadhwa S, Chandu KR, Nyberg E (2018b) Comparative analysis of neural QA models on squad.

In: Choi E, Seo M, Chen D, Jia R, Berant J (eds) Proceedings of the workshop on machine reading for question answering@ACL 2018. Association for Computational Linguistics, Melbourne, Australia, pp 89–97. 10.18653/v1/W18-2610.https://www.aclweb.org/anthology/W18-2610/

Wadhwa S, Embar V, Grabmair M, Nyberg E (2018c) Towards inference-oriented reading compre- hension: parallelQA.http://arxiv.org/abs/1805.03830

Wang D, Nyberg E (2015a) CMU OAQA at TREC 2015 liveqa: discovering the right answer with clues. In: Voorhees EM, Ellis A (eds) Proceedings of the twenty-fourth text retrieval conference, TREC 2015, vol Special Publication 500-319. National Institute of Standards and Technology (NIST), Gaithersburg, Maryland, USA.http://trec.nist.gov/pubs/trec24/papers/oaqa-QA.pdf Wang D, Nyberg E (2015b) A long short-term memory model for answer sentence selection in

question answering. In: Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing of the Asian federation of natural language processing, ACL 2015, July 26-31, 2015, Beijing, China, Volume 2: Short Papers. The Association for Computer Linguistics, pp 707–712.http://aclweb.

org/anthology/P/P15/P15-2116.pdf

Wang D, Nyberg E (2016) CMU OAQA at TREC 2016 liveqa: An attentional neural encoder- decoder approach for answer ranking. In: Voorhees EM, Ellis A (eds) Proceedings of the twenty- fifth text retrieval conference, TREC 2016, Gaithersburg, Maryland, USA, November 15–18, 2016, National Institute of Standards and Technology (NIST), vol Special Publication 500-321.

http://trec.nist.gov/pubs/trec25/papers/CMU-OAQA-QA.pdf

Wang D, Nyberg E (2017) CMU OAQA at TREC 2017 liveqa: a neural dual entailment approach for question paraphrase identification. In: Voorhees EM, Ellis A (eds) Proceedings of the twenty- sixth text retrieval conference, TREC 2017, Gaithersburg, Maryland, USA, November 15–17, 2017, National Institute of Standards and Technology (NIST), vol Special Publication 500-324.

https://trec.nist.gov/pubs/trec26/papers/CMU-OAQA-QA.pdf

Yang Z (2017) Analytics meta learning. PhD thesis, Carnegie Mellon University, 5000 Forbes Avenue

Yang Z, Garduño E, Fang Y, Maiberg A, McCormack C, Nyberg E (2013) Building optimal information systems automatically: configuration space exploration for biomedical information systems.

In: He Q, Iyengar A, Nejdl W, Pei J, Rastogi R (eds) 22nd ACM international conference on information and knowledge management, CIKM’13, San Francisco, CA, USA, October 27 November 1, 2013, ACM, pp 1421–1430.https://doi.org/10.1145/2505515.2505692

126 T. Mitamura and E. Nyberg

Yang Z, Gupta N, Sun X, Xu D, Zhang C, Nyberg E (2015) Learning to answer biomedical factoid

& list questions: OAQA at bioasq 3b. In: Cappellato L, Ferro N, Jones GJF, SanJuan E (eds) Working notes of CLEF 2015—conference and labs of the evaluation forum, Toulouse, France, September 8–11, 2015., CEUR-WS.org, CEUR Workshop Proceedings, vol 1391.http://ceur- ws.org/Vol-1391/114-CR.pdf

Yang Z, Zhou Y, Nyberg E (2016) Learning to answer biomedical questions: OAQA at BioASQ 4B. In: Proceedings of the fourth BioASQ workshop, association for computational linguistics, Berlin, Germany, pp 23–37.https://doi.org/10.18653/v1/W16-3104,https://www.aclweb.

org/anthology/W16-3104

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Chapter 9

Dalam dokumen OER000001404.pdf (Halaman 133-137)