Why Do Masked Neural Language Models Still Need Semantic Knowledge in Question Answering?

Pre-trained language models have been widely used to solve various natural language processing tasks. Recently, there have been studies that attempt to approximate how much knowledge is learned in covert neural language models. However, a recent study reveals that the models do not accurately grasp semantic knowledge while exhibiting superhuman performance.

In this work, we empirically verify that questions that require semantic knowledge are still challenging for masked neural language models to solve in question answering. Therefore, we suggest a possible solution that injects semantic knowledge from external repositories into masked neural language models. Semantic knowledge is known to be an important factor for understanding and inferring natural language in SC tasks.

As studied in [23], MNLMs rely on pseudo-statistical models instead of understanding semantic knowledge while they show outstanding performance in various tasks. Queries that require semantic knowledge are still difficult to solve by MNLMs, while they show outstanding performance in QA. For our second hypothesis, we suggest a way to use semantic knowledge from external knowledge sources in MNLM.

Text Representation

Transformer

On the other hand, Transformer is based solely on attention mechanisms, so the limitations of RNN and CNN are solved. The input set of words is mapped to the set of vectors, using learned embedding. Because Transformer is based solely on attention mechanisms without RNN and CNN, positional information is needed.

The positional encodings and input embeddings can be added easily because they have the same dimension. In [1], sine and cosine functions are used, while there is a wide choice of positional encodings [30]. Specifically, word embeddings are linearly projected onto vectors of size hd, where h is the number of headings and the dimension of word embeddings.

Query, key and value weight matrices are trained to transform the input embeddings into query, key and value vectors. To calculate attention weights, the dot product of query and key vectors (divided by dk) is scaled and a softmax function is applied. Followed by multi-head attention, feed forward networks are used in a positional manner (the networks are used identically regardless of positions).

The input and output vectors of each block are summed and normalized in a layered fashion. Linear layers and a softmax function are used to convert the decoder output into predicted probabilities of the next word.

Figure 1: The overall architecture of Transformer. Reprinted from Vaswani et al. NIPS 2017:5998-6008 [1].

Masked Neural Language Models

Detailed information, such as the number of parameters and the size of the pre-training data, is described in Table 1. The input is represented as a pair of two text strings A1, .., AN and B1, .., BM, where each word token is mapped to the corresponding WordPiece embedding [37]. BERT is pre-trained with approximately 16 GB of text corpora, which is composed of English Wikipedia and Book Corpus [38].

Second, RobERTa is trained with a single sequence of tokens instead of using a next sentence prediction target. In particular, the data is about 160 GB which is composed of CommonCrawl News dataset [40], Open WebText corpus [41] and STORIES corpus [42] in addition to BERT pre-training data. First, ALBERT uses a reduced number of word embedding dimensions and shares parameters to reduce complexity.

In addition, ALBERT is pre-trained to predict sentence order instead of the next sentence prediction target. ALBERT1 uses 16 GB of data, which is the same as BERT, while ALBERT2 uses 160 GB of data, which is the same as RoBERTa. There have been several studies investigating whether MNLMs can recall the semantic knowledge that is part of their training data.

Specifically, [22] introduces language model analysis, which is a way to test MNLMs by querying semantic knowledge with a masked word. For example, 'birds can fly' can be converted to 'birds can' to test whether the model remembers that birds can fly. In particular, they do not even know the meaning of negation, which can be easily caught by people in many cases.

Therefore, we hypothesize that questions requiring semantic knowledge will still be difficult for MNLMs to solve despite their superior performance in QA tasks.

Question Answering

ConceptNet

In 1698, Thomas Savery patented a steam pump that used steam in direct contact with the water being pumped. Cosine similarity between TF-IDF term weighted uni-gram vectors of query and context. Although MNLMs have shown outstanding performance in various tasks, they have incomplete semantic knowledge [23].

Extending this to QA, we try to find out whether the models understand semantic knowledge or not. We assume that a match between question and context is related to the amount of semantic knowledge required. For example, if a question and a context are similar, or contain many overlapping words, they are less likely to require semantic knowledge.

On the other hand, if a question and context are not similar, they are more likely to require semantic knowledge. We observe that the first question looks like context, therefore the traces are explicitly shown in the context. On the other hand, the second question is not the same with the context, and therefore requires additional clues that are not explicitly given.

To solve the question, we must conclude, for example, that Russia is a country, and that “spreading the disease” can be paraphrased into “receiving the disease.” Therefore, low similarity questions are more likely to require implicit cues, such as semantic knowledge, which affects the difficulty of the questions. In addition, we try a different but more detailed analysis method, namely manually classifying the questions into six types based on the reasoning skills required to solve the questions.

We then analyze which types of questions are difficult to solve in order to test whether questions requiring semantic knowledge are challenging or not. For analysis, we use the largest models of each MNLM type trained on SQuAD 2.0 [47], the QA dataset: BERTlarge, ALBERT1xlarge, ALBERT2xlarge and RoBERTalarge.

Table 2: Examples of easy and hard questions on the has answer questions.

Correlation between Similarity and Difficulty of Questions

Categorizing Questions based on Reasoning Types

The second main legislative body is the Council, which consists of various ministers of the member states. This means that the commissioners are, through the appointment process, unelected subordinates of member state governments. Categories can be labeled with duplicates, except for semantic variation and without semantic variation.

Therefore, questions that require logical knowledge, which is a type of semantic knowledge, are still the most difficult for models to solve. But snow is very rare, the opposite is frequent, in the southwest of the state. Red words indicate the relation or relative pronoun pattern of the triple relation Turquoise words indicate the object term of the triple.

Therefore, we suggest a possible solution that injects semantic knowledge, especially common knowledge, from external repositories into models. First, we suggest a manual knowledge integration method, where a person manually integrates external knowledge into models.

Table 3: Examples and descriptions for each type of the has answer questions in SQuAD 2.0.

Manual Knowledge Integration

Furthermore, we find that 244 out of 684 questions of the semantic variation type have data in ConceptNet. We see that some of them are correctly predicted in all models after applying our method. This implies that the use of external knowledge can be a possible direction for overcoming the limitations of existing MNLMs in extracting semantic knowledge, especially common knowledge.

Table 6: The templates for 37 relations in ConceptNet.

Automatic Knowledge Integration

Renshaw, "Mctest: A Challenging Dataset for Open Domain Text Understanding," in Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2013, p. Linzen, "Right for the wrong reasons: Diagnosing syntactic heuristics in the inferential nature of language", in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019, pp. Singh, "Beyond accuracy: Behavioral testing of NLP models with Checklist," in Proceedings of the 58th Annual Meeting of the Linguistic Linguistics for Computation.

Yan, “Axiomatic attribution for deep networks,” in Proceedings of the 34th International Conference on Machine Learning-Volume 70, 2017, pp. Tsvetkov, “Explaining black box forecasts and revealing data artifacts through influence functions,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Smith, “Linguistic knowledge and transferability of contextual representations,” in Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2019, pp.

Pavlick, "Bert herontdek die klassieke nlp-pyplyn," in Proceedings of the Annual Meeting of the Association for Computational Linguistics, 2019, pp. Manning, "A structural probe for finding syntax in word representations," in Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2019, pp. Miller, "Taalmodelle as kennisbasisse?" in Proceedings of the Conference on Empirical Methods in Natural Language Processing and the International Joint Conference on Natural Language Processing, 2019, pp.

Birds can talk, but they can't fly,” in Proceedings of the Annual Meeting of the Association for Computational Linguistics, 2020, p. Havasi, “Conceptnet 5.5: an open multilingual general knowledge graph,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2017, p. Manning, “Glove: Global vectors for word representation,” in Proceedings of the 2014 Empirical Methods in Natural Language Processing (EMNLP) Conference, 2014, p.

Weston, "Key-value hukommelsesnetværk til direkte læsning af dokumenter," i Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2016, s. Berant, "Commonsenseqa: A question answering challenge targeting commonsense knowledge," i Proceedings of the Conference of the North American Chapter of Association for Computational Linguistics: Human Language Technologies, 2019, s. Liang, "Know what you don't know: Unanswerable questions for squad," i Proceedings of the Annual Meeting af Foreningen for Datalingvistik, 2018, s.

Liang, »Squad: 100.000+ questions for machine comprehension of text«, v Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2016, str.