A Non-Autoregressive Pre-Trained Sequence-to-Sequence Model for Natural Language Understanding and Generation

(1)

Digital Object Identifier 10.1109/ACCESS.2023.3346952

BERT-NAR-BERT: A Non-Autoregressive Pre-Trained Sequence-to-Sequence Model Leveraging BERT Checkpoints

MOHAMMAD GOLAM SOHRAB ¹, MASAKI ASADA ¹, MAT¯ISS RIKTERS¹, AND MAKOTO MIWA^1,2

1Artificial Intelligence Research Center (AIRC), National Institute of Advanced Industrial Science and Technology, Tokyo 135-0064, Japan 2Toyota Technological Institute, Nagoya 468-8511, Japan

Corresponding author: Mohammad Golam Sohrab ([email protected])

This work was supported by the New Energy and Industrial Technology Development Organization (NEDO) under Project JPNP20006.

ABSTRACTWe introduce BERT-NAR-BERT (BnB) – a pre-trained non-autoregressive sequence-to-sequence model, which employs BERT as the backbone for the encoder and decoder for natural language understanding and generation tasks. During the pre-training and fine-tuning with BERT-NAR-BERT, two challenging aspects are considered by adopting the length classification and connectionist temporal classification models to control the output length of BnB. We evaluate it using a standard natural language understanding benchmark GLUE and three generation tasks – abstractive summarization, question generation, and machine translation. Our results show substantial improvements in inference speed (on average 10x faster) with only little deficiency in output quality when compared to our direct autoregressive baseline BERT2BERT model. Our code is publicly released on GitHub (https://github.com/aistairc/BERT-NAR-BERT) under the Apache 2.0 License.

INDEX TERMS Abstractive summarisation, language understanding, machine translation, natural language processing, non-autoregressive modelling.

I. INTRODUCTION

Sequence-to-sequence (S2S) models have recently been widely used for natural language processing problems.

The S2S architecture is first introduced in the field of machine translation (MT)[1]and later used for pre-trained generative language models (LM), such as BART[2], T5[3], Optimus [4] and BERT2BERT [5] These models usually adopt an autoregressive (AR) decoding strategy to generate texts from left to right, token by token. AR decoding can perform high-quality inference, but it has the limitation that it cannot decode tokens in parallel and requires more time and computational cost for inference. In the growing demand for the real-time application use of S2S models, it is important to investigate approaches that can speed up the inference.

For fast inference, non-autoregressive (NAR) decoding has been widely studied since it can predict target tokens in all positions simultaneously and independently during inference[6]. The decoding thus can increase the inference

The associate editor coordinating the review of this manuscript and approving it for publication was Arianna D’Ulizia .

speed compared to AR decoding, but it can lead to degradation of generation performance. Several S2S-based NAR methods[7],[8]have been proposed to approach the performance of AR methods.

In AR models, Rothe et al.[5] proposed BERT2BERT (B2B), which leveraged the publicly available Pre-trained Language Model (PLM) parameter checkpoints such as BERT [9] and GPT-2 [10] and showed state-of-the-art performance on several generation tasks by utilizing released parameters for a S2S AR model without a new large-scale pre- training. In NAR models, however, existing studies performed pre-training from scratch, and the effectiveness of using PLM checkpoints for NAR decoding has not been investigated.

In this paper, we propose a novel S2S NAR model based on existing Transformer[11]architectures to allow parameter initialization from publicly available PLM checkpoints.

Specifically, we extend the B2B model to build a NAR S2S model using BERT as the backbone for both encoder and decoder models. The NAR modeling allows fast decoding and generation of longer texts. Our model architecture is similar to that of BioNART[12], which is targeted towards

2023 The Authors. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.

(2)

the biomedical domain. Using BERT as the backbone, we can start training with reliable parameters by loading the pre-trained BERT checkpoints. In addition, unlike the B2B model, we perform one epoch of additional pre- training starting at the BERT checkpoints and investigate the effectiveness of the additional pre-training.¹We adopt the Length Classification (LC)[13]or Connectionist Temporal Classification (CTC)[14]models to control the output length of BnB. We call our NAR model BERT-NAR-BERT, or BnB, where the NAR (n) part specifies that the model is Non- autoregressive and BERT on each side of ‘-’ represents that the model is sequence-to-sequence.

We fine-tune and evaluate the BnB model on a standard natural language understanding benchmark GLUE and three generation tasks – abstractive summarization, question generation, and machine translation. We compare its performance with several AR models and recent state-of-the-art NAR models.

Our contributions are summarized as follows:

• We propose a novel NAR S2S method with encoders and decoders, each compatible with publicly available PLMs, with additional pretraining and modeling output length.

• Our S2S BnB model is 10.7x faster than the AR counterpart, i.e., the direct baseline of our BERT2BERT model. On average, our model allows 17x faster decoding than autoregressive models and 7.7x faster than semi- autoregressive models.

• We show that leveraging BERT checkpoints for our NAR model is effective in improving performance.

It improves the Rouge-L scores by 1.9 and 4.8 points on the summarization and question generation tasks, respectively.

II. RELATED WORK

Pre-trained Language Models such as GPT-2[10], XLNet[15], and XLM [16] are neural networks trained on large-scale datasets that can be fine-tuned on problem-specific data. They became widely adopted after BERT[9], which reported SOTA results for 11 NLP tasks. Our PLM BnB is also trained on large-scale dataset and further fine-tuned on 12 task-specific NLP tasks where 10 NLP tasks are overlapped with BERT.

Li et al.[4]proposed the first large-scale variational auto encoder (VAE) language model, Optimus. They connect a BERT encoder and a GPT-2 decoder using a universal latent embedding space. The model is first pre-trained on a large text corpus and then fine-tuned for various language generation and understanding tasks. It achieves SOTA on VAE language modeling benchmarks. While the general idea of our work is similar, there are several core differences from this paper.

Our model does not have a VAE and instead of the GPT-2 decoder we use the same BERT as in the encoder. Optimus is also autoregressive, while ours is non-autoregressive.

Besides, Optimus also reported that inconsistency in models

1In this paper, we use the term additional pre-training for pre-training our BnB model. We load the pre-trained PLM checkpoints and do not pre-train the PLMs from scratch.

results in different tokenisation between input and output as the model encoder is based on BERT and the decoder is GPT-2. In contrast, The input and output tokenisation become consistent by using the same BERT architecture in our model.

Rothe et al.[5]developed Transformer-based sequence- to-sequence models by describing several combinations of model initialization that include BERT2BERT, a BERT- initialized encoder paired with a BERT-initialized AR decoder.

Our implementation of BnB is similar, except for the main differences of having a length prediction model, a latent representation from the encoder output layer, and a NAR decoder. The NAR decoder can decode tokens in parallel which drastically reduce the inference computational cost.

NAR models have been recently investigated in NLP tasks due to their efficiency. For example, Gu et al.[6]introduced a NAR model for MT based on the transformer[11]. This reduced latency by 90% and achieved competitive output quality with only a slight decrease in translation performance compared to similar-sized AR models. Gu and Kong[17]

further minimized the gap with AR models achieving SOTA performance on several MT benchmarks. Li et al. [18]

introduce ELMER by leveraging the early exit technique that enables token generation at different layers: if there is sufficient confidence to generate a token at lower layers where the model is allowed to exit and make the prediction without feeding to upper layers. Experiment results show that ELMER outperforms other NAR models and narrows the performance gap with AR models. The major limitation of ELMER is the model needs to define an appropriate early exit strategy.

During pre-training, the model utilizes the layer permutation language modeling but in fine-tuning the model needs to design a different early-exit strategy to decide at which layer the model will exit. During pre-training and fine-tuning, our model follows the general architecture where early-exit is not an option but the token is passed to all layers in the BnB.

Sun et al.[19]proposed a grafting approach, where they stack multiple sets of BERT encoder layers that follow autoregressive manner and GPT decoder decoder layers, and freeze selections of encoder or decoder parameters during fine-tuning. This approach leads to strong gains in output quality, however, they do not compare model sizes, training or inference time directly with previous work. While freezing some parameters during fine-tuning can optimize resource usage at the time, during inference the model would still be twice as slow due to double the parameter size. Compared to this work, our model follows non-autoregressive generation that opposes the autoregressive manner and maintains strong output quality, but vastly improves inference speed.

III. BERT-NAR-BERT FRAMEWORK

We propose BERT-NAR-BERT (BnB) - a non-autoregressive S2S model in this section. The overview of BnB is shown in Figure1. In this section, we will first briefly introduce the BERT2BERT model, then describe the differences in the model architecture, and explain our training approaches for the NAR modeling. Later, we describe the model architecture

(3)

FIGURE 1. The S2S BERT-NAR-BERT (BnB) architecture. The + sign to the right of the encoder box indicates, the input embeddings are the sum of the token embeddings, the position embeddings, and the type embeddings where the decoder box indicates the sum of the position, the type, and the latent embeddings.

of BnB and how we leverage BERT checkpoints for our NAR model. Then we describe our additional pre-training scheme for NAR language modeling.

A. BERT2BERT MODEL

The BERT2BERT (B2B)[5]model is an autoregressive S2S model using PLM checkpoints. All weights of the encoder and decoder are initialized from the public BERT checkpoints except the cross attention, which is randomly initialized. Since the model is based on an autoregressive manner, therefore, during the decoding task specific beam size is set for the generation. Our model extends this model for a NAR model with additional pre-training and modeling output length.

We consider B2B as our direct AR baseline model.

B. MODEL ARCHITECTURE

The BnB model is composed of a multi-layer Transformer- based encoder and decoder, in which the embedding layer and the stack of transformer layers are initialized with BERT[9].

To leverage the expressiveness power of existing pre-trained BERT models, we initialize our encoder and decoder parts with the pre-trained BERT parameters. We denote the number of layers (i.e., Transformer blocks) asL, the hidden size asH, and the number of self-attention heads asA.

1) BNB ENCODER

The encoder part of BnB is the same architecture as the BERT model.²The BnB model first feeds the source input

2We share the same encoder both for BERT2BERT and BERT-NAR-BERT, where we modify the decoder part by leveraging the latent representations and length classification for NAR generation.

sequenceX = x1, . . . ,x_nto the BnB encoder layer. In the BnB embedding layers, the input representation is constructed by summing the corresponding token (X), position (P), and type (T) embeddings.

The embeddings are fed into the BERT self-attention and feed-forward layers. The hidden representation of the final layer;his passed to the subsequent layer for obtaining latent representations.

2) LATENT REPRESENTATIONS

We construct the latent representationz =WEh+bbased on token-level representation from the encoder hidden state hwherez∈R^Pis aP-dimensional vector andW_E ∈R^P×H is the weight matrix. A visualization of BERT-NAR-BERT encoder and the latent representations can be seen in the left part of Figure1.

3) BNB DECODER

The decoder part is also based on the BERT architecture, and we can directly initialize the decoder with the pre-trained BERT model. Following our direct baseline BERT2BERT model, the cross-attention mechanism is adopted, and the encoder hidden representation of the final layerh is used for cross-attention. Our model differs from the BERT2BERT model in input representation and attention masks to enable NAR decoding. In the AR decoding, all target tokens are fed into the decoder with customized attention masks that prevent the decoder from seeing the future tokens during training. Then, in inference, the predicted token is fed to the decoder autoregressively. In our BnB decoder, input representation is constructed without providing any target tokens. The input representation is constructed by summing

(4)

the corresponding position (P) and type (T) embeddings and the latent embeddingzfrom the encoder. The attention masks are the normal masks that give access to all future tokens. The resulting decoder output representations of the final layer are fed to the subsequent generation layer. A visualization of the BERT-NAR-BERT decoder can be seen in the right part of Figure1. Figure2shows the decoder architectural difference between our non-autoregressive model and the direct baseline autoregressive model.

FIGURE 2. The decoder architecture of the non-autoregressive (NAR) BERT-NAR-BERT (BnB) model VS our direct baseline autoregressive (AR) BERT2BERT (B2B) model. Inference speedup is computed both in CPU and GPU over the 2000 sentences from WMT16 test data.

C. ADDITIONAL PRE-TRAINING

This section describes the two unsupervised pre-training objectives of the BERT-NAR-BERT. Figure3shows the pre- training objectives of BnB.

FIGURE 3. Masked and permutation LM objective functions of BnB.

1) MASKED LANGUAGE MODELING

The self-supervised learning strategy adopted in BERT.

We randomly mask the tokens in each sequence, and all masked tokens are predicted in a non-autoregressive manner.

2) PERMUTATION LANGUAGE MODELING

We randomly permute tokens in a sequence and predict the original order of the tokens, which is inspired by the idea in XLNet[15].

D. MODELING OUTPUT LENGTH

To control the generation output length of BnB, we implement two length prediction models, namely: 1) Length Classifica- tion[13], and 2) CTC[14]that implicitly determines the target length from the token alignment.

1) LENGTH CLASSIFICATION

The length classification formulates the length prediction as a classification task and utilizes the latent representations to predict the target length:

p_θ(y|z)=X

l

p_θ(y,l|z)

=p_θ y,l_y|z

=p_θ y|z,l_y

p_θ l_y|z

, (1)

where l_y denotes the length ofythat is the gold length in training andp_θ l_y|z

=L_Lis the length predictor that predicts the length of target sentencey. Once the sentence length is predicted, the model predicts each output token with a token- level categorical cross-entropy loss.

2) CONNECTIONIST TEMPORAL CLASSIFICATION

In contrast to LC, we also adopt the CTC[14]considering its superior performance and flexibility for latent alignment.

The CTC loss is independently computed after the decoder by replacing the length classification and CE loss.

E. POSSIBLE MODEL COMPONENT COMBINATIONS Table1shows the possible model component combinations of the BERT-NAR-BERT model. Additional pre-training is performed by either of two different objectives: masked and permutation LM. For pre-training, the BnB is initialized with random values or publicly available PLM checkpoints. For fine-tuning, model parameters are initialized with random values, public parameter checkpoints, or resulting additional pre-trained parameters. For both additional pre-training and fine-tuning, we choose LC or CTC for modeling the output lengths.

IV. EXPERIMENTS

A. PRE-TRAINING: DATA AND TASK SETTINGS

We pre-training our BERT-NAR-BERT which we call additional pre-training by loading the PLM checkpoints as sequence-to-sequence task.

1) ADDITIONAL PRE-TRAINING

The additional pre-training procedure follows existing literature on PLM pre-training. The entire Wikipedia[20]data dump with the version of20220301.enfrom Huggingface datasets³is used for pre-training without any data filtering scheme. As a result, 6,458,670 input text that includes multiple sequences are truncated with maximum sequence length of 512, and 3.2B tokens are used for our additional pre-training.

3https://huggingface.co/datasets/wikipedia

(5)

TABLE 1. Possible model components for additional pre-training and fine-tuning on BERT-NAR-BERT. Bold font options are adopted as our default options.

B. FINE-TUNING: DATA AND TASK SETTINGS

For fine-tuning, we describe the data and task settings of the benchmark downstream tasks including GLUE, abstractive summarization, question generation, and machine translation.

We initialize all the downstream tasks with additional pre- training checkpoints generated from BnB.

1) LANGUAGE UNDERSTANDING

We consider the General Language Understanding Evaluation (GLUE) benchmark [21], which consists of nine general language understanding tasks. We employ the Optimus evaluation script⁴to evaluate the scores and select the best performances among different runs to report all the scores following Optimus.

2) ABSTRACTIVE SUMMARIZATION

Abstractive text summarization aims to produce a short version of a document while preserving its salient information content.

We evaluate the models based on the BBC extreme [22]

(XSum) dataset. This is a news summarization dataset containing 227K news articles and single-sentence summary pairs. The evaluation metric is ROUGE [23], including ROUGE-1 (R-1), ROUGE-2 (R-2), and ROUGE-L (R- L). We adopted the Google Research reimplementation of ROUGE.⁵

We also perform latency comparison. We evaluate the time required to generate all the samples in the validation set with the same machine settings for our BnB and ELMER[18]and calculate the ratios of latency. We include the reported latency values of other existing models from ELMER.

3) QUESTION GENERATION

SQuAD v1.1[24] is a dataset created for machine reading comprehension. The dataset contains 98K triples of {passage, question, answer}. We use this data as a question generation dataset, in which a model receives an answer and a passage and generates the corresponding question. We follow the same train, validation test data split setting as Du et al.[25].

The evaluation metrics are ROUGE-L and BLEU-4 (B-4).

We follow the same settings as the abstractive summarization task for latency comparison.

4) MACHINE TRANSLATION

We evaluate our models using two popular benchmark data sets from the WMT shared tasks on news translation - English (EN) ↔German (DE) data from WMT 2014 and

4https://github.com/ChunyuanLI/Optimus

5https://github.com/google-research/google-research/tree/master/rouge

English↔Romanian (RO) data from WMT 2016. We also experiment with distilled versions of these data sets, generated by vanilla transformer models[11]trained on the normal data.

We load the WMT datasets from Huggingface datasets^6,7and use them directly to train the models without filtering, back- translation, or any other kinds of synthetic data generation.

We evaluate the performance by computing BLEU[26]scores using sacreBLEU[27].

C. EXPERIMENTAL SETTINGS

Our implementation is based on the encoder-decoder model (i.e., EncoderDecoderModel) in Huggingface Transformers[28], in combination with their BERT model.

We supplement the Huggingface Transformers framework with a non-autoregressive version of the encoder-decoder.

D. PRE-TRAINING BNB

For additional pre-training, we split the text from the text field and ignored the title field in the English Wikipedia. We consider a document-level rather than a shuffled sentence-level corpus by loading the data from Huggingface. During additional pre-training, we initialize BnB withbert-base-casedand update all the parameters of BnB’s encoder and decoder using Wikipedia, where the checkpoints are saved from both encoder and decoder outputs.

We set the 15% probability of masking for masked LM, and 50% probability of permuting for permutation LM.

E. FINE-TUNING BNB

For GLUE, following the fine-tuning setting in[4]and[9], we use the learning rates [2,3,4,5]×10⁻⁵with different seeds following the Optimus setting. We train the model for three epochs for MNLI, QQP, QNLI, and SST-2 datasets and ten epochs for comparatively very small datasets - COLA, STS-B, MRPC, and RTE.

For abstractive summarization, we set the number of training epochs to 100 and adopt early stopping. Our BnB is first initialized with thebert-based-casedpre-trained parameters and all the parameters are fine-tuned over the XSum label data. Later, the BnB is initialized with additional pre-training w.r.tbnb-base-casedparameters and fine- tuned the parameters over the masked and permutation LM objectives.

For question answering, input is formatted as {answer [SEP]

passage}, and the model generates the question sequences. Our models including the BnB initialize bybert-base-cased as the baseline model, and the BnB initialize by our

6https://huggingface.co/datasets/wmt14 7https://huggingface.co/datasets/wmt16

(6)

TABLE 2. Comparison of OpenAI GPT, BERT, Optimus, and BnB on the validation set of GLUE. ACC and P.C stand for accuracy and Pearson correlation coefficient, respectively. The OpenAI GPT scores are reported by Devlin et al.[9]. Bold and underlined scores denote the best and second-best results.

TABLE 3. Performance comparison on the XSum and SQuAD v1.1 datasets. R-1/2/L and B-4 stand for ROUGE-1/2/L and BLEU-4, respectively. Bold and underlined scores denote the best and second-best results within NAR models. For the latency values of existing methods, we included the values reported by Li et al.[18].

pre-training parameters w.r.t to bnb-base-cased are compared by the top published works reported by Li et al.[18].

For machine translation, we compare the scores with autoregressive baselines - Transformer MT models and BERT2BERT, initialized with random parameters or pre- trained multilingual BERT. We compare random initialization of parameters with initialization from mBERT, as well as knowledge distillation[29], which has previously proven to be highly beneficial for NAR MT models[30].

V. RESULTS

We evaluate the BnB by adopting the language understanding and generation tasks. We report the results of language understanding and generation tasks using the BnB with parallel decoding in comparison to other models in Tables2,3, and4.

A. LANGUAGE UNDERSTANDING

To fine-tune the model on GLUE tasks, we consider applying our pre-trained BnB model over the nine downstream tasks.

In Table 2, we compare our fully non-autoregressive BnB approach with the OpenAI GPT⁸[31], encoder-based BERT,

8The first version since the model parameters are comparable with the proposed BnB model.

and VAE-based Optimus models. We cannot compare BERT- NAR-BERT with BERT2BERT[5]model as the model skips the language understanding task. Moreover, the BERT2BERT pre-trained checkpoint is not publicly available that further narrowing to conduct the language understanding tasks over the BERT2BERT model. In this table, we compare the BnB based on the permutation LM objectives. Our best BnB model with permutation LM objective function outperforms the OpenAI GPT[31], and BERT models on averaged scores, but shows comparable results in terms of Optimus. The Optimus model follows the sentence-level pre-training procedure which allows training the model longer than document-level. The GLUE benchmark consists of nine datasets because of the problematic nature of the WNLI dataset BERT[9]literature ignores this dataset. We evaluate the WNLI dataset and achieve 56.3% in terms of accuracy as mentioned in the score like Optimus. We also notice the problematic nature of WNLI where the score remains unchanged even though the dataset is fine-tuned with different learning rates and seeds.

B. LANGUAGE GENERATION

For language generation based on BnB, we consider three tasks: 1) abstractive summarization, 2) question generation, and 3) machine translation.

(7)

1) SUMMARIZATION AND QUESTION GENERATION

Table3 shows the performance comparison of BnB over the XSum and SQuAD v1.1 datasets for summarization and question generation tasks respectively. Our best model outperforms all the semi-autoregressive (semi-AR), and most of the non-autoregressive (NAR) approaches, but closely competes with the AR approaches. Under the NAR setting, we achieve second-best results over the XSum and SQuAD v1.1 datasets. Our model outperforms the ELMER-Hard[18]

over the XSum and SQuAD v1.1 datasets where ELMER- Soft[18]remains state-of-the-art (SOTA) results. ELMER- Hard and ELMER-soft denote fine-tuning ELMER[18]with hard and soft early exit strategies, respectively. Unlike our BnB that needs to determine length by integrating an extra length prediction model, ELMER dynamically adjusts the output length by emitting an end token (i.e.,[EOS]) at any position with early exit strategies. Our BnB can also consider integrating an early exit strategy by adopting the ELMER-Soft idea.

2) MACHINE TRANSLATION

Results of machine translation experiments are summarized in Table4. The results show that our model can compete with the baseline Transformer and B2B models⁹after knowledge distillation. It is expected as the BnBs are compared with the autoregressive baseline translation models.

TABLE 4. Machine translation experiment results in BLEU scores of training a vanilla Transformer model, BERT2BERT (B2B), and BERT-NAR-BERT (BnB) initialized with multilingual BERT (mBERT); trained on original and distilled (dist.) data from WMT 2014 (German) and WMT 2016 (Romanian). Bold and underlined scores denote the best and second-best results.

C. INFERENCE SPEEDUP

We also compare the differences in inference speed between the models. While training speed is mostly similar for all, BnB can generate output 17x faster on average due to its non-autoregressive nature (tested on NVIDIA V100 and A100 GPUs) as shown in the Latency column of Table3.

For machine translation, translating 2,000 sentences in the WMT16 test data set with BnB takes 87 seconds on a GPU, and 587 seconds on a CPU. The same took 234 seconds on a GPU and 1,234 seconds on a CPU for an equivalent to the direct B2B baseline model. Following the BERT architecture, our BERT- NAR-BERT consists of 110M parameters each for encoder and decoder that includes 12 layers (L=12), 768 hidden size

9Transformer and B2B scores are from our implementation of the models using Huggingface Transformers. Rothe et al.[5]reported uncased BLEU scores on WMT14 of 30.1 for EN-DE and 32.7 for DE-EN.

(H=768), and 12 self-attention heads (A=12). Our direct baseline BERT2BERT model follows the same parameters, 220M totals for both the encoder and decoder models.

VI. ABLATION STUDY

In this section, we further conduct ablation studies to investigate the different objectives, hyper-parameter, and additional pre-training strategies of BnB. Generally, training language models in different objectives and hyper-parameters are computationally expensive, therefore, we show the differences in the performance of BnB to compare the settings in two possible scenarios: 1) judging the effect of additional pre-training over the masked and permutation LM objectives, and 2) selecting the better model from different length prediction models.

TABLE 5. Comparison of the parameter initialization for BnB. Our baseline model BnB is initialized with bert-base-cased (bBERT) where the models with autoencoder (AE), masked, and permutation (perm.) LM objectives are initialized with our additional pre-training (AP) or w.r.t bnb-base-cased.

The scores in subscript denotes our models performance being higher than our baseline models.

TABLE 6.Comparison of the length prediction mechanism for NAR decoding between Length Classification (LC) and Connectionist Temporal Classification (CTC).

A. EFFECT OF ADDITIONAL PRE-TRAINING

Table5shows the additional pre-training effect of BnB over the summarization dataset. The results show that BnBs with additional pre-training along with masked and permutation LM objectives outperform our baseline models.

B. ABLATION ON LENGTH PREDICTION

The fully NAR models need a target sequence length prediction model to generate all the tokens simultaneously.

We adopt the length classification, and Connectionist Temporal Classification models mentioned in SectionIIIto facilitate the BnB. In Table6, we fine-tune the BnB with LC and CTC mechanisms on the XSum dataset. Table6shows that the CTC-based model outperforms the LC-based model.

C. EFFECT OF FEW-SHOT LEARNING

We also conduct a few-shot experiment by selecting the 1%, 10%, 30%, 50%, and 70% data of each task in the GLUE dataset. As in Figure4, our pre-trained BnB model

(8)

FIGURE 4. Few-shot learning over the GLUE dataset. 1%, 10%, 30%, 50%, and 70% data are selected to conduct the experiments. CoLA, SSTB, MRPC, and RTE are relatively very small dataset in comparison to MNLI, QQP, QNLI, and SST2.

shows competitive results even with only 1% data from each task in the GLUE dataset. As the data scale increases to 30%, we can see that very negligible performance gap with 100% data.

VII. ERROR ANALYSIS ON GENERATION TASK A. ABSTRACTIVE SUMMARIZATION

Table7shows randomly selected generated examples from the XSum dataset. In the first example, our BnB successfully generated a summary meaning that Yorkshire Diamonds have appointed Paul Grayson as their head coach, but our model made incorrect modifiers such as ‘‘Super League champions’’, and ‘‘former’’.

In the second example, our BnB successfully generated the text ‘‘A man has been arrested on suspicion of murder after a man was’’ as in the gold reference, but the rest of the sentence contains grammatical errors, and the model wrongly generated the word ‘‘Wolverhampton’’ that are not mentioned in the source text. On the other hand, the AR decoding baseline BART generated a sentence with almost the same meaning as the gold reference.

B. MACHINE TRANSLATION

We randomly sampled several examples to superficially inspect the types of errors or differences in the output of our model compared to our baselines. In the evaluation we were mainly looking for repeating patterns of differences in the translations between the models and strong differences with the reference.

Table8shows that occasionally the BLEU score suffers due to paraphrasing of the same concepts, e.g., ‘‘rocket attacks’’

instead of ‘‘missile strikes’’. In this case, ‘‘rocket attacks’’

is a direct literal translation of the German compound word

‘‘Raketenangriffe’’, but not the correct English term for the given context. Person names like ‘‘Volodymyr Zelenskyy’’

are also a common struggle for any MT model, especially if not commonly found in training data. Location names like ‘‘Kyiv’’ may have other spellings, such as ‘‘Kiev’’ in this example, where the former is the romanized official Ukrainian name for the city, and the latter used to be the traditional English name for the city up to 2014, when many Western media outlets switched to using the romanized Ukrainian name after the outbreak of the Russo-Ukrainian War.

Table9 shows an example of English to German with variations on how the words ‘‘ticket checker’’ and ‘‘tickets’’

are translated. Both models seem to have not chosen the correct terms for the context, however, the meaning remains generally the same.

The final example in Table10exhibits some of the most common problems that we noticed in BnB output during the evaluation. While both models translate the German word ‘‘Blatt’’ very literally into ‘‘sheet’’ or ‘‘leaf’’ instead of the figurative correct translation ‘‘magazine’’, BnB seems to be struggling with overall longer inputs and sometimes generates shortened or even cut off translations. The example also exhibits the occasional hiccup of repeated words or phrases.

(9)

TABLE 7. An example of abstractive summarization output comparing the baseline BART model output to our BnB. The main differences between the model outputs are underlined.

TABLE 8. An example of German to English translation output comparing the baseline Transformer model output to our BnB. The main differences between the model outputs are underlined.

TABLE 9. An example of English to German translation output comparing the baseline Transformer model output to our BnB. The main differences between the model outputs are underlined.

TABLE 10. An example of German to English translation output where the BnB model struggles with some concepts and generates shortened output. The main differences between the model outputs are underlined.

VIII. DISCUSSION

We present BERT-NAR-BERT, a novel, simple, and easy- to-implement large-scale non-autoregressive pre-trained S2S model by leveraging BERT as the backbone of both encoder and decoder models. We demonstrate strong performance of language understanding with the nine GLUE tasks and three language generation tasks, including summarization, question generation, and machine translation.

BnB allows much faster decoding without noticeable sacrifice in comparison to the fully autoregressive approaches, and exhibits faster decoding with better performance in terms of semi-AR approaches. It is also noticeable that BnB even demonstrates faster decoding than existing NAR models, except ELMER[18]. This is expected because: 1) BnB (Total parameters 220M) is a relatively larger model than ELMER (Total parameters 140M); 2) BnB simultaneously predicts the

(10)

tokens only at the last layer like most NAR models, where the ELMER predicts at different layer by early exit technique.

The ELMER[18]reported that the model has an important concern for defining an early exit technique. In pre-training ELMER follows layer permutation language modeling but during fine-tuning the model needs to design an effective and suitable method to decide at which layer the model will exit and predict tokens. Our BnB does not rely on the early exit technique but predicts the tokens from the final output layer.

In our future work, we will adopt the early exit technique to observe the performance over long-form text generation along with a length classification model.

Furthermore, training and evaluation of language models with possible components stated in Table1is computationally very expensive. Therefore, our ablation studies over 1) the effect of additional pre-training using masked and permutation LM objectives and 2) different length prediction methods help us choose the best learning parameters for BnB by considering a smaller number of training steps.

IX. CONCLUSION

This paper introduces an efficient non-autoregressive S2S model BERT-NAR-BERT that outperforms baselines in most of the summarization and question generation tasks. Still, it remains competitive in the quality of outputs when evaluated on machine translation tasks. However, our model with distilled data shows improvement over the baseline approaches.

Furthermore, we find that using pre-trained BERT models as the encoder and decoder along with CTC for length prediction and knowledge distillation for machine translation helps improve the performance of language generation tasks.

We also separately evaluate the BnB on GLUE tasks and show strong performance of language understanding with nine GLUE tasks.

Despite overall good performance, our error analysis did highlight some reoccurring flaws in the output, which in most cases seemed to be difficulties in dealing with less common named entities (NE). This could potentially be addressed by incorporating dedicated NE processing either in the data preparation pipeline or model training.

In future work, we plan to experiment with replacing the BERT models with other pre-trained language models which can be used as encoders/decoders, as well as running broader evaluations on other S2S NLP tasks. Besides, large language models (LLMs) are computationally very expensive as they require many dedicated GPUs and processing power than standard deep learning models. We will address LLMs or decoder only models in our future direction.

LIMITATIONS

There are several limitations of the current BnB model. First, BERT is the backbone for the encoder and the decoder of BERT-NAR-BERT, it limits to take input sequence 512 tokens in length. In this work, we considered using a document- level corpus by loading the dataset directly from Huggingface.

Therefore, during pre-training with Wikipedia and fine-tuning

on summarization datasets, the longer sequences are truncated, as BERT limits the maximum input to 512 tokens. One may further train the models for a longer period with a sentence-level corpus of Wikipedia to achieve better context representations.

Second, the BERT-NAR-BERT model introduces token- level latent space that connects the encoder and decoder of the model. Since hyper-parameter tuning on large language models is computationally very costly, we choose the latent size of 8 without conducting any empirical studies on selecting a latent size.

Furthermore, the usual limitations of neural S2S generation models also apply to ours, where they tend to struggle with generating less commonly words and phrases, such as the ones highlighted in our error analysis.

ETHICS STATEMENT

We use only publicly available datasets and relatively low compute amounts while conducting our experiments to enable reproducibility. We do not perform any studies on other humans or animals in this research.

ACKNOWLEDGMENT

The authors thank all the anonymous reviewers for their insightful comments.

REFERENCES

[1] K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, ‘‘Learning phrase representations using RNN encoder–decoder for statistical machine translation,’’ inProc. Conf.

Empirical Methods Natural Lang. Process. (EMNLP), Doha, Qatar, 2014, pp. 1724–1734.

[2] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer, ‘‘BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension,’’ inProc. 58th Annu. Meeting Assoc. Comput. Linguistics, 2020, pp. 7871–7880.

[3] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W.

Li, and P. J. Liu, ‘‘Exploring the limits of transfer learning with a unified text- to-text transformer,’’J. Mach. Learn. Res., vol. 21, no. 1, pp. 5485–5551, 2020.

[4] C. Li, X. Gao, Y. Li, B. Peng, X. Li, Y. Zhang, and J. Gao, ‘‘Optimus:

Organizing sentences via pre-trained modeling of a latent space,’’ in Proc. Conf. Empirical Methods Natural Lang. Process. (EMNLP), 2020, pp. 4678–4699.

[5] S. Rothe, S. Narayan, and A. Severyn, ‘‘Leveraging pre-trained checkpoints for sequence generation tasks,’’Trans. Assoc. Comput. Linguistics, vol. 8, pp. 264–280, Dec. 2020.

[6] J. Gu, J. Bradbury, C. Xiong, V. O. K. Li, and R. Socher, ‘‘Non- autoregressive neural machine translation,’’ inProc. Int. Conf. Learn.

Represent., 2018, pp. 1–13.

[7] J. Lee, E. Mansimov, and K. Cho, ‘‘Deterministic non-autoregressive neural sequence modeling by iterative refinement,’’ inProc. Conf. Empirical Methods Natural Lang. Process., Brussels, Belgium, 2018, pp. 1173–1182.

[8] W. Qi, Y. Gong, J. Jiao, Y. Yan, W. Chen, D. Liu, K. Tang, H. Li, J. Chen, R. Zhang, M. Zhou, and N. Duan, ‘‘BANG: Bridging autoregressive and non-autoregressive generation with large scale pretraining,’’ inProc. 38th Int. Conf. Mach. Learn., M. M. T. Zhang, Ed., 2021, pp. 8630–8639.

[9] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, ‘‘BERT: Pre-training of deep bidirectional transformers for language understanding,’’ inProc. Conf.

North Amer. Chapter Assoc. Comput. Linguistics, Hum. Lang. Technol., vol. 1. Minneapolis, MA, USA, Jun. 2019, pp. 4171–4186.

[10] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever,

‘‘Language models are unsupervised multitask learners,’’OpenAI Blog, vol. 1, no. 8, p. 9, 2019.

(11)

[11] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, ‘‘Attention is all you need,’’ inProc. Adv.

Neural Inf. Process. Syst., vol. 30, 2017, pp. 1–11.

[12] M. Asada and M. Miwa, ‘‘BioNART: A biomedical non-AutoRegressive transformer for natural language generation,’’ inProc. 22nd Workshop Biomed. Natural Lang. Process. BioNLP Shared Tasks, Toronto, ON, Canada, 2023, pp. 369–376.

[13] R. Shu, J. Lee, H. Nakayama, and K. Cho, ‘‘Latent-variable non- autoregressive neural machine translation with deterministic inference using a delta posterior,’’ inProc. AAAI Conf. Artif. Intell., vol. 34, 2020, pp. 8846–8853.

[14] J. Libovický and J. Helcl, ‘‘End-to-end non-autoregressive neural machine translation with connectionist temporal classification,’’ inProc. Conf.

Empirical Methods Natural Lang. Process., Brussels, Belgium, 2018, pp. 3016–3021.

[15] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, and Q.

V. Le, ‘‘XLNet: Generalized autoregressive pretraining for language understanding,’’ inProc. Adv. Neural Inf. Process. Syst., vol. 32, 2019, pp. 5753–5763.

[16] A. Conneau and G. Lample, ‘‘Cross-lingual language model pretraining,’’

inProc. Neural Inf. Process. Syst., vol. 32, 2019, pp. 7059–7069.

[17] J. Gu and X. Kong, ‘‘Fully non-autoregressive neural machine translation:

Tricks of the trade,’’ inProc. IJCNLP, Aug. 2021, pp. 120–133.

[18] J. Li, T. Tang, W. X. Zhao, J.-Y. Nie, and J.-R. Wen, ‘‘ELMER: A non- autoregressive pre-trained language model for efficient and effective text generation,’’ inProc. Conf. Empirical Methods Natural Lang. Process., Abu Dhabi, United Arab Emirates, 2022, pp. 1044–1058.

[19] Z. Sun, M. Wang, and L. Li, ‘‘Multilingual translation via grafting pre- trained language models,’’ inProc. EMNLP, Punta Cana, Dominican Republic, Nov. 2021, pp. 2735–2747.

[20] Wikimedia Foundation. (2023). Wikimedia Downloads. Accessed:

Sep. 19, 2023. [Online]. Available: https://dumps.wikimedia.org [21] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. Bowman,

‘‘GLUE: A multi-task benchmark and analysis platform for natural language understanding,’’ inProc. EMNLP, Brussels, Belgium, Nov. 2018, pp. 353–355.

[22] S. Narayan, S. B. Cohen, and M. Lapata, ‘‘Don’t give me the details, just the summary! Topic-aware convolutional neural networks for extreme summarization,’’ inProc. Conf. Empirical Methods Natural Lang. Process., Brussels, Belgium, 2018, pp. 1797–1807.

[23] C.-Y. Lin, ‘‘ROUGE: A package for automatic evaluation of summaries,’’

inText Summarization Branches Out. Barcelona, Spain: Association for Computational Linguistics, Jul. 2004, pp. 74–81.

[24] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang, ‘‘SQuAD: 100,000+

questions for machine comprehension of text,’’ inProc. Conf. Empirical Methods Natural Lang. Process., 2016, pp. 2383–2392.

[25] X. Du, J. Shao, and C. Cardie, ‘‘Learning to ask: Neural question generation for reading comprehension,’’ inProc. 55th Annu. Meeting Assoc. Comput.

Linguistics, Vancouver, BC, Canada, 2017, pp. 1342–1352.

[26] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, ‘‘BLEU: A method for automatic evaluation of machine translation,’’ inProc. 40th Annu. Meeting Assoc. Comput. Linguistics, Pennsylvania, PA, USA, 2001, pp. 311–318.

[27] M. Post, ‘‘A call for clarity in reporting BLEU scores,’’ inProc. 3rd Conf.

Mach. Transl., Res. Papers, Brussels, Belgium, 2018, pp. 186–191.

[28] T. Wolf et al., ‘‘Transformers: State-of-the-art natural language processing,’’

inProc. Conf. Empirical Methods Natural Lang. Process., Syst. Demon- strations, Oct. 2020, pp. 38–45.

[29] G. Hinton, O. Vinyals, and J. Dean, ‘‘Distilling the knowledge in a neural network,’’ inProc. NIPS Deep Learn. Workshop, 2014, pp. 1–9.

[30] Y. Kim and A. M. Rush, ‘‘Sequence-level knowledge distillation,’’ inProc.

Conf. Empirical Methods Natural Lang. Process., Austin, TX, USA, 2016, pp. 1317–1327.

[31] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, ‘‘Improving language understanding by generative pre-training,’’OpenAI Blog, 2018.

[32] K. Song, X. Tan, T. Qin, J. Lu, and T.-Y. Liu, ‘‘MASS: Masked sequence to sequence pre-training for language generation,’’ inProc. ICML, vol. 97, K. C. R. Salakhutdinov, Ed., 2019, pp. 5926–5936.

[33] W. Qi, Y. Yan, Y. Gong, D. Liu, N. Duan, J. Chen, R. Zhang, and M. Zhou, ‘‘ProphetNet: Predicting future n-gram for sequence-to- SequencePre-training,’’ inFindings of the Association for Computational Linguistics: EMNLP 2020, Nov. 2020, pp. 2401–2410.

[34] M. Ghazvininejad, O. Levy, Y. Liu, and L. Zettlemoyer, ‘‘Mask-predict:

Parallel decoding of conditional masked language models,’’ inProc. Conf.

Empirical Methods Natural Lang. Process. 9th Int. Joint Conf. Natural Lang. Process. (EMNLP-IJCNLP), 2019, pp. 6112–6121.

[35] J. Gu, C. Wang, and J. Zhao, ‘‘Levenshtein transformer,’’ inAdvances in Neural Information Processing Systems, vol. 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. Fox, and R. Garnett, Eds. Red Hook, NY, USA: Curran Associates, 2019.

MOHAMMAD GOLAM SOHRABreceived the Doctor of Engineering degree from the Department of Information Science and Intelligent Systems, The University of Tokushima, in 2013.

He is currently a Senior Researcher with the Artificial Intelligence Research Center (AIRC), National Institute of Advanced Industrial Science and Technology (AIST), Japan. Previously, he was a Postdoctoral Fellow with the Computational Intel- ligent Laboratory, Toyota Technological Institute (TTI). Beforehand, he was an E-Commerce Analyst and a Software Developer of the subsidiary company of USA and Canada, respectively. He has participated in several biomedical shared-tasks and data mining challenges with achieving significant rank. His research interests include deep learning, machine learning, language models, data/text mining, information extraction, hierarchical text classification, big data, bio-informatics, extreme multi- label classification, and natural language processing. He has received the Outstanding Paper Award from IEEE GCCE 2021, the Best Paper Award from the IEEE Cloud Computing and Intelligent Systems, and the International Research Exchange Award from The University of Tokushima in recognition of outstanding research for the accomplishment of master’s degree.

MASAKI ASADA received the Ph.D. degree from the Toyota Technological Institute, in 2022.

He is currently a Researcher with the Artificial Intelligence Research Center, National Institute of Advanced Industrial Science and Technology, Tokyo, Japan. His research interests include deep learning, natural language processing on biomedical texts, and generative language models.

MAT¯ISS RIKTERSreceived the Ph.D. degree in hybrid methods for machine translation from the University of Latvia, in 2019. He is currently a Researcher with the Artificial Intelligence Research Center, National Institute of Advanced Industrial Science and Technology (AIST), Tokyo, Japan. Prior to AIST, he was a Postdoctoral Researcher with The University of Tokyo and a Researcher/Developer with Tilde. His main field of study is natural language processing (NLP) and machine translation (MT) in particular. His work has mainly focused on the improvement of MT for low-resource and morphologically rich languages.

More recently his focus has expanded towards context-aware MT, efficient models, and other areas of NLP, such as sentiment analysis and social network analysis.

MAKOTO MIWA received the Ph.D. degree from The University of Tokyo, in 2008. He is currently an Associate Professor with the Toyota Technological Institute, Nagoya, Japan, and an Invited Researcher with the National Institute of Advanced Industrial Science and Technology, Tokyo, Japan. His research interests include natural language processing, deep learning, and information extraction.