A Deep Learning Approach For Bengali Text Summarization

This thesis entitled “A Deep Learning Approach for Bengali Text Summarization” submitted by Amit Kumer Sorker to the Department of Computer Science and Engineering, Daffodil International University has been accepted as satisfactory in partial fulfillment of the requirements for the degree of M.sc in Computer Science and Engineering and approved for style and content. Department of Computer Science and Engineering Faculty of Science and Information Technology Daffodil International University. I also certify that neither this project nor any part of this project has been submitted elsewhere for the award of any degree or diploma.

I may like to thank my supervisor Mr. Abdus Sattar for his legitimate encouragement to complete this huge research work on Bengali language. It served each of us important assets and data to do this Bengali language examination. Sir Syed Akhter Hossain for his important help to do such research work in Bengali language.

However, programmed content summary for different dialects has already not made for the Bengali language. Expanding the devices and innovation of Bengali language is the fundamental goal of this exploration. In this exploratory work I have tried to produce a programmed book summary for the Bengali language.

However, I have made the basis for a compendium of programmed content for the Bengali language by the end.

Introduction

Focus on the main phrases and make the important meaning of connecting the content is the main idea of the summary. Abstract Content Summarizer is a content outline approach that finds a significant part from given content archives, which is optional if it contains summaries of words present or not in the first content reports. In this way, digest summarizes this long content and reduces the size of the record and puts only the central data by removing the redundant content.

So, the basic focus of this research is to build a programmed abstract book summary of the Bengali language to arrive at the Bengali NLP treasure. Be that as it may, in this epoch-making period, the devices and innovation of the Bengali language are not as rich as different dialects. The vast majority of substantive problems can be explained by NLP devices and strategies.

In this sense, in this exam paper we try to tell the best way to form Bengali language and make an abstract content summary for Bengali language. This encourages us to reduce the size of the report and give a known summary at the time of the recording.

Research Questions

It makes people effectively understand the meaning of a long book with a familiar and flawless synopsis. Most and essential NLP procedures so far work for various dialects, for example English, French, Chinese and so on. Be that as it may, for Bengali content a couple of models are constructed which is not sufficient.

To deal with this issue you need to use Unicode of those characters or images. The most extreme number of research work and devices are created using extractive content outline in Bengali language but not in an abstract content summary. Likewise, many analysts and engineers are not intrigued to give their data and assets to everyone.

At that point, the learning model works in the backend of a framework, for example a web or a versatile application. In this exploration, we present an AI technique for abstracting Bengali content synopsis and narrate key steps on the best way to fabricate a model for programmed Bengali content synopsis.

Report Layout

Background Studies

Introduction
Research Summary
Scope of the problem
Challenges

A vector of content sentences is processed by the input of the encoder, where the decoder generates the possible output of the vector groups. The encoder contains a fixed length vector information sequence, and the decoder produces the most connected group of coded arrangements. In this exam part, we used the learning of succession by agreement to create an abstract summary of the content.

This way, the dataset contains two segments: one is Bengali content and another is their comparison summary. In the preprocessing stage, split the content from scratch and then add Bengali compressions and remove words from the content. I have used a pre-prepared word vector document for Bengali content that is accessible on the web.

The word vector is the input of the encoder, and the significant word vector in the decoder is the output of the model. As the content outline is a new research in Bengali NLP, a diverse method is created step by step. This research work uses NMT to summarize Bengali content depending on a short sequence of Bengali content.

In the dataset, the length of the content is not huge, but sufficient for summary, but the outline of the content is short in sentences. In this way, there is another research area to compose a long or arbitrary content summary in Bengali content summary. A few paper datasets are available, but some of the research work has almost been done using this dataset.

After the assortment of the dataset, the content information, making an outline of that content is another difficult work. In this way, in the preprocessing step, raw coding must be performed to set up the content as a model contribution. For various dialects, such as English, there is a form in the library to remove words from the content.

Research Methodology

Research Subject and Instrumentation
Data Collection
a) Split
b) contractions
d) Remove stop words
e) Clean content & summary
Implementation Requirements
b) Word embedding
c) Encoder & Decoder
d) Seq2Seq model
Descriptive Analysis
Summary

The machine does not understand the short type of a book because it needs to portray the full meaning of the short type of the word. Space, white space, English character, highlighted structure content, Bengali digit from content is the use of standard pronunciation rule in our evaluation. As such, from the beginning, I collect the complete Bengali word from the web.

After completing the previous steps, the sequence of the text and the summary will look clean. Consequently, 𝑥1, 𝑥2,.… is input arrangement and which originates from the jargon need to estimate V. It creates the output arrangement, for example y1, y2,... , y𝑑, here S>P. It implies that the grouping of the disposition is not exactly the information content record. These vectors of the word are used as the contribution of the model at that time, giving a related word which will be the output of the model.

The information content is in the encoder where each information is word vector grouping. The decoder takes the information grouping and produces the output of the content from the significant content organize. If we consider x as an objective arrangement of sentences, the most extreme probability of the word vector sequence will be x.

Since we were using RNN for Bengali language at that time, the length of Bengali is the contribution of the model conveyed by the encoder. Here, each sequence used a token to identify the end and start point of the sequence. All these special tokens are used for processing the sequence in the encoder and decoder.

In the encoder, when the sequence of input ends token automatically discard the sequence. End of the decoder, that means when the output sequence ends . After the end of the encoding, the sequence must have an instruction to enter the decoder.

From that moment on, I convert the incentive into jargon for the follow-up that was the model's contribution. The probability value is determined by the weight value and the installation estimate of the contents.

Figure 3.1.1 Content summarization process flow.

CHAPTER 5

Impact on Environment

The majority of content penetration methods are performed in two unique ways, known as abstractive or extractive methodology. The other decrepit devices created for the Bangla language look very little appropriate from the application perspective. In the proposed approach, basic extraction review is applied with the new proposed model, and a lot of Bangla content research rules are obtained from the heuristics.

In assessing this procedure, the framework reflects great accuracy of results, in contrast to that of the human-produced summarized result and other Bangla content rundown tools.

Ethical Aspects

Sustainability

Data assortment structure internet based life Stage 2: Summarize the gathered information
Collect word2vec Stage 4: Data preprocessing
Load pre-preparer embedding Stage 7: Add uncommon token
Build succession to grouping model Stage 10: Model training
Check the outcome investigation the reaction of the machine

Recommendations
Implication for Further Study

The main concern of this research work is the creation and dissemination of Bengali UFO inquiry on the territory. I used the Bengali content as input to our model and produced a summary of this group which will be the Bengali content that is the output of the model. From the very beginning I build a model for English content, then I make this model for Bengali content.

If the probability of the follow-up length is left, the model does not work appropriately. Solid word to vector should deliver for taking care of the content related issue. Anyway, the primary concern is that the model can create an abstract summary for the Bengali language.

This is an achievement for my recorded Bengal UFO which supports future research work. In the next phase of my work, I will expand the data set and its synopsis to improve the model implementation. I will try to make another content synopsis model to help us find the best content outline entertainer in Bengali.

I am working only with short deal yet for grouping long content, bengali language summary required. Some limitations are introduced in this model, for example, the work for limited continuity, the data set is not sufficient. Either way, the pattern is set for the next turn of events.

In this work, the future work will be to expand the dataset of the Bengali content. In this way, the making of an application as a web and portable application is significantly dependent on the final fate of computerized reasoning.

34; Bengali Text Generation Using Bidirectional RNN." In 2019 10th International Conference on Computing, Communication and Networking Technologies (ICCCNT), p.