Speech and Language Processing

We start with tokenization and pre-processing, as well as useful algorithms such as computation of operation distance, then move on to the tasks of classification, logistic regression, neural networks, through feedforward networks, recurrent networks, and then transformers. La derni`ere chose to make a good choice from the heart of the cell, which is a mistake with the premiere.

CHAPTER

Regular Expressions

Basic Regular Expression Patterns
Disjunction, Grouping, and Precedence
A Simple Example
More Operators
A More Complex Example
Substitution, Capture Groups, and ELIZA

For example, the Unix command-line tool grep takes a regular expression and returns every line of the input document that matches the expression. In the following examples, we generally underline the exact part of the pattern that matches the regular expression and show only the first match. We will show regular expressions delimited by slashes, but note that slashes are not part of the regular expressions.

You can also use square brackets to specify what a single character cannot be, using the pointerˆ. A question mark can be thought of as "zero or one instance of the previous character". The Kleene asterisk means "zero or more occurrences of the immediately preceding character or regular expression".

One very important special character is the period (/./), awildcard expression that matches any single character (except the carriage return), as shown in Fig.2.6. beg.n/ any character between began, beg'n, began Figure 2.6 Using period. to specify any character. For example, suppose we are looking for the pattern "as many Xers were, as many Xers will be", where we want to constrain the two X's to be the same string.

IN WHAT WAY

CAN YOU THINK OF A SPECIFIC EXAMPLE User 3 : Well, my boyfriend made me come here

YOUR BOYFRIEND MADE YOU COME HERE User 4 : He says I’m depressed much of the time

I AM SORRY TO HEAR YOU ARE DEPRESSED

Lookahead Assertions
Words
Corpora
Text Normalization

Unix Tools for Crude Tokenization and Normalization
Word Tokenization
Byte-Pair Encoding for Tokenization
Word Normalization, Lemmatization and Stemming
Sentence Segmentation

Minimum Edit Distance

The Minimum Edit Distance Algorithm

Summary
N-Grams
Evaluating Language Models

Perplexity

Sampling sentences from a language model
Generalization and Zeros

Unknown Words

Smoothing

Laplace Smoothing
Add-k smoothing
Backoff and Interpolation

Huge Language Models and Stupid Backoff
Advanced: Kneser-Ney Smoothing

Absolute Discounting
Kneser-Ney Discounting

Advanced: Perplexity’s Relation to Entropy
Summary
Naive Bayes Classifiers
Training the Naive Bayes Classifier
Worked example
Optimizing for Sentiment Analysis
Naive Bayes for other text classification tasks
Naive Bayes as a Language Model
Evaluation: Precision, Recall, F-measure

Evaluating with more than two classes

Test sets and Cross-validation

Another measure of the number of words in the language is the number of lemmas rather than word types. To represent the probability of a particular random variable using the value "the", orP(Xi="the"), we will use the simplificationP(the). Let's look at a general equation for this n-gram approximation to the conditional probability of the next word in a sequence.

Let's say 0 occurs 91 times in the training set and each of the other digits occurred 1 time each. So a lower confusion can tell us that a language model is a better predictor of the words in the test set. Any kind of knowledge about the test set can make the confusion artificially low.

The expectation that the random number will fall in the larger intervals of one of the frequent words (the,of,a) is much higher than in the smaller interval of one of the rare words (polyphonic). Second, if the probability of any word in the test set is 0, the entire test set probability is 0. A better n-gram model is one that assigns a higher probability to the test data, and the perplexing is a normalized version of the test set probability.

The cross entropy is defined in the limit as the length of the sequence of observed words goes to infinity.

NormalClass 1: Urgent

Statistical Significance Testing

The Paired Bootstrap Test

Avoiding Harms in Classification
Summary
The sigmoid function
Classification with Logistic Regression

Sentiment Classification
Other classification tasks and features
Processing many examples at once
Choosing a classifier

Multinomial logistic regression

Softmax
Applying softmax in logistic regression
Features in Multinomial Logistic Regression

Learning in Logistic Regression
The cross-entropy loss function
Gradient Descent

The Gradient for Logistic Regression
The Stochastic Gradient Descent Algorithm
Working through an example
Mini-batch training

Regularization
Learning in Multinomial Logistic Regression
Interpreting models
Advanced: Deriving the Gradient Equation
Summary
Lexical Semantics
LIFE ’

Vector Semantics
Words and Vectors
Cosine for measuring similarity
TF-IDF: Weighing terms in the vector
Pointwise Mutual Information (PMI)
Applications of the tf-idf or PPMI vector models
Word2vec
Visualizing Embeddings
Semantic properties of embeddings

Indeed, logistic regression is one of the most important analytical tools in the social and natural sciences. In the rest of the book we will represent such sums using dot product notation. If we apply the sigmoid to the sum of the weighted features, we get a number between 0 and 1.

For each classk, the value ˆyk will be the classifier's estimate of the probability p(yk=1|x). We will use this kind of notation in our description of CRF in Chapter 8. The gradient of a function of many variables is a vector that points in the direction of the greatest increase in a function.

Fig.5.5 shows a visualization of the value of a 2-dimensional gradient vector obtained at the red point. For example, suppose you didn't know the meaning of the word songchoi (a recent loanword from Cantonese), but you see it in the following contexts:. Later in the chapter we will introduce some of the components of this vector comparison process: the weighting of the tf-idf term and the cosine similarity metric.

So for each of these (w,cpos) training instances we will create negative examples, each consisting of the targetw plus a 'noise word'cneg. Maximize the similarity of the target word, context word pairs (w,cpos) drawn from the positive examples.

DYNAMIC SOCIAL REPRESENTATIONS OF WORD MEANING79

Computational linguistic studies
Bias and Embeddings
Evaluating Vector Models
Summary
Units
The XOR problem

The solution: neural networks

Feedforward Neural Networks

More details on feedforward networks

Feedforward networks for NLP: Classification
Feedforward Neural Language Modeling

Forward inference in the neural language model

Training Neural Nets

Loss function
Computing the Gradient
Computation Graphs
Backward differentiation on computation graphs
More details on learning

Training the neural language model
Summary
Part-of-Speech Tagging
Named Entities and Named Entity Tagging
HMM Part-of-Speech Tagging

Markov Chains
The Hidden Markov Model
The components of an HMM tagger
HMM tagging as decoding
The Viterbi Algorithm
Working through an example

Conditional Random Fields (CRFs)

Features in a CRF POS Tagger
Features for CRF Named Entity Recognizers
Inference and Training for CRFs

Evaluation of Named Entity Recognition
Further Details

Rule-based Methods
POS Tagging for Morphologically Rich Languages

Summary
Recurrent Neural Networks

Inference in RNNs
Training

RNNs as Language Models

Forward Inference in an RNN language model
Training an RNN language model
Weight Tying

RNNs for other NLP tasks

Sequence Labeling

We include some suitable intermediate variables: the summation output,z, and the sigmoid output,a. In other words, we can view the hidden layer of the network as a representation of the input. The role of the output layer is to take this new representation and compute a final output.

In logistic regression, for each observation we can directly calculate the derivative of the loss function with respect to an individual worm. The calculation of the gradient requires the partial derivative of the loss function with respect to each parameter. The simplest use of calculation graphs is to calculate the value of the function with some given inputs.

The importance of the calculation graph stems from the backward pass, which is used to calculate the derivatives we need for the weight update. We give the beginning of the calculation, calculate the derivative of the loss function L with respect to toz, or ∂∂Lz (leaving the rest of the calculation as an exercise for the reader). Continuing the backward computation of the gradients (then passing the gradients over b[2]1 and the two product nodes, and so on back to all teal nodes) is left to an exercise for the reader.

We don't want to learn separate weight matrices to map each of the previous 3 words to the projection layer. Neural language models use a neural network as a probabilistic classifier to calculate the probability of the next word given the previous words. Each cell keeps the probability of the best path so far and a pointer to the previous cell along that path.

Most of the cells in the column are zero, since the word Janet cannot be any of these tags. Later taggers explicitly introduced the use of the hidden Markov model (Kupiec 1992; Weischedel et al. 1993; Sch¨utze and Singer 1994). See Householder (1995) for historical notes on parts of speech, and Sampson (1987) and Garside et al. (1997) on the origin of Brown and other roof sets.

That is, the activation value of the hidden layer depends on the current input as well as the activation value of the hidden layer from the previous time step. This link increments the input to the calculation on the hidden layer by the value of the hidden layer from the previous time.

VBMD