• Tidak ada hasil yang ditemukan

Speech and Language Processing

N/A
N/A
Fahmi Sajid

Academic year: 2023

Membagikan "Speech and Language Processing"

Copied!
636
0
0

Teks penuh

We start with tokenization and pre-processing, as well as useful algorithms such as computation of operation distance, then move on to the tasks of classification, logistic regression, neural networks, through feedforward networks, recurrent networks, and then transformers. La derni`ere chose to make a good choice from the heart of the cell, which is a mistake with the premiere.

CHAPTER

Regular Expressions

  • Basic Regular Expression Patterns
  • Disjunction, Grouping, and Precedence
  • A Simple Example
  • More Operators
  • A More Complex Example
  • Substitution, Capture Groups, and ELIZA

For example, the Unix command-line tool grep takes a regular expression and returns every line of the input document that matches the expression. In the following examples, we generally underline the exact part of the pattern that matches the regular expression and show only the first match. We will show regular expressions delimited by slashes, but note that slashes are not part of the regular expressions.

You can also use square brackets to specify what a single character cannot be, using the pointerˆ. A question mark can be thought of as "zero or one instance of the previous character". The Kleene asterisk means "zero or more occurrences of the immediately preceding character or regular expression".

One very important special character is the period (/./), awildcard expression that matches any single character (except the carriage return), as shown in Fig.2.6. beg.n/ any character between began, beg'n, began Figure 2.6 Using period. to specify any character. For example, suppose we are looking for the pattern "as many Xers were, as many Xers will be", where we want to constrain the two X's to be the same string.

IN WHAT WAY

CAN YOU THINK OF A SPECIFIC EXAMPLE User 3 : Well, my boyfriend made me come here

YOUR BOYFRIEND MADE YOU COME HERE User 4 : He says I’m depressed much of the time

I AM SORRY TO HEAR YOU ARE DEPRESSED

  • Lookahead Assertions
  • Words
  • Corpora
  • Text Normalization
    • Unix Tools for Crude Tokenization and Normalization
    • Word Tokenization
    • Byte-Pair Encoding for Tokenization
    • Word Normalization, Lemmatization and Stemming
    • Sentence Segmentation
  • Minimum Edit Distance
    • The Minimum Edit Distance Algorithm
  • Summary
  • N-Grams
  • Evaluating Language Models
    • Perplexity
  • Sampling sentences from a language model
  • Generalization and Zeros
    • Unknown Words
  • Smoothing
    • Laplace Smoothing
    • Add-k smoothing
    • Backoff and Interpolation
  • Huge Language Models and Stupid Backoff
  • Advanced: Kneser-Ney Smoothing
    • Absolute Discounting
    • Kneser-Ney Discounting
  • Advanced: Perplexity’s Relation to Entropy
  • Summary
  • Naive Bayes Classifiers
  • Training the Naive Bayes Classifier
  • Worked example
  • Optimizing for Sentiment Analysis
  • Naive Bayes for other text classification tasks
  • Naive Bayes as a Language Model
  • Evaluation: Precision, Recall, F-measure
    • Evaluating with more than two classes
  • Test sets and Cross-validation

Another measure of the number of words in the language is the number of lemmas rather than word types. To represent the probability of a particular random variable using the value "the", orP(Xi="the"), we will use the simplificationP(the). Let's look at a general equation for this n-gram approximation to the conditional probability of the next word in a sequence.

Let's say 0 occurs 91 times in the training set and each of the other digits occurred 1 time each. So a lower confusion can tell us that a language model is a better predictor of the words in the test set. Any kind of knowledge about the test set can make the confusion artificially low.

The expectation that the random number will fall in the larger intervals of one of the frequent words (the,of,a) is much higher than in the smaller interval of one of the rare words (polyphonic). Second, if the probability of any word in the test set is 0, the entire test set probability is 0. A better n-gram model is one that assigns a higher probability to the test data, and the perplexing is a normalized version of the test set probability.

The cross entropy is defined in the limit as the length of the sequence of observed words goes to infinity.

NormalClass 1: Urgent

  • Statistical Significance Testing
    • The Paired Bootstrap Test
  • Avoiding Harms in Classification
  • Summary
  • The sigmoid function
  • Classification with Logistic Regression
    • Sentiment Classification
    • Other classification tasks and features
    • Processing many examples at once
    • Choosing a classifier
  • Multinomial logistic regression
    • Softmax
    • Applying softmax in logistic regression
    • Features in Multinomial Logistic Regression
  • Learning in Logistic Regression
  • The cross-entropy loss function
  • Gradient Descent
    • The Gradient for Logistic Regression
    • The Stochastic Gradient Descent Algorithm
    • Working through an example
    • Mini-batch training
  • Regularization
  • Learning in Multinomial Logistic Regression
  • Interpreting models
  • Advanced: Deriving the Gradient Equation
  • Summary
  • Lexical Semantics
  • LIFE ’
    • Vector Semantics
    • Words and Vectors
    • Cosine for measuring similarity
    • TF-IDF: Weighing terms in the vector
    • Pointwise Mutual Information (PMI)
    • Applications of the tf-idf or PPMI vector models
    • Word2vec
    • Visualizing Embeddings
    • Semantic properties of embeddings

Indeed, logistic regression is one of the most important analytical tools in the social and natural sciences. In the rest of the book we will represent such sums using dot product notation. If we apply the sigmoid to the sum of the weighted features, we get a number between 0 and 1.

For each classk, the value ˆyk will be the classifier's estimate of the probability p(yk=1|x). We will use this kind of notation in our description of CRF in Chapter 8. The gradient of a function of many variables is a vector that points in the direction of the greatest increase in a function.

Fig.5.5 shows a visualization of the value of a 2-dimensional gradient vector obtained at the red point. For example, suppose you didn't know the meaning of the word songchoi (a recent loanword from Cantonese), but you see it in the following contexts:. Later in the chapter we will introduce some of the components of this vector comparison process: the weighting of the tf-idf term and the cosine similarity metric.

So for each of these (w,cpos) training instances we will create negative examples, each consisting of the targetw plus a 'noise word'cneg. Maximize the similarity of the target word, context word pairs (w,cpos) drawn from the positive examples.

DYNAMIC SOCIAL REPRESENTATIONS OF WORD MEANING79

  • Computational linguistic studies
  • Bias and Embeddings
  • Evaluating Vector Models
  • Summary
  • Units
  • The XOR problem
    • The solution: neural networks
  • Feedforward Neural Networks
    • More details on feedforward networks
  • Feedforward networks for NLP: Classification
  • Feedforward Neural Language Modeling
    • Forward inference in the neural language model
  • Training Neural Nets
    • Loss function
    • Computing the Gradient
    • Computation Graphs
    • Backward differentiation on computation graphs
    • More details on learning
  • Training the neural language model
  • Summary
  • Part-of-Speech Tagging
  • Named Entities and Named Entity Tagging
  • HMM Part-of-Speech Tagging
    • Markov Chains
    • The Hidden Markov Model
    • The components of an HMM tagger
    • HMM tagging as decoding
    • The Viterbi Algorithm
    • Working through an example
  • Conditional Random Fields (CRFs)
    • Features in a CRF POS Tagger
    • Features for CRF Named Entity Recognizers
    • Inference and Training for CRFs
  • Evaluation of Named Entity Recognition
  • Further Details
    • Rule-based Methods
    • POS Tagging for Morphologically Rich Languages
  • Summary
  • Recurrent Neural Networks
    • Inference in RNNs
    • Training
  • RNNs as Language Models
    • Forward Inference in an RNN language model
    • Training an RNN language model
    • Weight Tying
  • RNNs for other NLP tasks
    • Sequence Labeling

We include some suitable intermediate variables: the summation output,z, and the sigmoid output,a. In other words, we can view the hidden layer of the network as a representation of the input. The role of the output layer is to take this new representation and compute a final output.

In logistic regression, for each observation we can directly calculate the derivative of the loss function with respect to an individual worm. The calculation of the gradient requires the partial derivative of the loss function with respect to each parameter. The simplest use of calculation graphs is to calculate the value of the function with some given inputs.

The importance of the calculation graph stems from the backward pass, which is used to calculate the derivatives we need for the weight update. We give the beginning of the calculation, calculate the derivative of the loss function L with respect to toz, or ∂∂Lz (leaving the rest of the calculation as an exercise for the reader). Continuing the backward computation of the gradients (then passing the gradients over b[2]1 and the two product nodes, and so on back to all teal nodes) is left to an exercise for the reader.

We don't want to learn separate weight matrices to map each of the previous 3 words to the projection layer. Neural language models use a neural network as a probabilistic classifier to calculate the probability of the next word given the previous words. Each cell keeps the probability of the best path so far and a pointer to the previous cell along that path.

Most of the cells in the column are zero, since the word Janet cannot be any of these tags. Later taggers explicitly introduced the use of the hidden Markov model (Kupiec 1992; Weischedel et al. 1993; Sch¨utze and Singer 1994). See Householder (1995) for historical notes on parts of speech, and Sampson (1987) and Garside et al. (1997) on the origin of Brown and other roof sets.

That is, the activation value of the hidden layer depends on the current input as well as the activation value of the hidden layer from the previous time step. This link increments the input to the calculation on the hidden layer by the value of the hidden layer from the previous time.

VBMD

Referensi

Dokumen terkait