Text Mining with R A Tidy Approach pdf pdf

Informasi Dokumen

Penulis:

Julia Silge
David Robinson

Pengajar:

Nicole Tache, Editor
Nicholas Adams, Production Editor
Sonia Saruba, Copyeditor
Charles Roumeliotis, Proofreader
WordCo Indexing Services, Inc., Indexer
David Futato, Interior Designer
Karen Montgomery, Cover Designer
Rebecca Demarest, Illustrator

Sekolah: O’Reilly Media

Mata Pelajaran: Text Mining

Topik: Text Mining with R: A Tidy Approach

Tipe: book

Tahun: 2017

Kota: Sebastopol

Ringkasan Dokumen

I. The Tidy Text Format

This section introduces the core concept of tidy text, emphasizing its alignment with tidy data principles. It defines the tidy text format as a table where each row represents a single token (typically a word), contrasting this with alternative representations like strings or document-term matrices. The importance of this structure for efficient data manipulation and integration with popular R packages like dplyr, tidyr, and ggplot2 is highlighted. The section also previews the use of the unnest_tokens() function for converting text data into the tidy text format and introduces tibbles as a modern and efficient data frame class in R.

1.1 Contrasting Tidy Text with Other Data Structures

This subsection compares the tidy text format to other common ways of storing text data, namely strings, corpus objects, and document-term matrices. It explains the advantages of the tidy text format in terms of ease of manipulation and integration with the tidyverse. The limitations of alternative formats—such as difficulties in performing operations like filtering or counting word frequencies—are contrasted with the streamlined approach enabled by tidy text. This sets the stage for understanding the benefits of adopting the tidy approach for text mining.

1.2 The unnest_tokens Function

This subsection details the unnest_tokens() function, a core component of the tidytext package. It provides a step-by-step illustration of how this function transforms a character vector or data frame containing text into a tidy text format, where each row represents a single token. The function's arguments (output column name and input column name) are clearly explained. The default behavior of converting tokens to lowercase and removing punctuation is discussed, along with the option to disable the lowercase conversion. The role of the tokenizers package in the tokenization process is mentioned.

1.3 Tidying the Works of Jane Austen

This subsection demonstrates the application of the unnest_tokens() function to a real-world dataset: the works of Jane Austen. The janeaustenr package is introduced as a source of literary text data. The process involves creating a data frame, pre-processing the text using mutate() to add linenumber and chapter information, then utilizing unnest_tokens() to convert the text into the tidy format. The process of removing stop words using anti_join() and the stop_words dataset is detailed, along with the subsequent use of count() to determine word frequencies and the creation of a visualization using ggplot2 to display the most frequent words. This showcases the practical application of tidy text principles and the integration of various tidyverse packages.

1.4 The gutenbergr Package

This subsection introduces the gutenbergr package, providing a method to access public domain works from Project Gutenberg. It explains how to download books using gutenberg_download() and emphasizes the package's capabilities beyond simple downloading, such as accessing metadata to find works of interest. The integration of gutenbergr with the tidy text workflow is demonstrated by providing a brief example using the downloaded texts for further analysis. A link to the package tutorial is provided for more in-depth learning.

1.5 Word Frequencies

This section expands on the concept of word frequency analysis by comparing word frequencies across different authors (Jane Austen, H.G. Wells, and the Brontë sisters). The process of downloading texts using gutenbergr and converting them into the tidy format is shown again. Word frequencies are calculated using count(), and the results are visualized using ggplot2, showing comparisons between authors. The use of bind_rows, spread, and gather from tidyr is explained to prepare the data for effective visualization and comparison. The importance of str_extract() to handle specific textual features is highlighted, and a correlation test is performed to quantify the similarity of word frequencies between different authors.

1.6 Summary

This section summarizes the key concepts covered in Chapter 1. It reiterates the importance of tidy text format for efficient text mining, highlighting its compatibility with standard tidy tools and its applicability to various text analysis tasks, including removing stop words and calculating word frequencies. The adaptability of the one-token-per-row framework to different units of text (n-grams, sentences, etc.) is emphasized, setting the stage for the following chapters.

II. Sentiment Analysis with Tidy Data

Chapter 2 focuses on sentiment analysis, explaining how to programmatically assess the emotional content of text. The approach presented leverages the tidy text format and integrates seamlessly with the tidyverse. The chapter highlights the importance of treating the text as a collection of individual words and summing the sentiment contributions of each word to determine the overall sentiment of a piece of text. This approach contrasts with other, more complex, natural language processing (NLP) techniques.

2.1 The sentiments Dataset

This subsection introduces the sentiments dataset from the tidytext package, which contains several sentiment lexicons. The three main lexicons—AFINN, Bing, and NRC—are described in detail, explaining their different scoring mechanisms (numeric scores, binary categories) and the types of sentiments they measure. The methods used to create and validate these lexicons are briefly discussed. The limitations of using these lexicons with text from different eras or domains are acknowledged.

2.2 Sentiment Analysis with Inner Join

This section demonstrates sentiment analysis using inner joins. The process is shown using the NRC lexicon to find the most common joy words in Jane Austen's Emma. It explains how to prepare text data using unnest_tokens() and anti_join() for stop word removal. The inner_join() function is presented as the core operation for linking words in the text with sentiment scores from the lexicon. The use of count() for counting word frequencies is emphasized, showcasing the efficiency of the tidyverse approach.

2.3 Comparing the Three Sentiment Dictionaries

This subsection compares the results obtained from using the three different sentiment lexicons (AFINN, Bing, and NRC) on Jane Austen's Pride and Prejudice. The process of calculating sentiment scores using inner_join() with different lexicons is demonstrated. The differences in scoring mechanisms (numeric vs. binary) are addressed, and the method for aggregating sentiment scores across sections of the text using integer division (%/%) is explained. The results are visualized using ggplot2, showing how the sentiment trajectories differ between the lexicons while maintaining similar relative patterns throughout the novel.

2.4 Most Common Positive and Negative Words

This section focuses on identifying the most common positive and negative words in a text using sentiment lexicons. It explains the process of using inner_join() to merge the tidy text data with the sentiment lexicon and using count() to determine the frequency of positive and negative words. The result is a ranked list of the most frequent words associated with each sentiment. This provides a quantitative measure of the emotional tone of the analyzed text.

2.5 Wordclouds

This subsection delves into the creation of word clouds to visualize sentiment analysis results. The process of preparing the data for word cloud generation is described, emphasizing the use of word frequencies obtained through sentiment analysis. The visual nature of word clouds in representing the most frequent words associated with different sentiments is discussed. This approach is highlighted as a tool for quickly understanding the emotional content of the text.

2.6 Looking at Units Beyond Just Words

This subsection extends the sentiment analysis beyond single words. The chapter explores the analysis of sentiment at the level of sentences or paragraphs, as opposed to single words. This introduces the concept of considering larger units of text to capture contextual information and nuance. It is mentioned that this approach can help to mitigate issues of negation or sarcasm that may be missed when analyzing only single words. The chapter also notes that the optimal size of text chunks for sentiment analysis can be dependent on the specific type of text being examined.

2.7 Summary

This section summarizes the key concepts of Chapter 2. It reiterates the process of using tidy data principles for sentiment analysis, demonstrating how inner_join() connects text data with sentiment lexicons for calculating overall sentiment. The use of multiple sentiment lexicons and the visualization of results using ggplot2 are highlighted. The importance of selecting appropriate sentiment lexicons based on text type and domain is emphasized, and the limitations of dictionary-based methods are acknowledged.

III. Analyzing Word and Document Frequency: tf-idf

Chapter 3 introduces tf-idf (term frequency–inverse document frequency), a statistical measure used to identify words that are particularly important to a specific document within a collection of documents. This chapter explains the calculation and interpretation of tf-idf, demonstrating its use in identifying key terms that distinguish one document from others.

IV. Relationships Between Words: N-grams and Correlations

Chapter 4 delves into the analysis of word relationships, focusing on n-grams (sequences of n words) and word correlations. It covers methods for tokenizing text into n-grams and analyzing the resulting frequencies. The use of bigrams (2-word sequences) to provide context in sentiment analysis is presented. Visualizations using ggraph are used to illustrate the relationships between words in a network graph format. Techniques using the widyr package for counting and correlating pairs of words are also explained.

V. Converting to and from Nontidy Formats

Chapter 5 addresses the practical issue of working with text data in various formats. It presents methods for converting between tidy text formats and other common text representations, such as document-term matrices and corpus objects from the tm and quanteda packages. The techniques described allow for seamless integration of tidy text methods with other text analysis tools and workflows.

VI. Topic Modeling

Chapter 6 explores topic modeling, a technique for discovering underlying themes or topics within a collection of documents. It explains the Latent Dirichlet Allocation (LDA) method and uses the tidy() function to interpret and visualize the output. This provides a structured approach to understanding the latent topics within a corpus.

VII. Case Study: Comparing Twitter Archives

This chapter presents a case study applying the techniques learned in previous chapters to analyze Twitter data. It demonstrates how to retrieve, process, and analyze Twitter archives to compare the tweeting habits of two individuals, highlighting the practical applications of tidy text mining.

VIII. Case Study: Mining NASA Metadata

Chapter 8 presents another case study focusing on analyzing metadata from NASA datasets. It showcases the process of wrangling and tidying JSON data, performing exploratory analysis, identifying word co-occurrences, and constructing networks to visualize relationships between keywords and descriptions. This example illustrates the use of tidy text mining for data exploration and knowledge discovery in a large, complex dataset.

IX. Case Study: Analyzing Usenet Text

This final case study uses Usenet messages to demonstrate a comprehensive text mining workflow. It covers preprocessing techniques, word frequency analysis, tf-idf calculation, topic modeling, sentiment analysis, and n-gram analysis, showcasing the integrated use of the methods and tools presented throughout the book.

Referensi Dokumen

Essentials of Programming Languages ( Hal Abelson )

cleanNLP: A Tidy Data Model for Natural Language Processing ( Taylor B. Arnold )

quanteda: Quantitative Analysis of Textual Data ( Kenneth Benoit and Paul Nulty )

Text Mining Infrastructure in R ( Ingo Feinerer, Kurt Hornik, and David Meyer )

When Is a Liability Not a Liability? Textual Analysis, Dictionaries, and 10-Ks ( Tim Loughran and Bill McDonald )