• Tidak ada hasil yang ditemukan

Hands On Classification & Clustering

N/A
N/A
Protected

Academic year: 2024

Membagikan "Hands On Classification & Clustering"

Copied!
25
0
0

Teks penuh

(1)

Hands On Classification & Clustering

Hadaiq Rolis Sanabila

[email protected]

Natural Language Processing and Text Mining

(2)

CLASSIFICATION USING NLTK

2

(3)

Preparation

• Donwload nltk.corpus -> names

– Go to python interpreter on CMD

• import nltk

• nltk.download()

• choose names corpora

• find: C:\Users\MIC-

UI\AppData\Roaming\nltk_data\corpora

(4)

• Prepare the feature function:

def gender_features(word):

features = {};

features['last_letter'] =word[-1];

return features;

Classification

4

(5)

Classification (cont.)

• Prepare the corpus:

from nltk.corpus import names;

labeled_names = ([(name, 'male') for name in

names.words('male.txt')] + [(name, 'female') for name in names.words('female.txt')]);

#print labeled_names;

(6)

Classification (cont.)

6

• Shuffle the data (avoid bias):

import random;

random.shuffle(labeled_names);

(7)

Classification (cont.)

• Divide the training set and test set:

from nltk.classify import apply_features

train_set = apply_features(gender_features, labeled_names[500:]);

test_set = apply_features(gender_features, labeled_names[:500]);

#print test_set;

(8)

Classification (cont.)

8

• Train the classifier

import nltk;

classifier = nltk.NaiveBayesClassifier.train(train_set);

(9)

Classification (cont.)

• Test the classifier

print classifier.classify(gender_features('Neo'));

print classifier.classify(gender_features('Trinity'));

print(nltk.classify.accuracy(classifier, test_set));

classifier.show_most_informative_features(5)

(10)

Classification Using Own Dataset

10

• Copy dataset (folder “kompas”) to

– C:\Users\MIC-

UI\AppData\Roaming\nltk_data\corpora

• Check the dataset:

– nltk.data.find('corpora/kompas')

• Donwload nltk.corpus -> stopwords

– copy file: stop_indo to:

• C:\Users\MIC-

UI\AppData\Roaming\nltk_data\corpora\stopwords

(11)

Classification Using Own Dataset (Cont.)

• Import the required library

import string

from itertools import chain

from nltk.corpus import stopwords

from nltk.probability import FreqDist

from nltk.classify import NaiveBayesClassifier as nbc

from nltk.corpus import CategorizedPlaintextCorpusReader

import nltk

(12)

Classification Using Own Dataset (Cont.)

12

• Prepare the corpus:

mydir = 'C:/Users/MIC-UI/AppData/Roaming/nltk_data/corpora/kompas'

mr = CategorizedPlaintextCorpusReader(mydir, r'(?!\.).*\.txt', cat_pattern=r'(ekonomi|kesehatan|olahraga|properti|travel)/.*', encoding='ascii')

stop = stopwords.words('stop_indo')

documents = [([w for w in mr.words(i) if w.lower() not in stop and w.lower() not in string.punctuation], i.split('/')[0]) for i in mr.fileids()]

(13)

Classification Using Own Dataset (Cont.)

• Select the features

word_features = FreqDist(chain(*[i for i,j in documents]))

word_features = word_features.keys()[:100]

(14)

Classification Using Own Dataset (Cont.)

14

• Divide the training set and test set:

numtrain = int(len(documents) * 90 / 100)

train_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents[:numtrain]]

test_set = [({i:(i in tokens) for i in word_features}, tag) for

tokens,tag in documents[numtrain:]]

(15)

Classification Using Own Dataset (Cont.)

• Train and test the classifier:

classifier = nbc.train(train_set)

print nltk.classify.accuracy(classifier, test_set)

classifier.show_most_informative_features(5)

(16)

Hints:

1. Lower-casing: The entire email is converted into lower case, so that captialization is ignored (e.g., IndIcaTE is treated the same as Indicate).

2. Stripping HTML: All HTML tags are removed from the emails. Many emails often come with HTML formatting; we remove all the HTML tags, so that only the content remains.

3. Normalizing URLs: All URLs are replaced with the text “httpaddr”.

4. Normalizing Email Addresses: All email addresses are replaced with the text “emailaddr”.

5. Normalizing Numbers: All numbers are replaced with the text “number”.

6. Word Stemming: Words are reduced to their stemmed form. For example,

“discount”, “discounts”, “discounted” and “discounting” are all replaced with “discount”.

7. Removal of non-words: Non-words and punctuation have been removed. All white spaces (tabs, newlines, spaces) have all been trimmed to a single space character.

8. Try to get representative words from each class as features

Task

Update your model (especially on the feature selection process to get the higher precission & recall)

16

(17)

CLUSTERING USING NLTK

(18)

Preparation

18

• Copy dataset (folder “kompas_clustering”) to

– C:\Users\MIC-

UI\AppData\Roaming\nltk_data\corpora

• Check the dataset:

– nltk.data.find('corpora/kompas_clustering')

(19)

Clustering

• Prepare the function

def process_text(text, stem=False):

tokens = word_tokenize(text)

return tokens

(20)

Clustering (cont.)

20

• Prepare the required library

import string

import collections

from nltk import word_tokenize

from nltk.stem import PorterStemmer from nltk.corpus import stopwords from sklearn.cluster import KMeans

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.feature_extraction.text import HashingVectorizer

from pprint import pprint

(21)

Clustering (cont.)

• Prepare the clustering function

def cluster_texts(texts, clusters=3):

vectorizer = HashingVectorizer(tokenizer=process_text,

stop_words=stopwords.words('stop_indo'), lowercase=True)

tfidf_model = vectorizer.fit_transform(texts)

km_model = KMeans(n_clusters=clusters, n_init = 100)

km_model.fit(tfidf_model)

(22)

Clustering (cont.)

22

• Prepare the clustering function

def cluster_texts(texts, clusters=3):

...

clustering = collections.defaultdict(list)

for idx, label in enumerate(km_model.labels_):

clustering[label].append(idx)

return clustering

(23)

Clustering (cont.)

• Prepare the Main function

if __name__ == "__main__":

contents = list();

import os

rootdir = 'C:/Users/…/nltk_data/corpora/kompas_clustering/' for subdir, dirs, files in os.walk(rootdir):

for file in files:

f=open(rootdir+file,'r') lines = f.readlines() f.close()

tmp = str();

for line in lines:

tmp += line.strip() + " ";

contents.append(tmp) articles = contents

(24)

Hints:

1. Lower-casing: The entire email is converted into lower case, so that captialization is ignored (e.g., IndIcaTE is treated the same as Indicate).

2. Stripping HTML: All HTML tags are removed from the emails. Many emails often come with HTML formatting; we remove all the HTML tags, so that only the content remains.

3. Normalizing URLs: All URLs are replaced with the text “httpaddr”.

4. Normalizing Email Addresses: All email addresses are replaced with the text “emailaddr”.

5. Normalizing Numbers: All numbers are replaced with the text “number”.

6. Word Stemming: Words are reduced to their stemmed form. For example,

“discount”, “discounts”, “discounted” and “discounting” are all replaced with “discount”.

7. Removal of non-words: Non-words and punctuation have been removed. All white spaces (tabs, newlines, spaces) have all been trimmed to a single space character.

Task

Update your model to get better cluster

24

(25)

Thank you

Referensi

Dokumen terkait

Toman, et al., [7] compared between various lemmatization and stemming algorithms using English and Czech datasets to examine the influence of the word normalization on

c) Outline the objectives of financial administration d) State the basics of financial administration. Prepare literature books from the library, stationery for students, namely:

The result of syntagmatic analysis shows that representation of recreation function has worked well as seen from how Emily voluntarily visited library every day to read

Discussion of Implementation Results of colab.research.google Based on the results obtained from the results of the Prediction Data Training test, which totals 240 data, it shows that

Datum 9 Line 28 Your Gucci tennis shoes running from your issues The word shoes in line 28 contain suffix -s come from the base word shoe noun, by added suffix -s indicated as

we have mounted three displays drawing on archival material from UMA collections in the Baillieu library this year: in february the intriguing and sometimes striking collection of

Whenever the data has been processed for word embedding segmentation, the weighting in this pre-trained model is word2vec, which converts the shape of the word into a vector, and the

The findings from this research are that seven indicators need to be included in the standards, namely: 1 Disaster mitigation standards for library collections and buildings; 2