A Fruitful Data Mining Using Orange

(1)

CLEAR December 2013 Page 25 This article is an introduction to Orange data

mining tool and its use with python scripts, not

in details, in a brief. Have a fun..

Data Mining is the computational process of

discovering patterns in large data sets involving

methods at the intersection of artificial

intelligence, machine learning, statistics, and

database systems. This area has special

applications in areas like medical science,

business intelligence, and similar areas of real

life. There are many open source and

proprietary softwares tools are available for data

mining applications and research. Orange

(http://orange.biolab.si) is a general-purpose,

open source machine learning and data mining

tool. It includes a set of components for data

preprocessing, feature scoring and filtering,

modeling, model evaluation, and exploration

techniques.

Here the python scripting using orange package

is discussed.

Install Orange:

To build and install Orange you can use the

setup.py in the root orange directory (requires

GCC, Python and numpy development headers).

The details will be available in the orange

documentation page

(http://orange.biolab.si/download/).

Test the installation

After installing orange, type the following in

python interactive shell.

>>>import Orange

>>>Orange.version.version

'2.6a2.dev-a55510d'

If this leaves no error and warning, Orange and

Python are properly installed and you are ready

to continue.

A Fruitful Data Mining Using

ORANGE

Manu Madhavan

(2)

CLEAR December 2013 Page 26 Data Input

Orange can read files in native and other data

formats. Native format starts with feature

(attribute) names, their type (continuous,

discrete, string). The third line contains meta

information to identify dependent features

(class), irrelevant features (ignore) or meta

features (meta).

You may download lenses.tab to a target

directory and there open a python shell.

>>>import Orange

>>>data=Orange.data.Table("lense s")

>>>

Data mining using Orange-python

The orange python can be used for all data

mining applications like, classification,

prediction, clustering and learning.

Here I am illustrating, how orange can be used

for classification.

For classification, you need a training data set

and test data set. The data readable by orange

methods are stored in a .tab file (stored as

Tables). The table contains different features of

the data and class value. For training data, the

class value will be the actual class, in which the

feature vector belongs to. In case of testing data

set, the class value will be absent, which have to

be predicted by a learning function. Orange

have a vast variety of learning functions like,

KNN, Least Mean Square Error, Naive- Bayes,

Logistic/Linear regression, etc. The following

sample code use KNN method to predict the

class value of test data set.

Let the training dataset is stored in trainset. tab

and test dataset is stored in testset. tab. The

following script will print the class of test data.

Let the training dataset is stored in trainset. tab

and test dataset is stored in testset. tab. The

following script will print the class of test data.

>>>train=Orange.data.Table("trai nset")

>>>test=Orange.data.Table("tests et")

>>>learner=Orange.classification .knn.kNNLearner()

>>>classifier = learner(train)

>>>for i in range (len(trainset)):

...print i,classifier(test[i])

This will print the class of each vector in test

set. You can verify the result by comparing the

results with that by using other learning tools.