CLEAR December 2013 Page 25 This article is an introduction to Orange data
mining tool and its use with python scripts, not
in details, in a brief. Have a fun..
Data Mining is the computational process of
discovering patterns in large data sets involving
methods at the intersection of artificial
intelligence, machine learning, statistics, and
database systems. This area has special
applications in areas like medical science,
business intelligence, and similar areas of real
life. There are many open source and
proprietary softwares tools are available for data
mining applications and research. Orange
(http://orange.biolab.si) is a general-purpose,
open source machine learning and data mining
tool. It includes a set of components for data
preprocessing, feature scoring and filtering,
modeling, model evaluation, and exploration
techniques.
Here the python scripting using orange package
is discussed.
Install Orange:
To build and install Orange you can use the
setup.py in the root orange directory (requires
GCC, Python and numpy development headers).
The details will be available in the orange
documentation page
(http://orange.biolab.si/download/).
Test the installation
After installing orange, type the following in
python interactive shell.
>>>import Orange
>>>Orange.version.version
'2.6a2.dev-a55510d'
If this leaves no error and warning, Orange and
Python are properly installed and you are ready
to continue.
A Fruitful Data Mining Using
ORANGE
Manu Madhavan
CLEAR December 2013 Page 26 Data Input
Orange can read files in native and other data
formats. Native format starts with feature
(attribute) names, their type (continuous,
discrete, string). The third line contains meta
information to identify dependent features
(class), irrelevant features (ignore) or meta
features (meta).
You may download lenses.tab to a target
directory and there open a python shell.
>>>import Orange
>>>data=Orange.data.Table("lense s")
>>>
Data mining using Orange-python
The orange python can be used for all data
mining applications like, classification,
prediction, clustering and learning.
Here I am illustrating, how orange can be used
for classification.
For classification, you need a training data set
and test data set. The data readable by orange
methods are stored in a .tab file (stored as
Tables). The table contains different features of
the data and class value. For training data, the
class value will be the actual class, in which the
feature vector belongs to. In case of testing data
set, the class value will be absent, which have to
be predicted by a learning function. Orange
have a vast variety of learning functions like,
KNN, Least Mean Square Error, Naive- Bayes,
Logistic/Linear regression, etc. The following
sample code use KNN method to predict the
class value of test data set.
Let the training dataset is stored in trainset. tab
and test dataset is stored in testset. tab. The
following script will print the class of test data.
Let the training dataset is stored in trainset. tab
and test dataset is stored in testset. tab. The
following script will print the class of test data.
>>>train=Orange.data.Table("trai nset")
>>>test=Orange.data.Table("tests et")
>>>learner=Orange.classification .knn.kNNLearner()
>>>classifier = learner(train)
>>>for i in range (len(trainset)):
...print i,classifier(test[i])
This will print the class of each vector in test
set. You can verify the result by comparing the
results with that by using other learning tools.