C HAPTER 7
4. Introduction to WEKA
It is being developed by Waikato University in New Zealand. WEKA which stands for “Waikato Environment for Knowledge Analysis”, is a comprehensive collection of machine learning algorithms employed in data mining tasks. WEKA is coded in java and is open source software released under the GNU General Public License. It can be run on Windows, Macintosh, Linux operating systems and almost all platforms. By connecting to databases via the Java Database Connectivity (JDBC) driver, it can treat a query consequence and store the results of the transaction in the databases.
WEKA is kept up-to-date with commendable efforts by its developers to include even the latest algorithms in the field of data mining. The current algorithms are included in the form of plug-in packages, and users can access and install the packages through the package management system. Thanks to its diversity of algorithms, it paves the way for users to solve their problems with different and up-to-date methods and to compare the solution methods without the demand for code writing.
The GUI Chooser shown in Figure 1, which welcomes the user when WEKA starts, allows switching between five interactive interfaces. The Explorer is the basic section that contains the tools for the algorithms used to examine and analyze a dataset and visualization. The Experimenter provides performance statistics by benchmarking different classifiers or filters applied to the problem. In addition, advanced users can distribute the computational load to multiple machines by using a Java remote method invocation. The Knowledge Flow is an interface which serves to establish learning models in the form of a data stream by combining graphical blocks representing data sources, preprocessing tools, learning algorithms, evaluation methods and visualization modules. The Workbench is a stand binding all graphical interfaces within a single window in which the appearance of applications and plug-ins can be customized. Even if the interactive interfaces fulfill the need for many problems, in case they are inadequate for advanced analysis, the SimpleCLI which is a text based coding section completes the task. The SimpleCLI is also advantageous in terms of memory consumption.
4.1 Attribute-Relation File Format
Attribute-Relation File Format (ARFF) developed for WEKA is a text-based dataset file consisting of two distinct sections called HEADER and DATA as shown in the Figure 2 below. The Header Section contains information lines about the source and content of the data set, which are written after “%” character, in order to inform the user; however, the comment lines are not interpreted by WEKA. In addition to the description lines, @relation refers to the descriptive name of the data set and this information is displayed in the Current Relation Field in the perspective of Preprocess when the data set is loaded.
The last part of the Header lists attributes with their types exposing the data structure. @attribute refers to the attribute name
108 Logistics 4.0: Digital Transformation of Supply Chain Management
Fig. 1: WEKA GUI Chooser.
followed by its type definition or nominal values. Type definition consists of numeric, string, date and relational. Curly Brackets, commas and spaces are used composing nominal values. The data section that is started with the @data tag contains instances on each row where it is mandatory that the sequence corresponds to the above definition and sequence declared by
@attribute. For unknown attribute value question mark is available instead of its value.
4.2 Explorer
It is the most commonly used interface in which tasks operated in the data mining process are provided to the user in six perspectives which are preprocess, classify, cluster, associate, select attributes, visualize. Initially, the data discovery to be analyzed is started on the preprocessing perspective, because the other perspectives will invisible unless a data set is properly loaded in the preprocessing section.
4.2.1 Preprocess
The data to be processed can be obtained from a file, a URL address, a database source or can be created artificially by means of DataGenerators. Furthermore, an interior editor is also provided for manual manipulation of data. This first step window performs two important tasks on the data set. The former provides information, such as the number of instances, the number of attributes, and statistical information for each attribute, to grasp the structure of the data set.
The latter provides a variety of filters implementing processes such as cleansing, transforming, integrating, reducing and discretizing. In the Filter Panel, filter parameters can be assigned in the TextField as text-based or by selecting options on the GenericObjectEditor, the visual interface that appears when left-clicking. Changes made to the data can be canceled or saved for later use. Preprocess Screen can be viewed in Figure 3.
4.2.2 Classify
Numerous classifiers accessed by clicking the Classifier Button are organized according to key approaches as shown in Figure 4. The ones of the provided algorithms compatible with the dataset loaded in the previous step are visible and others are invisible in the list. The parameters of the selected algorithm can be edited with its TextField or GenericObjectEditor, as in classifiers counterpart. Four types of methods are presented to evaluate classifier performance.
1. Use training set: The classifier is tested with the data set used in its training.
2. Supplied test set: The classifier is tested with an exterior data set that is not used in its training.
3. Cross Validation: The data set is subdivided into groups and each one is held for testing, while others are used for training.
4. Percentage Split: The data set is subdivided into a training set and a test set based on a user-defined percentage.
% Popular dataset related to the problem
% of play game depending on weather conditions.
@relation weather
@attribute outlook {sunny, overcast, rainy}
@attribute temperature numeric
@attribute humidity numeric
@attribute windy {TRUE, FALSE}
@attribute play {yes, no}
@data
sunny,85,85,FALSE,no sunny,80,90,TRUE,no overcast,83,86,FALSE,yes rainy,70,96,FALSE,yes rainy,68,80,FALSE,yes rainy,65,70,TRUE,no overcast,64,65,TRUE,yes sunny,72,95,FALSE,no sunny,69,70,FALSE,yes rainy,75,80,FALSE,yes sunny,75,70,TRUE,yes overcast,72,90,TRUE,yes overcast,81,75,FALSE,yes rainy,71,91,TRUE,no
Fig. 2: ARFF File.
Measurement of classifier performance can be elaborated using additional evaluation options and specific evaluation metrics. These extra options are invoked via the more options button on the test panel. Classifier Output Panel is the area where the results of the training and test operations are explained. The structure of the data set, the learning scheme and the test statistics are presented in detail here. As for the Result List Panel, it holds a list of results for each classification attempt. Through this panel, the user can compare the results of classification experiments, graphically review the results, and also store them.
4.2.3 Cluster
The Clusterer Button brings up the list of clustering schemas. Similar to classifiers perspective, the parameters of the chosen algorithm can be edited with its TextField or GenericObjectEditor. The Ignore Attributes Button which throws undesired attributes out is located under the cluster mode panel. The Store Cluster option determining whether the clustering results will be visualized is productive for data sets requiring enormous memory usage. Four methods are present for evaluating the clustering performance:
110 Logistics 4.0: Digital Transformation of Supply Chain Management
Fig. 3: Preprocess.
Fig. 4: Classify.
1. Use training set: The training set is classified according to the clusters obtained and the number of instances per cluster is calculated.
2. Supplied test set: The boundaries of the clusters can be evaluated on a separate test data.
3. Percentage split: The data set is split into two parts as the training set and the test set, considering a user-defined certain percentage. The clusters generated using the training segment are evaluated with the test segment.
4. Classes to clusters evaluation: Clustering is assessed by taking into account predefined classes in the data set and results are represented in the confusion matrix.
The Result List and Clusterer Output Panels are no different from those of the classify perspective shown in Figure 5.
Fig. 5: Cluster.
4.2.4 Associate
The algorithms in this perspective shown in Figure 6 reveal the association rules among attributes in a data set. The algorithms are preferred from the Associator Field. The parameters of the algorithms can be edited with Textfield or GenericObjectEditor. Once the Start Button is clicked, the rules obtained are listed in Associator Output Field.
Fig. 6: Associate.
4.2.5 Select attributes
Attribute Evaluator and Search Method are used to establish which attributes are furthest convenient for classification or prediction. The attribute evaluator assigns which extraction method will be used, while the search method assigns what search approach will be performed. The whole data set can be handled in the evaluation, as well as cross-validation. In the Attribute Selection Output field, the selected attributes and their associated statistics are output. Figure 7 is a screenshot of Select Attributes.
112 Logistics 4.0: Digital Transformation of Supply Chain Management
Fig. 7: Select Attributes.
4.2.6 Visualize
Two-dimensional graphs in which the distribution of attributes in the data set can be displayed are accessed through the visualize panel shown in Figure 8. Graphs can be constituted with user-defined attributes or data instances. Arrangements regarding the appearance of the graphs such as color and size can be made.
Fig. 8: Visualize.
4.3 The Experimenter
Whereas Explorer can only actuate schemes individually, more comprehensive experiments can be implemented by Experimenter facilities. As distinct from Explorer, a data set can be processed by using a number of algorithms and besides more than one data set can be processed with one or more algorithms in the application. Consequently, the user has the opportunity to analyze the performance of algorithms run on the data sets. An experiment consists of three steps entitled Setup, Run, and Analyze.
4.3.1 Setup
Setup, shown in Figure 9, has two views including simplified and advanced option. Whereas Simple has a simpler display for the user, Advanced offers access to all options. The test results can be stored in several alternative recording environments such as ARFF file, CSV file or JDBC database. Storing the results in the database is more advantageous in terms of time consumption for experiments broken or to expand them. Just as the classification counterpart, the cross-validation or percentage split techniques can be used for testing and training process. It is important to repeat the training to generate more reliable results. In the Iteration Control Field, the number of repetitions is set and while working multiple algorithms on multiple data sets, it is preferred whether the data sets or algorithms are handled initially. In the Datasets and Algorithms Panels, once the data sets and learning schemes examined are selected, they become ready to work. It is also possible to load and store the settings of algorithms that require multiple parameter settings.
As for the advanced interface, the Result Generator Panel has been added to allow the user to determine the result generators, which is the detailed equivalent of the experiment type in the simple view. Apart from the Result Generator, there is a Distribute Experiment Panel that distributes the processing load to the other nodes in the network. A database server, computers and properly generated remote engine policy file are required to perform this feature.
Fig. 9: Setup.
4.3.2 Run
The experiment is launched by clicking the Start Button in the Run Tab shown in Figure 10. After selected learning schemes have been employed on the data sets, a message stating that the operations were completed without error should be received. The results are stored in the specified file path in the Result Destination Field on the Setup Tab.
4.3.3 Analyse
If already saved, the experiment is accessed from a file or database source; otherwise, clicking the Experiment Button will bind the experiment that has just been finalized on the Run Tab. Perform Test is the button which generates detailed statistics, yet the test configuration must primarily be done by selecting the options in the Configure Test Field. Numerous criteria are proposed to assess the performance of the learning schemas. These criteria can be viewed in Figure 11.
1. T-Test: Measures whether there is a significant difference between the averages of the user groups.
2. Select Rows and Cols: Assigns the criteria to the rows and columns of the result table.
3. Comparison Field: Selects the type of statistics to compare.
4. Significance: Specifies confidence threshold
5. Sorting (asc) by: Sets the sorting criteria of table rows.
114 Logistics 4.0: Digital Transformation of Supply Chain Management
Fig. 10: Run.
Fig. 11: Analyse.
6. Test Base: Uses to change the baseline scheme.
7. Displayed Columns: Some of the items selected in “Select rows and cols” can be taken away from the result table.
However, the test base cannot be discarded from the table.
8. Show Std. Deviations: Adds standard deviation information to the result table.
9. Output Format: Provides utility tools as follows for editing the output format. Precision can be specified for Mean and std deviation. A row representing the average of each column can be added. Plain Text, CSV, HTML, LaTeX, GNUPlot and Significance Only can be specified as output formats. Using the advanced setup option, in addition to those mentioned, all adjustable properties of the output matrix can be particularized.
4.4 The KnowledgeFlow
It is an application where all the data mining methods mentioned in the Explorer section are served in iconic form. The block functions representing the operation processes are associated with link nodes on the edges of their symbols and thereby composing a flowchart executing the work. Unlike Explorer; In the KnowledgeFlow shown in Figure 12, both
Fig. 12: KnowledgeFlow Environment.
continuous learning and components can be applied sequentially, e.g., the data can be passed through multiple filters. In addition, each of the components is executed as a separate thread.
To summarize the usage of this perspective, filters, classifiers, clusters, association rules and attribute selectors, as well as other tools, are located in folders named after them in the Design Field built into the left edge of the perspective. These components are placed on a layout canvas by drag and drop in the order required by the data mining tasks and thereafter the parameters of the components can be edited by double-clicking or selecting configure option from the shortcut menu.
As for general handling in a data flow, the DataSources tools are used to obtain the dataset from a data source or can be generated through the DataGenerators tools. Evaluation tools are used for the determination of the column that holds the class information in the dataset, the approach to which training and test data will be obtained, e.g., split or cross-validation and the criteria for evaluating test phase. Using the components in the visualization category, the evaluation results can be represented as text or chart. By means of DataSinks components, a data set subdivision, a trained model, a chart and text- based information can be recorded. Extra tools are also available to manage the data flow, in the flow and tools categories.
4.5 The Workbench
The Workbench added with WEKA version 3.8 brings together the perspectives described so far under the same roof. As viewed in Figure 13, on the Workbench, each perspective has its own tab located at the top of the layout. The user can define the settings in the perspectives, such as initial settings, default values and appearance. Apart from these adjustments, there is no difference in the functionality of the perspectives from those previously described. These settings can be accessed with the gear illustrated button located to the left of the perspectives.
As an example in the following screenshot shown in Figure 14, the settings which are some initial and default values belong to the clustering options, are shown. It is also a pleasing alternative to leave the text and background colors on the output panel to the user’s preference.
4.6 SimpleCLI
Java packages running behind interactive interfaces can be activated with coding via Weka’s command-line interface.
Help lists the main commands of SimpleCLI.
1. capabilities <classname> <args>: Lists the capabilities of the specified class. If the class is a weka.core.OptionHandler then trailing options after the classname will be set as well.
2. cls: Clears the output area.
3. echo msg: Outputs a message.
116 Logistics 4.0: Digital Transformation of Supply Chain Management
Fig. 13: Workbench.
Fig. 14: Workbench Settings.
4. exit: Exits the SimpleCLI program.
5. help [command1] [command2] [...]: Outputs the help for the specified command or, if omitted, for all commands.
6. history: Prints all issued commands.
7. java <classname> <args>: Lists the capabilities of the specified class. If the class is a weka.core.OptionHandler then trailing options after the classname will be set as well.
8. kill: Kills the running job, if any.
9. script <script_file>: Executes commands from a script file.
10. set [name=value]: Sets a variable. If no key=value pair is given all current variables are listed.
11. unset name: Removes a variable.
Weka has a hierarchical Java package structure. Namely, a classifier is contained in a classifiers subpackage at the higher level, which is grouped according to the approach method; the classifiers subpackage is contained in classifiers
Fig. 15: SimpleCLI.
package which is at the much higher level, and the classifiers package is contained in weka package which is at the top- level. Like IBk> Lazy> Classifiers> Weka. Java packages are called with the java <classname> <args> command. In the example, k-Nearest Neighbour is run on iris.arff dataset with default parameters. The parameter “- t” refers to the dataset portion to be used for training.
As shown in the Figure 15, the Tab key is functional as a command complement. In the example, the Tab key lists matching packages/commands after “java.weka.a” and “java.weka.c”. If the command is composed until the classifier name, a description of both the general parameters used in each classifier and the classifier-specific parameters is displayed.
Detailed information on the schemes, algorithms and parameters of the packages can be purchased from the WEKA documentation pages (Bouckaert et al. 2018).
5. Classification of Supply Chain Data by Using WEKA