CSC/ECE 517 Fall 2013/ch1 1w46 ka: Difference between revisions

From Expertiza_Wiki
Jump to navigation Jump to search
Line 1: Line 1:
= Weka and Ruby =
==Weka and Ruby==


Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from an embedded code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes. Weka is written in Java however it is possible to use Weka’s libraries inside Ruby.  To do this, we must install the Java, Rjb, and of course obtain the Weka source code.  We use JRuby and this is illustrated as follows:  
Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from an embedded code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes. Weka is written in Java however it is possible to use Weka’s libraries inside Ruby.  To do this, we must install the Java, Rjb, and of course obtain the Weka source code.  We use JRuby and this is illustrated as follows:  


=== Clustering Data using WEKA from jRuby===
===Clustering Data using WEKA from jRuby===
 
jRuby provides easy access to Java classes and methods, and WEKA is no exception. The following program builds a simple kmeans clusterer on a supplied input file, and then prints out the assigned cluster for each data instance. The 'include_class' statements are there to simplify references to classes in the API. When classifying each instance, we must watch for the exception thrown in case a classification cannot be made. Finally, notice that the filename is passed as a command-line parameter: the parameters after the name of the jRuby program are packaged up into ARGV in the usual ruby style.  Assuming weka.jar, jruby.jar, and your program are in the same folder, a sample Ruby example is shown bellow:
jRuby provides easy access to Java classes and methods, and WEKA is no exception. The following program builds a simple kmeans clusterer on a supplied input file, and then prints out the assigned cluster for each data instance. The 'include_class' statements are there to simplify references to classes in the API. When classifying each instance, we must watch for the exception thrown in case a classification cannot be made. Finally, notice that the filename is passed as a command-line parameter: the parameters after the name of the jRuby program are packaged up into ARGV in the usual ruby style.  Assuming weka.jar, jruby.jar, and your program are in the same folder, a sample Ruby example is shown bellow:
  # Weka scripting from jruby
  require "java"
  require "weka"
  include_class "java.io.FileReader"
  include_class "weka.clusterers.SimpleKMeans"
  include_class "weka.core.Instances"
 
  # load data file
  file = FileReader.new ARGV[0]
  data = Instances.new file
 
  # create the model
  kmeans = SimpleKMeans.new
  kmeans.buildClusterer data
 
  # print out the built model
  print kmeans
 
  # Display the cluster for each instance
  data.numInstances.times do |i|
  cluster = "UNKNOWN"
  begin
    cluster = kmeans.clusterInstance(data.instance(i))
    rescue java.lang.Exception
  end
  puts "#{data.instance(i)},#{cluster}"
  end


We can see that the WEKA api makes it easy to pass in a data file. Data can be in a number of formats, including ARFF and CSV. When run on the weather.arff example (in WEKA's 'data' folder), the output looks like the following:
We can see that the WEKA api makes it easy to pass in a data file. Data can be in a number of formats, including ARFF and CSV. When run on the weather.arff example (in WEKA's 'data' folder), the output looks like the following:


= Advantages of using Weka from jRuby=
===Advantages of using Weka from jRuby===
 
One of the advantages of using a language like jruby to talk to WEKA is that we should have more control on how our data is constructed and passed to the machine-learning algorithms. A good start is how to construct our own set of instances, rather than reading them directly in from file. There are some quirks to WEKA's construction of a set of Instances. In particular, each attribute must be defined through an instance of the Attribute class. This class gives a string name to the attribute and, if the attribute is a nominal attribute, the class also holds a vector of the nominal values. Each instance can then be constructed and added to the growing set of Instances. The code below shows how to 'by-hand' construct a dataset which can then be passed to one of WEKA's learning algorithms.
One of the advantages of using a language like jruby to talk to WEKA is that we should have more control on how our data is constructed and passed to the machine-learning algorithms. A good start is how to construct our own set of instances, rather than reading them directly in from file. There are some quirks to WEKA's construction of a set of Instances. In particular, each attribute must be defined through an instance of the Attribute class. This class gives a string name to the attribute and, if the attribute is a nominal attribute, the class also holds a vector of the nominal values. Each instance can then be constructed and added to the growing set of Instances. The code below shows how to 'by-hand' construct a dataset which can then be passed to one of WEKA's learning algorithms.


{| border="1"
  Number of iterations: 3
|-Number of iterations: 3
  Within cluster sum of squared errors: 16.237456311387238
|Within cluster sum of squared errors: 16.237456311387238
  Missing values globally replaced with mean/mode
|-Missing values globally replaced with mean/mode
 
|
  Cluster centroids:
|-Cluster centroids:
                            Cluster#
|                          Cluster#
  Attribute      Full Data          0          1
|-Attribute      Full Data          0          1
                      (14)        (9)        (5)
|                    (14)        (9)        (5)
 
|
  outlook            sunny      sunny  overcast
|-outlook            sunny      sunny  overcast
  temperature      73.5714    75.8889      69.4
|temperature      73.5714    75.8889      69.4
  humidity        81.6429    84.1111      77.2
|humidity        81.6429    84.1111      77.2
  windy              FALSE      FALSE      TRUE
|windy              FALSE      FALSE      TRUE
  play                yes        yes        yes
|play                yes        yes        yes
|}

Revision as of 02:52, 7 October 2013

Weka and Ruby

Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from an embedded code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes. Weka is written in Java however it is possible to use Weka’s libraries inside Ruby. To do this, we must install the Java, Rjb, and of course obtain the Weka source code. We use JRuby and this is illustrated as follows:

Clustering Data using WEKA from jRuby

jRuby provides easy access to Java classes and methods, and WEKA is no exception. The following program builds a simple kmeans clusterer on a supplied input file, and then prints out the assigned cluster for each data instance. The 'include_class' statements are there to simplify references to classes in the API. When classifying each instance, we must watch for the exception thrown in case a classification cannot be made. Finally, notice that the filename is passed as a command-line parameter: the parameters after the name of the jRuby program are packaged up into ARGV in the usual ruby style. Assuming weka.jar, jruby.jar, and your program are in the same folder, a sample Ruby example is shown bellow:

 # Weka scripting from jruby
 require "java"
 require "weka"
 include_class "java.io.FileReader"
 include_class "weka.clusterers.SimpleKMeans"
 include_class "weka.core.Instances"
 
 # load data file
 file = FileReader.new ARGV[0]
 data = Instances.new file
 
 # create the model
 kmeans = SimpleKMeans.new
 kmeans.buildClusterer data
 
 # print out the built model
 print kmeans
 
 # Display the cluster for each instance
 data.numInstances.times do |i|
 cluster = "UNKNOWN"
  begin
   cluster = kmeans.clusterInstance(data.instance(i))
   rescue java.lang.Exception
  end
 puts "#{data.instance(i)},#{cluster}"
 end

We can see that the WEKA api makes it easy to pass in a data file. Data can be in a number of formats, including ARFF and CSV. When run on the weather.arff example (in WEKA's 'data' folder), the output looks like the following:

Advantages of using Weka from jRuby

One of the advantages of using a language like jruby to talk to WEKA is that we should have more control on how our data is constructed and passed to the machine-learning algorithms. A good start is how to construct our own set of instances, rather than reading them directly in from file. There are some quirks to WEKA's construction of a set of Instances. In particular, each attribute must be defined through an instance of the Attribute class. This class gives a string name to the attribute and, if the attribute is a nominal attribute, the class also holds a vector of the nominal values. Each instance can then be constructed and added to the growing set of Instances. The code below shows how to 'by-hand' construct a dataset which can then be passed to one of WEKA's learning algorithms.

 Number of iterations: 3
 Within cluster sum of squared errors: 16.237456311387238
 Missing values globally replaced with mean/mode
 
 Cluster centroids:
                            Cluster#
 Attribute      Full Data          0          1
                     (14)        (9)        (5)
 
 outlook            sunny      sunny   overcast
 temperature      73.5714    75.8889       69.4
 humidity         81.6429    84.1111       77.2
 windy              FALSE      FALSE       TRUE
 play                 yes        yes        yes