Quick intro to Weka

Weka (http://www.cs.waikato.ac.nz/ml/weka/) is Data Mining software from The University of Waikato. In Slovenia, The Bioinformatics Laboratory has also developed well known software Orange (http://orange.biolab.si/). Both tools have GUI interface and a library for programmatic access. The main difference is that Weka is Java and Orange is python -based.

Here I will give a short example how to use Weka within Java. Tha Java file is accessible here: Weka.java. All you need to do is put weka.jar to classpath, compile and run Weka.java (of course you need to have c:\\temp folder or choose another one).

For classification problems we normally have to identify features. In Weka standard types of attributes are numeric, nominal, string, date and relation. Relation attribute can represent a whole dataset. There are also some functions for data preprocessing available. Here we define some attributes:

Attribute attr = new Attribute("my-numeric");
FastVector myNomVals = new FastVector();
for (int i=0; i<10; i++)
Attribute attr1 = new Attribute("my-nominal", myNomVals);
Attribute attr2 = new Attribute("my-string", (FastVector)null);
Attribute attr3 = new Attribute("my-date", "dd-MM-yyyy");
//whole relation can also be an attr
//Attribute attr4 = new Attribute("my-relation", new Instances(...));

When we have attributes, we can form the dataset aka. relation (reading and writing from files will come later):

//2.create dataset
FastVector attrs = new FastVector();
Instances dataset = new	Instances("my_dataset", attrs, 0);

Now we have defined the relation structure. There are a few possible ways to fill the dataset and here we present few of them:

//3.add instances
//first instance
double[] attValues = new double[dataset.numAttributes()];
	attValues[0] = 55;
	attValues[1] = dataset.attribute("my-nominal").indexOfValue("value_5");
        attValues[2] = dataset.attribute("my-string").addStringValue("Slavko");
	attValues[3] = dataset.attribute("my-date").parseDate("7-6-1987");
dataset.add(new Instance(1.0, attValues));
//second instance
attValues = new double[dataset.numAttributes()];
	attValues[0] = Instance.missingValue();
	attValues[1] = dataset.attribute(1).indexOfValue("value_9");
	attValues[2] = dataset.attribute(2).addStringValue("Marinka");
	attValues[3] = dataset.attribute(3).parseDate("23-4-1989");
dataset.add(new Instance(1.0, attValues));
//third instance
Instance example = new Instance(4);
	example.setValue(attr, 16);
	example.setValue(attr1, "value_7");
	example.setValue(attr2, "Mirko");
	example.setValue(attr3, attr3.parseDate("1-1-1988"));

Up to here we have the dataset in the memory. We can use it (class attribute needs yet to be set), print it to stdout or file:

//4.output dataset
//5.save dataset
String file = "C:\\temp\\weka_test.arff";
ArffSaver saver = new ArffSaver();
saver.setFile(new File(file));
//6.read dataset
ArffLoader loader = new ArffLoader();
loader.setFile(new File(file));
dataset = loader.getDataSet();

As we have one string attribute, we need to properly preprocess it as very few classifiers support them. We can accomplish this with filters, for example changin it to nominal attribute:

//7.preprocess strings (almost no classifier supports them)
StringToWordVector filter = new StringToWordVector();
dataset = Filter.useFilter(dataset, filter);

We have the data. The next thing is building a classifier. Weka contains a lot well known classifiers like naive Bayes, decision trees, perceptrons, etc.. I like SVMs and I use LibSVM with Weka. Weka already has built-in LibSVM API, so the only thing you need to do is to include libsvm.jar to classpath and use LibSVM as classifier instance.

Another very easy task is also saving and retrieving back classifiers. The only thing to be aware of is the class index! You must set it before learning the classifier. Best practice is to always set class attribute as last one.

//8.build classifier
Classifier classifier = new J48();
//9.save classifier
OutputStream os = new FileOutputStream(file);
ObjectOutputStream objectOutputStream = new ObjectOutputStream(os);
//10. read classifier back
InputStream is = new FileInputStream(file);
ObjectInputStream objectInputStream = new ObjectInputStream(is);
classifier = (Classifier) objectInputStream.readObject();

Usually we need to know how good the classifications are. Weka supports a number of evaluation tools, like CV and different measures. Here we will resample our dataset, create the train and learn dataset and output some results.

//resample if needed
dataset = dataset.resample(new Random(42));
//split to 70:30 learn and test set
double percent = 70.0;
int trainSize = (int) Math.round(dataset.numInstances() * percent / 100);
int testSize = dataset.numInstances() - trainSize;
Instances train = new Instances(dataset, 0, trainSize);
Instances test = new Instances(dataset, trainSize, testSize);
//do eval
Evaluation eval = new Evaluation(train); //trainset
eval.evaluateModel(classifier, test); //testset

When classifying new instances, we must be aware to transform classifier’s result to the class attribute value – it returns only the index of a value (for classification purposes)!

//classified result value

I hope this example was useful to you. I tried to show how to use weka for some quick tasks.

5 thoughts on “Quick intro to Weka

  1. SAN

    Thanks for the code.

    But i got error in
    new Instance(1.0, attValues).
    Because Instance is a Interface. How can i create object for that.
    I am using latest weka.jar.

  2. SAN

    Sorry dude. I got the solution. I used source code. so only that problem . Now i am using Jar. Now no error .

Comments are closed.