I am trying to use the Weka API in a java class. I have performed 10-fold cross-validation and then binarized my data using different thresholds.
However I am new to using the Weka API so not sure what I have done is right. I am getting ROC values but note sure they are correct. Here is my code below:
for(int i = 0; i < threshold.length; i++)
{
// Deep copy of learnSet
ArrayList<ArrayList<int[]>> learnSetCopy = new ArrayList<>();
for (ArrayList<int[]> innerList : learnSet)
{
ArrayList<int[]> innerCopy = new ArrayList<>();
for (int[] array : innerList)
{
int[] arrayCopy = Arrays.copyOf(array, array.length);
innerCopy.add(arrayCopy);
}
learnSetCopy.add(innerCopy);
}
// Deep copy of validSet
ArrayList<ArrayList<int[]>> validSetCopy = new ArrayList<>();
for (ArrayList<int[]> innerList : validSet)
{
ArrayList<int[]> innerCopy = new ArrayList<>();
for (int[] array : innerList)
{
int[] arrayCopy = Arrays.copyOf(array, array.length);
innerCopy.add(arrayCopy);
}
validSetCopy.add(innerCopy);
}
//Binarize the chemical protein interaction values
binarizeCpiAttributes(learnSetCopy, threshold[i]);
//Generate an Arff file to be ran through Weka
generateARFF(fileName + "LearningThreshold" + threshold[i] + "Fold" + j + ".arff", attributeNames, learnSetCopy);
binarizeCpiAttributes(validSetCopy, threshold[i]);
generateARFF(fileName + "ValidThreshold" + threshold[i] + "Fold" + j + ".arff", attributeNames, validSetCopy);
//Create an Instances of the learning and valid arff files
Instances learningInstances = DataSource.read(fileName + "LearningThreshold" + threshold[i] + "Fold" + j + ".arff");
Instances validInstances = DataSource.read(fileName + "ValidThreshold" + threshold[i] + "Fold" + j + ".arff");
//Set class label for learning and valid sets
if(learningInstances.classIndex() == -1)
{
learningInstances.setClassIndex(learningInstances.numAttributes()-1);
}
if(validInstances.classIndex() == -1)
{
validInstances.setClassIndex(validInstances.numAttributes()-1);
}
RandomForest cls = new RandomForest();
String[] options = {
"-P", "100",
"-I", "100",
"-num-slots", "1",
"-K", "0",
"-M", "1.0",
"-V", "0.001",
"-S", "1"
};
cls.setOptions(options);
cls.buildClassifier(learningInstances);
Evaluation eval = new Evaluation(learningInstances);
eval.evaluateModel(cls, validInstances);
System.out.println("Area under ROC curve: " + eval.areaUnderROC(1));
// Print or use the rocAuc value as needed
System.out.println("Processing with Threshold: " + threshold[i]);
}
Here is an example of the outputs I am getting. I do think they should be higher which is making me question if what I have done is correct:
Area under ROC curve: 0.6000602772754672
Processing with Threshold: 0.4
Area under ROC curve: 0.5848854731766124
Processing with Threshold: 0.5
Area under ROC curve: 0.594831223628692
Processing with Threshold: 0.6
Area under ROC curve: 0.560051235684147
Processing with Threshold: 0.7
Is what I am doing here correct and is this the right way to get the ROC Area values or am I getting some other value from the Weka API?
I have tried to read through the Weka documentation provided but have gotten confused at parts.