How do I convert text to a MOA Instance?

123 Views Asked by At

I'm doing incremental learning in MOA for a text application. This requires creating an Instance object that represents the text numerically, such as TF-IDF scores for every stemmed word in the vocabulary. My MOA version is 2019.05.0.

I looked for text processing tools in MOA, but couldn't find them.

I saw that Weka has a class StringToWordVector, so I decided to try that. Weka's classes aren't the same as MOA's classes, but there's a class called WekaToSamoaInstanceConverter that I thought I could create a Weka Instance, run it through StringToWordVector, and convert it to a MOA Instance. Maybe this is the wrong track, or maybe this is the right track and I'm missing something in my syntax.

public static Instances convertDirectoryToInstances(String directory) throws Exception {
    //Create an object that reads training or test files from a directory.
    //In the future, I'll want to add one file at a time. That's not the part I'm worried about at the moment.
    TextDirectoryLoader loader = new TextDirectoryLoader();
    String[] options = new String[] {"-dir", directory, "-charset", "UTF-8"};
    loader.setOptions(options);
    loader.getStructure();

    //Create Weka Instances that represent unprocessed text.
    weka.core.Instances plainTextInstances = loader.getDataSet();

    //A StringToWordVector is a Filter that converts text to text vectors.
    //I'm not using any bells and whistles for this example, so I expect each Instance to be a set of terms in the document.
    StringToWordVector stringToWordVector = new StringToWordVector();
    stringToWordVector.setInputFormat(plainTextInstances);
    weka.core.Instances wekaWordVectors = Filter.useFilter(plainTextInstances, stringToWordVector);

    //A MOA Instance is different from a Weka Instance, so we need to convert them.
    WekaToSamoaInstanceConverter converter = new WekaToSamoaInstanceConverter();

    //This is what fails.
    Instances moaWordVectors = converter.samoaInstances(wekaWordVectors);
    return moaWordVectors;
}

wekaWordVectors.size() is the number of files in the subdirectories, so that's what I expect.

The call to samoaInstances() fails. Line 220 tries to make a call to locateIndex(0). There is no class at 0, so that returns -1. This -1 is used as an array index, so I get an ArrayIndexOutOfBoundsException. I don't know what class 0 means but I know that an ArrayIndexOutOfBoundsException means I did something wrong.

0

There are 0 best solutions below