I am implementing sentiment analysis, and when i tokenized my dataset.arff to use it as a training model, I have noticed that that is missing the label of "Positive"
the dataset.arff file Head:
relation SentimentAnalysis
@attribute text string @attribute classValues {Positive,Negative,Neutral,'Very positive','Very negative'}
@data 'some text',Positive 'some text',Negative 'some text',Neutral [...]
the tokenized data:
@relation 'SentimentAnalysis-weka.filters.unsupervised.attribute.StringToWordVector-R1-W5000-prune-rate-1.0-N0-stemmerweka.core.stemmers.NullStemmer-stopwords-handlerweka.core.stopwords.Null-M1-tokenizerweka.core.tokenizers.WordTokenizer -delimiters \" \\r\\n\\t.,;:\\\'\\\"()?!\"'
@attribute classValues {Positive,Negative,Neutral,'Very positive','Very negative'}
all the words in the text as set as @attribute numeric
then i have the @data part
{0 Negative,20 1,220 1,228 [...] }
{0 Neutral,22 1,169 [...] }
{22 1,169 1,272 [...] }
########################################
this is just a a few lines example i have many more like that, i do not understand why is missing positive,
here my code for the tokenizer
public static void Tokenizer(){
ArffLoader loader = new ArffLoader();
String filename = "dataset.arff";
File file = new File(filename);
//checking if the file exist
if (file.exists()){
try {
loader.setFile(new File(filename));
Instances data = loader.getDataSet();
if (data.classIndex() == -1){
data.setClassIndex(data.numAttributes()- 1);
}
String[] options = new String[] {"-R", "1", "-W", "5000", "-prune-rate", "-1.0"};
StringToWordVector filter = new StringToWordVector();
filter.setOptions(options);
filter.setInputFormat(data);
Instances filteredData = Filter.useFilter(data, filter);
File tokenizedFile = new File("tokenizeDataSet.arff");
ArffSaver saver = new ArffSaver();
saver.setInstances(filteredData);
saver.setFile(tokenizedFile);
saver.writeBatch();
System.out.println("tokenized text saved properly");
} catch (Exception e) {
System.out.println("Error in reloading the Arff file:"+ e.getMessage());
e.printStackTrace();
}
}
}
I was expecting to have the Positive label in the tokenize data set too!
They weren't lost. Nominal values at the first position merely get suppressed when using the sparse ARFF format. See the Weka wiki on the sparse ARFF format.