Missing labels in tokenized data set

45 Views Asked by Lorenzo Pombini At 10 March 2023 at 12:49

I am implementing sentiment analysis, and when i tokenized my dataset.arff to use it as a training model, I have noticed that that is missing the label of "Positive"

the dataset.arff file Head:

relation SentimentAnalysis

@attribute text string @attribute classValues {Positive,Negative,Neutral,'Very positive','Very negative'}

@data 'some text',Positive 'some text',Negative 'some text',Neutral [...]

the tokenized data:


@relation 'SentimentAnalysis-weka.filters.unsupervised.attribute.StringToWordVector-R1-W5000-prune-rate-1.0-N0-stemmerweka.core.stemmers.NullStemmer-stopwords-handlerweka.core.stopwords.Null-M1-tokenizerweka.core.tokenizers.WordTokenizer -delimiters \" \\r\\n\\t.,;:\\\'\\\"()?!\"'

@attribute classValues {Positive,Negative,Neutral,'Very positive','Very negative'}

all the words in the text as set as @attribute numeric

then i have the @data part

{0 Negative,20 1,220 1,228 [...] }
{0 Neutral,22 1,169 [...] }
{22 1,169 1,272 [...] }

########################################

this is just a a few lines example i have many more like that, i do not understand why is missing positive,

here my code for the tokenizer

public static void Tokenizer(){
           
            
            ArffLoader loader = new ArffLoader();
            String filename = "dataset.arff";
            File file = new File(filename);
            
            //checking if the file exist

            if (file.exists()){
                try {
                    loader.setFile(new File(filename));
                    Instances data = loader.getDataSet();
                    if (data.classIndex() == -1){
                        data.setClassIndex(data.numAttributes()- 1);
                    }

                    String[] options = new String[] {"-R", "1", "-W", "5000", "-prune-rate", "-1.0"};
                    StringToWordVector filter = new StringToWordVector();
                    filter.setOptions(options);
                    filter.setInputFormat(data);
                    
                    Instances filteredData = Filter.useFilter(data, filter);
                    
                    File tokenizedFile = new File("tokenizeDataSet.arff");


                    ArffSaver saver = new ArffSaver();
                    saver.setInstances(filteredData);
                    saver.setFile(tokenizedFile);
                    saver.writeBatch();

                    System.out.println("tokenized text saved properly");
                } catch (Exception e) {
                    System.out.println("Error in reloading the Arff file:"+ e.getMessage());
                    e.printStackTrace();
                }
                
            }    
    }

I was expecting to have the Positive label in the tokenize data set too!

Original Q&A

There are 1 best solutions below

$fracpete$ fracpete On 10 March 2023 at 20:03

They weren't lost. Nominal values at the first position merely get suppressed when using the sparse ARFF format. See the Weka wiki on the sparse ARFF format.

Missing labels in tokenized data set

There are 1 best solutions below

Related Questions in JAVA

Related Questions in MACHINE-LEARNING

Related Questions in WEKA

Related Questions in SENTIMENT-ANALYSIS

Related Questions in ARFF

Trending Questions

Popular # Hahtags

Popular Questions