I am trying to execute pig latin script to find TF-IDF value for book dataset
My input file bookdataset.txt contains the following lines
Document 1: "The quick brown fox jumps over the lazy dog."
Document 2: "A quick brown dog jumps over the lazy fox."
Document 3: "The lazy cat sleeps all day."
Document 4: "A lazy dog is a happy dog."
And my pig script is as follows
-- Load the book dataset
documents = LOAD 'input/book_dataset.txt' USING TextLoader AS (line:chararray);
-- Tokenize the documents into words
tokenized_documents = FOREACH documents GENERATE FLATTEN(TOKENIZE(REPLACE(LOWER(line), '[^a-zA-Z0-9\s]', ''))) AS word, line;
-- Compute the term frequency (TF) for each word in each document
word_counts = GROUP tokenized_documents BY (word, line);
word_tf = FOREACH word_counts GENERATE group.word AS word, group.line AS line, COUNT(tokenized_documents) AS tf;
-- Compute the document frequency (DF) for each word
word_df = FOREACH (GROUP word_tf BY word) GENERATE group AS word, COUNT(word_tf) AS df;
-- Compute the number of unique documents
num_documents = DISTINCT tokenized_documents.line;
num_documents_count = FOREACH (GROUP num_documents ALL) GENERATE COUNT(num_documents) AS num_docs;
-- Compute the inverse document frequency (IDF) for each word
word_idf = FOREACH word_df GENERATE word, LOG((double)(num_docs.$0) / (double)df) AS idf;
-- Join TF and IDF to calculate TF-IDF for each word in each document
word_tf_idf = JOIN word_tf BY word LEFT OUTER, word_idf BY word;
tf_idf = FOREACH word_tf_idf GENERATE word_tf::line AS document, word_tf::word AS word, word_tf::tf * word_idf::idf AS tf_idf;
-- Group TF-IDF values by document and store the result
grouped_tf_idf = GROUP tf_idf BY document;
final_tf_idf = FOREACH grouped_tf_idf GENERATE group AS document, tf_idf;
-- Store the result
STORE final_tf_idf INTO 'output/tf_idf_values' USING PigStorage();
When i try to execute above script it gives the following error in following line
num_documents = DISTINCT tokenized_documents.line;
I guess the error with use of DISTINCT operator
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: <line 20, column 44> mismatched input '.' expecting SEMI_COLON
Please resolve the issue