Lookup Error while running the .ipynb file in kaggle

35 Views Asked by At

I am facing this error while running the code in kaggle. it works well in my local pc. Here is the code link in kaggle if possible have a look in the details: Kaggle

import nltk 

# Custom tokenizer to remove stopwords and use lemmatization
def customtokenize(str):
    #Split string as tokens
    tokens=nltk.word_tokenize(str)
    #Filter for stopwords
    nostop = list(filter(lambda token: token not in stopwords.words('english'), tokens))
    #Perform lemmatization
    lemmatized=[lemmatizer.lemmatize(word) for word in nostop ]
    return lemmatized

from sklearn.feature_extraction.text import TfidfVectorizer

# Build a TF-IDF Vectorizer model
vectorizer = TfidfVectorizer(tokenizer=customtokenize)

# Transform feature input to TF-IDF
tfidf=vectorizer.fit_transform(spam_messages)
---------------------------------------------------------------------------
LookupError                               Traceback (most recent call last)
File /opt/conda/lib/python3.10/site-packages/nltk/corpus/util.py:80, in LazyCorpusLoader.__load(self)
     79 except LookupError as e:
---> 80     try: root = nltk.data.find('{}/{}'.format(self.subdir, zip_name))
     81     except LookupError: raise e

File /opt/conda/lib/python3.10/site-packages/nltk/data.py:653, in find(resource_name, paths)
    652 resource_not_found = '\n%s\n%s\n%s' % (sep, msg, sep)
--> 653 raise LookupError(resource_not_found)

LookupError: 
**********************************************************************
  Resource 'corpora/wordnet.zip/wordnet/.zip/' not found.  Please
  use the NLTK Downloader to obtain the resource:  >>>
  nltk.download()
  Searched in:
    - '/root/nltk_data'
     - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
**********************************************************************

During handling of the above exception, another exception occurred:

LookupError                               Traceback (most recent call last)
Cell In[3], line 20
     17 vectorizer = TfidfVectorizer(tokenizer=customtokenize)
     19 #Transform feature input to TF-IDF
---> 20 tfidf=vectorizer.fit_transform(spam_messages)
     21 #Convert TF-IDF to numpy array
     22 tfidf_array = tfidf.toarray()
File /opt/conda/lib/python3.10/site-packages/sklearn/feature_extraction/text.py:2133, in TfidfVectorizer.fit_transform(self, raw_documents, y)
   2126 self._check_params()
   2127 self._tfidf = TfidfTransformer(
   2128     norm=self.norm,
   2129     use_idf=self.use_idf,
   2130     smooth_idf=self.smooth_idf,
   2131     sublinear_tf=self.sublinear_tf,
   2132 )
-> 2133 X = super().fit_transform(raw_documents)
   2134 self._tfidf.fit(X)
   2135 # X is already a transformed view of raw_documents so
   2136 # we set copy to False
LookupError: 
**********************************************************************
  Resource 'corpora/wordnet' not found.  Please use the NLTK
  Downloader to obtain the resource:  >>> nltk.download()
  Searched in:
    - '/root/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
**********************************************************************

How to resolve this? The code runs well in my local pc in jupyter notebook but in kaggle it shows this error.

!pip install nltk  # This line installs the Natural Language Toolkit (NLTK) library using the pip package manager.

import nltk

nltk.download('stopwords')  # The code downloads NLTK resources, specifically the stopwords and punkt datasets. Stopwords are common words like "and," "the," etc., often removed in text processing. The punkt dataset is for tokenization, breaking text into words.
nltk.download('punkt')

from nltk.corpus import stopwords  # This line imports the stopwords module from NLTK, which contains a list of common words that are often removed from text during text processing to focus on meaningful words.

nltk.download('wordnet')  # This line downloads the WordNet dataset, a lexical database of the English language. WordNet is often used for lemmatization, which is the process of reducing words to their base or root form.
from nltk.stem import WordNetLemmatizer    # This code imports the WordNetLemmatizer class from NLTK's stem module and creates an instance of it. Lemmatization is a method for reducing words to their base or root form, helping in standardizing and normalizing text for analysis.
lemmatizer = WordNetLemmatizer()

Requirement already satisfied: nltk in /opt/conda/lib/python3.10/site-packages (3.2.4)
Requirement already satisfied: six in /opt/conda/lib/python3.10/site-packages (from nltk) (1.16.0)
[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /usr/share/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
0

There are 0 best solutions below