I am facing this error while running the code in kaggle. it works well in my local pc. Here is the code link in kaggle if possible have a look in the details: Kaggle
import nltk
# Custom tokenizer to remove stopwords and use lemmatization
def customtokenize(str):
#Split string as tokens
tokens=nltk.word_tokenize(str)
#Filter for stopwords
nostop = list(filter(lambda token: token not in stopwords.words('english'), tokens))
#Perform lemmatization
lemmatized=[lemmatizer.lemmatize(word) for word in nostop ]
return lemmatized
from sklearn.feature_extraction.text import TfidfVectorizer
# Build a TF-IDF Vectorizer model
vectorizer = TfidfVectorizer(tokenizer=customtokenize)
# Transform feature input to TF-IDF
tfidf=vectorizer.fit_transform(spam_messages)
---------------------------------------------------------------------------
LookupError Traceback (most recent call last)
File /opt/conda/lib/python3.10/site-packages/nltk/corpus/util.py:80, in LazyCorpusLoader.__load(self)
79 except LookupError as e:
---> 80 try: root = nltk.data.find('{}/{}'.format(self.subdir, zip_name))
81 except LookupError: raise e
File /opt/conda/lib/python3.10/site-packages/nltk/data.py:653, in find(resource_name, paths)
652 resource_not_found = '\n%s\n%s\n%s' % (sep, msg, sep)
--> 653 raise LookupError(resource_not_found)
LookupError:
**********************************************************************
Resource 'corpora/wordnet.zip/wordnet/.zip/' not found. Please
use the NLTK Downloader to obtain the resource: >>>
nltk.download()
Searched in:
- '/root/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/local/lib/nltk_data'
**********************************************************************
During handling of the above exception, another exception occurred:
LookupError Traceback (most recent call last)
Cell In[3], line 20
17 vectorizer = TfidfVectorizer(tokenizer=customtokenize)
19 #Transform feature input to TF-IDF
---> 20 tfidf=vectorizer.fit_transform(spam_messages)
21 #Convert TF-IDF to numpy array
22 tfidf_array = tfidf.toarray()
File /opt/conda/lib/python3.10/site-packages/sklearn/feature_extraction/text.py:2133, in TfidfVectorizer.fit_transform(self, raw_documents, y)
2126 self._check_params()
2127 self._tfidf = TfidfTransformer(
2128 norm=self.norm,
2129 use_idf=self.use_idf,
2130 smooth_idf=self.smooth_idf,
2131 sublinear_tf=self.sublinear_tf,
2132 )
-> 2133 X = super().fit_transform(raw_documents)
2134 self._tfidf.fit(X)
2135 # X is already a transformed view of raw_documents so
2136 # we set copy to False
LookupError:
**********************************************************************
Resource 'corpora/wordnet' not found. Please use the NLTK
Downloader to obtain the resource: >>> nltk.download()
Searched in:
- '/root/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/local/lib/nltk_data'
**********************************************************************
How to resolve this? The code runs well in my local pc in jupyter notebook but in kaggle it shows this error.
!pip install nltk # This line installs the Natural Language Toolkit (NLTK) library using the pip package manager.
import nltk
nltk.download('stopwords') # The code downloads NLTK resources, specifically the stopwords and punkt datasets. Stopwords are common words like "and," "the," etc., often removed in text processing. The punkt dataset is for tokenization, breaking text into words.
nltk.download('punkt')
from nltk.corpus import stopwords # This line imports the stopwords module from NLTK, which contains a list of common words that are often removed from text during text processing to focus on meaningful words.
nltk.download('wordnet') # This line downloads the WordNet dataset, a lexical database of the English language. WordNet is often used for lemmatization, which is the process of reducing words to their base or root form.
from nltk.stem import WordNetLemmatizer # This code imports the WordNetLemmatizer class from NLTK's stem module and creates an instance of it. Lemmatization is a method for reducing words to their base or root form, helping in standardizing and normalizing text for analysis.
lemmatizer = WordNetLemmatizer()
Requirement already satisfied: nltk in /opt/conda/lib/python3.10/site-packages (3.2.4)
Requirement already satisfied: six in /opt/conda/lib/python3.10/site-packages (from nltk) (1.16.0)
[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data] Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data] Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /usr/share/nltk_data...
[nltk_data] Package wordnet is already up-to-date!