Sample table:
| a |
|---|
| ryzen cpu,ryzen 5 5600x,best,amd ryzen,sale |
| cpu,ryzen 9 7800x,available,computer for ryzen,new |
df = pd.DataFrame({'a' : ['ryzen cpu,ryzen 5 5600x,best,amd ryzen,sale',
'cpu,ryzen 9 7800x,available,computer for ryzen,new']})
from nltk.corpus import stopwords, wordnet
stop = stopwords.words('english')
b = ['best', 'sale', 'new','available']
c = stop + b
x = []
for i in df['a'].str.split(','):
for j in i:
if j not in b:
x.append(j)
print(x)
I am trying to remove the stopwords and other specific words as mentioned above, even though the other words are getting removed, but stopwords are not.
This is the output I am getting:
['ryzen cpu', 'ryzen 5 5600x', 'amd ryzen', 'cpu', 'ryzen 9 7800x', 'computer for ryzen']
Also I am not able to get it in table format, I have tried to use the following list comprehension but it is not working:
df['a'] = df['a'].apply(lambda x: ''.join([j for i in x.split(' , ') for j in i if j not in c]))
df['a']
the output it is giving seems completely off (some of the letters are completely gone, such as "ryzen" has become "rzen" and "sale" has become "le" etc):
| a |
|---|
| rzen cpu,rzen 5 5600x,be, rzen,le |
| cpu,rzen 9 7800x,vlble,cpuer fr rzen,new |
If anyone can please help me understand what exactly I am doing wrong, and how to proceed further with this ?
the expected output looks something like this:
| a |
|---|
| ryzen cpu,ryzen 5 5600x,amd ryzen |
| cpu,ryzen 9 7800x,computer ryzen |
You're not using the stop words really, only your additional words, look at this line
Change b to c here.
It really, really helps debugging if you name your variables so that the names mean what they contain.
As for the loops here, the code looks very hard to read, let's understand it step by step
i- is word inx(in partfor i in x.split(' , '))j- is a symbol ini(for j in i)jis performed (if j not in c)jare joined togetherThat's how you get random symbols in the end. I think a right thing to do is to get rid of lambda in apply, because there's too much logic in it - it's easy to get confused. It may be better to write a separate function and apply it.
(by the way don't split by
" , ", split by","!)