Word count script in Python

78 Views Asked by At

Can someone please explain me why there is 'b' in front of each word and how to get read of it? Script returns something like this:

word= b'yesterday,' , count = 3

current_word = {}
current_count = 0
text = "https://raw.githubusercontent.com/KseniaGiansar/pythonProject2_text/master/yesterday.txt"
request = urllib.request.urlopen(text)
each_word = []
words = None
count = 1
same_words ={}
word = []

# сollect words into a list
for line in request:
    #print "Line = " , line
    line_words = line.split()
    for word in line_words:  # looping each line and extracting words
        each_word.append(word)
for words in each_word:
    if words.lower() not in same_words.keys() :
        same_words[words.lower()]=1
    else:
        same_words[words.lower()]=same_words[words.lower()]+1
for each in same_words.keys():
    print("word = ", each, ", count = ",same_words[each])
3

There are 3 best solutions below

0
David Meu On BEST ANSWER

It is indicating that the variable words is a bytes object.

urllib.request.urlopen() returns a bytes object.

To fix this, you can use the .decode() method to convert the bytes object to a string before appending it to the list.

for line in request:
    line_words = line.decode().split() # decode the bytes object to a string
    for word in line_words:
        each_word.append(word)
0
Marc Sances On

B-strings in python are byte strings.

When you are reading from an HTTP request, the response is in bytes, and you should decode it like this:

line_words = line.decode("utf8").split()

Please make sure the encoding of your string (UTF-8 in my example) matches the charset in the Content-Type header of the request. You can send an Accept-Charset: utf-8 header in the request to tell the server to return a UTF-8 string.

0
Abdullah Arafat On

the b prolly means bytes

i think you can remove the "b" decoding the bytes into a string using the .decode() method. In this case, you can add the following line before the for loop:

line = line.decode("utf-8")

You can also remove the ️ from each word, before adding it to the each_word list by doing the following:

word = word.decode("utf-8")