Word count script in Python

78 Views Asked by A_K At 16 January 2023 at 11:53

Can someone please explain me why there is 'b' in front of each word and how to get read of it? Script returns something like this:

word= b'yesterday,' , count = 3

current_word = {}
current_count = 0
text = "https://raw.githubusercontent.com/KseniaGiansar/pythonProject2_text/master/yesterday.txt"
request = urllib.request.urlopen(text)
each_word = []
words = None
count = 1
same_words ={}
word = []

# сollect words into a list
for line in request:
    #print "Line = " , line
    line_words = line.split()
    for word in line_words:  # looping each line and extracting words
        each_word.append(word)
for words in each_word:
    if words.lower() not in same_words.keys() :
        same_words[words.lower()]=1
    else:
        same_words[words.lower()]=same_words[words.lower()]+1
for each in same_words.keys():
    print("word = ", each, ", count = ",same_words[each])

Original Q&A

There are 3 best solutions below

David Meu On 16 January 2023 at 11:57 BEST ANSWER

It is indicating that the variable words is a bytes object.

urllib.request.urlopen() returns a bytes object.

To fix this, you can use the .decode() method to convert the bytes object to a string before appending it to the list.

for line in request:
    line_words = line.decode().split() # decode the bytes object to a string
    for word in line_words:
        each_word.append(word)

Marc Sances On 16 January 2023 at 11:58

B-strings in python are byte strings.

When you are reading from an HTTP request, the response is in bytes, and you should decode it like this:

line_words = line.decode("utf8").split()

Please make sure the encoding of your string (UTF-8 in my example) matches the charset in the Content-Type header of the request. You can send an Accept-Charset: utf-8 header in the request to tell the server to return a UTF-8 string.

Abdullah Arafat On 16 January 2023 at 12:00

the b prolly means bytes

i think you can remove the "b" decoding the bytes into a string using the .decode() method. In this case, you can add the following line before the for loop:

line = line.decode("utf-8")

You can also remove the ️ from each word, before adding it to the each_word list by doing the following:

word = word.decode("utf-8")

Word count script in Python

There are 3 best solutions below

Related Questions in PYTHON

Related Questions in URLLIB

Related Questions in URLOPEN

Trending Questions

Popular # Hahtags

Popular Questions