How could I read through a text file in Python and recognize certain tokens/words?

104 Views Asked by At

I tried to build a program that would open and read through each line of a .txt file, then recognize and label its components.

I was able to get the program to open the text file, and recognize some components but have run into some trouble with python strings. The problem seems to be that the program reads each character of the .txt file and does not clear the string after each recognition.

I ended up with a program that just reads and labels every single character of the .txt file.

#open a text file
program = open("input.txt", "r");
strHold =""
x=""

#function to check the strings
def findWord(strIn):
    if(strIn == "input"):
        print("<input>, "+strIn)
        # strIn = ""
        return
        
    elif(strIn == "("):
        print("<lparen>, "+strIn)
        # strIn = ""
        return
    
    elif(strIn == ")"):
        print("<rparen>, "+strIn)
        # strIn = ""
        return
        
    elif(strIn == "="):
        print("<assign_op>, "+strIn)
        # strIn = ""
        return
        
    elif(strIn == "+" or strIn == "-"):
        print("<add_op>, "+strIn)
        # strIn = ""
        return
        
    elif(strIn == "+" or strIn == "-"):
        print("<add_op>, "+strIn)
        # strIn = ""
        return
        
    elif(strIn=="/" or strIn=="*" or strIn == "//" or strIn == "%"):
        print("<mult_op>, "+strIn)
        # strIn = ""
        return
    
    elif(strIn==" " or strIn=="\n"):#check
        x= strIn.isspace()
        # if(x==True):
            # strIn=""
        return
    
    elif(strIn.isnumeric() == True):
        print("<number>, "+strIn)
        # strIn=""
        return
    
    elif(strIn=="output"):
        print("<output>, "+strIn)
        #strIn = ""
        return
    
        
    else:#default is an id??
        print("<id>, "+strIn)
        # strIn=""
        return


#loop through .txt file
for line in program:
    for c in line:
        strHold = strHold+c
        findWord(strHold);
        strHold="" 

There aren't any errors from the code.

The .txt file just looks like:

input(a)
input(b)
input(c)
total = a + b + c  /* get a sum of three inputs */
average = total / 3 /* compute an average */
output(total)
output(average)

but the issue is that the current output from the .py program is:

<id>, i
<id>, n
<id>, p
<id>, u
<id>, t
<lparen>, (
<id>, a
<rparen>, )
<id>, i
...

(for every character in that file, didn't include all of it)

The ideal output would be:

<input>, input
<lparen>, (
<id>, a
<rparen>, )

It seems like a logic error so far, or that I've missed a small detail in python. I've been trying to fix it but now am stuck. Does anyone have any suggestions for how to fix this output?

1

There are 1 best solutions below

2
Alain T. On

Your code is always processing characters one by one on each line (because of strHold="" after each check). So the findWord function can only identify single character matches.

To fix this, you should either use more sophisticated tools (such as regular expression) or, if you need to implement this without libraries, allow your findword function to look at complete substrings from each position on the line.

for line in program:
    for i in range(len(line)):
        findWord(line[i:])     # give findword whole substrings

and in the findWord function match prefixes instead of whole words:

def findWord(strIn):
    if strIn.startswith("input"):
       ...

    if strIn.startswith("output"):
       .... 

For the isnumeric() matching, you will get each digit separately if you only process the 1st character, unless you look ahead for more digits:

    if strIn[0].isnumeric():
       size = 1
       while size<len(str) and strIn[size-1].isnumeric():
           size += 1
       print("<number>, "+strIn[:size])
       

Note that this will output overlapping matches for multi-character tokens. You may want your findWord function to return the length of the match so that the calling loop can skip over the already matched sequences.