Split large textfile to multiple files based on a list of keywords in python

98 Views Asked by At

I am new to python. I am stuck at my homework. I am trying to split a 10,000 lines of text file into multiple files based on a list of keywords.

input.txt looks something like this:

 Name: Apple
 Type: Fruits
 Description:...

 Name: Orange
 Type: Fruits
 Description:...

 Name: Yellow
 Type: Colour
 Description:...

 Name: Apple
 Type: Fruits
 Description:...

 Name: Orange
 Type: Fruits
 Description:...

 Name: Yellow
 Type: Colour
 Description:...
 

Keywords:

Apple
Orange
Yellow

Expected output files :

Apple.txt

 Type: Fruits
 Description:

0range.txt

 Type: Fruits
 Description:

Yellow.txt

 Type: Colour
 Description:

But my current codes only able to split if the key is 'Apple'. I am not sure how to modify it to a range of keywords.

key = ['Apple']

outfile = None
fno = 0
lno = 0

with open('input.txt') as infile:
    while line := infile.readline():
        lno += 1
        if outfile is None:
            fno += 1
            outfile = open(f'{fno}.txt', 'w')
        outfile.write(line)
        
        if key in line:
            print(f'"{key}" found in line {lno}')
            outfile.close()
            outfile = None
if outfile:
    outfile.close()

Edit: It should print the first record for each keyword.

2

There are 2 best solutions below

0
tripleee On BEST ANSWER

Here is a somewhat more idiomatic version of your code. It does not hardcode a list of keywords; it simply picks up whatever comes after Name:

seen = set()
outfile = None

with open('input.txt') as infile:
    for line in infile:
        if line.startswith(' Name: '):
            keyword = line[len(' Name: '):-1]
            if keyword not in seen:
                outfile = open(f'{keyword}.txt', 'w')
                seen.add(keyword)
        if outfile is not None:
            if line.strip() == '':
                outfile.close()
                outfile = None
            else:
                outfile.write(line)
if outfile is not None:
    outfile.close()

You were never doing anything useful with lno but if you wanted it for some reason, the idiomatic way to get line numbers is

    for lno, line in enumerate(infile, start=1):

Your sample input.txt shows a space at the beginning of each line. If that was incorrectly transcribed, obviously adapt accordingly.

0
blhsing On

Since each record is separated by a blank line, a cleaner approach would be to read the first line of each record for a name, and use the iter function with newline as a sentinel to read the rest of the record, which is then written to a file under the name if the name is a keyword:

keywords = {'Apple', 'Orange', 'Yellow'}

with open('input.txt') as file:
    while keywords and (name := next(file, ': ').split(': ', 1)[-1]):
        rest = list(iter(file.__next__, '\n'))
        if name in keywords:
            keywords.remove(name)
            with open(name + '.txt', 'w') as output:
                output.writelines(rest)

Demo: https://replit.com/@blhsing1/SiennaOrderlyCygwin