How to read a text file partially in Python and join the parts to analyze and plot a histogram efficiently?

49 Views Asked by At

I'm facing a problem to read a file with txt format. The file contains a huge amount of data (88604154 lines, 2695.7893953323364 MB) and I have to analyze the data then plot a histogram of them.

The problem is that it takes ages for the computer to read that much data, so I thought I could read the data partly and add the parts together. I did a little search and came up with this code:

import resource

file_name = '/home/lam/Downloads/C3--Trace--00001.txt'

lines_num = []
for i in range(1,50001):
    lines_num.append(i)

with open (r"/home/lam/Downloads/C3--Trace--00001.txt", 'r') as fp:
    lines = []
    for i, line in enumerate(fp):
        if i in lines_num:
            lines.append(line.strip())
        elif i > 50001:
            break
txt_file.close()        

With this I can have the lines in the certain amount (for example from line one to 50000), but I want to repeat the code for like 1775 times in order to read all the data and then append them all in one list. How can I write a function for this?

1

There are 1 best solutions below

4
Kushim On

You need to read in chunks until there are no more chunks available:

with open(r"/home/lam/Downloads/C3--Trace--00001.txt", 'r') as src, open("sink.txt", 'w') as sink:
  chunk_size = 1024 * 1024 # 1024 * 1024 byte = 1 mb
  while True:
    chunk = src.read(chunk_size)
    if not chunk:
      break
    sink.write(chunk)

Here I'm reading the chunk size and then writing that data into another file.

The read function moves the pointer automatically so you don't need to provide indexing.

You could also use the code you shared but remove the break exception:

file_name = f"/home/lam/Downloads/C3--Trace--00001.txt"

with open (file_name, 'r') as fp:
    lines = []
    for i, line in enumerate(fp):
        lines.append(line.strip())

Example of how to calculate the mean:

import statistics

means = []
total_nums = 0

with open(r"./info.txt", 'r', newline="\n") as src:
  for line in src:
    line = [int(num) for num in line.split(",")]
    mean = statistics.mean(line)
    num = len(line)
    means.append({"num": num, "mean": mean})
    total_nums += num

total_mean = 0
for mean in means:
    total_mean += mean["mean"] * (mean["num"] / total_nums)