Reading MPEG Transport steam (Binary file) PID value, quickly in Python

1.2k Views Asked by At

I've a large MPEG (.ts) Binary file, usually a multiple of 188 bytes, I use python3,when I read 188 byte each time and parse to get required value, I found it really slow. I must traverse through each 188 bytes packet to get the value of the PID (binary data).

  • On the same time when I use any MPEG offline professional analyzer, they get the list of all PID values and their total counts, within a 45 seconds for 5 min duration TS file, where my program takes > 10 mins to get the same.
  • I don't understand how quickly they can find even though they might be written in c or c++.
  • I tried python multiprocessing, but it is not helping much. this means my method of parsing and working of 188 bytes of data is not proper and causing huge delay.

`with open(file2,'rb') as f:
data=f.read(188)
if len(data)==0: break
b=BitStream(data)
...   #parse b to get the required value 
...   # and increase count when needed
...
cnt=cnt+188 
f.seek(cnt)`
2

There are 2 best solutions below

2
AudioBubble On

It's your code man.

I tried Bitstream for a while too, it's slow.

The cProfile module is your friend.

With pypy3, I can parse 3.7GB of mpegts in 2.9 seconds, single process.

With Go-lang, I can parse 3.7GB in 1.2 seconds.

2
AudioBubble On

You're cool man. Try it like this:

```import sys
from functools import partial


PACKET_SIZE= 188

def do():
    args = sys.argv[1:]
    for arg in args:
        print(f'next file: {arg}')
        pkt_num=0
        with open(arg,'rb') as vid:
             for pkt in iter(partial(vid.read, PACKET_SIZE), b""):
                 pkt_num +=1
                 pid =(pkt[1] << 8 | pkt[2]) & 0x01FFF
                 print(f'Packet: {pkt_num} Pid: {pid}', end='\r')
         
if __name__ == "__main__":
    do()

keep in mind, printing each pid will; slow you down, there is 20 million packets in 3.7 GB of mpegts

a@fumatica:~/threefive$ time pypy3 cli2.py plp0.ts 
next file: plp0.ts
Packet: 20859290 Pid: 1081
real    1m22.976s
user    0m48.331s
sys     0m34.323s

printing each pid , it takes 1m22.976s

if I comment out

   #print(f'Packet: {pkt_num} Pid: {pid}', end='\r')

it goes much faster

a@fumatica:~/threefive$ time pypy3 no_print.py plp0.ts 
next file: plp0.ts

real    0m3.080s
user    0m2.237s
sys     0m0.816s

if I change the print call to

                 print(f'Packet: {pkt_num} Pid: {pid}')

and redirect output to a file,

it only takes 9 seconds to parse 3.7GB

a@fumatica:~/threefive$ time pypy3 cli2.py plp0.ts > out.pids

real    0m9.228s
user    0m7.820s
sys     0m1.229s

a@fumatica:~/threefive$ wc -l out.pids 
20859291 out.pids

hope that helps you man.