File retention mechanism in a large data storage

415 Views Asked by At

recently I faced performance problem with mp4 files retention. I have kind a recorder which saves 1 min long mp4 files from multiple RTSP streams. Those files are stored on external drive in file tree like this:

./recordings/{camera_name}/{YYYY-MM-DD}/{HH-MM}.mp4

Apart from video files, there are many other files on this drive which are not considered (unless they have mp4 extension), as they took much less space.

Assumption of file retention is as follows. Every minute, python script that is responsible for recording, check for external drive fulfillment level. If the level is above 80%, it performs a scan of the whole drive, and look for .mp4 files. When scanning is done, it sorts a list of files by its creation date, and deletes the number of the oldest files which is equal to the cameras number.

The part of the code, which is responsible for files retention, is shown below.

total, used, free = shutil.disk_usage("/home")
used_percent = int(used / total * 100)
if used_percent > 80:
    logging.info("SSD usage %s. Looking for the oldest files", used_percent)
    try:
        oldest_files = sorted(
            (
                os.path.join(dirname, filename)
                for dirname, dirnames, filenames in os.walk('/home')
                for filename in filenames
                if filename.endswith(".mp4")
            ),
            key=lambda fn: os.stat(fn).st_mtime,
        )[:len(camera_devices)]
        logging.info("Removing %s", oldest_files)
        for oldest_file in oldest_files:
            os.remove(oldest_file)
            logging.info("%s removed", oldest_file)
    except ValueError as e:
        # no files to delete
        pass

(/home is external drive mount point)

The problem is that this mechanism used to work as a charm, when I used 256 or 512 GB SSD. Now I have a need of larger space (more cameras and longer storage time), and it takes a lot of time to create files list on larger SSD (from 2 to 5 TB now and maybe 8 TB in the future). The scanning process takes a lot more than 1 min, what could be resolved by performing it more rarely, and extending the length of "to delete" files list. The real problem is, that the process uses a lot of CPU load (by I/O ops) itself. The performance drop is visible is the whole system. Other applications, like some simple computer vision algorithms, works slower, and CPU load can even cause kernel panic.

The HW I work on is Nvidia Jetson Nano and Xavier NX. Both devices have problem with performance as I described above.

The question is if you know some algorithms or out of the box software for file retention that will work on the case I described. Or maybe there is a way to rewrite my code, to let it be more reliable and perform?

EDIT:

I was able to lower os.walk() impact by limit space to check.Now I just scan /home/recordings and /home/recognition/ which also lower directory tree (for recursive scan). At the same time, I've added .jpg files checking, so now I look from both .mp4 and .jpg. Result is much better in this implementation.

However, I need further optimization. I prepared some test cases, and tested them on 1 TB drive which is 80% filled (media files mostly). I attached profiler results per case below.

@time_measure
def method6():
    paths = [
        "/home/recordings",
        "/home/recognition",
        "/home/recognition/marked_frames",
    ]
    files = []
    for path in paths:
        files.extend((
            os.path.join(dirname, filename)
            for dirname, dirnames, filenames in os.walk(path)
            for filename in filenames
            if (filename.endswith(".mp4") or filename.endswith(".jpg")) and not os.path.islink(os.path.join(dirname, filename))
        ))
    oldest_files = sorted(
        files,
        key=lambda fn: os.stat(fn).st_mtime,
    )
    print(oldest_files[:5])

method6

@time_measure
def method7():
    ext = [".mp4", ".jpg"]
    paths = [
        "/home/recordings/*/*/*",
        "/home/recognition/*",
        "/home/recognition/marked_frames/*",
    ]
    files = []
    for path in paths:
        files.extend((file for file in glob(path) if not os.path.islink(file) and (file.endswith(".mp4") or file.endswith(".jpg"))))
    oldest_files = sorted(files, key=lambda fn: os.stat(fn).st_mtime)
    print(oldest_files[:5])

method7

The original implementation on the same data set last ~100 s

EDIT2

@norok2 proposals comparation

I compared them with method6 and method7 from above. I tried several times with similar result.

Testing method7
['/home/recordings/35e68df5-44b1-5010-8d12-74b892c60136/2022-06-24/17-36-18.jpg', '/home/recordings/db33186d-3607-5055-85dd-7e5e3c46faba/2021-11-22/11-27-30.jpg', '/home/recordings/acce21a2-763d-56fe-980d-a85af1744b7a/2021-11-22/11-27-30.jpg', '/home/recordings/b97eb889-e050-5c82-8034-f52ae2d99c37/2021-11-22/11-28-23.jpg', '/home/recordings/01ae845c-b743-5b64-86f6-7f1db79b73ae/2021-11-22/11-28-23.jpg']
Took 24.73726773262024 s
_________________________
Testing find_oldest
['/home/recordings/35e68df5-44b1-5010-8d12-74b892c60136/2022-06-24/17-36-18.jpg', '/home/recordings/db33186d-3607-5055-85dd-7e5e3c46faba/2021-11-22/11-27-30.jpg', '/home/recordings/acce21a2-763d-56fe-980d-a85af1744b7a/2021-11-22/11-27-30.jpg', '/home/recordings/b97eb889-e050-5c82-8034-f52ae2d99c37/2021-11-22/11-28-23.jpg', '/home/recordings/01ae845c-b743-5b64-86f6-7f1db79b73ae/2021-11-22/11-28-23.jpg']
Took 34.355509757995605 s
_________________________
Testing find_oldest_cython
['/home/recordings/35e68df5-44b1-5010-8d12-74b892c60136/2022-06-24/17-36-18.jpg', '/home/recordings/db33186d-3607-5055-85dd-7e5e3c46faba/2021-11-22/11-27-30.jpg', '/home/recordings/acce21a2-763d-56fe-980d-a85af1744b7a/2021-11-22/11-27-30.jpg', '/home/recordings/b97eb889-e050-5c82-8034-f52ae2d99c37/2021-11-22/11-28-23.jpg', '/home/recordings/01ae845c-b743-5b64-86f6-7f1db79b73ae/2021-11-22/11-28-23.jpg']
Took 25.81963086128235 s

method7 (glob()) method7

iglob() iglob

Cython cython

3

There are 3 best solutions below

2
Utshaan On

You could use the subprocess module to list all the mp4 files directly, without having to loop through all the files in the directory.

import subprocess as sb
oldest_files = sb.getoutput("dir /b /s .\home\*.mp4").split("\n")).sort(lambda fn: os.stat(fn).st_mtime,)[:len(camera_devices)]
0
IamFr0ssT On

A quick optimization would be not to bother checking file creation time and trusting the filename.

total, used, free = shutil.disk_usage("/home")
used_percent = int(used / total * 100)
if used_percent > 80:
    logging.info("SSD usage %s. Looking for the oldest files", used_percent)
    try:
        files = []
        for dirname, dirnames, filenames in os.walk('/home/recordings'):
            for filename in filenames:
                files.push((
                    name := os.path.join(dirname, filename),
                    datetime.strptime(
                        re.search(r'\d{4}-\d{2}-\d{2}\/\d{2}-\d{2}', name)[0],
                        "%Y-%m-%d/%H-%M"
                        ))
        oldest_files = files.sort(key=lambda e: e[1])[:len(camera_devices)]
        logging.info("Removing %s", oldest_files)
        for oldest_file in oldest_files:
            os.remove(oldest_file)
            # logging.info("%s removed", oldest_file)
        logging.info("Removed")
    except ValueError as e:
        # no files to delete
        pass
3
norok2 On

You could get an extra few percent speed-up on top of your method7() with the following:

import os
import glob


def find_oldest(paths=("*",), exts=(".mp4", ".jpg"), k=5):
    result = [      
        filename
        for path in paths
        for filename in glob.iglob(path)
        if any(filename.endswith(ext) for ext in exts) and not os.path.islink(filename)]
    mtime_idxs = sorted(
        (os.stat(fn).st_mtime, i)
        for i, fn in enumerate(result))
    return [result[mtime_idxs[i][1]] for i in range(k)]

The main improvements are:

  • use iglob instead of glob -- while it may be of comparable speed, it takes significantly less memory which may help on low end machines
  • str.endswith() is done before the allegedly more expensive os.path.islink() which helps reducing the number of such calls due to shortcircuiting
  • an intermediate list with all the mtimes is produces to minimize the os.stat() calls

This can be sped up even further with Cython:

%%cython --cplus -c-O3 -c-march=native -a

import os
import glob


cpdef find_oldest_cy(paths=("*",), exts=(".mp4", ".jpg"), k=5):
    result = []
    for path in paths:
        for filename in glob.iglob(path):
            good_ext = False
            for ext in exts:
                if filename.endswith(ext):
                    good_ext = True
                    break
            if good_ext and not os.path.islink(filename):
                result.append(filename)
    mtime_idxs = []
    for i, fn in enumerate(result):
        mtime_idxs.append((os.stat(fn).st_mtime, i))
    mtime_idxs.sort()
    return [result[mtime_idxs[i][1]] for i in range(k)]

My tests on the following files:

def gen_files(n, exts=("mp4", "jpg", "txt"), filename="somefile", content="content"):
    for i in range(n):
        ext = exts[i % len(exts)]
        with open(f"{filename}{i}.{ext}", "w") as f:
            f.write(content)


gen_files(10_000)

produces the following:

funcs = find_oldest_OP, find_oldest, find_oldest_cy


timings = []
base = funcs[0]()
for func in funcs:
    res = func()
    is_good = base == res
    timed = %timeit -r 8 -n 4 -q -o func()
    timing = timed.best * 1e3
    timings.append(timing if is_good else None)
    print(f"{func.__name__:>24}  {is_good}  {timing:10.3f} ms")
#           find_oldest_OP  True      81.074 ms
#              find_oldest  True      70.994 ms
#           find_oldest_cy  True      64.335 ms

find_oldest_OP is the following, based on method7() from OP:

def find_oldest_OP(paths=("*",), exts=(".mp4", ".jpg"), k=5):
    files = []
    for path in paths:
        files.extend(
            (file for file in glob.glob(path)
            if not os.path.islink(file) and any(file.endswith(ext) for ext in exts)))
    oldest_files = sorted(files, key=lambda fn: os.stat(fn).st_mtime)
    return oldest_files[:k]

The Cython version seems to point to a ~25% reduction in execution time.