Nested dataclass introducing memory leak, but gc.get_objects() has constant length

141 Views Asked by At

Salut community,
I had a problem with leaking memory in some code of mine and posted a question on this board (Issue with python memory management). Through some inspection I found what I had to change in my code to stop memory from leaking, but I still do not understand the fundamentals of it.
A small example for my problem looks like this:

import gc
from dataclasses import dataclass

import psutil


class SomeClass:
    def __init__(self, *args):
        self.args = args

    @staticmethod
    def from_str(some_str):
        # do some stuff with some_str
        return SomeClass(*SomeClass.disassemble(some_str).to_tuple())

    @staticmethod
    def disassemble(some_str):
        @dataclass
        class StringAttributes:
            attr1: str

            def to_tuple(self):
                return (v for v in self.__dict__.values())

        return StringAttributes(some_str[0])


def some_function(strings: set[str]) -> set[SomeClass]:
    return {SomeClass.from_str(s) for s in strings}


counter = 0
process = psutil.Process()
while True:
    a = some_function({str(counter)})
    print(
        f"\n"
        f""
        f"Date processed: {counter}\n"
        f"Memory consumption:\n"
        f"Resident memory (MB): {process.memory_info().rss / 1024 ** 2}\n"
        f"Virtual memory (MB): {process.memory_info().vms / 1024 ** 2}\n"
        f"Object count: {len(gc.get_objects())}"
    )
    gc.collect()
    counter += 1

Running this script leads to small increments in the virtual memory consumption at irregular iteration intervals, e.g.

Date processed: 1552
Memory consumption:
Resident memory (MB): 15.375
Virtual memory (MB): 231.78515625
Object count: 13488

Date processed: 1553
Memory consumption:
Resident memory (MB): 15.75
Virtual memory (MB): 231.984375
Object count: 13488

The count of objects tracked by the garbage collector however stays constant at every iteration step.
When I change the code to move the StringAttributes dataclass outside of the disassemble method however the memory consumption is constant throughout all iterations as far as I observed it.

@dataclass
class StringAttributes:
    attr1: str

    def to_tuple(self):
        return (v for v in self.__dict__.values())
class SomeClass:
    def __init__(self, *args):
        self.args = args

    @staticmethod
    def from_str(some_str):
        # do some stuff with some_str
        return SomeClass(*SomeClass.disassemble(some_str).to_tuple())

    @staticmethod
    def disassemble(some_str):

        return StringAttributes(some_str[0])


def some_function(strings: set[str]) -> set[SomeClass]:
    return {SomeClass.from_str(s) for s in strings}

Now this change in the code layout solves my problem, but it leaves me puzzled:

  • Why does this solve my problem, i.e. what exactly is staying in the process' memory when I define the dataclass in a nested way? I guess the problem is that I return an instance of the dataclass from the disassemble method and thus move it outside of the scope where it is defined.
  • Why does the increase in virtual memory not reflect in the length of the list of objects tracked by the garbage collector or in the list of unreachable objects (which I take to be gc.garbage)?
0

There are 0 best solutions below