Addressing Resource Exhaustion and Connection Leakage in Python Monitoring Scripts

18 Views Asked by At

In managing a server environment, especially when deploying monitoring scripts, it's crucial to ensure efficient resource utilization and proper handling of connections. Here to discuss the architecture, functionalities, and challenges encountered in a Python monitoring script setup. Specifically, it addresses issues related to resource exhaustion and connection leakage on a Windows Desktop 2019 server with substantial hardware specifications.

Script Hosting Environment:

The Python scripts discussed herein are hosted on a robust server provided by Netcup, featuring 32GB RAM, 12 cores, and 24 threads. The server runs on Windows Desktop 2019 with Python 3.11.1

Functionality of the Scripts:

These Python scripts primarily function as monitors, continuously executing to keep track of various HTTPS endpoints. They operate with a delay cycle of approximately 60 seconds between iterations. The monitors are equipped with error handling mechanisms, retrying cycles up to three times upon encountering errors. To access protected endpoints, the scripts utilize the tls-client library in addition to the requests library, ensuring proper management of HTTPS requests, including closing connections and sessions to release resources efficiently. Furthermore, there are Discord bots implemented using discord.py, which are executed through proxies to circumvent rate-limiting issues from the Discord API due to multiple bots running simultaneously on the same server.

Script Structure:

The scripts employ Threading and Multiprocessing techniques to handle the substantial volume of requests efficiently. Threading is utilized with ThreadPoolExecutor to execute tasks concurrently, while Multiprocessing is employed to manage multiple pools of processes for parallel execution. Each monitor class is structured similarly, with a primary class initialized from a Main.py file. Below is a sample structure of a monitor class (there are aprox 30 Classes):

from utilities import *
from .modules import *

from multiprocessing import Pool
from concurrent.futures import ThreadPoolExecutor
import time
import numpy as np
import sys
sys.dont_write_bytecode = True

class ClassicShoes:
    table = "aboutyou"
    pools = 2

    def __init__(self) -> None:
        self.logstail_token = DatabaseManager().get_logstail(self.table)
        self.products = []

        while True:
            try:
                self.start_monitoring()
                time.sleep(DatabaseManager().get_delay(self.table))
                Helper.clear_console()
            except Exception as e:
                Logger(self.logstail_token).error(self.table, f"__init__ Exception - {e}", extra=Helper.extra_log(exception=traceback.format_exc()))

    def get_products(self):
        return DatabaseManager().get_region_pids(self.table)
    
    def executor(self, products_subset):
        with ThreadPoolExecutor() as executor:
            executor.map(Monitor().request_site, *zip(*products_subset))
            executor.shutdown(wait=True)
    
    def start_monitoring(self, products=None):
        self.products = self.get_products() if not products else products
        products_split = np.array_split(self.products, self.pools)

        with Pool(processes=self.pools) as pool:
            pool.map(self.executor, products_split)

Sample Function sending requests to endpoint

def request_site(self, region, pid):
    request_attempts = 0
    while request_attempts < self.max_retries:
        try:
            session = RequestManager().init_session("chrome_112")
            start = time.time()
            proxy = ProxyManager().get_proxy()
            response = RequestManager().get_request(
                session=session,
                url=f"https://api-cloud.shoes.de/v1/{pid}",
                params={
                    region,
                    'with': 'categories'
                },
                proxy=proxy
            )
            execution_time = time.time() - start

            # here the code manage the response, retry after fail , return and update the database if there are changes
            # that's all

        except Exception as e:
            # Handle exception (e.g., log, retry logic)
            pass

The scripts retrieve parameters from a database for request customization, create executors, and manage pools. This approach was adopted to address memory leakage issues encountered when solely using threading, ensuring proper release of system resources.

Encountered Error:

After approximately 20 hours of execution, the server experiences a crash attributed to an accumulation of open ports that are not properly closed or released by the system. This issue manifests as abnormal CPU usage during idle periods and unresponsive RDP access. Additionally, Discord bots print errors related to connection failures, indicating potential socket buffer space or queue saturation issues.

Conclusion:

Based on the data I have, I believe that the process of accumulating open ports results in a block of the system which, as it can no longer send requests, freezes. If the scripts cannot send more requests, they work endlessly, thus creating 100% of the used CPU and RAM. The server explodes but does not shut down, it goes into overload. I can't figure out how to handle the situation.

CPU Stats of my server in a 24h range

Network Stats of my server in a 24h range

I reasoned by excluding possible scripts that were causing the problem but it only recurred after more time. I tried to create a script that kept track of all the open and closed ports, resulting in a total nothing.

0

There are 0 best solutions below