How to Estimate Theoretical Execution Time for Dynamic Data Generation in PostgreSQL Function?

39 Views Asked by At

I have developed a PostgreSQL function that dynamically generates user data with random values for specified fields. The function includes nested calls to another function for password generation. I conducted performance tests with varying parameters, such as the number of records and password lengths, to measure the execution time.

Here's the main code snippet:

CREATE OR REPLACE FUNCTION generate_users(count INT)
RETURNS TABLE (
    first_name_new TEXT,
    last_name_new TEXT,
    email TEXT,
    gender TEXT,
    dob DATE,
    password TEXT
) AS $$
DECLARE
    first_names_arr TEXT[] = ARRAY(SELECT names_csv.first_name FROM names_csv);
    last_names_arr TEXT[] = ARRAY(SELECT names_csv.last_name FROM names_csv);
    i INT;
BEGIN
    FOR i IN 1..count LOOP
        RETURN QUERY
        SELECT
            -- generate a random first name:
            first_names_arr[floor(random() * array_length(first_names_arr, 1) + 1)],
            -- generate a random last name:
            last_names_arr[floor(random() * array_length(last_names_arr, 1) + 1)],
            -- generate a random email:
            first_names_arr[floor(random() * array_length(first_names_arr, 1) + 1)] || '_' || last_names_arr[floor(random() * array_length(last_names_arr, 1) + 1)] || '_' || EXTRACT(YEAR FROM (NOW() - (floor(random() * 3650) || ' days')::INTERVAL)) || '@example.com',
            -- generate random sex:
            CASE WHEN random() < 0.5 THEN 'f' ELSE 'm' END,
            -- generate a random date of birth:
            CAST(NOW() - (floor(random() * 3650) || ' days')::INTERVAL AS DATE),
            generate_password(20);
    END LOOP;
END;
$$ LANGUAGE plpgsql;

And here's the generate_password() function:

CREATE OR REPLACE FUNCTION generate_password(length INT)
    RETURNS TEXT AS $$
    DECLARE
        chars TEXT = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789';
        result TEXT = '';
        i INT;
    BEGIN
        FOR i IN 1..length LOOP
            result := result || substr(chars::text, floor(random() * length(chars) + 1)::int, 1);
        END LOOP;
        RETURN result;
    END;
    $$ LANGUAGE plpgsql;

I came across a theoretical formula for estimating execution time in PostgreSQL, but it seems inadequate as it only considers the number of rows and doesn't account for dynamic data generation and the complexity of the function. The formula is as follows:

enter image description here

(N must be greater than 1000)

However, the practical results I obtained from my tests:

Practical

do not align with the theoretical values from this formula:

theoretical

(for N = 5,000; 10,000; 50,000 and 100,000 respectively)

Moreover, the formula appears to be primarily focused on insertion scenarios and doesn't consider dynamic data generation.

My question is: How can I develop a more accurate theoretical estimation for the execution time of my PostgreSQL function, taking into account dynamic data generation, the number of columns, and the values generated within each cell? Additionally, are there any existing methodologies or tools that are better suited for estimating the performance of functions involving dynamic data generation in PostgreSQL?

0

There are 0 best solutions below