How to Efficiently Parallelize AES-CTR PRNG Implementation in C Using Pthreads for Multi-Core Utilization?

42 Views Asked by At

I am working on improving the nwipe tool, specifically by implementing an AES-CTR PRNG using AES-128 in counter mode to generate high-quality random numbers for securely wiping HDDs and SSDs. The original implementation runs on a single core, and I am trying to parallelize it using pthreads to utilize all available CPU cores. However, my attempt at parallelization has resulted in a significant performance drop, and I'm seeking advice on how to correct this.

Here's the single-core implementation that works correctly but only utilizes one core:

int nwipe_aes_ctr_prng_read(NWIPE_PRNG_READ_SIGNATURE) {
    u8* restrict bufpos = buffer;
    size_t words = count / SIZE_OF_AES_CTR_PRNG;

    for(size_t ii = 0; ii < words; ++ii) {
        aes_ctr_prng_genrand_uint128_to_buf((aes_ctr_state_t*) *state, bufpos);
        bufpos += 16; // Move to the next block
    }

    // Handle remaining bytes if count is not a multiple of SIZE_OF_AES_CTR_PRNG
    const size_t remain = count % SIZE_OF_AES_CTR_PRNG;
    if(remain > 0) {
        unsigned char temp_output[16]; // Temporary buffer for the last block
        aes_ctr_prng_genrand_uint128_to_buf((aes_ctr_state_t*) *state, temp_output);
        memcpy(bufpos, temp_output, remain);
    }

    return 0; // Success
}

My attempt to implement pthreads for parallelization is as follows, but it has led to a performance decrease from 200MB/s to around 15MB/s:

typedef struct {
    aes_ctr_state_t* state;
    u8* buffer;
    size_t start;
    size_t end;
} prng_thread_arg_t;

void* nwipe_aes_ctr_prng_read_thread(void* arg) {
    prng_thread_arg_t* thread_arg = (prng_thread_arg_t*)arg;
    aes_ctr_state_t* state = thread_arg->state;
    u8* buffer = thread_arg->buffer + thread_arg->start;
    size_t words = (thread_arg->end - thread_arg->start) / SIZE_OF_AES_CTR_PRNG;

    for(size_t ii = 0; ii < words; ++ii) {
        aes_ctr_prng_genrand_uint128_to_buf(state, buffer);
        buffer += SIZE_OF_AES_CTR_PRNG;
    }

    return NULL;
}

int nwipe_aes_ctr_prng_read(NWIPE_PRNG_READ_SIGNATURE) {
    int num_threads = 8; // Adjustable based on requirements
    pthread_t threads[num_threads];
    prng_thread_arg_t thread_args[num_threads];

    size_t total_words = count / SIZE_OF_AES_CTR_PRNG;
    size_t words_per_thread = total_words / num_threads;

    for(int i = 0; i < num_threads; i++) {
        size_t start = i * words_per_thread * SIZE_OF_AES_CTR_PRNG;
        size_t end = (i + 1) * words_per_thread * SIZE_OF_AES_CTR_PRNG;

        if(i == num_threads - 1) {
            end = total_words * SIZE_OF_AES_CTR_PRNG; // Correct end calculation
        }

        thread_args[i].state = (aes_ctr_state_t*)*state;
        thread_args[i].buffer = buffer;
        thread_args[i].start = start;
        thread_args[i].end = end;

        pthread_create(&threads[i], NULL, nwipe_aes_ctr_prng_read_thread, &thread_args[i]);
    }

    for(int i = 0; i < num_threads; i++) {
        pthread_join(threads[i], NULL);
    }

    // Remaining bytes handling omitted for brevity
    return 0;
}

Both attempts, use the following function in order to generate the numbers.

void aes_ctr_prng_genrand_uint128_to_buf(aes_ctr_state_t* state, unsigned char* bufpos) {
    CRYPTO_ctr128_encrypt(bufpos, bufpos, 16, &state->aes_key, state->ivec, state->ecount, &state->num, (block128_f) AES_encrypt);
    next_state(state);
}

Question: What could be the cause of the performance drop when parallelizing with pthreads, and how can I efficiently use all cores for the AES-CTR PRNG implementation?

I appreciate any insights or suggestions you may have. Thank you!

0

There are 0 best solutions below