Storing heterogeneous data continuously in memory as a sequence of chars

163 Views Asked by At

I have a large number of strings and some data associated with each string. For simplicity, lets assume that the data is an int for each string. Lets assume I have an std::vector<std::tuple<std::string, int>>. I want to try to store this data continuously in memory with a single heap allocation. I will not need to worry about adding or deleting strings in the future.

A simple example

Constructing an std::string requires a heap allocation, and accessing entry chars of the std::string requires a dereference. If I have a bunch of strings, I may make better use of memory by storing all of the strings in one std::string and storing each string's starting index and size as a separate variable. If I want, I could try to store the starting index and size within the std::string itself.

Back to my problem

One idea I had was to store everything in an std::string or std::vector<char>. Each entry of the std::vector<std::tuple<std::string, int>> would be laid out in memory like this:

  1. length of next string (int or size_t)
  2. sequence of chars representing the string (chars)
  3. some number zero chars for correct int alignment (chars)
  4. data (int)

This requires being able to interpret a sequence of chars as an int. There have been questions about this before, but it seems to me that trying to do this can result in undefined behavior. I believe that I can help this slightly by checking the sizeof(int).

Another option I have is to create a union

union CharInt{
    char[sizeof(int)] some_chars;
    int data;
}

here, I would need to be careful that the number of chars per int used is determined at compile-time based on the result of sizeof(int). I would then store an std::vector<CharInt>. This seems more "C++" than using reinterpret_cast. One downside of this is that accessing the second char member of a CharInt would require an additional pointer addition (the pointer to the CharInt + 1). This cost still seems small relative to the benefit of making everything contiguous.

Is this the better option? Are there other options available? Are there pitfalls I need to account for using the union method?

Edit:

I wanted to provide clarity about how CharInt would be used. I provided an example below:

#include <iostream>
#include <string>
#include <vector>


class CharIntTest {
public:
    CharIntTest() {
        my_trie.push_back(CharInt{ 42 });
        std::string example_string{ "this is a long string" };
        my_trie.push_back(CharInt{ example_string, 5 });
        my_trie.push_back(CharInt{ 106 });
    }

    int GetFirstInt() {
        return my_trie[0].an_int;
    }

    char GetFirstChar() {
        return my_trie[1].some_chars[0];
    }

    char GetSecondChar() {
        return my_trie[1].some_chars[1];
    }

    int GetSecondInt() {
        return my_trie[2].an_int;
    }

private:

    union CharInt {
        // here I would need to be careful that I only insert sizeof(int) number of chars
        CharInt(std::string s, int index) : some_chars{ s[index], s[index+1], s[index+2], s[index+3]} {
        }

        CharInt(int i) : an_int{ i } {
        }

        char some_chars[sizeof(int)];
        int an_int;
    };

    std::vector<CharInt> my_trie;

};

Note that I do not access the first or third CharInts as though they were chars. I do not access the second CharInt as though it were an int. Here is the main:

int main() {
    CharIntTest tester{};

    std::cout << tester.GetFirstInt() << "\n";
    std::cout << tester.GetFirstChar() << "\n";
    std::cout << tester.GetSecondChar() << "\n";
    std::cout << tester.GetSecondInt();
}

which produces the desired output

42
i
s
106
0

There are 0 best solutions below