I'm working on a C++ application, that operates on a pretty large set of data (tens of GBs). It's compiled for linux and Windows platforms. The data should be persistent (i.e. needs to be kept when the app is restarted), and it's a nice-to-have to be able to work in transaction mode, where the data update, which may include many different modifications, is either committed at once, or rolled back automatically).
I had several ideas at the very beginning, but at the end it was decided to use an sqlite DB for the data storage. Such that the data structures that I need (arrays, linked lists, binary trees and B-trees) are emulated via the DB engine, with appropriate indexes and tables relations.
Now, I'm planning to rewrite this. The idea basically is to work with a file mapping. I'm well-aware of all the benefits the sqlite has, but specifically for my use-case, mapping all my data structures directly into memory should be superior. I'm talking about using proper algorithms on my data structures (such as accessing data in O(1) where applicable), and avoiding copying.
In a naiive approach one would create a file and map it (with R/W protection), and then read/write the mapped memory directly. But this has the following limitation/drawbacks:
- Atomicity of modifications. Sounds not a good idea to give the OS full control of modifing the underlying file when my app modifies the mapped pages. If the app is forcibly closed during modifications (or there's a power shortage) - the data would be left in an inconsistent state.
- Auto-resize. I wouldn't like to preallocate the file mapping to the maximum size, I'd like to auto-increase the size on-demand.
So, I thought about the following idea. The mapping normally would be opened with read-only protection (or at least app should never write to the mapping directly as-is). The app can access all the data structures in read-only mode.
Now, when the app makes some modifications, it should call an appropriate function, and tell it's goind to modify a specific address+offset. The implementation would create a persistent copy of the being-modified pages on the first access (which is analogous to the sqlite journal file in some sense). When the app is done with modifications and wants to "commit" the changes - it would just erase the journal file. If OTOH the modifications are to be reverted - the app would read the journal file and restore the pages to the original file mapping.
This is regarding consistency of the data.
Now, regarding the auto-grow. This seems to be more tricky to achieve. Ideally I'd want to reserve a large part of the virtual address space, with only accessible pages that correspond to the current file size. But, as far as I understand, in both Linux and Windows API, this doesn't work that way. Windows would automatically enlarge the file to match the requested file mapping size, Linux would always read zeroes for exceeding size and ignore writes (according to docs).
So, to increase the file mapping size I'll need to close/unmap the existing mapping, then increase the file, and then re-map it. Or, alternatively, add another mapping for the new file portion.
This, however, doesn't guarantee that the mapped address would be the same. I'd like very much a guarantee that after extending the file mapping the mapping virtual address remains the same. So that if I already have some pointers to objects, which are supposed to be unaffected by the changes, I can use them safely.
I know that it's possible to specify a preferred address to the file mapping, but how likely it would be respected by the OS?
Is there a better alternative to what I'm trying to achieve?