I need to bulk-load all entities in a table. (They need to be in memory rather than loaded as-needed, for high-speed on-demand graph-traversal algorithms.)
I need to parallelize this for speed in loading. So, I want to run multiple queries in parallel threads, each pulling approx. 800 entities from the database.
QuerySplitter serves this purpose, but we are running on Flexible Environment and so are using the Appengine SDK rather than the Client libraries.
MapReduce has been mentioned, but that is not aimed at simple dataloading into memory. Memcache is somewhat relevant, but for high speed access I need all these objects in a dense network in the RAM of my own app's JVM.
MultiQueryBuilder might do this. It offers parallelism in running parts of a query in parallel.
Whichever of these three approaches, or some other approach, is used, the hardest part is to define filters or some other form of spilts that roughly partition the table (the Kind) into chunks of 800 or so entities? I would create filters that say "objects 1 through 800", "801 through 1600,...", but I know that that is impractical. So, how does one do it?
I solved a similar problem by partitioning the entities into random groups.
I added a float property to each datastore entity, and assigned it a random number between 0 and 1 every time I saved the entity. Then, when launching the
Nthreads to do the work on various datastore entities, I had each thread work over a query of1/Nof the entities. For example, thread 0 would handle all entities which had its random property set between0and1/N. Thread 2 would handle all entities that had their random property between1/Nand2/N, etc.The downside to this is that it is not entirely deterministic and you need to add a new property to your datastore entities. The upside is that it can easily scale to millions of entities and threads and you generally get an even distribution of work across the threads.