I have a use case to count number of rows in Bigtable using rowkey prefix. I am using Google Bigtable Java client, with the current implementation my API taking well over 3 minutes to count 15M records, I am expecting for some days it could be 50M. I am looking to optimize the query and better solution.
I have only 1 node in my sandbox and running on HDD storage, I am planning to use SSD and use more nodes here. I want it to be performed better
Update:
BigtableDataClient dataClient
= BigtableDataClient.create(projectId, instanceId);
// Limit parallelism of concurrent requests
Semaphore semaphore = new Semaphore(100);
// Use strip filter as we just need the rowkeys
Filters.Filter stripFfilter = Filters.FILTERS.value().strip();
Query myQuery = Query.create(tableId).prefix(prefix).filter(stripFfilter);
List<KeyOffset> keyOffsets = null;
try {
keyOffsets = dataClient.sampleRowKeysCallable().call(tableId);
List<Query> queryShards = myQuery.shard(keyOffsets);
CountDownLatch taskTracker = new CountDownLatch(queryShards.size());
List<Throwable> errors = Collections.synchronizedList(new ArrayList<>());
AtomicLong totalCount = new AtomicLong();
for (Query subQuery : queryShards) {
semaphore.acquire();
dataClient.readRowsAsync(subQuery, new ResponseObserver<>() {
long subCount = 0;
@Override
public void onStart(StreamController controller) {
}
@Override
public void onResponse(Row response) {
subCount++;
}
@Override
public void onError(Throwable t) {
errors.add(t);
taskTracker.countDown();
semaphore.release();
}
@Override
public void onComplete() {
totalCount.addAndGet(subCount);
taskTracker.countDown();
semaphore.release();
}
});
}
taskTracker.await();
} catch (InterruptedException e) {
throw new RuntimeException(e);
}