How to include SQL select statement in dsbulk unload command

1k Views Asked by Rahul Diggi At 21 June 2022 at 07:29

I have a huge orderhistory table in cassandra having data from 2013, But I want only last 12 months of orderhistory data to be unloaded, I use the below command to do it which unloads all the data starting from 2013 and stores in the path data/json/customer_data/orderhistory/data. How do I modify the below statement such that each time I run this it should select only last 12months of data?

dsbulk unload -k customer_data -t crawlsiteidentifiedpages -h '172.xx.xx.xxx' \
  -c json -url data/json/customer_data/orderhistory/data

Original Q&A

There are 2 best solutions below

Alex Ott On 21 June 2022 at 09:36

You need to remove options -k and -t, and instead use the -query option as described in documentation, like:

dsbulk unload -query 'select * from ks.table where <your condition>'

To make sure that unload is parallelized, make sure that your condition includes part like and token(pkcol) > :start and token(pkcol) <= :end where pkcol is the name of the partition column (if you have multiple partition columns, specify them comma-separated).

adutra On 21 June 2022 at 09:54

Instead of -t crawlsiteidentifiedpages you should use -query and provide the SELECT query, e.g.:

-query "SELECT * FROM crawlsiteidentifiedpages WHERE token(pk) > :start and token(pk) <= :end and date > maxTimeuuid('2021-06-21+0000') ALLOW FILTERING"

A few remarks:

I assume your table has one partition key column pk and one clustering column date of type timeuuid – please adjust the actual query accordingly.
The WHERE restriction token(pk) > :start and token(pk) <= :end allows DSBulk to parallelize the operation and improves performance.
The WHERE restriction date > maxTimeuuid('2021-06-21+0000') is where the magic happens and allows you to select only the last 12 months of data.
Unfortunately, you also need to add ALLOW FILTERING to this type of query, otherwise Cassandra will reject the query.

How to include SQL select statement in dsbulk unload command

There are 2 best solutions below

Related Questions in CASSANDRA

Related Questions in DATASTAX

Related Questions in DSBULK

Trending Questions

Popular # Hahtags

Popular Questions