I have a huge orderhistory table in cassandra having data from 2013, But I want only last 12 months of orderhistory data to be unloaded, I use the below command to do it which unloads all the data starting from 2013 and stores in the path data/json/customer_data/orderhistory/data. How do I modify the below statement such that each time I run this it should select only last 12months of data?
dsbulk unload -k customer_data -t crawlsiteidentifiedpages -h '172.xx.xx.xxx' \
-c json -url data/json/customer_data/orderhistory/data
You need to remove options
-kand-t, and instead use the-queryoption as described in documentation, like:To make sure that unload is parallelized, make sure that your condition includes part like
and token(pkcol) > :start and token(pkcol) <= :endwherepkcolis the name of the partition column (if you have multiple partition columns, specify them comma-separated).