Is there a way to resume from where COPY TO failed?

114 Views Asked by At

I am using Cassandra's COPY command (docs can be found at https://docs.datastax.com/en/cql-oss/3.x/cql/cql_reference/cqlshCopy.html) to export a large table to CSV and have even larger was ones I need to export after this.

The command I used is:

COPY my_table_name TO 'my_table_name.csv' 

After running for 12 hours (and creating a 289GB file) I got the following error:

Error for (3598295844520231142, 3615644561192297385): ReadFailure - Error from server: code=1300 [Replica(s) failed to execute read] message="Operation failed - received 0 responses and 1 failures" info={'failures': 1, 'received_responses': 0, 'required_responses': 1, 'consistency': 'ONE'} (permanently given up after 349000 rows and 1 attempts)
Exported 1280 ranges out of 1281 total ranges, some records might be missing

I would like to know if there is a way to continue from the point of failure. I see the error shows "Exported 1280 ranges out of 1281 total ranges" is there a way to specify just the last range so I don't need to try exporting the entire table again?

The docs mention BEGINTOKEN and ENDTOKEN but I'm not clear on what those mean if they can help me.

Is there perhaps a more robust method to export a table?

3

There are 3 best solutions below

0
clunven On BEST ANSWER

The COPY commands should be used for up to about 1 million records in a table (rule of thumb). The reason is it is pretty slow (single threaded python) and not very robust even if easy.

There are FREE tools out there to achieve the same thing. I am thinking specially of DSBulk. To create a CSV use the unload command doc

This component provides you with many options for the export and a CHECKPOINT mechanism to restart when you stopped. The output are multiple CSV more easy to move later on.

dsbulk unload -url ~/data-export -k ks1 -t table1
0
Aaron On

Usually, no, there isn't a way to resume a COPY operation.

The docs mention BEGINTOKEN and ENDTOKEN but I'm not clear on what those mean if they can help me.

Error for (3598295844520231142, 3615644561192297385): ReadFailure

However, in your case, cqlsh did give you the token range which had the failure. You could use the BEGINTOKEN and ENDTOKEN options to COPY just that token range:

COPY my_table_name TO 'my_table_name.csv' WITH BEGINTOKEN=3598295844520231142 AND ENDTOKEN=3615644561192297385

But otherwise, Cédrick is right. DSBulk is a much more reliable method for exporting and importing data from Cassandra.

0
Erick Ramirez On

The cqlsh COPY commands are intended to be used for non-production workloads, particularly if you're trying things out. It's a development tool intended for exporting/loading up to a few thousand records on small clusters.

If you think about it, the process of exporting requires a full table scan so if there are (a) lots of records, and/or (b) lots of nodes, then (c) you'll either overload your cluster or cause cqlsh to timeout.

Furthermore, the cqlsh COPY commands do not have the ability to retry an operation from where it last failed so it will have to start all over and re-process all the records.

The DataStax Bulk Loader (DSBulk) utility was created specifically for this reason. DSBulk can export data from a Cassandra cluster to CSV or JSON format but its main feature is for bulk-loading data in CSV or JSON format.

DSBulk breaks up a table into small segments to be processed by multiple threads in parallel so it can distribute the load efficiently across nodes in the cluster.

Here are some references with examples to help you get started quickly:

Note that DSBulk has a "resume" feature that allows a failed operation to be restarted from where it left off using checkpoint files. For details, see the checkpoint sections of the DSBulk Logging options.

And finally, DSBulk is open-source so it's free to use. Cheers!