distcp - copy data from cloudera hdfs to cloud storage

719 Views Asked by At

I am trying to replicate data between hdfs and my gcp cloud storage. This is not one time data copy. After first copy, I want copy only new files, updates files. and if files are deleted on on-prem it should be deleted from cloud storage as well.

However, what I realized is snapshot diff based copy is not working when target is cloud.

is it even possible to do this synch ?

-update flag doesn't seem to work with cloud storage. it copies all the files even if there is no change to it.

command

hadoop distcp --conf hdfs.conf -update -delete hdfs:///tmp/test_distcp gs://onpremhadoopfiles-123/

Command with snapshot diff

hadoop distcp --conf test.conf -update -diff  test_distcp test_distcp_new  hdfs:///tmp/test_distcp gs://xxxx-123/

Error


Jul 29, 2022 9:56:31 AM com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase configure
WARNING: No working directory configured, using default: 'gs://onpremhadoopfiles-123/'
22/07/29 09:56:32 INFO tools.DistCp: Input Options: DistCpOptions{atomicCommit=false, syncFolder=true, deleteMissing=false, ignoreFailures=false, overwrite=false, append=false, useDiff=true, useRdiff=false, fromSnapshot=test_distcp, toSnapshot=test_distcp_new, skipCRC=false, blocking=true, numListstatusThreads=0, maxMaps=20, mapBandwidth=0.0, copyStrategy='uniformsize', preserveStatus=[BLOCKSIZE], atomicWorkPath=null, logPath=null, sourceFileListing=null, sourcePaths=[hdfs:/tmp/test_distcp], targetPath=gs://xxx-123/, filtersFile='null', blocksPerChunk=0, copyBufferSize=8192, verboseLog=false}, sourcePaths=[hdfs:/tmp/test_distcp], targetPathExists=true, preserveRawXattrsfalse
22/07/29 09:56:32 INFO client.RMProxy: Connecting to ResourceManager at xxx.xxx.com/xx.xx.xx.x:8032
22/07/29 09:56:33 ERROR tools.DistCp: Exception encountered
java.lang.IllegalArgumentException: The FileSystems needs to be DistributedFileSystem for using snapshot-diff-based distcp
        at org.apache.hadoop.tools.DistCpSync.preSyncCheck(DistCpSync.java:98)
        at org.apache.hadoop.tools.DistCpSync.sync(DistCpSync.java:149)
        at org.apache.hadoop.tools.DistCp.prepareFileListing(DistCp.java:88)
        at org.apache.hadoop.tools.DistCp.createAndSubmitJob(DistCp.java:205)
        at org.apache.hadoop.tools.DistCp.execute(DistCp.java:182)
        at org.apache.hadoop.tools.DistCp.run(DistCp.java:153)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
        at org.apache.hadoop.tools.DistCp.main(DistCp.java:432)

2

There are 2 best solutions below

5
Ayuush Saxena On

Snapshot diff based DistCp is only possible when both source and target cluster support Snapshot Operations. And Here GCP Cloud Storage doesn't support Snapshots. So you can't use Snapshot based Sync here.

But the same behaviour can be achieved by using the -update & -delete options of distcp.

  • It will copy the files which aren't there on target.
  • Overwrite the files which differ in file size & checksum (use -skipcrccheck option if either source or target doesn't support checksums or valid reasons where checksums can't match like different encryption zones etc)
  • Delete the files on target cluster, which aren't available on source cluster. (-delete option)

Can get some more details from the official doc: https://hadoop.apache.org/docs/stable/hadoop-distcp/DistCp.html#Appendix Under the sub-heading DistCp and Object Stores

0
Yogesh Patel On

hadoop distcp -direct -pr -update -delete hdfs://tmp/source/ gs://gcs-data-bucket/tmp/target/

Above command worked for me.