I am trying to replicate data between hdfs and my gcp cloud storage. This is not one time data copy. After first copy, I want copy only new files, updates files. and if files are deleted on on-prem it should be deleted from cloud storage as well.
However, what I realized is snapshot diff based copy is not working when target is cloud.
is it even possible to do this synch ?
-update flag doesn't seem to work with cloud storage. it copies all the files even if there is no change to it.
command
hadoop distcp --conf hdfs.conf -update -delete hdfs:///tmp/test_distcp gs://onpremhadoopfiles-123/
Command with snapshot diff
hadoop distcp --conf test.conf -update -diff test_distcp test_distcp_new hdfs:///tmp/test_distcp gs://xxxx-123/
Error
Jul 29, 2022 9:56:31 AM com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase configure
WARNING: No working directory configured, using default: 'gs://onpremhadoopfiles-123/'
22/07/29 09:56:32 INFO tools.DistCp: Input Options: DistCpOptions{atomicCommit=false, syncFolder=true, deleteMissing=false, ignoreFailures=false, overwrite=false, append=false, useDiff=true, useRdiff=false, fromSnapshot=test_distcp, toSnapshot=test_distcp_new, skipCRC=false, blocking=true, numListstatusThreads=0, maxMaps=20, mapBandwidth=0.0, copyStrategy='uniformsize', preserveStatus=[BLOCKSIZE], atomicWorkPath=null, logPath=null, sourceFileListing=null, sourcePaths=[hdfs:/tmp/test_distcp], targetPath=gs://xxx-123/, filtersFile='null', blocksPerChunk=0, copyBufferSize=8192, verboseLog=false}, sourcePaths=[hdfs:/tmp/test_distcp], targetPathExists=true, preserveRawXattrsfalse
22/07/29 09:56:32 INFO client.RMProxy: Connecting to ResourceManager at xxx.xxx.com/xx.xx.xx.x:8032
22/07/29 09:56:33 ERROR tools.DistCp: Exception encountered
java.lang.IllegalArgumentException: The FileSystems needs to be DistributedFileSystem for using snapshot-diff-based distcp
at org.apache.hadoop.tools.DistCpSync.preSyncCheck(DistCpSync.java:98)
at org.apache.hadoop.tools.DistCpSync.sync(DistCpSync.java:149)
at org.apache.hadoop.tools.DistCp.prepareFileListing(DistCp.java:88)
at org.apache.hadoop.tools.DistCp.createAndSubmitJob(DistCp.java:205)
at org.apache.hadoop.tools.DistCp.execute(DistCp.java:182)
at org.apache.hadoop.tools.DistCp.run(DistCp.java:153)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at org.apache.hadoop.tools.DistCp.main(DistCp.java:432)
Snapshot diff based DistCp is only possible when both source and target cluster support Snapshot Operations. And Here GCP Cloud Storage doesn't support Snapshots. So you can't use Snapshot based Sync here.
But the same behaviour can be achieved by using the -update & -delete options of distcp.
Can get some more details from the official doc: https://hadoop.apache.org/docs/stable/hadoop-distcp/DistCp.html#Appendix Under the sub-heading DistCp and Object Stores