Does rclone support HDFS TDE

What is the problem you are having with rclone?

Trying to copy HDFS data between 2 clusters
The source cluster has TDE, the destination HDFS cluster has no TDE.
rclone copies the files to the destination cluster but they are useless.

What is your rclone version (output from rclone version)

rclone v1.56.2
- os/version: redhat 7.8 (64 bit)
- os/kernel: 3.10.0-1127.el7.x86_64 (x86_64)
- os/type: linux
- os/arch: amd64
- go/version: go1.16.8
- go/linking: static
- go/tags: none

Which cloud storage system are you using? (eg Google Drive)

on-premise HDFS

The command you were trying to run (eg rclone copy /tmp remote:tmp)

rclone copy san_prod:/tests/data las_prod:/tests/data

The rclone config contents with secrets removed.

[las_prod]
type = hdfs
namenode = lassdppr01hdp01.las.ssnsgs.net:8020
username = hdfs

[san_prod]
type = hdfs
namenode = sansdppr01hdp01.san.ssnsgs.net:8020
username = hdfs

A log from the command with the -vv flag

: [0132] root@lassdppr01nag01:rclone # ; ./rclone -vv  copy san_prod:/tests/data las_prod:/tests/data
2021/10/15 01:36:14 DEBUG : rclone: Version "v1.56.2" starting with parameters ["./rclone" "-vv" "copy" "san_prod:/tests/data" "las_prod:/tests/data"]
2021/10/15 01:36:14 DEBUG : Creating backend with remote "san_prod:/tests/data"
2021/10/15 01:36:14 DEBUG : Using config file from "/root/.config/rclone/rclone.conf"
2021/10/15 01:36:15 DEBUG : Creating backend with remote "las_prod:/tests/data"
2021/10/15 01:36:15 DEBUG : hdfs://lassdppr01hdp01.las.ssnsgs.net:8020: list [/tests/data]
2021/10/15 01:36:15 DEBUG : hdfs://sansdppr01hdp01.san.ssnsgs.net:8020: list [/tests/data]
2021/10/15 01:36:15 DEBUG : hdfs://lassdppr01hdp01.las.ssnsgs.net:8020: Waiting for checks to finish
2021/10/15 01:36:15 DEBUG : hdfs://lassdppr01hdp01.las.ssnsgs.net:8020: Waiting for transfers to finish
2021/10/15 01:36:15 DEBUG : hdfs://sansdppr01hdp01.san.ssnsgs.net:8020: open [/tests/data/20211002-62664aaf-b1b0-42f0-bf0a-2276b22236d9.csv.gz]
2021/10/15 01:36:15 DEBUG : hdfs://lassdppr01hdp01.las.ssnsgs.net:8020: update [/tests/data/20211002-62664aaf-b1b0-42f0-bf0a-2276b22236d9.csv.gz]
2021/10/15 01:36:15 INFO  : 20211002-62664aaf-b1b0-42f0-bf0a-2276b22236d9.csv.gz: Copied (new)
2021/10/15 01:36:15 INFO  :
Transferred:        5.214Ki / 5.214 KiByte, 100%, 0 Byte/s, ETA -
Transferred:            1 / 1, 100%
Elapsed time:         0.1s

2021/10/15 01:36:15 DEBUG : 5 go routines active

in what way are they useless?

The files on the destination HDFS cluster are corrupt.
I can't unzip a gzip file and I cant open a text file.

In the source HDFS cluster these files are in an "encryption zone". See Hadoop TDE

Even when I try to copy files by prepending the paths with /.reserved/raw/the files are not readable in the destination cluster.

Here is a full example of what I mean:

In the source HDFS cluster I copy a file into my /prod1 encryption zone:

: [2243] hdfs@sansdppr01hdp05:rclone $ ; hdfs dfs -ls /prod1/raw/test
Found 1 items
-rw-r--r--   3 hdfs hadoop   14899752 2021-10-15 22:43 /prod1/raw/test/rclone.tgz
: [SAN_PROD01 Hadoop] ;
: [2244] hdfs@sansdppr01hdp05:rclone $ ; hdfs dfs -copyToLocal  /prod1/raw/test/rclone.tgz .
: [SAN_PROD01 Hadoop] ;
: [2244] hdfs@sansdppr01hdp05:rclone $ ; tar tvzf rclone.tgz
-rw-r--r-- hdfs/hadoop    1131 2021-10-13 23:17 git-log.txt
-rwxr-xr-x hdfs/hadoop 43847680 2021-10-13 23:17 rclone
-rw-r--r-- hdfs/hadoop  1455517 2021-10-13 23:17 rclone.1
-rw-r--r-- hdfs/hadoop  1575634 2021-10-13 23:17 README.html
-rw-r--r-- hdfs/hadoop  1278833 2021-10-13 23:17 README.txt

Then I run rclone to copy this dir to my destination HDFS cluster:

: [2246] hdfs@sansdppr01hdp05:rclone $ ; ./rclone -vv copy san_prod:/.reserved/raw/prod1/raw/test las_prod:/.reserved/raw/prod1/raw/test
2021/10/15 22:48:04 DEBUG : rclone: Version "v1.56.2" starting with parameters ["./rclone" "-vv" "copy" "san_prod:/.reserved/raw/prod1/raw/test" "las_prod:/.reserved/raw/prod1/raw/test"]
2021/10/15 22:48:04 DEBUG : Creating backend with remote "san_prod:/.reserved/raw/prod1/raw/test"
2021/10/15 22:48:04 DEBUG : Using config file from "/home/hdfs/.config/rclone/rclone.conf"
2021/10/15 22:48:04 DEBUG : Creating backend with remote "las_prod:/.reserved/raw/prod1/raw/test"
2021/10/15 22:48:04 DEBUG : hdfs://lassdppr01hdp01.las.ssnsgs.net:8020: list [/.reserved/raw/prod1/raw/test]
2021/10/15 22:48:04 DEBUG : hdfs://sansdppr01hdp01.san.ssnsgs.net:8020: list [/.reserved/raw/prod1/raw/test]
2021/10/15 22:48:04 DEBUG : hdfs://lassdppr01hdp01.las.ssnsgs.net:8020: Waiting for checks to finish
2021/10/15 22:48:04 DEBUG : hdfs://lassdppr01hdp01.las.ssnsgs.net:8020: Waiting for transfers to finish
2021/10/15 22:48:04 DEBUG : hdfs://sansdppr01hdp01.san.ssnsgs.net:8020: open [/.reserved/raw/prod1/raw/test/rclone.tgz]
2021/10/15 22:48:04 DEBUG : hdfs://lassdppr01hdp01.las.ssnsgs.net:8020: update [/.reserved/raw/prod1/raw/test/rclone.tgz]
2021/10/15 22:48:05 INFO  : rclone.tgz: Copied (new)
2021/10/15 22:48:05 INFO  :
Transferred:       14.210Mi / 14.210 MiByte, 100%, 0 Byte/s, ETA -
Transferred:            1 / 1, 100%
Elapsed time:         0.5s

2021/10/15 22:48:05 DEBUG : 5 go routines active

The tgz file in the destination cluster:

: [2241] hdfs@lassdppr01hdp05:~ $ ; hdfs dfs -ls /prod1/raw/test
Found 1 items
-rw-r--r--   3 hdfs hadoop   14899752 2021-10-15 22:43 /prod1/raw/test/rclone.tgz
: [SLDP: LAS_PROD01] ;
: [2249] hdfs@lassdppr01hdp05:~ $ ; hdfs dfs -copyToLocal  /prod1/raw/test/rclone.tgz .
: [SLDP: LAS_PROD01] ;
: [2249] hdfs@lassdppr01hdp05:~ $ ; tar tvzf rclone.tgz

gzip: stdin: not in gzip format
tar: Child returned status 1
tar: Error is not recoverable: exiting now

I don't know much about this backend either.

Perhaps @ncw or @ivandeex might have a thought as I was looking through the feature request and I think they both were involved. I do not see @urykhy on the forums.

I suspect you are right about it not supporting TDE.

Can you open a new issue on Github and we can ask the experts there about it.

Thanks

I don't know what is TDE so can't comment. Let gurus decide.

I also wanted to mention that copying HDFS files from an "unencrypted" area in the source cluster to the destination cluster works fine (there is no data corruption).

For example:

  1. All files under /prod1 are an "encryption zone"
  2. First I copied the contents of /prod1/raw/test to /unencrypted/raw/test using
    the hadoop distcp command (preserves timestamps, perms, acs, etc)
  3. Then I ran rclone copy san_prod:/unencrypted/raw/test las_prod:/prod1/raw/test
  4. Which achieves my goal of mirroring data across the 2 HDFS clusters.

So, this is probably a feature request for rclone to support Hadoop TDE.
It would be nice if the documentation would at least mention this limitation.

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.