20 Petabytes in 1 million files across 60 nodes

Alex_Angelino · May 24, 2023, 5:36am

What is the problem you are having with rclone?

I need to copy about 20 Petabytes from a cluster of web severs to an S3 equivalent bucket. Access to the web servers requires a custom Auth Header, they do not support HEAD requests and we need to retain Last-Modified-Time as Metadata

Run the command 'rclone version' and share the full output of the command.

rclone v1.62.2

os/version: ubuntu 18.04 (64 bit)
os/kernel: 4.15.0-144-generic (x86_64)
os/type: linux
os/arch: amd64
go/version: go1.20.2
go/linking: static
go/tags: none

Which cloud storage system are you using? (eg Google Drive)

I’m using Min.IO

The command you were trying to run (eg `rclone copy /tmp remote:tmp`)

 rclone  -vv copyurl --header 'X-Http-Auth: DEADBEEFF00DCAFE' --http-no-head  -M http://node1.lan/foo/bar/file.dat MinIO:new_bucket/foor/bar/file.dat

This command works but I don’t want to call it 1 million times!

Is there a shared library I can link into python or a REST API we can use to queue jobs and have them run concurrently; not all 1 million to run at the same time, a few at a time from node1…nodeNN

ncw · May 24, 2023, 10:47am

You can use operations/copyurl from the API to do this.

You can either run rclone rcd to run rclone as an API server and use REST calls on that, or you can use rclone as a python library (which effectively calls the API but in-process).

random404 · May 24, 2023, 12:50pm

I would just run rclone copy from the local storage in each node to the s3 bucket. That way each node upload their own on-disk files to s3 at same time.

Then no need for headers and you have modified time

Alex_Angelino · May 24, 2023, 1:30pm

I’m having a difficult time finding documentation that explains how to call copyurl through the API. Can you provide a reference/URL? Or is it more of a trial and error process?

As for thr python library, i used apt-get to install, where is it installed by default? It does not appear in /usr/local/lib or /usr/lib. Must I build it for myself?

Thanks for your help

Alex_Angelino · May 24, 2023, 1:32pm

Yes, worse case I login to each host and run it locally.

asdffdsa · May 24, 2023, 1:48pm

hi,

hope this is not too off-topic, but i would try to run rclone locally.

in that way, as an additional type of file verification,
rclone calculates the md5 hash of the source file and compare it against the dest file.

Alex_Angelino · May 24, 2023, 1:49pm

For the REST API, my best guess is that:
rclone -vv copyurl --header 'X-Http-Auth: DEADBEEFF00DCAFE' --http-no-head -M http://node1.lan/foo/bar/file.dat MinIO:new_bucket/foo/bar/file.dat

And

rclone rc operations/copyurl fs=MinIO: remote:new_bucket/foo/bar/file.dat autoFilename:true url:http://node1.lan/foo/bar/file.dat MinIO:new_bucket/foor/bar/file.dat

Should be equivalent, but there is the obvious missing custom header and I can only guess that large files, in large queues, may require treatment with sync or async but how those parameters are passed through is a bit confusing.

ncw · May 24, 2023, 2:24pm

That looks about right.

You pass through global flags (of which --header is one) using _config

For --header you want something like this

    _config='{"Headers": [{"Key": "X-Http-Auth","Value": "DEADBEEFF00DCAFE"	}]}'

Pass the _async flag to enable async processing of a result.

You'll need to do this if the job takes longer than the HTTP timeout (30s I think).

If you don't care about the result of the job you don't need to do anything else.

If you want to know about the result of the job then you need to poll job/status or use job/list.

So a scheduling algorithm for you might be

call job/list to collect current job statuses
note jobs in error (finished=true, error=true)
if there are less than 10/100/whatever jobs active (finished=false) then add another job with _async=true

Alex_Angelino · May 24, 2023, 2:44pm

ncw:

That looks about right.

You pass through global flags (of which --header is one) using _config

For --header you want something like this
    _config='{"Headers": [{"Key": "X-Http-Auth","Value": "DEADBEEFF00DCAFE"	}]}'

I see that the syntax of the command was not correct, parametric assignment uses “=“ not “:” on the command line.

After fixing that, I see the process respond with a 500 error attempting what appears to be a PUT request with invalid parameters. Is this a PUT against the S3 target or the source webserver?

Here is the command issued and response:

rclone rc operations/copyurl fs=MinIO: remote=new_bucket/foo/bar/ autoFilename=true url=http://node1/foo/bar/file.dat _config='{"Headers": [{"Key": "X-Http-Auth", "Value": "DEADBEEFF00DCAFE"}]}'```

2023/05/24 07:28:18 Failed to rc: failed to read rc response: 500 Internal Server Error: {
        "error": "InvalidParameter: 1 validation error(s) found.\n- minimum field size of 1, PutObjectInput.Key.\n",
        "input": {
                "_config": "{\"Headers\": [{\"Key\": \"X-Http-Auth\", \"Value\": \"DEADBEEFF00DCAFE\"}]}",
                "autoFilename": "true",
                "fs": "MinIO:",
                "remote": "new_bucket/foo/bar/",
                "url": "http://node1/foo/bar/file.dat"
        },
        "path": "operations/copyurl",
        "status": 500
}

This is the output seen by the rClone agent (mode rcd):

rclone -vv rcd --rc-no-auth

2023/05/24 07:26:28 DEBUG : rclone: Version "v1.62.2" starting with parameters ["rclone" "-vv" "rcd" "--rc-no-auth"]

2023/05/24 07:26:28 NOTICE: Serving remote control on http://127.0.0.1:5572/

2023/05/24 07:28:18 DEBUG : rc: "operations/copyurl": with parameters map[_config:{"Headers": [{"Key": "X-Http-Auth", "Value": "DEADBEEFF00DCAFE"}]} autoFilename:true fs:MinIO: remote:new_bucket/foo/bar/ url:http://node1/foo/bar/file.dat]

2023/05/24 07:28:18 DEBUG : Creating backend with remote "MinIO:"

2023/05/24 07:28:18 DEBUG : Using config file from "/home/alex/.config/rclone/rclone.conf"

2023/05/24 07:28:18 DEBUG : Resolving service "s3" region "us-east-1"

2023/05/24 07:28:18 DEBUG : file.dat: File name found in url

2023/05/24 07:28:18 ERROR : file.dat: Post request put error: InvalidParameter: 1 validation error(s) found.

- minimum field size of 1, PutObjectInput.Key.

2023/05/24 07:28:18 ERROR : rc: "operations/copyurl": error: InvalidParameter: 1 validation error(s) found.

- minimum field size of 1, PutObjectInput.Key.

ncw · May 24, 2023, 2:46pm

I suspect this will work better if you write it like this

            "fs": "MinIO:new_bucket",
            "remote": "foo/bar/",

Give that a go

Alex_Angelino · May 24, 2023, 2:56pm

Different, got a empty set response this time:

rclone rc operations/copyurl fs=MinIO:new_bucket remote=foo/bar/ autoFilename=true url=http://node1/foo/bar/file.dat _config='{"Headers": [{"Key": "X-Http-Auth", "Value": "DEADBEEFF00DCAFE"}]}' 
{}

RClone agent shows:

2023/05/24 07:49:46 DEBUG : rclone: Version "v1.62.2" starting with parameters ["rclone" "-vv" "rcd" "--rc-no-auth"]

2023/05/24 07:49:46 NOTICE: Serving remote control on http://127.0.0.1:5572/

2023/05/24 07:49:58 DEBUG : rc: "operations/copyurl": with parameters map[_config:{"Headers": [{"Key": "X-Http-Auth", "Value": "DEADBEEFF00DCAFE"}]} autoFilename:true fs:MinIO:new_bucket remote:foo/bar/ url:http://node1/foo/bar/file.dat]

2023/05/24 07:49:58 DEBUG : Creating backend with remote "MinIO:new_bucket"

2023/05/24 07:49:58 DEBUG : Using config file from "/home/alex/.config/rclone/rclone.conf"

2023/05/24 07:49:58 DEBUG : Resolving service "s3" region "us-east-1"

2023/05/24 07:49:58 DEBUG : file.dat: File name found in url

2023/05/24 07:49:59 DEBUG : rc: "operations/copyurl": reply map[]: <nil>

ncw · May 24, 2023, 5:49pm

That means it worked!

Alex_Angelino · May 24, 2023, 7:22pm

Thank you, yes, that does appear to work, but a little disconcerting that it doesn't return a jobID; not sure how we track status with lots of jobs in flight.

I also found that it will always return {} even if I leave the Authentication Header off, which just means the request to rclone was well formed but the call may ultimately fail. Without a job ID it's not clear how reliable this feature is in production.

ncw · May 24, 2023, 7:26pm

You didn't use _async which means the result was synchronous.

If it had failed you would get an error message instead of {}

Alex_Angelino · May 24, 2023, 7:36pm

How odd, it fails silently here and the result is nothing in the remote/destination object store


~$ rclone rc operations/copyurl fs=MinIO:new_bucket remote=foo/bar/ autoFilename=true url=http://node1/foo/bar/file.dat _config='{"Headers": [{"Key": "X-Http-Auth", "Value": "DEADBEEFF00DCAFE"}]}'

{}

~$ rclone rc operations/copyurl fs=MinIO:new_bucket remote=foo/bar/ autoFilename=true url=http://node1/foo/bar/file.dat

{}

~$ rclone -vv lsl MinIO:new_bucket/foo

2023/05/24 12:28:06 DEBUG : rclone: Version "v1.62.2" starting with parameters ["rclone" "-vv" "lsl" "MinIO:new_bucket/foo"]

2023/05/24 12:28:06 DEBUG : Creating backend with remote "MinIO:new_bucket/foo"

2023/05/24 12:28:06 DEBUG : Using config file from "/home/alex/.config/rclone/rclone.conf"

2023/05/24 12:28:06 DEBUG : Resolving service "s3" region "us-east-1"

2023/05/24 12:28:06 DEBUG : 4 go routines active

ncw · May 25, 2023, 4:03pm

Try adding -vv --dump bodies to the rclone rcd and run again. You'll then get to see the HTTP transactions and there may be something useful in the DEBUG log also.

It might be operations/copyurl isn't reporting an error correctly - it certainly should be though.

Alex_Angelino · May 27, 2023, 4:53am

@ncw Turning on --dump bodies corrupted the terminal with binary data, switching to --dump headers gave me enough information to see what was wrong.

First, someone on my side disabled AUTH, I suppose they were trying to help me out. and that explains why there were no errors without the valid headers.

Second, assigning autoFilename the value true causes rclone to ignore the value assigned to remote, causing it to write everything to the bucket without a prefix, as if to the root of the filesystem. To fix this problem I changed the command from:
rclone rc operations/copyurl fs=MinIO:new_bucket remote=foo/bar/ autoFilename=true url=http://node1/foo/bar/file.dat

to:
rclone rc operations/copyurl fs=MinIO:new_bucket remote=foo/bar/file.dat url=http://node1/foo/bar/file.dat

This is very easy to reproduce and should be noted as a side-effect of using these parameters in combination.

Thanks again for your help.

ncw · May 27, 2023, 11:08am

You shouldn't need to supply remote at all with autoFilename, but you do, so I guess that is a bug.

You could use something like this

rclone rc operations/copyurl fs=MinIO:new_bucket/foo/bar/ remote= autoFilename=true url=http://node1/foo/bar/file.dat

Note that for efficiency reasons its best if you don't change the fs= line each call so the solution you have found already is the best one.

No worries, and good luck with your 20PiB of data

system · May 30, 2023, 11:09am

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.

20 Petabytes in 1 million files across 60 nodes

What is the problem you are having with rclone?

Run the command 'rclone version' and share the full output of the command.

Which cloud storage system are you using? (eg Google Drive)

The command you were trying to run (eg rclone copy /tmp remote:tmp)

The command you were trying to run (eg `rclone copy /tmp remote:tmp`)