I need to copy about 20 Petabytes from a cluster of web severs to an S3 equivalent bucket. Access to the web servers requires a custom Auth Header, they do not support HEAD requests and we need to retain Last-Modified-Time as Metadata
Run the command 'rclone version' and share the full output of the command.
rclone v1.62.2
os/version: ubuntu 18.04 (64 bit)
os/kernel: 4.15.0-144-generic (x86_64)
os/type: linux
os/arch: amd64
go/version: go1.20.2
go/linking: static
go/tags: none
Which cloud storage system are you using? (eg Google Drive)
This command works but I don’t want to call it 1 million times!
Is there a shared library I can link into python or a REST API we can use to queue jobs and have them run concurrently; not all 1 million to run at the same time, a few at a time from node1…nodeNN
You can either run rclone rcd to run rclone as an API server and use REST calls on that, or you can use rclone as a python library (which effectively calls the API but in-process).
I would just run rclone copy from the local storage in each node to the s3 bucket. That way each node upload their own on-disk files to s3 at same time.
Then no need for headers and you have modified time
I’m having a difficult time finding documentation that explains how to call copyurl through the API. Can you provide a reference/URL? Or is it more of a trial and error process?
As for thr python library, i used apt-get to install, where is it installed by default? It does not appear in /usr/local/lib or /usr/lib. Must I build it for myself?
For the REST API, my best guess is that: rclone -vv copyurl --header 'X-Http-Auth: DEADBEEFF00DCAFE' --http-no-head -M http://node1.lan/foo/bar/file.dat MinIO:new_bucket/foo/bar/file.dat
Should be equivalent, but there is the obvious missing custom header and I can only guess that large files, in large queues, may require treatment with sync or async but how those parameters are passed through is a bit confusing.
I see that the syntax of the command was not correct, parametric assignment uses “=“ not “:” on the command line.
After fixing that, I see the process respond with a 500 error attempting what appears to be a PUT request with invalid parameters. Is this a PUT against the S3 target or the source webserver?
Thank you, yes, that does appear to work, but a little disconcerting that it doesn't return a jobID; not sure how we track status with lots of jobs in flight.
I also found that it will always return {} even if I leave the Authentication Header off, which just means the request to rclone was well formed but the call may ultimately fail. Without a job ID it's not clear how reliable this feature is in production.
Try adding -vv --dump bodies to the rclone rcd and run again. You'll then get to see the HTTP transactions and there may be something useful in the DEBUG log also.
It might be operations/copyurl isn't reporting an error correctly - it certainly should be though.
@ncw Turning on --dump bodies corrupted the terminal with binary data, switching to --dump headers gave me enough information to see what was wrong.
First, someone on my side disabled AUTH, I suppose they were trying to help me out. and that explains why there were no errors without the valid headers.
Second, assigning autoFilename the value true causes rclone to ignore the value assigned to remote, causing it to write everything to the bucket without a prefix, as if to the root of the filesystem. To fix this problem I changed the command from: rclone rc operations/copyurl fs=MinIO:new_bucket remote=foo/bar/ autoFilename=true url=http://node1/foo/bar/file.dat
to: rclone rc operations/copyurl fs=MinIO:new_bucket remote=foo/bar/file.dat url=http://node1/foo/bar/file.dat
This is very easy to reproduce and should be noted as a side-effect of using these parameters in combination.