Reduce number of S3 Head Requests

aadiver · June 26, 2019, 8:50am

I want to sync a folder to a S3 bucket.

Every day I do a rclone copy to copy only the new and modified files and once a month I do a rclone sync to sync also the deletes.

Script that runs every day:

/usr/bin/rclone copy "/volume1/myfolder" "AmazonS3DeepGlacier:mybucket/myfolder" --max-age 24h --no-traverse --ignore-times --exclude "#recycle/" --exclude "@eaDir/" -v --config="/var/services/homes/admin/.config/rclone/rclone.conf" --track-renames

But I notice for every put request there exists 2 HEAD requests.

What is the reason for these HEAD requests and is it possible to avoid these HEAD requests?

ncw · June 26, 2019, 8:59am

One HEAD request is to see whether the file already exists (because of the --no-traverse) and one is to confirm it was uploaded properly.

First make sure you have v1.48 - the latest release.

Depending on exactly how your files are laid out, removing --no-traverse may reduce the number of queries. If all the files you are uploading are in one directory for instance then that will definitely be a win. You'll need to use this in conjunction with --size-only or --checksum for syncing. This will avoid rclone doing a HEAD on the file to read the metadata.

Why are you using --ignore-times? That is probably causing rclone to transfer stuff it doesn't need to.

Also --track-renames only works with rclone sync so I suggest you remove that.

You can also consider --update and --use-server-modtime.

When you do the full sync, if you have enough memory then --fast-list will make it run a lot quicker. You want to use --size-only, --checksum or --update --use-server-modtime with that.

aadiver · June 26, 2019, 10:55am

I'm using the latest release.

One HEAD request is to see whether the file already exists (because of the --no-traverse)
Is it possible to avoid this request and just copy the file (for new and modified files) to the bucket without checking the existence?

The script for the daily backup I've changed to:

/usr/bin/rclone copy "/volume1/myfolder" "AmazonS3DeepGlacier:mybucket/myfolder" --max-age 24h --no-traverse --ignore-times --exclude "#recycle/**" --exclude "@eaDir/**" -v --config="/var/services/homes/admin/.config/rclone/rclone.conf" --track-renames

The script that runs monthly is:
/usr/bin/rclone sync "/volume1/myfolder" "AmazonS3DeepGlacier:mybucket/myfolder" --fast-list --exclude "#recycle/**" --exclude "@eaDir/**" -v --config="/var/services/homes/admin/.config/rclone/rclone.conf" --checksum --track-renames

One folder I backup to Amazon S3 bucket has 1224 files and 124 directories. (388 MB)
The other folder has 238508 files and 52433 directories. (40 GB)

Everyday only a few files (let us say 10 files) are changed or added.

Because Amazon charges for requests I would like to limit the requests as much as possible.
If I have for example 5 new files I would like to see only 5 PUT requests. Maybe also 5 HEAD requests to confirm if the files are properly uploaded. But in my case these 5 HEAD requests are not necessary because I would check the whole folder with the monthly script. So is it possible to avoid all the HEAD requests?

ncw · June 26, 2019, 1:31pm

Rclone reads the metadata for the file to see if it needs to copy it. The file might be there already in which case much better one HEAD request than an unecessary upload.

Depending on how often you run the script this might be a situation you are facing.

As I said above removing --no-traverse and using --size-only or --checksum might use less requests as rclone will list a small number of directories and do no HEAD requests.

aadiver · June 26, 2019, 6:48pm

I've tested it with removing --no-traverse and adding --checksum:
/usr/bin/rclone copy "/volume1/myfolder" "AmazonS3DeepGlacier:mybucket/myfolder" --max-age 24h --checksum --exclude "#recycle/**" --exclude "@eaDir/**" -v --config="/var/services/homes/admin/.config/rclone/rclone.conf

but this is definitively not giving the right results. I've added 1 file which needed to be copy to the S3 bucket for my smallest folder (1224 files and 124 directories. (388 MB)) but now I have a total of 767 requests:
694 HeadRequests
72 ListRequests
1 PutRequests

Version of rclone:
rclone v1.47.0-098-gac4c8d8d-beta

os/arch: linux/amd64
go version: go1.12.5

ncw · June 26, 2019, 7:47pm

How many of those files were less than 24h old?

The -vv flag will show what rclone is doing which may gives you a better idea as to what is going on.

aadiver · June 27, 2019, 6:26am

1 file was less than 24h old (I've just added 1 new file).

But when reading the following link https://forum.rclone.org/t/no-traverse-for-dummies/2992/2 apparently I've to use the option --no-traverse because if you are not using this option rclone will load in the definitions for all the remote files before discovering whether the newly locally added file needs to be uploaded. If you use --no-traverse rclone will just check the newly added file on the remote.

ncw · June 27, 2019, 9:11am

That is correct and you'll get exactly one HEAD request per file, leading to the one HEAD before upload and one HEAD after to see if the file is OK.

Depending on how the files to be uploaded are laid out not using --no-traverse can be quicker - for example if all the files were in one directory.

I don't understand why this made so many HEAD requests though... Could you run it with -vv --dump headers and post the result somewhere?

aadiver · June 27, 2019, 10:55am

You can find the logs here:
https://1drv.ms/u/s!AJ4TaVsx8r2Iged5

ncw · June 27, 2019, 9:35pm

Thank you for that. It is definitely doing more HEAD requests than it needs to and I'm not sure why - I'll investigate

What happens if you try it in two steps, the first step with something like this to get a list of the files which need uploading

/usr/bin/rclone lsf -R --files-only "/volume1/myfolder"  --max-age 24h --exclude "#recycle/**" --exclude "@eaDir/**" -v --config="/var/services/homes/admin/.config/rclone/rclone.conf > file-list

and the second with --files-from to actually transfer them

/usr/bin/rclone copy "/volume1/myfolder" "AmazonS3DeepGlacier:mybucket/myfolder" --files-from file-list --checksum  -v --config="/var/services/homes/admin/.config/rclone/rclone.conf

And you can try adding --no-traverse to the second.

ncw · June 28, 2019, 8:59am

I've been doing some more investigation into this. I've discovered the problem. It is that rclone is doing the age filtering on the source and the destination, so each file considered is using a HEAD request.

My workaround above with --files-from should work to fix the issue.

Another way around this would be to use --use-server-modtime

  --use-server-modtime   Use server modified time instead of object metadata

Can you please make a new issue on github about this and put a link to this page in. I think this needs fixing properly at some point - I don't think the destination file list should be being filtered by age at all though there are a rather a lot of things to consider there!

aadiver · July 1, 2019, 6:30am

I've tested it with the following script

/usr/bin/rclone lsf -R --files-only "/volume1/myfolder"  --max-age 24h --exclude "#recycle/**" --exclude "@eaDir/**" -v  > file-list 
/usr/bin/rclone copy "/volume1/myfolder" "AmazonS3DeepGlacier:mybucket/my folder" --files-from file-list --checksum  -v --config="/var/services/homes/admin/.config/rclone/rclone.conf" --log-file="/volume1/homes/admin/rcloneLogs/`date +%Y%m%d_%H%M%S`BackupToAmazonS3DeepGlacier_myfolder.log"

and now I have the minimum number of requests. (only 1 HEAD request per PUT request and not anymore 2 HEAD requests as before)

I had 9 new files added (in a folder of 124 sub directories and 1249 files) and with Cloudwatch on Amazon I can see I have 20 requests.

1 LIST request for the bucket.
1 HEAD request for the folder myfolder
18 = (9 PUT + 9 HEAD) requests for the 9 files I've added.

These 9 HEAD requests are these requests to see if the files are OK or are these leading ones before UPLOAD?
Is there a possibility to avoid also these 9 HEAD requests?

ncw · July 1, 2019, 7:14am

Great!

Yes

That would require a new flag I think.

At the moment rclone does that check so that so

if the file is already there and correct it won't upload it again
on some remotes (eg google drive) you'll make a duplicate file if you don't update an existing file

aadiver · July 1, 2019, 8:38am

Because I'm using the option --max-age option it is not necessary to check if the file is already there remotely. The script should just upload it in my opinion. This will save this extra HEAD request.

aadiver · July 2, 2019, 3:02pm

Can I avoid maybe these extra requests by adding in the second step the options --checksum --no-traverse?

/usr/bin/rclone lsf -R --files-only "/volume1/myfolder"  --max-age 24h --exclude "#recycle/**" --exclude "@eaDir/**" -v  > file-list
/usr/bin/rclone copy "/volume1/FIRST LOOK VOF" "AmazonS3DeepGlacier:mybucket/myfolder" --files-from file-list --checksum --no-traverse  -v --config="/var/services/homes/admin/.config/rclone/rclone.conf" --log-file="/volume1/homes/admin/rcloneLogs/`date +%Y%m%d_%H%M%S`BackupToAmazonS3DeepGlacier_myfolder.log"

If this is not possible what flag would you propose?

ncw · July 3, 2019, 9:55am

To read the checksum it needs to do the HEAD request, so no, that won't help.

I think for that you want --ignore-times - I'm not 100% sure that will avoid the HEAD request - give it a go!

aadiver · July 4, 2019, 9:44am

With the option --ignore-times I still get the HEAD request:

/usr/bin/rclone lsf -R --files-only "/volume1/myfolder"  --max-age 24h --exclude "#recycle/**" --exclude "@eaDir/**" -v  > file-list
/usr/bin/rclone copy "/volume1/myfolder" "AmazonS3DeepGlacier:mybucket/myfolder" --files-from file-list --ignore-times -v --config="/var/services/homes/admin/.config/rclone/rclone.conf" --log-file="/volume1/homes/admin/rcloneLogs/`date +%Y%m%d_%H%M%S`BackupToAmazonS3DeepGlacier_myfolder.log"

I get the following PUT and HEAD request in the Amazon log when I've a modified file:

27d13145810bcc4308eccf87c3baef327f021ba65f117cc670a788ce3410cbf7 mybucket [03/Jul/2019:13:05:58 +0000] 91.178.129.118 arn:aws:iam::779481237381:user/synology 104F22BFD134675D REST.HEAD.OBJECT myfolder/myfile.txt "HEAD /mybucket/myfolder.txt/myfile.txt HTTP/1.1" 200 - - 38190 3 - "-" "rclone/v1.47.0-098-gac4c8d8d-beta" - vOVM6X0wbHjRQjAEy9V9a6uZpfNCmYqFJqM+LS0yfoVwqd31AZlN8Y0TI5qOzVWXLw8LZtWsvP0= SigV4 ECDHE-RSA-AES128-GCM-SHA256 AuthHeader s3.eu-west-1.amazonaws.com TLSv1.2
27d13145810bcc4308eccf87c3baef327f021ba65f117cc670a788ce3410cbf7 mybucket [03/Jul/2019:13:05:58 +0000] 91.178.129.118 arn:aws:iam::779481237381:user/synology C2CD32B8E4654B8F REST.PUT.OBJECT myfolder/myfile.txt "PUT /mybucket/myfolder.txt/myfile.txt?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIA3K7FTY6CTINL54NL%2F20190703%2Feu-west-1%2Fs3%2Faws4_request&X-Amz-Date=20190703T130558Z&X-Amz-Expires=900&X-Amz-SignedHeaders=content-md5%3Bcontent-type%3Bhost%3Bx-amz-acl%3Bx-amz-meta-mtime%3Bx-amz-storage-class&X-Amz-Signature=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX HTTP/1.1" 200 - - 38190 26 19 "-" "rclone/v1.47.0-098-gac4c8d8d-beta" - evFborV4aiTNir5+5lXT/b8HxRNWF0+rBCIkXl39lFq8awV+AgLsTEP+N8pUigfvH0N5yYhPPXw= SigV4 ECDHE-RSA-AES128-GCM-SHA256 QueryString s3.eu-west-1.amazonaws.com TLSv1.2

Any other idea?

ncw · July 4, 2019, 12:44pm

I can't think of any at the moment. To ignore the destination completely would mean that repeating the command would make the upload again which is not desirable in the general case.

I could imagine a flag which did this though..

aadiver · July 10, 2019, 7:35am

I did some extra tests with adding for example the option --no-traverse

usr/bin/rclone lsf -R --files-only "/volume1/myfolder"  --max-age 24h --exclude "#recycle/**" --exclude "@eaDir/**" -v  > file-list
/usr/bin/rclone copy "/volume1/myfolder" "AmazonS3DeepGlacier:mybucket/myfolder" --files-from file-list --ignore-times --no-traverse -v --config="/var/services/homes/admin/.config/rclone/rclone.conf" --log-file="/volume1/homes/admin/rcloneLogs/`date +%Y%m%d_%H%M%S`BackupToAmazonS3DeepGlacier_myfolder.log"

but these generates even more HEAD requests. It is generating 2 HEAD requests per 1 PUT request.

So if 1 file is changed I get in total 5 requests on Amazon:

1PUT requests+ 2 HEAD requests for the file MyFile.xlsm
1 of these HEAD requests is looking for the MyFile.xlsm at the root of the bucket, strange it is looking over there ...
1 HEAD request for the bucket (with the option –no-traverse it is a HEAD request for the bucket in place of a LIST request if you are not using the option --no-traverse)
1 HEAD request for the folder

These are the following requests at Amazon for 1 modified file:

27d13145810bcc4308eccf87c3baef327f021ba65f117cc670a788ce3410cbf7 firstlookbackupsynology [06/Jul/2019:00:00:03 +0000] 91.178.129.118 arn:aws:iam::779481237381:user/synology 126382B935A2BF1F REST.HEAD.BUCKET - "HEAD /firstlookbackupsynology HTTP/1.1" 200 - - - 2 2 "-" "rclone/v1.47.0-098-gac4c8d8d-beta" - va3uablkmTJxsbMQfw5UI16Mno4j+gNt9RxuXUcfCbf6a57+HRn6TFuyQfNE10wU7HybIFtcYsA= SigV4 ECDHE-RSA-AES128-GCM-SHA256 AuthHeader s3.eu-west-1.amazonaws.com TLSv1.2
27d13145810bcc4308eccf87c3baef327f021ba65f117cc670a788ce3410cbf7 firstlookbackupsynology [06/Jul/2019:00:00:03 +0000] 91.178.129.118 arn:aws:iam::779481237381:user/synology 1811705C01F59F4B REST.HEAD.OBJECT FIRST%2BLOOK%2BVOF/MyFolder/MyFile.xlsm "HEAD /firstlookbackupsynology/FIRST%20LOOK%20VOF/MyFolder/MyFile.xlsm HTTP/1.1" 200 - - 37844 6 - "-" "rclone/v1.47.0-098-gac4c8d8d-beta" - VpF5rBNW7+cf+nqX23QgoTfGRJzNvPziQ3/mgnHRa4zVaYP+ywGiX+PChctE69F641Ddl1ArSKU= SigV4 ECDHE-RSA-AES128-GCM-SHA256 AuthHeader s3.eu-west-1.amazonaws.com TLSv1.2
27d13145810bcc4308eccf87c3baef327f021ba65f117cc670a788ce3410cbf7 firstlookbackupsynology [06/Jul/2019:00:00:03 +0000] 91.178.129.118 arn:aws:iam::779481237381:user/synology 4DCB48D3F09214B1 REST.HEAD.OBJECT FIRST%2BLOOK%2BVOF/MyFile.xlsm "HEAD /firstlookbackupsynology/FIRST%20LOOK%20VOF/MyFile.xlsm HTTP/1.1" 404 NoSuchKey 310 - 9 - "-" "rclone/v1.47.0-098-gac4c8d8d-beta" - xMVi7t+cJoIS00yDTL+wkl3LLXZrXti5WFMMkW5S41B+q65rhDQYfjTfnui8rjpMnDhwk1EWP7A= SigV4 ECDHE-RSA-AES128-GCM-SHA256 AuthHeader s3.eu-west-1.amazonaws.com TLSv1.2
27d13145810bcc4308eccf87c3baef327f021ba65f117cc670a788ce3410cbf7 firstlookbackupsynology [06/Jul/2019:00:00:03 +0000] 91.178.129.118 arn:aws:iam::779481237381:user/synology 9989A7AB857370B6 REST.PUT.OBJECT FIRST%2BLOOK%2BVOF/MyFolder/MyFile.xlsm "PUT /firstlookbackupsynology/FIRST%20LOOK%20VOF/MyFolder/MyFile.xlsm?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIA3K7FTY6CTINL54NL%2F20190706%2Feu-west-1%2Fs3%2Faws4_request&X-Amz-Date=20190706T000003Z&X-Amz-Expires=900&X-Amz-SignedHeaders=content-md5%3Bcontent-type%3Bhost%3Bx-amz-acl%3Bx-amz-meta-mtime%3Bx-amz-storage-class&X-Amz-Signature=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX HTTP/1.1" 200 - - 37844 32 26 "-" "rclone/v1.47.0-098-gac4c8d8d-beta" - cvXnz7ZXtGku08yNJkBZky7Pr8IG84W7aapX73p9d2vMZ+XrU8IbKOEgEQILOzAKqQMxe3lPYL4= SigV4 ECDHE-RSA-AES128-GCM-SHA256 QueryString s3.eu-west-1.amazonaws.com TLSv1.2
27d13145810bcc4308eccf87c3baef327f021ba65f117cc670a788ce3410cbf7 firstlookbackupsynology [06/Jul/2019:00:00:03 +0000] 91.178.129.118 arn:aws:iam::779481237381:user/synology EF5B76FE7A9D314F REST.HEAD.OBJECT FIRST%2BLOOK%2BVOF "HEAD /firstlookbackupsynology/FIRST%20LOOK%20VOF HTTP/1.1" 404 NoSuchKey 285 - 13 - "-" "rclone/v1.47.0-098-gac4c8d8d-beta" - jNjRF39Oi4VyHcqYqp05DyMv5Kj0w6KvxNnVFuWhr5bi3BnANMSy9KPRYmNf9ygk7jWcr4l4+kc= SigV4 ECDHE-RSA-AES128-GCM-SHA256 AuthHeader s3.eu-west-1.amazonaws.com TLSv1.2

To ignore the destination completely would mean that repeating the command would make the upload again which is not desirable in the general case.

In the case when you are first generating a list of modified/newly files with the first command in the script it is completely unncessary to check if the file exists on Amazon or to check if the file is modified because the first command in the scripts gives us just the files which are new and modified.

system · October 8, 2019, 7:35am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.