Filtering files to copy based on S3 Metadata

rjbirkett · April 19, 2024, 8:15pm

What is the problem you are having with rclone?

I don't understand the documentation regarding Filter Options because I want to filter files I'm copying based on the Metadata in S3 (Glacier Deep Archive) and the documentation has Flags for filtering directory listings. I just want to know if I can filter out the files in S3 that are archived. Is that possible? or do I have to create a text file with filtered results before I can copy what I want from S3? I have never used rclone before, I always used AZCOPY since I'm migrating from AWS to Azure, but that doesn't allow filtering on S3 Metadata either and fails as soon as it hits an archived file with a 403 error because it's inaccessible. It seems most utilities are stupid in the same way, they can't bypass or filter the copy process.

Run the command 'rclone version' and share the full output of the command.

rclone copy awss3:ts-dfm-customer-documents-qa azblobqa:customerdocsqa --metadata --metadata-include tier=STANDARD --dry-run

Which cloud storage system are you using? (eg Google Drive)

S3 and Azure Blob (copy from S3 to Azure Blob)

The command you were trying to run (eg `rclone copy /tmp remote:tmp`)

rclone copy awss3:ts-dfm-customer-documents-qa  azblobqa:customerdocsqa --metadata --metadata-include tier=STANDARD

rclone ls azblobqa:customerdocsqa
rclone ls awss3:ts-dfm-customer-documents-qa

Both the ls commands work so I have no issues connecting to my S3 and Azure Blob storage

Please run 'rclone config redacted' and share the full output. If you get command not found, please make sure to update rclone.

Paste config here

[awss3]
type = s3
provider = AWS
env_auth = true
access_key_id = XXX
secret_access_key = XXX
region = us-west-2
location_constraint = us-west-2
storage_class = STANDARD

[azblobqa]
type = azureblob
account = XXX
env_auth = true
tenant = XXX
client_id = XXX
client_secret = XXX

I think it is working. The Dry Run showed the output correctly so I'm running the command to see if it actually copies the files now.

asdffdsa · April 19, 2024, 8:25pm

welcome to the forum,

rclone lsf aws01:zork --format=pT
file_dg.ext;DEEP_ARCHIVE
file_standard.ext;STANDARD

rclone lsf aws01:zork --format=pT --metadata-exclude="tier=DEEP_ARCHIVE"
file_standard.ext;STANDARD

rjbirkett · April 19, 2024, 8:56pm

I'm sorry but I don't understand that answer. You can probably tell I'm not really a Linux user. I want to copy the files, but isn't lsf a switch to list the files (like ls)? I pasted my command in the topic and dry-run showed a list of files, so I ran the command, and it's sitting there doing something, but no output. I'm assuming it's doing something, but you know what happens when you assume...

asdffdsa · April 19, 2024, 9:28pm

i ran those command on windows, not linux
the same exact command works on windows, linux, macos, etc...

from rclone docs, "To test filters without risk of damage to data, apply them to rclone ls"
imho, once the filter is safely tested using rclone ls, then rclone copy --dry-run -vv, then rclone copy -vv

https://rclone.org/docs/#log-level-level

rjbirkett · April 20, 2024, 1:56am

Although I did get it working, performance has been somewhat erratic. The copy process starts well, then degrades down to unacceptable levels. I have restarted the process several times with different setting and I am making some progress. It is dd thought. It seems to run in spurts where a set of files or a folder will copy, then it stalls, and then after a while it starts again. It isn't CPU/Memory bound, and at times I get throughput on the network interface approaching 1GBs, then nothing. Very weird. My latest attempt is using this command:-

rclone copy awss3:ts-dfm-customer-documents-qa azblobqa:customerdocsqa --metadata --metadata-include tier=STANDARD --azureblob-upload-concurrency 64 --transfers 32 --disable-http2 --ignore-checksum --ignore-existing --log-level INFO

It copies then just stops...

Transferred: 26.751 GiB / 26.897 GiB, 99%, 4.013 MiB/s, ETA 37s
Transferred: 32009 / 32011, 100%
Elapsed time: 20m0.0s
Transferring:

BulkModeledRateRequest…-82be-06f08c789fad.csv: 21% /116.136Mi, 1.769Mi/s, 51s
BulkModeledRateRequest…-82c2-06f08c789fad.csv: 59% /143Mi, 7.498Mi/s, 7s

2024/04/19 19:48:40 INFO : BulkModeledRateRequest/f9338f76-dc9c-ed11-82c2-06f08c789fad.csv: Copied (new)
2024/04/19 19:48:44 INFO : BulkModeledRateRequest/e58beb33-7b45-ed11-82be-06f08c789fad.csv: Copied (new)
2024/04/19 19:49:38 INFO :
Transferred: 26.897 GiB / 26.897 GiB, 100%, 4.013 MiB/s, ETA 0s
Transferred: 32011 / 32011, 100%
Elapsed time: 21m0.0s

2024/04/19 19:50:38 INFO :
Transferred: 26.897 GiB / 26.897 GiB, 100%, 4.013 MiB/s, ETA 0s
Transferred: 32011 / 32011, 100%
Elapsed time: 22m0.0s

2024/04/19 19:51:38 INFO :
Transferred: 26.897 GiB / 26.897 GiB, 100%, 4.013 MiB/s, ETA 0s
Transferred: 32011 / 32011, 100%
Elapsed time: 23m0.0s

asdffdsa · April 20, 2024, 3:19pm

need to use DEBUG, not INFO
and for a deeper look, --dump=headers

rjbirkett · April 22, 2024, 3:41pm

Just to update you on this, I did successfully copy the data over the weekend for one environment. If I understand this correctly RCLONE reads the file list from S3 (using the filter) then starts the copy process. I believe the output paused for a great deal of time because (based on the metada filter I used to ignore Glacier deep Archive), creating the list of files to be copied took a considerable amount of time because 95% of the files in one specific folder were archived, and there were a LOT of files in that 2TB folder. Everything else copied with no issues, although I need to detune the performance a bit since the VM network interface was saturated at times. Other than that, success! I can't thank you guys enough for the help, because AZCOPY and Azure Data Factory were really not going to solve the problem. I've become a big fan of RCLONE basically overnight.

rjbirkett · April 22, 2024, 4:04pm

And for anyone else copying from S3 to Azure Blob using Private Link/Private Endpoints for Azure Blob Storage Accounts, put your VM on the same VNET/Subnet as the Storage Account Private Endpoint. I was getting up to 1GBs network throughput for the entire copy process between S3 private buckets and a Private Storage Account. The next one is 9.5 TB so I'll find out how long that takes with the same setup.

system · April 25, 2024, 4:04pm

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.