Rclone sync/copy no progress on one large 641GB file to Azure Blob

Leroy · April 6, 2022, 8:26am

What is the problem you are having with rclone?

The copy (and sync) command only shows the Elapsed time increasing. No bytes are Transferred, even after waiting for days (tested with smaller files, e.g. 10GB, these work although only after two minutes of Elapsed time pass).

Run the command 'rclone version' and share the full output of the command.

rclone v1.58.0

os/version: freebsd 12.0-release-p4 (64 bit)
os/kernel: 12.0-release-p4 (amd64)
os/type: freebsd
os/arch: amd64
go/version: go1.17.8
go/linking: static
go/tags: none

Which cloud storage system are you using? (eg Google Drive)

Microsoft azureblob

The command you were trying to run (eg `rclone copy /tmp remote:tmp`)

rclone copy -vv -P  local:/tank/projects/1/largefile.dat azure_archive:archive/largefile.dat

The rclone config contents with secrets removed.

[local]
type = local

[azure_archive]
type = azureblob
sas_url = https://privatename.blob.core.windows.net/archive001?sp=private

A log from the command with the `-vv` flag

2022/04/06 09:41:00 DEBUG : rclone: Version "v1.58.0" starting with parameters ["/usr/local/bin/rclone-v1.58.0-freebsd-amd64/rclone" "copy" "-vv" "-P" "local:/tank/projects/1/largefile.dat"]
2022/04/06 09:41:00 DEBUG : Creating backend with remote "local:/tank/projects/1/largefile.dat"
2022/04/06 09:41:00 DEBUG : Using config file from "/home/rclone/.config/rclone/rclone.conf"
2022/04/06 09:41:00 DEBUG : fs cache: adding new entry for parent of "local:/tank/projects/1/largefile.dat", "/tank/projects/1/largefile.dat"
2022/04/06 09:41:00 DEBUG : Creating backend with remote "azure_archive:archive1/largefile.dat"
2022-04-06 09:41:00 DEBUG : largefile.dat: Need to transfer - File not found at Destination
Transferred:              0 B / 641.449 GiB, 0%, 0 B/s, ETA -
Transferred:            0 / 1, 0%
Elapsed time:        22.6s
Transferring:
 *                                  largefile.dat:  0% /641.449Gi, 0/s, -

Animosity022 · April 6, 2022, 11:23am

If you check the system, do you see disk IO? The first part is checksumming the file so if you have very slow storage, I'd imagine that is the bottleneck.

A 10GB file for me takes about 20 seconds to sum on slower spinning storage so 2 minutes seems really long.

felix@gemini:/data$ time md5sum testsum
2dd26c4d4799ebd29fa31e48d49e8e53  testsum

real	0m19.099s
user	0m14.426s
sys	0m4.671s
felix@gemini:/data$ du -sh testsum
10G	testsum

ncw · April 6, 2022, 11:59am

You can disable this checksum calculation with

  --azureblob-disable-checksum   Don't store MD5 checksum with object metadata

It might be worth trying with that to see if the transfer starts immediately and if it does that is definitely the problem.

How fast is your source disk?

Leroy · April 6, 2022, 12:16pm

Attached dtrace syscall counters for few seconds of running:

CPU     ID                    FUNCTION:NAME
  1      1                           :BEGIN   UID   PID Command  Path
  getrandom                                                         1
  sigaction                                                         1
  sysarch                                                           1
  thr_new                                                           1
  fstatat                                                           2
  open                                                              2
  sigaltstack                                                       2
  thr_self                                                          3
  mmap                                                              4
  sigprocmask                                                       5
  ioctl                                                            10
  write                                                            11
  fstat                                                            18
  sched_yield                                                      34
  compat11.kevent                                                  41
  _umtx_op                                                        316
  sigreturn                                                       382
  thr_kill                                                        384
  nanosleep                                                      5345
  read                                                          37874

The rough disk read speed:

# gdd if=/tank/projects/1/largefile.dat of=/dev/null bs=1G count=1
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 1.78407 s, 602 MB/s

Animosity022 · April 6, 2022, 12:26pm

I think that dd command is giving you some odd info as my slow spinning disk hits some 'not right' numbers as you have disk caching in the mix with that command.

felix@gemini:/data$ dd if=testsum of=/dev/null bs=1G count=1
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.635587 s, 1.7 GB/s
felix@gemini:/data$ dd if=testsum of=/dev/null bs=1G count=1
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.464849 s, 2.3 GB/s
felix@gemini:/data$ dd if=testsum of=/dev/null bs=1G count=1
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.361735 s, 3.0 GB/s

This is more what I'd expect from a slow spinning disk:

felix@gemini:/data$ mount | grep data
/dev/sdd1 on /data type btrfs (rw,relatime,space_cache,subvolid=5,subvol=/)
felix@gemini:/data$ sudo hdparm -Tt /dev/sdd

/dev/sdd:
 Timing cached reads:   38294 MB in  2.00 seconds = 19184.28 MB/sec
 Timing buffered disk reads: 684 MB in  3.01 seconds = 227.46 MB/sec

Leroy · April 6, 2022, 12:31pm

Thank you, without the checksum on the source the transfer starts immediately.

I would like to keep the data transfer secure and reliable with some perhaps interleaved checksums. Is it possible to tune the blocksizes of any kind?

Animosity022 · April 6, 2022, 12:33pm

It's all really disk IO bound and reads sequentially so there's no real way to defeat slow disk unfortunately as you can't tune out of that

ncw · April 6, 2022, 1:27pm

Great

Each block is checksummed and if I remember rightly there is a checksum of checksums.

What you are missing is storing an MD5 sum as metadata on that large file's object. That comes in useful for bitrot detection (eg with rclone check which won't work without it) but doesn't make the transfer any more secure.

Leroy · April 6, 2022, 2:40pm

It's not the fastest storage around the block, also this environment is hammered on - regardless, reading through the whole file takes finite amount of time:

time gdd if=largefile.dat of=/dev/null bs=1G
641+1 records in
641+1 records out
688751047612 bytes (689 GB, 641 GiB) copied, 3326.52 s, 207 MB/s

real    55m26.612s
user    0m0.008s
sys     8m33.407s

Could it be that the procedure which @ncw called "checksum of checksums" is not sequential and generates a lot of smaller IOPS or even random I/O patterns? Any options left to work with checksum while switching to something less intensive for large files on this somewhat slower data source? ( it's spinning rust )

@ncw chime in if you like - didn't want to skip you

ncw · April 6, 2022, 3:28pm

That should be sequential, though remember rclone can transfer multiple files in parallel depending on the value of --transfers and scan multiple directories in parallel (controlled by --checkers). Often setting --checkers too high is counterproductive with hard disk based systems.

It looks like from the above you are just transferring one file though so this probably isn't a --checkers or --transfers problem.

If you want to simulate what rclone is doing making the checksum then use

rclone -P  md5sum largefile.dat

I'd expect that to complete at about disk speed though unless you have a really slow CPU so for 641 GB file it should take about 55m not > 24hr.

You could disable checksums and transfer with a --min-size 100G (say) then complete the transfer without the --min-size and the checksum disable to hoover up the remaining things.

Leroy · April 7, 2022, 9:23am

Update,

(The manual md5sum takes <1hr.)
Now the single file transfer stopped after >18hr, using the --azureblob-disable-checksum option:

Transferring:
 * largefile.dat: 60% /641.449Gi, 0/s, 0sTransferred:             390.637 GiB / 641.449 GiB, 61%, 0 B/s, ETA 0s
Checks:           1466715 / 1476722, 99%
Transferred:            0 / 1, 0%
Elapsed time:  18h47m22.6s

Can the partial upload be resumed/continued by restarting, or any special commands needed?

Animosity022 · April 7, 2022, 11:54am

There's no way to resume a partial upload as there are some feature requests for it, but it doesn't exist yet.

Leroy · April 19, 2022, 6:54am

Follow-up,

The maximum transferable filesize seems limited to the default 4MB blocksize and the number of 50000 as the number of blocks.

4MB times 50000 = 195 GB maximum per file/object

After adjusting the chunk-size, this becomes:

100MB times 50000 = 4,7 TB as new limit

Can this behaviour be documented in the Azure instructions as these sizes maybe uncommon now, but not unknown of in the near future?

Thanks!

ncw · April 19, 2022, 8:33am

Thanks for the follow up.

Can you open a new issue on Github about this?

The s3 backend automatically uses a larger chunk size for large files and I think the azure blob backend should do the same.

I wasn't aware of the 50,000 chunks limit - it is documented here though Scalability and performance targets for Blob storage - Azure Storage | Microsoft Docs

Leroy · April 19, 2022, 1:18pm

As requested the following issue has been created.

Feature request: Autmatically use Larger chunk size for large files to Azure blob (stay below 50000 blocks) · Issue #6115 · rclone/rclone (github.com)

system · May 19, 2022, 1:18pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.