Sync: very high memory usage leading to crash

Mika56 · May 24, 2023, 8:36am

What is the problem you are having with rclone?

Using sync command to sync big files from SMB to Backblaze makes rclone use insanely high amount of memory, leading to OS (Windows Server 2012) crash.

Run the command 'rclone version' and share the full output of the command.

rclone v1.62.2

os/version: Microsoft Windows Server 2012 Standard (64 bit)
os/kernel: 6.2.9200.24266 Build 9200.24266.24266 (x86_64)
os/type: windows
os/arch: amd64
go/version: go1.20.2
go/linking: static
go/tags: cmount

Which cloud storage system are you using? (eg Google Drive)

SMB hosted on a QNAP NAS (TS-431P2 running QTS 5.0.1)
Backblaze B2

The command you were trying to run (eg `rclone copy /tmp remote:tmp`)

rclone.exe sync --stats-log-level NOTICE --stats 1h --transfers 300 --exclude *.temp serveurbackup:bkp Backblaze:azimut-backups

The rclone config contents with secrets removed.

[Backblaze]
type = b2
account = REDACTED
key = REDACTED
hard_delete = true
endpoint = https://api.backblazeb2.com

[serveurbackup]
type = smb
host = REDACTED
user = REDACTED
pass = REDACTED

A log from the command with the `-vv` flag

I don't have logs with -vv flags because the process runs for days and I'm afraid of log size, am I wrong?

Log file

2023/05/23 02:24:09 ERROR : Veeam: Entry doesn't belong in directory "Veeam" (too short) - ignoring
2023/05/23 02:24:12 ERROR : Veeam/Archive: Entry doesn't belong in directory "Veeam/Archive" (too short) - ignoring
2023/05/23 02:24:12 ERROR : Veeam/Backup: Entry doesn't belong in directory "Veeam/Backup" (too short) - ignoring
2023/05/23 03:24:05 NOTICE: 
Transferred:   	  121.765 GiB / 3.036 TiB, 4%, 29.462 MiB/s, ETA 1d4h50m
Checks:             28371 / 28371, 100%
Transferred:           58 / 59, 98%
Elapsed time:       1h0.9s
Transferring:
 * Veeam/VM-WS2016B/VM-WS…-05-09T020730_67F4.vbk:  3% /3.010Ti, 29.179Mi/s, 29h7m11s

2023/05/23 04:24:05 NOTICE: 
Transferred:   	  222.752 GiB / 3.036 TiB, 7%, 29.546 MiB/s, ETA 1d3h47m
Checks:             28371 / 28371, 100%
Transferred:           58 / 59, 98%
Elapsed time:       2h0.9s
Transferring:
 * Veeam/VM-WS2016B/VM-WS…-05-09T020730_67F4.vbk:  6% /3.010Ti, 29.394Mi/s, 27h55m46s

2023/05/23 05:24:05 NOTICE: 
Transferred:   	  323.653 GiB / 3.036 TiB, 10%, 30.542 MiB/s, ETA 1d1h56m
Checks:             28371 / 28371, 100%
Transferred:           58 / 59, 98%
Elapsed time:       3h0.9s
Transferring:
 * Veeam/VM-WS2016B/VM-WS…-05-09T020730_67F4.vbk:  9% /3.010Ti, 30.112Mi/s, 26h18m38s

2023/05/23 06:24:05 NOTICE: 
Transferred:   	  425.024 GiB / 3.036 TiB, 14%, 29.023 MiB/s, ETA 1d2h18m
Checks:             28371 / 28371, 100%
Transferred:           58 / 59, 98%
Elapsed time:       4h0.9s
Transferring:
 * Veeam/VM-WS2016B/VM-WS…-05-09T020730_67F4.vbk: 12% /3.010Ti, 29.640Mi/s, 25h45m23s

2023/05/23 07:24:05 NOTICE: 
Transferred:   	  525.970 GiB / 3.036 TiB, 17%, 32.166 MiB/s, ETA 22h50m28s
Checks:             28371 / 28371, 100%
Transferred:           58 / 59, 98%
Elapsed time:       5h0.9s
Transferring:
 * Veeam/VM-WS2016B/VM-WS…-05-09T020730_67F4.vbk: 16% /3.010Ti, 32.058Mi/s, 22h55m5s

2023/05/23 08:24:05 NOTICE: 
Transferred:   	  626.446 GiB / 3.036 TiB, 20%, 32.396 MiB/s, ETA 21h47m47s
Checks:             28371 / 28371, 100%
Transferred:           58 / 59, 98%
Elapsed time:       6h0.9s
Transferring:
 * Veeam/VM-WS2016B/VM-WS…-05-09T020730_67F4.vbk: 19% /3.010Ti, 32.461Mi/s, 21h45m10s

2023/05/23 09:24:05 NOTICE: 
Transferred:   	  724.036 GiB / 3.036 TiB, 23%, 29.262 MiB/s, ETA 23h10m56s
Checks:             28371 / 28371, 100%
Transferred:           58 / 59, 98%
Elapsed time:       7h0.9s
Transferring:
 * Veeam/VM-WS2016B/VM-WS…-05-09T020730_67F4.vbk: 22% /3.010Ti, 28.945Mi/s, 23h26m10s

2023/05/23 10:24:05 NOTICE: 
Transferred:   	  820.766 GiB / 3.036 TiB, 26%, 28.520 MiB/s, ETA 22h49m16s
Checks:             28371 / 28371, 100%
Transferred:           58 / 59, 98%
Elapsed time:       8h0.9s
Transferring:
 * Veeam/VM-WS2016B/VM-WS…-05-09T020730_67F4.vbk: 25% /3.010Ti, 28.020Mi/s, 23h13m40s

2023/05/23 11:24:05 NOTICE: 
Transferred:   	  919.348 GiB / 3.036 TiB, 30%, 28.986 MiB/s, ETA 21h29m12s
Checks:             28371 / 28371, 100%
Transferred:           58 / 59, 98%
Elapsed time:       9h0.9s
Transferring:
 * Veeam/VM-WS2016B/VM-WS…-05-09T020730_67F4.vbk: 28% /3.010Ti, 28.847Mi/s, 21h35m23s

2023/05/23 12:24:05 NOTICE: 
Transferred:   	 1017.196 GiB / 3.036 TiB, 33%, 25.440 MiB/s, ETA 23h23m14s
Checks:             28371 / 28371, 100%
Transferred:           58 / 59, 98%
Elapsed time:      10h0.9s
Transferring:
 * Veeam/VM-WS2016B/VM-WS…-05-09T020730_67F4.vbk: 32% /3.010Ti, 25.025Mi/s, 23h46m33s

2023/05/23 13:24:05 NOTICE: 
Transferred:   	    1.091 TiB / 3.036 TiB, 36%, 30.411 MiB/s, ETA 18h37m46s
Checks:             28371 / 28371, 100%
Transferred:           58 / 59, 98%
Elapsed time:      11h0.9s
Transferring:
 * Veeam/VM-WS2016B/VM-WS…-05-09T020730_67F4.vbk: 35% /3.010Ti, 30.654Mi/s, 18h28m54s

2023/05/23 14:24:05 NOTICE: 
Transferred:   	    1.188 TiB / 3.036 TiB, 39%, 28.462 MiB/s, ETA 18h54m43s
Checks:             28371 / 28371, 100%
Transferred:           58 / 59, 98%
Elapsed time:      12h0.9s
Transferring:
 * Veeam/VM-WS2016B/VM-WS…-05-09T020730_67F4.vbk: 38% /3.010Ti, 28.336Mi/s, 18h59m46s

2023/05/23 15:24:05 NOTICE: 
Transferred:   	    1.283 TiB / 3.036 TiB, 42%, 25.457 MiB/s, ETA 20h3m42s
Checks:             28371 / 28371, 100%
Transferred:           58 / 59, 98%
Elapsed time:      13h0.9s
Transferring:
 * Veeam/VM-WS2016B/VM-WS…-05-09T020730_67F4.vbk: 41% /3.010Ti, 25.866Mi/s, 19h44m40s

2023/05/23 16:24:05 NOTICE: 
Transferred:   	    1.376 TiB / 3.036 TiB, 45%, 31.428 MiB/s, ETA 15h23m3s
Checks:             28371 / 28371, 100%
Transferred:           58 / 59, 98%
Elapsed time:      14h0.9s
Transferring:
 * Veeam/VM-WS2016B/VM-WS…-05-09T020730_67F4.vbk: 44% /3.010Ti, 31.539Mi/s, 15h19m49s

2023/05/23 17:24:05 NOTICE: 
Transferred:   	    1.473 TiB / 3.036 TiB, 49%, 30.995 MiB/s, ETA 14h41m12s
Checks:             28371 / 28371, 100%
Transferred:           58 / 59, 98%
Elapsed time:      15h0.9s
Transferring:
 * Veeam/VM-WS2016B/VM-WS…-05-09T020730_67F4.vbk: 48% /3.010Ti, 31.159Mi/s, 14h36m33s

2023/05/23 18:24:05 NOTICE: 
Transferred:   	    1.570 TiB / 3.036 TiB, 52%, 26.721 MiB/s, ETA 15h58m55s
Checks:             28371 / 28371, 100%
Transferred:           58 / 59, 98%
Elapsed time:      16h0.9s
Transferring:
 * Veeam/VM-WS2016B/VM-WS…-05-09T020730_67F4.vbk: 51% /3.010Ti, 26.948Mi/s, 15h50m51s

2023/05/23 19:24:05 NOTICE: 
Transferred:   	    1.669 TiB / 3.036 TiB, 55%, 26.972 MiB/s, ETA 14h45m52s
Checks:             28371 / 28371, 100%
Transferred:           58 / 59, 98%
Elapsed time:      17h0.9s
Transferring:
 * Veeam/VM-WS2016B/VM-WS…-05-09T020730_67F4.vbk: 54% /3.010Ti, 26.929Mi/s, 14h47m17s

2023/05/23 20:24:05 NOTICE: 
Transferred:   	    1.768 TiB / 3.036 TiB, 58%, 27.817 MiB/s, ETA 13h16m34s
Checks:             28371 / 28371, 100%
Transferred:           58 / 59, 98%
Elapsed time:      18h0.9s
Transferring:
 * Veeam/VM-WS2016B/VM-WS…-05-09T020730_67F4.vbk: 57% /3.010Ti, 28.391Mi/s, 13h0m28s

2023/05/23 21:24:05 NOTICE: 
Transferred:   	    1.867 TiB / 3.036 TiB, 62%, 27.075 MiB/s, ETA 12h34m21s
Checks:             28371 / 28371, 100%
Transferred:           58 / 59, 98%
Elapsed time:      19h0.9s
Transferring:
 * Veeam/VM-WS2016B/VM-WS…-05-09T020730_67F4.vbk: 61% /3.010Ti, 26.520Mi/s, 12h50m8s

2023/05/23 22:24:05 NOTICE: 
Transferred:   	    1.966 TiB / 3.036 TiB, 65%, 31.176 MiB/s, ETA 9h59m38s
Checks:             28371 / 28371, 100%
Transferred:           58 / 59, 98%
Elapsed time:      20h0.9s
Transferring:
 * Veeam/VM-WS2016B/VM-WS…-05-09T020730_67F4.vbk: 64% /3.010Ti, 31.081Mi/s, 10h1m28s

2023/05/23 23:24:05 NOTICE: 
Transferred:   	    2.067 TiB / 3.036 TiB, 68%, 29.222 MiB/s, ETA 9h39m38s
Checks:             28371 / 28371, 100%
Transferred:           58 / 59, 98%
Elapsed time:      21h0.9s
Transferring:
 * Veeam/VM-WS2016B/VM-WS…-05-09T020730_67F4.vbk: 67% /3.010Ti, 29.053Mi/s, 9h43m1s

2023/05/24 00:24:05 NOTICE: 
Transferred:   	    2.190 TiB / 3.036 TiB, 72%, 26.846 MiB/s, ETA 9h10m32s
Checks:             28371 / 28371, 100%
Transferred:           58 / 59, 98%
Elapsed time:      22h0.9s
Transferring:
 * Veeam/VM-WS2016B/VM-WS…-05-09T020730_67F4.vbk: 71% /3.010Ti, 27.054Mi/s, 9h6m17s

2023/05/24 01:24:05 NOTICE: 
Transferred:   	    2.302 TiB / 3.036 TiB, 76%, 31.209 MiB/s, ETA 6h51m2s
Checks:             28371 / 28371, 100%
Transferred:           58 / 59, 98%
Elapsed time:      23h0.9s
Transferring:
 * Veeam/VM-WS2016B/VM-WS…-05-09T020730_67F4.vbk: 75% /3.010Ti, 31.145Mi/s, 6h51m53s

2023/05/24 02:24:05 NOTICE: 
Transferred:   	    2.420 TiB / 3.036 TiB, 80%, 61.547 MiB/s, ETA 2h54m55s
Checks:             28371 / 28371, 100%
Transferred:           58 / 59, 98%
Elapsed time:       1d0.9s
Transferring:
 * Veeam/VM-WS2016B/VM-WS…-05-09T020730_67F4.vbk: 79% /3.010Ti, 61.607Mi/s, 2h54m45s

2023/05/24 03:24:45 NOTICE: 
Transferred:   	    2.485 TiB / 3.036 TiB, 82%, 0 B/s, ETA -
Checks:             28371 / 28371, 100%
Transferred:           58 / 59, 98%
Elapsed time:    1d1h11.9s
Transferring:
 * Veeam/VM-WS2016B/VM-WS…-05-09T020730_67F4.vbk: 81% /3.010Ti, 0/s, 2562047h47m16s

2023/05/24 04:24:55 NOTICE: 
Transferred:   	    2.485 TiB / 3.036 TiB, 82%, 0 B/s, ETA -
Checks:             28371 / 28371, 100%
Transferred:           58 / 59, 98%
Elapsed time:    1d2h30.3s
Transferring:
 * Veeam/VM-WS2016B/VM-WS…-05-09T020730_67F4.vbk: 81% /3.010Ti, 0/s, 2562047h47m16s

2023/05/24 05:24:22 NOTICE: 
Transferred:   	    2.485 TiB / 3.036 TiB, 82%, 0 B/s, ETA -
Checks:             28371 / 28371, 100%
Transferred:           58 / 59, 98%
Elapsed time:     1d3h9.4s
Transferring:
 * Veeam/VM-WS2016B/VM-WS…-05-09T020730_67F4.vbk: 81% /3.010Ti, 0/s, 2562047h47m16s

2023/05/24 06:24:19 NOTICE: 
Transferred:   	    2.485 TiB / 3.036 TiB, 82%, 0 B/s, ETA -
Checks:             28371 / 28371, 100%
Transferred:           58 / 59, 98%
Elapsed time:     1d4h8.2s
Transferring:
 * Veeam/VM-WS2016B/VM-WS…-05-09T020730_67F4.vbk: 81% /3.010Ti, 0/s, 2562047h47m16s

2023/05/24 07:24:25 NOTICE: 
Transferred:   	    2.485 TiB / 3.036 TiB, 82%, 0 B/s, ETA -
Checks:             28371 / 28371, 100%
Transferred:           58 / 59, 98%
Elapsed time:     1d5h9.0s
Transferring:
 * Veeam/VM-WS2016B/VM-WS…-05-09T020730_67F4.vbk: 81% /3.010Ti, 0/s, 2562047h47m16s

2023/05/24 08:24:32 NOTICE: 
Transferred:   	    2.485 TiB / 3.036 TiB, 82%, 0 B/s, ETA -
Checks:             28371 / 28371, 100%
Transferred:           58 / 59, 98%
Elapsed time:    1d6h15.8s
Transferring:
 * Veeam/VM-WS2016B/VM-WS…-05-09T020730_67F4.vbk: 81% /3.010Ti, 0/s, 2562047h47m16s

2023/05/24 09:24:28 NOTICE: 
Transferred:   	    2.485 TiB / 3.036 TiB, 82%, 0 B/s, ETA -
Checks:             28371 / 28371, 100%
Transferred:           58 / 59, 98%
Elapsed time:    1d7h12.2s
Transferring:
 * Veeam/VM-WS2016B/VM-WS…-05-09T020730_67F4.vbk: 81% /3.010Ti, 0/s, 2562047h47m16s

More details

I'm running rclone on a Windows Server 2012 host to sync our backups from their storage (via SMB) to Backblaze B2.
One of the backup file is 3+TB of size, so uploading it takes some time. The host machine has 64GB of mostly unused memory (hence the --transfers 300 switch, which works fine).
However, from time to time, rclone will start to lose its mind and eat memory as if it was unlimited. From what I see in the logs, something goes wrong with the transfer, the ETA grows to 300 years, then memory goes up, CPU stays stable and network activity stops.

Memory usage: rclone starts at about 2am on 05/23, uses a few GB of RAM, then starts eating memory the day after, at about 2am on 05/24. Server rebooted on 05/24 at 9:30am.
SWAP usage: not certain why it is so high all the time (from previous runs of rclone?)
CPU usage: rather stable all the time
Network usage: Goes up when rclone starts, stable most of the time, does something weird at 2am, goes down at 3am

(One big image because new users can only upload one image...)

When this happens, the system goes unresponsive and I have to kill rclone.exe (when I'm lucky enough that RDP still works), or cold reboot the server.

It doesn't look like that the communication with the NAS gets interrupted at any time, the NAS' uptime right now is 23 days.

I would at least expect rclone to detect when the transfers start to fail and either retry or stop.

kapitainsky · May 24, 2023, 8:43am

To clarify.

you have problem with rclone? or rsync?

Mika56 · May 24, 2023, 10:12am

My bad, I meant rclone of course, I've edited my message

ncw · May 24, 2023, 10:33am

This looks like a bug to me. In a sync the memory usage should stay stable.

From your log above the only file being transferred at the time was the Veeam file so I'd conjecture that an error occurred and that rclone retried a block but this kicked off some other problem. The -vv log would say what happened hopefully.

A log with -vv would be really helpful. It would probably tell us what happened to make the transfer go bad.

You could try to get a memory profile after the memory usage has spiked? You'll need to run rclone with the --rc flag and install the Go tools.

Mika56 · May 24, 2023, 11:59am

I've added -vv and --rc to my commands, I'll keep you posted next time it crashes. Next sync is scheduled for tonight, but it doesn't crash every time.

ncw · May 24, 2023, 2:44pm

Great, thanks. If you can get a memory dump that would be great but if not the -vv log will help a lot.

This probably means it is related to an error rclone is retrying. That is my guess!

Mika56 · June 15, 2023, 7:43am

Hi,

So, stupid me forgot about the memory dump... Here's some logs about what happened anyway. I'll keep trying to sync with the memory profile enabled and try to dump it next time.

Memory usage (and a little bit a swap) went up at about 2:43am, and swap usage went through the roof before at 7:42am (but could have been before, looks like some monitoring data is missing):

ncw · June 16, 2023, 11:24am

Did you have a log with -vv? If so post a link here (or if you prefer email me nick@craig-wood.com with a link. Put a link to this forum page in the email so I have some context).

Mika56 · June 16, 2023, 11:39am

Have you seen the link in my message? That is an excerpt of what was logged with -vv. I can send you the whole 13MB log file if you need it.

ncw · June 16, 2023, 3:17pm

Sorry, I did miss your link!

Everything appears to be proceeding quite well in the log you sent. There are some errors, but rclone retries and recovers the upload.

Once thing to note is that memory usage is proportional to the under of simultaneous transfers.

From the b2 docs

Backblaze recommends that you do lots of transfers simultaneously for maximum speed. In tests from my SSD equipped laptop the optimum setting is about --transfers 32 though higher numbers may be used for a slight speed improvement. The optimum number for you may vary depending on your hardware, how big the files are, how much you want to load your computer, etc. The default of --transfers 4 is definitely too low for Backblaze B2 though.

Note that uploading big files (bigger than 200 MiB by default) will use a 96 MiB RAM buffer by default. There can be at most --transfers of these in use at any moment, so this sets the upper limit on the memory used.

You've got --transfers 300 set, so if you've got 300 files bigger than 200MiB transferring at once then you could be using 300*96MiB = 28.1 GiB of RAM.

So I'd reduce --transfers down to the level where you can afford that much RAM remembering each transfer will use 0.1 GB of RAM.

If that doesn't help then could I see a full log with -vv please?

Mika56 · July 5, 2023, 12:18pm

Hello,

This server has 64GB of mostly unused memory, and I do want the transfers to go as fast as possible. Ideally, I'd like to use the whole 1Gbps of uplink we have.

Anyway, the problem is happening right now, it took some time for RDP to work but I've been able to log in onto the server.
Here's the memory profile:

PS C:\Users\Administrator> go tool pprof -text http://localhost:5572/debug/pprof/heap
Fetching profile over HTTP from http://localhost:5572/debug/pprof/heap
Saved profile in C:\Users\Administrator\pprof\pprof.rclone.exe.alloc_objects.alloc_space.inuse_objects.inuse_space.003.p
b.gz
File: rclone.exe
Build ID: C:\veeam-scripts\rclone.exe2023-03-16 08:59:02 +0100 CET
Type: inuse_space
Time: Jul 5, 2023 at 2:04pm (CEST)
Showing nodes accounting for 158.22GB, 100% of 158.24GB total
Dropped 133 nodes (cum <= 0.79GB)
      flat  flat%   sum%        cum   cum%
  158.22GB   100%   100%   158.22GB   100%  github.com/rclone/rclone/lib/pool.New.func1
         0     0%   100%   158.20GB   100%  github.com/rclone/rclone/backend/b2.(*largeUpload).Upload.func2
         0     0%   100%   158.20GB   100%  github.com/rclone/rclone/backend/b2.(*largeUpload).Upload.func4
         0     0%   100%   158.22GB   100%  github.com/rclone/rclone/lib/pool.(*Pool).Get
         0     0%   100%   158.20GB   100%  golang.org/x/sync/errgroup.(*Group).Go.func1

And the go routines:

goroutine profile: total 516
313 @ 0x43ecb6 0x40727d 0x406d78 0x7a93ab 0x7a9745 0x7a9870 0xae639d 0xae9365 0x7c3984 0x471361
#	0x7a93aa	github.com/rclone/rclone/lib/pacer.(*Pacer).beginCall+0x2a		github.com/rclone/rclone/lib/pacer/pacer.go:159
#	0x7a9744	github.com/rclone/rclone/lib/pacer.(*Pacer).call+0x64			github.com/rclone/rclone/lib/pacer/pacer.go:196
#	0x7a986f	github.com/rclone/rclone/lib/pacer.(*Pacer).Call+0x8f			github.com/rclone/rclone/lib/pacer/pacer.go:216
#	0xae639c	github.com/rclone/rclone/backend/b2.(*largeUpload).transferChunk+0x13c	github.com/rclone/rclone/backend/b2/upload.go:208
#	0xae9364	github.com/rclone/rclone/backend/b2.(*largeUpload).Upload.func4.1+0xc4	github.com/rclone/rclone/backend/b2/upload.go:496
#	0x7c3983	golang.org/x/sync/errgroup.(*Group).Go.func1+0x63			golang.org/x/sync@v0.1.0/errgroup/errgroup.go:75

182 @ 0x43ecb6 0x44fa7b 0x44fa52 0x46c026 0x47e725 0xae5c58 0xae5c30 0xae6618 0x7baa18 0x7a9765 0x7a9870 0xae639d 0xae9365 0x7c3984 0x471361
#	0x46c025	sync.runtime_SemacquireMutex+0x25						runtime/sema.go:77
#	0x47e724	sync.(*Mutex).lockSlow+0x164							sync/mutex.go:171
#	0xae5c57	sync.(*Mutex).Lock+0x97								sync/mutex.go:90
#	0xae5c2f	github.com/rclone/rclone/backend/b2.(*largeUpload).getUploadURL+0x6f		github.com/rclone/rclone/backend/b2/upload.go:173
#	0xae6617	github.com/rclone/rclone/backend/b2.(*largeUpload).transferChunk.func1+0x117	github.com/rclone/rclone/backend/b2/upload.go:212
#	0x7baa17	github.com/rclone/rclone/fs.pacerInvoker+0x37					github.com/rclone/rclone/fs/pacer.go:88
#	0x7a9764	github.com/rclone/rclone/lib/pacer.(*Pacer).call+0x84				github.com/rclone/rclone/lib/pacer/pacer.go:197
#	0x7a986f	github.com/rclone/rclone/lib/pacer.(*Pacer).Call+0x8f				github.com/rclone/rclone/lib/pacer/pacer.go:216
#	0xae639c	github.com/rclone/rclone/backend/b2.(*largeUpload).transferChunk+0x13c		github.com/rclone/rclone/backend/b2/upload.go:208
#	0xae9364	github.com/rclone/rclone/backend/b2.(*largeUpload).Upload.func4.1+0xc4		github.com/rclone/rclone/backend/b2/upload.go:496
#	0x7c3983	golang.org/x/sync/errgroup.(*Group).Go.func1+0x63				golang.org/x/sync@v0.1.0/errgroup/errgroup.go:75

4 @ 0x43ecb6 0x44fa7b 0x44fa52 0x46c026 0x47e725 0xae6157 0xae6137 0xae6ca8 0x7baa18 0x7a9765 0x7a9870 0xae639d 0xae9365 0x7c3984 0x471361
#	0x46c025	sync.runtime_SemacquireMutex+0x25						runtime/sema.go:77
#	0x47e724	sync.(*Mutex).lockSlow+0x164							sync/mutex.go:171
#	0xae6156	sync.(*Mutex).Lock+0x56								sync/mutex.go:90
#	0xae6136	github.com/rclone/rclone/backend/b2.(*largeUpload).returnUploadURL+0x36		github.com/rclone/rclone/backend/b2/upload.go:201
#	0xae6ca7	github.com/rclone/rclone/backend/b2.(*largeUpload).transferChunk.func1+0x7a7	github.com/rclone/rclone/backend/b2/upload.go:266
#	0x7baa17	github.com/rclone/rclone/fs.pacerInvoker+0x37					github.com/rclone/rclone/fs/pacer.go:88
#	0x7a9764	github.com/rclone/rclone/lib/pacer.(*Pacer).call+0x84				github.com/rclone/rclone/lib/pacer/pacer.go:197
#	0x7a986f	github.com/rclone/rclone/lib/pacer.(*Pacer).Call+0x8f				github.com/rclone/rclone/lib/pacer/pacer.go:216
#	0xae639c	github.com/rclone/rclone/backend/b2.(*largeUpload).transferChunk+0x13c		github.com/rclone/rclone/backend/b2/upload.go:208
#	0xae9364	github.com/rclone/rclone/backend/b2.(*largeUpload).Upload.func4.1+0xc4		github.com/rclone/rclone/backend/b2/upload.go:496
#	0x7c3983	golang.org/x/sync/errgroup.(*Group).Go.func1+0x63				golang.org/x/sync@v0.1.0/errgroup/errgroup.go:75

1 @ 0x40d09c 0x46c6ef 0x7dff59 0x471361
#	0x46c6ee	os/signal.signal_recv+0x2e	runtime/sigqueue.go:152
#	0x7dff58	os/signal.loop+0x18		os/signal/signal_unix.go:23

1 @ 0x4318f6 0x469f05 0x80ba75 0x80b88d 0x80862b 0x817845 0x818393 0x7780cf 0x779f29 0x87dcb6 0x7780cf 0x1713a3a 0x7780cf 0x1713a3a 0x7780cf 0x87b9f0 0x77b6d6 0x776bf2 0x471361
#	0x469f04	runtime/pprof.runtime_goroutineProfileWithLabels+0x24		runtime/mprof.go:844
#	0x80ba74	runtime/pprof.writeRuntimeProfile+0xb4				runtime/pprof/pprof.go:734
#	0x80b88c	runtime/pprof.writeGoroutine+0x4c				runtime/pprof/pprof.go:694
#	0x80862a	runtime/pprof.(*Profile).WriteTo+0x14a				runtime/pprof/pprof.go:329
#	0x817844	net/http/pprof.handler.ServeHTTP+0x4a4				net/http/pprof/pprof.go:259
#	0x818392	net/http/pprof.Index+0xf2					net/http/pprof/pprof.go:376
#	0x7780ce	net/http.HandlerFunc.ServeHTTP+0x2e				net/http/server.go:2122
#	0x779f28	net/http.(*ServeMux).ServeHTTP+0x148				net/http/server.go:2500
#	0x87dcb5	github.com/go-chi/chi/v5.(*Mux).routeHTTP+0x215			github.com/go-chi/chi/v5@v5.0.8/mux.go:444
#	0x7780ce	net/http.HandlerFunc.ServeHTTP+0x2e				net/http/server.go:2122
#	0x1713a39	github.com/go-chi/chi/v5/middleware.SetHeader.func1.1+0x139	github.com/go-chi/chi/v5@v5.0.8/middleware/content_type.go:13
#	0x7780ce	net/http.HandlerFunc.ServeHTTP+0x2e				net/http/server.go:2122
#	0x1713a39	github.com/go-chi/chi/v5/middleware.SetHeader.func1.1+0x139	github.com/go-chi/chi/v5@v5.0.8/middleware/content_type.go:13
#	0x7780ce	net/http.HandlerFunc.ServeHTTP+0x2e				net/http/server.go:2122
#	0x87b9ef	github.com/go-chi/chi/v5.(*Mux).ServeHTTP+0x30f			github.com/go-chi/chi/v5@v5.0.8/mux.go:90
#	0x77b6d5	net/http.serverHandler.ServeHTTP+0x315				net/http/server.go:2936
#	0x776bf1	net/http.(*conn).serve+0x611					net/http/server.go:1995

1 @ 0x43ecb6 0x40727d 0x406d78 0x7a93ab 0x7a9745 0x7a9870 0xae5e90 0xae6618 0x7baa18 0x7a9765 0x7a9870 0xae639d 0xae9365 0x7c3984 0x471361
#	0x7a93aa	github.com/rclone/rclone/lib/pacer.(*Pacer).beginCall+0x2a			github.com/rclone/rclone/lib/pacer/pacer.go:159
#	0x7a9744	github.com/rclone/rclone/lib/pacer.(*Pacer).call+0x64				github.com/rclone/rclone/lib/pacer/pacer.go:196
#	0x7a986f	github.com/rclone/rclone/lib/pacer.(*Pacer).Call+0x8f				github.com/rclone/rclone/lib/pacer/pacer.go:216
#	0xae5e8f	github.com/rclone/rclone/backend/b2.(*largeUpload).getUploadURL+0x2cf		github.com/rclone/rclone/backend/b2/upload.go:183
#	0xae6617	github.com/rclone/rclone/backend/b2.(*largeUpload).transferChunk.func1+0x117	github.com/rclone/rclone/backend/b2/upload.go:212
#	0x7baa17	github.com/rclone/rclone/fs.pacerInvoker+0x37					github.com/rclone/rclone/fs/pacer.go:88
#	0x7a9764	github.com/rclone/rclone/lib/pacer.(*Pacer).call+0x84				github.com/rclone/rclone/lib/pacer/pacer.go:197
#	0x7a986f	github.com/rclone/rclone/lib/pacer.(*Pacer).Call+0x8f				github.com/rclone/rclone/lib/pacer/pacer.go:216
#	0xae639c	github.com/rclone/rclone/backend/b2.(*largeUpload).transferChunk+0x13c		github.com/rclone/rclone/backend/b2/upload.go:208
#	0xae9364	github.com/rclone/rclone/backend/b2.(*largeUpload).Upload.func4.1+0xc4		github.com/rclone/rclone/backend/b2/upload.go:496
#	0x7c3983	golang.org/x/sync/errgroup.(*Group).Go.func1+0x63				golang.org/x/sync@v0.1.0/errgroup/errgroup.go:75

1 @ 0x43ecb6 0x40727d 0x406d78 0x7e064f 0x471361
#	0x7e064e	github.com/rclone/rclone/lib/atexit.Register.func1.1+0x2e	github.com/rclone/rclone/lib/atexit/atexit.go:45

1 @ 0x43ecb6 0x40727d 0x406d78 0xae951a 0xae9501 0xae94fe 0xae9078 0x7c3984 0x471361
#	0xae9519	github.com/rclone/rclone/lib/pacer.(*TokenDispenser).Get+0x39		github.com/rclone/rclone/lib/pacer/tokens.go:24
#	0xae9500	github.com/rclone/rclone/backend/b2.(*Fs).getBuf+0x20			github.com/rclone/rclone/backend/b2/b2.go:602
#	0xae94fd	github.com/rclone/rclone/backend/b2.(*largeUpload).Upload.func2+0x1d	github.com/rclone/rclone/backend/b2/upload.go:452
#	0xae9077	github.com/rclone/rclone/backend/b2.(*largeUpload).Upload.func4+0xf7	github.com/rclone/rclone/backend/b2/upload.go:468
#	0x7c3983	golang.org/x/sync/errgroup.(*Group).Go.func1+0x63			golang.org/x/sync@v0.1.0/errgroup/errgroup.go:75

1 @ 0x43ecb6 0x40727d 0x406d78 0xb0e4e5 0x471361
#	0xb0e4e4	github.com/rclone/rclone/fs/march.(*March).Run.func2+0x44	github.com/rclone/rclone/fs/march/march.go:208

1 @ 0x43ecb6 0x435257 0x46a389 0x500092 0x501717 0x502418 0x56cb49 0x584225 0x77093f 0x471361
#	0x46a388	internal/poll.runtime_pollWait+0x88		runtime/netpoll.go:306
#	0x500091	internal/poll.(*pollDesc).wait+0x31		internal/poll/fd_poll_runtime.go:84
#	0x501716	internal/poll.execIO+0xf6			internal/poll/fd_windows.go:175
#	0x502417	internal/poll.(*FD).Read+0x2b7			internal/poll/fd_windows.go:436
#	0x56cb48	net.(*netFD).Read+0x28				net/fd_posix.go:55
#	0x584224	net.(*conn).Read+0x44				net/net.go:183
#	0x77093e	net/http.(*connReader).backgroundRead+0x3e	net/http/server.go:674

1 @ 0x43ecb6 0x435257 0x46a389 0x500092 0x501717 0x502418 0x56cb49 0x584225 0x770db1 0x59b13f 0x59bd2f 0x59bf87 0x6ffb19 0x76c15a 0x76c15b 0x77230a 0x77692b 0x471361
#	0x46a388	internal/poll.runtime_pollWait+0x88		runtime/netpoll.go:306
#	0x500091	internal/poll.(*pollDesc).wait+0x31		internal/poll/fd_poll_runtime.go:84
#	0x501716	internal/poll.execIO+0xf6			internal/poll/fd_windows.go:175
#	0x502417	internal/poll.(*FD).Read+0x2b7			internal/poll/fd_windows.go:436
#	0x56cb48	net.(*netFD).Read+0x28				net/fd_posix.go:55
#	0x584224	net.(*conn).Read+0x44				net/net.go:183
#	0x770db0	net/http.(*connReader).Read+0x170		net/http/server.go:782
#	0x59b13e	bufio.(*Reader).fill+0xfe			bufio/bufio.go:106
#	0x59bd2e	bufio.(*Reader).ReadSlice+0x2e			bufio/bufio.go:372
#	0x59bf86	bufio.(*Reader).ReadLine+0x26			bufio/bufio.go:401
#	0x6ffb18	net/textproto.(*Reader).readLineSlice+0x98	net/textproto/reader.go:56
#	0x76c159	net/textproto.(*Reader).ReadLine+0xb9		net/textproto/reader.go:39
#	0x76c15a	net/http.readRequest+0xba			net/http/request.go:1042
#	0x772309	net/http.(*conn).readRequest+0x249		net/http/server.go:990
#	0x77692a	net/http.(*conn).serve+0x34a			net/http/server.go:1920

1 @ 0x43ecb6 0x435257 0x46a389 0x500092 0x501717 0x505f6d 0x5062d6 0x56e9a5 0x58dd85 0x58ca1d 0x77bc45 0x883da8 0x471361
#	0x46a388	internal/poll.runtime_pollWait+0x88			runtime/netpoll.go:306
#	0x500091	internal/poll.(*pollDesc).wait+0x31			internal/poll/fd_poll_runtime.go:84
#	0x501716	internal/poll.execIO+0xf6				internal/poll/fd_windows.go:175
#	0x505f6c	internal/poll.(*FD).acceptOne+0x6c			internal/poll/fd_windows.go:936
#	0x5062d5	internal/poll.(*FD).Accept+0x1d5			internal/poll/fd_windows.go:970
#	0x56e9a4	net.(*netFD).accept+0x64				net/fd_windows.go:139
#	0x58dd84	net.(*TCPListener).accept+0x24				net/tcpsock_posix.go:148
#	0x58ca1c	net.(*TCPListener).Accept+0x3c				net/tcpsock.go:297
#	0x77bc44	net/http.(*Server).Serve+0x384				net/http/server.go:3059
#	0x883da7	github.com/rclone/rclone/lib/http.instance.serve+0xa7	github.com/rclone/rclone/lib/http/server.go:132

1 @ 0x43ecb6 0x44e97e 0x1777f8b 0x471361
#	0x1777f8a	github.com/rclone/rclone/cmd.StartStats.func2+0xea	github.com/rclone/rclone/cmd/cmd.go:372

1 @ 0x43ecb6 0x44e97e 0x8ae4e9 0x471361
#	0x8ae4e8	github.com/rclone/rclone/fs/asyncreader.(*AsyncReader).init.func1+0xe8	github.com/rclone/rclone/fs/asyncreader/asyncreader.go:88

1 @ 0x43ecb6 0x44e97e 0x97c245 0x471361
#	0x97c244	github.com/rclone/rclone/fs/accounting.(*Account).averageLoop+0x104	github.com/rclone/rclone/fs/accounting/accounting.go:212

1 @ 0x43ecb6 0x44e97e 0x981ba5 0x471361
#	0x981ba4	github.com/rclone/rclone/fs/accounting.(*StatsInfo).averageLoop+0x164	github.com/rclone/rclone/fs/accounting/stats.go:328

1 @ 0x43ecb6 0x44e97e 0xd31acd 0x471361
#	0xd31acc	go.opencensus.io/stats/view.(*worker).start+0xac	go.opencensus.io@v0.24.0/stats/view/worker.go:292

1 @ 0x43ecb6 0x44fa7b 0x44fa52 0x46bf07 0x48022b 0x1785ca9 0x1788d3f 0x178a12a 0x178a218 0x195758c 0x1776edb 0x19574e5 0x174dd47 0x174e5dd 0x1778b96 0x1778b8b 0x19659b7 0x43e897 0x471361
#	0x46bf06	sync.runtime_Semacquire+0x26						runtime/sema.go:62
#	0x48022a	sync.(*WaitGroup).Wait+0x4a						sync/waitgroup.go:116
#	0x1785ca8	github.com/rclone/rclone/fs/sync.(*syncCopyMove).stopTransfers+0x68	github.com/rclone/rclone/fs/sync/sync.go:472
#	0x1788d3e	github.com/rclone/rclone/fs/sync.(*syncCopyMove).run+0x33e		github.com/rclone/rclone/fs/sync/sync.go:909
#	0x178a129	github.com/rclone/rclone/fs/sync.runSyncCopyMove+0x229			github.com/rclone/rclone/fs/sync/sync.go:1130
#	0x178a217	github.com/rclone/rclone/fs/sync.Sync+0x77				github.com/rclone/rclone/fs/sync/sync.go:1136
#	0x195758b	github.com/rclone/rclone/cmd/sync.glob..func1.1+0x4b			github.com/rclone/rclone/cmd/sync/sync.go:68
#	0x1776eda	github.com/rclone/rclone/cmd.Run+0x13a					github.com/rclone/rclone/cmd/cmd.go:257
#	0x19574e4	github.com/rclone/rclone/cmd/sync.glob..func1+0xc4			github.com/rclone/rclone/cmd/sync/sync.go:66
#	0x174dd46	github.com/spf13/cobra.(*Command).execute+0x846				github.com/spf13/cobra@v1.6.1/command.go:920
#	0x174e5dc	github.com/spf13/cobra.(*Command).ExecuteC+0x3bc			github.com/spf13/cobra@v1.6.1/command.go:1044
#	0x1778b95	github.com/spf13/cobra.(*Command).Execute+0x75				github.com/spf13/cobra@v1.6.1/command.go:968
#	0x1778b8a	github.com/rclone/rclone/cmd.Main+0x6a					github.com/rclone/rclone/cmd/cmd.go:567
#	0x19659b6	main.main+0x16								github.com/rclone/rclone/rclone.go:14
#	0x43e896	runtime.main+0x1f6							runtime/proc.go:250

1 @ 0x43ecb6 0x44fa7b 0x44fa52 0x46bf07 0x48022b 0x7c37e7 0xae8e25 0xae39cf 0xadd645 0xb23f09 0x178566a 0x471361
#	0x46bf06	sync.runtime_Semacquire+0x26						runtime/sema.go:62
#	0x48022a	sync.(*WaitGroup).Wait+0x4a						sync/waitgroup.go:116
#	0x7c37e6	golang.org/x/sync/errgroup.(*Group).Wait+0x26				golang.org/x/sync@v0.1.0/errgroup/errgroup.go:53
#	0xae8e24	github.com/rclone/rclone/backend/b2.(*largeUpload).Upload+0x504		github.com/rclone/rclone/backend/b2/upload.go:506
#	0xae39ce	github.com/rclone/rclone/backend/b2.(*Object).Update+0x28e		github.com/rclone/rclone/backend/b2/b2.go:1896
#	0xadd644	github.com/rclone/rclone/backend/b2.(*Fs).Put+0x124			github.com/rclone/rclone/backend/b2/b2.go:1037
#	0xb23f08	github.com/rclone/rclone/fs/operations.Copy+0x1c68			github.com/rclone/rclone/fs/operations/operations.go:440
#	0x1785669	github.com/rclone/rclone/fs/sync.(*syncCopyMove).pairCopyOrMove+0x209	github.com/rclone/rclone/fs/sync/sync.go:437

1 @ 0x43ecb6 0x46de7a 0x7a9545 0x471361
#	0x46de79	time.Sleep+0x139							runtime/time.go:195
#	0x7a9544	github.com/rclone/rclone/lib/pacer.(*Pacer).beginCall.func1+0x24	github.com/rclone/rclone/lib/pacer/pacer.go:167

The rclone process is currently using about 500MB of RAM, but has committed for 173GB of memory, most of which was provided by swap:

I can provide the full logfile if needed, but the memory problem started after Backblaze started answering with 500 and 503 server errors.

What do you think I should do next? I could do more debugging while I'm in that state, or kill rclone and try to sync again.

ncw · July 5, 2023, 2:56pm

That trace is very clear - it is the b2 large file upload which is causing the problem.

The go routines look fine - there is about 300 uploading stuff and each one should own a 96MB chunk of memory, unless the files are more than 10,000 parts long. 10,000 parts of 96MiB is 931 GiB. So are any of the files bigger than 931 GiB?

If the file is bigger than 931GiB then rclone needs to make new memory chunks, so potentially it can be using much more memory.

You said above that you've got one file of 3TiB. This will need chunks of size 286 MiB and 300 of those would be 85GB.

Do you know how many files you've got in the transfer that are bigger than 931GB? Potentially each one will be using a lot of memory.

You can use rclone to find this out:

rclone lsl --min-size 931G serveurbackup:bkp

If this is the problem then I need to rethink exactly how we are doing this in the b2 backend.

Interesting. I wonder if there is a memory leak on some error state. I checked and it doesn't look like it so I guess the problem above is responsible.

One thing you could try is using --b2-memory-pool-use-mmap --b2-memory-pool-flush-time 1s which should help returning memory to the OS more snappily. This might be enough to fix the problem for you without me re-working the b2 backend memory management.

If you have one a log file with -vv would be useful. I don't think you need to keep that process any more though.

Mika56 · July 5, 2023, 3:55pm

Not certain I'm following you correctly on the chunks part. Are we talking memory chunks or upload chunks? Is this documented somewhere?
I only have that one big file, and lots of smaller ones. The big file was the only one still uploading when the error occurred.

Couldn't it be that the memory isn't freed when an error occurs? If the rclone logs are to be trusted, the Backblaze incident lasted for about 25 minutes (22:35-23:00).
The RAM usage starts going up around that time:

Another thing that can be noted from the logs: after Backblaze comes back to life, rclone resumes syncing and loops over data chunks, although no data seems to go out. A new message is logged exactly every 5 minutes:

2023/07/05 14:52:22 DEBUG : Veeam/VM-WS2016B/VM-WS2016BD2023-06-22T020709_67F4.vbk: Sending chunk 1188 length 339738624
2023/07/05 14:57:22 DEBUG : Veeam/VM-WS2016B/VM-WS2016BD2023-06-22T020709_67F4.vbk: Sending chunk 1189 length 339738624
2023/07/05 15:02:22 DEBUG : Veeam/VM-WS2016B/VM-WS2016BD2023-06-22T020709_67F4.vbk: Sending chunk 1190 length 339738624
2023/07/05 15:07:22 DEBUG : Veeam/VM-WS2016B/VM-WS2016BD2023-06-22T020709_67F4.vbk: Sending chunk 1191 length 339738624
2023/07/05 15:12:22 DEBUG : Veeam/VM-WS2016B/VM-WS2016BD2023-06-22T020709_67F4.vbk: Sending chunk 1192 length 339738624
2023/07/05 15:17:22 DEBUG : Veeam/VM-WS2016B/VM-WS2016BD2023-06-22T020709_67F4.vbk: Sending chunk 1193 length 339738624
2023/07/05 15:22:22 DEBUG : Veeam/VM-WS2016B/VM-WS2016BD2023-06-22T020709_67F4.vbk: Sending chunk 1194 length 339738624
2023/07/05 15:27:22 DEBUG : Veeam/VM-WS2016B/VM-WS2016BD2023-06-22T020709_67F4.vbk: Sending chunk 1195 length 339738624
2023/07/05 15:32:22 DEBUG : Veeam/VM-WS2016B/VM-WS2016BD2023-06-22T020709_67F4.vbk: Sending chunk 1196 length 339738624
2023/07/05 15:37:22 DEBUG : Veeam/VM-WS2016B/VM-WS2016BD2023-06-22T020709_67F4.vbk: Sending chunk 1197 length 339738624
2023/07/05 15:42:22 DEBUG : Veeam/VM-WS2016B/VM-WS2016BD2023-06-22T020709_67F4.vbk: Sending chunk 1198 length 339738624
2023/07/05 15:47:22 DEBUG : Veeam/VM-WS2016B/VM-WS2016BD2023-06-22T020709_67F4.vbk: Sending chunk 1199 length 339738624

rclone continues with the next chunk every five minutes, although the last chunk that was successfully uploaded was chunk 975.

I'm sending you an email with the -vv logs.
In the meantime I'm changing my script to include the suggested switches to see if it helps.

ncw · July 6, 2023, 2:32pm

Thank you for the logs.

I ran this on the logs

$ grep -P "Sending.*chunk" rclone-2023-07-04.stderr.txt | awk '{print $NF}' | sort | uniq -c | sort -n
      1 18805161
      1 19423764
      1 44371968
      1 59924480
      1 64273038
      1 64273072
      1 64273123
      1 96104448
      2 22069853
      2 64273049
      3 17657837
      4 22290432
   2723 100663296
   6537 339738624

We can ignore any chunk sizes < 100663296 as these will use size 100663296.

What this means is that for chunk size 100,663,296 there will be 300 (your value of --transfers) buffers of that size in memory.

However there will also be 300 of 339,738,624 sized buffers (from transferring your very big file).

This is a total of 132,120,576,000 bytes which is 123 GiB which explains why rclone is using so much memory.

Rclone is using a bit more than 123GiB it is using 158GB. This is likely not recycled yet buffers due to the garbage collection.

I don't think there is a problem with memory leaks.

The mmap flags I posted above should help with this.

I think the solution is to cut down --transfers. So if you want rclone to use < 64GB of ram then you need --transfers 150. Maybe a bit less.

Or set --b2-chunk-size 350M and --transfers 180.

At minimum we need to document this behaviour in the b2 backend. So after this text

Backblaze recommends that you do lots of transfers simultaneously for maximum speed. In tests from my SSD equipped laptop the optimum setting is about --transfers 32 though higher numbers may be used for a slight speed improvement. The optimum number for you may vary depending on your hardware, how big the files are, how much you want to load your computer, etc. The default of --transfers 4 is definitely too low for Backblaze B2 though.

Note that uploading big files (bigger than 200 MiB by default) will use a 96 MiB RAM buffer by default. There can be at most --transfers of these in use at any moment, so this sets the upper limit on the memory used.

Note also that if you upload files bigger than 931GB (by default) then rclone will need to increase the --b2-chunk-size to above 96MB in order to fit within the 10,000 chunks per file limit. This potentially increases the memory use by file_size/10000 * --transfers for each file above 931 GB. To make sure rclone doesn't use too much memory set --b2-chunk-size to bigger than the size of the largest file / 10000. For example if the largest file you are transferring is 3TB then set --chunk-size 320M as 3TB = 3,000,000,000,000 / 10,000,000 = 300,000,000 = 300 MB. Rclone will then use no more than 320M * --transfers of RAM.

What do you think of that? Does it make sense?

system · August 5, 2023, 2:33pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.