Rclone memory consumption increasing linearly

Rclone memory increasing linearly

I am trying to copy files (sample size) from local EXT4 filesystem to GCS. I have observed that the total memory being consumed by rclone increases linearly. I made some changes to the code to disable checks and API calls.

I am using Rclone v1.49.3 with ubuntu-18.04 64 bit as source.

Changes made:

  • The testdata contains around 16000 small size files (10-50 bytes) distributed evenly across sub directories with depth level of 5.
  • Bucket has already been created before the transfer is started.
  • config only contains the service account credentials for GCP.
  • Disabled the fetch metadata call for --no-traverse. 1 API call for each file scanned. Commented backends/googlecloudstorage/googlecloudstorage.go L465-472.
  • Disable fs/sync/sync.go -> pairChecker. Whatever file comes in the pipeline is then sent out without any checks.
  • Added a Memmory Logger function which logs Heap and Sys Memory usage after every 50 file transfers.

Profiler Function:

func CallProfiler(numFiles uint64) {
	var m runtime.MemStats
	runtime.ReadMemStats(&m)
	currentMemory = m.Alloc / 1024 / 1024
	currentHeapMemory := m.HeapSys / 1024 / 1024
	if currentMemory > peakMemory {
		peakMemory = currentMemory
	}
	fs.Infof(nil, "Profiling after %v files (Peak Usage - %v) (Current Usage - %v) (GoRoutine Count - %v) (HeapSys Usage - %v)", numFiles, peakMemory, currentMemory, runtime.NumGoroutine(), currentHeapMemory)
}

Command:

rclone.exe copy /home/user/testdata gcp:bucket/10000000001 --files-from files.txt --ignore-checksum --no-traverse --config rclone.config -vP

Observation:

Total Memory usage is increasing linearly as the number of files transferred is increasing. The Heap allocation varies in a range of 30-70 MB But there are spikes where Heap usage goes upto 137 MB. Please see the log attached.
See Logs Here
See Memory Profiles Here

I also tried to reduce the Garbage collection threshold to 20 (export GOGC=20) but still no improvement in memory management.

What could be the reason for this memory increase?

Hmm, I thought there was some leak fixes after that version.

It may be worthwhile to try it on the latest and see if the pattern is the same.

There definitely was a memleak fix recently. I don't know the details of it though. It should be noted in the changelog, but re-testing on the latest version would be advisable yes.

To try to see what is going on I plotted the info. It looks plausible there might be a leak.

Looking at the profiles:

This is the biggest item, but it goes up and down in the profiles

image

The only thing I can see consistently going up is this

image

If you want to see whether it is that then can you apply this patch and re-run your tests.

diff --git a/backend/local/fadvise_unix.go b/backend/local/fadvise_unix.go
index fce828971..2b394835c 100644
--- a/backend/local/fadvise_unix.go
+++ b/backend/local/fadvise_unix.go
@@ -136,6 +136,8 @@ type fadviseReadCloser struct {
 // make sequential reads faster.
 // See also fadvise.
 func newFadviseReadCloser(o *Object, f *os.File, offset, limit int64) io.ReadCloser {
+	return f // FIXME bypass
+
 	r := fadviseReadCloser{
 		fadvise: newFadvise(o, int(f.Fd()), offset),
 		inner:   f,

I'm not 100% sure this is a leak, it might just be the go allocator being lazy...

There was a file handle leak on retries in v1.49.3 so it would be worth trying v1.49.5 as leaking file handles could explain the problem...

@ncw I tested again the following combinations:

  • v1.49.5
  • patch (mentioned above)
  • v1.49.5 + patch

Still I am seeing a linear increase in memory usage in all the cases given above.
I also tried splitting the source into batches of 1000 files each (using temp files with --files-from flag) . At any given time only 1 batch is running (Batches are transferred sequentially). Technically it should release the memory after every batch but usage is still increasing. (Code is given below).

The next batch is started only when the current batch has finished all the transfers.
Profiling after 3/4/5 files ... is printed when a new batch comes for processing. Even at this stage, the memory usage is increasing as the number of files transferred increases.

func createBatch(csvReader *csv.Reader, batchSize int, sourcePath string) []string {
	var batchFileCounter int
	var batch []string
	for {
		record, err := csvReader.Read()
		if err == io.EOF {
			break
		}
		if record[1] == "False" {
            // Returns the relative path of a file compared to the source directory
			relPath := utils.GetRelativePath(record[0], sourcePath)
			batch = append(batch, relPath)
			batchFileCounter++
		}
		if batchFileCounter == batchSize {
			batchFileCounter = 0
			batchCounter++
			break
		}
	}
	return batch
}

func createTempFFL(batch []string, sourcePath string) error {
	file, err := os.Create("temp_ffl.txt")
	defer file.Close()
	if err != nil {
		return err
	}
	for _, path := range batch {
		file.WriteString(path + "\n")
	}
	return nil
}

cmd.Run(false, true, command, func() error {
    err := setTempFFLLocation()
    operations.CallProfiler(1)
    if err != nil {
        return err
    }
    args = append(args, sourcePath)
    reportFile, err := os.Open("ffl.csv")
    operations.CallProfiler(2)
    if err != nil {
        return err
    }
    defer reportFile.Close()
    csvReader := csv.NewReader(bufio.NewReader(reportFile))
    for {
        operations.CallProfiler(3)
        batch := createBatch(csvReader, 1000, sourcePath)
        operations.CallProfiler(4)
        if len(batch) == 0 {
            break
        }
        err = createTempFFL(batch, sourcePath)

        // Freeing memory taken by batch slice variable
        batch = nil
        
        operations.CallProfiler(5)
        err = rcloneutils.SetFileFromFlag(tempFFLLocation)
        if err != nil {
            return err
        }
        fs.Config.NoTraverse = true
        fs.Config.IgnoreChecksum = true
        if srcFileName == "" {
            err = sync.CopyDir(context.Background(), fdst, fsrc, createEmptySrcDirs)
            if err != nil {
                return err
            }
        } else {
            err = operations.CopyFile(context.Background(), fdst, fsrc, srcFileName, srcFileName)
            if err != nil {
                return err
            }
        }
        fs.Infof(nil, "Processed Batch Counter - %v", batchCounter)
        runtime.GC()
    }
    return os.Remove(tempFFLLocation)
})

Link for Logs and Profiling Screenshots

Your profiles still have newFadvise in using lots of memory - I don't know which of the 3 tests these represent?

Can you try with the patch again and make profiles for that?

Also it would probably be useful if you could try a lot more files and see if the memory usage keeps going up or flattens off?

Thank you for all your testing - we'll get to the bottom of it soon I'm sure!

I ran the test after applying the patch. The changes includes:

  • Changes from the top description
  • fAdvisor patch applied
  • v1.49.5 used
  • Split the ffl into batches of 1000 with batches being processed sequentially.

I will try with a large source and will soon post the results.

Hmm, can you double check the fAdvisor patch - It shouldn't be in the profiles at all I don't think, unless I did something wrong!

@ncw I double checked and started another copy and the results were same. I also ran the test on a large file base (500K+ Objects).
Memory Profiles Here

For 500K Objects Total Memory Allocation was around 1.2 GB and Heap Allocation was around 900 MB. (Without splitting the source into batches).

Here are the changes which I applied. DisableChecks.diff
Please check.

Update: I verified the similar issue on older versions. Was able to replicate the same issue on v1.45. Testing other versions also.

The fadvise stuff is missing now which means the patch was effective.

The usages for everything just seem to go up...

Can you check to see if there is a file handle leak? See how the number of open files goes up?

Can you also try this test on an unpatched rclone just to see if it is your patch causing the problem?

I also tried that on v1.49.5 stock build without patch. Similar result. The memory is still increasing.

Can you check the file handles please? lsof on linux - no idea on Windows.

I managed to replicate this problem in the latest beta. Transferring 10,000 files the rclone memory usage grows to 250M

According to my testing copying 10,000 files patching out the fadvise reduces memory use quite a bit, but doesn't fix the problem.

You can get it to do it for a local -> local copy too so it isn't the gcs backend.

here is local -> gcs

here is local -> local

After a lot of digging I found the leak! (in fact I found two). The main leak was in the transfer statistics - the objects weren't being reaped from them. The memory usage goes down over time as the transfer pipe is emptied which was the other minor leak.

I've patched this here

https://beta.rclone.org/branch/v1.49.5-219-gcd501ca9-fix-memory-leak-beta/ (uploaded in 15-30 mins)

which you can find in the fix-memory-leak branch

1 Like

I merged those two fixes to master now so they will be in v1.50

@ncw
The memory leak issue seems to be resolved. I have started a transfer for a large dataset (500k objects). After it is completed will verify the results.

But this introduced a regression. The elapsed time in progress stats keeps getting reset after every file.

Transferred:        1.113M / 1.113 MBytes, 100%, 0 Bytes/s, ETA -
Errors:                 0
Checks:                 0 / 0, -
Transferred:        25202 / 25202, 100%
Elapsed time:          0s

Verified on windows as well as Ubuntu.

Hooray!

Well spotted...

I'll work on a better fix.

@ncw

Can you please provide an ETA for the 1.50 or for the complete fix?
Thanks

I have merged to the latest beta a better fix.

It was a lot more complicated than I thought it would be! The accounting contained the assumption that we would keep the completed transfers around forever which was causing the memory leak.

I've fixed that now and would really appreciate your testing!

v1.50 should be released at the weekend or on Monday I think.

I've been seeing this as well, specifically with backing up hierarchies with simply enormous numbers of files (one job processes at least 2 million files). I've moved the beta binary into place and will see how tonight's backup fares.

Thanks for testing @diamondsw - we are approaching a release so the more testing on the beta the better!