Command to copy a provided list of files

panthony · October 28, 2022, 7:54am

Hi everyone!

I have a use case where I need to append new files from a source to a target.

The list of files to copy is determined using a rclone lsf as source and affined using some other logic coming from elsewhere.

Today I'm using --files-from-raw in combinaison with copy command to do so.

The problem I have today is that copy will perform a list on the source to find the file's information to copy and I'm working with buckets that have tens of millions of files.

Because of this I'm having a huge performance hit (some listing can take more than 1h) and it also requires more ram than I feel necessary (easily 2Gb-3Gb).

The feature I suggest would be a dedicated command (ex: copyfiles or similar) that would take a --files-list parameter with a content similar to --files-from-raw.

This command would:

ignore any provided filters (we are provided the exact list of files to copy)
retrieve source' info using HEAD requests instead of listing

I have a first implementation that seems to work - narrowed to my use case - although I did not perform an extensive test yet on huge volumes.

The naïve implementation I have is to modify the March class by doing something like:

func (m *March) Run(ctx context.Context) error {
	if m.ListFiles == "" {
		return m.walk(ctx)
	} else {
		return m.iterate(ctx)
	}
}

Where walk is the current implementation. The iterate would look something like:

func (m *March) iterate(ctx context.Context) error {
	ci := fs.GetConfig(ctx)
	m.init(ctx)

	var files = make(chan string)
	var checkers = ci.Checkers
	var mu sync.Mutex // Protects vars below
	var jobError error
	var errCount int
	var g errgroup.Group

	for i := 0; i < checkers; i++ {
		g.Go(func() (err error) {
			for remote := range files {
				obj, err := m.Fsrc.NewObject(ctx, remote)
				if err != nil {
					mu.Lock()
					// Keep reference only to the first encountered error
					if jobError == nil {
						jobError = err
					}
					errCount++
					mu.Unlock()
				}

				// TODO use m.NoCheckDest to determine if we should check target or not
				//      if we do, we may call Match instead of SrcOnly

				if obj != nil {
					m.Callback.SrcOnly(obj)
				}
			}
			return nil
		})
	}

	err := fs.ForEachLine(m.ListFiles, true, func(remote string) error {
		files <- remote
		return nil
	})
	close(files)
	if err != nil {
		return nil
	}

	err = g.Wait()
	if err != nil {
		return err
	}
	if errCount > 1 {
		return fmt.Errorf("march failed with %d error(s): first error: %w", errCount, jobError)
	}
	return jobError
}

ncw · October 28, 2022, 5:01pm

Try adding the --no-traverse parameter and see if that helps

panthony · October 29, 2022, 6:55am

Already using it but it only help partially it prevents the listing of the destination but not the source.

I have sources (S3 and/or GCS) with dozens of millions of files in the same "directory", and I have a list with 5k files to copy.

With rclone copy --no-traverse --files-from-raw files.txt it can takes 40-45min to start copying the files because it is stuck listing the millions of files to find the 5k I wanna copy.

With my suggestion it starts copying immediately.

ncw · October 29, 2022, 10:18pm

When I try using --no-traverse it does HEAD on each individual file in the source and dest - no directory listings are performed.

$ rclone copy /tmp/filexx s3:rclone/filexx --no-traverse --files-from-raw filez -vv --dump headers    --log-file log
$ grep -c HEAD log
73
$ grep -c PUT log
25
$ grep -c GET log # no gets means no directory listings
0

Can you show the command line you are using please? And also describe the source and destination or post a redacted config file.

Which version of rclone are you using? Can you post the output of rclone version?

If you want to see what HTTP verbs rclone is using then use -vv --dump headers

Thanks

panthony · October 30, 2022, 7:38am

Thanks for the hint on how to track how many listing/head request there are.

I assumed it was doing a source listing because of how long it takes to start copying and looking at march.go I saw:

	m.srcListDir = m.makeListDir(ctx, m.Fsrc, m.SrcIncludeAll)
	if !m.NoTraverse {
		m.dstListDir = m.makeListDir(ctx, m.Fdst, m.DstIncludeAll)
	}

The sources can be ftp, sftp, azureblob, netstorage, s3, gcs (thanks again rclone ) but the target is always GCS and the copy command slighly vary depending on the source.

The command line looks like:

rclone copy \
 --update \
 --buffer-size 8m \
 --checkers 48 \
 --transfers 24 \
 --modify-window 1s \
 --timeout 1800s \
 --no-update-modtime \
 --ignore-checksum \
 --ignore-size \
 --no-gzip-encoding \
 --s3-no-check-bucket \
 --gcs-no-check-bucket \
 --gcs-download-compressed \
 --no-traverse \
 --files-from-raw files.txt \
 source:path/to/source target:path/to/target

I'll try to re-test locally using grep on logs and I'll try to see if I have GET request and if not, why it takes so much time to start.

For context, I'm dealing a lot with small files problems where I have to copy millions of tiny tiny files, that's why I tried to reduce the buffer size etc. to try to reduce the overall memory footprint.

panthony · October 30, 2022, 8:03am

Ok I tried something like this locally:

./rclone copy --config rc.conf --no-traverse --files-from-raw files.txt -vv --dump-headers --log-file copy.log s3:bucket/CDN/ :local:tmp-out

I only have the credentials in the configuration files.

You are right, I do not see listing requests, but I do see 4113 HEAD requests for 2052 files, If I grep one of the file I can see:

➜  grep "b7a262621ea4" copy.log 
2022/10/30 08:51:36 DEBUG : HEAD /CDN/2022-10-30/1667098495-ff3d6ab4-3388-44b0-b28d-b7a262621ea4.log.gz HTTP/1.1
2022/10/30 08:51:40 DEBUG : HEAD /CDN/2022-10-30/1667098495-ff3d6ab4-3388-44b0-b28d-b7a262621ea4.log.gz HTTP/1.1
2022/10/30 08:52:44 DEBUG : GET /CDN/2022-10-30/1667098495-ff3d6ab4-3388-44b0-b28d-b7a262621ea4.log.gz HTTP/1.1
2022/10/30 08:52:44 DEBUG : 2022-10-30/1667098495-ff3d6ab4-3388-44b0-b28d-b7a262621ea4.log.gz: md5 = ce0771282e44190bca7a9ab34a05cf28 OK
2022/10/30 08:52:44 INFO  : 2022-10-30/1667098495-ff3d6ab4-3388-44b0-b28d-b7a262621ea4.log.gz: Copied (new)

Not quite sure why they are 2 HEAD requests for the same file.

And I'm not quite sure it may takes so much time to start I'll have to try & reproduce one when I see one.

Ole · October 30, 2022, 8:19am

A quick guess: The first is used to determine if the file needs to be copied and the second is needed to check the checksum (is still the same) after the file is copied.

You can verify by adding --ignore-checksum.

Please note the warning:

You should only use it if ... you are sure you might want to transfer potentially corrupted data.

More info here:
https://rclone.org/docs/#ignore-checksum

panthony · October 30, 2022, 8:41am

A quick guess: The first is used to determine if the file needs to be copied and the second is needed to check the checksum (is still the same) after the file is copied.

Would be surprising because both HEAD requests happens before the copy is performed, but just to be sure I retried with --ignore-checksum and it did not change anything.

You should only use it if ... you are sure you might want to transfer potentially corrupted data.

In practice I do not have any choices but to ignore checks both on size & checksum:

check on size may fail because of gzip encoding done by S3/GCS where reported size does not correspond to the size of the file once downloaded
I had a case where the checksum check would fail for files on S3 that were perfectly valid thus I was forced to remove that check too.

edit

Looking in the code I was looking why we may do 2 HEAD requests and I stumbled upon --s3-no-head-object which seems to do pretty much what I want (although it seems we do not have --gcs-no-head-object)

Start copying the files immediately.

As I know the files within --files-from--raw are to be copied, I do not need to gather the date, size or anything else especially since I cannot rely on either the size or checksum to validate the copy.

It does not say why we have 2 HEAD requests though.

edit 2

Ok I think I understand why I see 2 HEAD requests:

We start the marcher that will perform makeListDir which will start doing HEAD requests on all objects
All of them are returning HTTP 301 because this is not the proper S3 region
There is the routine to correct the region kicking in.
We retry the marcher from the start

If I provide the region I do have. the same number of HEAD requests that I have files in my input.

In case of S3 it might be worth to have a routine to ensure we have a valid region before going all in.

edit 3

To resume, my suggestion is more or less redundant with what we can do today my doing:

rclone copy --no-traverse --files-from-raw files.txt --s3-no-head source:/ target:/

The difference being - it seems - that my suggestion would start to copy right away as it iterate over the provided files whereas here it will starts my mounting everything in memory before it starts to copy.

But I could go one step further & skip the source's HEAD too since I know I do not really need it:

rclone copy --no-traverse --files-from-raw files.txt --s3-no-head --s3-no-head-object source:/ target:/

But the current documentation of --no-traverse is a little misleading today because as far I can tell it always ever talk about skipping destination listing, never talked about the source, or I missed it.

ncw · October 30, 2022, 5:04pm

That is worth looking in to.

Note that rclone runs through the --files-from file making objects for each file in there. That is where the initial HEAD requests come from.

You can make this process speedier by increasing --checkers as --checkers HEAD requests are done in parallel. I expect this will help quite a bit!

Note also that --no-traverse isn't always a win. On most backends its quicker to do a listing to find the objects especially if a few are in the same directory. It depends a lot on the access patterns. Some backends (eg google drive) absolutely hate --no-traverse - that pattern triggers heavy rate limiting for some reason.

panthony · October 31, 2022, 6:57am

Regarding --no-traverse, my bad, it was written somewhere in the filtering part:

If the --no-traverse and --files-from flags are used together an rclone command does not traverse the remote. Instead it addresses each path/file named in the file individually. For each path/file name, that requires typically 1 API call. This can be efficient for a short --files-from list and a remote containing many files.

I could have read the documentation more thoughtfully to not miss that part but it may be easy to miss when every other occurences make you think this is destination only.

In Copy command:

See the --no-traverse option for controlling whether rclone lists the destination directory or not. Supplying this option when copying a small number of files into a large destination can speed transfers up greatly.

In usage:

The --no-traverse flag controls whether the destination file system is traversed when using the copy or move commands. --no-traverse is not compatible with sync and will be ignored if you supply it with sync.

I'll close this post because I may be able to do what I want with existing option, thanks you both

system · December 30, 2022, 6:58am

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.