Efficiently update (copy) only very few files to Google drive

Background: see my previous question. While the challenge to accelerate rclone copy remains, I start a new thread as it is a distinct question. Test case: 10 k files on the server, 1 file modified (i.e. touched). The test case is approx 1/30 of my real use case.

I looked into three options. See outputs below for details.

  1. direct rclone copy from local to encrypted gdrive - 12 minutes
  2. rclone mount --vfs-cache-mode full plus rclone copy from local to mountpoint - 45 minutes
  3. script to identify outside of rclone the changed files via sqlite, then upload only those (rclone copy --files-from) - less than 30 seconds.

What is the problem you are having with rclone?

Do I miss flags? Is there a more elegant, built-in way to have a fast discovery and copy of the modified files?

What is your rclone version (output from rclone version)

v1.55.0

Which OS you are using and how many bits (eg Windows 7, 64 bit)

Linux: Ubuntu 20.04 LTS, 64bit

Which cloud storage system are you using? (eg Google Drive)

Google Drive

The command you were trying to run (eg rclone copy /tmp remote:tmp)

## Option 1 - copy to encryptgdrive - 12 minutes
$ touch /home/xxx/xxx/SourceC/Wein/jahrgaengemodel.cpp
$ date ; ./rclone --version ; ./rclone --progress --skip-links --filter-from ~/.config/rclone/buero-backup.filter copy / encryptgdrive:buero ; date
Sa 03 Apr 2021 13:22:57 CEST
rclone v1.55.0
- os/type: linux
- os/arch: amd64
- go/version: go1.16.2
- go/linking: static
- go/tags: cmount
Transferred:        8.624k / 8.624 kBytes, 100%, 6.442 kBytes/s, ETA 0s
Checks:             10186 / 10186, 100%
Transferred:            1 / 1, 100%
Elapsed time:     11m28.1s
Sa 03 Apr 2021 13:34:26 CEST

## Option 2 - mount --vfs-cache-mode full, then copy from local to mount - ~45 minutes
$ touch /home/xxx/xxx/SourceC/Wein/jahrgaengemodel.cpp
Terminal 1: $ ./rclone mount --vfs-cache-mode full encryptgdrive:buero /mnt/usbdisk/
Terminal 2: $ date ; ./rclone --version ; ./rclone --progress --skip-links --filter-from ~/.config/rclone/buero-backup.filter copy / /mnt/usbdisk/ ; date
Sa 03 Apr 2021 13:56:40 CEST
rclone v1.55.0
- os/type: linux
- os/arch: amd64
- go/version: go1.16.2
- go/linking: static
- go/tags: cmount
Transferred:             0 / 0 Bytes, -, 0 Bytes/s, ETA -
Checks:              3270 / 10186, 32%
Elapsed time:     15m10.5s
>> then terminated by Ctrl-C

## Option 3 - home-made offline check for modified size or time - <30 seconds
$ sqlite3 ~/Heimnetzwerk/BackupConfig/rclone_testfilelist.db ".schema"
CREATE TABLE localfiles(filesize INTEGER NOT NULL, filetime TEXT NOT NULL, filepath TEXT NOT NULL UNIQUE);
CREATE TABLE remotefiles(filesize INTEGER NOT NULL, filetime TEXT NOT NULL, filepath TEXT NOT NULL UNIQUE);
$ touch /home/xxx/xxx/SourceC/Wein/jahrgaengemodel.cpp
$ date
Sa 03 Apr 2021 13:38:36 CEST
$ ./rclone --version
rclone v1.55.0
- os/type: linux
- os/arch: amd64
- go/version: go1.16.2
- go/linking: static
- go/tags: cmount
$ >&2 echo "clean the database"
clean the database
$ sqlite3 ~/Heimnetzwerk/BackupConfig/rclone_testfilelist.db "DELETE FROM remotefiles; DELETE FROM localfiles;"
$ >&2 echo "load list of remote files (with size and time)"
load list of remote files (with size and time)
$ ./rclone lsf --format "stp" --csv --recursive --files-only encryptgdrive:buero | sqlite3 -csv ~/Heimnetzwerk/BackupConfig/rclone_testfilelist.db ".import '|cat -' remotefiles"
$ >&2 echo "load list of local files (with size and time)"
load list of local files (with size and time)
$ ./rclone --skip-links --filter-from ~/.config/rclone/buero-backup.filter lsf --recursive --csv --files-only --format "stp" / | sqlite3 -csv ~/Heimnetzwerk/BackupConfig/rclone_testfilelist.db ".import '|cat -' localfiles"
$ >&2 echo "identify the files requiring update"
identify the files requiring update
$ sqlite3 ~/Heimnetzwerk/BackupConfig/rclone_testfilelist.db "SELECT l.filepath FROM localfiles l LEFT JOIN remotefiles r ON l.filepath=r.filepath WHERE l.filesize != r.filesize OR r.filesize IS NULL OR l.filetime > r.filetime" > /tmp/changedfiles
$ >&2 echo "upload these files:"
upload these files:
$ >&2 cat /tmp/changedfiles
home/xxx/xxx/SourceC/Wein/jahrgaengemodel.cpp
$ ./rclone copy --skip-links --progress --no-traverse --files-from /tmp/changedfiles / encryptgdrive:buero
Transferred:        8.624k / 8.624 kBytes, 100%, 4.140 kBytes/s, ETA 0s
Checks:                 1 / 1, 100%
Transferred:            1 / 1, 100%
Elapsed time:         4.2s
$ date
Sa 03 Apr 2021 13:39:02 CEST

The rclone config contents with secrets removed.

[mygdrive]
type = drive
scope = drive
export_formats = odt,ods,odp,svg
token = {"access_token":"xxx","token_type":"Bearer","refresh_token":"xxx","expiry":"2021-04-03T14:20:08.76658435+02:00"}
root_folder_id = xxx
client_id = xxx.apps.googleusercontent.com
client_secret = xxx

[encryptgdrive]
type = crypt
remote = mygdrive:Backups
filename_encryption = standard
directory_name_encryption = true
password = xxx
password2 = xxx

A log from the command with the -vv flag

$ ./rclone -vv --log-file /tmp/rclone-direct.log --progress --skip-links --filter-from ~/.config/rclone/buero-backup.filter copy / encryptgdrive:buero
$ cat /tmp/rclone-direct.log
2021/04/03 14:17:16 DEBUG : Using config file from "/home/xxx/.config/rclone/rclone.conf"
2021/04/03 14:17:16 DEBUG : rclone: Version "v1.55.0" starting with parameters ["./rclone" "-vv" "--log-file" "/tmp/rclone.log" "--progress" "--skip-links" "--filter-from" "/home/xxx/.config/rclone/buero-backup.filter" "copy" "/" "encryptgdrive:buero"]
2021/04/03 14:17:16 DEBUG : Creating backend with remote "/"
2021/04/03 14:17:16 DEBUG : local: detected overridden config - adding "{xxx}" suffix to name
2021/04/03 14:17:16 DEBUG : fs cache: renaming cache item "/" to be canonical "local{xxx}:/"
2021/04/03 14:17:16 DEBUG : Creating backend with remote "encryptgdrive:buero"
2021/04/03 14:17:16 DEBUG : Creating backend with remote "mygdrive:Backups/xxx"
2021/04/03 14:17:16 DEBUG : var: Excluded
2021/04/03 14:17:16 DEBUG : proc: Excluded
2021/04/03 14:17:16 DEBUG : cdrom: Excluded
2021/04/03 14:17:16 DEBUG : dev: Excluded
2021/04/03 14:17:16 DEBUG : lost+found: Excluded
2021/04/03 14:17:16 DEBUG : swapfile: Excluded
...
2021/04/03 14:17:22 DEBUG : home/xxx/xxx/SourceC/femtomail/femtomail.c: Size and modification time the same (differ by -286.486µs, within tolerance 1ms)
2021/04/03 14:17:22 DEBUG : home/xxx/xxx/SourceC/femtomail/femtomail.c: Unchanged skipping
2021/04/03 14:17:22 DEBUG : home/xxx/xxx/SourceC/femtomail/README.md: Size and modification time the same (differ by -286.486µs, within tolerance 1ms)
2021/04/03 14:17:22 DEBUG : home/xxx/xxx/SourceC/femtomail/README.md: Unchanged skipping
2021/04/03 14:17:22 DEBUG : home/xxx/xxx/SourceC/femtomail/read_local_mail.sh: Size and modification time the same (differ by -286.486µs, within tolerance 1ms)
2021/04/03 14:17:22 DEBUG : home/xxx/xxx/SourceC/femtomail/read_local_mail.sh: Unchanged skipping
2021/04/03 14:17:22 DEBUG : home/xxx/xxx/SourceC/InputSequence/.git/COMMIT_EDITMSG: Size and modification time the same (differ by -152.867µs, within tolerance 1ms)
2021/04/03 14:17:22 DEBUG : home/xxx/xxx/SourceC/InputSequence/.git/config: Size and modification time the same (differ by -319µs, within tolerance 1ms)
2021/04/03 14:17:22 DEBUG : home/xxx/xxx/SourceC/InputSequence/.git/config: Unchanged skipping
2021/04/03 14:17:22 DEBUG : home/xxx/xxx/SourceC/InputSequence/.git/index: Size and modification time the same (differ by -617.887µs, within tolerance 1ms)
2021/04/03 14:17:22 DEBUG : home/xxx/xxx/SourceC/InputSequence/.git/COMMIT_EDITMSG: Unchanged skipping
2021/04/03 14:17:22 DEBUG : home/xxx/xxx/SourceC/InputSequence/.git/index: Unchanged skipping
...

i have a local source with lots of files that i rclone sync every day.
since i run it every day, rclone only needs to look for local files that have changed in the past 24 hours.
for that --max-age 24h.
then as needed, rclone sync without --max-age

Using --max-age is indeed another workaround. Thanks for the suggestion. As I cannot guarantee to do a backup every less than 24 hours (or every less than x hours), this option is less attractive to me. I would have to give a rather high --max-age and consequently most of the time too many files that are considered for upload. Or I keep track of the time of the last sync, then adjust --max-age accordingly. Or, as my script demonstrates that a fast file size and timestamp is possible, I might implement this workaround for production use.

Curiously, how long does your weekly sync (without --max-age) run; for how many files?

another option that works well on windows is using file attributes.
whenever a local file is modified, the archive attribute is set to true.
so scanning for that is very quick tho rclone itself cannot do that.

my use case will not be much help, as i use wasabi, a s3 clone known for hot storage, very quick api calls.
whereas gdrive is very slow for lots of small files.

so far, your testing has been very informative, so run a test using --max-age.

Interesting result: even with --max-age 24h it takes 13 minutes to identify (and copy) . One file was changed (same file as in previous tests, always using touch, i.e. same size: 8 kB, same content, but newer timestamp).

$ date ; ./rclone --version ; ./rclone --skip-links --filter-from ~/.config/rclone/buero-backup.filter --max-age 24h copy / encryptgdrive:buero ; date
Sa 03 Apr 2021 18:15:34 CEST
rclone v1.55.0
- os/type: linux
- os/arch: amd64
- go/version: go1.16.2
- go/linking: static
- go/tags: cmount
Sa 03 Apr 2021 18:28:23 CEST

When looking at the log file (from a separate run as logging could contribute to slowing down), I see that each file or directory is listed as 'Excluded'; irrespective whether skipped due to --filter-from or due to --max-age.

well, rclone has to look at every single local file to figure out that to exclude.

not an expert that this but perhaps

  1. rclone lsf / --max-age=24h -R --files-only > changed.files.txt
  2. for each file in changed.files.txt, rclone copy / encryptgdrive:buero --filter-from ~/.config/rclone/buero-backup.filter

Thanks. I wrote a bash script that determines the list of files to upload by comparing the local and the remote listings based on my original thought to use sqlite to compare (instead of filtering for --max-age=24h as you suggested, as this would not work in my case; see previous reply).

The script will likely need further tweaking. I will update as appropriate.

To use the script, copy the below to a file (I named the script fastsync.sh), give it execute permission, and run it as ./fastsync.sh --help. The script should print instructions.

#!/bin/bash
set -o errexit
set -o nounset

### user modifiable ###
# set aliases to use alternative program versions
#alias rclone="/opt/rclone/rclone"
#alias tar=
#alias sqlite3=
### end user modifiable ###

usage() {
   local -a -r exitcodes=( "no error" 
                           "commandline error"
                           "unconfigured backend" )
   [ "$1" -gt 0 ] && echo "Exit code $1 - ${exitcodes[$1]}"
   echo -e "
   usage: $( basename "$0" ) [ <flags> ] copy|sync [--] source destination
      Supported flags:
      --help             This usage text
      --progress         Prints stages of script progress to stdout
      -v, -vv, -vvv      Prints debug messages to errout; more v -> more details
      --version          Prints version of this script and of used tools
      --filter-from <filename>
      $( printf "%s\n      " "${supportedflags[@]}" )"
   echo "See rclone documentation for meaning of flags and commands.
   
   2021-04-05 - script released into the public domain. No warranty.
   Send bug reports and requests for additional flags to:
   <software at quantentunnel dot de>
   and mention fastsync in the subject line to bypass spam filter."
   exit $1
}

version() {
   echo "$( basename "$0" ) 1.0.4" # rcs version number (major.2-digit minor)
   rclone version | head -1
   echo -n 'sqlite3 ' && sqlite3 --version
   tar --version | head -1
}

# parse commandline
declare -r -a supportedflags=( --skip-links --dry-run )
declare -i progress=0
declare -i verbose=0
declare -a parameters=( )
declare -a flags=( )
declare filterfrom=''
while [ $# -gt 0 ] ; do
   case "$1" in
      --help)
         usage 0
         ;;
      --progress)
         date
         echo "Parsing and validating commandline ..."
         progress=1
         ;;
      -v)
         verbose=1
         ;;
      -vv)
         verbose=2
         ;;
      -vvv)
         verbose=3
         ;;
      --filter-from)
         shift
         filterfrom="$1"
         ;;
      --version)
         version
         exit 0
         ;;
      --?*)
         [[ " ${supportedflags[*]} " =~ " $1 " ]] || \
            { echo "Unsupported flag: $1" ; usage 1 ; }
         flags+=( $1 )
         ;;
      --)
         shift
         break
         ;;
      *)
         parameters+=( $1 )
         ;;
   esac
   shift
done         
parameters+=( "$@" )
[ ${verbose} -ge 2 ] && version
[ ${#parameters[@]} -gt 3 ] && \
   echo -e "Too many positional parameters: ${parameters[@]}\n" && usage 1
[ ${#parameters[@]} -lt 3 ] && \
   echo -e "Too few positional parameters: ${parameters[@]}\n" && usage 1
declare -r command="${parameters[0]}"
declare -r source="${parameters[1]}"
declare -r destination="${parameters[2]}"
[ "${command}" = 'copy' ] || [ "${command}" = 'sync' ] || usage 1
for i in {1..2} ; do
   [ ${verbose} -ge 2 ] && [[ -d "${parameters[$i]}" ]] && \
      echo "${parameters[$i]} is a local directory"
   [ ${verbose} -ge 2 ] && rclone listremotes |& \
      grep --regexp="^${parameters[$i]%%:*}:$" > /dev/null && \
      echo "${parameters[$i]} is a configured rclone backend"
   [[ -d "${parameters[$i]}" ]] || \
      rclone listremotes |& \
      grep --regexp="^${parameters[$i]%%:*}:$" >/dev/null || \
      { echo "${parameters[$i]} is unknown"; usage 2; }
   [ ${verbose} -gt 1 ] && \
      >&2 echo "'${parameters[$i]}' (or '${parameters[$i]%%:*}:') is valid"
done
[ ${verbose} -gt 0 ] && \
   >&2 echo "${command} from '${source}' to '${destination}'"
[ ${verbose} -gt 0 ] && \
   >&2 echo -e "progress=${progress} verbose=${verbose} ${flags[*]}
   filter-from='${filterfrom}'"

[ ${progress} -ne 0 ] && echo "Prepare sqlite db as cache ..."
declare -r db="$( mktemp --suffix='.db' --tmpdir fastsync-XXXXXX )"
[ ${verbose} -gt 0 ] && >&2 echo "Metadata cache will be in '${db}'"
sqlite3 -batch "${db}" <<"EOF"
   CREATE TABLE source(filesize INTEGER NOT NULL,
                       filetime TEXT NOT NULL,
                       filepath TEXT NOT NULL UNIQUE);
   CREATE TABLE dest  (filesize INTEGER NOT NULL,
                       filetime TEXT NOT NULL,
                       filepath TEXT NOT NULL UNIQUE);
   CREATE VIEW copy AS          -- files that require source -> dest
      SELECT s.filepath
         FROM source s LEFT JOIN dest d ON s.filepath=d.filepath
         WHERE s.filesize != d.filesize 
            OR d.filesize IS NULL 
            OR s.filetime > d.filetime;
   CREATE VIEW sync AS          -- also files that need to be deleted on dest
      SELECT filepath FROM copy
      UNION ALL
      SELECT s.filepath
         FROM dest d LEFT JOIN source s ON d.filepath=s.filepath
         WHERE s.filepath IS NULL;
EOF

[ ${progress} -ne 0 ] && echo "Retrieving metadata from source (${source}) ..."
rclone lsf ${flags[@]} \
       $( [ -n "${filterfrom}" ] && echo "--filter-from ${filterfrom}" ) \
       --recursive --csv --files-only --format "stp" "${source}" \
       | sqlite3 -csv "${db}" ".import '|cat -' source"
[ ${verbose} -ge 1 ] && \
   >&2 echo "Metadata for $( sqlite3 "${db}" 'SELECT count(*) FROM source' )\
   files retrieved from ${source}"
[ ${verbose} -ge 3 ] && \
   >&2 sqlite3 -separator ' ' -header "${db}" 'SELECT * FROM source'

[ ${progress} -ne 0 ] && \
   echo "Retrieving metadata from destination (${destination}) ..."
rclone lsf --format "stp" --csv --recursive --files-only "${destination}" \
   | sqlite3 -csv "${db}" ".import '|cat -' dest"
[ ${verbose} -ge 1 ] && \
   >&2 echo "Metadata for $( sqlite3 "${db}" 'SELECT count(*) FROM dest' )\
   files retrieved from ${destination}"
[ ${verbose} -ge 3 ] && \
   >&2 sqlite3 -separator ' ' -header "${db}" 'SELECT * FROM dest'

[ ${progress} -ne 0 ] && \
   echo "Identifying the files requiring update for ${command} ..."
declare -r filelist="$( mktemp --suffix='.list' --tmpdir fastsync-XXXXXX )"
sqlite3 "${db}" "SELECT * FROM ${command}" > "${filelist}"
[ ${verbose} -ge 1 ] && \
   >&2 echo "$( sqlite3 "${db}" "SELECT count(*) FROM ${command}" )\
   files require updating (stored in '${filelist}')"
[ ${verbose} -ge 2 ] && \
   >&2 echo -e "For reference:
   $( sqlite3 "${db}" "SELECT count(*) FROM copy" ) for 'rclone copy'
   $( sqlite3 "${db}" "SELECT count(*) FROM sync" ) for 'rclone sync'"

[ ${progress} -ne 0 ] && echo "Executing rclone ${command} ..."
rclone ${command} ${flags[@]} --no-traverse --files-from "${filelist}" \
       $( [ ${progress} -ne 0 ] && echo "--progress" ) \
       $( [ ${verbose} -ge 2 ] && echo "-v" ) \
       -- "${source}" "${destination}"

[ ${progress} -ne 0 ] && echo "Cleaning up ..."
[ ${verbose} -ge 2 ] && \
   >&2 echo "Keeping '${db}' and '${filelist}' due to selected verbosity"
[ ${verbose} -lt 2 ] && rm "${db}"
[ ${verbose} -lt 2 ] && rm "${filelist}"
[ ${progress} -ne 0 ] && date
exit 0

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.