What are the optimal Settings for big sync job?

gelsas · November 15, 2019, 12:59am

Hello guys,

I want to backup all my files from my QNAP NAS to the cloud to have an offsite backup of them.

After a a quick check it is in total about 1.5 Million files. (lots of small files)

What are the optimal settings to ensure maximum efficency ?

regards,
gelsas

thestigma · November 15, 2019, 2:09am

What is your cloud service provider? Recommendations may depend on that.

Also, for 1.5mill files - seriously consider archiving some of that. Consider if you really need to be able to individually download all of that or can tolerate bundling some of it while in storage. In terms of how to optimize a set of many, but small files. This is both the most powerful thing you can do and also about the only thing. Big files upload can be optimized for with settings in rclone, but very small files will depend almost entirely on the limits of the cloud provider - and thus making them into single, larger files has massive performance benefits.

Clouds usually have massive bandwidth, but loads of tiny files are their bane. The most high performance services deal with them "not so great". Most of them deal with them terribly. What this effectively often amounts to is poor effective use of the bandwidth you have - and thus it taking a long time. Tends to be especially bad on uploading, with somewhat more permissive limits on download.

Finally, not all clouds can even storage 1.5 mill files. Many of them have some sort of max limit - or at least a "recommended maximum for good performance".

To give an idea from Gdrive which is really common - you can start a little over 2 new transfers pr second, so that comes out to about 9 days of 24/7 transferring.
The maximum file-limit on a teamdrive is 400.000 according to google, with a recommendation of 100.000 or less for optimal performance.

These specs can vary pretty wildly between different providers.

gelsas · November 15, 2019, 2:41am

Thank you for the detailed response. I do have some questions.

The provider would be Google Drive.

How could rclone be sure everything is synced when you archive stuff prior to uploading ?
How would it know the next time rclone sync command runs that if any files were changed since the last time or does rclone check the contents of the archives to make sure if there are any changed file or if really every file is backed up?

I have 2-3 folders (including its subfolder) which I would like to sync in real time (not archived) if that is possible. Well I mean sync if some files changed and backup/move the old ones to a dated backup folder. ( I have an script/command for a dated backup solution which I wrote I think 8 month ago but I haven't used it yet because I was not sure what the best settings would be to have maximum efficency)

The rest of all the files could be archived prior to uploading. If we can somehow make sure each time a sync is running it can be sure that really every single file is backed up.

thestigma · November 15, 2019, 2:59am

Rckone does not interact with archived\zipped files any differently than any other file. The archive file will have a size, a modtime and a hash just like anything else and that is what rclone will check. packing and unpacking the data will be left up to you. It is a bit of a hassle yes... but for many types of data that you rarely access anyway it's a worthy tradeoff.

There is some work in progress on rclone remotes that will be able to provide automatic transparent archiving - but that system is not done yet.

thestigma · November 15, 2019, 3:07am

Do you mean all of those tiny files will be changing? ... if so then that is kind of a problem. If an archive changes it will effectively have to be re-uploaded. This whole strategy is mostly predicated on the assumption that the data is for more of a long-term storage. It is not well suited to data that often changes.

Also, it won't be possible to check individual local files against an archived file in the cloud that contains them. Not yet anyway until a compression remote is made. You can of course check an archive-file against an archive-file on the cloud though. That's like with any other file. That would mean keeping the data in an archive-file locally too in other words.

A lot of this stuff really comes down to where the bulk of your files comes from and how they are used, which can vary wildly. If you can share a little more detail it may help me understand the situation better. You can PM me if you prefer to be discrete.

ncw · November 15, 2019, 10:32am

If you want a backup (and don't want to access the files on the drive web interface) then you might consider using something like restic which can use rclone to save the data to google drive.

That will solve the millions of files problem and incremental backups etc.

I use restic+rclone to backup my laptop to a swift object storage system. This is 1.2M files and takes about 3 minutes for an incremental.

I also have an rclone+crypt backup to the same swift cluster using --backup-dir for the incrementals. This takes more like 20 mins for an incremental backup. However the swift cluster is good at lots of small files unlike google drive.

gelsas · November 15, 2019, 10:42am

restic Sounds like the perfect solution.

Do you mind to share your commands so I can easily setup incremental backups (with backup dir)?

Does that mean I just can not access the files on Webinterface but via a mount I could normally access and browse in it ?

ncw · November 15, 2019, 3:45pm

Here is what I use for backing up to an openstack swift cluster

#!/bin/sh

DEST=remote:backup
DATE=`date -Is`

echo "Backing up, with archive directory $DATE"

# backup home directory
rclone --retries 1 \
       --size-only \
       --fast-list \
       --skip-links \
       --one-file-system \
       --checkers 16 \
       --transfers 16 \
       --backup-dir "${DEST}/${DATE}" \
       --exclude "/mnt/**" \
       --exclude "/.cache/**" \
       --delete-excluded \
       --max-backlog 1000000 \
       --progress \
       sync /home/ncw ${DEST}/current

And here is how I backup with restic

#!/bin/bash

hostname=myhost
email=me@example.com
log=/tmp/restic.log
default_dirs=~

# send output to log and terminal
exec > >(tee -a $log) 2>&1

# A POSIX variable
OPTIND=1         # Reset in case getopts has been used previously in the shell.

prune=0
list=0
backup=1
while getopts "h?pl" opt; do
    case "$opt" in
    h|\?)
        echo "$0 [-h] [-p]"
        exit 1
        ;;
    p)  prune=1
        ;;
    l)  list=1
        backup=0
        ;;
    esac
done

shift $((OPTIND-1))

# read the command line arguments
dir=$@
if [ "$dir" = "" ]; then
    dir="$default_dirs"
fi

function notify() {
    local when=`date -Is`
    echo "${when} $1"
}

# This means use rclone with the directory remote:{hostname}_backup
export RESTIC_REPOSITORY=rclone:remote:${hostname}_backup/
export RESTIC_PASSWORD=XXX

# lower CPU and disk priority
renice 19 $$
ionice -c 3 -p $$

# restic init

if [ "$backup" = "1" ]; then
    notify "restic backup for ${dir} starting for ${hostname}"
    
    # Make the backup
    restic backup \
           --exclude "/mnt/**" \
           --exclude "/.cache/**" \
           --one-file-system \
           ${dir}
    
    restic_exit_status=$?
    
    # mail if failed
    if [ $restic_exit_status -ne 0 ];then
        notify "restic backup failed"
        mail -s "Restic backup failed on ${hostname}" ${email} < ${log}
    fi;
    
    notify "backup complete"
fi

# Tidy the old backups if required

if [ "$prune" = "1" ]; then
    restic forget \
           --prune \
           --keep-last 3 \
           --keep-daily 3 \
           --keep-weekly 3 \
           --keep-monthly 12 \
           --keep-yearly 75

    restic_exit_status=$?

    # mail if failed
    if [ $restic_exit_status -ne 0 ];then
        notify "restic prune failed"
        mail -s "Restic prune failed on ${hostname}" ${email} < ${log}
    fi;
    
    notify "restic prune complete"
fi

# list snapshots

if [ "$list" = "1" ]; then
    restic snapshots
fi

Depending on how you back up

if you backup using rclone sync to a plain google drive remote then you will be able to use the files in the webinterface
if you backup using rclone sync to a crypted google drive then you will not be able to use the files in the web interface. You will be able to use rclone mount though
if you backup using rclone+restic you won't be able to use the web interface or rclone mount. However restic does have its own way of mounting (read-only) its snapshots.

gelsas · November 17, 2019, 2:21am

Thank you for the detailed answer I will look into it in more details once I am back in front of the computer!

Really appreciate the input/tips.

system · February 15, 2020, 2:21am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.