What is your cloud service provider? Recommendations may depend on that.
Also, for 1.5mill files - seriously consider archiving some of that. Consider if you really need to be able to individually download all of that or can tolerate bundling some of it while in storage. In terms of how to optimize a set of many, but small files. This is both the most powerful thing you can do and also about the only thing. Big files upload can be optimized for with settings in rclone, but very small files will depend almost entirely on the limits of the cloud provider - and thus making them into single, larger files has massive performance benefits.
Clouds usually have massive bandwidth, but loads of tiny files are their bane. The most high performance services deal with them "not so great". Most of them deal with them terribly. What this effectively often amounts to is poor effective use of the bandwidth you have - and thus it taking a long time. Tends to be especially bad on uploading, with somewhat more permissive limits on download.
Finally, not all clouds can even storage 1.5 mill files. Many of them have some sort of max limit - or at least a "recommended maximum for good performance".
To give an idea from Gdrive which is really common - you can start a little over 2 new transfers pr second, so that comes out to about 9 days of 24/7 transferring.
The maximum file-limit on a teamdrive is 400.000 according to google, with a recommendation of 100.000 or less for optimal performance.
These specs can vary pretty wildly between different providers.
Thank you for the detailed response. I do have some questions.
The provider would be Google Drive.
How could rclone be sure everything is synced when you archive stuff prior to uploading ?
How would it know the next time rclone sync command runs that if any files were changed since the last time or does rclone check the contents of the archives to make sure if there are any changed file or if really every file is backed up?
I have 2-3 folders (including its subfolder) which I would like to sync in real time (not archived) if that is possible. Well I mean sync if some files changed and backup/move the old ones to a dated backup folder. ( I have an script/command for a dated backup solution which I wrote I think 8 month ago but I haven't used it yet because I was not sure what the best settings would be to have maximum efficency)
The rest of all the files could be archived prior to uploading. If we can somehow make sure each time a sync is running it can be sure that really every single file is backed up.
Rckone does not interact with archived\zipped files any differently than any other file. The archive file will have a size, a modtime and a hash just like anything else and that is what rclone will check. packing and unpacking the data will be left up to you. It is a bit of a hassle yes... but for many types of data that you rarely access anyway it's a worthy tradeoff.
There is some work in progress on rclone remotes that will be able to provide automatic transparent archiving - but that system is not done yet.
Do you mean all of those tiny files will be changing? ... if so then that is kind of a problem. If an archive changes it will effectively have to be re-uploaded. This whole strategy is mostly predicated on the assumption that the data is for more of a long-term storage. It is not well suited to data that often changes.
Also, it won't be possible to check individual local files against an archived file in the cloud that contains them. Not yet anyway until a compression remote is made. You can of course check an archive-file against an archive-file on the cloud though. That's like with any other file. That would mean keeping the data in an archive-file locally too in other words.
A lot of this stuff really comes down to where the bulk of your files comes from and how they are used, which can vary wildly. If you can share a little more detail it may help me understand the situation better. You can PM me if you prefer to be discrete.
If you want a backup (and don't want to access the files on the drive web interface) then you might consider using something like restic which can use rclone to save the data to google drive.
That will solve the millions of files problem and incremental backups etc.
I use restic+rclone to backup my laptop to a swift object storage system. This is 1.2M files and takes about 3 minutes for an incremental.
I also have an rclone+crypt backup to the same swift cluster using --backup-dir for the incrementals. This takes more like 20 mins for an incremental backup. However the swift cluster is good at lots of small files unlike google drive.
#!/bin/bash
hostname=myhost
email=me@example.com
log=/tmp/restic.log
default_dirs=~
# send output to log and terminal
exec > >(tee -a $log) 2>&1
# A POSIX variable
OPTIND=1 # Reset in case getopts has been used previously in the shell.
prune=0
list=0
backup=1
while getopts "h?pl" opt; do
case "$opt" in
h|\?)
echo "$0 [-h] [-p]"
exit 1
;;
p) prune=1
;;
l) list=1
backup=0
;;
esac
done
shift $((OPTIND-1))
# read the command line arguments
dir=$@
if [ "$dir" = "" ]; then
dir="$default_dirs"
fi
function notify() {
local when=`date -Is`
echo "${when} $1"
}
# This means use rclone with the directory remote:{hostname}_backup
export RESTIC_REPOSITORY=rclone:remote:${hostname}_backup/
export RESTIC_PASSWORD=XXX
# lower CPU and disk priority
renice 19 $$
ionice -c 3 -p $$
# restic init
if [ "$backup" = "1" ]; then
notify "restic backup for ${dir} starting for ${hostname}"
# Make the backup
restic backup \
--exclude "/mnt/**" \
--exclude "/.cache/**" \
--one-file-system \
${dir}
restic_exit_status=$?
# mail if failed
if [ $restic_exit_status -ne 0 ];then
notify "restic backup failed"
mail -s "Restic backup failed on ${hostname}" ${email} < ${log}
fi;
notify "backup complete"
fi
# Tidy the old backups if required
if [ "$prune" = "1" ]; then
restic forget \
--prune \
--keep-last 3 \
--keep-daily 3 \
--keep-weekly 3 \
--keep-monthly 12 \
--keep-yearly 75
restic_exit_status=$?
# mail if failed
if [ $restic_exit_status -ne 0 ];then
notify "restic prune failed"
mail -s "Restic prune failed on ${hostname}" ${email} < ${log}
fi;
notify "restic prune complete"
fi
# list snapshots
if [ "$list" = "1" ]; then
restic snapshots
fi
Depending on how you back up
if you backup using rclone sync to a plain google drive remote then you will be able to use the files in the webinterface
if you backup using rclone sync to a crypted google drive then you will not be able to use the files in the web interface. You will be able to use rclone mount though
if you backup using rclone+restic you won't be able to use the web interface or rclone mount. However restic does have its own way of mounting (read-only) its snapshots.