i know this has been discussed before, but i figured i'd drop by and see if the interest level has changed in regards to using a local database to track file size/date differences as opposed to scanning multi-million file remotes at every sync. I know some of you are using rclone for some very serious workloads, and it would reduce sync times by hours for such tasks.
maybe it could even come with a --scan-remote flag for the occasional full remote scan to make sure the database is not leading to neglected files.
just curious what ya'll think. wonderful tool, regardless
then with respect, please consider this my gentle stoking at the ambers of the matter nick mentioned that this is something that was started times ago but then mostly abandoned. i would've thought this would garner more interest, so i figured i would bring it up again to see if anything has changed.
I am working on a Python wrapper around rclone for backup that does the same kind of thing.
The problem is, rclone is stateless for most operations. This will be anything but stateless and can cause quite a few issues. Notably, you need to deal with cache-invalidation, which is notoriously difficult.
the reason i think this should be inherent in rclone's code is because the whole nature of cloud storage, i.e. centrally managed hardware that allots resources to a large customer base, is keeping very close track of I/O's. either to maximize availability or improve monetization. microsoft business, for example, allows you about 30,000 I/O's before throttling you and then allowing you 5,000 I/O's every 5 to 15 minutes. (roughly) that is extremely stingy. (especially since their consumer onedrive allows 100k's of I/O's before throttling.) i believe other cloud services charge you for I/O's. this is likely to become more common in the future.
if rclone was "self-sufficient" with keeping track of changes, it could result in syncing a multi-million file remote with a total of 20 I/O requests. it would be magical