Since many people are transferring data over from a particular unnamed provider to a different one, I thought I would share this tip. If you need to transfer massive amounts of data quickly and do not want to bog down your own network or get a VPS, you can use Google Colab to run your copy/sync operations. The down side is that Colab executions are limited to 12h on the free tier (and I believe 24h on the pro tier) so it does need to be manually restarted every 12-24 hours. That being said, I'm able to move data via Colab much faster than I otherwise would between the two cloud providers I'm using. See screenshot:
Here's a rough notebook that I'm using to help you get set up, although you will need to modify for your use case:
1. Create working directories
Ensure there is a working directory called "sync_task" and a cache and tmp folder within.
import os
base_dir = '~/sync_task'
cache_dir = os.path.join(base_dir, 'cache')
tmp_dir = os.path.join(base_dir, 'tmp')
# Ensure directories exist
for dir_path in [base_dir, cache_dir, tmp_dir]:
os.makedirs(os.path.expanduser(dir_path), exist_ok=True)
print("Directories are set up.")
2. Install RClone and Upload Config
Install:
!curl https://rclone.org/install.sh | sudo bash
Upload rclone config file (must go through authentication locally as Colab is not interactive):
from google.colab import files
# Upload the rclone config file
uploaded = files.upload()
# Check if the correct file was uploaded and move it to the sync_task directory
for fn in uploaded.keys():
if "rclone.conf" in fn:
print('User uploaded rclone config file.')
!mv rclone.conf ~/sync_task/rclone.conf
print("rclone configuration moved to ~/sync_task/")
else:
print(f'Unexpected file "{fn}". Please upload rclone.conf.')
Test and make sure remotes are accessible via Colab:
!rclone lsd remote1: --config ~/sync_task/rclone.conf && rclone lsd remote2: --config ~/sync_task/rclone.conf
3. Run Sync Operation
Since Colab can't show live output, we need to use subprocess
to carry out the operation and then tail the log to keep an eye on progress:
import subprocess
# Define rclone command with logging
command = "rclone sync remote1: remote2: --dry-run --config ~/sync_task/rclone.conf -v --log-file ~/sync_task/rclone.log --transfers 15 --checkers 64 --fast-list --cache-dir ~/sync_task/cache --temp-dir ~/sync_task/tmp"
rclone_process = subprocess.Popen(command, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
Tail the log, and ensure some output every 60 seconds (to prevent Colab from shutting it down due to inactivity):
import time
from IPython.display import clear_output
# Define monitoring period and duration
log_check_interval = 10 # seconds
keep_alive_message_interval = 60 # seconds
last_message_time = 0
while True:
# Check if the rclone process is still running
if rclone_process.poll() is not None:
print("rclone process has finished.")
break
clear_output(wait=True) # Clear the previous output
# Print last 30 lines of the log file
log_content = !tail -n 30 ~/sync_task/rclone.log
for line in log_content:
print(line)
# Print a keep-alive message
current_time = time.time()
if current_time - last_message_time > keep_alive_message_interval:
print("Still syncing...")
last_message_time = current_time
# Sleep before checking the log again
time.sleep(log_check_interval)
4. FORCE-STOP SYNC OPERATION
In case you need to stop mid-operation for some reason, we need to reattach to the subprocess
and terminate.
# Check if the process is still running
if rclone_process.poll() is None:
rclone_process.terminate()
print("rclone process terminated.")
else:
print("rclone process is not running.")
If you have multiple google accounts, you can run multiple operations simultaneously on each one. Obviously, if you do that ensure that you're not running a sync on the same directory on multiple instances.