PSA: Use Google Colab for cloud-to-cloud data transfers

Since many people are transferring data over from a particular unnamed provider to a different one, I thought I would share this tip. If you need to transfer massive amounts of data quickly and do not want to bog down your own network or get a VPS, you can use Google Colab to run your copy/sync operations. The down side is that Colab executions are limited to 12h on the free tier (and I believe 24h on the pro tier) so it does need to be manually restarted every 12-24 hours. That being said, I'm able to move data via Colab much faster than I otherwise would between the two cloud providers I'm using. See screenshot:

Here's a rough notebook that I'm using to help you get set up, although you will need to modify for your use case:


1. Create working directories

Ensure there is a working directory called "sync_task" and a cache and tmp folder within.

import os

base_dir = '~/sync_task'
cache_dir = os.path.join(base_dir, 'cache')
tmp_dir = os.path.join(base_dir, 'tmp')

# Ensure directories exist
for dir_path in [base_dir, cache_dir, tmp_dir]:
    os.makedirs(os.path.expanduser(dir_path), exist_ok=True)

print("Directories are set up.")

2. Install RClone and Upload Config

Install:

!curl https://rclone.org/install.sh | sudo bash

Upload rclone config file (must go through authentication locally as Colab is not interactive):

from google.colab import files

# Upload the rclone config file
uploaded = files.upload()

# Check if the correct file was uploaded and move it to the sync_task directory
for fn in uploaded.keys():
    if "rclone.conf" in fn:
        print('User uploaded rclone config file.')
        !mv rclone.conf ~/sync_task/rclone.conf
        print("rclone configuration moved to ~/sync_task/")
    else:
        print(f'Unexpected file "{fn}". Please upload rclone.conf.')

Test and make sure remotes are accessible via Colab:

!rclone lsd remote1: --config ~/sync_task/rclone.conf && rclone lsd remote2: --config ~/sync_task/rclone.conf

3. Run Sync Operation

Since Colab can't show live output, we need to use subprocess to carry out the operation and then tail the log to keep an eye on progress:

import subprocess

# Define rclone command with logging
command = "rclone sync remote1: remote2: --dry-run --config ~/sync_task/rclone.conf -v --log-file ~/sync_task/rclone.log --transfers 15 --checkers 64 --fast-list --cache-dir ~/sync_task/cache --temp-dir ~/sync_task/tmp"
rclone_process = subprocess.Popen(command, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)

Tail the log, and ensure some output every 60 seconds (to prevent Colab from shutting it down due to inactivity):

import time
from IPython.display import clear_output

# Define monitoring period and duration
log_check_interval = 10  # seconds
keep_alive_message_interval = 60  # seconds

last_message_time = 0

while True:
    # Check if the rclone process is still running
    if rclone_process.poll() is not None:
        print("rclone process has finished.")
        break

    clear_output(wait=True)  # Clear the previous output

    # Print last 30 lines of the log file
    log_content = !tail -n 30 ~/sync_task/rclone.log
    for line in log_content:
        print(line)

    # Print a keep-alive message
    current_time = time.time()
    if current_time - last_message_time > keep_alive_message_interval:
        print("Still syncing...")
        last_message_time = current_time

    # Sleep before checking the log again
    time.sleep(log_check_interval)

4. FORCE-STOP SYNC OPERATION

In case you need to stop mid-operation for some reason, we need to reattach to the subprocess and terminate.

# Check if the process is still running
if rclone_process.poll() is None:
    rclone_process.terminate()
    print("rclone process terminated.")
else:
    print("rclone process is not running.")

If you have multiple google accounts, you can run multiple operations simultaneously on each one. Obviously, if you do that ensure that you're not running a sync on the same directory on multiple instances.

6 Likes

that sounds awesome and exactly what I need. would I be able to achieve this under windows as well?

Should be system agnostic, it's using Google Colab

all good. just realised where all the code goes :slight_smile: I'll be back.

It's great thanks. Do you know if it's possible to schedule the execution so we don't need to login and run it manually each time?

No because the purpose we're using it for in this case isn't really what Colab is meant for.

1 Like

This is a fantastic solution for transfers and your guide worked flawlessly. Thank you so much!

Is there anyway to run multiple commands and logs in the same colab for multiple remotes? Do you just have to make multiple colab docs for each remote and run them individually?

If you start a copy command then start a log activity command, can you close the window or must you leave it open?

Codes
import subprocess

import subprocess

command = "rclone copy DROPv01-10: DRIVEv01: --config ~/sync_task/rclone.conf -v --log-file ~/sync_task/DRIVEv01.log --transfers 15 --checkers 64 --fast-list --cache-dir ~/sync_task/cache --temp-dir ~/sync_task/tmp"

DRIVEv01 = subprocess.Popen(command, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)

import time

import time
from IPython.display import clear_output

log_check_interval = 10 # seconds
keep_alive_message_interval = 60 # seconds

last_message_time = 0

while True:
if DRIVEv01.poll() is not None:
print("DRIVEv01 has finished.")
break

clear_output(wait=True)

log_content = !tail -n 21 ~/sync_task/DRIVEv01.log
for line in log_content:
print(line)

current_time = time.time()
if current_time - last_message_time > keep_alive_message_interval:
print("Still copying...")
last_message_time = current_time

time.sleep(log_check_interval)

Does this bypass the 750GB limit? Should I include the --drive-stop-on-upload-limit flag so as to not continue to use resources?

I guess theoretically you could open multiple subprocesses within the same notebook, each outputting to a different log file and then find some way to tail multiple logs. That being said, it seems like it would get really complicated and if something goes wrong with one subprocess, there's really no way to restart that subprocess without also restarting the others. I'd recommend just using a second google account to run a copy of that notebook with your other remote.

Also, no this will not bypass the 750 GB upload limit. If you're doing a Google Drive to Google Drive transfer, you're better of using a free VM instance on Google Cloud which not only bypasses the upload limit but also does the transfer server-side. Doing this server side makes things much easier--I once moved ~200 TB from one Google Drive to another in a matter of a few hours by using a VM.

1 Like

Thank you! Do you have any specific suggestions for a well built guide/walk-through to build a VM on Google Cloud for this?

I need to fill 200TB on Drive asap and my rclone transfers are very slow.

I built a GCP VM to run some Remote1: to Remote1:Data copies, but it doesn't appear to be bypassing the limit.

1 Like

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.