This is the end result of what I posted yesterday
I am intentionally not calling this a "guide" since it is not written as a "how to do this for everyone". It is how I did it. It should help other though
If you prefer it inline, here is the automatic markdown export:
Using rclone to "extract" Backblaze Zip Snapshots and Reupload to B2
This is a
guide demonstration of how I use rclone to expand the contents of a Backblaze snapshot on B2 into another B2 bucket.
Questions and Answers
Who is this for?
Me! Seriously, I wrote this for my own recall/notes in the future but I thought I'd share it
To really answer the question, this is for people who want to do something similar and can use this as a guide. It is not a "tool" per se. It is not designed to be an easy or user-friendly process.
I use Python to do it on a VPS. Python is super readable so it should be easy enough to (lightly) customize if you don't know Python. I would say this demonstration is for people who are willing to play around and learn it. It is not turn-key.
Can I use Windows?
No idea! I am using a Debian VPS and my restore was from a macOS backup. I suspect any tool would work
What software do you need
You need rclone and FUSE so you can rclone mount. This is not a guide on either of those.
I am also assuming you've already set up rclone with B2 and/or an additional remote.
I also use the awsome
tqdm library but you can ignore that if you don't want it.
Will this cost money
Yes! You will be downloading from B2 so you pay egress. It is also very inneficient. I have no idea how bad but I'd imagine it isn't great! So expect to pay more egress than you're actually using.
Why do this vs downloading the entire zip file?
My test restore is small but my main use is for a 200+gb restore. I want to use my VPS's bandwidth but my VPS is small! (~10gb free). So while I am paying more for egress than, say, downloading the restore (and especially more than if I were to request a USB drive or download right from Backblaze Personal), it saves me the bandwidth.
What is this for
Besides just putting it into its own B2 bucket, this process is useful to seed a different backup tool (including rclone, but really any)
Can I filter it
Yes! There are two places. The first and best is to filter which
files you include. The second is with rclone filters but I do not suggest that as you waste the time and expense to extract the files.
This could be done better
I bet! Please share. I like learning new things. This is just what I worked out!
import os,sys import shutil import time import subprocess import operator import signal from pathlib import Path from zipfile import ZipFile from tqdm import tqdm # This is 3rd party. $ python -m pip install tqdm
rclone v1.53.3 - os/arch: linux/amd64 - go version: go1.15.5 3.8.3 (default, Jul 2 2020, 16:21:59) [GCC 7.3.0]
Mount the restore bucket
Here we mount the restore bucket. Note, do not add any caching unless you have the scratch space. Since my restore is bigger than my free space, I do not! This is basically a super vanilla rclone mount. In fact, when I tested with different advanced options, it failed.
There are two ways to do this. The first is to use a new terminal and create the mount there. That works fine but I will instead do it all within Python and
subprocess. With subprcocess, the arguments are passed as a list. This is actually really great since you do not have to deal with escaping. And it's easier to comment! If you do run it on a seperate terminal,
screen is your friend.
mountdir = Path('~/mount').expanduser() mountdir.mkdir(exist_ok=True) rclone_remote = 'b2:b2-snapshots-7f7799daad93/' # already set up B2. Found the bucket with `rclone lsf b2:` restore_zip = 'bzsnapshot_2020-12-17-07-06-19.zip' # found with `rclone lsf b2:b2-snapshots-7f7799daad93/`
cmd = ['rclone', '-vv', # Optional but may be useful later 'mount',rclone_remote,str(mountdir), '--read-only',] stdout,stderr = open('stdout','wb'),open('stderr','wb') # writable in bytes mode. I usually use context managers but I will need this to stay open mount_proc = subprocess.Popen(cmd,stdout=stdout,stderr=stderr)
Make sure it mounted. This is optional
print('Waiting for mount ',flush=True) for ii in range(10): if os.path.ismount(mountdir): break if mount_proc.poll() is not None: raise ValueError('did not mount') time.sleep(1) print('.',end='',flush=True) else: print('ERROR: Mount did not activate. Kill proc and exiting',file=sys.stderr,flush=True) mount_proc.kill() sys.exit(2) print('mounted')
Waiting for mount ......mounted
Browse the Zip
zipfile will not read the entire file in order to get a listing or even some random file inside. Don't believe me? See the bottom!
What we need to do now is get a list of the files and use manual inspection to decide what to cut. Backblaze uses the full path
with ZipFile(mountdir/restore_zip) as zf: files = zf.infolist() # could also do namelist() but we will want the sizes later
Pick a random file to get the path. We will use this later
Identify and save the prefix as you want it removed
restore_prefix = 'Macintosh HD/Users/jwinkMAC/' # We will need this later to reupload. This s
Restore a single file!
This is actually super easy! Just search though
files to find the file you want. Let's assume it is the 1000th file still
restore_file = files restore_dir = Path('~/restore').expanduser() with ZipFile(mountdir/restore_zip) as zf: zf.extract(restore_file,path=str(restore_dir))
Inside the zip file is the full prefixed file (from root). I don't want that
# Optional. Remove prefix src = restore_dir / restore_file.filename dst = restore_dir / os.path.relpath(src,restore_dir / restore_prefix) dst.parent.mkdir(parents=True,exist_ok=True) shutil.move(src,dst)
Extract and Upload
Now, this could almost certainly use improvement. We will do the following:
- Gather files up to the max batch size. Then for each batch:
- Delete the restore directory
- Restore the batched files
- Do an rclone
sync) to push those files
- Need to make the source at the
restore_prefixso we do not keep that junk
- Need to make the source at the
Note that we may be able to optimize this by better backfilling the batches but I am not sure if there is any advantages with sequential reading so I will go one file after the other. It may be moot.
# Tool to gather the files into batches def group_to_size(seq,maxsize,key=None): """ Group seq by size up to but not to exceed maxsize (unless a single item does) Example: >>> list(group_to_size([10,20,10,90,40,50,99,2,101,0,30,90,11],100)) [(10, 20, 10), (90,), (40, 50), (99,), (2,), (101,), (0, 30), (90,), (11,)] """ s = 0 curr =  for item in seq: s0 = key(item) if callable(key) else item if s + s0 > maxsize: # Yield if will be pushed over yield tuple(curr) curr =  s = 0 s += s0 curr.append(item) if curr: yield tuple(curr) # Anything remaining
maxsize = 512 * 1024 * 1024 # 512 mb or 536870912 bytes # dest_remote = 'b2:mynewbuckets/whatever' dest_remote = '/home/jwink3101/restore/tmp/' scratch = Path('~/scratch').expanduser().absolute() scratch.mkdir(parents=True,exist_ok=True)
# This is there you can filter stuff # filtered = (f for f in files if ...) filtered = files # No filter
batches = group_to_size(filtered,maxsize,key=operator.attrgetter('file_size')) with ZipFile(mountdir/restore_zip) as zf: for ib,batchfiles in enumerate(batches): print('batch',ib,'# files',len(batchfiles)) # Extract all of the files for file in tqdm(batchfiles): zf.extract(file,path=str(scratch)) print('calling rclone') cmd = ['rclone', 'move', # use move so they get deleted str(scratch / restore_prefix), dest_remote, '--transfers','20', # and/or other flags. all optional. ] subprocess.check_call(cmd)
0%| | 0/185 [00:00<?, ?it/s] batch 0 # files 185 100%|██████████| 185/185 [00:30<00:00, 6.16it/s] 0%| | 0/146 [00:00<?, ?it/s] calling rclone batch 1 # files 146 100%|██████████| 146/146 [00:32<00:00, 4.47it/s] 0%| | 0/97 [00:00<?, ?it/s] calling rclone batch 2 # files 97 100%|██████████| 97/97 [00:29<00:00, 3.24it/s] calling rclone 1%| | 2/187 [00:00<00:12, 15.09it/s] batch 3 # files 187 100%|██████████| 187/187 [00:31<00:00, 5.87it/s] 0%| | 0/126 [00:00<?, ?it/s] calling rclone batch 4 # files 126 100%|██████████| 126/126 [00:28<00:00, 4.39it/s] 0%| | 0/141 [00:00<?, ?it/s] calling rclone batch 5 # files 141 100%|██████████| 141/141 [00:27<00:00, 5.05it/s] calling rclone 0%| | 0/78 [00:00<?, ?it/s] batch 6 # files 78 100%|██████████| 78/78 [00:15<00:00, 4.90it/s] 0%| | 0/54 [00:00<?, ?it/s] calling rclone batch 7 # files 54 100%|██████████| 54/54 [00:29<00:00, 1.84it/s] 0%| | 0/138 [00:00<?, ?it/s] calling rclone batch 8 # files 138 100%|██████████| 138/138 [00:29<00:00, 4.68it/s] 0%| | 0/730 [00:00<?, ?it/s] calling rclone batch 9 # files 730 100%|██████████| 730/730 [00:29<00:00, 24.90it/s] 0%| | 0/101 [00:00<?, ?it/s] calling rclone batch 10 # files 101 100%|██████████| 101/101 [00:30<00:00, 3.30it/s] 0%| | 0/137 [00:00<?, ?it/s] calling rclone batch 11 # files 137 100%|██████████| 137/137 [00:28<00:00, 4.84it/s] 0%| | 0/29 [00:00<?, ?it/s] calling rclone batch 12 # files 29 100%|██████████| 29/29 [00:11<00:00, 2.58it/s] calling rclone
mount_proc.send_signal(signal.SIGINT) mount_proc.wait() # Hopefully this works. Otherwise you may need to kill it manually stdout.close() stderr.close()
Python's ZipFile will read into a zip file without reading the entire file. It does need to "seek" in the file, hence the mount, but rclone handles that like a champ.
How do I know I'm not downloading the entire file? Well, you could look at the rclone logs. The other way is to make a file-object that will be verbose about what's going on. Note that
ZipFile takes either a filename or a file-like object
import io class VerboseFile(io.FileIO): def read(self,*args,**kwargs): print('read',*args,**kwargs) r = super(VerboseFile,self).read(*args,**kwargs) print(' len:',len(r)) return r def seek(self,*args,**kwargs): print('seek',*args,**kwargs) return super(VerboseFile,self).seek(*args,**kwargs) def close(self,*args,**kwargs): print('close') return super(VerboseFile,self).close(*args,**kwargs)
Then, insetad of
with ZipFile(mountdir/restore_zip) as zf: ...
with ZipFile(VerboseFile(mountdir/restore_zip)) as zf: ...
and you'll be able to see everything