Working with zip files on a mount using Python

jwink3101 · December 17, 2020, 10:27pm

This is the end result of what I posted yesterday

I am intentionally not calling this a "guide" since it is not written as a "how to do this for everyone". It is how I did it. It should help other though

If you prefer it inline, here is the automatic markdown export:

Using rclone to "extract" Backblaze Zip Snapshots and Reupload to B2

This is a ~~guide~~ demonstration of how I use rclone to expand the contents of a Backblaze snapshot on B2 into another B2 bucket.

Questions and Answers

Who is this for?

Me! Seriously, I wrote this for my own recall/notes in the future but I thought I'd share it

To really answer the question, this is for people who want to do something similar and can use this as a guide. It is not a "tool" per se. It is not designed to be an easy or user-friendly process.

I use Python to do it on a VPS. Python is super readable so it should be easy enough to (lightly) customize if you don't know Python. I would say this demonstration is for people who are willing to play around and learn it. It is not turn-key.

Can I use Windows?

No idea! I am using a Debian VPS and my restore was from a macOS backup. I suspect any tool would work

What software do you need

You need rclone and FUSE so you can rclone mount. This is not a guide on either of those.

I am also assuming you've already set up rclone with B2 and/or an additional remote.

I also use the awsome tqdm library but you can ignore that if you don't want it.

Will this cost money

Yes! You will be downloading from B2 so you pay egress. It is also very inneficient. I have no idea how bad but I'd imagine it isn't great! So expect to pay more egress than you're actually using.

Why do this vs downloading the entire zip file?

My test restore is small but my main use is for a 200+gb restore. I want to use my VPS's bandwidth but my VPS is small! (~10gb free). So while I am paying more for egress than, say, downloading the restore (and especially more than if I were to request a USB drive or download right from Backblaze Personal), it saves me the bandwidth.

What is this for

Besides just putting it into its own B2 bucket, this process is useful to seed a different backup tool (including rclone, but really any)

Can I filter it

Yes! There are two places. The first and best is to filter which files you include. The second is with rclone filters but I do not suggest that as you waste the time and expense to extract the files.

This could be done better

I bet! Please share. I like learning new things. This is just what I worked out!

import os,sys
import shutil
import time
import subprocess
import operator
import signal
from pathlib import Path
from zipfile import ZipFile

from tqdm import tqdm # This is 3rd party. $ python -m pip install tqdm

print(subprocess.check_output(['rclone','version']).decode())
print(sys.version)

rclone v1.53.3
- os/arch: linux/amd64
- go version: go1.15.5

3.8.3 (default, Jul  2 2020, 16:21:59)
[GCC 7.3.0]

Mount the restore bucket

Here we mount the restore bucket. Note, do not add any caching unless you have the scratch space. Since my restore is bigger than my free space, I do not! This is basically a super vanilla rclone mount. In fact, when I tested with different advanced options, it failed.

There are two ways to do this. The first is to use a new terminal and create the mount there. That works fine but I will instead do it all within Python and subprocess. With subprcocess, the arguments are passed as a list. This is actually really great since you do not have to deal with escaping. And it's easier to comment! If you do run it on a seperate terminal, screen is your friend.

mountdir = Path('~/mount').expanduser()
mountdir.mkdir(exist_ok=True)

rclone_remote = 'b2:b2-snapshots-7f7799daad93/' # already set up B2. Found the bucket with `rclone lsf b2:`
restore_zip = 'bzsnapshot_2020-12-17-07-06-19.zip' # found with `rclone lsf b2:b2-snapshots-7f7799daad93/`

cmd = ['rclone',
       '-vv', # Optional but may be useful later
       'mount',rclone_remote,str(mountdir),
       '--read-only',]
stdout,stderr = open('stdout','wb'),open('stderr','wb') # writable in bytes mode. I usually use context managers but I will need this to stay open
mount_proc = subprocess.Popen(cmd,stdout=stdout,stderr=stderr)

Make sure it mounted. This is optional

print('Waiting for mount ',flush=True)
for ii in range(10):
    if os.path.ismount(mountdir):
        break
    if mount_proc.poll() is not None:
        raise ValueError('did not mount')
    time.sleep(1)
    print('.',end='',flush=True)
else:
    print('ERROR: Mount did not activate. Kill proc and exiting',file=sys.stderr,flush=True)
    mount_proc.kill()
    sys.exit(2)
print('mounted')

Waiting for mount 
......mounted

Browse the Zip

Python's zipfile will not read the entire file in order to get a listing or even some random file inside. Don't believe me? See the bottom!

What we need to do now is get a list of the files and use manual inspection to decide what to cut. Backblaze uses the full path

with ZipFile(mountdir/restore_zip) as zf:
    files = zf.infolist() # could also do namelist() but we will want the sizes later

len(files)

Pick a random file to get the path. We will use this later

files[1000].filename

'Macintosh HD/Users/jwinkMAC/PyFiSync/Papers/Sorted/2010/2018/melchers2018structural.pdf'

Identify and save the prefix as you want it removed

restore_prefix = 'Macintosh HD/Users/jwinkMAC/' # We will need this later to reupload. This s

Restore a single file!

This is actually super easy! Just search though files to find the file you want. Let's assume it is the 1000th file still

restore_file = files[1000]

restore_dir = Path('~/restore').expanduser()
with ZipFile(mountdir/restore_zip) as zf:
    zf.extract(restore_file,path=str(restore_dir))

Inside the zip file is the full prefixed file (from root). I don't want that

# Optional. Remove prefix
src = restore_dir / restore_file.filename
dst = restore_dir / os.path.relpath(src,restore_dir / restore_prefix)
dst.parent.mkdir(parents=True,exist_ok=True)
shutil.move(src,dst)

PosixPath('/home/jwink3101/restore/PyFiSync/Papers/Sorted/2010/2018/melchers2018structural.pdf')

Extract and Upload

Now, this could almost certainly use improvement. We will do the following:

Gather files up to the max batch size. Then for each batch:
Delete the restore directory
Restore the batched files
Do an rclone copy (not sync) to push those files
- Need to make the source at the restore_prefix so we do not keep that junk

Note that we may be able to optimize this by better backfilling the batches but I am not sure if there is any advantages with sequential reading so I will go one file after the other. It may be moot.

# Tool to gather the files into batches
def group_to_size(seq,maxsize,key=None):
    """
    Group seq by size up to but not to exceed 
    maxsize (unless a single item does)
    
    Example:
        >>> list(group_to_size([10,20,10,90,40,50,99,2,101,0,30,90,11],100))
        [(10, 20, 10), (90,), (40, 50), (99,), (2,), (101,), (0, 30), (90,), (11,)]
    
    """
    s = 0
    curr = []
    for item in seq:
        s0 = key(item) if callable(key) else item
        if s + s0 > maxsize: # Yield if will be pushed over
            yield tuple(curr)
            curr = []
            s = 0
        s += s0
        curr.append(item)
    if curr: 
        yield tuple(curr) # Anything remaining

maxsize = 512 * 1024 * 1024 # 512 mb or 536870912 bytes

# dest_remote = 'b2:mynewbuckets/whatever'
dest_remote = '/home/jwink3101/restore/tmp/'

scratch = Path('~/scratch').expanduser().absolute()
scratch.mkdir(parents=True,exist_ok=True)

# This is there you can filter stuff
# filtered = (f for f in files if ...)

filtered = files # No filter

batches = group_to_size(filtered,maxsize,key=operator.attrgetter('file_size'))
with ZipFile(mountdir/restore_zip) as zf:
    for ib,batchfiles in enumerate(batches):
        print('batch',ib,'# files',len(batchfiles))
        # Extract all of the files
        for file in tqdm(batchfiles):
            zf.extract(file,path=str(scratch))
        
        print('calling rclone')
        
        cmd = ['rclone',
               'move', # use move so they get deleted
               str(scratch / restore_prefix), dest_remote,
               '--transfers','20', # and/or other flags. all optional.
              ]
        subprocess.check_call(cmd)

  0%|          | 0/185 [00:00<?, ?it/s]

batch 0 # files 185


100%|██████████| 185/185 [00:30<00:00,  6.16it/s]
  0%|          | 0/146 [00:00<?, ?it/s]

calling rclone
batch 1 # files 146


100%|██████████| 146/146 [00:32<00:00,  4.47it/s]
  0%|          | 0/97 [00:00<?, ?it/s]

calling rclone
batch 2 # files 97


100%|██████████| 97/97 [00:29<00:00,  3.24it/s]


calling rclone


  1%|          | 2/187 [00:00<00:12, 15.09it/s]

batch 3 # files 187


100%|██████████| 187/187 [00:31<00:00,  5.87it/s]
  0%|          | 0/126 [00:00<?, ?it/s]

calling rclone
batch 4 # files 126


100%|██████████| 126/126 [00:28<00:00,  4.39it/s]
  0%|          | 0/141 [00:00<?, ?it/s]

calling rclone
batch 5 # files 141


100%|██████████| 141/141 [00:27<00:00,  5.05it/s]


calling rclone


  0%|          | 0/78 [00:00<?, ?it/s]

batch 6 # files 78


100%|██████████| 78/78 [00:15<00:00,  4.90it/s]
  0%|          | 0/54 [00:00<?, ?it/s]

calling rclone
batch 7 # files 54


100%|██████████| 54/54 [00:29<00:00,  1.84it/s]
  0%|          | 0/138 [00:00<?, ?it/s]

calling rclone
batch 8 # files 138


100%|██████████| 138/138 [00:29<00:00,  4.68it/s]
  0%|          | 0/730 [00:00<?, ?it/s]

calling rclone
batch 9 # files 730


100%|██████████| 730/730 [00:29<00:00, 24.90it/s]
  0%|          | 0/101 [00:00<?, ?it/s]

calling rclone
batch 10 # files 101


100%|██████████| 101/101 [00:30<00:00,  3.30it/s]
  0%|          | 0/137 [00:00<?, ?it/s]

calling rclone
batch 11 # files 137


100%|██████████| 137/137 [00:28<00:00,  4.84it/s]
  0%|          | 0/29 [00:00<?, ?it/s]

calling rclone
batch 12 # files 29


100%|██████████| 29/29 [00:11<00:00,  2.58it/s]

calling rclone

Unmount

mount_proc.send_signal(signal.SIGINT)
mount_proc.wait() # Hopefully this works. Otherwise you may need to kill it manually
stdout.close()
stderr.close()

Additional Notes

ZipFile

Python's ZipFile will read into a zip file without reading the entire file. It does need to "seek" in the file, hence the mount, but rclone handles that like a champ.

How do I know I'm not downloading the entire file? Well, you could look at the rclone logs. The other way is to make a file-object that will be verbose about what's going on. Note that ZipFile takes either a filename or a file-like object

import io
class VerboseFile(io.FileIO):
    def read(self,*args,**kwargs):
        print('read',*args,**kwargs)
        r = super(VerboseFile,self).read(*args,**kwargs)
        print('  len:',len(r))
        return r
    def seek(self,*args,**kwargs):
        print('seek',*args,**kwargs)
        return super(VerboseFile,self).seek(*args,**kwargs)
    def close(self,*args,**kwargs):
        print('close')
        return super(VerboseFile,self).close(*args,**kwargs)

Then, insetad of

with ZipFile(mountdir/restore_zip) as zf:
    ...

do

with ZipFile(VerboseFile(mountdir/restore_zip)) as zf:
    ...

and you'll be able to see everything

asdffdsa · December 17, 2020, 10:39pm

hi,
thanks for the post, i like it. but i think you might want to tweak it

i would change the title and the post, to generalize it.

title does not mention the main point, working with zip files on a mount.
remove focus on b2, as it is really about a mount, really does not matter what the backend is.

working with zip files on a mount using Python

jwink3101 · December 17, 2020, 10:44pm

Good point. Changed the title.

I am not really interested in removing the focus on B2. Again, demonstration and not tutorial. But the point is noted.

system · February 16, 2021, 6:44pm

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.