Using mount to read a large zip file (with Python)

I am not sure if this is on topic or off but I am thinking through reading a large zipfile with a read-only mount using Python.

My plan is to use rclone mount with --read-only. I think I can get away with no-cache (see below) since it is just reading but it should be seeking. Part of the problem is, rclone mount with full cache uses sparse files but, at least as I understand it, they are zeroed out files. (is that correct? Its how it looks on the file-system but that may be wrong).

And then I will browse and extract via Python's zipfile module.

Has anyone else tried this? Does anyone have any experience with these and know what to expect?

BTW, just to avoid an XY problem, I am trying to read one or two files from a large Backblaze Snapshot stored in B2. I am trying to (a) avoid paying egress to download the whole thing but more importantly, (b) trying to do what I can on a VPS with limited storage.

are you trying to extract a single file that is inside the zipfile?

are you trying to extract a single file that is inside the zipfile?

Yes/No/Maybe

Yes:

This is, at the moment, part of a larger thought-experiment. First goal is to be able to extract a single file. Both because I may just want to restore a single file and just to play

No/Maybe:

Part of the larger goal is to use a Backblaze Personal backup snapshot to seed another backup. Either directly with rclone (will involve some scripting but I would batch a bunch of files to upload) or with some other deduplicating backup (restic, duplicacy, etc).

At the very least, a way (even if not particularly efficient) to "extract" a zipfile on B2 without downloading the entire thing at once. This will, of course, still download all of the bytes (and likely with repeats) so it'll cost egress but I won't need a VPS with 2+Tb of storage

I am (upload + quota) bandwidth limited so a VPS would be ideal

Additional Playing:

Unrelated to rclone except in that they will play together, I wrote a wrapper in python so I can see what the ZipFile module is doing. It basically seeks to the end, goes back a few times to find a file listing, and then if I read a specific file, it seeks there (not directly. Some jumps but still, doesn't read the entire file). Honestly, it make a good case for full cache but, unless I am mistaken, the sparse file format is sparse in that it doesn't download everything but not sparse in that it is a bunch of zeros. That is, can you have an rclone cache of a 2tb file on a 50 gb disk? (I doubt it!)

Okay, I still would love thoughts and feedback but I did some more playing.

(as an aside, mount with cache didn't do anything on a local file system so I had to make a crypted version so it would be forced to use the cache. Is that expected? Should I file a bug report on that?)

Anyway, first I played with some same python script on the mounted file system. As expected, it showed (with ls -l) as the fully size. But when I du -h I do get the smaller. So I think I am wrong about it being a bunch of null bytes and it is actually sparse.

At this point, I should mention that I am on a mac but I plan to do this on a Linux VPS. It should be pretty similar though.

Anyway, I then tried something different. My test zip file is about 500mb. I made a disk image only 50mb and used that as a cache!

Well, that seemed to crash rclone! Like I had to kill -9 it! (Not sure it counts as a bug or an edge case. Not opposed to opening an issue but likely low priority).

Finally, I then added -vfs-cache-max-size 25M which stopped rclone from going nuts but made python unable to read it as a zip file.

If more detail is needed I can provide it. I am not sure this is actually interesting to anyone and seems like an abuse of the mount so it may not even qualify as a bug!

But I will leave with a TL/DR:

  • May plan my work without caching but be inefficient
  • Is there guidance on trying to read a file larger than the cache capacity? Seems like a bad idea but can it be done and I am missing it?

yeah, sparse is actually sparse.

that python zipfile, would not use it, would not trust it with valuable backups.
"it currently cannot create an encrypted file.
Decryption is extremely slow as it is implemented in native Python"

i have a python script that, on ms windows, that runs rclone, 7zip, fastcopy and enables VSS snapshots.
i have python do a subprocess.run

To be clear I am not using python to do anything but read a zipfile. Encryption is entirely done in rclone. I am just planning to read a zipfile from B2.

At this point, it's mostly been thinking out loud. But now I guess I will just test it. I'll prepare a large restore on Backblaze to B2 and see what I can do. Worse comes to worse, it costs me a bit of bandwidth egress. And I crash the VPS but I've been meaning to destroy it soon anyway.

Thanks for listening. Any feedback or tips would be appreciated

perhaps this
https://github.com/miurahr/py7zr

I sincerely appreciate the help but I am not sure what you're getting at with that suggestion? The zip-file is already created. I know how to read from it without a problem in python. It just comes down to whether or not rclone mount will work well enough with it.

But I really do appreciate the help! (for me and for everyone else in the community)

sorry, i understand now.
what is the size of the zip file?

You need enough cache to have the full file there as that's how the file caching works using sparse files.

If you read the whole thing, the whole file needs enough space to exist.

i did some quick testing, rclone mount, no vfs cache, read only, worked well but slow, to extract a file.

using cache mode full, rclone only uses the cache for that single file

image

I also just did a test and it worked well enough!

This is more of a yak-shave to what I really want to do so I have some say. But anyway, I made a 25gb restore on Backblaze. My VPS has all of 10 gb free so I know I would be limited.

I was able to mount the zip file and then use Python to pull out a single file. I only have macOS and Linux but I wonder if you can can use mount with Windows Explorer to read into the zip file? I don't care but I see that question come up a lot!

BTW, for anyone who wants to play, I suspect there are better ways, but my python code looks like:

# Define a wrapper to see what ZipFile is doing.
# This is 100% optional
import io
class Wrap(io.FileIO):
    def read(self,*args,**kwargs):
        print('read',*args,**kwargs)
        r = super(Wrap,self).read(*args,**kwargs)
        print('  len:',len(r))
        return r
    def seek(self,*args,**kwargs):
        print('seek',*args,**kwargs)
        return super(Wrap,self).seek(*args,**kwargs)
    def close(self,*args,**kwargs):
        print('close')
        return super(Wrap,self).close(*args,**kwargs)

Then

from zipfile import ZipFile
mounted_file = '/path/to/mount/myfile.zip'
inside_file = 'inside/zip/path.ext' # INSIDE

# Extract
with ZipFile(Wrap(mounted_file)) as zf:
    zf.extract(inside_file)

# This just reads it. Or use this to hash the bytes
with ZipFile(Wrap(mounted_file)) as zf:
    with zf.open(inside_file) as file:
        file.read() 

Thanks for the help and/or attention to play along with my testing

i am on windows.