Alternative to Crypt for File Path Blinding

lostheli · September 5, 2020, 4:03pm

I am looking for thoughts on filename / file path blinding. Whether there's already a solution to this, whether someone has tried it before, etc.

Right now, the options for encryption & blinding seem to favor portability and access over flexibility. Crypt has the option to encrypt filenames section by section. However, this comes with limitations on the length and complexity of filenames.

I am looking for a more flexible solution that will enable me to store files in a blinded manner regardless of what filenames are used locally.

There is a question of where to store the source filename if not in the corresponding remote file's filename (i.e. metadata). In my use case, I do not need to access the remote from multiple locations. Thus, my hunch is it would be feasible to split up the metadata and storage aspects of the file system - to keep the real filename locally in a database and upload the file with a different filename. The remote files would not be useful without the database, but that's not an issue here. As a bonus it would be possible to change the entire file path, avoid storing other metadata with the remote file, etc. because this could be stored separately in the database rather than uploaded as metadata attached to the file.

There are probably other good solutions to handle this as well. Has anyone seen something similar or something different that would accomplish the same objectives?

lostheli · September 9, 2020, 2:38pm

Bumping this in the hope weekday folks have thoughts or suggestions.

calisro · September 9, 2020, 2:46pm

So you don't want to use crypt because of the filename length limitations? If not, what's the issue with crypt that you're trying to overcome?

lostheli · September 9, 2020, 5:56pm

Overcoming the file name length limitation is a big part of it. I cannot control the file names in this use case. I'd also favor something that limits two things mentioned in the documentation -

directory structure visible
identical files names will have identical uploaded names

Of these, both are potentially important to me, but the second is more critical. I don't want to 'leak' information about cyclical work because a common filename is reused. I suspect, though didn't explicitly see it in the documentation, other metadata - such as file creation / modification time - is also leaked (unless it's just not preserved). Again providing information like 'every month a file with the same filename appears in a new directory.'

These would be avoided by storing the file metadata separately.

calisro · September 9, 2020, 6:13pm

If the remote supports extremely long file names, then the file name limitations probably won't be an issue since the underlying remote supports what you need.

I'm not aware of anything that would help here without writing a new remote to handle the use case or using something like encfs on top of a fuse mount which is very inefficient.

lostheli · September 9, 2020, 6:40pm

If the remote supports extremely long file names, then the file name limitations probably won't be an issue since the underlying remote supports what you need.

The documentation suggested filenames were limited to ~143 characters. Are you saying that's dependent on the remote? Do you know the calculation behind 143 or to do this for another remote? It seems there's more to it than the efficiency and boundaries of base32.

I'm not aware of anything that would help here without writing a new remote to handle the use case or using something like encfs on top of a fuse mount which is very inefficient.

Thanks for the suggestion. I will take a look a closer look at encfs. I believe it also has a filename length limitation in the default mode, but maybe there is a mode that solves my issue. It does have an option that would solve the same filename in a different directory issue.

When you say it's inefficient - do you know it will be but not how much or is there a performance comparison of encfs on top of a (non-encrypted) rclone mount vs. the crypt backend?

calisro · September 9, 2020, 7:28pm

It is dependent on the remote, yes. I think the author was trying to say that if they are limited to 143 characters that it would be compatible to most including being able to copy that file back down from google drive to the local filesystem since EXT and NTFS and others are limited to 255.

Well. When rclone was first starting up, I used to use encfs on top of a regular google drive mount. There's been significant improvements in rclone since then so it may have gotten better but it was pretty dreadfully slow when laid over a google mount. Now days with chunking and caching, maybe it will work better. YMMV.

calisro · September 9, 2020, 7:53pm

If you search for this, you'll find lots of references to lengths on different remotes. Maybe that will help.

https://forum.rclone.org/search?q=rclone%20backend%20length

There are backend commands that can calculate the lengths as well. Be careful using them though as it attempts to create invalid characters and leaves a bunch of garbage behind.

calisro · September 9, 2020, 8:09pm

for example:

ext Crypt:

    rclone info local-crypt:
    2020/09/09 16:05:24 NOTICE: Local file system at /tmp: Replacing invalid UTF-8 characters in "position-right-BF-\xbf"
    2020/09/09 16:05:24 NOTICE: Local file system at /tmp: Replacing invalid UTF-8 characters in "position-middle-BF-\xbf-"
    2020/09/09 16:05:24 NOTICE: Local file system at /tmp: Replacing invalid UTF-8 characters in "position-middle-FE-\xfe-"
    2020/09/09 16:05:24 NOTICE: Local file system at /tmp: Replacing invalid UTF-8 characters in "position-right-FE-\xfe"
    2020/09/09 16:05:24 NOTICE: Local file system at /tmp: Replacing invalid UTF-8 characters in "\xfe-position-left-FE"
    2020/09/09 16:05:24 NOTICE: Local file system at /tmp: Replacing invalid UTF-8 characters in "\xbf-position-left-BF"
    2020/09/09 16:05:24 EME operates on 1 to 128 block-cipher blocks, you passed 513
    2020/09/09 16:05:24 EME operates on 1 to 128 block-cipher blocks, you passed 257
    2020/09/09 16:05:24 EME operates on 1 to 128 block-cipher blocks, you passed 129
    // local-crypt
    stringNeedsEscaping = []rune{
    	'/', '\x00'
    }
    maxFileLength = 143
    canWriteUnnormalized = true
    canReadUnnormalized   = true
    canReadRenormalized   = false
    canStream = true

ext:

rclone info local:
2020/09/09 16:04:23 NOTICE: Local file system at /tmp: Replacing invalid UTF-8 characters in "position-right-BF-\xbf"
2020/09/09 16:04:23 NOTICE: Local file system at /tmp: Replacing invalid UTF-8 characters in "position-middle-BF-\xbf-"
2020/09/09 16:04:23 NOTICE: Local file system at /tmp: Replacing invalid UTF-8 characters in "position-middle-FE-\xfe-"
2020/09/09 16:04:23 NOTICE: Local file system at /tmp: Replacing invalid UTF-8 characters in "position-right-FE-\xfe"
2020/09/09 16:04:23 NOTICE: Local file system at /tmp: Replacing invalid UTF-8 characters in "\xfe-position-left-FE"
2020/09/09 16:04:23 NOTICE: Local file system at /tmp: Replacing invalid UTF-8 characters in "\xbf-position-left-BF"
// local
stringNeedsEscaping = []rune{
	'/', '\x00'
}
maxFileLength = 255
canWriteUnnormalized = true
canReadUnnormalized   = true
canReadRenormalized   = false
canStream = true

google crypt:

rclone info zonegd-cryptp:
2020/09/09 15:55:57 EME operates on 1 to 128 block-cipher blocks, you passed 513
2020/09/09 15:55:57 EME operates on 1 to 128 block-cipher blocks, you passed 257
2020/09/09 15:55:57 EME operates on 1 to 128 block-cipher blocks, you passed 129
// zonegd-cryptp
stringNeedsEscaping = []rune{
	'/', '\x00'
}
maxFileLength = 2047
canWriteUnnormalized = true
canReadUnnormalized   = true
canReadRenormalized   = false
canStream = true

google:

rclone info zonegd:
// zonegd
stringNeedsEscaping = []rune{
	'/', '\a', '\b', '\f', '\n', '\r', '\t', '\u007f', '\v', '\x00', '\x01', '\x02', '\x03', '\x04', '\x05', '\x06', '\x0e', '\x0f', '\x10', '\x11', '\x12', '\x13', '\x14', '\x15', '\x16', '\x17', '\x18', '\x19', '\x1a', '\x1b', '\x1c', '\x1d', '\x1e', '\x1f'
}
maxFileLength = 7768
canWriteUnnormalized = true
canReadUnnormalized   = true
canReadRenormalized   = false
canStream = true

ncw · September 11, 2020, 2:54pm

I'm currently prototyping a new scheme for filename encryption which compresses the file names.

I'm thinking about whether to stop the file names being duplicated. It adds quite a big of complexity but it wouldn't require metadata to implement, just caching of the directory listings.

It still leaves the directory structure visible though.

lostheli · September 12, 2020, 1:01pm

Regarding encFS:

Thanks again for this suggestion.

In normal mode encFS will solve the repeated filename issue (by using filename initialization vector chaining). It will still show the directory structure along with other file metadata. A few downsides:
It requires the double mount trick and potential performance impacts.
Suspect filename limitations of the host filesystem and the remote will both apply. While some of the remotes support very long names, local filesystems mostly do not, and you are writing to a 'local' mount. Original filenames at the limit of the local filesystem would break, which is a deal breaker.
You are copying from whatever your source is to encFS's mounted location with normal tools. This means missing out on some of the rclone functionality (rclone is just in mount mode). Not sure if this is a deal breaker, but I'd certainly like that functionality.

EncFS has a 'reverse' mode that I've not used before but may work better in most remote encrypted backup scenarios. By setting the reverse mount as the source in rclone, you could copy from that directly to the remote using rclone functions. However, filename initialization vector chaining doesn't work in reverse mode, so it won't avoid duplicate filenames.

Other Approaches:

Interesting - I assume you are compressing each name separately not together / with a shared dictionary? What is the caching of directory listings required for? To improve performance or the cache has to be persisted to avoid losing the original file name?

Could you compress (& encrypt) the entire file path as one string to blind the directory structure if you're using a cache for ls? Could this be extended to split the encrypted path into a new directory structure? Breaking it at fixed points in the encrypted string could help with filesystems that support a longer path than filename. For example, 1000 characters of encrypted path data split into 3 levels of 250 character folders and 1 level of 250 character filename.

Thinking through this has also caused me to look at what else is leaked - file size seems to be a concern that's come up before. A round up option could be implemented in any scheme by padding and storing the original length in a header. A packing approach (discussed as tar-like in some threads) is more easily implemented with some permanent store of metadata (here which file is in which package).

Any thoughts on how much work it would be to create an rclone backend/remote that solves for this situation using the separate metadata database?

ncw · September 14, 2020, 3:15pm

Yes that is the plan.

If there isn't a 1:1 mapping of file name <-> encrypted file name then you need to search for the file name in the directory.

This could work and I have considered it. It makes operations like renaming a directory really quite tricky!

Rclone knows the length of the decrypted file by looking at the length of the encrypted file.

If you have to read a header then you will have to read a header out of every file even when listing directories.

So this scheme won't work unless rclone does the full directory caching thing.

Quite a lot of work!

I did experiment with letting rclone export one large chunked file and mounting that with ext3 (or whatever). Which fixes all your issues, but the performance was abysmal and doing loop mounts on FUSE file systems didn't strike me as a particularly reliable way of carrying on!

asdffdsa · September 14, 2020, 3:42pm

the crypt is stable and reliable.

and that is quite a lot of work, for perhaps little practical value.

imho, based on forum posts, these feature are not needed and are not requested.

directory structure visible
identical files names will have identical uploaded names

these complex features could add a lot of bugs for little return.
a bug in crypt code could be fatal.

and most importantly, how would we handle backwards compatibility???

perhaps create a new backend, crypt2, to add just these overlay features.
have crypt2 call out to the current crypt backend, for the bulk of code.

lostheli · September 20, 2020, 3:13pm

Quick brainstorm - what functions are critical to implement for an overlay remote like this?

User Functions:
-Initialize file system - setup underlying backend / remote, database to cache everything, encryption options
-List files on remote - database / cache lookup. How does rclone handle globs / pattern matching? Exact filenames are easy but there will not be an underlying filesystem to handle this. I would guess crypt in the current state has the same issue and does this internally.
-Copy to remote - encode filename, hash original file, encrypt file & generate hmac, copy (with underlying backend), write to metadata to database
-Copy from remote - encode filename in request, copy (from underlying backend), verify hmac & decrypt, verify original hash, combine with metadata from database, write to local
-Validate DB / cache - validate remote files match database (since most listing calls will use the database not the remote; addresses situations where remote files are changed outside of rclone)

Internally requires:
-Encrypt / decrypt file contents - probably similar to existing crypt
-Encode / decode existing file names - database lookup
-Encode / decode new filename - can be random data if reversibility is not required; padding, encryption, base32 encoding, splitting into file path if reversibility is required
-Save / read metadata to database - what metadata does rclone keep: filename, file modification time, file creation time? anything else?

I would assume so. Requiring a local database or cache is a large change. Especially large if the ability to access your files is tied to keeping this database safe. Less so but maybe still something that should be broken out if the database can be recreated by processing the underlying remote. Depends on what you feel would be clearer and easier to discover for users.

ncw · September 21, 2020, 10:01am

The backends don't handle globs - that is done entirely by a higher layer.

Filename, Size, ModTime are the most important 3

Hash is very useful

MimeType, ID etc are some minor ones.

I'm planning to make all rclone objects serializable so I can make a cache layer for all backends.

lostheli · October 3, 2020, 3:34pm

In the short term, is there an option to not transfer the file modification time to the remote (e.g. use upload time as modification time)?

It looks like this exists for remotes where modification times require an extra API call (--use-server-modtime) and mounts / the caching layer for mounts (--no-modtime). Copying from local drive to local drive actually gives this behavior by default (stat on a newly copied file gives the copy time for all 3 times). Is there a general version that I can use when copying anywhere? The destination for this job is gdrive.

ncw · October 4, 2020, 4:01pm

It might be that the option you want is this, so when you read the drive you see the created dates?

  --drive-use-created-date   Use file created date instead of modified date.,

lostheli · October 4, 2020, 7:26pm

Will using this when uploading prevent the modification date on drive from ever being set to the source file's modification date? Or is it something used when reading the drive to get creation date (i.e. upload date) in place of modification date?

I would like to prevent the original file's modification date from being ever copied to drive.

ncw · October 5, 2020, 2:31pm

No

Yes

That would need a patch to rclone. You'd patch ModTime in backend/local/local.go to return time.Now()

lostheli · November 2, 2020, 5:26pm

I am looking for a more permanent way to implement this that changes the upload side so that all reads from local are not impacted. For example, files on the remote should look newer to rclone than files on the local if they were uploaded since they were last modified. Ideally this is done with a flag that can be set in the config file or provided at runtime.

Would patching the drive remote by doing the following accomplish this?

Change PutUnchecked at line 2072 in drive.go to modTime := time.Now()
Change Update at line 3592 in drive.go to ModifiedTime: time.Now(),
Change Update for document objects at line 3616 in drive.go in the same way

Would adding something like no_modtime to the Options struct be sufficient to make rclone read this flag from the config file, if present, and make it available in f.opt and o.fs.opt respectively in these functions?

Is there any other path through which source modification time is sent to drive or a better way to accomplish this?