Alternative to Crypt for File Path Blinding

I am looking for thoughts on filename / file path blinding. Whether there's already a solution to this, whether someone has tried it before, etc.

Right now, the options for encryption & blinding seem to favor portability and access over flexibility. Crypt has the option to encrypt filenames section by section. However, this comes with limitations on the length and complexity of filenames.

I am looking for a more flexible solution that will enable me to store files in a blinded manner regardless of what filenames are used locally.

There is a question of where to store the source filename if not in the corresponding remote file's filename (i.e. metadata). In my use case, I do not need to access the remote from multiple locations. Thus, my hunch is it would be feasible to split up the metadata and storage aspects of the file system - to keep the real filename locally in a database and upload the file with a different filename. The remote files would not be useful without the database, but that's not an issue here. As a bonus it would be possible to change the entire file path, avoid storing other metadata with the remote file, etc. because this could be stored separately in the database rather than uploaded as metadata attached to the file.

There are probably other good solutions to handle this as well. Has anyone seen something similar or something different that would accomplish the same objectives?

Bumping this in the hope weekday folks have thoughts or suggestions.

So you don't want to use crypt because of the filename length limitations? If not, what's the issue with crypt that you're trying to overcome?

Overcoming the file name length limitation is a big part of it. I cannot control the file names in this use case. I'd also favor something that limits two things mentioned in the documentation -

  • directory structure visible
  • identical files names will have identical uploaded names

Of these, both are potentially important to me, but the second is more critical. I don't want to 'leak' information about cyclical work because a common filename is reused. I suspect, though didn't explicitly see it in the documentation, other metadata - such as file creation / modification time - is also leaked (unless it's just not preserved). Again providing information like 'every month a file with the same filename appears in a new directory.'

These would be avoided by storing the file metadata separately.

If the remote supports extremely long file names, then the file name limitations probably won't be an issue since the underlying remote supports what you need.

I'm not aware of anything that would help here without writing a new remote to handle the use case or using something like encfs on top of a fuse mount which is very inefficient.

If the remote supports extremely long file names, then the file name limitations probably won't be an issue since the underlying remote supports what you need.

The documentation suggested filenames were limited to ~143 characters. Are you saying that's dependent on the remote? Do you know the calculation behind 143 or to do this for another remote? It seems there's more to it than the efficiency and boundaries of base32.

I'm not aware of anything that would help here without writing a new remote to handle the use case or using something like encfs on top of a fuse mount which is very inefficient.

Thanks for the suggestion. I will take a look a closer look at encfs. I believe it also has a filename length limitation in the default mode, but maybe there is a mode that solves my issue. It does have an option that would solve the same filename in a different directory issue.

When you say it's inefficient - do you know it will be but not how much or is there a performance comparison of encfs on top of a (non-encrypted) rclone mount vs. the crypt backend?

It is dependent on the remote, yes. I think the author was trying to say that if they are limited to 143 characters that it would be compatible to most including being able to copy that file back down from google drive to the local filesystem since EXT and NTFS and others are limited to 255.

Well. When rclone was first starting up, I used to use encfs on top of a regular google drive mount. There's been significant improvements in rclone since then so it may have gotten better but it was pretty dreadfully slow when laid over a google mount. Now days with chunking and caching, maybe it will work better. YMMV.

If you search for this, you'll find lots of references to lengths on different remotes. Maybe that will help.

https://forum.rclone.org/search?q=rclone%20backend%20length

There are backend commands that can calculate the lengths as well. Be careful using them though as it attempts to create invalid characters and leaves a bunch of garbage behind. :slight_smile:

for example:

ext Crypt:

    rclone info local-crypt:
    2020/09/09 16:05:24 NOTICE: Local file system at /tmp: Replacing invalid UTF-8 characters in "position-right-BF-\xbf"
    2020/09/09 16:05:24 NOTICE: Local file system at /tmp: Replacing invalid UTF-8 characters in "position-middle-BF-\xbf-"
    2020/09/09 16:05:24 NOTICE: Local file system at /tmp: Replacing invalid UTF-8 characters in "position-middle-FE-\xfe-"
    2020/09/09 16:05:24 NOTICE: Local file system at /tmp: Replacing invalid UTF-8 characters in "position-right-FE-\xfe"
    2020/09/09 16:05:24 NOTICE: Local file system at /tmp: Replacing invalid UTF-8 characters in "\xfe-position-left-FE"
    2020/09/09 16:05:24 NOTICE: Local file system at /tmp: Replacing invalid UTF-8 characters in "\xbf-position-left-BF"
    2020/09/09 16:05:24 EME operates on 1 to 128 block-cipher blocks, you passed 513
    2020/09/09 16:05:24 EME operates on 1 to 128 block-cipher blocks, you passed 257
    2020/09/09 16:05:24 EME operates on 1 to 128 block-cipher blocks, you passed 129
    // local-crypt
    stringNeedsEscaping = []rune{
    	'/', '\x00'
    }
    maxFileLength = 143
    canWriteUnnormalized = true
    canReadUnnormalized   = true
    canReadRenormalized   = false
    canStream = true

ext:

rclone info local:
2020/09/09 16:04:23 NOTICE: Local file system at /tmp: Replacing invalid UTF-8 characters in "position-right-BF-\xbf"
2020/09/09 16:04:23 NOTICE: Local file system at /tmp: Replacing invalid UTF-8 characters in "position-middle-BF-\xbf-"
2020/09/09 16:04:23 NOTICE: Local file system at /tmp: Replacing invalid UTF-8 characters in "position-middle-FE-\xfe-"
2020/09/09 16:04:23 NOTICE: Local file system at /tmp: Replacing invalid UTF-8 characters in "position-right-FE-\xfe"
2020/09/09 16:04:23 NOTICE: Local file system at /tmp: Replacing invalid UTF-8 characters in "\xfe-position-left-FE"
2020/09/09 16:04:23 NOTICE: Local file system at /tmp: Replacing invalid UTF-8 characters in "\xbf-position-left-BF"
// local
stringNeedsEscaping = []rune{
	'/', '\x00'
}
maxFileLength = 255
canWriteUnnormalized = true
canReadUnnormalized   = true
canReadRenormalized   = false
canStream = true

google crypt:

rclone info zonegd-cryptp:
2020/09/09 15:55:57 EME operates on 1 to 128 block-cipher blocks, you passed 513
2020/09/09 15:55:57 EME operates on 1 to 128 block-cipher blocks, you passed 257
2020/09/09 15:55:57 EME operates on 1 to 128 block-cipher blocks, you passed 129
// zonegd-cryptp
stringNeedsEscaping = []rune{
	'/', '\x00'
}
maxFileLength = 2047
canWriteUnnormalized = true
canReadUnnormalized   = true
canReadRenormalized   = false
canStream = true

google:

rclone info zonegd:
// zonegd
stringNeedsEscaping = []rune{
	'/', '\a', '\b', '\f', '\n', '\r', '\t', '\u007f', '\v', '\x00', '\x01', '\x02', '\x03', '\x04', '\x05', '\x06', '\x0e', '\x0f', '\x10', '\x11', '\x12', '\x13', '\x14', '\x15', '\x16', '\x17', '\x18', '\x19', '\x1a', '\x1b', '\x1c', '\x1d', '\x1e', '\x1f'
}
maxFileLength = 7768
canWriteUnnormalized = true
canReadUnnormalized   = true
canReadRenormalized   = false
canStream = true

I'm currently prototyping a new scheme for filename encryption which compresses the file names.

I'm thinking about whether to stop the file names being duplicated. It adds quite a big of complexity but it wouldn't require metadata to implement, just caching of the directory listings.

It still leaves the directory structure visible though.

1 Like

Regarding encFS:

Thanks again for this suggestion.

In normal mode encFS will solve the repeated filename issue (by using filename initialization vector chaining). It will still show the directory structure along with other file metadata. A few downsides:
It requires the double mount trick and potential performance impacts.
Suspect filename limitations of the host filesystem and the remote will both apply. While some of the remotes support very long names, local filesystems mostly do not, and you are writing to a 'local' mount. Original filenames at the limit of the local filesystem would break, which is a deal breaker.
You are copying from whatever your source is to encFS's mounted location with normal tools. This means missing out on some of the rclone functionality (rclone is just in mount mode). Not sure if this is a deal breaker, but I'd certainly like that functionality.

EncFS has a 'reverse' mode that I've not used before but may work better in most remote encrypted backup scenarios. By setting the reverse mount as the source in rclone, you could copy from that directly to the remote using rclone functions. However, filename initialization vector chaining doesn't work in reverse mode, so it won't avoid duplicate filenames.

Other Approaches:

Interesting - I assume you are compressing each name separately not together / with a shared dictionary? What is the caching of directory listings required for? To improve performance or the cache has to be persisted to avoid losing the original file name?

Could you compress (& encrypt) the entire file path as one string to blind the directory structure if you're using a cache for ls? Could this be extended to split the encrypted path into a new directory structure? Breaking it at fixed points in the encrypted string could help with filesystems that support a longer path than filename. For example, 1000 characters of encrypted path data split into 3 levels of 250 character folders and 1 level of 250 character filename.

Thinking through this has also caused me to look at what else is leaked - file size seems to be a concern that's come up before. A round up option could be implemented in any scheme by padding and storing the original length in a header. A packing approach (discussed as tar-like in some threads) is more easily implemented with some permanent store of metadata (here which file is in which package).

Any thoughts on how much work it would be to create an rclone backend/remote that solves for this situation using the separate metadata database?

Yes that is the plan.

If there isn't a 1:1 mapping of file name <-> encrypted file name then you need to search for the file name in the directory.

This could work and I have considered it. It makes operations like renaming a directory really quite tricky!

Rclone knows the length of the decrypted file by looking at the length of the encrypted file.

If you have to read a header then you will have to read a header out of every file even when listing directories.

So this scheme won't work unless rclone does the full directory caching thing.

Quite a lot of work!

I did experiment with letting rclone export one large chunked file and mounting that with ext3 (or whatever). Which fixes all your issues, but the performance was abysmal and doing loop mounts on FUSE file systems didn't strike me as a particularly reliable way of carrying on!

the crypt is stable and reliable.

and that is quite a lot of work, for perhaps little practical value.

imho, based on forum posts, these feature are not needed and are not requested.

directory structure visible
identical files names will have identical uploaded names

these complex features could add a lot of bugs for little return.
a bug in crypt code could be fatal.

and most importantly, how would we handle backwards compatibility???

perhaps create a new backend, crypt2, to add just these overlay features.
have crypt2 call out to the current crypt backend, for the bulk of code.