Base32768 to compress filename length

Max-Sum · December 3, 2019, 10:23am

Crypt overlay now uses base32 to encode the filename resulting in a large overhead. I found that some remotes would count Unicode numbers instead of ascii so may be we can use qntm/base32768 to compress the total filename length.

ncw · December 3, 2019, 11:35am

Interesting idea

Some remotes are case insensitive - does it take into account capital letters?

Some remotes do count characters, but I think most just count the length in UTF-8, eg s3

The name for a key is a sequence of Unicode characters whose UTF-8 encoding is at most 1024 bytes long.

One idea might be to switch to base64. Even on a case insensitive remote the chance of a collision is quite low since the minimum length is 16 bytes before encoding...

16 bytes encodes to 22 bytes base64. So there are 2^22 possible collisions (codewords which are the same string when made lowercase) out of a possible 2^(8*16) codewords, ie a chance of any two codewords colliding is 2^-106 which is very small! So even if you have 2^32 files in a directory then this will leave chance of a single collision at roughly 2^-106 * 2^64 = 2^-42 which is still very small...

Max-Sum · December 3, 2019, 12:17pm

Some remotes are case insensitive - does it take into account capital letters?

I didn't look into detail but it doesn't seem to use latin characters.

Use base64 for UTF-8 counted fs might be good idea. For Onedrive, it counts unicode chararcters so base32768 might be better. Maybe we can add an advance option for this?

ncw · December 3, 2019, 1:34pm

I guess it will depend on how the remote treats the unicode character classes...

Another option would be OK I think! Though I don't want to add too many!

thestigma · December 3, 2019, 4:03pm

Compressing the name length sounds like a great idea. It is not an uncommon problem on some backends, and especially when crypt is also used.

As for the ideal format - couldn't rclone just intelligently decide what to use based on if the given remote is case-sensitive or not? That is information we know up-front after all.

Max-Sum · December 4, 2019, 3:33am

I ported the base32768 to go.

As for the density, base32 converts 5 bytes to 8 characters, while base32768 converts 15 bytes to 8 characters. Since all of base32768's alphabet falls into UTF-16, it takes no more than 3 bytes to express 1 character in UTF-8. So base32768 is always shorter than base32 when no trailing is considered.

And I have a thought for base64, maybe we can use only Upper case character and other characters in U+0000 - U+007F, which takes one bytes in UTF-8(such as !,@,? ...) to make a new alphabet, so that no collision is possible.

Max-Sum · December 4, 2019, 7:20am

I found this:

FPE seems to allow us to encrypt a slice of bytes without padding it while still being safe. So maybe we can have 0 overhead encryption for the filename (though encoding to string still comes with a cost).

ncw · December 4, 2019, 10:39am

Nice!

Needs some tests

There are rather a lot of characters you can't use in various backends so that would need extreme care - see: rclone/lib/encoder/encoder.go at master · rclone/rclone · GitHub

I think I prefer the idea of a base64 mode which is a definite win.

The crypto for the names has to satisfy some relatively unusual properties

the crypto must be 1:1 so any file name must always encrypt to the same string
the crypt shouldn't leak prefixes so if you have two file names with the same prefix the output strings should not have the same prefix

Rule 1 means anything with an IV or a nonce is out. And having a constant nonce/IV is a bad idea. This rule means running in ECB mode.

Rule 2 means that you can't use a shorter block ciper in ECB mode as you'll get repeated prefixes.

I think Rule1 rules out FF1 and FF3.

It would be possible to redesign the filename crypto so it doesn't have those requirements, but it will lead to, for example, having an index file per directory.

It would be possible I think to relax rule 1 at the loss of some efficiency so when rclone was trying to find a file it would have to list the directory (and all the parents) it was in and decrypt all the filenames within to find it.

It might be just as easy at this point to map each file to a uuid and store the uuids in an index file along with the full name, hashes etc.

This could also mask the directory structure at this point too.

If you search the issues you will see ideas for all of these!

Max-Sum · December 4, 2019, 4:02pm

I don't think FF1 doesn't require a nonce/iv. It uses AES-ECB in the underlay but since go doesn't include ecb in standard library CBC with iv filled with zero is used. And FF1 should still fulfill Rule 2 at same time.

Max-Sum · December 5, 2019, 7:49am

I tested FF1 and Base32768 with zero length tweak and 16bytes key.
The result seems to be ok for both rule.

Before Encryption: Hello, 世界 , len: 13
Encryption: [163 99 31 190 251 130 109 40 123 167 56 123 42] , len: 13
Encoding: 砑湏藐䴒樽䌡鲵 , len: 21 , charcount: 7
Decryption: Hello, 世界 , len: 13

Before Encryption: Hello, World , len: 12
Encryption: [95 103 75 174 86 242 226 67 61 20 3 226] , len: 12
Encoding: 嘓祋焾咄䀨癯ɥ , len: 20 , charcount: 7
Decryption: Hello, World , len: 12

Before Encryption: Hello, World , len: 12
Encryption: [95 103 75 174 86 242 226 67 61 20 3 226] , len: 12
Encoding: 嘓祋焾咄䀨癯ɥ , len: 20 , charcount: 7
Decryption: Hello, World , len: 12

ncw · December 5, 2019, 5:08pm

Nice!

I think that using ff1 would be a lower security method of doing the filename encryption but still pretty good. However I'd really like to go for a higher security method...

Max-Sum · December 5, 2019, 6:23pm

I pushed what I tried here. Now it has:

Base32768, Base64
encode to UTF-16 and compare to see if shorter, useful because I uses chinese (compatible)
FF1 encryption

ncw · December 6, 2019, 9:58am

I note from wikipedia that ff1 appears to have patented parts

FF1 is FFX[Radix] "Format-preserving Feistel-based Encryption Mode" which is also in standards processes under ANSI X9 as X9.119 and X9.124. It was submitted to NIST by Mihir Bellare of University of California, San Diego, Phillip Rogaway of University of California, Davis, and Terence Spies of Voltage Security Inc. Test vectors are supplied and parts of it are patented.

Max-Sum · December 9, 2019, 10:55am

FF3 is BPS named after the authors. It was submitted to NIST by Eric Brier, Thomas Peyrin and Jacques Stern of Ingenico, France. Authors declared to NIST that their algorithm is not patented.The HPE Voltage company although claims to own patents also for BPS mode.

So... maybe FF3?

ncw · December 10, 2019, 10:11am

I wonder what this bit means?

Max-Sum · January 27, 2020, 4:31pm

So HPE Voltage claims the patent also applies to FF3. But the authors of FF3 don't agree with them