Http upload contribution

Can you suggest an alternative to just running rm -rf ./**/*.upload for the cleanup? I feel like that could be destructive

Other than

  • even more "special" file names
  • keep them in "special" directories

not really...

Ok, so it looks like I have to unify the temporary file scheme.
Currently all backends that implement Concat can use the generic ConcatUploader which is implemented to create a directory at the upload location with the filename and a '.upload' suffix.

S3 has three annoying restrictions: 5Mb minimums for fragments, you can't update the metadata of an upload after it has started, and Lifecycle Configs can only be specified for key prefixes.

It makes more sense and provides some benefits during cleanup to have a single .%!%uploads%!% folder that contains all temporary files in a directory structure the same as the remote root. The actual folder path can be configurable per backend and avoids the possibility of file name collisions. This also has the benefit of being able to apply cleanup policies to a single folder rather than having to scan through the entire remote directory structure. This satisfies the prefix requirement of S3s life-cycle policies even if the configured path is not at the root.

We can skirt around the 5Mb fragment minimum by simply not using multipart upload. PUT each fragment in the temporary folder and use the multipart upload and UploadPartCopy functionality to concatenate everything at the end. The pricing shouldn't be different and I see no downside to this. You mentioned earlier in this thread that this could be a slow operation, would it be any different than completing a normal multipart upload?

The metadata issue is removed as well because of the way ConcatUploader stores the total size in the upload directory. I was looking at storing this information in the metadata but it became complicated when the final size is not provided when the upload initiates. (Edit: turns out that the docs I was reading are out of date and s3 also supports tags along with prefixes)

Thoughts?

Edit: I went ahead assuming you had no objections to any of the above and finished up the S3 implementation. Using the ConcatUploader really is ideal for any backend that can efficiently concatenate files.

Does it make sense that ResumableCleanup() is a no-op for S3 given S3s lifecycle policies?
For other backends, would it make sense that ResumableCleanup() is to be called from the command line by some cron job? Basically acting as a stand-in for tmpwatch that guarantees that files are cleaned up based on the upload logic?

I need to make items non-optional for the fs.NewDir interface. How bad of an idea is this?

Yes, having all the temporaries in one folder makes cleanup much easier.

Most of the backends have a minimum chunk size. I think gathering a multipart upload is much quicker than doing the UploadPartCopy since you'll need to do one operation per part to copy the UploadParts into a new object.

I think it should do the cleanup. Perhaps we can wire it into rclone cleanup.

Very few backend support Items wthout actually counting the number of files in the directory - that is why it is optional.

1 Like

I am not sure what you mean? UploadPart also requires an operation per part to upload the content body.

Should it do it instead of having a lifecycle policy applied to the bucket?

Can it be converted to a getter?

I was thinking this

  • UploadPart takes one operation per chunk (the upload)
  • Upload then UploadPartCopy takes two operations per chunk

Does that make sense?

I don't think it is rclone's job to apply lifecycle policies. They certainly won't work with all the S3 compatible providers.

So I think maybe suggesting a lifecycle policy in the manual but rclone cleanup to do the cleanup.

Something like that!

You can create your own objects which satisfy the Directory interface

Ah, I see. That would require falling back to another backend (generally 'local') while buffering the data. I'll look into enabling both modes as I need a minio S3 store to actually be my fallback.

Ok, I'll remove the policy and implement the cleanup logic.

The ConcatUploader requires that the backend populate the items field. This is necessary to handle the situation where a client decides to upload more than 10000 chunks and I need to limit any one folder to that quantity. The Directory interface needs to require that it be populated (lazily or otherwise) or I need every backend to provide a function that takes the Directory and returns a count.

I see... You'll have to count the items manually I think (with the List call)

Counting manually should be done by the fs and not the uploader. I made the necessary modification in this commit, hopefully that makes sense for you.

I have also finished implementing the fallback uploader. Everything is to the point that we need to go through all of the backends and implement ResumableUploader.

I would appreciate if you could review the overall structure before I write up all the high level documentation and tests. There are also some TODO items and bug-fixing that I would rather sort out after I know there isn't going to be large refactors of the code so far.

One thing I have done is require that Concat() for the Concatenator interface doesn't fail for any set of objects passed to it.
The function is responsible for working around any limitations of the backend, even if this means falling back to the new ConcatReader I have implemented.

Concat() in the S3 backend will aggressively succeed regardless of the cost of the operation. It is up to the caller to hand the function a set of objects that do not trigger the fallback mechanisms if they want to optimize the cost of the operation. This is ultimately what the fallback uploader tries to do before calling Concat().