Http upload contribution

innovate-invent · January 5, 2020, 8:27pm

tus does not support non-sequential, I initially wanted to support it for future protocols but it is too problematic.

Is the issue that the resulting Object won't be ready before the Response returns?

The service handler manages the server url to remote url mapping, and the fs manages the remote url to backend url mapping. Would it be possible to include a ?upload_id=abc123 query parameter in the remote url when necessary? This way the fs retains control over the remote to backend url mapping, which in googles case is a very different url than the final one. It will also keep the upload stateless as this query parameter can be propagated to the service url.

The interface would have to return a path which is rewritten by the service handler before being returned to the upload handler:

type ResumableUploader interface {
	ResumableUpload(path) (Uploader, path, error) // Start or continue an existing upload to path
	ResumableCleanup() error // Clean up expired incomplete uploads
}

I would really like to avoid creating a temporary mapping between resources. It creates the issue of generating ids and all of the subtle complexity of doing that in a distributed manner.

ncw · January 6, 2020, 11:39am

Great. Sequential is much easier.

No, it just takes a while that is all!

Isn't that just another way of saying the service manager will remember the ID?

Here is the protocol for starting an upload with TUS

Request:

POST /files HTTP/1.1
Host: tus.example.org
Content-Length: 0
Upload-Length: 100
Tus-Resumable: 1.0.0
Upload-Metadata: filename d29ybGRfZG9taW5hdGlvbl9wbGFuLnBkZg==,is_confidential

Response:

HTTP/1.1 201 Created
Location: https://tus.example.org/files/24e533e02ec3bc40c387f1a0e460e216
Tus-Resumable: 1.0.0

Note the location contains a server defined URL.... That could contain the upload ID and the filename quite easily.

I see what you mean. However those IDs are generated by the backends themselves so someone has to keep track of them. If we can get them into the Location URL then the client will keep track of them...

innovate-invent · January 6, 2020, 4:44pm

There is no issue for backends that generate upload IDs, but we will have to manage ID generation for backends that don't. It would be better if the service handler treated the query as semi opaque and just forwarded it to the fs. Then the fs can stuff any information it needs into the returned url without needing explicit support for each parameter in the service handler code. Creating the concept of an ID rather than just using the service/remote path to uniquely identify the upload is unnecessary.

ncw · January 6, 2020, 7:22pm

I was assuming that if the backend doesn't generate IDs we just use the file path as an ID.

I don't think we can get away from needing an ID to identify a multipart upload. However it looks like we can pass that straight to the client for the client to remember. The backend can stuff anything it wants into that ID (the actual ID from the cloud storage + a path + expiry, serialized and encoded in base64) and it will be opaque to everything else.

innovate-invent · January 7, 2020, 5:34am

Shouldn't we try to have the url returned from the POST be the url that will eventually point to the final object? Any upload query parameters can be ignored during a GET.

I am not clear on what you have in mind for the url containing the ID.
We should be careful to not put any requirements on the structure of the path as we want this to be compatible with a variety of service url schemes.

ncw · January 7, 2020, 10:03am

That isn't the way these schemes normally work. There is normally a separate URL for the upload. Does the tus spec say anything about this?

That would only be used for the upload. I think that is the way it is supposed to work but I may be mistaken.

You'd fetch the file back with its filename glued onto the base URL using a GET request.

innovate-invent · January 7, 2020, 5:04pm

POST is responsible for returning the URL for the uploaded resource. If that isn't the final url for the file then there is no way for the client to access the file after upload without somehow searching for it. The URL POSTed to also doesn't have to be a sub-path of the URL returned.

Automatically routed based on user credential:
Request: POST /upload
Response: Location: /user/data/filename

A url scheme for concatenated uploads:
Request: POST /user/data/
Response: Location: /user/data/filename/1

A 1:1 filesystem to url mapping that handles many many uploads:
Request POST /uploads/
Response: Location: /uploads/fi/le/filename

The service handler or the fs can decide to place the upload anywhere it wants.

tus doesn't specify anything for the url scheme but it does provide an optional GET handler. I am not sure if it handles resumable downloads.

jwink3101 · January 7, 2020, 8:59pm

also https://github.com/rclone/rclone/issues/3151

ncw · January 7, 2020, 10:50pm

The docs refer to the value of the Location: header as the "upload URL"

I don't think it is the URL that you are supposed to GET the file from - it doesn't say that in the spec. In fact there is no mention of GET in the spec at all!

Though it could be easily enough and we stick the extra metadata as a URL parameter

I've implemented a lot of uploaders for cloud backends and the "upload URL" is a common concept meaning upload your data here. You usually get the final URL when the upload has finished.

Update I looked at the javascript code and it does look like you are expected to download the file from the upload URL.

There doesn't seem to be a way of saying I've finished uploading to this file? It looks like (correct me if I'm wrong) you can add to any file at any time - is that correct? If so that will be a problem.

I don't understand why you would want your file uploaded to some random place decided by the upload handler?

Where is that - I haven't found that.

I note that tus doesn't support uploads where you don't know the size in advance. This is a bit of a limitation

ncw · January 7, 2020, 10:52pm

I think a straight forward POST (with form) or PUT (with headers) would be a lot easier to implement than tus and do for 99% of the use cases of uploading. You could use it with curl or via javascript

innovate-invent · January 8, 2020, 2:17am

It isn't in the spec, but is in the implementation: https://godoc.org/github.com/tus/tusd/pkg/handler#UnroutedHandler.GetFile

Once it receives the last byte it is considered complete. This is why the protocol requires transmitting the total size at some point.

It depends on the service and what the path represents. A http client should also have no say in the upload location for security and collision handling.

tus supports uploads where you know the size eventually, like on the last PATCH.

If you are satisfied can I begin implementing this using query parameters?

ncw · January 8, 2020, 9:33am

I think that will work!

Go ahead, but be aware that I have a major backlog of pull requests to review and not much time at the moment! Pull requests take a long time to review properly.

innovate-invent · January 11, 2020, 7:33pm

We are going to need to tweak the interface a bit. Using ? as a token in the path won't work if the backend allows ? in file names. I was thinking of passing the entire url.URL object in place of the path or url encoding the path but this is clunky.

I propose the following change using url.Values:

type ResumableUploader interface {
	ResumableUpload(path, options url.Values) (Uploader, path, options url.Values, error) // Start or continue an existing upload to path
	ResumableCleanup() error // Clean up expired incomplete uploads
}

I am not sure if I am happy with pulling in a type from the url package for this interface.
Should I use an existing rclone type?
or should I just redeclare the type:

type TransactionOptions url.Values
or
type TransactionOptions map[string][]string

Can you think of anything better than TransactionOptions?

ncw · January 11, 2020, 10:09pm

URL encoding the path is what most cloud storage systems do. They have the same problem. You need a little care encoding the path but url.URL has enough tools in it to let you do it.

I don't really want to pollute the Resumable interface with url.Values which will mean nothing to anything except the tus uploader.

innovate-invent · January 11, 2020, 10:58pm

Will this be the only interface that expects the remote path to be url encoded?

ncw · January 12, 2020, 11:58am

It is certainly the only one at the moment.

innovate-invent · January 12, 2020, 6:16pm

Isn't that an issue?
Should the other interfaces be changed? or should we compromise and add an extra parameter to this one?

ncw · January 13, 2020, 7:55am

I don't think any of the internal interfaces to rclone should care - they should be presented with file with questionmark? and work just fine.

The external interfaces might - for instance if you use rclone serve http you'll see rclone will URL encode and decode the ? for you, but I see that as the job of the external interface to get the URLs back to rclone standard format.

innovate-invent · February 6, 2020, 4:34am

I managed to implement a generic Uploader for Fs that implement Concatenator.

Although I am not sure if it is better to specify a single directory to contain all pending uploads or if it is better to create it in the destination location. A single directory makes it easier to clean up dead uploads but complicates collision avoidance. What do you think of storing paths to the temporary files in boltdb?
Cleanup can then just scan the boltdb list of pending uploads and their modtime and delete accordingly.
If the boltdb is ever lost then a user can then just manually delete the temporary files.

I am going to take a swing at the S3 implementation now.

ncw · February 6, 2020, 10:39am

I'd probably say create it in the destination location.

I'd prefer not to have local state if we can avoid it.

OK!