I few months ago tardigrade was renamed to storj (see the introducing-storj-dcs-decentralized-cloud-storage-for-developers on storj blog)
To avoid confusion I would like to create a PR to rename tardigrade to storj, but I am wondering what are the backward compatibility practices of rclone:
Is it fine to rename full package and cli arguments? (It would break existing clients)
Or is it better to create a new storj provider, share the code, and eventually deprecate old tardigrade connector?
Or we should create existing (tardigrade) cli parameters but update the help texts/docs only?
Eventually (sometime in the future), remove all code and documentation for Tardigrade.
Benefits:
StorJ users get a better rclone backend due to the higher number of users (and developers) on the S3 backend.
rclone complexity is slightly reduced by having fewer specialised backends. This eases maintenance which typically leads to increased overall robustness/stability.
Drawbacks:
None that I can see (with my very limited knowledge of Tardigrade/StorJ)
I missed a (small) performance test after the functional backend test to check if there are performance differences - both StorJ and rclone may have changed since the Tardigrade backend was made.
Do we have any (new or old) performance tests/data showing the benefits of using the specific backend instead of the S3 backend?
I can ask around, but based on the architecture, S3 is expected to be slower.
Using s3 protocol requires a running storj/gateway-(mt|st) server which receives the REST calls and transform them to storj specific rpc calls. The rclone backend uses the same rpc calls (using the same original client library), so it should be significant faster.
I guess it depends on the limiting component in the setup. If your speed is limited by your local system resources or network bandwidth, then it will be faster to use the S3 protocol than the StorJ protocol (assuming the S3 gateway has sufficient resources and bandwidth).
This is due to the StorJ storage architecture and illustrated by these examples:
If you upload a 100 MB file, then you only need to encrypt and upload app. 101 MB using the S3 protocol, whereas you would need to encrypt and upload app. 340MB using the StorJ protocol.
If you download a 100 MB file, then you only need to download and decrypt app. 101 MB using the S3 protocol, whereas you would need to download and decrypt app. 130MB using the StorJ protocol.
Also the S3 protocol only uses one TCP connection per --transfer whereas the StorJ protocol uses minimum 110 for each upload and 35 for each download. Thus requiring significantly more system resources. Reference: https://rclone.org/tardigrade/#known-issues
Is my understanding correct or am I missing something?
Greetings, @dominick here with Storj. When using the storj (Native) back end the encryption and erasure coding occurs client side which is more intense on local compute and results in a 2.68x upload multiplier due to erasure coding.
Use our native integration pattern to take advantage of client-side encryption as well as to achieve the best possible download performance. Uploads will be erasure-coded locally, thus a 1gb upload will result in 2.68gb of data being uploaded to storage nodes across the network.
Use this pattern for
The strongest security
The best download speed
Using the S3 backend for uploading is faster as the encryption and erasure coding occurs on our edge services. The disadvantage is you have to share the encryption key with us as we encrypt for you.
Use our S3 compatible Hosted Gateway integration pattern to increase upload performance and reduce the load on your systems and network. Uploads will be encrypted and erasure-coded server-side, thus a 1GB upload will result in only in 1GB of data being uploaded to storage nodes across the network.
Use this pattern for
Reduced upload time
Reduction in network load
Ideal users could choose between the "Native" and "Hosted S3 Options". I can't post links here but we have a recent write-up on performance. Search for "hotrodding decentralized storage" which should bring up my post on the Storj Community.
Perhaps, but at the expense of significantly increased usage of local computer and network resources - or speed.
The other possibility is to user the rclone crypt backend on top of the S3 backend to StorJ. It may well have enough security for the average rcloner.
I read your Storj post and didn’t find any measurements to support this claim and suspect you assume a local computer having resources and network connectivity like your S3 edge computers - e.g being directly connected to the internet backbone.
I would expect the S3 protocol to be the fastest due to the multipliers of the StorJ protocol, and this is partly supported by this statement from your StorJ post: “Ultimately your compute, ram, or internet connection is likely to be the gating factor”
Do you have a performance comparison of rclone using the S3 protocol vs the StorJ protocol on an average computer connected to average ISP network connection?
You are absolute correct. The only reason to upload via the storj backend is so you can be the custodian of you keys. It might not be most peoples first choice but its a nice option to support for the more security conscious.
Over the weekend we did some testing and did find that the rclone backend outperformed ours but it required the file be split using a utility and transferred with --transfer figures around 128. We saw around 3000Mb/s via our backend with parallelism of 192 and 6000Mb/s when using rclone and --transfers 128. Key is the load on our network, when downloading with our backend it skirts our edge services (GatewayMT) and directly connects to the nodes holding the segments.
Really Id love to see our product implemented like it is today, natively (currently referred to as tardigrade in rclone) and under the s3 universal choices via the s3/rclone backend. Even better would be to set defaults like our 64MB block size when selecting us form compatible s3 services.
Thank you for the time you have spent reviewing this!
If I understand well, the only one remaining question is: keeping the support tardigrade/storj native backend or focus on s3 integration (4th option vs others)
For me, it seems to be better to keep both option:
They have different resource usage guarantees (with enough available resource - like a backup from a server -- it can be better/faster to use storj/tardigrade)
Different security guarantees (sharing key or not)
And as of today tardigrade is included as a backend, I think it should be maintained for backward-compatibility anyway... Renaming it just makes the usage less confusing
But I think it would be a good idea to improve the documentation of the tardifrade/storj backend with explaining the different guarantees and advantages/disadvantages discussed above, to help users to choose. (TLDR; on edge use s3, on server OR if security is important use tardigrade/s3)
I think we are taking activities and discussions in the wrong order.
Let me briefly recap status as seen from my perspective:
The S3 protocol seems to be the best option for most uses of Storj, except for the very security interested prepared to pay the price in equipment and/or speed.
I therefore propose the following sequence of your Storj activities:
Make a GitHub issue to discuss the overall approach and plan
Confirm full Storj S3 compatibility by running the automated S3 backend test against an S3 TestDrive having Storj as endpoint.
Make a performance comparison of the S3 and tardigrade backends to collect information on the expected resource usage and speed on representative use cases using identical equipment.
Make a pull request to update the tardigrade (and S3) backend documentation with this information.
Make a pull request to rename the tardigrade backend to Storj - or to depreciate it in favor of Storj’s own client (https://www.storj.io/integrations/uplink-cli). Approach to be based on the results and agreements obtained in the previous steps.
@ncw Please correct me if you see something missing/mistaken