Avoiding re-uploads with sftp

What is the problem you are having with rclone?

It's working fine - except that it keeps re-uploading files. Which isn't really a surprise as the the files are being regenerated. So while the hashes would stay the same, the modification (and even creation) time changes.

I am now searching for a way to optimize the upload.

Since it's an sftp-only target there is no way to check the hash on the target (unless there is a rclone ssh subsystem that I don't know of). That's why I have the hash check disabled. I was hoping there is some mode that uploads an index to could be used to check against on the source?

What is your rclone version (output from rclone version)

rclone v1.52.2
- os/arch: darwin/amd64
- go version: go1.14.4

Which OS you are using and how many bits (eg Windows 7, 64 bit)

macOS 10.15.3 transferring to Ubuntu 19.10

Which cloud storage system are you using? (eg Google Drive)

sftp with key auth and nologin shell.

The command you were trying to run (eg rclone copy /tmp remote:tmp)

	RCLONE_CONFIG_SITE_DISABLE_HASHCHECK=true \
	RCLONE_CONFIG_SITE_TYPE=sftp \
	RCLONE_CONFIG_SITE_HOST=$(DEPLOY_HOST) \
	RCLONE_CONFIG_SITE_USER=deploy \
	rclone sync -v dist site:$(DEPLOY_PATH)

The rclone config contents with secrets removed.

no config besides the env variables

A log from the command with the -vv flag

2020/07/22 17:54:54 DEBUG : rclone: Version "v1.52.2" starting with parameters ["rclone" "sync" "-vv" "dist" "site:foo.com/doc"]
2020/07/22 18:00:29 NOTICE: Config file "~/.config/rclone/rclone.conf" not found - using defaults
2020/07/22 17:54:54 DEBUG : fs cache: renaming cache item "dist" to be canonical "/site/dist"
2020/07/22 17:54:58 DEBUG : sftp://deploy@x.x.x.x:22/foo.com/doc: New connection 192.168.178.21:52947->x.x.x.x:22 to "SSH-2.0-OpenSSH_8.0p1 Ubuntu-6build1"

2020/07/22 17:54:59 DEBUG : robots.txt: Modification times differ by -21m50.222770633s: 2020-07-22 17:54:54.222770633 +0200 CEST, 2020-07-22 17:33:04 +0200 CEST
2020/07/22 17:54:59 DEBUG : index.html: Modification times differ by -21m50.267267973s: 2020-07-22 17:54:54.267267973 +0200 CEST, 2020-07-22 17:33:04 +0200 CEST

2020/07/22 17:55:00 DEBUG : sftp://deploy@x.x.x.x:22/foo.com/doc: New connection 192.168.178.21:52948->x.x.x.x:22 to "SSH-2.0-OpenSSH_8.0p1 Ubuntu-6build1"
2020/07/22 17:55:00 DEBUG : sftp://deploy@x.x.x.x:22/foo.com/doc: New connection 192.168.178.21:52949->x.x.x.x:22 to "SSH-2.0-OpenSSH_8.0p1 Ubuntu-6build1"
2020/07/22 17:55:00 DEBUG : sftp://deploy@x.x.x.x:22/foo.com/doc: New connection 192.168.178.21:52950->x.x.x.x:22 to "SSH-2.0-OpenSSH_8.0p1 Ubuntu-6build1"
2020/07/22 17:55:00 DEBUG : sftp://deploy@x.x.x.x:22/foo.com/doc: New connection 192.168.178.21:52951->x.x.x.x:22 to "SSH-2.0-OpenSSH_8.0p1 Ubuntu-6build1"
2020/07/22 17:55:00 DEBUG : sftp://deploy@x.x.x.x:22/foo.com/doc: New connection 192.168.178.21:52952->x.x.x.x:22 to "SSH-2.0-OpenSSH_8.0p1 Ubuntu-6build1"
2020/07/22 17:55:00 DEBUG : sftp://deploy@x.x.x.x:22/foo.com/doc: New connection 192.168.178.21:52953->x.x.x.x:22 to "SSH-2.0-OpenSSH_8.0p1 Ubuntu-6build1"
2020/07/22 17:55:01 DEBUG : sftp://deploy@x.x.x.x:22/foo.com/doc: New connection 192.168.178.21:52954->x.x.x.x:22 to "SSH-2.0-OpenSSH_8.0p1 Ubuntu-6build1"
2020/07/22 17:55:01 DEBUG : sftp://deploy@x.x.x.x:22/foo.com/doc: New connection 192.168.178.21:52955->x.x.x.x:22 to "SSH-2.0-OpenSSH_8.0p1 Ubuntu-6build1"
2020/07/22 17:55:01 DEBUG : sftp://deploy@x.x.x.x:22/foo.com/doc: New connection 192.168.178.21:52956->x.x.x.x:22 to "SSH-2.0-OpenSSH_8.0p1 Ubuntu-6build1"
2020/07/22 17:55:01 DEBUG : sftp://deploy@x.x.x.x:22/foo.com/doc: New connection 192.168.178.21:52957->x.x.x.x:22 to "SSH-2.0-OpenSSH_8.0p1 Ubuntu-6build1"
2020/07/22 17:55:01 DEBUG : sftp://deploy@x.x.x.x:22/foo.com/doc: New connection 192.168.178.21:52958->x.x.x.x:22 to "SSH-2.0-OpenSSH_8.0p1 Ubuntu-6build1"
...
2020/07/22 17:55:02 DEBUG : kontakt/index.html: Modification times differ by -21m50.268337075s: 2020-07-22 17:54:54.268337075 +0200 CEST, 2020-07-22 17:33:04 +0200 CEST
...
2020/07/22 17:55:03 INFO  : robots.txt: Copied (replaced existing)
2020/07/22 17:55:03 INFO  : index.html: Copied (replaced existing)
2020/07/22 17:55:07 INFO  : kontakt/index.html: Copied (replaced existing)
...

When I enable hashing I also see

2020/07/22 18:03:09 DEBUG : Saving config "md5sum_command" = "none" in section "site" of the config file
2020/07/22 18:03:09 DEBUG : Saving config "sha1sum_command" = "none" in section "site" of the config file

The issue would be you turned off any checksums so if you are size and modification time are different, it's going to reupload.

With checksums, it just updates the mod time on the file:

textere@seraphim ~ % sudo touch /etc/hosts
Password:
textere@seraphim ~ % ls -al /etc/hosts
-rw-r--r--  1 root  wheel  213 Jul 22 12:22 /etc/hosts
textere@seraphim ~ % rclone copy /etc/hosts SFTP: -vv
2020/07/22 12:22:22 DEBUG : rclone: Version "v1.52.0" starting with parameters ["rclone" "copy" "/etc/hosts" "SFTP:" "-vv"]
2020/07/22 12:22:22 DEBUG : Using config file from "/Users/textere/Documents/rclone.conf"
2020/07/22 12:22:22 DEBUG : fs cache: renaming cache item "/etc/hosts" to be canonical "/etc"
2020/07/22 12:22:22 DEBUG : sftp://felix@192.168.1.30:22/: New connection 192.168.1.152:57683->192.168.1.30:22 to "SSH-2.0-OpenSSH_8.2p1 Ubuntu-4ubuntu0.1"
2020/07/22 12:22:23 DEBUG : hosts: Modification times differ by -6689h42m47.828537s: 2020-07-22 12:22:17.828537 -0400 EDT, 2019-10-17 18:39:30 -0400 EDT
2020/07/22 12:22:23 DEBUG : sftp cmd = hosts
2020/07/22 12:22:23 DEBUG : sftp output = "a3f51a033f988bc3c16d343ac53bb25f  hosts\n"
2020/07/22 12:22:23 DEBUG : sftp hash = "a3f51a033f988bc3c16d343ac53bb25f"
2020/07/22 12:22:23 DEBUG : hosts: MD5 = a3f51a033f988bc3c16d343ac53bb25f OK
2020/07/22 12:22:23 INFO  : hosts: Updated modification time in destination
2020/07/22 12:22:23 DEBUG : hosts: Unchanged skipping
2020/07/22 12:22:23 INFO  :
Transferred:   	         0 / 0 Bytes, -, 0 Bytes/s, ETA -
Checks:                 1 / 1, 100%
Elapsed time:         0.0s

2020/07/22 12:22:23 DEBUG : 13 go routines active

@Animosity022 Please also see the second part of the log with hashing on. There is no md5sum or sha1sum available on the server. I assume those are only available with full ssh access.

No, it just needs to know where the md5sum command is located on the server.

And it will also need to run those commands - unless I am missing something.
Here is the sshd config to give a bit more context.

Match User deploy
  ChrootDirectory /srv/deploy
  ForceCommand internal-sftp
  AllowAgentForwarding no
  AllowTcpForwarding no
  X11Forwarding no

Yes, the user on the server needs to be able to execute the command in the path. Normally, you don't have to configure anything as it's most distributions, it isn't locked down.

If you locked it down, you'd need to make it accessible to execute for the user if you want to use checksums.

My fresh install Ubuntu VM with normal install and SSH setup with no configuration changes works without issue on checksums. If you turn off checksums, you have other options as by default it'll look as size and modification time.

You can use --size-only and only check size if you are unable to use checksums.

In the end, you need some way to compare file on the source and destination to say they are the same or not the same so checksum, size are some options for you to go with.

Of course it will work when there is full ssh access - which is the default. But in this scenario I want to avoid giving full ssh access.

Just checking the size is a too weak indicator.

I was hoping for there was a mode where rclone could upload and compare a manifest. But I guess I the only option is to fiddle with the ssh config to allow the execution of the hash commands then, or live with the re-uploads.

Bummer, but thanks.

How would that work? What would it do?

Are you doing a restricted shell? How are you restricting access? You can just add the checksum to the shell. A few examples on it like this:

How would that work? What would it do?

Making the assumption that the manifest will always be updated on updates, there could be e.g. a single .index file that lists all files and their hashes. This could all be done on the client. This file would have to be downloaded first to check what has changed.
That would be a naive implementation for it.

Are you doing a restricted shell? How are you restricting access?

See above. I provided the sshd snipped.
But thanks for the link to the article.

I think this is going to be hard to work around without hashes. As you've noted --size-only is a pretty weak check - it is better than nothing though.

By this I think you mean could rclone upload the hash to the sftp server then download it again to check it?

You can use the chunker backend to do this. You'd set the chunk_size very large (assuming you don't want chunked files) and then set the hash_type to md5all or sha1all meaning you want a hash of all files. This will store a bit of metadata per file with the hash in.

You could also do it manually by running rclone md5sum locally, uploading that to the server then downloading it and diffing it.

1 Like

Pretty much, yes. Along the lines of having an index as:

 .index:
 53c4e0c7ba9ab420362fd67deabe2b80  index.html
 4123c3cad74036160bdebcd26bb4146b  css/styles.css

The chunker backend sounds pretty good - but IIUC it will have an impact on the file names. Which is prohibitive for this use case.

The manual diffing could be an option of course. I was just hoping I didn't have to write that myself :slight_smile:

I thought the data files had their normal names and there was a sidecar md5 file but I may be wrong about that!

:smiley:

What sort of files are they BTW? Maybe there is a different way to solve this.

That would be nice! I might misread the docs.

What sort of files are they BTW? Maybe there is a different way to solve this.

Just a website. If I move the videos into a CDN it might be not so bad. But it also is a general wondering.

1 Like

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.