Force Download & Hash

lostheli · November 9, 2020, 9:20pm

There's a fair amount of discussion on hashes, but I could not find an answer to my specific question nor figure it out myself. Looking for a way to force rclone to download and hash files from a remote. These files will then be validated through separate steps.

Functions I have considered:

Hashsum is the logical choice but does not appear to have a force download option
Check does have a force download option but does not save the hash it computes. It also does not appear to download the files and hash them when the comparison directory is empty (the machine running this does not have access to the originals)
Rclone cat and pipe to a hash utility works for a single file but does not work for multiple files. I have many files and would not want to call rclone separately for each
Mount is not an option in this situation, so I cannot mount then run a separate utility
Copy then hash is also not an option here

This seems like it should be simple because check is doing what I want to do internally and hashsum has the output format that I'm looking for. Am I missing something simple like a global force flag for hashsum or an output option for check that would allow me to do this?

ncw · November 11, 2020, 6:20pm

Indeed it doesn't... It could potentially though

Check doesn't compute the hash, it compares the files byte by byte.

OK

I think this would need a --download option implemented for hashsum (and friends md5sum and sha1sum).

Is this something you'd like to work on?

lostheli · November 13, 2020, 4:53pm

Sure, I am happy to work on adding this.

I've done an initial review of some related existing functions and how it might be implemented. Becoming familiar with the existing code base and unpacking how it works has been slow for me - is there big picture documentation on organization and design intent and/or process flows of key processes that I might have missed? If not, I'll continue reading the code and comments.

Interested in feedback and advice on the following assumptions and questions:

Currently hashsum passes requests through several steps to the backends. This makes sense when leveraging the ability of backends to provide hashes. Here we will be creating hashes in a manner consistent across all backends; the backend must simply provide the file. Therefore, we will handle requests with the download flag in operations and continue passing through other requests to the backed.
** operations.HashLister and operations.HashListerBase64, could be modified to do this. It may be better to combine them into a single method with an optional base64 encoding step to prevent too much duplication of code. operations.Md5sum and operations.Sha1sum will also need to pass through the download flag.
** The corresponding functions in cmd would need to be modified to accept the download flag and send it as an input to the operations functions.
When the download flag is set, we want to accomplish something roughly equivalent to creating a multihasher, catting the file into the multihasher, and then printing the hash in the same way hashsum does.
For efficiency, we should be able to download and hash multiple files simultaneously. In most cases, transfer speed will be lower than computation speed and multiple transfers can improve transfer speed. Even in computation constrained cases, multiple CPU threads may be available and we must work on multiple files in parallel to leverage them when computing hashes in most cases. However, I see a limitation with the simple approach above given how the existing functions are implemented:
** If I understand the code in cat correctly, we cannot simply reuse the cat function for the download. Looking at line 1088 to 1091 in operations, it appears the download takes place while the mutex on a share writer is held. Handing cat a multihasher as its io.writer may work but would limit this to a single simultaneous download.
** We could implement a function similar to cat that creates a hasher then copies a file to it such that each instance of this function has its own hasher and a mutex on the hasher is not required. I am less certain, but it appears ListFn itself may also be limited to a single file at a time. The comments have a disclaimer that the creation of the file list is done in parallel. However, the execution of the passed function is done in ForObject, which is a sequential for loop. ListFn may be intended only for cases where the function execution time is negligible or parallel processing is otherwise unimportant.
What then is the best way to walk the remote applying our transfer and hash function to multiple files in parallel while respecting config.transfers or config.checkers? Is there a function designed for applying a long running function to each file of a single remote or does this 'solo march' or 'parallel ListFn' need to be created as part of this?
** Check and syncCopyMove use March. However, this is explicitly designed to walk two remotes in parallel and the existence of a destination appears integral to the code. In our case, we do not have a second remote.
** Check appears to determine the level of parallelism based on config.checkers even if transfers are required by the download option. I am unsure what happens if config.transfers is less than half of config.checkers or in the special case where config.transfers is 1. Given a forced download and hash function must both transfer and do local work - if that's what checkers is intended to limit - propose we limit to the lower of the two.
** Depending on how this map (application of download and hash function to each file) is done, there may be a choice between creating a new hasher for each file or creating a pool of hashers equal to the number of transfers. The first option is likely simpler but may incur greater overhead - I am not sure how significant the overhead for creating a hasher is.

ncw · November 15, 2020, 12:26pm

Great!

There is a code outline here: rclone/CONTRIBUTING.md at master · rclone/rclone · GitHub

There isn't overall flow documentation though - that would be nice.

That seems reasonable. I think you'd end up passing the download flag all the way to the hashSum function and doing the read the hash / download the file at that point.

I would call NewReopen to get a handle which will re-open itself on failures, then use io.Copy to copy the data into a multihasher. No neet to use Cat here

The ListFn while it lists in parallel, is serialised with a mutex, so you will only ever get one called at once.

I would do this work in two steps.

Do the first step without parallelism and get it working. Don't forget the unit tests.

Then to implement parallelism you would do something like this (untested)

func HashLister(ctx context.Context, ht hash.Type, f fs.Fs, w io.Writer) error {
	concurrencyControl := make(chan struct{}, fs.Config.Transfers)
	var wg sync.WaitGroup
	err := ListFn(ctx, f, func(o fs.Object) {
		concurrencyControl <- struct{}{} // get a concurrency slot
		wg.Add(1)
		go func() {
			defer func() {
				wg.Done()
				<-concurrencyControl // return the concurrency slot
			}()
			sum, _ := hashSum(ctx, ht, o)
			syncFprintf(w, "%*s  %s\n", hash.Width(ht), sum, o.Remote())
		}()
	})
	wg.Wait()
	return err
}

Note that the output is synchonised already so no need for anything else.

This doesn't guarantee the ordering of the output, but it was never guaranteed anyway.

So in summary

add a download flag to operations.hashSum and chase the flag additions upwards, including into the command line flags
combine operations.HashLister and operations.HashListerBase64 with a flag
add concurrency to operations.HashLister as above

You could add the --download and --base64 flags to hashsum, md5sum and sha1sum - maybe like the way it was done with check and cryptcheck

lostheli · November 23, 2020, 5:48pm

I've implemented the single threaded version of HashSum and HashLister as suggested. There are several comments where I am uncertain about the best way to implement in this code base mainly due to differences between the methods I used as references.

// hashSum returns the human readable hash for ht passed in.  This may
// be UNSUPPORTED or ERROR. If it isn't returning a valid hash it will
// return an error.
func hashSum(ctx context.Context, ht hash.Type, downloadFlag bool, o fs.Object) (string, error) {
	var sum string
	var err error

	// If downloadFlag is true, download and hash the file.
	// If downloadFlag is false, call o.Hash asking the backend for the hash
	if downloadFlag {
		
		// Setup accounting as NewTransfer.  Same approach used in Cat
		tr := accounting.Stats(ctx).NewTransfer(o)
		defer func() {
			tr.Done(ctx, err)
		}()

		// Setup download options.  Similar approach to Cat.
		// We do not have a start / end, so use only fs.Config.DownloadHeaders.
		// Unsure if copying to a new OpenOption object is required.
		// Copy has a very similar setup but includes hashOption
		var options []fs.OpenOption
		for _, option := range fs.Config.DownloadHeaders {
			options = append(options, option)
		}

		// Using NewReOpen.  Error handling seems to be done differently in each function.  This version is from Cat
		in, err := NewReOpen(ctx, o, fs.Config.LowLevelRetries, options...)
		if err != nil {
			err = fs.CountError(err)
			fs.Errorf(o, "Failed to open: %v", err)
			return "ERROR", err
		}
		in = tr.Account(ctx, in).WithBuffer() // account and buffer the transfer

		// Unclear if we need to close in.
		// Copy, which uses NewReOpen does.
		// Cat, rcat, and other methods use their own readCloser but don't appear to close
		// If the error does not occur until close, we log the error but return the hash as okay
		defer func() {
			err = in.Close()
			if err != nil {
				err = fs.CountError(err)
				fs.Errorf(o, "Failed to close: %v", err)
			}
		}()

		// Create a multihasher to receive the transfer
		// Should use hash.StreamTypes here instead of creating the hasher and copying?
		// Not doing fs.CountError or fs.Errorf here because other functions that create MultiHashers do not.  Should we?
		hasher, err := hash.NewMultiHasherTypes(hash.NewHashSet(ht))
		if err != nil {
			return "UNSUPPORTED", err
		}

		// Using fs.CountError based on Cat
		// StreamTypes does not do this, if copy fails it just returns the error
		_, err = io.Copy(hasher, in)
		if err != nil {
			err = fs.CountError(err)
			fs.Errorf(o, "Failed to copy file to hasher: %v", err)
			return "ERROR", err
		}

		// hasher.Sum returns an error, but I don't believe it's possible to error here.
		// It appears the only error is unsupported.
		// If NewMultiHasherTypes succeeds and ht has not changed, we should not receive this error.
		// Sum returns a byte array, and we want to return a hex string, so encode
		byteSum, err := hasher.Sum(ht)
		if err != nil {
			return "ERROR", err
		}
		sum = hex.EncodeToString(byteSum)
	} else {
		// If not downloadFlag, do the original version of HashSum

		tr := accounting.Stats(ctx).NewCheckingTransfer(o)
		defer func() {
			tr.Done(ctx, err)
		}()

		sum, err = o.Hash(ctx, ht)
		if err == hash.ErrUnsupported {
			return "UNSUPPORTED", err
		} else if err != nil {
			// This was previously at the debug level.
			// Changed to the error level to be consistent with the above
			fs.Errorf(o, "Failed to read %v: %v", ht, err)
			return "ERROR", err
		}
	}

And HashLister:

// HashLister does an md5sum equivalent for the hash type passed in
// Updated to handle both standard hex encoding and base64
// HashListerBase64 appears to have an odd case for non nil errors where the sum is not base64 encoded but the width is set based on base64 encoding
// In this new function, the standard width is instead used for the outputBase64 == True && err != nil case
func HashLister(ctx context.Context, ht hash.Type, outputBase64 bool, downloadFlag bool, f fs.Fs, w io.Writer) error {
	return ListFn(ctx, f, func(o fs.Object) {
		sum, err := hashSum(ctx, ht, downloadFlag, o)
		if outputBase64 && err == nil {
			hexBytes, _ := hex.DecodeString(sum)
			sum = base64.URLEncoding.EncodeToString(hexBytes)
			width := base64.URLEncoding.EncodedLen(hash.Width(ht) / 2)
			syncFprintf(w, "%*s  %s\n", width, sum, o.Remote())
		} else {
			syncFprintf(w, "%*s  %s\n", hash.Width(ht), sum, o.Remote())	
		}
	})
}

I have modified the HashSum unit tests as follows. This test is passing. Is more required here?

func TestHashSums(t *testing.T) {
	ctx := context.Background()
	r := fstest.NewRun(t)
	defer r.Finalise()
	file1 := r.WriteBoth(ctx, "potato2", "------------------------------------------------------------", t1)
	file2 := r.WriteBoth(ctx, "empty space", "-", t2)

	fstest.CheckItems(t, r.Fremote, file1, file2)

	// MD5 Sum

	var buf bytes.Buffer
	err := operations.HashLister(ctx, hash.MD5, false, true, r.Fremote, &buf)
	require.NoError(t, err)
	res := buf.String()
	if !strings.Contains(res, "336d5ebc5436534e61d16e63ddfca327  empty space\n") &&
		!strings.Contains(res, "                     UNSUPPORTED  empty space\n") &&
		!strings.Contains(res, "                                  empty space\n") {
		t.Errorf("empty space missing: %q", res)
	}
	if !strings.Contains(res, "d6548b156ea68a4e003e786df99eee76  potato2\n") &&
		!strings.Contains(res, "                     UNSUPPORTED  potato2\n") &&
		!strings.Contains(res, "                                  potato2\n") {
		t.Errorf("potato2 missing: %q", res)
	}

	// SHA1 Sum

	buf.Reset()
	err = operations.HashLister(ctx, hash.SHA1, false, false, r.Fremote, &buf)
	require.NoError(t, err)
	res = buf.String()
	if !strings.Contains(res, "3bc15c8aae3e4124dd409035f32ea2fd6835efc9  empty space\n") &&
		!strings.Contains(res, "                             UNSUPPORTED  empty space\n") &&
		!strings.Contains(res, "                                          empty space\n") {
		t.Errorf("empty space missing: %q", res)
	}
	if !strings.Contains(res, "9dc7f7d3279715991a22853f5981df582b7f9f6d  potato2\n") &&
		!strings.Contains(res, "                             UNSUPPORTED  potato2\n") &&
		!strings.Contains(res, "                                          potato2\n") {
		t.Errorf("potato2 missing: %q", res)
	}

	// QuickXorHash Sum

	buf.Reset()
	var ht hash.Type
	err = ht.Set("QuickXorHash")
	require.NoError(t, err)
	err = operations.HashLister(ctx, ht, false, true, r.Fremote, &buf)
	require.NoError(t, err)
	res = buf.String()
	if !strings.Contains(res, "2d00000000000000000000000100000000000000  empty space\n") &&
		!strings.Contains(res, "                             UNSUPPORTED  empty space\n") &&
		!strings.Contains(res, "                                          empty space\n") {
		t.Errorf("empty space missing: %q", res)
	}
	if !strings.Contains(res, "4001dad296b6b4a52d6d694b67dad296b6b4a52d  potato2\n") &&
		!strings.Contains(res, "                             UNSUPPORTED  potato2\n") &&
		!strings.Contains(res, "                                          potato2\n") {
		t.Errorf("potato2 missing: %q", res)
	}

	// QuickXorHash Sum with Base64 Encoded

	buf.Reset()
	err = operations.HashLister(ctx, ht, true, false, r.Fremote, &buf)
	require.NoError(t, err)
	res = buf.String()
	if !strings.Contains(res, "LQAAAAAAAAAAAAAAAQAAAAAAAAA=  empty space\n") &&
		!strings.Contains(res, "                 UNSUPPORTED  empty space\n") &&
		!strings.Contains(res, "                              empty space\n") {
		t.Errorf("empty space missing: %q", res)
	}
	if !strings.Contains(res, "QAHa0pa2tKUtbWlLZ9rSlra0pS0=  potato2\n") &&
		!strings.Contains(res, "                 UNSUPPORTED  potato2\n") &&
		!strings.Contains(res, "                              potato2\n") {
		t.Errorf("potato2 missing: %q", res)
	}
}

I have redirected the commands md5sum and sha1sum to HashLister. I have implemented the download flag and base64 flag (where required) for those two commands as well as hashsum and dbhashsum. I have removed operations.Md5sum, operations.Sha1sum, and operations.HashListerBase64.

I wanted to replace the commands other than hashsum with an alias to hashsum + the appropriate argument to avoid implementing the same flags multiple times. I could not determine how to do this.
Are there unit tests for the commands?

ncw · November 23, 2020, 9:14pm

Well done

Sure

Probably not.

All good

I don't know how to do that.

What you can do is export a function which adds the flags from hashsum and use it in md5sum/sha1sum

The commands are supposed to be thin veneers over the function in fs/operations so the unit tests by and large go in there.

Some of the commands have unit tests.

What they could really do with is some tests using using something like this but that is a job for another day!

I suggest you make a pull request - GitHub it is a much better platform for reviewing code - I can't really tell which code is new and which is changed from the code pastes above.

You can attach comments to parts of the PR that you want help on.

system · January 23, 2021, 5:14pm

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.