Double uncached metadata read on copy for internetarchive

parkan · March 6, 2025, 2:24pm

Simple copy (from remote to local) operations with internetarchive backend result in two identical /metadata/ reads for the whole item:

NewFs -> Fs.NewObject -> listAllUnconstrained -> requestMetadata
operations.moveOrCopyFile -> Fs.NewObject -> listAllUnconstrained -> requestMetadata

This is an expensive operation and results in aggressive ratelimiting, especially because the requests arrive immediately one after the other and count towards burst limits.

When reading large items with many files sequentially, this is fairly amortized. However, the notion of "item" in IA infrastructure is of somewhat lower rank than "bucket" elsewhere and we often need to read many small items (with high top-level parallelism by necessity) and accordingly encounter metadata read amplifications and ratelimiting.

From what I understand the two calls for most backends happen because NewFs will read the bucket metadata and moveOrCopyFile will read the object metadata; in this case both calls in effect pull bucket metadata.

Is it possible to squirrel away the item metadata (in ctx perhaps, or on the Fs which is passed to moveOrCopyFile) between the calls such that we don't re-request it again? This may present some cache invalidation challenges but there are many cases (reads-only) where this would be a big help and as we know remote updates with internetarchive are extremely not realtime.