Hi everyone!
I have a use case where I need to append new files from a source to a target.
The list of files to copy is determined using a rclone lsf
as source and affined using some other logic coming from elsewhere.
Today I'm using --files-from-raw
in combinaison with copy
command to do so.
The problem I have today is that copy
will perform a list on the source to find the file's information to copy and I'm working with buckets that have tens of millions of files.
Because of this I'm having a huge performance hit (some listing can take more than 1h) and it also requires more ram than I feel necessary (easily 2Gb-3Gb).
The feature I suggest would be a dedicated command (ex: copyfiles
or similar) that would take a --files-list
parameter with a content similar to --files-from-raw
.
This command would:
- ignore any provided filters (we are provided the exact list of files to copy)
- retrieve source' info using HEAD requests instead of listing
I have a first implementation that seems to work - narrowed to my use case - although I did not perform an extensive test yet on huge volumes.
The naïve implementation I have is to modify the March
class by doing something like:
func (m *March) Run(ctx context.Context) error {
if m.ListFiles == "" {
return m.walk(ctx)
} else {
return m.iterate(ctx)
}
}
Where walk
is the current implementation. The iterate
would look something like:
func (m *March) iterate(ctx context.Context) error {
ci := fs.GetConfig(ctx)
m.init(ctx)
var files = make(chan string)
var checkers = ci.Checkers
var mu sync.Mutex // Protects vars below
var jobError error
var errCount int
var g errgroup.Group
for i := 0; i < checkers; i++ {
g.Go(func() (err error) {
for remote := range files {
obj, err := m.Fsrc.NewObject(ctx, remote)
if err != nil {
mu.Lock()
// Keep reference only to the first encountered error
if jobError == nil {
jobError = err
}
errCount++
mu.Unlock()
}
// TODO use m.NoCheckDest to determine if we should check target or not
// if we do, we may call Match instead of SrcOnly
if obj != nil {
m.Callback.SrcOnly(obj)
}
}
return nil
})
}
err := fs.ForEachLine(m.ListFiles, true, func(remote string) error {
files <- remote
return nil
})
close(files)
if err != nil {
return nil
}
err = g.Wait()
if err != nil {
return err
}
if errCount > 1 {
return fmt.Errorf("march failed with %d error(s): first error: %w", errCount, jobError)
}
return jobError
}