Can't list 1 directory in bucket so sync doesn't work properly

What is the problem you are having with rclone?

We use sync to backup objects from ceph to minio but I noticed that a bucket with over 1 million objects in ceph only had 40 thousand in minio. On investigation I found that one folder wasn't being synced.

The direcotry uploads/app contains several sub directories 2013 to 2021 but the output of this command shows nothing (this is the folder that's not being synced)

rclone lsd backup-rgw-wilxite:shine-oms/uploads/app/

Strangely though if we use this command it returns all of the files in the sub directories

rclone ls backup-rgw-wilxite:shine-oms/uploads/app/

To test further I copied a file to this directory

rclone copy hello.txt backup-rgw-wilxite:shine-oms/uploads/app/

and then tried to list it but got empty output

rclone ls backup-rgw-wilxite:shine-oms/uploads/app/hello.txt

I was able to list it with s3cmd using the same credentials

[root@rook-ceph-tools-6759cb4bd6-68hzp /]# s3cmd ls --no-ssl --host=${AWS_HOST} --host-bucket= s3://shine-oms/uploads/app/hello.txt
2021-12-27 21:28           12  s3://shine-oms/uploads/app/hello.txt

Although I also can't list the directory in s3cmd

[root@rook-ceph-tools-6759cb4bd6-68hzp /]# s3cmd ls --no-ssl --host=${AWS_HOST} --host-bucket= s3://shine-oms/uploads/app/         
2019-11-01 16:57            0  s3://shine-oms/uploads/app/

I realise that this means the problem is with the ceph bucket and not rclone but I have run out of ideas to try and fix. I tried this in ceph to fix indexes but although it completed successfully it didn't resolve the problem

radosgw-admin bucket check --check-objects --fix --bucket=shine-oms --debug_rgw=10

Before I ran this command in ceph I did a sync with each of the sub directories to minio to make sure we wouldn't loose and data and this worked fine

rclone sync --use-mmap --checkers 128 --transfers 24 -P backup-rgw-tango:shine-oms/uploads/app/2013 minio-tango:shine-oms/uploads/app/2013

What is your rclone version (output from rclone version)

rclone v1.55.1-DEV

  • os/type: linux
  • os/arch: amd64
  • go/version: go1.14.12
  • go/linking: dynamic
  • go/tags: none

Which cloud storage system are you using? (eg Google Drive)

Ceph & Minio

The command you were trying to run (eg rclone copy /tmp remote:tmp)

rclone sync --use-mmap --checkers 128 --transfers 24 -P backup-rgw-tango:shine-oms minio-tango:shine-oms

The rclone config contents with secrets removed.

[backup-rgw-tango]
type = s3
provider = Ceph
env_auth = false
access_key_id = <removed>
secret_access_key = <removed>
endpoint = <removed>
acl = private

[minio-tango]
type = s3
provider = Minio
env_auth = false
access_key_id = <removed>
secret_access_key = <removed>
endpoint = <removed>
acl = private

A log from the command with the -vv flag

This is the list command as opposed to the sync one as I don't want to run sync again until this issue is fixed

rclone lsd backup-rgw-wilxite:shine-oms/uploads/app/ -vv
2021/12/29 18:34:58 DEBUG : Using config file from "/root/.config/rclone/rclone.conf"
2021/12/29 18:34:58 DEBUG : rclone: Version "v1.55.1-DEV" starting with parameters ["rclone" "lsd" "backup-rgw-wilxite:shine-oms/uploads/app/" "-vv"]
2021/12/29 18:34:58 DEBUG : Creating backend with remote "backup-rgw-wilxite:shine-oms/uploads/app/"
2021/12/29 18:34:58 DEBUG : fs cache: renaming cache item "backup-rgw-wilxite:shine-oms/uploads/app/" to be canonical "backup-rgw-wilxite:shine-oms/uploads/app"
2021/12/29 18:34:59 DEBUG : 3 go routines active

hello and welcome to the forum,

should to update to latest stable v1.57.0

could be a permissions issue

i would assume that ceph itself, has a tool to view files in ceph.
make sure that works first.

Thank you so much for getting back to me.

Got the same result with updated version of rclone

{sarah@saz-asus}$ rclone version
rclone v1.57.0
- os/version: ubuntu 21.10 (64 bit)
- os/kernel: 5.13.0-051300-generic (x86_64)
- os/type: linux
- os/arch: amd64
- go/version: go1.17.2
- go/linking: static
- go/tags: none
{sarah@saz-asus}$ rclone lsd backup-rgw-wilxite:shine-oms/uploads/app/
{sarah@saz-asus}$ rclone ls backup-rgw-wilxite:shine-oms/uploads/app/hello.txt
{sarah@saz-asus}$ 

I don't believe there is a way to just list a directory directly in ceph but a list of the bucket contains all 1.6 million objects as it does in rclone

{sarah@saz-asus}$ rclone ls backup-rgw-wilxite:shine-oms | wc -l
1601546

40 thousand of the objects are outside of the /uploads/app path and they get synced correctly.

Just in case it helps this shows we can list the directory above

{sarah@saz-asus}$ rclone lsd backup-rgw-wilxite:shine-oms/uploads/
           0 2021-12-29 19:14:55        -1 app
           0 2021-12-29 19:14:55        -1 audit
           0 2021-12-29 19:14:55        -1 claims
           0 2021-12-29 19:14:55        -1 email

and the same in s3cmd

[root@rook-ceph-tools-6759cb4bd6-68hzp /]# s3cmd ls --no-ssl --host=${AWS_HOST} --host-bucket= s3://shine-oms/uploads/    
                          DIR  s3://shine-oms/uploads/app/
                          DIR  s3://shine-oms/uploads/audit/
                          DIR  s3://shine-oms/uploads/claims/
                          DIR  s3://shine-oms/uploads/email/
2019-11-01 16:57            0  s3://shine-oms/uploads/

well, not sure that is correct, as these two commands provide diffeent set of folders
rclone lsd backup-rgw-wilxite:shine-oms/uploads/
and
s3cmd ls --no-ssl --host=${AWS_HOST} --host-bucket= s3://shine-oms/uploads/

there are five directore listed, the last one s3://shine-oms/uploads/ looks like a dir but is really a file of zero bytes, correct? and that one is not listed with rclone.

in s3, only objects exist, directories are just an abstraction.
some tools will create and use empty dir but rclone does not like that.

I believe s3cmd is just showing the directory being listed, this is example of one of the directories in /uploads/app with s3cmd

[root@rook-ceph-tools-6759cb4bd6-68hzp /]# s3cmd ls --no-ssl --host=${AWS_HOST} --host-bucket= s3://shine-oms/uploads/app/2013
                          DIR  s3://shine-oms/uploads/app/2013/
[root@rook-ceph-tools-6759cb4bd6-68hzp /]# s3cmd ls --no-ssl --host=${AWS_HOST} --host-bucket= s3://shine-oms/uploads/app/2013/
                          DIR  s3://shine-oms/uploads/app/2013/10/
                          DIR  s3://shine-oms/uploads/app/2013/11/
                          DIR  s3://shine-oms/uploads/app/2013/12/
2019-11-01 16:58            0  s3://shine-oms/uploads/app/2013/

And with rclone

{sarah@saz-asus}$ rclone lsd backup-rgw-wilxite:shine-oms/uploads/app/2013
           0 2021-12-29 19:24:56        -1 10
           0 2021-12-29 19:24:56        -1 11
           0 2021-12-29 19:24:56        -1 12

This is an example from another bucket where an object was created as the directory path and rclone just ignores it so don't believe that is the problem

{sarah@saz-asus}$ rclone lsd backup-rgw-wilxite:support-your-school
           0 2021-12-29 19:33:34        -1 uploads
{sarah@saz-asus}$ rclone lsd backup-rgw-wilxite:support-your-school/uploads
2021/12/29 19:33:41 ERROR : : Entry doesn't belong in directory "" (same as directory) - ignoring
           0 2021-12-29 19:33:41        -1 downloads
           0 2021-12-29 19:33:41        -1 establishments
           0 2021-12-29 19:33:41        -1 files
           0 2021-12-29 19:33:41        -1 img
           0 2021-12-29 19:33:41        -1 stockImg
           0 2021-12-29 19:33:41        -1 tmp

For info the ceph user in backup-rgw-wilxite is a system user so it has permission to all buckets and objects

@asdffdsa thanks for trying to help, guess I'll go with the last resort which is to create a new bucket and copy the data across.

Thanks again, really appreciate your efforts :slight_smile:

Just for info as I really didn't expect this...

I created a new bucket and ran this expecting to have to run additional commands for every sub directory in app

rclone sync --use-mmap --checkers 128 --transfers 48 --fast-list -P backup-rgw-wilxite:shine-oms shine-rgw-wilxite:shine-system

BUT it is syncing everything!!

rclone lsd backup-rgw-wilxite:shine-system/uploads/app
           0 2022-01-01 13:00:42        -1 2013
           0 2022-01-01 13:00:42        -1 2014
           0 2022-01-01 13:00:42        -1 2015
           0 2022-01-01 13:00:42        -1 2016
           0 2022-01-01 13:00:42        -1 2017
           0 2022-01-01 13:00:42        -1 2018
           0 2022-01-01 13:00:42        -1 2019
           0 2022-01-01 13:00:42        -1 2020
           0 2022-01-01 13:00:42        -1 2021
           0 2022-01-01 13:00:42        -1 tmp

So why is rclone sync working between 2 Ceph buckets but not Ceph and Minio??

Using rclone lsd on the original bucket still returns nothing so there is obviously a problem but would love to know what caused it!!

me too, i had this exact issue, i would need to know.

note: not an expert with s3cmd and i could be misinterpreting your output

tldr: i think this example summarizes all my points

s3cmd ls s3://zork/
                          DIR  s3://zork/01/
2022-01-01 16:57            0  s3://zork/01

=---------------------------------------------------------=

i still go back to the issue of why s3://shine-oms/uploads/app/ looks like a file, not a directory

  • has a time stamp
  • has a size of zero

=-------------------------------------------------=

i still not believe that is correct.
but if you are correct, i cannot see the logic of listing a root dir and including the root dir in the list?

and the fact that s3cmd ls --no-ssl --host=${AWS_HOST} --host-bucket= s3://shine-oms/uploads/app/2013/
shows 2019-11-01 16:58 0 s3://shine-oms/uploads/app/2013/
and
rclone lsd backup-rgw-wilxite:shine-oms/uploads/app/2013 does not
might be another point demonstrating my point that the object s3://shine-oms/uploads/app/2013/ is considered as a file by rclone.

so i would like to see the output of this command, if 2013 listed a file?
s3cmd ls --no-ssl --host=${AWS_HOST} --host-bucket= s3://shine-oms/uploads/app/2013 --max-depth=1

and the equivalent output from ceph command line and ceph GUI.

Yeah you are absolutely right it can't be just showing the directory when it has a timestamp from the past.

I tried your command but it failed with invalid option, s3cmd only lists at depth 1 by default so guess that's why there is no equivalent command for it.

[root@rook-ceph-tools-6759cb4bd6-68hzp /]# s3cmd ls --no-ssl --host=${AWS_HOST} --host-bucket= s3://shine-oms/uploads/app/2013 --max-depth=1
Usage: s3cmd [options] COMMAND [parameters]

s3cmd: error: no such option: --max-depth

This bucket was created in 2019 which matches up with the timestamp. I believe the import routine we used must have created the empty file. To prove this I did a list of the 2020 one (obviously created after the import) from s3cmd and that didn't have the empty file

[root@rook-ceph-tools-6759cb4bd6-68hzp /]# s3cmd ls --no-ssl --host=${AWS_HOST} --host-bucket= s3://shine-oms/uploads/app/2020/
                          DIR  s3://shine-oms/uploads/app/2020/01/
                          DIR  s3://shine-oms/uploads/app/2020/02/
                          DIR  s3://shine-oms/uploads/app/2020/03/
                          DIR  s3://shine-oms/uploads/app/2020/04/
                          DIR  s3://shine-oms/uploads/app/2020/05/
                          DIR  s3://shine-oms/uploads/app/2020/06/
                          DIR  s3://shine-oms/uploads/app/2020/07/
                          DIR  s3://shine-oms/uploads/app/2020/08/
                          DIR  s3://shine-oms/uploads/app/2020/09/
                          DIR  s3://shine-oms/uploads/app/2020/10/
                          DIR  s3://shine-oms/uploads/app/2020/11/
                          DIR  s3://shine-oms/uploads/app/2020/12/

This still doesn't explain the rclone sync issues as the uploads directory also has empty file and yet this one is synced correctly

[root@rook-ceph-tools-6759cb4bd6-68hzp /]# s3cmd ls --no-ssl --host=${AWS_HOST} --host-bucket= s3://shine-oms/uploads/
                          DIR  s3://shine-oms/uploads/app/
                          DIR  s3://shine-oms/uploads/audit/
                          DIR  s3://shine-oms/uploads/claims/
                          DIR  s3://shine-oms/uploads/email/
2019-11-01 16:57            0  s3://shine-oms/uploads/

Also doesn't explain why sync to another ceph bucket works but the sync to minio doesn't

good, we are making progress

well, everything in s3 is an object, that file and directories are abstractions.

https://docs.aws.amazon.com/AmazonS3/latest/userguide/using-folders.html
so to make life easy for aws s3 users of the website, needed to have a hierarchical structure to emulate a typical file system.
so to emulate a directory, amazon would create an zero byte object , with a name that ends with a forward slash.
2019-11-01 16:58 0 s3://shine-oms/uploads/app/2013/

most apps do that themselves, cloudberry explorer, s3brower and so on.

rclone is not fully compatible with that.
for example, https://rclone.org/commands/rclone_mount/#limitations
"The bucket based remotes (e.g. Swift, S3, Google Compute Storage, B2, Hubic) do not support the concept of empty directories, so empty directories will have a tendency to disappear once they fall out of the directory cache."

Might want to check out:

Emtpy directory markers for S3 and GCS backends by Jancis · Pull Request #5323 · rclone/rclone (github.com)

i was just about to search for that and update this topic.

so thanks for saving me a few minutes, now i can spend my time and talent on my (long term) income.

Thank so much for your time and sorry we've taken up so much of it. I do understand that files and directories in s3 are abstractions but since rclone has the 'list directories' command 'lsd' then I thought it was fair to speak in these terms. It's also worth noting that s3 uses these 'fake' directories for indexing. We have 1 bucket with 17 million objects on one level and I haven't yet figured out how to back this up, neither rclone or s3cmd cope with this as far as I've found, or maybe it's ceph that doesn't, think I will need to get the client to change their structure!! Any tips welcome :wink:

The issue is not empty directories disappearing, both ceph and rclone see the objects inside the path but rclone doesn't sync them to minio

I think this issue has been a little bit sidetracked by s3cmd which is not a tool we have ever used to add objects to buckets or retrieve them. I did test it as possibility to do our backups but we went with rclone and the only reason I added it in earlier posts was just to try and help debug.

We use rclone and SDK libraries for Go, Rust & PHP but it is possible this bucket was accessed by s3browser, I don't use windows so my client tool is always rclone but I believe some of my colleagues do.

I'll speak to them when they are back in work tomorrow and once we switch the client to the new bucket then I can run some possibly destructive tests on the old bucket without worrying!

sure, i understand that is not the case, as that link was about rclone mount.
just pointing out that rclone can/does have issue with empty directories.
and perhaps other tools as well.

the most i have used/tested is 1,000,000 objects using wasabi, a s3 clone know for hot storage,
in my testing and based on forum posts, nothing is faster.
this post might be of use.
https://forum.rclone.org/t/fastest-way-to-check-for-changes-in-2-5-million-files/25957/11

Thanks, will have proper look tomorrow, really appreciate the help

Thanks, I'll investigate wasabi tomorrow, not heard of that before but look forward to trying it and :crossed_fingers:

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.