Finding latest file in public AWS bucket

I am a novice with rclone but I use it for simple tasks. I work with weather data (NEXRAD) and I am having an issue finding data, that's why I am here.

The command I use to see the data is :
rclone lsf chunks:unidata-nexrad-level2-chunks/KEWX/

That will list directories 0-999, such as :

C:\super>rclone lsf chunks:unidata-nexrad-level2-chunks/KEWX/
1/
10/
100/
101/
102/
103/
104/
105/
106/
etc, etc

So, the data I need is always being written in one of these folders. Is there a way to know what folder has the latest data being written? The files are just one scan of the sky and then it moves to a new folder to write the next data. Therefore I only have a few moments to determine which folder # the data is in.

Is there a way to list by latest file written inside unidata-nexrad-level2-chunks/KEWX/? This would give me the folder #, I would assume.

Any help is appreciated!

I'm not an AWS guy but I know on Google, there isn't anything like that to note if a folder/directory has been written to. I doubt that metadata is in AWS but I could be wrong as well.

In Linux, most things have atime (access time) by default turned on so you can usually figure it out.

felix@gemini:/home$ stat /etc/hosts
  File: /etc/hosts
  Size: 130       	Blocks: 8          IO Block: 4096   regular file
Device: 812h/2066d	Inode: 6689650     Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2021-07-09 09:04:33.547816863 -0400
Modify: 2021-06-10 08:55:10.399842342 -0400
Change: 2021-06-10 08:55:10.399842342 -0400
 Birth: -

but on my fuse mount, it doesn't keep that data/write it.

felix@gemini:/GD$ stat mounted
  File: mounted
  Size: 243       	Blocks: 1          IO Block: 4096   regular file
Device: 34h/52d	Inode: 16502406546232425784  Links: 1
Access: (0664/-rw-rw-r--)  Uid: ( 1000/   felix)   Gid: ( 1000/   felix)
Access: 2019-05-19 17:33:38.599000000 -0400
Modify: 2019-05-19 17:33:38.599000000 -0400
Change: 2019-05-19 17:33:38.599000000 -0400
 Birth: -

What you might be able to do is try to figure it out with some filters and look for last 'new' files and perhaps figure out the directory from there:

https://rclone.org/filtering/#max-age-don-t-transfer-any-file-older-than-this

Example would be a file that's only an hour or less old:

felix@gemini:/GD$ rclone lsl gcrypt: --max-depth 1 --max-age 1h
      130 2021-07-10 01:07:48.630000000 hosts

Thank you. I am playing around with your advice, but still no luck. I can list every file in all 999 folders but that takes an insanely long time. Stumped.

What are you running to get that? How long is insanely long?

If I run this :
rclone lsl chunks:unidata-nexrad-level2-chunks/KEWX/

It starts listing every single file with the latest time in all 999 folders. It never finishes (Well I am sure it does, but it takes longer than 5 minutes. By the time I finish listing, the radar is already writing files into a new folder.)

You probably want to atleast add --fast-list and see if that helps. The challenge that it doesn't keep any data so you are having to fetch full listings each time.

There is a request at some point to add a local metadata cache as that's in the works and would be great for situations like this. I'd avoid the cache backend since it's a bit buggy sadly as that would be a good fit as well.

Doesn’t Windows change the parent directory date when it writes to it? You could use RCline LSD to get just the directories I think and then parse the time?

To animosity022 - I tried --fast-list with no luck. It still takes a very long time, I would suspect over 10 minutes to list everything.

alstrandsr - I tried the LSD command just now, it's super fast. But it only gives the following output (truncated to not list all 999 here)

           0 2021-07-10 11:46:42        -1 1
           0 2021-07-10 11:46:42        -1 10
           0 2021-07-10 11:46:42        -1 100
           0 2021-07-10 11:46:42        -1 101
           0 2021-07-10 11:46:42        -1 102
           0 2021-07-10 11:46:42        -1 103
           0 2021-07-10 11:46:42        -1 104
           0 2021-07-10 11:46:42        -1 105
           0 2021-07-10 11:46:42        -1 106
           0 2021-07-10 11:46:42        -1 107

The timestamps are all exactly the same, and I know only one is being written to at a time. The active directory changes every few minutes. It looks like its showing the time on my local PC when I executed the command (11:46, it's 11:50 now)

Wonderful:-(

I am a Linux dude so bear with me. Files have an access time, creation time, modified time etc. can you use a Windows command, possibly dir with some option to see the other times? If one gives you what you want you could add that to your script that runs RClone.

Also look at the options for lsd if they documented that. Maybe it can be configured.

Windows or Linux doesn't matter as it's pointing to a cloud remote and not a local file system. The remote has to support access time which I'm not aware of any that do.

I've been keeping up with the comments and looking at documentation, still not luck with this. Being so new with rclone makes it harder for me.

Sorry when I saw C:\ I thought you were copying from a local system.

Is the "cloud remote" an S3 compatible bucket? If not, I would look at command tools by your provider to analyze for dates. This is not as much an RClone issue as a source provider question.

S3 compatible bucket files contain a "last modified date" which could be used but I have no idea what capabilities your source data has. If you search all files remotely, yes it will be slow. Here is an example of data S3 can provide you about the object.

{
    "Contents": [
        {
            "Key": "dir1/ft1.txt",
            "LastModified": "2021-07-10T20:28:18+00:00",
            "ETag": "\"d41d8cd98f00b204e9800998ecf8427e\"",
            "Size": 0,
            "StorageClass": "STANDARD"
        }
    ]
}

The rclone lsd command is essential an aws s3 ls without recursion.

AWS bucket contents are stored by full name. A parent directory is known as a prefix. The prefix doesn't have a date like a Windows or Linux system have.

You may be one level down from where you should be. Your source data should have an API to get you the current file(s). If not you may have to explore how to get that and then figure out if one of the RClone filters will give you the parent prefix.

Also -- if the source is a real-time system you probably want to get the next-to-last item so you don't try and copy something partial that is being created.

Good luck.

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.