I’ve proven it with a product we use called Atempo Ada. When it comes to millions of small files, the CPU in each ECS node handling the transactions is much more of a bottleneck than the bandwidth.
As far as “that is how the OS does it” — that’s presumably only because that’s how you’ve written it. If you run the same nslookup from the command line 4 times in a row you will get 4 different IPs. So rclone would need to support this behaviour: give me all the IPs for the destination DNS record, and randomly choose n of them where n is the number of transfers requested.
Does it make sense?