It sounds like we've found the root cause of most of the speed issue.
Setting --multi-thread-chunk-size 4G
should make the code do more or less the same thing as it used to.
I think increasing --multi-thread-streams
and decreasing --multi-thread-chunk-size
should make it run faster. You could try --multi-thread-chunk-size 1G
and --multi-thread-streams 16
and see what difference that makes. You should find just using --multi-thread-streams 16
on its own works quite well too.