What could explain that it's performance is finicky? The theory of having 8 banks with one in conflict makes a lot of sense (88%), but why is it so far from it in real world?
Is that 33% typical for bidirectional ports, because the usage pattern isn't very symmetrical with real code?
The quoted peaks and "real-world" performance figures in the DF article are not described very well and the measurement methods are not disclosed.
The quoted scenario from the source is heavy on memory traffic and in theory it was supposed to be an illustration of a good use case, so I'm not sure why it would be this twitchy.
The fact that the documentation doesn't label the interface as 2x what is in the diagram is curious because there's nothing wrong with giving peak figures that assume no banking conflicts and a frequently unrealistic 1:1 read/write ratio. The lack of it is usually consistent with the top numbers being a more restricted use case, and the lack of detail on the measurement method means we can't rule out a range of common errors when it comes to benchmarking complex memory pipelines that will try to prefetch, buffer, and coalesce whatever they can.
The picture from the leaks is an incomplete one, so I'm awaiting something with more detail--preferably more direct than an anonymous source that is being passed along second and third hand with non-technical parties trying to interpret it before passing it along.