The small difference between GDDR5 and DDR3 is quite surprising me. Are they the best cases or what? I was expecting the difference to be higher than a mere 20% which, if we factor in the grand-total (cache access+dram access /total access) shouldnt almost affect the final result...
I believe latency numbers provided for an architecture tend to focus on best or good cases. It's almost trivial to force a hit to a closed bank or nail the bus with read/write turnaround penalties. There are a few potential spots with bank grouping that might inject a few extra cycles for GDDR5 in scenarios DDR3 wouldn't need to worry about, but in general the datasheets for modules with either tech tend to keep the wall clock times in the same band.
The common case should be that the memory subsystem moves heaven and earth to make DRAM hit the good case, but there lies the dark art of memory controller tech.
At least for this comparison, both platforms have the same technology pool to draw from.
I do not think it was even supposed to. HSA was more about offering a way to (in theory) avoid memory copies by allowing fast memory remapping/sharing, and adding CPU<->GPU intercommunication/event support.
Or are you referring to the way their MCT+DCT are tuned for APUs?
The latency numbers for APUs were a marked rise over their predecessors. In general, besides a possible small improvement from Llano to Trinity, the memory subsystem has gotten worse every chance AMD has had to improve its tech.
Looking at the PS4 numbers, we see an uncore and on-chip interconnect that could take 100 or more cycles to handle an access (meaning just the snoop, queueing, and notification of the DRAM controller).
If we assume for the sake of argument that the PS4 CPUs are 1.6GHz, we could be looking at 100ns or more devoted to the chip figuring out whether an L2 miss needs to go to memory.
The Xbox One is a little better, possibly because the uncore clocks are higher or because their tweaks there have shaved some time off. Another possibility, given that Kaveri has adopted some of the same bus elements as Orbis (and has surprisingly bad latencies), is that Durango benefited latency wise by stopping earlier on AMD's progression of worsening architectural latency.
That's enough time for 2-3 full memory accesses if we go by some decent desktop cores.