And how much did that reduce LATENCY by, how often is an ALU stalled? how much time will HBM reduce the time that ALU is stalled for thus increasing utilization. how often in a frame are you limited by the worse case memory fetch?
The idea that you need 100's of GB of bandwidth a second and yet being able to deliver that data at something like 1/2 the latency of GDDR5 wont have an improvement across the entire GPU pipeline as a layman doesn't compute to me. GPU's have very limited ability to hide latency which in turn has to have an effect on utilization.
I've tried cross-referencing data between the various architectures in the console forum:
Xbox One (Durango) Technical hardware investigation
The DRAM itself seems to be a minority contributor to the cost of memory access for AMD's architectures (especially for AMD's architectures, many of these APU). CPU cache comparisons put the external DRAM device contribution at 60-80 CPU cycles, or 30-40 in terms of the GPU's slower clock speed.
Using the GPU clock regime, a probable best-case would be an on-die pool with dedicated read and write paths, which the ESRAM for the Xbox One would be. That's 250+ for DDR3 versus 125+ for ESRAM, once additive latencies from the caches you have to miss through are added in. I suspect that should HBM be an improvement, it may have difficulty dropping latency below an on-die SRAM pool.
Because HBM is DRAM on a read/write bus, it will still need to deal with the same coalescing requirements and the various latencies and penalties GDDR5 has for irregular traffic. Possibly, the larger number of buses can allow the memory controller more leeway on working around turnaround penalties. Maybe the slower bus and shorter wires of the interposer can reduce the bus turnaround penalties, at least as seen by the lower-clocked controller.
HBM has some high-level similarities in its banking (not that DRAM gives that much leeway for this) that point to AMD not wanting a memory technology too different from what has come before in GDDR5, so I think a general similarity to the DRAM it's replacing will probably mean it won't beat the ESRAM in latency--where ESRAM shows it's a nice but not earth-shattering improvement for the GPU from a vector memory standpoint.