My interpretation of the engineer's statements on this is that they gave a preliminary 102 GB/s figure as the minimum (with no qualifications) that they communicated to developers and outside groups before most of the design had been evaluated.When MS specced 16 ROPs they also specced 102 GB/s of B/W from the esram.
The fact that AMD delivered something vastly more capable is simply a happy turn of events.
My strong suspicion is that those involved with the design knew enough about on-die memories and interconnects in general to expect more, but they had no good reason to say more as that would have relied on implementation details for an implementation that had not been fully specced.
I do also wonder how much more expensive it would have been if they asked for double the I/O to the same scratch memory. It somewhat looks like they were not ever going to accommodate MSAA bandwidth requirements or even FP16, which themselves would make 32MB seem that much smaller to work with anyway.
Decisions, decisions...
It's on-die memory, but per the DF interview it's a memory scratchpad that is accessed identically to main memory, post whatever page table setup and mapping to the hardware is done at allocation.
Accesses get routed through a crossbar setup with no additional special handling from the code, hence the ability to split portions of a target across both pools.
I'm speculating at this point, but I wonder if the 16 ROPs and their peak bandwidth faced a design bottleneck with the eSRAM's crossbar requirements.
I think, from the die shots and interviews, that Durango has doubled up on crossbar blocks relative to Orbis in order to service this comparatively generic memory access capability for the GPU memory pipeline.
Having 32 ROPs, or expanding the general memory access bandwidth for 16, would require plugging even more into AMD's crufty uncore--of which I have been having an increasingly jaundiced view as of late. Since the eSRAM's bus is sized to match the ROPs so well, the play to keep things on-die in this manner raises complexity in a way that a relatively straight shot over a wider Garlic to juiced up off-die GDDR might not.
That level of on-die bandwidth is doable, but potentially not with the constraints that the accesses be as generic, done cheaply, and with AMD's bus setup and design capabilities as the basis.
The lack of mention of significant latency benefits from the eSRAM, and the lousy latency numbers for AMD's memory accesses in general (from all appearances it is uniformly and disconcertingly bad across all current APUs including Kaveri) makes me think that there may be benefits on-die memory could have brought, if it weren't shoehorned into tech as old as Llano.