Understanding XB1's internal memory bandwidth *spawn

Slightly tangential question about the esram... how does a dual ported design compare to a single ported of twice the bus width in terms of transistor counts, IO pins, suitability for future process node shrinks, etc.?
There are probably no real dual ported SRAM cells in that 32MB array. It is just banked and as long as there are no conflicts, a certain number of parallel accesses can be done. That keeps the overhead low. How they implemented the SRAM interface is another question. As they probably transfer 64 bytes per access (the cache line size) and the total bandwidth is between 128 and 256 Bytes per cycle as well as MS appearing to indicate the 32MB consists of four chunks (which one can probably access in parallel), I would wager that the SRAM interface probably exposes 4 ports (2 read + 2 write). Having 2 dedicated ports for each (instead of 4 general ones), probably cuts down a bit on the effort for the interface and SRAM controller, but exactly how much, no idea. Probably enough that MS considered 4 general ports not worth the effort given that the most bandwidth demanding stuff often requires reading and writing.
IO pins in the sense of the word don't exist. The large number of data lines (probably 2048) is the same in both cases, just the SRAM controller gets simpler. I'm not aware that this would be a factor for process shrinks, but I have to admit that this is outside of my field.
 
Last edited by a moderator:
I guess as sebbbi says above situations with heavy alpha blending or overdraw are likely to be more efficient on esram than similarly specified GDDR5. Could it be that XB1 may even be able to outperform a high end GPU like Tahiti in these situations?

Question alpha blending is a heavy case of read-modify-write so in a PC GPU the bios and memory controllers optimize operations when switching from DDR3 to GDDR5 "Write Data Masking"?
 
Question alpha blending is a heavy case of read-modify-write so in a PC GPU the bios and memory controllers optimize operations when switching from DDR3 to GDDR5 "Write Data Masking"?
That's probably not used too often as the ROPs read and write the render target in tiles to/from the ROP caches. These tiles are far larger than the burst length to a DRAM channel. The read-modify-write operations are done within the ROP caches (offering plenty of internal bandwidth for that and also MSAA).
 
There are probably no real dual ported SRAM cells in that 32MB array. It is just banked and as long as there are no conflicts, a certain number of parallel accesses can be done.

Actually the way it works is that conflicts are arbitrated so that there's added latency in a banked multiport design, after all, simultaneously read/write means to access the same memory blocks at the same time, so having no conflicts would actually mean it's not accessing the same blocks.

In practice, have true dual port on every memory cell is expensive (cost, energy, transistors). It would be ideal of this is how it's done, though it seem less likely.

If we work on the assumption that the banked memory is needed for the design, then it's a banked multiport design, is which there's an arbitrator/crossbar between the R/W ports to the banked memory. The trade off is added latency in access time.

However if we revisit the DRAM in the HotChips slides, the 8G DDR3s are divided into 4 X 2G, so at the same time, so does that imply the DRAM is banked as well? Probably not, though a banked design might be the case for the eSRAM.
 
The reality is a bit more complicated than that. DRAM is heavily optimized for localized or linear accesses with writes and reads not being mixed together. Internally, the DRAM is heavily subdivided, slower, and it can't keep everything at the ready at all times. It also incurs a penalty whenever it has to switch from reads to writes.

The memory subsystem tries very hard to schedule accesses so that they hit as few bank and turnaround penalties as possible, but this isn't simple to do with other constraints like latency and balancing service to multiple clients.

Ideally, the eSRAM could dispense with all of this, and gladly take any mix that works within the bounds of its read and write ports.
However, the peak numbers and articles on the subject suggest that for various reasons there are at least some banking and timing considerations that make the ideal unreachable. The physical speed of the SRAM and the lack of an external bus probably mean that the perceived latency hierarchy is "flatter" than it would be if you were spamming a GDDR bus with reads and writes with poor locality.

This is where I assume the hinted advantages the eSRAM has for certain operations come in, where the access pattern starts interspersing reads with writes, or there is poor locality.


Hi I signed up for this forum because I'm really wanted to ask one of the most knowledgeble forum userbase about what Boyd Multerer had to say during the architecture panel.
Boyd Multerer had this to say:
xbox_sram_cache2.jpg

As sram is multiple times more costly in die real-estate than edram wouldn't that suggest some really big benefit with going with the former(sram)?

And he's suggesting something different for current generation gpu's vs last generation (I assume by generation he means console generation eg xt1xxx/7xxx circa 2004/5)

Would the use of esram as opposed to edram suggest a different purpose for for the 4x8MB cache? Unlike edram which was used for frame buffering, post processing and RoP performance improvements (I might be wrong about RoP improvements)?
 
Hot chips slides say 204 *shrug* http://images.thisisxbox.com/2013/08/XBO_diagram_WM.jpg

Which fits with the former 7/8 multiplier throughout.

I'd trust that for now over some off the cuff Penello post on GAF. Which btw I dont think he went and asked any "technical fellow", IIRC I think he was just going with the crowd and possibly made a mistake. He must have thought "well it must be 2X that makes sense". And it was in reply to gaffers stating it should be 218 on the erroneous 2x109 idea. I should look up the post but dont have time right now.

If they're saying 218 now, it's new.

I had already posted it...

http://forum.beyond3d.com/showpost.php?p=1783345&postcount=248

"people at the office" corrected him for "writing the wrong number". Seems like people in Microsoft are telling him 218 is the correct number now. Until we hear otherwise I would surmise that's the correct number now.

Tommy McClain
 
That was the original figure Digital Foundry reported. After the clock update people calculated it would change to 204, and then the Hotchips presentation seemed to confirm that. In both cases it appeared to be information from MS which is why people have been so confused by the 7/8ths discrepancy for so long. So we have multiple sources from MS consistently using one calculation for the peak, and only Penello, non-technical guy who was confused about a lot of this stuff saying it should actually be a straight doubling. AFAIK he has never reconfirmed the 218 figure with anyone after people asked what was going on with the math. I assume he thought he made a typo himself, and didn't realize the same figure was being reported in multiple places.
 
Any idea on what the penalty between a switch from a read to write is?
The number varies. In this case, there can be latencies inherent to the device or the bus between the controller and device.

The model in question can have different numbers, and it can have programmable settings that can change the math as well.
http://www.hynix.com/datasheet/pdf/graphics/H5GQ1H24AFR(Rev1.0).pdf

A write to read delay is 5 nanoseconds of an internal turnaround time. 5.5 Gbps memory would make that roughly 7 clocks.

For a switch from reads to writes:
[CLmrs+(BL/4)+2-WLmrs]*tCK

CAS latency can range from 5-20 cycles, BL is 8, Write Latency can be from 4-7 cycles.
The middle of the road CL numbers would put a penalty roughly around that of the write to read penalty, although fiddling with numbers can make things better or worse.
Since the data clock is twice that of the command clock, and the data bus is DDR, that's 28 data transfers that don't happen every time modes switch.

More than this, the pdf shows a massive number of variables, restrictions, and options that a merely functional memory controller must manage, not to mention what would go into a good one.

So would it be fair to say that you do still need a 50/50 split of read/write to achieve maximum bandwidth utilization of the XB1 esram? However that in such a use case GDDR5 would likely be less efficient because of the switch penalties you mentioned?
The inferred ideal is 50/50 for the eSRAM, and that does seem like the most straightforward interpretation. The optimal mix of type and locality hasn't actually been disclosed, however.
GDDR5 can do well enough if the mix is heavily in favor of one operation or the other, with the additional caveat that many systems can give an edge to read performance, because automatic functions like hardware prefetching work for reads but can't help with writes.
It also does best with accesses that don't jump around to closed banks. The eSRAM's situation with regards to where the accesses hit relative to each other isn't known, although it sounds like it should have an easier time with it.

If that's correct (please just blast my theory out of the water if it's not) then would you care to venture an estimate of how a typical games workload may best fit either model from a high level (I understand at a low level there would be a mix) or could it vary greatly from one model to the other depending on the game engine?
I'm not sure what would pass for typical for modern games, and I haven't run across analysis of that sort for games.
I am intrigued by the prospect of less regular accesses being sped up by the eSRAM. I wouldn't say irregular accesses because it does seem to have a decently coarse granularity, going by the width of its ports and its apparent alignment in bandwidth to the ROPs.

The Edge article hinted at certain things that had higher write needs and less ALU and texture work that could be used to compensate.
Going a little far afield, I'm curious if measurements of performance will show that the eSRAM can speed up the setup process for intermediate buffers and render targets. Comparisons between various forms of deferred lighting or shading show that there is overhead for tile setup that is eventually swamped by the total amount of work in larger data sets.
That could open up the possibility for more buffers that can be made smaller and used in more complicated ways because their overheads are smaller.
On the other hand, once the setup is complete and the goal is to power through a lot of generated data, the other memory setup could have an edge.

Different phases of the same graphics pipeline could favor one setup over the other, if the programmer is able to delve deep enough to uncover the differences.
 
However if we revisit the DRAM in the HotChips slides, the 8G DDR3s are divided into 4 X 2G, so at the same time, so does that imply the DRAM is banked as well? Probably not, though a banked design might be the case for the eSRAM.
Of course DRAM is banked internally. Each typical GDDR5 device (chip) ususally supports 16 banks (the smaller capacities 8, iirc), would have to look it up for the DDR3 in case of the XB1. It is not used to access it in parallel (as the external interface doesn't afford this), but to enable a faster switch between banks, i.e. reducing the access latency.
Generally, I would very much expect that the XB1 interleaves its address space between the four memory channels in about the same way it is done between the four 8 MB chunks of eSRAM. The interleaving stride is just open for speculation, could be larger than the cacheline size.
 

Thanks for the info, that's much clearer now. I guess there really is no simplifying this and it's going to come down to how developers are able to leverage the strengths and avoid the weaknesses of both setups.

I realise we need to understand more about the precise quirks of it's operation but from what I'm gathering so far the esram with it's concurrent read/write capability is sounding more capable than I was originally giving it credit for.

I'm getting the impression now that the XB1 may genuinely come out with a bandwidth/performance advantage as a result of it's esram use over a single pool of high speed GDDR5 - presumably at a cost to ease of development.
 
I'm getting the impression now that the XB1 may genuinely come out with a bandwidth/performance advantage as a result of it's esram use over a single pool of high speed GDDR5 - presumably at a cost to ease of development.

If that is the case I wonder if this 'problem' will need to be solved on a case by case basis or MS/XBO developers will quickly come up with best practices that can be applied across the board and get 80% of the advantage, with the last 20% left for those who really want to optimize for their engine/game.
 
Of course DRAM is banked internally. Each typical GDDR5 device (chip) ususally supports 16 banks (the smaller capacities 8, iirc), would have to look it up for the DDR3 in case of the XB1. It is not used to access it in parallel (as the external interface doesn't afford this), but to enable a faster switch between banks, i.e. reducing the access latency.
Generally, I would very much expect that the XB1 interleaves its address space between the four memory channels in about the same way it is done between the four 8 MB chunks of eSRAM. The interleaving stride is just open for speculation, could be larger than the cacheline size.

Er, I meant to say that having DRAM banked doesn't imply that it's banked multiport, or vice versa, in context to the banked ESRAM.
 
Last edited by a moderator:
Thanks for the info, that's much clearer now. I guess there really is no simplifying this and it's going to come down to how developers are able to leverage the strengths and avoid the weaknesses of both setups.

I realise we need to understand more about the precise quirks of it's operation but from what I'm gathering so far the esram with it's concurrent read/write capability is sounding more capable than I was originally giving it credit for.

I'm getting the impression now that the XB1 may genuinely come out with a bandwidth/performance advantage as a result of it's esram use over a single pool of high speed GDDR5 - presumably at a cost to ease of development.

I wouldn't be too sure of that I think it will be like Intel Iris Pro 5200 & the embedded ram will just help the GPU reach around the same performance that it would reach if it was connected to GDDR5 instead of DDR3.
 
I wouldn't be too sure of that I think it will be like Intel Iris Pro 5200 & the embedded ram will just help the GPU reach around the same performance that it would reach if it was connected to GDDR5 instead of DDR3.

Would you care to elaborate on what led you to this conclusion?
 
Would you care to elaborate on what led you to this conclusion?

I didn't come to a conclusion I said that's the way I think it will be, not an advantage over a single pool of GDDR5 but an alternative to a single pool of GDDR5.
 
I didn't come to a conclusion I said that's the way I think it will be, not an advantage over a single pool of GDDR5 but an alternative to a single pool of GDDR5.

But isn't Haswell's eDRAM more like a hardware managed L4 cache than software managed eSRAM in the X1?
 
Back
Top