The ESRAM in Durango as a possible performance aid

3dilettante · May 23, 2013

The question is related to the particulars of how those links work in relation to the eSRAM, which doesn't exist in those documents.
Some of the glaring performance deficiencies should hopefully be improved since Llano.

loekf · May 23, 2013

Betanumerical said:
I never noticed that it has 170GB/s read but only 102GB/s write before (implying that it cannot write to both memory pools at the same time). Interesting.

Some sites write 102 Gbit/s and not 102 Gbyte/s

It's the same old discussion how you write bit or byte. I'm used to a capital "B" for byte.

Maybe an open door (hose #11 in dutch), but 102 Gb/s divided by 64-bits = 1.6 GHz... close to
the actual clock speed of the CPU cores....

(assuming there's a 64-bits on-chip bus ... going to an array of single port SRAM)

BTW... 32 MB SRAM in 28 nm is according to some info on the web:

26 mm2 without routing overhead, approx. 36 mm2 with routing overhead (75% util). So that's approx. 10% of the total die ?
I would assume there's some repair overhead and test logic to compensate for yield losses, so maybe closer to 40 mm2 is more realistic.

Gubbi · May 23, 2013

Betanumerical said:
I never noticed that it has 170GB/s read but only 102GB/s write before (implying that it cannot write to both memory pools at the same time). Interesting.

The 102GB/s write bandwidth is likely determined by the 16 ROPs writing a maximum of 8 bytes per cycle at 800MHz.

Both ROPs and texture units can read data so the requirement for read bandwidth is always going to be larger.

Cheers

loekf · May 23, 2013

Gubbi said:
The 102GB/s write bandwidth is likely determined by the 16 ROPs writing a maximum of 8 bytes per cycle at 800MHz.

Both ROPs and texture units can read data so the requirement for read bandwidth is always going to be larger.

Cheers

Are you implying that the SRAM can only be accessed from the GPU side ?

Throught it was both, but the numbers do match up:

GPU: 800 MHz x 8 x 16 = ~102 GB/s

CPU: 64-bits data bus x 1.6 GHz (or 64-bits R-bus + 64-bits W-bus x 800 MHz) = 102 GB/s

Betanumerical · May 23, 2013

loekf said:
Are you implying that the SRAM can only be accessed from the GPU side ?

Throught it was both, but the numbers do match up:

GPU: 800 MHz x 8 x 16 = ~102 GB/s

CPU: 64-bits data bus x 1.6 GHz (or 64-bits R-bus + 64-bits W-bus x 800 MHz) = 102 GB/s

I don't remember reading that the CPU has a databus that goes the eSRAM it looks like it has to through the north bridge, with each CPU module (2 modules) has a max speed of 20.8GB/s. R/W. Or are you talking about something else, i cannot tell.

Strange · May 23, 2013

scently said:
You should read the post by sebbbi. The low latency eSRAM, while it might not be the main reason for the eSRAM, has additional benefits due to algorithms and effects that rely on memory coherency and low latency.

Problem is how low latency compared to the DDRs is it and how much of a scratchpad is it?
Does the current understanding of eSRAM fall safely under sebbbi's criteria for bandwidth savings?

XpiderMX · May 25, 2013

From AnandTech:

If it’s used as a cache, the embedded SRAM should significantly cut down on GPU memory bandwidth requests which will give the GPU much more bandwidth than the 256-bit DDR3-2133 memory interface would otherwise imply. Depending on how the eSRAM is managed, it’s very possible that the Xbox One could have comparable effective memory bandwidth to the PlayStation 4. If the eSRAM isn’t managed as a cache however, this all gets much more complicated.

There are merits to both approaches. Sony has the most present-day-GPU-centric approach to its memory subsystem: give the GPU a wide and fast GDDR5 interface and call it a day. It’s well understood and simple to manage. The downsides? High speed GDDR5 isn’t the most power efficient, and Sony is now married to a more costly memory technology for the life of the PlayStation 4.

Microsoft’s approach leaves some questions about implementation, and is potentially more complex to deal with depending on that implementation. Microsoft specifically called out its 8GB of memory as being “power friendly”, a nod to the lower power operation of DDR3-2133 compared to 5.5GHz GDDR5 used in the PS4. There are also cost benefits. DDR3 is presently cheaper than GDDR5 and that gap should remain over time (although 2133MHz DDR3 is by no means the cheapest available). The 32MB of embedded SRAM is costly, but SRAM scales well with smaller processes. Microsoft probably figures it can significantly cut down the die area of the eSRAM at 20nm and by 14/16nm it shouldn’t be a problem at all.
Even if Microsoft can’t deliver the same effective memory bandwidth as Sony, it also has fewer GPU execution resources - it’s entirely possible that the Xbox One’s memory bandwidth demands will be inherently lower to begin with.

Jawed · May 25, 2013

Does XBox One have ROPs?

EDRAM in XBox 360 was to support fixed function hardware, i.e. ROPs.

Since it's quite possible to write a pixel shader that doesn't output pixels (instead it simply writes data to/from memory) it's possible this architecture is ROP-less.

This would be so cool.

Betanumerical · May 25, 2013

Jawed said:
Does XBox One have ROPs?

EDRAM in XBox 360 was to support fixed function hardware, i.e. ROPs.

Since it's quite possible to write a pixel shader that doesn't output pixels (instead it simply writes data to/from memory) it's possible this architecture is ROP-less.

This would be so cool.

Yes, 8 of them, for a peak bandwidth of 102GB/s.

Rangers · May 25, 2013

isn't it 16?

mosen · May 25, 2013

Rangers said:
isn't it 16?

Output

Pixel shading output goes through the DB and CB before being written to the depth/stencil and color render targets. Logically, these buffers represent screenspace arrays, with one value per sample. Physically, implementation of these buffers is much more complex, and involves a number of optimizations in hardware.
Both depth and color are stored in compressed formats. The purpose of compression is to save bandwidth, not memory, and, in fact, compressed render targets actually require slightly more memory than their uncompressed analogues. Compressed render targets provide for certain types of fast-path rendering. A clear operation, for example, is much faster in the presence of compression, because the GPU does not need to explicitly write the clear value to every sample. Similarly, for relatively large triangles, MSAA rendering to a compressed color buffer can run at nearly the same rate as non-MSAA rendering.
For performance reasons, it is important to keep depth and color data compressed as much as possible. Some examples of operations which can destroy compression are:

Rendering highly tessellated geometry

Heavy use of alpha-to-mask (sometimes called alpha-to-coverage)

Writing to depth or stencil from a pixel shader

Running the pixel shader per-sample (using the SV_SampleIndex semantic)

Sourcing the depth or color buffer as a texture in-place and then resuming use as a render target

Both the DB and the CB have substantial caches on die, and all depth and color operations are performed locally in the caches. Access to these caches is faster than access to ESRAM. For this reason, the peak GPU pixel rate can be larger than what raw memory throughput would indicate. The caches are not large enough, however, to fit entire render targets. Therefore, rendering that is localized to a particular area of the screen is more efficient than scattered rendering.

4 DB and 4 CB

http://www.vgleaks.com/durango-gpu-2/3/

Betanumerical · May 25, 2013

Rangers said:
isn't it 16?

Yeah it is, my bad. I remembered 8 for some reason.

Urian · May 25, 2013

Well I suppose that we are talking about 4 RBE units then

Brad Grenz · May 25, 2013

Exactly.

loekf · May 26, 2013

(((interference))) said:
Is it mentioned in Vgleaks' memory architecture article:
http://www.vgleaks.com/durango-memory-system-overview/

Hmm... any idea why they didn't go for a kind of system cache ? Kind of L3 with configurable consumers sitting between the CPUs L2, CPU and DRAM.

MfA · May 26, 2013

Takes a lot more space.

Mandrion · May 27, 2013

The whole system reservation article from Kotaku made me think about the memory bandwidth.

Doenst the system also need to reserve some bandwidth from the Main RAM?
I guess the same as for the GPU, roughly 90%.

So i think 60GB/s for the game seems reasonable?

Edit: argh wrong thread...

Homeles · May 28, 2013

Say we have 100 GB/s eDRAM with 1/10th the latency (say 1ns vs 10ns) of 100 GB/s GDDR5.

Where can low latency be useful with both frame rendering and with GPGPU? Could someone cite specific examples? I.e., what tasks would benefit?

I just can't imagine that the memory subsystem of the Xbox One cannot outperform the PS4's GDDR5 in any metric. With graphics, bandwidth is far and away the more important resource to have, but there has to be some significant scenario where the 32MB static eDRAM holds an advantage. Is there really no reason for Microsoft to chose DDR3 + eDRAM over GDDR5 other than cost and power/heat?

Disclaimer: from here on out I ramble a bit. If you are able to answer the questions above, it'd be much appreciated. The following isn't as important to me.
________________________________________

Microsoft may be assuming that the cost of 32MB of static eDRAM is going to significantly decrease in cost as it goes onto 20nm and 14nm in the coming years. There has been an awful lot of noise about 20nm and 14nm offering little improvement, no improvement, or regression in cost per transistor compared to 28nm. I've heard that 14nm is of particular concern, because the R&D cost and wafer costs will raise to the point that only the very largest companies will be able to profit, leading to some crazy semiconductor mass extinction event. Perhaps the scaling of SRAM is large enough to overcome the relatively higher cost? Nvidia's claims of cost regression on TSMC's 20nm process are presumably based on average GPU transistor cost, while SRAM should fare better.

GloFo's a bit of an oddball, though. They're moving to gate last with 20nm, which may help them (or their customers) in the cost department. Am I wrong on this? I understand gate last means lower density, but the higher yield would mean lower cost. I suppose it could end up meaning higher cost if the yield improvement is not large enough to counteract the density hit, or it could result in no ground being made in the cost department at all. Performance will move forward, of course, but I can't help but wonder whether it's density or yield that wins out.

Their 14nm process is also some hybrid contraption. Does anyone know how their decision to shrink the transistors while standing pat on the interconnect will turn out? Which would benefit more, cost or the electrical performance? I really wish I understood more about the subject, but to me it seems like their decision would result in lowered cost, while performance would not be moving forward much.

Does anyone know how the cost of 32 MB static eDRAM on a 28nm process + 8GB DDR3 2133 compares to 8GB GDDR5? Which is the more expensive memory solution on today's market and manufacturing processes? I doubt a measly 32MB of memory is enough to fully negate the cost gap between DDR3 and GDDR5. I wish there was more transparency when it came to part costs, so there would be some way to estimate how much Microsoft is saving by choosing eDRAM.

Anand stated that eDRAM would be relatively inexpensive in the long run, thanks to node advancements. He also stated that Sony is married to the more expensive GDDR5, but is there any reason why Sony couldn't get GDDR5 that's been ported to a newer process? I'd imagine that their volume would be high enough to warrant such a move. I suppose it'd be up to either Samsung or Hynix to conduct the R&D for it, though.

One final bit: obviously one of the biggest reasons consoles work so well is because they have a fixed set of hardware. Is there not any room for performance to improve over each console's respective lifespan? Surely something simple like increasing clock speed won't throw things off, would it?

If we ignore the cost of doing so, would moving to DDR4 break compatibility between the Xbox One and a theoretical Xbox DDR4? If not, I suppose Sony could theoretically implement GDDR6. Would stacked DRAM require a revised memory controller? I'm sure GDDR6 and stacked DRAM would be hilarious overkill, but it's fun to imagine.

How about doubling the eDRAM size? I've seen some questioning over the usefulness of such a small frame buffer. I believe I saw some criticism for Haswell GT3e's 128MB eDRAM as well. What would 64MB allow us to do, where 32MB would fall short? 128MB? 256MB? At what point does the advantage disappear, and we simply have too much memory to do anything interesting with?

dobwal · May 28, 2013

Lay person here. But isn't the latency between eDRAM/eSRAM and off chip GDDR5/DDR3 much larger than being discussed?

Latency of video DRAM (time its takes to service a memory request) can be the same regardless if its in your PC or a server 100s of miles away. However, a memory request from your PC to a server isn't going to met as quickly as a memory request to your PC's memory. The temporal latency may be the same but spatial latency involved is drastically different.

Isn't on-chip memory much faster simply because data has a shorter distance to travel as well as the ability to service requests faster (requiring less cycles)?

Exophase · May 28, 2013

dobwal said:
Lay person here. But isn't the latency between eDRAM/eSRAM and off chip GDDR5/DDR3 much larger than being discussed?

Latency of video DRAM (time its takes to service a memory request) can be the same regardless if its in your PC or a server 100s of miles away. However, a memory request from your PC to a server isn't going to met as quickly as a memory request to your PC's memory. The temporal latency may be the same but spatial latency involved is drastically different.

Isn't on-chip memory much faster simply because data has a shorter distance to travel as well as the ability to service request faster (requiring less cycles)?

I haven't heard the terms spatial and temporal latency before. Maybe you're thinking of latency vs bandwidth?

Propagation delay does play a role in adding to memory latency but it's very small. Around 0.1ns per cm at most. The memory on PS4 shouldn't be more than a few cm away from the APU so I don't think it'll make a big difference.

The ESRAM in Durango as a possible performance aid

3dilettante

loekf

Gubbi

loekf

Betanumerical

Strange

XpiderMX

Jawed

Betanumerical

Rangers

mosen

Betanumerical

Urian

Brad Grenz

Philosopher & Poet

loekf

MfA

Mandrion

Homeles

dobwal

Exophase

Similar threads