The ESRAM in Durango as a possible performance aid

So at best what size would the esram be on the apu?

Well I could not get further than the assumption of 28nm TSMC bulk and a 6T structure.

Then that lead me to a chipworks SEM image of a 6T SRAM cell on Tahiti.

So that is as far as I could go and rests upon the assumption that these are correct: "28nm TSMC bulk" and "6T structure"



Can't seem to quickly find the free download anymore:

http://www.chipworks.com/en/technic...15821.jpg?file=2012/12/figure-4-amd215821.jpg

https://chipworks.secure.force.com/catalog/ProductDetails?sku=AMD-215-0821060&viewState=DetailView



Others can give a better answer I think from knowing much more about what the total eSRAM block might take up in space. (Perhaps from their knowledge and/or experience with SRAM blocks in other designs. I do not know what extra is added to the 6T SRAM to turn it into 'eSRAM'. I don't know why it has the 'e'.)
 
It may be a matter of what wasn't added to the SRAM pool to make it a cache like the other pools on the chip.

I have questions about what kind of replacement policy it may or may not have, can it be snooped, how do shaders write to it, and how much it can self-manage.

Most of the details leaked show a lot of explicit mapping and data movement via DMA. If the eSRAM were a proper last level cache in a normal hierarchy, why would this be so heavily emphasized?
It may not be called a cache because it doesn't act like one.

It would still need to be called something at that point, and being an embedded array of SRAM that has functions reminiscent of the 360's eDRAM is suggestive to the new label.
 
PRT for that type of memory would be interesting. No need to tile manually.
Partial Resident Rendertarget ...
 
It's just an unrealistic but beautiful setup I have in mind. Say we have a virtual memory range for the embedded RAM, around 1GB mapped to main memory. Whenever a page-fault occurs the move engines automagically swap the page from main memory into the embedded memory via a default or a custom handler, and some L(east)RU page back to main memory. If the page references a PRT then the PRT handler would take care of making the content available.
This is effectively a programmable 2 level cache hierarchy. If the access pattern on the embedded RAM is local and predictable it should be possible to hide the quite long latency of handling such an "exception".
If the handler programs can be complex it could be possible to implement a transaction scheme - the handlers can decide to record page-faults and preempt or abort the shader that wants to work on it - later, when some pages become old enough, it could swap those recorded pages in and revive the related shaders. So the MMU handler acts like a custom scheduler, driven by memory access patterns.
Quite crazy, but beautiful. :)
 
PRT for that type of memory would be interesting. No need to tile manually.
Partial Resident Rendertarget ...

It's just an unrealistic but beautiful setup I have in mind. Say we have a virtual memory range for the embedded RAM, around 1GB mapped to main memory. Whenever a page-fault occurs the move engines automagically swap the page from main memory into the embedded memory via a default or a custom handler, and some L(east)RU page back to main memory. If the page references a PRT then the PRT handler would take care of making the content available.
This is effectively a programmable 2 level cache hierarchy. If the access pattern on the embedded RAM is local and predictable it should be possible to hide the quite long latency of handling such an "exception".
If the handler programs can be complex it could be possible to implement a transaction scheme - the handlers can decide to record page-faults and preempt or abort the shader that wants to work on it - later, when some pages become old enough, it could swap those recorded pages in and revive the related shaders. So the MMU handler acts like a custom scheduler, driven by memory access patterns.
Quite crazy, but beautiful. :)
If that is reasonably doable, AMD should look no further to improve the perfs of its own APUs.
 
I have a question, now that the esram is at like 196GB/s. what kind of real world benefits will we see? higher framerates? higher resolution? more effcts? how will the higher bandwidth help with the tiled rendering stuff ms showed off at build?
 
I have a question, now that the esram is at like 196GB/s. what kind of real world benefits will we see? higher framerates? higher resolution? more effcts? how will the higher bandwidth help with the tiled rendering stuff ms showed off at build?
because who needs a friggin' GPU.
 
i guess i should say more stable framerates when under stress? like when there's alot happening on screen at once? i'm still confused how it went from 102GB/s to 196GB/s but overall its a good thing right?
 
i guess i should say more stable framerates when under stress? like when there's alot happening on screen at once? i'm still confused how it went from 102GB/s to 196GB/s but overall its a good thing right?

It was 196GB/s only in a very specific theoretical situation, it was 133GB/s in real world scenarios.

Its still a nice boost though.
 
It's just an unrealistic but beautiful setup I have in mind. Say we have a virtual memory range for the embedded RAM, around 1GB mapped to main memory. Whenever a page-fault occurs the move engines automagically swap the page from main memory into the embedded memory via a default or a custom handler, and some L(east)RU page back to main memory. If the page references a PRT then the PRT handler would take care of making the content available.
This is effectively a programmable 2 level cache hierarchy. If the access pattern on the embedded RAM is local and predictable it should be possible to hide the quite long latency of handling such an "exception".
If the handler programs can be complex it could be possible to implement a transaction scheme - the handlers can decide to record page-faults and preempt or abort the shader that wants to work on it - later, when some pages become old enough, it could swap those recorded pages in and revive the related shaders. So the MMU handler acts like a custom scheduler, driven by memory access patterns.
Quite crazy, but beautiful. :)

Why would it be crazy/unrealistic?
 
Why would it be crazy/unrealistic?

Making hardware exception handling work for real is one of the very dark arts. That's true for both, provider and programmer. Proofing it works as intended is a requirement, not just run-and-see, but logically. And because the scope of the suggestion is big with a huge in-flight processor state, it's very difficult to achieve that.

I tend to be overly conservative in estimates about complexity and time needed for software implementations [because of the unknowns you don't know about yet], maybe I misjudge this one and it's a cakewalk. :)
 
It has not been confirmed but the information that came with it stated that it was 133GB/s in real world scenarios and thats only under specific conditions as well.

do they say what scenarios that would be? the xbox one is seemingly alot more powerful than people originally gave it credit for.
 
No, there's no understanding behind the reported BW increase. The only explanation we got, from DF, made it sound impossible, getting more BW than the bus is capable of actually carrying.
 
i know i'm pretty much uninformed on how any of this stuff works so bear with me. but i thought esram's bandwidth was based on the clocks of the gpu it was embedded on when talking about xbox one. someone said bidirectional so that means sending and receiving right? so is it 133GB/s both ways or is it 66.5GB/s both ways?
 
From what I understand the game/program has to use GPU cycles for the move engines. This would make porting a bit more complicated. No idea how granular the process is or if it plays havoc with the GPU cache.
 
Back
Top