The ESRAM in Durango as a possible performance aid

Xbat · Jul 19, 2013

MfA said:
Not all transistors or mm2 of die area are created equal yield wise.

So at best what size would the esram be on the apu?

BeyondTed · Jul 19, 2013

Xbat said:
So at best what size would the esram be on the apu?

Well I could not get further than the assumption of 28nm TSMC bulk and a 6T structure.

Then that lead me to a chipworks SEM image of a 6T SRAM cell on Tahiti.

So that is as far as I could go and rests upon the assumption that these are correct: "28nm TSMC bulk" and "6T structure"

Can't seem to quickly find the free download anymore:

http://www.chipworks.com/en/technic...15821.jpg?file=2012/12/figure-4-amd215821.jpg

https://chipworks.secure.force.com/catalog/ProductDetails?sku=AMD-215-0821060&viewState=DetailView

Others can give a better answer I think from knowing much more about what the total eSRAM block might take up in space. (Perhaps from their knowledge and/or experience with SRAM blocks in other designs. I do not know what extra is added to the 6T SRAM to turn it into 'eSRAM'. I don't know why it has the 'e'.)

3dilettante · Jul 19, 2013

It may be a matter of what wasn't added to the SRAM pool to make it a cache like the other pools on the chip.

I have questions about what kind of replacement policy it may or may not have, can it be snooped, how do shaders write to it, and how much it can self-manage.

Most of the details leaked show a lot of explicit mapping and data movement via DMA. If the eSRAM were a proper last level cache in a normal hierarchy, why would this be so heavily emphasized?
It may not be called a cache because it doesn't act like one.

It would still need to be called something at that point, and being an embedded array of SRAM that has functions reminiscent of the 360's eDRAM is suggestive to the new label.

Ethatron · Jul 23, 2013

PRT for that type of memory would be interesting. No need to tile manually.
Partial Resident Rendertarget ...

NRP · Jul 23, 2013

Go on . . .

Ethatron · Jul 24, 2013

It's just an unrealistic but beautiful setup I have in mind. Say we have a virtual memory range for the embedded RAM, around 1GB mapped to main memory. Whenever a page-fault occurs the move engines automagically swap the page from main memory into the embedded memory via a default or a custom handler, and some L(east)RU page back to main memory. If the page references a PRT then the PRT handler would take care of making the content available.
This is effectively a programmable 2 level cache hierarchy. If the access pattern on the embedded RAM is local and predictable it should be possible to hide the quite long latency of handling such an "exception".
If the handler programs can be complex it could be possible to implement a transaction scheme - the handlers can decide to record page-faults and preempt or abort the shader that wants to work on it - later, when some pages become old enough, it could swap those recorded pages in and revive the related shaders. So the MMU handler acts like a custom scheduler, driven by memory access patterns.
Quite crazy, but beautiful.

liolio · Jul 24, 2013

Ethatron said:
PRT for that type of memory would be interesting. No need to tile manually.
Partial Resident Rendertarget ...

Ethatron said:
It's just an unrealistic but beautiful setup I have in mind. Say we have a virtual memory range for the embedded RAM, around 1GB mapped to main memory. Whenever a page-fault occurs the move engines automagically swap the page from main memory into the embedded memory via a default or a custom handler, and some L(east)RU page back to main memory. If the page references a PRT then the PRT handler would take care of making the content available.
This is effectively a programmable 2 level cache hierarchy. If the access pattern on the embedded RAM is local and predictable it should be possible to hide the quite long latency of handling such an "exception".
If the handler programs can be complex it could be possible to implement a transaction scheme - the handlers can decide to record page-faults and preempt or abort the shader that wants to work on it - later, when some pages become old enough, it could swap those recorded pages in and revive the related shaders. So the MMU handler acts like a custom scheduler, driven by memory access patterns.
Quite crazy, but beautiful.

If that is reasonably doable, AMD should look no further to improve the perfs of its own APUs.

Solarus · Jul 25, 2013

I have a question, now that the esram is at like 196GB/s. what kind of real world benefits will we see? higher framerates? higher resolution? more effcts? how will the higher bandwidth help with the tiled rendering stuff ms showed off at build?

Shifty Geezer · Jul 25, 2013

It'll be 10% better.

deanos · Jul 25, 2013

Solarus said:
I have a question, now that the esram is at like 196GB/s. what kind of real world benefits will we see? higher framerates? higher resolution? more effcts? how will the higher bandwidth help with the tiled rendering stuff ms showed off at build?

because who needs a friggin' GPU.

Solarus · Jul 25, 2013

i guess i should say more stable framerates when under stress? like when there's alot happening on screen at once? i'm still confused how it went from 102GB/s to 196GB/s but overall its a good thing right?

Betanumerical · Jul 25, 2013

Solarus said:
i guess i should say more stable framerates when under stress? like when there's alot happening on screen at once? i'm still confused how it went from 102GB/s to 196GB/s but overall its a good thing right?

It was 196GB/s only in a very specific theoretical situation, it was 133GB/s in real world scenarios.

Its still a nice boost though.

Le Photographeur · Jul 25, 2013

Ethatron said:
It's just an unrealistic but beautiful setup I have in mind. Say we have a virtual memory range for the embedded RAM, around 1GB mapped to main memory. Whenever a page-fault occurs the move engines automagically swap the page from main memory into the embedded memory via a default or a custom handler, and some L(east)RU page back to main memory. If the page references a PRT then the PRT handler would take care of making the content available.
This is effectively a programmable 2 level cache hierarchy. If the access pattern on the embedded RAM is local and predictable it should be possible to hide the quite long latency of handling such an "exception".
If the handler programs can be complex it could be possible to implement a transaction scheme - the handlers can decide to record page-faults and preempt or abort the shader that wants to work on it - later, when some pages become old enough, it could swap those recorded pages in and revive the related shaders. So the MMU handler acts like a custom scheduler, driven by memory access patterns.
Quite crazy, but beautiful.

Why would it be crazy/unrealistic?

liolio · Jul 25, 2013

Betanumerical said:
It was 196GB/s only in a very specific theoretical situation, it was 133GB/s in real world scenarios.

Its still a nice boost though.

I missed an episode it seems, has this info been confirmed ?

Betanumerical · Jul 25, 2013

liolio said:
I missed an episode it seems, has this info been confirmed ?

It has not been confirmed but the information that came with it stated that it was 133GB/s in real world scenarios and thats only under specific conditions as well.

Ethatron · Jul 25, 2013

Le Photographeur said:
Why would it be crazy/unrealistic?

Making hardware exception handling work for real is one of the very dark arts. That's true for both, provider and programmer. Proofing it works as intended is a requirement, not just run-and-see, but logically. And because the scope of the suggestion is big with a huge in-flight processor state, it's very difficult to achieve that.

I tend to be overly conservative in estimates about complexity and time needed for software implementations [because of the unknowns you don't know about yet], maybe I misjudge this one and it's a cakewalk.

Solarus · Jul 25, 2013

Betanumerical said:
It has not been confirmed but the information that came with it stated that it was 133GB/s in real world scenarios and thats only under specific conditions as well.

do they say what scenarios that would be? the xbox one is seemingly alot more powerful than people originally gave it credit for.

Shifty Geezer · Jul 25, 2013

No, there's no understanding behind the reported BW increase. The only explanation we got, from DF, made it sound impossible, getting more BW than the bus is capable of actually carrying.

Solarus · Jul 25, 2013

i know i'm pretty much uninformed on how any of this stuff works so bear with me. but i thought esram's bandwidth was based on the clocks of the gpu it was embedded on when talking about xbox one. someone said bidirectional so that means sending and receiving right? so is it 133GB/s both ways or is it 66.5GB/s both ways?

grndzro · Jul 25, 2013

From what I understand the game/program has to use GPU cycles for the move engines. This would make porting a bit more complicated. No idea how granular the process is or if it plays havoc with the GPU cache.

The ESRAM in Durango as a possible performance aid

Xbat

BeyondTed

3dilettante

Ethatron

NRP

Ethatron

liolio

Aquoiboniste

Solarus

Shifty Geezer

uber-Troll!

deanos

Solarus

Betanumerical

Le Photographeur

liolio

Aquoiboniste

Betanumerical

Ethatron

Solarus

Shifty Geezer

uber-Troll!

Solarus

grndzro

Similar threads