The pros and cons of eDRAM/ESRAM in next-gen

sebbbi · Mar 21, 2014

AlNets said:
And now the obvious questions , why not increase the size of both the L2s and the ROP caches? Or is it that any practical size increases wouldn't be anywhere as useful as just having the scratchpad?

Nvidia did exactly this with Maxwell. Kepler (GK107) had 256KB L2 and Maxwell (GM107) has 2MB (8x increase). This helps both performance (BW savings, latency savings) and power efficiency (L2 traffic is much more power efficient than main memory traffic).

Cache logic (and tags) take considerable amount of die space. The question thus becomes, which is better: (2x+) larger manual scratchpad or a smaller automatic cache. For PC, you obviously want the automated cache, since PC graphics APIs don't expose exotic manual scratchpads. For a console, the manual scratchpad is likely a better choice (as developers are directly targeting the hardware). Obviously both choices have their trade offs.

AlNom · Mar 21, 2014

sebbbi said:
Nvidia did exactly this with Maxwell. Kepler (GK107) had 256KB L2 and Maxwell (GM107) has 2MB (8x increase). This helps both performance (BW savings, latency savings) and power efficiency (L2 traffic is much more power efficient than main memory traffic).

Indeed... I suppose that's a question for AMD engineers, or maybe it was simply time constraint (no time to design* or test ideal setup) and/or perhaps MS wasn't as open to it (very strict design requirement) :?:

*not sure how simple it'd be to just increase the L2 or ROP caches while maintaining the high bandwidth there.

Then we have the Hawaii GPU, which seems more or less a super-sized Tahiti - the external GPU bandwidth didn't really scale that much (320 vs 288GB/s) compared to the increase in Shader/tex & ROP amounts. Makes me more curious about how Tahiti-class would be with Hawaii-size, filled in with more cache instead...

Perhaps their internal tests showed otherwise? Questions...

liolio · Mar 21, 2014

I may be wrong but measurement made by hardware.fr doesn't seem to validate that scenario:
http://www.hardware.fr/articles/916-4/performances-theoriques-pixels.html
Fillrate and fillrate with blending is lower than the figures for AMD competing cards.

sebbbi · Mar 21, 2014

Fill rate testers usually render big full screen quads that are too large to fit into any caches (1080p 4x16 bit HDR render target is 16 MB). In a realistic scenario however particles (biggest overdraw case in most games) are clumped to smallish groups in screen. Thus particle blending benefits a lot from large ROP caches.

The programmer can also order particle rendering in a way that benefits more from ROP caches (spatial sorting). This can double the achieved fill rate (because of reduced BW cost).

AlNom · Mar 21, 2014

liolio said:
I may be wrong but measurement made by hardware.fr doesn't seem to validate that scenario:
http://www.hardware.fr/articles/916-4/performances-theoriques-pixels.html
Fillrate and fillrate with blending is lower than the figures for AMD competing cards.

hm... wouldn't it be more useful to compare the effect of the cache vs nV's other chips? The tex fillrate on 750Ti seems to hit closer to theoretical max compared to GTX660 and GTX650.

~ result vs theoretical Gtex/s
750Ti - 42 vs 43 - ~97%
660 - 62 vs 83 - ~75%
650 - 25 vs 34 - ~74%

for D32F in particular:
750Ti - 36 vs 43 - ~84%
660 - 50 vs 83 - ~60%
650 - 19 vs 34 - ~55%

---

Looks like they have half-rate blending with the formats besides 32-bit INT. hm... IIRC, wasn't there some hubbub about AMD & FP10 using the 32-bit path whereas nV was doing FP16?

bbiab

liolio · Mar 21, 2014

sebbbi said:
Fill rate testers usually render big full screen quads that are too large to fit into any caches (1080p 4x16 bit HDR render target is 16 MB). In a realistic scenario however particles (biggest overdraw case in most games) are clumped to smallish groups in screen. Thus particle blending benefits a lot from large ROP caches.

The programmer can also order particle rendering in a way that benefits more from ROP caches (spatial sorting). This can double the achieved fill rate (because of reduced BW cost).

Thanks for the insight. But would that shows on software that has not been designed for that specific piece of hardware? ? Do Nvidia engineers go that far in optimizing things?

AlNets said:
hm... wouldn't it be more useful to compare the effect of the cache vs nV's other chips? The tex fillrate on 750Ti seems to hit closer to theoretical max compared to GTX660 and GTX650.

~ result vs theoretical Gtex/s
750Ti - 42 vs 43
660 - 62 vs 83
650 - 25 vs 34

for D32F in particular:
750Ti - 36 vs 43
660 - 50 vs 83
650 - 19 vs 34

Well it could be the cache or something else. Out of my head, there could be bottleneck somewhere, hence it would be a result of the doubling of ALU/TEX ratio while the bottleneck remains.

I don't know, I read a couple of discussions about how the cache works in Nvidia GPUs and Maxwel more specifically, it is over my head.
I don't question that the big L2 helps from the few I understand on the topic I get that the L2 in Nvidia GPus is a more "generic" (sorry if the wording is really incorrect) structure than in AMD GPUs.
ROPs are tied to the L2, so are the L1 whereas in AMD RBE have their own color and Z cache for example and the L2 seems only tied to the shader cores, so overall its increased size may affect the design in more than one way.
There are also other improvements in Maxwell outside of the L2 size, it looks "leaner", the share memory is now only that, the texture cache being used as the "L1" in compute operation (for what I get it is not exactly a L1 as in a CPU) , so there is more share memory, some operation are faster > it is really over my head, it is tough as articles that try to explain the inner working are not that numerous and they differentiate "things" between compute and rendering mode (I'm at loss of words here what I'm trying to say is that instead of presenting the hardware altogether and then explaining how each parts play its roles while rendering or computing, they present things in a shallow manner when in comes to graphics and go more in depth when it comes to computing which makes it difficult for a dummy like me to get it all "altogether".. within the limited scope of my understanding that is it

).

3dilettante · Mar 21, 2014

AlNets said:
And now the obvious questions , why not increase the size of both the L2s and the ROP caches? Or is it that any practical size increases wouldn't be anywhere as useful as just having the scratchpad?

For the first, AMD's introduction of GCN indicate that L2 partitions can be 64K or 128K in size. It seems likely that existing designs have cache controllers and pipelines that can pick one or the other when implementing a GPU.
L2 slices are directly tied to memory channels, so the number of slices is not flexible for the console scenario.
At best, they could probably hope for doubling a pretty small L2 to something still pretty small.

If GCN were enhanced, that wouldn't be a stumbling block to larger caches, just impacts to the cache logic or complexity in how the L2 slices interact with the memory channels.
However, I don't say "without a redesign" a lot in threads about AMD's custom APUs, or modern products without reason. Up to a point, avoiding redesigns makes it easier to keep things modular. Unless a client pays enough, fiddling with a few lower level details just means reimplementing hardware and negating the point of the semicustom division.
AMD's corporate line is all about "leveraging" existing IP, not improving it.

As for the ROPs:
Expanding those means redesigning AMD's fixed-function elements, violating the "without a redesign" trend. There have been some gradual changes to it, but those ROPs (and significant portions of the graphics domain in AMD's GPUs) would look familiar if they were teleported into GPUs from generations ago.
With their typical behaviors, it would take a much larger cache to not get completely thrashed, a redesign of hardware, and not that much upside because ROPs are extremely good at being ROPs.
Someone would have to innovate and think of a new thing ROPs can do without compromising their current strengths. They have been used for atomics at times, because read/modify/write is what they do.

Starx · Mar 21, 2014

ESRAM is so incredibly high bandwidth that you can do amazing things with it.

http://www.igameresponsibly.com/2014/03/20/xbox-esram-can-amazing-things/

brunogm · Mar 21, 2014

Barbarian said:
Actually a much cheaper/smarter modification would've been to turn the ESRAM into full on L3 cache similar to what Intel did with the Iris Pro. It is pretty shocking they didn't think of this, given that (allegedly) the ESRAM is full blown 6-transistor on-chip memory. It's just missing the cache controller!!!

Not just the cache controller, i remind you of the cache tags. My 2 cents would be at least 25% increase in total bits( or reduction if you reuse some storage bits for tagging). Even so cache controllers consume a lot of power.

EDIT: Sebbi only now read your comment on cache tags. So for GPU what's changed in the tagging compared with a CPU?

Also one colleague worked with caches and scratchpads allocations, maybe someone can benefit of this approach: http://dl.acm.org/citation.cfm?doid=2020876.2020926

milk · Mar 22, 2014

sebbbi said:
Fill rate testers usually render big full screen quads that are too large to fit into any caches (1080p 4x16 bit HDR render target is 16 MB). In a realistic scenario however particles (biggest overdraw case in most games) are clumped to smallish groups in screen. Thus particle blending benefits a lot from large ROP caches.

Great info, as always sebbbi. Thanks. I guess that is partially the reason why massive particle systems have become so popular this gen then.

liquidboy · Mar 28, 2014

I find it interesting that the Ryse developers used a 128MB shadow map ...

Shadow Map Optimization
- Static shadow map
- Generate large shadow map with all static objects once at level load time
- Avoids re-rendering distant static objects every frame
- 8192x8192 16 bit shadow map (128 MB) covering a 1km area of the game world provides sufficient resolution

I guess the Ryse devs used the "overflow" possibilities of esram mentioned by Xbox Ones Baker ??? "The Xbox 360 was the easiest console platform to develop for, it wasn't that hard for our developers to adapt to eDRAM, but there were a number of places where we said, 'gosh, it would sure be nice if an entire render target didn't have to live in eDRAM' and so we fixed that on Xbox One where we have the ability to overflow from ESRAM into DDR3, so the ESRAM is fully integrated into our page tables and so you can kind of mix and match the ESRAM and the DDR memory as you go... From my perspective it's very much an evolution and improvement - a big improvement - over the design we had with the Xbox 360. I'm kind of surprised by all this, quite frankly." ...

Thoughts ?!

(p.s. I'm assuming the esram is being used to store the map)

liolio · Mar 29, 2014

liquidboy said:
I find it interesting that the Ryse developers used a 128MB shadow map ...

Shadow Map Optimization
- Static shadow map
- Generate large shadow map with all static objects once at level load time
- Avoids re-rendering distant static objects every frame
- 8192x8192 16 bit shadow map (128 MB) covering a 1km area of the game world provides sufficient resolution

I guess the Ryse devs used the "overflow" possibilities of esram mentioned by Xbox Ones Baker ??? "The Xbox 360 was the easiest console platform to develop for, it wasn't that hard for our developers to adapt to eDRAM, but there were a number of places where we said, 'gosh, it would sure be nice if an entire render target didn't have to live in eDRAM' and so we fixed that on Xbox One where we have the ability to overflow from ESRAM into DDR3, so the ESRAM is fully integrated into our page tables and so you can kind of mix and match the ESRAM and the DDR memory as you go... From my perspective it's very much an evolution and improvement - a big improvement - over the design we had with the Xbox 360. I'm kind of surprised by all this, quite frankly." ...

Thoughts ?!

(p.s. I'm assuming the esram is being used to store the map)

I remember ERP listing shadow maps as a case for not bandwidth limited usage of the ROPS.
If I understood what he meant I would think that under proper usage the limiting factor here would be ROPS throughput, in which case I would think they would be better off rendering to the main RAM.

Rockster · Apr 1, 2014

Rangers said:
Yup, that's major surgery. They were already at 5 billion transistors, that would have pushed them well over 7. Does even Titan/GK110 have 7?

3dilettante said:
How is that simple?

I never said it was simple, but the work involved from a design, testing, and manufacturing perspective would be exactly the same. My assertion is that their performance target was simply set too low. And that a higher spec'd chip wouldn't have so dramatically affected cost as to not outweigh the current negative publicity from being the weaker console.

They have plenty of metrics to know exactly what common engine g, z, and color backbuffer usage at next-gen resolution targets would require. So if the decision was made early on that GDDR was out and on chip cache was in; they shouldn't have knowingly gimped it. They could have easily gone with separate die (ala 360) or beefed up their current design keeping everything else the same including speeds and feeds. I'm quite certain that even moving to 48MB @ 306GB/s, 24ROP's would have made a significant difference in the current performance delta. Especially considering that each X1 ROP is less bandwidth bound than it's PS4 counterpart, and 48MB is just enough for 4x1080p RT's, color and Z at 32bits. 64 and 32 would have been going the extra mile. As they should have, but you don't run as hard when you think your at the top of the hill.

Nisaaru · Apr 1, 2014

IMHO the whole problem could have been "solved" with a 2nd layer stacked with EDRAM or ESRAM. Enough room on the APU for a bigger GPU and a cheap high yield ED/SRAM layer. Either with 20 or 14nm they could have integrated it into the APU as ESRAM.
If cost optimizing was such an issue with the chip it makes you wonder why they didn't bother about low cost for the rest of the console(case, cooling and some questionable gimmicks)

3dilettante · Apr 1, 2014

Rockster said:
I never said it was simple, but the work involved from a design, testing, and manufacturing perspective would be exactly the same.

AMD's system architecture is tied up in an evolution of the Llano uncore, and relies on a system crossbar that the eSRAM and memory controllers lie behind.
If my admittedly shaky assumption about the crossbar logic blocks on the Durango die is somewhat correct, the eSRAM's addition may have doubled the investment there, at least.
Doubling the ROPs and doubling the eSRAM controller clients would be a significant increase in on-die complexity, aside from die size increases and a power ceiling we know Durango is quite near.
The minor "free" upclock indicates the chip was somewhere in the neighborhood of the power targets Microsoft set.

And that a higher spec'd chip wouldn't have so dramatically affected cost as to not outweigh the current negative publicity from being the weaker console.

The technical component to this is probably the weakest factor. Price, fear of a Kinect-based beachhead against traditional gaming, and ham-handed corporate decisions and messaging would have counted for more. I believe an attempt at a rational analysis early in the development cycle and before the console stats and gamer reactions to ROP counts and SIMD capabilities would not have predicted this, because much of the blowback on the tech is not rational.

They could have easily gone with separate die (ala 360) or beefed up their current design keeping everything else the same including speeds and feeds.

The initial platform goals had it as an SOC. The makeup of AMD's IP is such that splitting things off of the die would have had serious downsides.
Bandwidth would not be as high going off-die, absent manufacturing and packaging choices that were non-options.

I'm quite certain that even moving to 48MB @ 306GB/s, 24ROP's would have made a significant difference in the current performance delta.

Easier said or awkwardly pasted with MS Paint than done.

If people want to recalibrate the system, cost, and power parameters as well, then have at it.
It's not a trivial exercise to make calls like this years ahead of time.
Saying X or Y can be done without touching the constraints the design was given is discussing the color of unicorns.

Nisaaru said:
IMHO the whole problem could have been "solved" with a 2nd layer stacked with EDRAM or ESRAM.

Many of this gen's design problems, Xbox or PS4, could be solved by design choices that are not possible.

pMax · Apr 1, 2014

3dilettante said:
If my admittedly shaky assumption about the crossbar logic blocks on the Durango die is somewhat correct, the eSRAM's addition may have doubled the investment there, at least.

Nah, I doubt that: eSRAM is directly, quickly accessible from GPU, not CPU - CPU penalties to access it are high.
eSRAM is integrated in the GPU core (far right of the chip), as the chip layout clearly shows.
For sure, however, there must be some investment in the CPU/GPU ties to the NB zone (plus whatever arbitration mode/stuff they did use)

Nisaaru · Apr 1, 2014

3dilettante said:
Many of this gen's design problems, Xbox or PS4, could be solved by design choices that are not possible.

Could you elaborate why that wouldn't have been possible? Even some chip engineer on the Hotchip conference video asked them about that in the context about yield/price optimization.

http://forum.beyond3d.com/showpost.php?p=1838012&postcount=7745

liolio · Apr 1, 2014

I think he just alluded to tech that is not ready yet for mass production, like HMC, stacked memory, etc.
or too costly (like IBM 32nm process, anything Intel, etc.).

pMax · Apr 1, 2014

Nisaaru said:
Could you elaborate why that wouldn't have been possible?

A chip that comes out this year or so is likely being architected 3-5 years ago at least.
How can you predict safely 'where' will you be in five years? Probably, you can change something, but company chain of approval is long: you cannot change that MUCH meanwhile.

...unless you go from 4Gb to 8Gb of the same ram, of course.

Lalaland · Apr 1, 2014

liolio said:
I think he just alluded to tech that is not ready yet for mass production, like HMC, stacked memory, etc.
or too costly (like IBM 32nm process, anything Intel, etc.).

Nail on the head, I'm astonished at the amount of commentary that discusses stacked chips as if it was so cheap and common your fridge ASIC has been using it for years. It's still pretty exotic and it still costs more than not doing it, so far the only large scale use has been in the mobile and phone space where board space is at enough of a premium to justify the cost. The one thing I don't think we can accuse the XB1 design of being is too stingy with board space.

pMax also makes a very valid point, until the last minute availability of 8Gb GDDR5 chips the 8GB DDR3 + ESRAM design was looking pretty smart. Alas for MS swapping RAM chips is a fairly easy step compared to doing a single revision of a CPU design. Not that increasing ESRAM was on the cards or that ESRAM could have been placed off-die (I don't think so anyway but please correct if wrong) and with Intel likely to burn their Fabs to the ground before they'd manufacture AMD IP off chip EDRAM was never open to them (aside from Intel having 0% interest anyway).

Durango is a good design with interesting wrinkles that will provide fodder for interesting discussions for years much like Cell did (but not as waaayyy out there as Cell was).

The pros and cons of eDRAM/ESRAM in next-gen

sebbbi

AlNom

Moderator

liolio

Aquoiboniste

sebbbi

AlNom

Moderator

liolio

Aquoiboniste

3dilettante

Starx

brunogm

milk

Like Verified

liquidboy

liolio

Aquoiboniste

Rockster

Nisaaru

3dilettante

pMax

Nisaaru

liolio

Aquoiboniste

pMax

Lalaland

Similar threads