Xbox One (Durango) Technical hardware investigation

SenjutsuSage · Apr 2, 2013

itsmydamnation said:
Now your talking about something completely different from what the physical array is, i.e. some form of S ram or some form of D ram. Now you are talking about a memory hierarchy. Everyone has been talking from a "scratch pad" perspective, which means its not a cache, so you would only get a benefit for an L2 miss if the data is being held on the 32meg *ram block. At this point 28 or 32 cycles, if its S ram or D ram doesn't make a huge difference because either way your saving big against the access latency of the main DDR 3 memory.

If there is an L2 miss and the data is held on the DDR3 the *ram block wont help you, its not a cache.

What about a move engine moving the data from DDR3 to the SRAM? Or is it too tough to manage the data in such a fashion where you could, in more cases than not, usually make certain that the specific data you want to be accessed from the ESRAM is actually present inside before the data is needed? Or maybe the ESRAM is used in a much more limited, but still useful way, where you only use it for a very specific thing consistently.

Check out this powerpoint: Tiled Resources for Xbox 360 and Direct3D 11
http://www.microsoft.com/en-us/download/details.aspx?id=27854

Starting around page 12-16, it seems to describe a method that should be most useful on 32MB of ESRAM. In fact, it practically makes 32MB seem humungous. You have chunks of tiles that all fit within 64KB, and the way it works is that creating a tiled resource simply reserves a virtual memory range without any actual memory allocation. It wouldn't have to be fully resident in order to be used. The method would also allow for greater control over graphics resources. Tiles being 64KB would make managing resources easier also.

And using a residency manager, tile priorities are constantly updated as the camera in a game moves. They would basically be streaming this stuff in. The fine details are things I don't understand, but I'm sure developers do. And I believe tiled resources and ESRAM residency control are among the planned Durango specific DX11.x extensions.

itsmydamnation · Apr 2, 2013

SenjutsuSage said:
What about a move engine moving the data from DDR3 to the SRAM? Or is it too tough to manage the data in such a fashion where you could, in more cases than not, usually make certain that the specific data you want to be accessed from the ESRAM is actually present inside before the data is needed? Or maybe the ESRAM is used in a much more limited, but still useful way, where you only use it for a very specific thing consistently.

The point here is there is no bandwidth multiplier effect like with a cache, all you are doing is trading SOC cost for latency. The move engine can move whatever data in and out of the RAM block as much as it wants via whatever method. The question is how much does that buy you in terms improving average performance. For traditional GPU loads im going to say sweet FA or GPU's would have been rocking much larger and more complex cache hierarchies for a long time. Modern compute shaders and GPGPU would seem to be a primary target but you will still have "traditional" workloads going on at the same time.

What I think you might find is that all other things being equal utilisation might not be that much higher but the time a complex shader takes to run could be significantly faster, that could help reduce frame times but not be picked up in generalised averages.

Grall · Apr 2, 2013

SenjutsuSage said:
You really think it's so inconceivable that Microsoft would do this

Like I said, it would be totally unique to put 32MB SRAM in a consumer device. That's not small potatoes. CPU cores in comparison share 1MB SRAM *per four cores*.

and that it would truly only have a minor performance difference if they did?

It certainly wouldn't have enormous benefits versus eDRAM, that's for sure. 1.5 billion transistors vs. 250 million, it's a big step.

It's like having L3 cache onboard

No, it's not like L3 cache at all unless you actually have caching hardware built-into the chip to make the memory array into a cache. Without that hardware, what you got is a scratchpad which you need to manually pre-load the memory in advance with everything you think you're going to need in the near future (taking in mind the maximum capacity of the memory, of course), juggling and re-arranging assets in a continuous fashion, and that's not very easy to manage on the fly. They both have different pros and cons, but in no way are they alike.

And when you say there will only be a minor performance difference, might you not be overlooking the fact that Microsoft isn't expecting it to carry the entire graphical load, or suddenly make a 1.2 TFLOP GPU perform like a 2 TFLOP part, but to just make certain crucial tasks much faster and cheaper, which when a dev takes a step back and looks at their overall efficiency gains may find it well worth it?

I think that sounds like handwavy fanboy talk TBH. None of what you typed up actually means anything. If a big SRAM array on a GPU was THE solution to efficiency and everything else, you would have seen it on PC video cards ages ago. It's just not that easy, it never is.

That an on-chip memory array can be very useful isn't in question, PS2, Gamecube and also the 360 showed there is merit in such a system - I just doubt it will be SRAM, due to the obvious costs reasons.

SenjutsuSage · Apr 2, 2013

Yea, my cache reference was pretty off, which I suspected, but I figured I might as well toss it out, because I might learn something in the process. Very interesting. So, if I'm understanding you correctly, there may not be all that much difference in general or more traditional workloads, but more complex shaders could be run significantly faster, which may be nice, but no matter what, you still come back around to the more plentiful traditional workloads, which may bring things back down to earth?

Forgive me if I mangled what you intended to say. Either way, you definitely know your stuff.

Yea, Grail, I suspected the cache reference might have been off, and sure it enough it is. That said, isn't the reason it hasn't happened on PC simply because it just isn't as practical in a pc environment where developers can't focus on just that design? In such a case isn't it just easier for AMD and Nvidia to keep doing things as they have been? Not saying SRAM has to be the solution to everything, but might it not be a good thing for developers to have in some cases, perhaps more so than EDRAM? Also, if costs of a pool of SRAM this large are a serious concern, why would AMD or Nvidia, for example, be so quick to want to throw it on their GPUs even if it was helpful, because seeing as how they are dealing with desktop components, they would likely have to somehow find a way to couple all that same versatile SRAM on top of all the GDDR5 or GDDR6 they'll be using in the future, which would make things even more expensive, don't you think? It might be too drastic a change to make in the PC space for now, because of the complexities involved. I suppose it would be easier to get away with on a console.

Shifty Geezer · Apr 2, 2013

Grall said:
It certainly wouldn't have enormous benefits versus eDRAM, that's for sure. 1.5 billion transistors vs. 250 million, it's a big step.

This is the most poignant comparison. Ms could have had 64 MB eDRAM instead for less cost. But there are multiple interpretations. 1) It's not SRAM. 2) The lower latency helps significantly enough to justify the expense. 3) eDRAM would have limited manufacturing to costly processes, costing more in the long run. 4) A mix of two and three, the increased costs versus decreased benefits of eDRAM making the massive SRAM a better choice.

It's hard to see in the numbers we have what the gains of SRAM are though.

Gubbi · Apr 2, 2013

TSMC SRAM cell size for use with logic is just 0.155um² /bit in 28nm, - 1.3 mm² per megabyte. That's 42 mm² for the SRAM cells alone in a 32MB array. Even when you add overhead for buses and power I can't imagine the ESRAM to be bigger than 65mm².

The ESRAM is there for the added bandwidth, it adds cost to the SOC but it enables savings on the order of $40-50 on memory per unit sold. The lower latency is just a bonus.

In future shrinks, MS might be able to use eDRAM with a SRAM interface instead, as well as replace DDR3 with a narrower DDR4 memory subsystem.

I see the ESRAM as a cost reduction feature rather than a performance enhancing one.

Cheers

Shifty Geezer · Apr 2, 2013

Gubbi said:
TSMC SRAM cell size for use with logic is just 0.155um² /bit in 28nm, - 1.3 mm² per megabyte. That's 42 mm² for the SRAM cells alone in a 32MB array. Even when you add overhead for buses and power I can't imagine the ESRAM to be bigger than 65mm².

Thanks. Where do you think choice of eDRAM would add costs? Finding a fabricator and suitable process? Or rather, why would MS choose SRAM over DRAM?

Gubbi · Apr 2, 2013

Shifty Geezer said:
Thanks. Where do you think choice of eDRAM would add costs? Finding a fabricator and suitable process? Or rather, why would MS choose SRAM over DRAM?

It might be one or more of these reasons:
1. eDRAM adds extra steps to manufacturing, so the savings might not be as big as the reduction in die size would indicate (ie. each die takes longer to manufacture and yield is impacted negatively).
2. Optimizing a process for eDRAM might do so at the expense of logic performance.
3. eDRAM might not even be an option on the chosen process.

MS will make the change when it is technically feasible and economically favourable to do so. It might never happen. The 360's eDRAM was never integrated.

Cheers

dobwal · Apr 2, 2013

Here are some issues with eDRAM on chip.

http://www.realworldtech.com/iedm-2010/3/

The overall arrays are roughly 2-4X denser than SRAM (cell size is roughly 4-6X smaller), with 2-3 orders of magnitude improvement in SER. Additionally, there is a slight decrease in active power and a substantial drop in standby power.

However, the capacitor must refresh periodically to retain the data and the access time is substantially slower than SRAM. The high access times are one reason why eDRAM is suitable for large arrays (e.g. a last level cache), since it cannot scale down to the sub-nanosecond times required for high performance SRAMs. The last drawback is perhaps the most problematic – IBM’s eDRAM requires several changes and addition manufacturing steps, including the formation of a deep trench for the capacitor (shown in Figure 3), which impacts cost/yield. While this is hard to quantitatively estimate, the costs are clearly enough of an issue that AMD has forgone using eDRAM. However, IBM’s economics are very different and they can clearly justify the extra silicon cost to reduce overall system cost.

Click to expand...

http://www.gsaglobal.org/events/2012/0416/docs/Iyer_pres.pdf

Slide 13 provides a graph for the break even point of swapping eDRAM for SRAM based on node and the sram replacement area.

Slide 8 is pretty interesting as IBM states that 64mb eDRAM is actually faster than 64mb of SRAM in terms of latency.

liolio · Apr 2, 2013

DrJay24 said:
32-bit performance is 100% higher than 64-bit for GCN, sounds like they are quoting 64-bit performance on Orbis and 32-bit for Durango. They are both GCN parts.

None of that shit (the post you are quoting)makes any sense to me at least.
I would advice anybody to read a presentation about GCN on a reputable website (anandtech or techreport comes to mind).

the whole talk about threads is non sense, most likely from somebody that doesn't get the difference between wavefront, the logical width of those wavefront (64 elements on GCN) the physical width of the SIMD executing those wavefront (16 on GCN, four SIMD per CUs in GCN), the number of threads a CU can keep track of ( Iirc 40, 4 being active at a time), the number of threads the GPU as whole deal with, etc.
Shortly not a word make sense in that post, it is plain fanboy BS at its worse lol

Cjail · Apr 2, 2013

@B Real

That post you quoted is actually form Misterxmedia's blog.
Nothing coming from that blog is worthy of consideration.

3dilettante · Apr 2, 2013

It looks like it's a forum repost from semiaccurate of that blog, since it looks like there's a username pasted in, so it's more indirect.

I would suggest, as a matter of personal preference, that we not source forum posts from there (or anywhere a lot of the time). Most of the posters I'd lend credence to are ones that are also members here.

dobwal · Apr 2, 2013

(((interference))) said:
I've asked if there was any truth to speculation that Durango is designed for TBDR and have been told it's 'wishful thinking'.

Could you ask your source if Durango is a tile rendering design in general? TBDR (correct me if I am wrong) is not an all encompassing term. PowerVR uses TBDR but Mali, Tegra and Adreno (used to be ATI's Imageon line) don't use that design. In fact I've seen them referred to as TBIMR, a hybrid solution.

Even the documents presented by vgleaks hints or makes reference to what could be construed as a tile based design.

http://www.vgleaks.com/durango-gpu-2/3/

Access to these caches is faster than access to ESRAM. For this reason, the peak GPU pixel rate can be larger than what raw memory throughput would indicate. The caches are not large enough, however, to fit entire render targets. Therefore, rendering that is localized to a particular area of the screen is more efficient than scattered rendering.

SenjutsuSage · Apr 2, 2013

The display planes supporting up to 4 image rectangles each, along with each Move Engine supporting tiling and untiling would seem to indicate this as well, wouldn't it?

bkilian · Apr 2, 2013

SenjutsuSage said:
The display planes supporting up to 4 image rectangles each, along with each Move Engine supporting tiling and untiling would seem to indicate this as well, wouldn't it?

Tiling and untiling is just a reference to texture swizzling like this and has nothing to do with tile based rendering. It's purely a way to reorder data in a texture to make it faster to process. The reported rectangles in the display planes did not seem to be related to that either, the supposedly leaked docs mentioned they were a way to reduce bandwidth requirements by only needing to draw parts of the screen.

B Real · Apr 3, 2013

I'll return to the shadows to lurk then, sorry guys

Grall · Apr 3, 2013

dobwal said:
if Durango is a tile rendering design in general?

NO!

How could it be, when it's based on a bog standard AMD APU, with some tweaks here and there?

AMD doesn't have any TBDR tech lying about that can be implemented with a minimum of fuss at a cheap price. Nobody has, except for imgtech, and they're not involved.

SenjutsuSage · Apr 3, 2013

bkilian said:
Tiling and untiling is just a reference to texture swizzling like this and has nothing to do with tile based rendering. It's purely a way to reorder data in a texture to make it faster to process. The reported rectangles in the display planes did not seem to be related to that either, the supposedly leaked docs mentioned they were a way to reduce bandwidth requirements by only needing to draw parts of the screen.

Ahh, okay, I stand corrected. I find the easiest way to learn stuff is to just say exactly what I'm thinking, which is usually wrong, and then someone comes with the correct answer lol.

(((interference))) · Apr 3, 2013

dobwal said:
Could you ask your source if Durango is a tile rendering design in general? TBDR (correct me if I am wrong) is not an all encompassing term. PowerVR uses TBDR but Mali, Tegra and Adreno (used to be ATI's Imageon line) don't use that design. In fact I've seen them referred to as TBIMR, a hybrid solution.

Even the documents presented by vgleaks hints or makes reference to what could be construed as a tile based design.

http://www.vgleaks.com/durango-gpu-2/3/

No, that denial wasn't meant to be so specific - It's not designed for tile based rendering in general.

DrJay24 · Apr 3, 2013

jaysherman said:
Durango: 12 sc's that can do 64 64-bit operations or 64 32-bit operations(similar to ALU function of orbis, but more flexible).

That isn't possible with GCN. Does anyone make a full speed double precision GPU?

Xbox One (Durango) Technical hardware investigation

SenjutsuSage

itsmydamnation

Grall

Invisible Member

SenjutsuSage

Shifty Geezer

uber-Troll!

Gubbi

Shifty Geezer

uber-Troll!

Gubbi

dobwal

liolio

Aquoiboniste

Cjail

Fool

3dilettante

dobwal

SenjutsuSage

bkilian

B Real

Grall

Invisible Member

SenjutsuSage

(((interference)))

DrJay24

Similar threads