Xbox One (Durango) Technical hardware investigation

Proelite · Feb 13, 2013

Shifty Geezer said:
What about supercomputers? If SRAM can make a 7770 run like a 680gtx (the original point Love_In_Rio is defending), why isn't nVidia adding it to its Tesla range? Or are only MS able to see the benefits and nVidia's going to be kicking itself when they see the incredible performance Durango gets with such cheap silicon?

Not my point.

I was just pointing out that Nvidia and AMD have to worry about compatilibilty and legacy engines, things that Sony and MS don't have to worry about.

pjbliverpool · Feb 13, 2013

Love_In_Rio said:
Nothing ridiculous.Desktop products have to deal with ancient engines not latency sensitives and beat the competitor in benchmarks,so better more alus

And how is that relevant to adding a 32 MB esram cache to nvidia's professional range of GPU's? Or for that matter, I'd be interested to know how you think it would hobble modern gaming performance. Bare in mind that Haswell GT3 is incorporating something similar to boost current gaming performance.

Your whole 3x performance speculation in the first post completely ignored the following paragraphs in the article you linked which described how modern GPU's get around the issue with low latency caches.

It's not that caches are going to be a perfect solution but it's also not as if they don't exist at all which is what your speculation suggests. The esram will probably result in some efficiency increases as the guys above are suggesting but to suggest it's going to make 800Mhz Cu's behave like 2400Mhz CU's is completely absurd.

anexanhume · Feb 13, 2013

Love_In_Rio said:
Any directx 9 game with light shaders will not suffer from cache misses, for example COD.

No, I'm asking for an example where the performance of an existing title decreased across a GPU architecture change.

Shifty Geezer · Feb 13, 2013

Proelite said:
Not my point.

I was just pointing out that Nvidia and AMD have to worry about compatilibilty and legacy engines, things that Sony and MS don't have to worry about.

Which isn't my point.

I don't disagree with you, but the premise that adding SRAM can have such a marked improvement and can happen without someone else implementing it (outside the PC space where things are limited by APIs) doesn't make sense. No-one knows GPGPU hardware utilisation better than nVidia. They sell to supercomputers. They'll know exactly where the bottlenecks are and what will alleviate them. Low latency RAM may be something they can look into, but if a cache can turn a 7770 into a 680gtx in performance, how can nVidia not be aware and that and be pursuing it? Ergo the hopes that the SRAM design can dramatically boost the GPU performance is unreasonable. Improve performance, sure, but not crazily so as some are hoping.

Love_In_Rio · Feb 13, 2013

Shifty Geezer said:
What about supercomputers? If SRAM can make a 7770 run like a 680gtx (the original point Love_In_Rio is defending), why isn't nVidia adding it to its Tesla range? Or are only MS able to see the benefits and nVidia's going to be kicking itself when they see the incredible performance Durango gets with such cheap silicon?

Real supercomputers have a lot of ESRAM and no many gpus,not sure the real money nvidia take from that but maybe not enough to compite with homogeneous computing beasts and to design specific chips for it.There is a post from sebbi that nails this question, let me look for it.
Here it is:
http://forum.beyond3d.com/showpost.php?p=1654196&postcount=32

Except for the 4 way SMT he almost is making a description of Durango.

Proelite · Feb 13, 2013

anexanhume said:
No, I'm asking for an example where the performance of an existing title decreased across a GPU architecture change.

I am not sure if there are, since that's something AMD and Nvidia try to avoid.

anexanhume · Feb 13, 2013

Proelite said:
I am not sure if there are, since that's something AMD and Nvidia try to avoid.

That's my point. There's no historical case of it happening, so we don't really have a reason to believe it's a credible fear, do we?

ERP · Feb 13, 2013

anexanhume said:
No, I'm asking for an example where the performance of an existing title decreased across a GPU architecture change.

Adding something like embedded memory would be a good example, since it requires code support for any benefit, and it costs silicon that could be used for other things
Most PC GPU's are optimized for the current popular titles, a friend ran some tests to determine "fast" paths on current GPU's, his conclusion was they are optimized to run Crysis2.
The same would apply if you wanted to say increase register storage, if you get more performance increase in current games by increasing ALU count instead, that's what you do.

Now I want to make it clear, I'm not saying embedded RAM is a panacea, or even in Durangos case a significant win or in fact the reverse.

On Durango I'd take the numbers we have at face value until there is some evidence otherwise (read games), I also think that even seemingly significant on paper differences, won't necessarily translate to significant visual differences. I'd also be more worried by the reported ROP counts and overall bandwidth than ALU counts.

bkilian · Feb 13, 2013

Shifty Geezer said:
What about supercomputers? If SRAM can make a 7770 run like a 680gtx (the original point Love_In_Rio is defending), why isn't nVidia adding it to its Tesla range? Or are only MS able to see the benefits and nVidia's going to be kicking itself when they see the incredible performance Durango gets with such cheap silicon?

Nvidia _does_ add low latency SRAM to it's products, that article a couple of pages back showed that the only card with sub 20 cycle memory times in it's caches was the NVidia. AMD had a 300 cycle minimum, even in it's cache.

GPGPU is kinda new still, and I don't think manufacturers have followed it through all the way yet. You might start seeing GPUs with much larger high speed caches coming out in the next few years. Remember also that while low latency memory will help in certain workloads, if you're using 6-T SRAM, it's huge, and that space could be used for things that will increase utility across all workloads. It's a trade off. If the GPU is doing compute on memory bound operations that are random access, and impossible to cache to the small CU caches, then sure, ESRAM would dramatically speed up those calculations, by up to an order of magnitude, depending on latencies (DRAM has latencies on the order of many hundreds of cycles). How often will the GPU be doing calculation that fit that specific use case? I don't think very often, and I bet it could be worked around algorithmically by doing things like structuring your data correctly.

Love_In_Rio · Feb 13, 2013

pjbliverpool said:
And how is that relevant to adding a 32 MB esram cache to nvidia's professional range of GPU's? Or for that matter, I'd be interested to know how you think it would hobble modern gaming performance. Bare in mind that Haswell GT3 is incorporating something similar to boost current gaming performance.

Your whole 3x performance speculation in the first post completely ignored the following paragraphs in the article you linked which described how modern GPU's get around the issue with low latency caches.

It's not that caches are going to be a perfect solution but it's also not as if they don't exist at all which is what your speculation suggests. The esram will probably result in some efficiency increases as the guys above are suggesting but to suggest it's going to make 800Mhz Cu's behave like 2400Mhz CU's is completely absurd.

Did you see the graphic in the article?.300 cycles on average of stall when you look for data to GDDR. Make your maths.

anexanhume · Feb 13, 2013

ERP said:
Adding something like embedded memory would be a good example, since it requires code support for any benefit, and it costs silicon that could be used for other things
Most PC GPU's are optimized for the current popular titles, a friend ran some tests to determine "fast" paths on current GPU's, his conclusion was they are optimized to run Crysis2.
The same would apply if you wanted to say increase register storage, if you get more performance increase in current games by increasing ALU count instead, that's what you do.

Now I want to make it clear, I'm not saying embedded RAM is a panacea, or even in Durangos case a significant win or in fact the reverse.

On Durango I'd take the numbers we have at face value until there is some evidence otherwise (read games), I also think that even seemingly significant on paper differences, won't necessarily translate to significant visual differences. I'd also be more worried by the reported ROP counts and overall bandwidth than ALU counts.

I understand silicon that could be used for other things, but what I'm saying is that the transition from VLIW5 to GCN or to the tesselation powerhouse fermi etc. etc. didn't fundamentally change the individual execution cores so much that previous titles ran like crap. As others have pointed out many times, adding more cache/less latency up front would have been a rather obvious and low hanging fruit choice to make if it yielded so much speedup.

mrcorbo · Feb 13, 2013

anexanhume said:
That's my point. There's no historical case of it happening, so we don't really have a reason to believe it's a credible fear, do we?

I seem to recall there were performance drops in some games going from the TNT 2 Ultra to the original GeForce 256. That was an awfully long time ago, though.

AlphaWolf · Feb 13, 2013

It could be a win in the console space without being a win in the GPGPU space. It might not benefit certain code enough to warrant it in the GPGPU space because of data sizes and the corresponding SRAM needed, just an example. That might not be an issue for a game. If MS had decided to use embedded memory the extra cost to move to SRAM might have made sense if they saw it as a significant compute advantage. I'm still not betting its 6T-SRAM I'm just suggesting the lack of inclusion on GPGPU pc parts doesn't necessarily exclude a benefit existing.

ERP · Feb 13, 2013

anexanhume said:
I understand silicon that could be used for other things, but what I'm saying is that the transition from VLIW5 to GCN or to the tesselation powerhouse fermi etc. etc. didn't fundamentally change the individual execution cores so much that previous titles ran like crap. As others have pointed out many times, adding more cache/less latency up front would have been a rather obvious and low hanging fruit choice to make if it yielded so much speedup.

But everything is a tradeoff.
Nothing can be considered in isolation, even seemingly fundamental things lie VLIW5->GCN if it was obviously a win to use vectors of scalers, why didn't they just do that?
The answer is always it's a combination of things, they may not have thought of it, the data they had on existing shader complexity in games may not have supported it, the additional scheduling logic or register pool resources may have been better spent on more VLIW5 ALU's etc etc etc.
It's something of a vicious cycle on PC where games optimize for the hardware and the hardware optimizes for the games.

As was pointed out earlier, NVidia does have lower latency cache and as a result seems to get more out of it's "flops".

DopeyFish · Feb 13, 2013

anexanhume said:
I understand silicon that could be used for other things, but what I'm saying is that the transition from VLIW5 to GCN or to the tesselation powerhouse fermi etc. etc. didn't fundamentally change the individual execution cores so much that previous titles ran like crap. As others have pointed out many times, adding more cache/less latency up front would have been a rather obvious and low hanging fruit choice to make if it yielded so much speedup.

But cost

erformance ratio is a problem, no?

Adding 1-2 billion transistors just as a cache isnt truly helpful especially if it mainly benefits compute, no?

And we are only talking about making a few CUs working on the ESram here, so its not essentially viable for the whole GPU... From hardware maker perspective they could get similar or greater performance by dedicating that silicon budget to more CUs... But this is a different scenario where the ESram is multipurpose

*note: don't know what I'm talking about just guesstimating

anexanhume · Feb 13, 2013

AlphaWolf said:
It could be a win in the console space without being a win in the GPGPU space. It might not benefit certain code enough to warrant it in the GPGPU space because of data sizes and the corresponding SRAM needed, just an example. That might not be an issue for a game. If MS had decided to use embedded memory the extra cost to move to SRAM might have made sense if they saw it as a significant compute advantage. I'm still not betting its 6T-SRAM I'm just suggesting the lack of inclusion on GPGPU pc parts doesn't necessarily exclude a benefit existing.

Didn't we just go through a long discussion about how game tasks aren't latency sensitive but GPGPU is? :runaway:

Do we feel the ESRAM's main point is to compensate bandwidth, latency or both?

Love_In_Rio · Feb 13, 2013

anexanhume said:
Didn't we just go through a long discussion about how game tasks aren't latency sensitive but GPGPU is?

Do we feel the ESRAM's main point is to compensate bandwidth, latency or both?

Shaders like reflections cube maps are very latency sensitive.If you have heavy shaders like this mixed with other lighter shaders you could hide the latency misses running threads corresponding to these lighter ones while waiting for the texture data.But if your engine is full of heavy shaders you will hit a point in which your alus have no option but to wait for data from the main RAM.

pjbliverpool · Feb 13, 2013

Love_In_Rio said:
Did you see the graphic in the article?.300 cycles on average of stall when you look for data to GDDR. Make your maths.

As I said, you're ignoring the impact of the caches. Are you always going straight to GDDR for data?

Because if you're not (which you aren't) then you can't simply compare the latency of the edram to GDDR and conclude x times the performance based on that.

patsu · Feb 13, 2013

Yeah, developers will prefetch or even DMA a whole chunk to the cache while they work. If the access misses, another chunk will be loaded into the cache to avoid stalling for every access.

Best if the helper CUs already prepared the data somewhere ready for the regular CUs to use.

scently · Feb 13, 2013

Don't know why this hasn't been posted, but its a diagram of the durango system, bandwidth, connections and all that. http://i.imgur.com/dxH2hmm.jpg

It was gotten from vgleaks. I am sure they will post an article about it soon.

Xbox One (Durango) Technical hardware investigation

Proelite

pjbliverpool

B3D Scallywag

anexanhume

Shifty Geezer

uber-Troll!

Love_In_Rio

Proelite

anexanhume

ERP

bkilian

Love_In_Rio

anexanhume

mrcorbo

Foo Fighter

AlphaWolf

Specious Misanthrope

ERP

DopeyFish

anexanhume

Love_In_Rio

pjbliverpool

B3D Scallywag

patsu

scently

Similar threads