Digital Foundry Article Technical Discussion Archive [2013]

Betanumerical · Jul 29, 2013

BRiT said:
Like the Intel Haswell-E?

I was thinking in the more high end space, the kind of area where a company wouldnt mind shelling out a large amount of cash for a device that performs ~2x as good.

Also isnt the main purpose of the intel Haswell-E 128MB cache to make up the for anaemic bandwidth of the DDR3?.

expletive · Jul 29, 2013

Betanumerical said:
I was thinking in the more high end space, the kind of area where a company wouldnt mind shelling out a large amount of cash for a device that performs ~2x as good.

Also isnt the main purpose of the intel Haswell-E 128MB cache to make up the for anaemic bandwidth of the DDR3?.

I think you can only get the "2x" performance on a closed system when you control the fixed hardware platform and the API.

Betanumerical · Jul 29, 2013

expletive said:
I think you can only get the "2x" performance on a closed system when you control the fixed hardware platform and the API.

It wouldn't be that hard to write a API to the eSRAM and with these systems you decide everything that runs on it, you should be able to get the "2x" performance on such a system still.

If you assume a eSRAM API and a fixed hardware profile (ie you buy what you want, and write to it) then its essentially the same as a closed platform. You just dont decide everything to do with the hardware.

And yet even the super computers employing the same GPU's don't use it.

expletive · Jul 29, 2013

Betanumerical said:
It wouldn't be that hard to write a API to the eSRAM and with these systems you decide everything that runs on it, you should be able to get the "2x" performance on such a system still.

If you assume a eSRAM API and a fixed hardware profile (ie you buy what you want, and write to it) then its essentially the same as a closed platform. You just dont decide everything to do with the hardware.

And yet even the super computers employing the same GPU's don't use it.

The infinite combinations of PC hardware, and the requirement by PC hardcore gamers that their card perform well on every game, makes this very difficult or even impossible. You'd have to write an API and drivers that could cater to every game ever built (or at least in the last 5 years) rather than techniques tailored to a specific closed system like a console.

Plus, with the expense of ESRAM in transistor count and die size, its not something that can trickle down to lower product families the way these manufacturers must do to stay profitable. An esoteric design like the XB1 doesn't work on a number of levels for a mainstream manufacturer like AMD or Nvidia, I think its pretty obvious.

Betanumerical · Jul 29, 2013

expletive said:
The infinite combinations of PC hardware, and the requirement by PC hardcore gamers that their card perform well on every game, makes this very difficult or even impossible. You'd have to write an API and drivers that could cater to every game ever built (or at least in the last 5 years) rather than techniques tailored to a specific closed system like a console.

Plus, with the expense of ESRAM in transistor count and die size, its not something that can trickle down to lower product families the way these manufacturers must do to stay profitable. An esoteric design like the XB1 doesn't work on a number of levels for a mainstream manufacturer like AMD or Nvidia, I think its pretty obvious.

I'm not talking about games nor consumer hardware, I'm talking about hpc where you write code that's low level and specific to the hardware

liolio · Jul 29, 2013

expletive said:
I think you can only get the "2x" performance on a closed system when you control the fixed hardware platform and the API.

I'm more and more wary about the statement, I'm close to think that it is not worded properly, most likely for convenience

that is not a plot of epic proportion...

:
I read it more and more like : "you can looks the same and sometime better with half the hardware".

It is quite different to me that stating that you get twice the performances. It is not like some games are not optimized in the PC realm and drivers teams let the hardware runs at half its potential.

I guess there are real optimizations but I think the main advantage is that console games are fine tuned further than what is doable on PC. I think one of the main restriction in the PC realm is that you can't decide the resolution of various render targets then up scale them.
If you render transparency at half your FB resolution and do the same for various RT (or for shadows) well you don't exactly get "twice the performances", it is more you do half the work or less on some quite relevant stages of your frame.
I think the same applies to texture filtering, you can apply better filtering "where it shows".

Though I would not call that "optimization" or low level optimization or the benefit of a close platform, it is more that the API used in the PC realm doesn't allow for that level of fine tuning (I guess because it is not really needed).
Again that is different than saying that low level optimization on a close platform makes a major difference in perfs. It is not a ceteris paribus comparison.

If a driver team were to cheat and use lower resolution for some RT and more selective texture filtering for example, it would be interesting to see how conservative PC set-up fares against the current gen, now you would see the benefits of lower access to the hardware and real low level optimizations. I think it get nowhere near close to x2.

Now it seems that MSFT introduced something in their last directx that allows to scale the frame buffer dynamically (it is not clear to me if it is the framebuffer alone or all the matching render targets), still not there with consoles but it is really more a matter of them choosing to give the option to PC developers than a limitation of API by self or an advantage of close platform and low level hardware access. Say MSFT wants more PC games on Windows RT (so most likely under powered system vs even a conservative low end gaming rig) they may decide to give developers a bigger bag of tricks.

liolio · Jul 29, 2013

pjbliverpool said:
Yes and as I said there's no implication there that 'effective use of the esram' isn't merely a reference to utilising the bandwidth advantage afforded by the esram over the narrow 128bit GDDR5 bus of the 7790. Theres nothing in Daves statement that says anything to me about esram affording some kind of special performance advantage above the additional bandwidth it offers.

An interesting side note to this discussion though is that with that statement Dave has effectively admitted that the 7790 is a horribly unbalanced design capable of being in his words "far outstripped" in performance by a weaker gpu with more memory bandwidth.

Clearly though that's only going to be the case in bandwidth restricted scenarios. Where computational performance is the limiting factor no amount of esram will put the x1 gpu on par with a 7790.

I don't think he would agree with that at all

The HD7790 is a great GPU, I used to think that actually a 192 bit bus would have done it some good. Dave told us that actually he was one of the ones pushing for a 128bit bus only, looking at reviews and how the card fares against a HD7850 one have to admit that it does quite well as it is (/slash run most games @1080P 30FPS nicely).

pjbliverpool · Jul 29, 2013

blakjedi said:
The context of the thread lends itself to Daves pronouncement. No one said there would be an increase in computational tflop count. That was not the point of the DF article nor of my references, so I'm not sure what point it is you are trying to make.

I has assumed you were implying that esram was offering some kind of efficiency advantage over and above the additional bandwidth it affords the system to make it computationally superior to a 7790. It's easy to see how I came to that conclusion to be fair since you implied the X1 will have performance similar to a 7970 which has vastly more bandwidth than is afforded by the esram in the X1.

If i misunderstood though then I apologise.

I do still stand by the assertion though that while the X1 will likely be faster in bandwidth limited cases, it will not be as fast a 7790 in non bandwidth limited cases. And they do exist. BF3 for example is according the DF article as fast on the 7790 as it is on a 7850 despite the 7850 having a 60% bandwidth advantage. Thus it stands to reason that while the 7790 may be bandwidth limited in many games, in this one it is compute limited and thus so would be the X1 and by extension the X1 would be slower in BF3 despite it's bandwidth advantage.

pjbliverpool · Jul 29, 2013

Betanumerical said:
The article also doesn't address the massive difference in ROPS (PS4 has 2x the ROPS), nor the difference in cache (PS4 has more cache per unit of work (not per CU) on the GPU if you spread a specific load over all the CU's), nor does it address the difference in texturing (PS4 has 1.5x the texture units of the XBONE).

I agree the article is missing many factors to make it a valid comparison of X1 and PS4 performance but I do think Richard is aware of that and does try to point that out in the article. He does ignore the ROP and setup differences between the two consoles in his article which does invalidate the results little IMO however TMU's are certainly accounted fr since they are part of the CU's. Thus his target GPU's have the TMU delta of the consoles as well as the compute delta. I'm not sure what you mean by cache differences but I'd assume the target GPU's also represent the cache differences correctly unless there is some difference between 7790 and 7850 cache structure that I'm not aware of. I'd just assumed they were identical apart from the obvious scaling up of total cache as CU's go up.

pjbliverpool · Jul 29, 2013

liolio said:
I don't think he would agree with that at all
The HD7790 is a great GPU, I used to think that actually a 192 bit bus would have done it some good. Dave told us that actually he was one of the ones pushing for a 128bit bus only, looking at reviews and how the card fares against a HD7850 one have to admit that it does quite well as it is (/slash run most games @1080P 30FPS nicely).

Yes I probably worded that a little strongly :smile:

Not horribly unbalanced but certainly bandwidth starved given that it computationally matches the 7850 in many areas but lags behind in performance due to the bandwidth disadvantage. In fact it has between 25-50% more core power than the X1 GPU (depending on which aspect you measure) and yet Dave claims it would be far outstripped in performance by the X1 GPU when it fully uses it's available esram bandwidth. That therefore suggests that you're giving up at least 25-50% of the GPU's potential performance because of bandwidth starvation. As I noted above though, bandwidth starvation doesn't apply in every circumstance and so I expect Daves remark was more aimed at situations where bandwidth is a factor rather than suggesting an across the board performance differential.

You can however see why that is perfectly acceptable from a product point of view since if the 7790 weren't bandwidth starved it would canabalize 7850 sales. So the way I see it, they had all these fully working Durango dies that could be clocked up to 1Ghz which they need to sell. So they slap a restrictive memory bus on them so as not to canabalize higher end sales and sell it as the mid range 7790.

Betanumerical · Jul 29, 2013

pjbliverpool said:
I agree the article is missing many factors to make it a valid comparison of X1 and PS4 performance but I do think Richard is aware of that and does try to point that out in the article. He does ignore the ROP and setup differences between the two consoles in his article which does invalidate the results little IMO however TMU's are certainly accounted fr since they are part of the CU's. Thus his target GPU's have the TMU delta of the consoles as well as the compute delta. I'm not sure what you mean by cache differences but I'd assume the target GPU's also represent the cache differences correctly unless there is some difference between 7790 and 7850 cache structure that I'm not aware of. I'd just assumed they were identical apart from the obvious scaling up of total cache as CU's go up.

What im getting at is that even though he represented the delta between the GPU's correctly, it would show differences in tex and cache less so then the actual consoles because he is using cards that have such a higher performance that they are unlikely to be as bound by the tex or cache spill / etc then the actual consoles which has the same delta but comparatively less of these resources.

Whilst the comparison is not without merit, imo it still lacks a lot.

pjbliverpool · Jul 29, 2013

Cyan said:
You've *successfully* mapped all of the possible performance numbers into a couple of them at most.

Reducing the theoretical performance to two factors, bandwidth and flops, is certainly wrong in my eyes since I think it's a pretty narrow sighted view of the actual numbers.

I should have been clearer. When I spoke of computational performance in my post above, I was effectively talking about every aspect of the GPU that isn't memory bandwidth. The fact is that the 7790 is faster than the the X1 GPU in every way apart from memory bandwidth (on account of it being basically the same GPU with a 25% greater clock speed and 2 extra CU's). Therefore it's fair to say IMO that if the X1 GPU does outperform it then it's entirely down to bandwidth limitations. Either that or Dave was just referring to console performance amplification but that seems a little too obvious for a man of his calibre in the context of the discussion his statement was made in.

Why is Xbox One's GPU weak? There is nothing weak about it, it's going to be utilized almost fully, I am sure.

....

After reading sebbbi's posts about flops and the scratchpad memory, and having watched the Xbox One games in action all I can say is that the console is a monster, performance wise.

I didn't say it was weak. It is what it is. It's clearly not aimed at the high end performance bracket but it's still going to handle all next gen console games just fine. Calling it a monster is a bit of a stretch though. I recall Xenos being referred to as a "shader monster" back in it's pre-launch days and that was certainly justified given that it sported vastly more shader power than the commerically available PC GPU's of it's time (before it's launch). However the same claim cannot be applied to any aspect of the X1.

Another question, if bandwidth alone is so important, what's the point of having 32MB of eSRAM instead of using a 256 or 512 bits bus even if they utilized DDR3 memory?

Bandwidth is important in combination with everything else, not on it's own. However the inclusion of the esram is clearly for the very reason that bandwidth is important. X1 already has a 256bit bus to it's main memory and 512bit would have been restrictively expensive. Thus esram was added to expand the overall available bandwidth.

Rangers · Jul 29, 2013

Betanumerical said:
What im getting at is that even though he represented the delta between the GPU's correctly, it would show differences in tex and cache less so then the actual consoles because he is using cards that have such a higher performance that they are unlikely to be as bound by the tex or cache spill / etc then the actual consoles which has the same delta but comparatively less of these resources.

Whilst the comparison is not without merit, imo it still lacks a lot.

X1 has multiple times more cache (ESRAM) than a 7870XT or a 7850...

DrJay24 · Jul 29, 2013

cache (/ˈkæʃ/ kash)[1] is a component that transparently stores data so that future requests for that data can be served faster.

http://en.wikipedia.org/wiki/Cache_(computing)

I don't think the eSRAM is cache in the traditional way the word is used.

Betanumerical · Jul 29, 2013

Rangers said:
X1 has multiple times more cache (ESRAM) than a 7870XT or a 7850...

a large amount of SRAM does not make a cache.

Gipsel · Jul 29, 2013

DrJay24 said:
Rangers said:

X1 has multiple times more cache (ESRAM) than a 7870XT or a 7850...

Click to expand...

I don't think the eSRAM is cache in the traditional way the word is used.

Exactly. It's a second memory pool of limited size managed by software. And it doesn't have more bandwidth than the memory interface of a HD7870. It's the supposedly cheaper version putting the burden of splitting the accesses between the two memory pools (DDR3 or eSRAM) and managing the contents of the eSRAM on the devs. A cache wouldn't need the dev to do anything but would require significantly more effort on the hardware side.

My take on the eSRAM in the XB1 is that MS can hope that it will significantly limit the advantage of the PS4 in having twice the number of ROPs, especially as a lot of games will work heavily with 64bit (4xFP16) pixel formats. My guess for at least the first round of games would be that the devs will try to use the eSRAM almost exclusively for the render target(s) *). The eSRAM will provide enough bandwidth for all ROP operations while the texturing reads from the DDR3. It could be that the ROP color caches are slightly to small to hide the memory latency in case of 64bit blending operations for fillrate limited situations (fillrate test) in case of the PS4, so the XB1 may be able to pull even with half the ROPs (assuming they can do fullspeed 64bit blending, which is somewhat implied in one passage of the documentation but is ususally not visible in fillrate tests; furthermore, MS's alleged comment to devs about achieving more than 102.4GB/s with 64bit blending operations also indicates exactly this). That means the often measured halfrate blending with 4xFP16 render targets (in case when not limited by memory speed, which requires a downclocked Tahiti with full memory speed to be sure) may be just an effect of the limited latency hiding capacity caused the small ROP caches and only apply to the PS4, but not the XB1.
But frankly, the argumentation includes quite a few conjectures.

*):
One may think also about some other use if the render target doesn't fill the eSRAM (a simple 8 bytes/pixel render target would leave 16 MB free for other use, but how probable is this?). A small pool for PRT where misses trigger the transfer of tiles from DDR3 to the eSRAM (move engines could be put to some use here) comes to my mind, especially as MS had a demo using just 16MB RAM for the PRT. Or some other smallish buffers used for some low latency data exchange or something in that direction (if it is very small, one might just use the 64kB GDS for this purpose, or counting on the L2). But I would guess the devs will look into such options only after getting the first games out of the door.

zupallinere · Jul 29, 2013

Betanumerical said:
a large amount of SRAM does not make a cache.

Of course we don't know specifically how the esram will be used. Does the fact that MS is using the data from profiling certain 360 games point one way or the other to how the esram will be used ?

MrFox · Jul 29, 2013

Wouldn't a real cache be physically huge for 32MB, I mean something like 150 mm2 or more?
Or can it be a "cheaper" implementation than what we see on CPUs, but still be functionally a cache?

3dilettante · Jul 29, 2013

zupallinere said:
Of course we don't know specifically how the esram will be used. Does the fact that MS is using the data from profiling certain 360 games point one way or the other to how the esram will be used ?

Using profile information to make the hardware good at the tasks it will be running is pretty much what is done and should be done as much as possible. Until the details are released or we see an account of what conclusions they drew from the data, it doesn't point to any specific conclusion.

MrFox said:
Wouldn't a real cache be physically huge for 32MB, I mean something like 150 mm2 or more?

A cache is defined by what it does, not what it is composed of. A significant number of choices can influence the physical dimensions of a memory pool. There are elements of a cache, such as the tags and the additional control logic in the interface that can add area, but the dominant factor at that capacity is what decisions were made for the storage arrays that would be present even if it wasn't a cache.

Or can it be a "cheaper" implementation than what we see on CPUs, but still be functionally a cache?

If it can automatically host copies of data present in RAM and add/discard them without software intervention, that's been good enough to call it a cache.
The tiny non-coherent and read-only texture caches of the older generation of GPUs prior to Fermi and GCN were called caches.

Gipsel · Jul 29, 2013

zupallinere said:
Of course we don't know specifically how the esram will be used.

We pretty much know it is a second memory pool explicitly managed by the software, i.e. not a cache. The devs have to decide what to put where, nothing is done automatically (besides that shader code is agnostic regarding the physical location, i.e. a shader program doesn't have to know, where a memory location it accesses is physically located; memory accesses are routed automatically to the right memory pool using a page table [or as the simplest version some aperture]).

Digital Foundry Article Technical Discussion Archive [2013]

Betanumerical

expletive

Betanumerical

expletive

Betanumerical

liolio

Aquoiboniste

liolio

Aquoiboniste

pjbliverpool

B3D Scallywag

pjbliverpool

B3D Scallywag

pjbliverpool

B3D Scallywag

Betanumerical

pjbliverpool

B3D Scallywag

Rangers

DrJay24

Betanumerical

Gipsel

zupallinere

MrFox

Deludedly Fantastic

3dilettante

Gipsel

Similar threads