Does Cell Have Any Other Advantages Over XCPU Other Than FLOPS?

Kryton said:
Perhaps but it is not their speed that matters, it is the loading the SSE values that takes time. Why these overwrite the FPU stack entries I will never know, many are the mysteries of x86 (and why it still exists :devilish: )
Because you are wrong about that.

SSE has 8 new 128 Bit registers that required operating system so they registers get saved on context change. SSE can be used at the same time as x87 (or MMX/3DNow) code with no switching penalty, because the registers are independant (you don't switch modes when using SSE instructions). In x86_64 there are are 16 128bit registers.

The 8 64bit MMX/3DNow registers were aliased on floating point stack so no changes were required to operating system code. It was up to the program to ensure that that state of the FPU is correct and the the instructions to change and save the state are really expensive.
 
Jawed said:
It's a guess.

In conventional PC GPUs lots of data compression is used to improve the efficacy of memory accesses - 32GB/s of compressed bandwidth might equate to the use of 128GB/s if compression weren't used (I dunno, just a guess.)

Xenos uses a mixed compressed (GPU<->EDRAM, 32GB/s) and uncompressed (ROPs<->RAM, 256GB/s) scheme, which makes it fiddlesome to characterise its effective bandwidth for framebuffer operations, in terms that are comparable with RSX (or PC GPUs).

In other words, because the EDRAM uses an uncompressed format, it's not fair to say that it's directly comparable to the bandwidth available to other GPUs, as they use compression.

ATI dropped compression because it lowers the transistor count (presumably quite heavily) and with the RAM being embedded the performance will be fantastic anyway.

Jawed

I thought that in pc graphic cards only the transfers of data are compressed, but the framebuffer wasn't compressed when you are working in it (AA, HDR, another pass of rendering, etc)
 
Apoc said:
I thought that in pc graphic cards only the transfers of data are compressed, but the framebuffer wasn't compressed when you are working in it (AA, HDR, another pass of rendering, etc)

Can you count frame buffer like this?:

If every frame takes 60mb, and there 60 frame per second, it would take 3,6gb/s of bandwidth?
But the eDRAM only holds 10mb, and if you use tile rendering it can take 40mb? So it actually only frees 2,4GB/s?
 
weaksauce said:
Can you count frame buffer like this?:

If every frame takes 60mb, and there 60 frame per second, it would take 3,6gb/s of bandwidth?
But the eDRAM only holds 10mb, and if you use tile rendering it can take 40mb? So it actually only frees 2,4GB/s?

If it takes 60 mb you'll have to tile it across 6 tiles if the eDRAM can only hold 10 mb. I don't know where you got that you have to tile it across four tiles always. But yes, only copying the fb would use 3.6gb/s (if it takes 60mb) WITHOUT compression, plus the additional overhead for resubmiting the geometry that spans across two or more tiles.

Correct me if I'm wrong: when doing fb operations in pc graphic cards (like AA, postprocessing, HDR, etc), the data is always uncompressed, isn't it?
 
I've been talking about the use of AA, as that's the case that uses the most extreme bandwidth.

Compression arises because the amount of data required to represent polygon edge and non-edge pixels is different.

Edge pixels require that each of the AA samples are written. So if a pixel colour falls on two (out of four) of the sampling positions within a pixel, then colour and z/stencil data (8 bytes total) needs to be written to the memory locations corresponding to those two sampling positions. At the same time, the other two samples within the pixel need to be set to the z/stencil values corresponding to the polygon that was already there.

When a non-edge pixel is written, there's no need to fill in all the samples - which is where the "compression" comes from. It's enough to write a single colour/z value.

Depending on the amount of detail on screen, the number of edge pixels will vary. Compression is also going to be affected by the amount of overdraw - e.g. if there's a troll army on the far side of a hill, but the hill is drawn after the trolls (!! not likely!) then a huge number of edge pixels generated in rendering the trolls will be turned into non-edge pixels when the hill is drawn.

So assessing the effectiveness of compression is a guessing game. The effectiveness of the compression depends on the data (detail) and the rendering methods (amount of overdraw).

Jawed
 
Apoc said:
If it takes 60 mb you'll have to tile it across 6 tiles if the eDRAM can only hold 10 mb.

A 720p frame with 4xAA takes:
width x height x AA-samples x (colour+z) bytes
1280 x 720 x 4 x 8
29491200 bytes

Which is 3 tiles.

I don't know where you got that you have to tile it across four tiles always. But yes, only copying the fb would use 3.6gb/s (if it takes 60mb) WITHOUT compression, plus the additional overhead for resubmiting the geometry that spans across two or more tiles.
The resolution of the backbuffer into the frontbuffer, generates:

width x height x red,green,blue x fps
1280 x 720 x 3 x 60
165888000 bytes/s
of uncompressed framebuffer data in main memory.

Jawed
 
Jawed said:
A 720p frame with 4xAA takes:
width x height x AA-samples x (colour+z) bytes
1280 x 720 x 4 x 8
29491200 bytes


Which is 3 tiles.


The resolution of the backbuffer into the frontbuffer, generates:

width x height x red,green,blue x fps
1280 x 720 x 3 x 60
165888000 bytes/s

of uncompressed framebuffer data in main memory.

Jawed

I know, i was just answering to weaksauce, using his example of a 60 mb FB.
 
Apoc said:
I know, i was just answering to weaksauce, using his example of a 60 mb FB.

I don't want to continue offtopic but this has somewhat to do with how much bandwidth there is left fo CPU. :p

Anyhow, about the tile rendering. I really don't know anything about it, but I've been said it can hold up to 40mb when using "tile rendering". :smile:

So the 360 uses "compressed" frames and the ps3 are uncompressedand much larger and takes a lot of bandwidth or?:???:

You guys have any idea on how much bandwidth the edram would free at tops? Cause this doesn't really sound optimal for me.
 
weaksauce said:
So the 360 uses "compressed" frames and the ps3 are uncompressedand much larger and takes a lot of bandwidth or?:???:

It's the other way around. Color/Z compression etc. aren't used by the daughter die, but they are used in nVidia chips. There's enough bandwidth on the daughter die that compression isn't really necessary.
 
Titanio said:
It's the other way around. Color/Z compression etc. aren't used by the daughter die, but they are used in nVidia chips. There's enough bandwidth on the daughter die that compression isn't really necessary.

Hm, ok. So what exactly is 360 winning on all this?
 
weaksauce said:
Hm, ok. So what exactly is 360 winning on all this?
Basically it will be much less BW limited for framebuffer operations than the PS3.

RSX has 25GB/s to gpu ram, and 23gb/s to main ram, Xenos has 22GB/s to CPU, and 32GB/s to daughter die, daughter die has 256gb/s internally.

How much practical bandwidth savings the EDRAM offers can only be guessed at right now, somewhere in between 32-256gb/s, regardless of teh exact # it should be much more than the available bandwidth on the RSX if used correctly.
 
scooby_dooby said:
Basically it will be much less BW limited for framebuffer operations than the PS3.

RSX has 25GB/s to gpu ram, and 23gb/s to main ram, Xenos has 22GB/s to CPU, and 32GB/s to daughter die, daughter die has 256gb/s internally.

How much practical bandwidth savings the EDRAM offers can only be guessed at right now, somewhere in between 32-256gb/s, regardless of teh exact # it should be much more than the available bandwidth on the RSX if used correctly.

No but the 22.4/gb/s is shared with the CPU, it doesn't get all the bandwidth. And I don't get it, it's only 10mb, it can only fit so much.

...
 
Last edited by a moderator:
weaksauce said:
No but the 22.4/gb/s is shared with the CPU, it doesn't get all the bandwidth. And I don't get it, it's only 10mb, it can only fit so much.

...


Trust me for simple or transparent fill it will absolutely kill external memory in terms of performance.

And there are still a lot of simple and blended tri's in your average game even in the days of programmable shaders.
 
scooby_dooby said:
RSX has 25GB/s to gpu ram, and 23gb/s to main ram

It's 22.1GB/s and 25.6 GB/s respectively, actually.

Whether it's a win over PS3 depends on how much framebuffer operations consume, obviously. Every game will be different in this regard. The tipping point in favour of X360 is obviously if the framebuffer starts requiring more than 25.6GB/s (or, perhaps more accurately, 25.6GB/s * x, where x is the compression ratio that applies, which again is a variable depending on what you're doing). If it requires less than that, PS3 as a whole will be left with more main memory bandwidth. That, of course, is not accounting for tiling costs which may affect how much of a win it'd be in the former case.

Obviously in the context of X360 alone, it's a win - if it weren't there, that 22.1GB/s to main memory would be looking terribly narrow.

It's just a hunch, but if we consider 3 consumers of bw - CPU, GPU, framebuffer, then I think in many cases the chips may fall like this: PS3 with more bw for CPU and GPU, X360 with more bw for the framebuffer. The setup in PS3 pretty much necessitates that the framebuffer use less than 25.6GB/s, or 22.1GB/s perhaps, so it pretty much has to be this way. That's also conditional on X360's eDram being properly used (proper use unfortunately does not seem to go without saying).
 
Last edited by a moderator:
Xenos has a guaranteed minimum backbuffer bandwidth of 32GB/s against the guaranteed maximum of 22.4GB/s of RSX.

Where RSX ends, Xenos hasn't even started :oops:

Jawed
 
Jawed said:
Xenos has a guaranteed minimum backbuffer bandwidth of 32GB/s against the guaranteed maximum of 22.4GB/s of RSX.

And I doubt most games will use 22.4GB/s for the framebuffer (they'll be made not to). Which may leave it with a generous excess of bw for the two other components I mentioned, versus X360..

It's all tradeoffs. However in X360's case, to even get its advantages here requires work that may not be universally taken up (and certainly has not been to date)..
 
Titanio said:
It's just a hunch, but if we consider 3 consumers of bw - CPU, GPU, framebuffer, then I think in many cases the chips may fall like this: PS3 with more bw for CPU and GPU, X360 with more bw for the framebuffer.
I'd suggest this is a stab in the dark at the very least, especially since we don't yet know exactly how developers are going to use the chips in the respective systems, nor do we know the full details. For instance, with the "Procedural Synthesis" on Xenon this should be a fairly low bandwidth and latency bound method of generating geometry, more or less consuming only FSB bandwidth, thanks to the cache locking between the CPU and graphics chips; until we learn more about the integration of RSX and Cell we don't know if this is going to be possible or whether such operations would need to be pumped out to one memory or the other before the graphics could retrieve the information.

However in X360's case, to even get its advantages here requires work that may not be universally taken up (and certainly has not been to date)..
I think its been fairly well documented that the early dev systems didn't implement the the actual method required for tiling - given the time between final dev systems and shipping games its really no surprise that the launch titles are as they are. Should the talk of DOA4 utilising 4x FSAA prove to be true we can probably begin to understand why the title has been delayed.
 
Dave Baumann said:
I'd suggest this is a stab in the dark at the very least, especially since we don't yet know exactly how developers are going to use the chips in the respective systems, nor do we know the full details.

None of this changes the fact that if RSX uses no more than 22.4GB/s for the framebuffer, as it perhaps may not be even able to do (at least not easily), then the CPU and GPU then have at least 25.6GB/s between them. Which is more than x360 will have.

We could discuss potential differences in behaviour that could alter bw usage from those two components' perspects in both systems, but things get a little flaky then..

Dave Baumann said:
I think its been fairly well documented that the early dev systems didn't implement the the actual method required for tiling - given the time between final dev systems and shipping games its really no surprise that the launch titles are as they are. Should the talk of DOA4 utilising 4x FSAA prove to be true we can probably begin to understand why the title has been delayed.

There remains a question of how third party multiplatform engines and titles might handle this, though..how significant would engine changes need to be?
 
Last edited by a moderator:
Titanio said:
None of this changes the fact that if RSX uses no more than 22.4GB/s for the framebuffer, as it perhaps may not be even able to do (at least not easily), then the CPU and GPU then have at least 25.6GB/s between them. Which is more than x360 will have.
Right, but you should also point out that the PC counterpart to this card seems already BW limited in some games when you simulate a 128-bit bus.

Also, that in the newest refresh of that card, they've bumped up BW by 43% which suggests it certainly can 'use' all the bandwidth it can get, and shows that Nvidia feels that even with the 256bit bus at 600Mhz the G70 could still gain from a signifigant increase in bandwidth.

It doesn't bode well for most games not requiring more than 22.4GB. It's closed boxed, so I'm sure they'll find many very cool workarounds, but it does seem pretty obvious that most dev's are going to hit this wall at one point or another, i.e. looking at the GTX 512, it's certainly suggests that 22GB's of bandwidth is not ideal...
 
Last edited by a moderator:
Titanio said:
None of this changes the fact that if RSX uses no more than 22.4GB/s for the framebuffer, as it perhaps may not be even able to do (at least not easily), then the CPU and GPU then have at least 25.6GB/s between them. Which is more than x360 will have.
Thats suggest that, other than command information, you are thinking that geometry is only ever going to be shoved between the two chips, in which case how much is that going to be valid for and does RSX have the setup rate for it?

nAo has already been suggesting very much against this when the concept of the split memory was first talked about.

There remains a question of how third party multiplatform engines and titles might handle this, though..how significant would engine changes need to be?
How much of a change is it going to be to get the code working on two radically different CPU's? How different is the code going to be given the two different API's? Who different do the shaders have to be, presumably, given the two different HLSL compilers? How different does the code have to be given the different properties of the graphics chips?

Changes will always have to be made becuase they are different platforms. Other factors may also come into play such as what the primary development platform be for the developers.
 
Back
Top