What is the peak theoretical fill rate of RSX and Xenos at varying levels of AA?

MrWibble said:
Disclaimer:

I was just repeating the information from these publically shown presentations - I haven't looked inside my latest devkit or anything... I'd probably get shouted at for attacking one with a screwdriver.
*Hulk SMASH*

Okay, I'm going to sit in the corner now.
 
Aaah, now i get it, the dev kits seems to run graphics card from the "south bride" structure, and the final PS3 RSX act as a sort of north bride to its HD?

No wonder the tabloid website got confused, dev kits dont seem to have a north bridge at all.
 
kimg said:
Aaah, now i get it, the dev kits seems to run graphics card from the "south bride" structure, and the final PS3 RSX act as a sort of north bride to its HD?

No wonder the tabloid website got confused, dev kits dont seem to have a north bridge at all.

I had to actually go and lookup what the difference was - I'm not actually very familiar with PC motherboard chipsets...

No, the PS3 design doesn't have need of a northbridge in the PC sense - both RAM and the GPU are directly interfaced to the CPU. If the wikipedia article on the subject is any more accurate than the EE one was, it seems as though the northbridge functionality is being assimilated into the CPU on PC architectures these days too.

It seems to me that integrating the southbridge onto the GPU wouldn't be impossible, but perhaps might complicate the design enough that it won't happen for the first design of the machine. I have no idea what the pin-counts are on either chip, but if the southbridge has a significant number of peripheral interfaces, it may be tricky to package it in with an already fairly large GPU. You've also got to factor in logistics of integrating different companies logic into a single die - this may be beyond the scope of current agreements between Sony, Toshiba, IBM and NVidia.

Again, I'm not about to prise open my devkit to look, nor would that necessarily be the same layout as a final PS3, but nothing I've so far heard about would indicate anything other than them remaining seperate chips.
 
kimg said:
Aaah, now i get it, the dev kits seems to run graphics card from the "south bride" structure, and the final PS3 RSX act as a sort of north bride to its HD?

No wonder the tabloid website got confused, dev kits dont seem to have a north bridge at all.

If you look at one of the IBM Cell presentation from sometime ago, it had a diagram, that imply integration of north and south bridge into Cell. Its one of the integration aspect of Cell.
 
kimg said:
Aaah, now i get it, the dev kits seems to run graphics card from the "south bride" structure, and the final PS3 RSX act as a sort of north bride to its HD? No wonder the tabloid website got confused, dev kits dont seem to have a north bridge at all.
The northbridge has always been integrated into Cell since Cell is connected directly to XDR RAM. The southbridge could go on Cell, on RSX or on a separate chip. The dev kits probably used a southbridge to link RSX to Cell because a Flexio interface was not built into the nVidia PCIe cards used in the dev kits. In the production PS3, if RSX had a Flexio incorporated to communicate with Cell and XDR, then it would be directly connected to Cell via Flexio as shown and not through a southbridge. This makes pretty good sense. I hope a proper Flexio interface between Cell and RSX in the production PS3 means there will hopefully be no serious bandwidth or latency limitations for access of XDR by RSX and GDDR by Cell. Is it possible the the latency and bandwidth issues discussed thus far relate to the dev kits rather than the production PS3?
 
Last edited by a moderator:
SPM, I think that the latency seen in the case in which CELL has to read from GDDR3 directly might be explained by the very long pipe-line and much lower clockspeed RSX has compared to the CELL processor. The enormous degree of parallelism granted by processing streams of mostly independent elements (Vertices and Fragments) executed in large batches (all running the same Shader program) allows latency hiding for each fragment.

The problem comes when RSX receives a request from CELL: likely the memory operation has to be inserted in the RSX's instruction stream and has to wait until reaching the proper pipeline stage before initiating the memory request to the GDDR3 memory controller.

It could have been avoided (the same way Xenon/Waternoose reads and writes from the shared GDDR3 pool and we know that in the Xbox 360 architecture the Memory Controller is inside the GPU, that is inside Xenos/C1), but I guess it was more efficient/cheaper to keep and optimize the current G7x memory controller and interface between the core and the memory controller.

Please anyone correct me if I said something or many things wrong here, thanks :D.
 
Acert93 said:
Higher resolutions is going to put more strain on the fillrate of both GPUs. I am not sure how you arrive at the conclusion that fillrate is the bottleneck ATI had in mind in that comments though. Resolution increases affect more than fillrate and since the rep did not mention it being fillrate-bound and the architectures indicates this probably would not be the case, I am not inclined to think it is such either. If I had to guess it would seem to me the ATI comments about performance degrading at much higher resolutions would be related to the small framebuffer size on the Xbox 360. What resolutions he had in mind (higher than 16x12? 21x15? 10x7?) we don't know, ditto whether he had AA in mind as well and the difference in FP16 blending and filtering performance on the X1800 versus F10 on Xenos (and whether that is a bottleneck to memory or ROPs at all), etc And we don't have any numbers, even from ATI, the performance hit from 3 tiles to 4, 5, 6, etc (they claim none at 2 and 1-5% typically for 3). Ultimately the X1800's memory bandwidth (and realized fillrate) will be contending with buffers, texture and geometry assets and I believe the fillrate takes a hit with 4xMSAA (but I could be wrong). Worse case scenario Xenos will always have enough bandwidth for 4Gigapixel/s fillrate. Anyhow, the only absolutes the ATI gave were bandwidth and shader performance. Guessing what bottleneck he had in mind and what resolutions is kind of difficult.

I beleive the quote specifically mentioned 1600x1200 which means 2 tiles with no MSAA and either 4 or 8 with 2x/4xMSAA.

So with no FSAA, the tiling hit should be virtually none existant (if you beleive ATI's claims), so what else could be the reason for R520 being faster? HDR format can't be it because thats a Xenos advantage. Maybe they are assuming 4xMSAA and the 8 tiles would drop Xenos performance below R520 despite your claimed Xenos real world fill rate adavtage but I can't say im convinced of that given how much we are told that tiling has minimal performance impact.

What else is hit by higher resolution? Shaders? Advantage: Xenos, framebuffer bandwidth? Advantage: Xenos. So im thinking fill rate could be a good candidate for R520's claimed superior performance at 1600x1200 and above.
 
HiZ (which can reject many, many pixels very quickly, without any shader work being carried out or using any bandwidth) will be built for higher resolutions on the high end desktop parts, whereas it will be a little more targetted to the type of resolutions being used in a console part.
 
pjbliverpool said:
Maybe they are assuming 4xMSAA and the 8 tiles would drop Xenos performance below R520 despite your claimed Xenos real world fill rate adavtage

My claimed? Look at the numbers. Xenos has a full 4Gigapixel/s fillrate with the bandwidth to sustain it and it can reach 16gigasample/s with 4xMSAA and the bandwidth to sustain it as well.

The X1800XT (if I am doing my math right for 48GB/s) has enough bandwidth to sustain ~5.6Gigapixel/s. That is in situations where the totality of memory bandwidth is being dedicated to fill rate. No texturing. No geoemtry. No backbuffers and no framebuffer.

but I can't say im convinced of that given how much we are told that tiling has minimal performance impact.
...
What else is hit by higher resolution?

Tiling.

You mention the negligable hit from ATI's numbers. What they are on record for is no penalty for 2 tiles and 1-5% penalty for 3 tiles depending on the scenes geometry complexity. If that is a linear progression (could be logorithmic) that would be 6-30% penalty for 8 tiles (1600x1200 with 4xMSAA). It could be significantly more. The X1800 also has an advantage in texturing in some situations.

Anyhow, Xenos has not run the demo. It ran Ruby fine and probably would run ToyShop well as well. We can roughly verify that the ATI claims that Xenos has more shader performance & bandwidth. What the rep had in mind in regards to what would hold Xenos back at higher resolutions is guesswork.

But my guess is that Xenos is pretty much tailored for 720p as the maximum. The X1800 is more flexible in regards to resolutions. This is one reason the Xenos article mentions we wont see the eDRAM in the PC sector any time soon: It currently is too small to accomodate the various display resolutions.
 
Titanio said:
'G70' is just an architecture's moniker, different implementations with varying ROP numbers will have different fillrate numbers, obviously. IIRC, there isn't a hit with 2xAA.


I would argue that G70 is a specific chip. 8 Vertex Shaders, 24 Pixel Shaders, 24 Texture, and 16 ROPs.


while "G7x" is the architecture's moniker. different implementations of G7x (i.e. G73) with different amounts of Vertex Shader, Pixel Shaders and ROPs.

RSX is a G7x architecture, not a G70.
 
pjbliverpool said:
[...]which means 2 tiles with no MSAA and either 4 or 8 with 2x/4xMSAA.

One small note: it should be 3 and 6 tiles, respectively.

1600x1200x(4 bytes color + 4 bytes z/stencil) = 14.65MiB
2xMSAA: ~29.3
4xMSAA: ~58.6
 
Acert93 said:
My claimed? Look at the numbers. Xenos has a full 4Gigapixel/s fillrate with the bandwidth to sustain it and it can reach 16gigasample/s with 4xMSAA and the bandwidth to sustain it as well.

The X1800XT (if I am doing my math right for 48GB/s) has enough bandwidth to sustain ~5.6Gigapixel/s. That is in situations where the totality of memory bandwidth is being dedicated to fill rate. No texturing. No geoemtry. No backbuffers and no framebuffer.



Tiling.

You mention the negligable hit from ATI's numbers. What they are on record for is no penalty for 2 tiles and 1-5% penalty for 3 tiles depending on the scenes geometry complexity. If that is a linear progression (could be logorithmic) that would be 6-30% penalty for 8 tiles (1600x1200 with 4xMSAA). It could be significantly more. The X1800 also has an advantage in texturing in some situations.

Anyhow, Xenos has not run the demo. It ran Ruby fine and probably would run ToyShop well as well. We can roughly verify that the ATI claims that Xenos has more shader performance & bandwidth. What the rep had in mind in regards to what would hold Xenos back at higher resolutions is guesswork.

But my guess is that Xenos is pretty much tailored for 720p as the maximum. The X1800 is more flexible in regards to resolutions. This is one reason the Xenos article mentions we wont see the eDRAM in the PC sector any time soon: It currently is too small to accomodate the various display resolutions.

At the end of the day ATI's comments were likely marketing anyway and had little, if any technical basis. It must be difficult for ATI to simultaneously "big up" its 2 top end parts which are effectively competing with each other - at least for the hearts and minds of their respective fans - when people ask for direct comparisons.

As a side note, im not sure how relevant all the bandwidth numbers are when discussing the edram. Going by the numbers one would assume that PC GPU's are heavily bandwidth constricted in the latest games with all the fancy framebuffer effects, high res, 4xMSAA and FP16 HDR. And thus you would expect performance to scale almost linearly with memory speed. Yet the 7900GTX is almost universally faster than the 7800GTX512 which has faster memory. And its only advanatage is a modest core speed increase. That suggests that its far less bandwidth constrained than the numbers would suggest (X1800 and X1900 with virtually the same memory bandwidth are perhaps a better example given that G7x can't handle FP16 HDR and MSAA).
 
G71's ROPs are more capable than G70's - which seems to make G71 able to use bandwidth more effectively.

Jawed
 
Back
Top