PS3 GPU Analysis

Arun

Unknown.
Moderator
Legend
I'm more than tired about the same unrealistic crap being repeated again and again about the PS3 GPU (which I will name NV5A here, just for simplicity's sake). I encourage you to consider the following analysis to be little more than a rumor, although in all honestly it's little more than reliable information with "obvious" speculation filling in the gaps. While in no way perfect, this is still a good order of magnitude more precise and likely than anything else I've seen so far, so read on & enjoy.

Untransformed vertex data, along with potential tessellation information, initially resides in main memory. SPEs then handles VS and potentially PPP functionality if the programmer bothers to implement it. Whether tesselation is implemented before or after the VS (or both) is thus obviously up to the programmer. In some cases, the PE might assist, although I doubt this would be happening unless the SPEs' branch prediction is truly subpar.
The transformed vertex data is then sent to another memory location. Whether the last frame's vertex data is kept on whether CELL waits for the NV5A to finish before or after it transformed the vertices. Some other tricks, such as an efficient way of checking whether a render call has been fully processed, would make this more efficient. Occlusion Queries already are quite similar to such a technique. An obvious advantage of this technique, and the lack of a direct bus between CELL and the NV5A, is to significantly reduce the difference between static and dynamic vertices, at least for the GPU.
The NV5A then handles clipping, triangle setup, and rasterization. Since the SPEs handle Vertex Shading, most games will begin rendering by a Z-only pass, ala Doom3. Like in Doom3, and unlike in 3DMark03, there won't be any "redudant transformations", as these operations are exclusively done in the SPEs, that is on the CPU. This implies the cost of a Z-Only Pass on the PS3 exclusively is a "idle Pixel Shading transistors" one. Bandwidth wouldn't not (always) be wasted, as it is shared with CELL and whatever the NV5A doesn't use, CELL might do. In order to minimize power usage and heat dissipation, it is possible that MediaQ-like technology would be used to "switch off" the Pixel Shading transitors during this initial Z-Only Pass.
The actual framebuffer is in main memory, but it is not out of the question to have a bit of onboard EDRAM for Hierarchical Z. For the technically challenged, Hierarchical Z - which also bears other names, and some might attribute to other things - is a technique which minimizes bandwidth usage by keeping a much lower resolution version of the Z buffer, with each pixel "block" having a minimum and maximum Z value. On ATI GPUs, and possibly on NVIDIA ones too, it is also currently the only (known) way for Early Z to work with Alpha Testing, a technique often used on trees for example.
For this task however, EDRAM is only preferable over a traditional cache system if the resolution of this min/max Z buffer is sufficiently important. There's no point using EDRAM for a 128x128 buffer which would most likely require about 64KB (128x128x(2x16)). On the other hand, there's little point NOT using EDRAM for a 1024x1024 one. Please note that non-1:1 resolutions -could- be used for these buffers, but I'm only using power of 2s here to simplify calculations. Personally, I doubt there's any truth to the EDRAM rumors for the NV5A, although it's not out of the question at all, particularly so if the total system bandwidth is indeed limited to 25.6GB/s. But "hoping" for a significant part of the die to be used by EDRAM is ridiculous at best. The real high-resolution framebuffer with Z *and* color information will always be kept in main memory. And it's not like NVIDIA liked EDRAM much anyway, although it is true that Sony and Toshiba has used it extensively in the past.
In such an architecture, it is not out of the question for Pixel Shaded data to be used by CELL and then resent to the GPU again. Framebuffer access by the CPU should be fast. Framebuffer post-processing by the CPU might not be out of the question either, although I sincerly doubt the performance of such a technique.
The NV5A architecture, considering the common (but not forced) Z-Pass, tends to focus not only on high "Zixel" throughput, but also on fast Z rejection - in that way it isn't really very different from PC parts, it's just a bit more optimized towards it (also refer to the paragraphs about Hierarchical Z above) - also, you shouldn't expect a mind-blowing featureset. Unlike in the PC market, it's only about gaming, that means wasted transistors on features which are not usable in realtime just aren't much of an option. That's why you shouldn't expect much of anything above Pixel Shaders 3.0, and that means it's likely to have a fair bit of NV4x influence. But yes, it's still next-generation, as NVIDIA's upcoming high-end GPU is still largely based on the NV40. You shouldn't expect much more of an architectural difference there than between the NV30 and the NV35 for this new part, even though it's likely that just like ATI did with the R420, NVIDIA is going to rename it NV50 or something along those lines. The originally planned NV50 architecture might have to wait for Microsoft to release Longhorn, or at least for there to be a more official and final release schedule for it. It's unlikely for the "new" NV50 to be little more than a NV40 on steroids, though, and it's likely the optimized architecture - and maybe even this new NV50 - will be NVIDIA's high-end for at least another year.
Another important point is the amount of RAM present in the whole system. What most people seem to forget is that the traditional CPU-GPU architecture does have a fair bit of redundancy; textures are often also kept in system memory, even when present in video memory. This COULD be changed thanks to the high GPU->CPU speed of PCI-Express, but I doubt it will be anytime soon, as it requires a fair bit of API changes, and potentially driver ones too. And most obviously, a console properly handling shared memory would thus required a lot less total memory than a PC. Other factors also explain this, such as a lighter OS overhead and the SPEs doing VS work, also reducing buffer redundancy. What I'm trying to say here is that there's no reason in hell the PS3 should have 512MB of high-speed XDR RAM. Yes, it would be an advantage for it to have 512MB, but it wouldn't be the end of the world if it didn't, and although not totally out of the question, it most likely won't have anymore than 256 for obvious economical reasons.
Finally, when it comes to performance, it entirely depends on how many transistors Sony has accepted to put in the design. If, and only if, there were as many transistors in the PS3's GPU as in the Xenon's, it's likely to have higher average pixel shading performance since it won't have to deal with any kind of vertex shading. You have to realize, however, that a 220M transistors CELL CPU would be quite expensive for a console, and if it was to be in the PS3, it's likely Sony would try to cut costs on other things - that also entirely depends on when the PS3 becomes publicly available, though.

Uttar
 
Uttar said:
I encourage you to consider the following analysis to be little more than a rumor, although in all honestly it's little more than reliable information with "obvious" speculation filling in the gaps.
What if you split rumours from speculations? so we can enjoy the fill the gap game even more.. :)

Untransformed vertex data, along with potential tessellation information, initially resides in main memory. SPEs then handles VS and potentially PPP functionality if the programmer bothers to implement it. Whether tesselation is implemented before or after the VS (or both) is thus obviously up to the programmer. In some cases, the PE might assist, although I doubt this would be happening unless the SPEs' branch prediction is truly subpar.
SPEs don't have branch prediction, just branch hints. the compiler makes the 'predictions'.

The transformed vertex data is then sent to another memory location. Whether the last frame's vertex data is kept on whether CELL waits for the NV5A to finish before or after it transformed the vertices.
Sorry...what are you trying to say here? may you explain it better?

An obvious advantage of this technique, and the lack of a direct bus between CELL and the NV5A, is to significantly reduce the difference between static and dynamic vertices, at least for the GPU.
I don't understand. Why there's no direct connection??! CELL has something like 80 GB/s to share with external devices.

The NV5A then handles clipping, triangle setup, and rasterization. Since the SPEs handle Vertex Shading, most games will begin rendering by a Z-only pass, ala Doom3. Like in Doom3, and unlike in 3DMark03, there won't be any "redudant transformations", as these operations are exclusively done in the SPEs, that is on the CPU.
That doesn't mean there will not be any redundant transformations. If there
is enough memory (and bandwith) to cache a full geometry frame then maybe there will not be any redundant transformations.

This implies the cost of a Z-Only Pass on the PS3 exclusively is a "idle Pixel Shading transistors" one.
IMHO, this is not true.

The actual framebuffer is in main memory,
'Fact' or speculation?


What I'm trying to say here is that there's no reason in hell the PS3 should have 512MB of high-speed XDR RAM. Yes, it would be an advantage for it to have 512MB, but it wouldn't be the end of the world if it didn't, and although not totally out of the question, it most likely won't have anymore than 256 for obvious economical reasons.
Since CELL has 'just' 2 XDR channels, an edramless GPU would need more ram in order to not share the same base (25.6 GB/s) bandwith with the CPU from the external memory.

ciao,
Marco
 
IF cell is handling all the vertex shading could this effect the phyics and AI engine side of things, and if so won't that mean a trade off?
 
Pugger said:
IF cell is handling all the vertex shading could this effect the phyics and AI engine side of things, and if so won't that mean a trade off?

This is Cell we're talking about; you've got 8 SPEs working for you. All physics, AI or VS code will be tailor made to run on SPE. So some SPEs will do AI, some will do physics and some VS. They are fairly autonomous. This is not a super 31 stage pipeline Pentium 4 switching madly between the jobs.
 
Shifty Geezer: Not really, but even if I was, there are more than enough things that would justify that decision. The SPE design is perfect for vertex processing. Yes, a GPU Vertex Shader is too, maybe even more so, but it's less flexible and would limit tessellation possibilities. And there are only so many things those 256GFlops could be used for in games, besides (mostly) Vertex Shading and Physics. As I said, it also has the advantage of accelerating a Z-Only Pass.

nAo: Don't have the time to reply to all of your post now extensively, but here are a few replies to some of your questions/remarks: First, good point about the branch hints/predictions distinction. Regarding the bus remark, I should have made myself clearer; I just meant an external bus, ala AGP/PCI-Express, or one through which textures (or even geometry data) would be sent through. In a shared memory scenario, the memory for this would ideally just be shared between CPU and GPU, so there's no need to send it over a bus. Rendering commands would most likely still be sent through a bus though, although I don't know much of anything about that tbh.
As for redudant transformations, for optimal parallelism between the NV5A and CELL, a full geometry frame would most likely have to be cached anyway. So it doesn't feel very likely at all for it to work in any other way.
As for bandwidth, I don't really see the problem with sharing 25.6GB/s with the CPU. Remember the XBox1 survived just fine with its 64MB of 6.4GB/s bandwidth RAM shared between CPU and GPU. Yes, it's 5 years later, but you shouldn't expect everything to be 5x faster either.

As for separating speculation from the rest, don't expect me to do this for misc. reasons; don't assume there to be more than 30-40% "reliable rumors" stuff in there either, I fear, but the rest is quite far from unfounded dreams; at worst, it most likely could be considered to be bad extrapolation, I guess.

Uttar
 
So... If I understood everything correctly here.
The TnL is done by the APUs and which ones might be helped by the PU.
The GPU might, or not, have some eDRAM for the Zbuffer, or maybe as framebuffer, nothing as been ruled out.
The GPU might be SM3.0, or not. It might be a beefed up NV40... Or not.
The whole machine, the PS3, might have 256MB or 512MB of XDRAM.
Regarding the performances, the GPU could be good, although, it might not be that good... That depend on different unknown variables...

Well, well, in other words, you don't know anything more than we do? And if you do know more, than you don't really spill the beans here, to say the least.
 
If PS3 ships with 25gb/s shared memory for the GPU, then when it ships, it will be two generations behind PC technology. Today's GPUs already exceed 25gb/s *unshared*. By late 2006 or 2007 when PS3 ships, another 18+mos would have passed, and I would expect PCs to sport video bandwidth far greater than the 38gb/s an XTPE can garner.

Simply put, 25gb/s is not enough for a next-gen GPU. It won't be able to handle 720p 4x-8x FSAA, HDR rendering/blending/filtering, and tons of highly detailed textures in huge worlds and lots of render-to-texture ops. It is expected to share this with a high-GFlop stream oriented CPU. Just how will CELL reach anywhere close to its peak performance on the job it is supposed to do, if it is not allowed to burn through memory streams?

I expect CELL to burn through extremely complex maps, AI, and physics simulations, all of which will need gobs of memory.

When the X-Box shipped, it was not 2-generations behind in GPU bandwidth, even with its UMA architecture.
 
Uttar said:
As for redudant transformations, for optimal parallelism between the NV5A and CELL, a full geometry frame would most likely have to be cached anyway. So it doesn't feel very likely at all for it to work in any other way.
Could you explaint why for optimal parallelism a full geometry frame would most likely has to be cached?
IMHO that's not the case. A PS3 would work as a PS2 in this regard.
The main CPU copes with frame N+1, the GPU RamDac shows frame N-1, while VU1 + GS (or SPEs + nvidia GPU) render frame N. Everything is decoupled or deferred (but not cached). Only one thing is cached: the commands buffer (aka DMA display lists).
As for bandwidth, I don't really see the problem with sharing 25.6GB/s with the CPU. Remember the XBox1 survived just fine with its 64MB of 6.4GB/s bandwidth RAM shared between CPU and GPU. Yes, it's 5 years later, but you shouldn't expect everything to be 5x faster either.
The ratio between flops and mem bandwith would be much higher this time..and this is not good :(
 
This is Cell we're talking about; you've got 8 SPEs working for you. All physics, AI or VS code will be tailor made to run on SPE. So some SPEs will do AI, some will do physics and some VS. They are fairly autonomous. This is not a super 31 stage pipeline Pentium 4 switching madly between the jobs.

Thats exactly what I thought, but depending on how complex physics and AI are for a particular game do we know how many SPE will be involved a similar question goes to vertex shading, how many SPE could be used in a graphically intensive game? Could it be that a game utilising advanced AI and physics will involved a trade off with the shaders? How intensive on the SPE's will vertex shading, AI and physics be?
 
1 SPE can do 20 million bezier patch/sec
this is 10 gigapolys/sec with 16*16 tesselate !!!
other 7 SPE compute physics,AI :)
 
G70 is the next generation Nvidia

Does G stand for whizz?


WE FINALLY managed to get the next generation codename for Nvidia's high end card. Actually, the guys at station-drivers.com managed to get a 75.10 driver and read what's inside the driver inf file. These guys whre lucky enough to see the new codename, and it's currently called G70. You can check what they have to say, here.
We were able to verify the info and we know that G70 is actually the real codename for Nvidia next generation high end product. We don't have a clue why Nvidia dropped its NV number nomenclature but this might actually mean that Nvidia is turning the page and that it will present something out of this world.

The G70 won't be based on any existing NV40 marchitecture and it won't be anything like NV47.5, we mean not just an upgrade to existing marchitecture.

We still don't know when it will surface but we do know it will be SLI capable. So whatever one G70 scores two G70 will score close to twice as much, theoretically. Nvidia wants to make a lot of noise about SLI as it's alone in this market, at least for now. Later on ATI will jump in with its own version of SLI as well, that will be the second big battle of the current graphic wars.

I expect this card to show up later than R520 but I could be wrong. We have to congratulate Nvidia for keeping this codename secret for so long. µ


http://theinquirer.net/?article=21409
 
G70? Is that the chip name or the board name? Cause if it's the chip, where's the 6th? and the 5th?
I guess it's just the follow up to the GF6800, in which case, no huge surprise that it is called G70. Sounds much better than GF7x00 or whatever else they might have used.

Sounds a bit like a mobile phone but anyway....
 
Jumping numbers to skip 'versions' is fairly standard practice. Look at XBox 360. If people associate numerical progression without performance progression, a jump could imply a jump in performance.

As for SPU's doing all the vertex work, from the sounds of it if true XB360 will steam all over PS3. They'll have 2/3 fast cores and as much graphics crunching power as Cell+G70 (or whatever). I'm sure the SPE's can do more than just vertex processing or it's a stupid design for a multi-purpose stream processor. :D I'm hoping SPE's to handle realtime audio synthesis, AI and physics to extremes. If it's chugged down with the bread and butter of graphics rendering because they only inclide a half-baked GPU, the Cell advantage is lost.

Maybe all this Cell talk was to panic MS into including mega-costly hardware which actually outperforms PS3 but which costs so much MS goes bankrupt? :p
 
I don't think this thread is related to nvidia gpu naming scheme, so please stay on topic
 
nAo said:
The ratio between flops and mem bandwith would be much higher this time..and this is not good :(
As I said it does depend a fair bit on the timeframe. If the PS3 was to become publicly available in more than 18 months as DemoCoder said, 256MB @ 25.6GB/s shared would indeed be quite unthinkable.
Vysez said:
The whole machine, the PS3, might have 256MB or 512MB of XDRAM.
I never pretended the part on the RAM was fact. This thread is about the actual GPU, and this was just a semi-related note. In my opinion, and that's speculation, it'll have 256MB of XDRAM if released at a pricepoint at or below $299 or if it's released earlier than most expect, and 512MB otherwise.
nAo said:
while VU1 + GS (or SPEs + nvidia GPU) render frame N.
I must have been wrong there, confused a few buffering notions in my reasoning eh, stupid me. Disregard that. So yes, that would also work. However, many games would still be buffering transformed vertex data; I really don't think memory or bandwidth would be a problem. If you cache a Vec3 (or Vec4 too if you need W), you only need 12MBs for 1 million vertices. Storing it with only 16 bits/component might work too but I doubt that'll be supported. For very high polygon scenes, 256MB might be problematic too though.

Vysez said:
The TnL is done by the APUs and which ones might be helped by the PU.
No, I said the PE could help for branch-heavy vertex shaders if the programmer wants to do that, but most of the time it won't even be a good idea at all. The SPEs are what's used 99% for TnL.
Vysez said:
The GPU might, or not, have some eDRAM for the Zbuffer, or maybe as framebuffer, nothing as been ruled out.
You can rule out the color part of the framebuffer to be stored anywhere else than in main memory.
Vysez said:
The GPU might be SM3.0, or not. It might be a beefed up NV40... Or not.
It won't feature much more than PS3.0. functionality (if any at all, considering PS3.0. is a flexible standard and doesn't really include many non-PS restrictions)
Vysez said:
Regarding the performances, the GPU could be good, although, it might not be that good... That depend on different unknown variables...
The GPU is godly. It's a real revolution. It'll kill the competition. That is, if the competition was a 3DFX Voodoo Graphics. My point just is we don't even have reliable R520 specs to compare it, so precise GPU performance information wouldn't do us any good. Of course it's gonna be faster than today's high-end PC GPUs, and of course it's going to be clocked at more than 500Mhz. But that doesn't tell you much, now does it? :)


Uttar
EDIT: Added a "doesn't" in the PS3.0. response, stupid typo, made no sense at all this way eh
 
Back
Top