I'm more than tired about the same unrealistic crap being repeated again and again about the PS3 GPU (which I will name NV5A here, just for simplicity's sake). I encourage you to consider the following analysis to be little more than a rumor, although in all honestly it's little more than reliable information with "obvious" speculation filling in the gaps. While in no way perfect, this is still a good order of magnitude more precise and likely than anything else I've seen so far, so read on & enjoy.
Untransformed vertex data, along with potential tessellation information, initially resides in main memory. SPEs then handles VS and potentially PPP functionality if the programmer bothers to implement it. Whether tesselation is implemented before or after the VS (or both) is thus obviously up to the programmer. In some cases, the PE might assist, although I doubt this would be happening unless the SPEs' branch prediction is truly subpar.
The transformed vertex data is then sent to another memory location. Whether the last frame's vertex data is kept on whether CELL waits for the NV5A to finish before or after it transformed the vertices. Some other tricks, such as an efficient way of checking whether a render call has been fully processed, would make this more efficient. Occlusion Queries already are quite similar to such a technique. An obvious advantage of this technique, and the lack of a direct bus between CELL and the NV5A, is to significantly reduce the difference between static and dynamic vertices, at least for the GPU.
The NV5A then handles clipping, triangle setup, and rasterization. Since the SPEs handle Vertex Shading, most games will begin rendering by a Z-only pass, ala Doom3. Like in Doom3, and unlike in 3DMark03, there won't be any "redudant transformations", as these operations are exclusively done in the SPEs, that is on the CPU. This implies the cost of a Z-Only Pass on the PS3 exclusively is a "idle Pixel Shading transistors" one. Bandwidth wouldn't not (always) be wasted, as it is shared with CELL and whatever the NV5A doesn't use, CELL might do. In order to minimize power usage and heat dissipation, it is possible that MediaQ-like technology would be used to "switch off" the Pixel Shading transitors during this initial Z-Only Pass.
The actual framebuffer is in main memory, but it is not out of the question to have a bit of onboard EDRAM for Hierarchical Z. For the technically challenged, Hierarchical Z - which also bears other names, and some might attribute to other things - is a technique which minimizes bandwidth usage by keeping a much lower resolution version of the Z buffer, with each pixel "block" having a minimum and maximum Z value. On ATI GPUs, and possibly on NVIDIA ones too, it is also currently the only (known) way for Early Z to work with Alpha Testing, a technique often used on trees for example.
For this task however, EDRAM is only preferable over a traditional cache system if the resolution of this min/max Z buffer is sufficiently important. There's no point using EDRAM for a 128x128 buffer which would most likely require about 64KB (128x128x(2x16)). On the other hand, there's little point NOT using EDRAM for a 1024x1024 one. Please note that non-1:1 resolutions -could- be used for these buffers, but I'm only using power of 2s here to simplify calculations. Personally, I doubt there's any truth to the EDRAM rumors for the NV5A, although it's not out of the question at all, particularly so if the total system bandwidth is indeed limited to 25.6GB/s. But "hoping" for a significant part of the die to be used by EDRAM is ridiculous at best. The real high-resolution framebuffer with Z *and* color information will always be kept in main memory. And it's not like NVIDIA liked EDRAM much anyway, although it is true that Sony and Toshiba has used it extensively in the past.
In such an architecture, it is not out of the question for Pixel Shaded data to be used by CELL and then resent to the GPU again. Framebuffer access by the CPU should be fast. Framebuffer post-processing by the CPU might not be out of the question either, although I sincerly doubt the performance of such a technique.
The NV5A architecture, considering the common (but not forced) Z-Pass, tends to focus not only on high "Zixel" throughput, but also on fast Z rejection - in that way it isn't really very different from PC parts, it's just a bit more optimized towards it (also refer to the paragraphs about Hierarchical Z above) - also, you shouldn't expect a mind-blowing featureset. Unlike in the PC market, it's only about gaming, that means wasted transistors on features which are not usable in realtime just aren't much of an option. That's why you shouldn't expect much of anything above Pixel Shaders 3.0, and that means it's likely to have a fair bit of NV4x influence. But yes, it's still next-generation, as NVIDIA's upcoming high-end GPU is still largely based on the NV40. You shouldn't expect much more of an architectural difference there than between the NV30 and the NV35 for this new part, even though it's likely that just like ATI did with the R420, NVIDIA is going to rename it NV50 or something along those lines. The originally planned NV50 architecture might have to wait for Microsoft to release Longhorn, or at least for there to be a more official and final release schedule for it. It's unlikely for the "new" NV50 to be little more than a NV40 on steroids, though, and it's likely the optimized architecture - and maybe even this new NV50 - will be NVIDIA's high-end for at least another year.
Another important point is the amount of RAM present in the whole system. What most people seem to forget is that the traditional CPU-GPU architecture does have a fair bit of redundancy; textures are often also kept in system memory, even when present in video memory. This COULD be changed thanks to the high GPU->CPU speed of PCI-Express, but I doubt it will be anytime soon, as it requires a fair bit of API changes, and potentially driver ones too. And most obviously, a console properly handling shared memory would thus required a lot less total memory than a PC. Other factors also explain this, such as a lighter OS overhead and the SPEs doing VS work, also reducing buffer redundancy. What I'm trying to say here is that there's no reason in hell the PS3 should have 512MB of high-speed XDR RAM. Yes, it would be an advantage for it to have 512MB, but it wouldn't be the end of the world if it didn't, and although not totally out of the question, it most likely won't have anymore than 256 for obvious economical reasons.
Finally, when it comes to performance, it entirely depends on how many transistors Sony has accepted to put in the design. If, and only if, there were as many transistors in the PS3's GPU as in the Xenon's, it's likely to have higher average pixel shading performance since it won't have to deal with any kind of vertex shading. You have to realize, however, that a 220M transistors CELL CPU would be quite expensive for a console, and if it was to be in the PS3, it's likely Sony would try to cut costs on other things - that also entirely depends on when the PS3 becomes publicly available, though.
Uttar
Untransformed vertex data, along with potential tessellation information, initially resides in main memory. SPEs then handles VS and potentially PPP functionality if the programmer bothers to implement it. Whether tesselation is implemented before or after the VS (or both) is thus obviously up to the programmer. In some cases, the PE might assist, although I doubt this would be happening unless the SPEs' branch prediction is truly subpar.
The transformed vertex data is then sent to another memory location. Whether the last frame's vertex data is kept on whether CELL waits for the NV5A to finish before or after it transformed the vertices. Some other tricks, such as an efficient way of checking whether a render call has been fully processed, would make this more efficient. Occlusion Queries already are quite similar to such a technique. An obvious advantage of this technique, and the lack of a direct bus between CELL and the NV5A, is to significantly reduce the difference between static and dynamic vertices, at least for the GPU.
The NV5A then handles clipping, triangle setup, and rasterization. Since the SPEs handle Vertex Shading, most games will begin rendering by a Z-only pass, ala Doom3. Like in Doom3, and unlike in 3DMark03, there won't be any "redudant transformations", as these operations are exclusively done in the SPEs, that is on the CPU. This implies the cost of a Z-Only Pass on the PS3 exclusively is a "idle Pixel Shading transistors" one. Bandwidth wouldn't not (always) be wasted, as it is shared with CELL and whatever the NV5A doesn't use, CELL might do. In order to minimize power usage and heat dissipation, it is possible that MediaQ-like technology would be used to "switch off" the Pixel Shading transitors during this initial Z-Only Pass.
The actual framebuffer is in main memory, but it is not out of the question to have a bit of onboard EDRAM for Hierarchical Z. For the technically challenged, Hierarchical Z - which also bears other names, and some might attribute to other things - is a technique which minimizes bandwidth usage by keeping a much lower resolution version of the Z buffer, with each pixel "block" having a minimum and maximum Z value. On ATI GPUs, and possibly on NVIDIA ones too, it is also currently the only (known) way for Early Z to work with Alpha Testing, a technique often used on trees for example.
For this task however, EDRAM is only preferable over a traditional cache system if the resolution of this min/max Z buffer is sufficiently important. There's no point using EDRAM for a 128x128 buffer which would most likely require about 64KB (128x128x(2x16)). On the other hand, there's little point NOT using EDRAM for a 1024x1024 one. Please note that non-1:1 resolutions -could- be used for these buffers, but I'm only using power of 2s here to simplify calculations. Personally, I doubt there's any truth to the EDRAM rumors for the NV5A, although it's not out of the question at all, particularly so if the total system bandwidth is indeed limited to 25.6GB/s. But "hoping" for a significant part of the die to be used by EDRAM is ridiculous at best. The real high-resolution framebuffer with Z *and* color information will always be kept in main memory. And it's not like NVIDIA liked EDRAM much anyway, although it is true that Sony and Toshiba has used it extensively in the past.
In such an architecture, it is not out of the question for Pixel Shaded data to be used by CELL and then resent to the GPU again. Framebuffer access by the CPU should be fast. Framebuffer post-processing by the CPU might not be out of the question either, although I sincerly doubt the performance of such a technique.
The NV5A architecture, considering the common (but not forced) Z-Pass, tends to focus not only on high "Zixel" throughput, but also on fast Z rejection - in that way it isn't really very different from PC parts, it's just a bit more optimized towards it (also refer to the paragraphs about Hierarchical Z above) - also, you shouldn't expect a mind-blowing featureset. Unlike in the PC market, it's only about gaming, that means wasted transistors on features which are not usable in realtime just aren't much of an option. That's why you shouldn't expect much of anything above Pixel Shaders 3.0, and that means it's likely to have a fair bit of NV4x influence. But yes, it's still next-generation, as NVIDIA's upcoming high-end GPU is still largely based on the NV40. You shouldn't expect much more of an architectural difference there than between the NV30 and the NV35 for this new part, even though it's likely that just like ATI did with the R420, NVIDIA is going to rename it NV50 or something along those lines. The originally planned NV50 architecture might have to wait for Microsoft to release Longhorn, or at least for there to be a more official and final release schedule for it. It's unlikely for the "new" NV50 to be little more than a NV40 on steroids, though, and it's likely the optimized architecture - and maybe even this new NV50 - will be NVIDIA's high-end for at least another year.
Another important point is the amount of RAM present in the whole system. What most people seem to forget is that the traditional CPU-GPU architecture does have a fair bit of redundancy; textures are often also kept in system memory, even when present in video memory. This COULD be changed thanks to the high GPU->CPU speed of PCI-Express, but I doubt it will be anytime soon, as it requires a fair bit of API changes, and potentially driver ones too. And most obviously, a console properly handling shared memory would thus required a lot less total memory than a PC. Other factors also explain this, such as a lighter OS overhead and the SPEs doing VS work, also reducing buffer redundancy. What I'm trying to say here is that there's no reason in hell the PS3 should have 512MB of high-speed XDR RAM. Yes, it would be an advantage for it to have 512MB, but it wouldn't be the end of the world if it didn't, and although not totally out of the question, it most likely won't have anymore than 256 for obvious economical reasons.
Finally, when it comes to performance, it entirely depends on how many transistors Sony has accepted to put in the design. If, and only if, there were as many transistors in the PS3's GPU as in the Xenon's, it's likely to have higher average pixel shading performance since it won't have to deal with any kind of vertex shading. You have to realize, however, that a 220M transistors CELL CPU would be quite expensive for a console, and if it was to be in the PS3, it's likely Sony would try to cut costs on other things - that also entirely depends on when the PS3 becomes publicly available, though.
Uttar