ihamoitc2005 said:
Question is not if texture is needed but what proportion of total shader cycles is required in typical situation. Maybe a developer with real experience can provide us with this answer.
If there is one single filtered texture, in a single second, then RSX fails to sustain peak theoretical shader operations, even if it's 100% efficient. Since textures are needed to have useful shading power, then we should consider realworld performance to be in the range of max texture usage and no texture usage, and not at either end, correct? (average, as you've said). However, looking at R520 and G70, how often is it that the latter seems to gain twice as much speed at all? Even when there are fewer textures, in how many situations does it seem like that second ALU can actually be used, effectively? Some games, like F.E.A.R., have a higher than 3:1 ratio for arithmic to texture ops already, yet the lead that G70 has matches fairly well with the advantage it has in fragment pipes x clockrate.
ihamoitc2005 said:
My understanding of Aaron's comments was that bandwidth is insufficient for sufficient fill-rate not just for peak fill-rate but for simply rendering at same quality as Xenos. If you look at his post then it seems this is what he is trying to say. ROP count is secondary to his main statement: "The fill-rate of RSX will be severly limited by the memory bandwidth available."
You misinterpreted, then. He said that half of the ROPs were pointless because it didn't have the bw to support them. This has nothing to do with sufficient fillrate, but rather peak fillrate (and how realisitc and approachable it is). And ROP count is integral to his main statement, since it DEFINES what the fillrate is. Do not separate them. The statement that it will be limited has additional and futher implications, beyond simply the 8 vs. 16 ROPs, but the main point is that there's no way it has bandwidth to support 16 ROPs in a realistic situation (i.e., with textures and geometry). Once it's down to 8 ROPs there's still the possibility of running into bw issues, but it is far less severe.
ihamoitc2005 said:
With Heavenly Sword and other real-time demos we know fill-rate is sufficient to equal or be better than reference platform so therefore it is not a limitation. As for extra ROPs, if included, maybe it is for SSAA.
Er.... sufficient for videogame != sufficient for peak fillrate (which is what the original argument stemmed from: Xenos vs RSX fillrate. In which case, Xenos has exactly the amount of BW it needs to support 4 Gigapixels w/ 4 samples each at 32bpp. 4billion pixels x 4 samples each x 64 bits (color + z/stencil) x 2 (read and write) = 256GB/s)
ihamoitc2005 said:
What problems do you feel exist with organization of "nV" pipelines? Real world performance of G70 is excellent and many believe it is extremely effective and proven design.
Not necessarily problems, but it goes against what I see as the preferrred method: KISS. ATI has now, in the PC space, a decoupled TMU. The ALU setup consists of just one fully capable Vec3+Scalar ALU with a mini-ALU (which, according to them, is only there now due to having a good compiler for this arrangement). Then one ROP for that. Xenos takes it further by fully dissociating the TMUs, removing the mini-ALUs (AFAWK), and keeping ROPs in line with what can realistically be supported.
G70 has two ALUs, two mini-ALUs, does partial precision normalize, and the first ALU is tied to the texture address processor and is consumed in a texture fetch. You always use the two ALUs (hopefully, but not some of the other hardware),, but ultimately, how well does it work when using both for arithmic ops? Despite increased capability of first ALU to MADD capable, it hasn't really seen a performance jump. So far, seems inefficient and like a waste of hardware. Xenos design changes in regard to that, as well as unified shaders, just define Xenos as a more elegant solution IMHO.
It's powerful, effective, and proven. Doesn't mean it isn't broken, however. Unified shaders wouldn't exist for either nv or ati, otherwise. They both have part of what is probably the right design, however. Units that can serve as fragment ALUs, vertex ALUs, or TMUs. Of course, the latter might be extremely expensive in transistor costs, preventing GPUs from going in that exact direction (especially if texture growth stays the same while shader growth continues to increase relative to that).
ihamoitc2005 said:
I do not know of which charts you are speaking since I have not seen any with incorrect Xenos info, only correct Xenos info, but there is uncertainty of what is RSX precisely due to some incompatible specs with RSX such as dot-product and what changes to pipeline architecture are made as well as precise method of accessing CELL and XDR.
mostly-correct info. But it's PR-glossed. And missing the fact that each ROP can support 4 multisamples, so while fillrate for RSX goes to 17.6 at 2xAA, Xenos makes a second jump at 4x to 16Gsamples, making it quite comparable to RSX. And saying that RSX has 48GB/s of bandwidth to Xenos' 22.4, completely ignoring the 256GB/s for all the bw-consuming parts of framebuffer ops? Heh.
And why should we toss away other info that chart presents [2x (Vec4+scalar) + fp16 normalize and all subsequent data] because of one dot-product number which isn't entirely dependant on GPU calculations. The reasoning just doesn't make much sense to me. And as for prcise method of accessing Cell and XDR, well, I'm waiting for all the nity-gritty on Xenos too. Don't see that coming, even after the console's release (though it's undoubtedly floating around in one of those white papers that most other people seem to have
)
ihamoitc2005 said:
How are you certain that "development time" is restriction on Xenos output quality? Do you have information that it is not memory bandwidth restriction or other hardware limitation such as unified shader inefficiency that causes low output for games such as PGR3 with false 720P with upscaling and 30fps?
Perhaps because it is? PGR3's team wanted to rewrite the engine once they got further hardware. Obviously there were problems or extra room there. And why shouldn't one of the problems be one that we've known about since Dave's article (which it seems you haven't read). Tiled rendering needs to really be done when you're making the game, rather than just tacked on at the end (if that would even be realistically possible, considering everything). W/o tiled rendering, you're limited to 720p, or lower with ANY level of AA on it. And, as far as everyone has said, that is precisely why PGR3 renders in the lower res (which just happens to be the right amount for 2xAA in the eDRAM framebuffer? Heh) Why would it ever be a memory bandwidth restriction?
Sorry, but there is 256GB/s of framebuffer bandwidth. With 2xAA, only half of that can even be used. Fillrate is hit (at 8Gsamples/s) first. Therefore the only bottlenecks can be in fillrate, RAM bandwidth, or something else. Fillrate is unlikely. At 60fps and 720p, even, that's over 70 pixel writes per output pixel? (someone correct me if I'm horribly off). 22.4 GB/s to main RAM is a possibility, but if so, then PS3 is in equal trouble, since it has to fit the framebuffer into little over double that. If there's any problem besides tiled rendering, then it's in the computational core. But, tell me, why would the unified shaders be inefficient? Why would the have continued development of unified shaders if they couldn't get reasonable performance? And would those performance losses, in the unlikely case they existed, be greater than performance gains due to automatic load balancing?
ihamoitc2005 said:
If Xenos is super efficient and capable as some say then 60fps graphics output for games like PGR3 should be "piece of cake" no? Remember at false-720P resolution, inexperience with tiling method cannot explain poor performance because no tiling is required and eDRAM is enough for entire frame.
You ignore the system as a whole to focus on the GPU despite earlier saying CELL could help RSX to do more vertex processing? What was Carmacks comment on the new CPUs? I believe he said Xenon was comparable to a 1.6GHz OoOE CPU if you just took your old code and put it on there. And how long did PGR3 have with final hardware? There's also the issue that the game is using extremely high-quality textures in a very large world, and Xenos has filtered texture fetch abilities that are outpaced by the X1800XT and 7800GTX 512MB (peak). Which could potentially be a problem. Also, doesnt PGR3 use rendertargets for the cubemaps used in reflections? And these, since you might use them on more than just one car, would multiply the number of times you're rendering geometry and such by a huge number. Of course, I don't know exactly how they're doing everything (and if that's correct) and how bad that would be if that's what they're doing.
ihamoitc2005 said:
He is referring to it but he does not know it. Entire purpose of LS design is to avoid "hitting" RAM. He is saying LS based design prevents SPU from "hitting" RAM, I am saying that is a good thing he is saying that is a bad thing.
No. You misunderstand what he's referring to.
ihamoitc2005 said:
Original Xbox had 6.4GB/s unified memory and can perform 720P at 60fps so what is your estimate of PS3 capability and effectiveness of compression with 48GB/s available? Also this is not my proposal, it is PS3 design. CELL can access GDDR3 and GPU can access XDR.
Think of it as 48GB/s unified memory.
Then you remove your fillrate argument presented earlier. Good to hear. It was somewhat out there in the first place. (But, it's PS3's design to split the framebuffer in half across two separate memory pools? Just because it can go into XDR memory doesn't mean you want it to, especially for framebuffer ops! Not only that, but last I heard, Xbox had some problems with SOME game with rain in it, due to having far less framebuffer bandwidth than PS2? particles = fillrate. Remember that.)