Funny how it ended up branded as doubling the clock frequency. You know, as if it wasn't inflated already.
May I suggest kindly suggest you peruse the following pamflet?
Funny how it ended up branded as doubling the clock frequency. You know, as if it wasn't inflated already.
Thanks, but I have sentiments for this one.May I suggest kindly suggest you peruse the following pamflet?
In the end, clock speed is a design choice. Higher clockspeeds take more area and consume more power. More functional units also take up more area and consume more power. But more functional units are frequently also less efficient (though graphics processing scales exceedingly well with the number of pipelines, GPGPU will benefit hugely from higher clocks but fewer pipes).
So, with each new process, and with each new iteration of GPU technology, exactly what clockspeed is most efficient varies.
Personally, I think that the choice of high-speed ALU's on the G80 is going to be the biggest differentiator between the G80 and R600. More than anything else, I think that that design choice is going to make the biggest difference (though others will clearly have their own benefits and drawbacks as well).
Wow, you're right. That's a fascinating paper. Thanks for sharing this and apologies for the sarcastic tone of my earlier reply.Thanks, but I have sentiments for this one.
P4's ALU simply don't deserve to be called "ALU operating at double frequency", because if it did, you'd exect it to: a) process 2 independent instructions just as fast as 2 dependent ones, and b) expose results of these 2 instructions at the higher clock rate. P4's ALU does neither.
To answer this question, you should ask yourself why AMD and Intel reached that clock frequency years ago on their CPUs, while GPU's were still in the couple hundred MHz range on the same process.
Put simply, it's all about design.
Historically it seems there was little point in increasing the clock of the ALUs, since the TMUs, being inline, couldn't keep up without investing a lot of effort in them, too. TMUs seem to consume at least as much die, maybe doubly so, as ALUs.Yeah. I think it was a smart design choice as well. The silicon obviously can bare higher frequencies than we have seen on GPUs, so the question was could they design part of the pipeline to hit higher frequencies while being an overall win. I think the fact G80 has 128 smaller ALUs opened the door for getting their hands dirty and hand tweaking the layout. The work spent on getting 1 ALU design to have a high frequency benefits the others.
I am curious how long it takes other parts of the GPU to get this sort of TLC.
Well, I didn't say it was a smart design choice. I just said that it will be the biggest reason, I think, for different performance between the parts (whatever that different performance may be). Whether or not it was the better design choice remains to be seen (and I don't think it'll be evident immediately upon the release of the R600, either).Yeah. I think it was a smart design choice as well.
When I had chatted with NVIDIA about G80, I was told directly that the shader units were full custom (or as close to full custom as anyone has done in the graphics industry). Now, as to what exact level of "custom" they were referring to, I couldn't say.
Those 128 smaller ALUs are marketing bullcrap. The G80 is still doing the same 16way SIMD, just organized a bit differently.I think the fact G80 has 128 smaller ALUs opened the door for getting their hands dirty and hand tweaking the layout.
It's not bullcrap, it's AoS vs SoA. And the ALUs are 8way now fyi, not 16way.Those 128 smaller ALUs are marketing bullcrap. The G80 is still doing the same 16way SIMD, just organized a bit differently.
The "same 16 way SIMD" as what?
Aside from eliminating the need for horizontal ops, such as dot product, I don't see how AoS vs. SoA makes any difference for the ALUs. You're still executing a single instruction in SIMD fashion. Register file access is simplified a bit though by eliminating the need for a layer of muxes to support swizzles.It's not bullcrap, it's AoS vs SoA. And the ALUs are 8way now fyi, not 16way.
Think about 8 ALUs working on 16 pixels over 2 clock cycles.And I still believe the G80 ALU's are 16way SIMD. You have 128 execution units total, arranged into 8 processors, each capable of executing a single instruction over 16 pixels. AFAIK there is no way to issue different instructions in smaller than 16pixel granularity.
Those 128 smaller ALUs are marketing bullcrap. The G80 is still doing the same 16way SIMD, just organized a bit differently.
A bit differently huh? So based on the "old" marketing how many ALU's are in G80 exactly?
Re-reading the B3D G80 review (which is excellent btw.) it seems so, though I can't see any reason why it should be like this. So it looks like 8 processors, with two 8way SIMD units that are clocked double the control logic, being effectively 16way SIMD. Can the two 8way SIMD's be executing different instructions. If not I can't see any difference with 16way SIMD. If yes, it would be interesting to know what parts are shared. Constant registers, shader code, any thing else?Think about 8 ALUs working on 16 pixels over 2 clock cycles.
Yes..Can the two 8way SIMD's be executing different instructions.
It's likely that at least 2 different groups of threads can be executed, different code.. some constants (or better..some constants cache )If not I can't see any difference with 16way SIMD. If yes, it would be interesting to know what parts are shared. Constant registers, shader code, any thing else?
Nothing wrong with your definition, but can't see why you require to execute different code on each ALU.Normally, if you hear 128 scalar ALUs you'd think that the processor can be executing 128 different instructions at once, i.e. 128 way superscalar. (which would of course require ridiculous amounts of control logic)