G80 Shader Core @ 1.35GHz, How'd They Do That?

May I suggest kindly suggest you peruse the following pamflet?
Thanks, but I have sentiments for this one.

P4's ALU simply don't deserve to be called "ALU operating at double frequency", because if it did, you'd exect it to: a) process 2 independent instructions just as fast as 2 dependent ones, and b) expose results of these 2 instructions at the higher clock rate. P4's ALU does neither.

It would more acurately describe computational power of the P4 ALU, if it was said that in some circumstances, the ALU can take 2 dependent instructions, fuse them, and execute as a monolithic 3-operand instruction. The tricks with raising clock edge, splitting data in 16-bit halfs, etc. are just implementation details. Of course, the "double frequency" looked better on paper, and in the peak of Megahertz Wars it looked like pure pwnage.
 
In the end, clock speed is a design choice. Higher clockspeeds take more area and consume more power. More functional units also take up more area and consume more power. But more functional units are frequently also less efficient (though graphics processing scales exceedingly well with the number of pipelines, GPGPU will benefit hugely from higher clocks but fewer pipes).

So, with each new process, and with each new iteration of GPU technology, exactly what clockspeed is most efficient varies.

Personally, I think that the choice of high-speed ALU's on the G80 is going to be the biggest differentiator between the G80 and R600. More than anything else, I think that that design choice is going to make the biggest difference (though others will clearly have their own benefits and drawbacks as well).

Well said Chalnoth. I think that the design cycle has something to do with it as well. GPU designers have less time to spend optimizing individual circuits because they are looking for the fastest execution time of as many threads as possible while CPU designers spend a great deal of time optimizing single thread performance. It looks like G80 had about a four year development cycle (I believe) which might be a bit longer then other designs they have worked on.
 
Thanks, but I have sentiments for this one.

P4's ALU simply don't deserve to be called "ALU operating at double frequency", because if it did, you'd exect it to: a) process 2 independent instructions just as fast as 2 dependent ones, and b) expose results of these 2 instructions at the higher clock rate. P4's ALU does neither.
Wow, you're right. That's a fascinating paper. Thanks for sharing this and apologies for the sarcastic tone of my earlier reply.
 
To answer this question, you should ask yourself why AMD and Intel reached that clock frequency years ago on their CPUs, while GPU's were still in the couple hundred MHz range on the same process.

Put simply, it's all about design.

Yeah. I think it was a smart design choice as well. The silicon obviously can bare higher frequencies than we have seen on GPUs, so the question was could they design part of the pipeline to hit higher frequencies while being an overall win. I think the fact G80 has 128 smaller ALUs opened the door for getting their hands dirty and hand tweaking the layout. The work spent on getting 1 ALU design to have a high frequency benefits the others.

I am curious how long it takes other parts of the GPU to get this sort of TLC.
 
Yeah. I think it was a smart design choice as well. The silicon obviously can bare higher frequencies than we have seen on GPUs, so the question was could they design part of the pipeline to hit higher frequencies while being an overall win. I think the fact G80 has 128 smaller ALUs opened the door for getting their hands dirty and hand tweaking the layout. The work spent on getting 1 ALU design to have a high frequency benefits the others.

I am curious how long it takes other parts of the GPU to get this sort of TLC.
Historically it seems there was little point in increasing the clock of the ALUs, since the TMUs, being inline, couldn't keep up without investing a lot of effort in them, too. TMUs seem to consume at least as much die, maybe doubly so, as ALUs.

Additionally, every time you increase the pipeline clock, you have to take account of the fact that texture fetch latency (in terms of absolute time) remains ~ constant. So if you clock the pipeline faster, you need to introduce more FIFO stages. Which costs in area. It seems to me that NV4x-G7x were designed with enough FIFO stages to cater for a range of GPUs from craptastic to enthusiast, where the percentage of texture fetch latency, expressed in terms of pipeline stages, varies by X. I dunno what X is though, or how close G71 comes to its latency-tolerance ceiling for bilinear texturing.

So, it seems to me that NVidia had to wait until the TMUs were decoupled before it could play with the ALU clocks. Then, for trilinear/anisotropic filtering, the relative throughput of the TMUs is such that it's advantageous to double-up the filtering section. So, in a sense, the TMUs are "double-clocked", except of course this is achieved by doubling their area.

Jawed
 
Yeah. I think it was a smart design choice as well.
Well, I didn't say it was a smart design choice. I just said that it will be the biggest reason, I think, for different performance between the parts (whatever that different performance may be). Whether or not it was the better design choice remains to be seen (and I don't think it'll be evident immediately upon the release of the R600, either).
 
When I had chatted with NVIDIA about G80, I was told directly that the shader units were full custom (or as close to full custom as anyone has done in the graphics industry). Now, as to what exact level of "custom" they were referring to, I couldn't say.
 
When I had chatted with NVIDIA about G80, I was told directly that the shader units were full custom (or as close to full custom as anyone has done in the graphics industry). Now, as to what exact level of "custom" they were referring to, I couldn't say.

I've heard that they used their own libraries or whatever.
 
I think the fact G80 has 128 smaller ALUs opened the door for getting their hands dirty and hand tweaking the layout.
Those 128 smaller ALUs are marketing bullcrap. The G80 is still doing the same 16way SIMD, just organized a bit differently.
 
Those 128 smaller ALUs are marketing bullcrap. The G80 is still doing the same 16way SIMD, just organized a bit differently.
It's not bullcrap, it's AoS vs SoA. And the ALUs are 8way now fyi, not 16way.


Uttar
 
It's not bullcrap, it's AoS vs SoA. And the ALUs are 8way now fyi, not 16way.
Aside from eliminating the need for horizontal ops, such as dot product, I don't see how AoS vs. SoA makes any difference for the ALUs. You're still executing a single instruction in SIMD fashion. Register file access is simplified a bit though by eliminating the need for a layer of muxes to support swizzles.

And I still believe the G80 ALU's are 16way SIMD. You have 128 execution units total, arranged into 8 processors, each capable of executing a single instruction over 16 pixels. AFAIK there is no way to issue different instructions in smaller than 16pixel granularity. If you're talking about the previous generation, the you're correct, it was (something like) 8way SIMD thanks to the 1+3/2+2 issuing schemes.
 
And I still believe the G80 ALU's are 16way SIMD. You have 128 execution units total, arranged into 8 processors, each capable of executing a single instruction over 16 pixels. AFAIK there is no way to issue different instructions in smaller than 16pixel granularity.
Think about 8 ALUs working on 16 pixels over 2 clock cycles.
 
Last edited:
Well I was really trying to ask stepz for his definition of an ALU. For example, why are we dividing by 4? Is it just because older architecture's were vec4? It can't be based on the SIMD configuration cause there was a time that the entire shader pipeline was a single SIMD array.

He claims that 128 ALU's is just marketing speak - I'm trying to understand why each ALU should not be counted separately.
 
Think about 8 ALUs working on 16 pixels over 2 clock cycles.
Re-reading the B3D G80 review (which is excellent btw.) it seems so, though I can't see any reason why it should be like this. So it looks like 8 processors, with two 8way SIMD units that are clocked double the control logic, being effectively 16way SIMD. Can the two 8way SIMD's be executing different instructions. If not I can't see any difference with 16way SIMD. If yes, it would be interesting to know what parts are shared. Constant registers, shader code, any thing else?

trinibwoy: I'd define ALUs like they do in CPU land, by control unit outputs. If two would be ALU's can't be executing different instructions, then they're not two different ALUs, but a single SIMD ALU. Although I must say that the terminology for talking about SIMD ALU's and it constituent units is pretty fuzzy. But saying that a subunit of SIMD ALU is a scalar unit is in my opinion seriously misleading. Normally, if you hear 128 scalar ALUs you'd think that the processor can be executing 128 different instructions at once, i.e. 128 way superscalar. (which would of course require ridiculous amounts of control logic)
Oh and for the NVidia previous generation, I'd say 6 processors of 2way superscalar 8way(ish) SIMD. I'm not sure how decoupled the 6 processors are though, we really need some decent overviews of different GPU microarchitectures without all the marketing buzzwords, but using established microprocessor design terminology where possible. The G80 B3D review is a good step in that direction.
 
Last edited by a moderator:
Can the two 8way SIMD's be executing different instructions.
Yes..
If not I can't see any difference with 16way SIMD. If yes, it would be interesting to know what parts are shared. Constant registers, shader code, any thing else?
It's likely that at least 2 different groups of threads can be executed, different code.. some constants (or better..some constants cache :) )
Normally, if you hear 128 scalar ALUs you'd think that the processor can be executing 128 different instructions at once, i.e. 128 way superscalar. (which would of course require ridiculous amounts of control logic)
Nothing wrong with your definition, but can't see why you require to execute different code on each ALU.
 
Back
Top