G80 Shader Core @ 1.35GHz, How'd They Do That?

Because the SIMD vector size is a really important performance characteristic that shouldn't be omitted. Information wise I think it's equivalent to say 16 8way SIMD ALUs and 128 ALUs in 8way SIMD configurations, depending on how you define an ALU. I'm used to the microprocessor architecture worlds tradition to call SIMD units a single ALU and think that it'd be confusing to arbitrarily change that convention when talking about GPUs. GPUs are microprocessors too you know.
 
Heh, well it seems your "marketing bullshit" stamp shouldn't only apply to G80. You make a valid point but if we use prior generations' common definition of an ALU then Nvidia hasn't really changed anything by saying G80 has 128 of them. What you're proposing seems to go against the GPU convention that we've been used to for a while now.
 
I personally couldn't give a rat's ass if someone calls it a 16 or 128 ALU design, as long as he bothers to define what each unit is capable of.
 
What you're proposing seems to go against the GPU convention that we've been used to for a while now.
Unfortunately the current convention doesn't apply to G80 in any reasonably accurate way. Thats what you get if you use a convention that ignores reality.
 
It doesn't ignore reality. It simply lists things as they are seen from the point of view of the programming model. You could argue it's not ideal from an information perspective, but I wouldn't call it dishonest or anything either.


Uttar
 
Unfortunately the current convention doesn't apply to G80 in any reasonably accurate way. Thats what you get if you use a convention that ignores reality.

Sure if you want to take the pedantic approach but from a marketing point of view I see nothing wrong with it. It's not like they're lying based on some layman's definition of an "ALU". I just think it's a bit late to be re-defining GPU processing units based on the SIMD width.

I also disagree that the current convention doesn't apply to G80. We used to look at units on a per-fragment basis. If we do the same for G80 we come up with 128 so I'm not sure what you're pointing at there.

PS: All this renaming is kinda weird - Uttar is now "Arun Demeure" ? It's as if he's a real person all of a sudden! :oops: :p
 
PS: All this renaming is kinda weird - Uttar is now "Arun Demeure" ? It's as if he's a real person all of a sudden! :oops: :p
:oops: But yes, staff members will now have their real names used as forum nicknames. Oh well, I guess I'll survive... maybe!
 
I also disagree that the current convention doesn't apply to G80. We used to look at units on a per-fragment basis. If we do the same for G80 we come up with 128 so I'm not sure what you're pointing at there.
It depends on what are the necessary criteria for applicability of the current convention. If you ignore the requirement for being useful in performance predictions and only require countability of some kind of hardware features by some rough mapping of the convention, then sure, it applies. I maintain that the so called per-fragment ALU counts of G70 and G80 are incomparable. (what would be the per-fragment ALU count of G70 anyway)
I am pointing out that if your metric doesn't have a clear fundamental microarchitectural counterpart, then it is to be expected that you can't use it to do any relevant comparisons between generations.
I was actually originally trying to point out that the execution units (to use a less overloaded term) really aren't any smaller than in the previous generation. In hindsight, I should have been more polite, verbose and less controversial. I'm just annoyed at marketing for taking perfectly good established terminology, and then using it completely differently. I object to the images that all this talk about sea of units and 128 ALUs conjures up. In reality it's still pretty much the same bunch of heavily multithreaded SIMD cores. With a significantly better organised thread allocation system though.
 
It depends on what are the necessary criteria for applicability of the current convention. If you ignore the requirement for being useful in performance predictions and only require countability of some kind of hardware features by some rough mapping of the convention, then sure, it applies.

Well yeah that's exactly right - in the past we've just "counted" ALU's in this manner without much regard for architectural differences. That's when the flops counting started......

I maintain that the so called per-fragment ALU counts of G70 and G80 are incomparable. (what would be the per-fragment ALU count of G70 anyway)

Most people/reviewers counted it as 24.

I am pointing out that if your metric doesn't have a clear fundamental microarchitectural counterpart, then it is to be expected that you can't use it to do any relevant comparisons between generations.

So how do you plan to do comparisions between G80 and R600 ALU counts?

I was actually originally trying to point out that the execution units (to use a less overloaded term) really aren't any smaller than in the previous generation.

But how do you reconcile that with the fact that G80 can have 128 different fragments or vertices in flight at a given stage in the pipeline? What you're proposing would be a marketing nightmare - they would be going from 24 to 16. How do you think that would stack up against AMD's 64/96 etc, etc? Marketing isn't exactly just some annoyance that corrupts the honest engineering truth. It is an integral part of the process.
 
Well, pipeline counts are beside the point, as far as I'm concerned. They just aren't going to be very comparable, since nVidia's going scalar and ATI's going vector. So it'd be best just to go by the benchmarks and be done with it.

And right now, the benchmarks say that the 8800 is really a beastly processor. How will it look when the R600 comes out? Well, it should be very interesting, at the least.
 
Well, pipeline counts are beside the point, as far as I'm concerned. They just aren't going to be very comparable, since nVidia's going scalar and ATI's going vector. So it'd be best just to go by the benchmarks and be done with it.

And right now, the benchmarks say that the 8800 is really a beastly processor. How will it look when the R600 comes out? Well, it should be very interesting, at the least.

Yeah but you can't put benchmarks on a retail box or spec sheet which is the issue at the root of this discussion me thinks.
 
Yeah but you can't put benchmarks on a retail box or spec sheet which is the issue at the root of this discussion me thinks.
Quite right. But spec sheets have always been full of misleading crap. I don't see why that should stop now. See, for example, the tradition of using graphics card memory size on required and recommended specifications on games.
 
I believe nvidia was able to raise the clock so much mainly due to the fact that the ALUs got simpler, 32bit scalar instead of 128bit vec4.

CPUs can achieve higher clocks cause they have less transistors, so for the same frequency less heat dissipation and therefore less power demand.

Frequency depends on the number of transistors, you can have few transistors with high clock speed, or many transistors with low clock speed. Depends on what you want to do. If you want more pipelines each more complex (GPU approach) , then you need more transistors hence you cannot raise the frequency so much. If you want few pipelines each not so complex (CPU approach), then transistor count is lower hence you can raise the frequency much more.
 
I believe nvidia was able to raise the clock so much mainly due to the fact that the ALUs got simpler, 32bit scalar instead of 128bit vec4.

CPUs can achieve higher clocks cause they have less transistors, so for the same frequency less heat dissipation and therefore less power demand.

Frequency depends on the number of transistors, you can have few transistors with high clock speed, or many transistors with low clock speed. Depends on what you want to do. If you want more pipelines each more complex (GPU approach) , then you need more transistors hence you cannot raise the frequency so much. If you want few pipelines each not so complex (CPU approach), then transistor count is lower hence you can raise the frequency much more.

I don't think that's true, but i'll let the chip design experts detail it further.
To my knowledge, CPU clockspeeds scale much better because the circuit design is much more "hand-tuned" than in a GPU.
That is, GPU designers can't afford the man-power, don't have to include full general purpose computing capabilities and respective "baggage" (x86, etc), have to release products in a narrow timetable, and don't enjoy the often 50%+ profit margin provided by CPU's.
 
Inkster, what u said is another reason for difference in CPU and GPU clockspeeds. Gfx ASIC designers just use a collection of building blocks for building GPUs, while CPU designers kind of start CPU desgin form scratch. That's why CPUs have almost 3 times the desgin cycle of GPUs.
 
I don't think that's true, but i'll let the chip design experts detail it further.
To my knowledge, CPU clockspeeds scale much better because the circuit design is much more "hand-tuned" than in a GPU.
That is, GPU designers can't afford the man-power, don't have to include full general purpose computing capabilities and respective "baggage" (x86, etc), have to release products in a narrow timetable, and don't enjoy the often 50%+ profit margin provided by CPU's.


You may have a point with the hand tuning, but i still dont believe it is the main reason.

The increased transistor count (increased power, more difficult to cool) and the increased complexity(more complex interconnects which leads to signal interference, signal attenuation and signal delay times issues) of the GPUs are the main reasons.

These reasons make even the best "hand tuning" not capable of delivering a substantial increase in clock speed.
 
There's also the process factor. GPUs thus far have been fabbed on foundry processes that must cover a wide range of customers and customer designs. Timings and drive currents are inferior to the more extreme engineering involved in high-end CPU processes.

Timings could be 2-3 times worse for a foundry process than they are for an Intel or AMD fab at the same geometry.
 
Back
Top