Seems right about the core count, to me. Of course that ignores the transcendental and double-precision units.GeForce 9 multi-core GPU (8 cores, 16 scalar stream processors per core)
GeForce 200 multi-core GPU (10 cores, 24 scalar stream processors per core)
Tesla multi-core GPGPU (8 cores, 16 scalar stream processors per core)
http://en.wikipedia.org/wiki/Multi-core
Seems right about the core count, to me. Of course that ignores the transcendental and double-precision units.
EDIT: Should add this seems right if you assume that the 2 (or 3 in TG200) MAD SIMDs + their associated transcendental units + double-precision (in GT200) + TMUs are all under the control of a single high level scheduler that issues instructions to all these units in parallel.
Jawed
So what's a core to you?
I haven't seen these suggestions of 240 or 800 "cores" but instead they claim them processing units. So, I don't really understand where your comments are coming from.
A collection of one or more ALUs plus registers, control, error reporting, interrupt infrastructure, memory and/or system interface and capable of autonomous operation.
Yeah my reading of the docs is the same as yours, but if that were the case, why can't I get down to 8-wide dynamic branching granularity rather than 16-wide (G80) or 24-wide (GT200)?CUDA documentation gives pretty good evidence that each of the 2 or 3 units is independent of the others in the cluster. They share a texture unit, but that's about all as far as I can tell. So I'd count G80 as 16 cores and GT200 as 30.
Oh I haven't tested the 24 number (don't have a GT200)... I was merely guessing from the architecture. Have you guys run the numbers on this? All I've seen is hints that it's higher on GT200 than G80.Andrew: Really? If so, I presume that's VS-only, while PS is still 32 - in which case, I guess that means the MUL is fully exposed in the VS, fun! (heh it'd help if I ever had a GT200 in my hands, I guess)
Oh I haven't tested the 24 number (don't have a GT200)... I was merely guessing from the architecture. Have you guys run the numbers on this? All I've seen is hints that it's higher on GT200 than G80.
Anyways my original question remains... if the 8-wide SIMD units are indeed independent then why can't they branch incoherently? If they can't, then they're really just a wider SIMD array!
Ah, interesting... that makes some amount of sense now, thanks Certainly thinking about it as running half the clock rate makes some sense, but that hides the fact that you need twice as many "threads" as you'd think to run at full throughput, no?For clocks, I find it far easier to think of everything in terms of half the ALU clock rate and twice the SIMD width. Conceptually, each SIMD processor is 16-wide running at ~650 MHz, even though physically the ALUs are 8-wide running at ~1300 MHz.
Well, that's a big part of why we set the warp size to 32. If you know the general hardware configuration and you know what CUDA does at a high level, yes, you'd miss out on the actual thread requirements, but when we say in the documentation "you should really have 128 threads per block at the absolute low end" and set the warp size to twice the effective SIMD width, I think that should balance things out a bit.Ah, interesting... that makes some amount of sense now, thanks Certainly thinking about it as running half the clock rate makes some sense, but that hides the fact that you need twice as many "threads" as you'd think to run at full throughput, no?
Chris Malachowsky said:That first product of ours—I basically designed the entire graphics pipeline myself—where it was a good technical achievement, it was a really shitty product.
The most interesting part is the high-end margin, where GTX280 is slated to be replaced by a new chip codenamed GT212, which is signed 45nm, or 40nm.