Nvision 2008

http://en.wikipedia.org/wiki/Multi-core

GeForce 9 multi-core GPU (8 cores, 16 scalar stream processors per core)
GeForce 200 multi-core GPU (10 cores, 24 scalar stream processors per core)
Tesla multi-core GPGPU (8 cores, 16 scalar stream processors per core)
Seems right about the core count, to me. Of course that ignores the transcendental and double-precision units.

EDIT: Should add this seems right if you assume that the 2 (or 3 in TG200) MAD SIMDs + their associated transcendental units + double-precision (in GT200) + TMUs are all under the control of a single high level scheduler that issues instructions to all these units in parallel.

Jawed
 
http://en.wikipedia.org/wiki/Multi-core


Seems right about the core count, to me. Of course that ignores the transcendental and double-precision units.

EDIT: Should add this seems right if you assume that the 2 (or 3 in TG200) MAD SIMDs + their associated transcendental units + double-precision (in GT200) + TMUs are all under the control of a single high level scheduler that issues instructions to all these units in parallel.

Jawed

CUDA documentation gives pretty good evidence that each of the 2 or 3 units is independent of the others in the cluster. They share a texture unit, but that's about all as far as I can tell. So I'd count G80 as 16 cores and GT200 as 30.
 
Trinibwoy found this:

http://forum.beyond3d.com/showpost.php?p=1157238&postcount=207

which puts a different perspective on things, implying that a single scheduler concurrently issues to the MAD and TMU pipes, taking heed of the pipelines' instruction issue rates.

There's extra fun to be had because the TMUs themselves seemingly consist of independent units (LOD/Bias, addressing, fetching, filtering) and so that begs the question "VLIW or superscalar?"

Similar to the question over the ALUs: "VLIW: MAD+transcendental; or superscalar?" which is why Trinibwoy linked that patent.

Jawed
 
So what's a core to you?

A collection of one or more ALUs plus registers, control, error reporting, interrupt infrastructure, memory and/or system interface and capable of autonomous operation. It certaintly isn't a single ALU as nVidia and ATI are trying to snowball people with.

But then again they can't make ridiculous statements if they have to admit they are using 10/30 core designs.
 
I haven't seen these suggestions of 240 or 800 "cores" but instead they claim them processing units. So, I don't really understand where your comments are coming from.
 
I haven't seen these suggestions of 240 or 800 "cores" but instead they claim them processing units. So, I don't really understand where your comments are coming from.

Nvidia constantly calls them cores. '128 cores', '240 cores' and so on.

http://www.nvidia.com/object/geforce_gtx_280.html - 'Processor cores'
http://www.nvidia.com/object/tesla_c1060.html - 'Processor cores'
http://www.nvidia.com/object/geforce_9800gtx.html - '128 cores'

http://www.amd.com/us-en/Corporate/VirtualPressRoom/0,,51_104_543,00.html - '800 cores'
http://www.nvidia.com/object/io_1213610051114.html
 
Last edited by a moderator:
A collection of one or more ALUs plus registers, control, error reporting, interrupt infrastructure, memory and/or system interface and capable of autonomous operation.

With that definition I suppose GPUs have zero cores then.
 
Perhaps we should just not play the semantics game. It's clear that the definition of what a core is is someway 'fluid', on the other hand I wouldn't call NVIDIA streaming processors cores.
 
CUDA documentation gives pretty good evidence that each of the 2 or 3 units is independent of the others in the cluster. They share a texture unit, but that's about all as far as I can tell. So I'd count G80 as 16 cores and GT200 as 30.
Yeah my reading of the docs is the same as yours, but if that were the case, why can't I get down to 8-wide dynamic branching granularity rather than 16-wide (G80) or 24-wide (GT200)?
 
Andrew: Really? If so, I presume that's VS-only, while PS is still 32 - in which case, I guess that means the MUL is fully exposed in the VS, fun! :) (heh it'd help if I ever had a GT200 in my hands, I guess)
 
Andrew: Really? If so, I presume that's VS-only, while PS is still 32 - in which case, I guess that means the MUL is fully exposed in the VS, fun! :) (heh it'd help if I ever had a GT200 in my hands, I guess)
Oh I haven't tested the 24 number (don't have a GT200)... I was merely guessing from the architecture. Have you guys run the numbers on this? All I've seen is hints that it's higher on GT200 than G80.

Anyways my original question remains... if the 8-wide SIMD units are indeed independent then why can't they branch incoherently? If they can't, then they're really just a wider SIMD array!
 
Oh I haven't tested the 24 number (don't have a GT200)... I was merely guessing from the architecture. Have you guys run the numbers on this? All I've seen is hints that it's higher on GT200 than G80.

Anyways my original question remains... if the 8-wide SIMD units are indeed independent then why can't they branch incoherently? If they can't, then they're really just a wider SIMD array!


On G80 you have two independent SIMD arrays per cluster. Each one issues an instruction to 16 threads over 2 ALU clocks for vertex work, or an instruction to 32 threads over 4 ALU clocks for pixel and CUDA work. (My theory is they still issue an instruction every two clocks to get dual-issue -- which implies there's no dual-issue for vertex work.) So your dynamic branching granularity is 16 or 32 not because the two SIMD units are really one big unit, but because instruction issue is at half of the ALU clock rate.

GT200 is just three of these independent SIMD arrays per cluster. So the branch granularity is the same as on G80. Changing the number of SIMD arrays per SIMD+TEX cluster doesn't affect the branching granularity, just the overall math/tex throughput ratio.

For clocks, I find it far easier to think of everything in terms of half the ALU clock rate and twice the SIMD width. Conceptually, each SIMD processor is 16-wide running at ~650 MHz, even though physically the ALUs are 8-wide running at ~1300 MHz.
 
For clocks, I find it far easier to think of everything in terms of half the ALU clock rate and twice the SIMD width. Conceptually, each SIMD processor is 16-wide running at ~650 MHz, even though physically the ALUs are 8-wide running at ~1300 MHz.
Ah, interesting... that makes some amount of sense now, thanks :) Certainly thinking about it as running half the clock rate makes some sense, but that hides the fact that you need twice as many "threads" as you'd think to run at full throughput, no?
 
Ah, interesting... that makes some amount of sense now, thanks :) Certainly thinking about it as running half the clock rate makes some sense, but that hides the fact that you need twice as many "threads" as you'd think to run at full throughput, no?
Well, that's a big part of why we set the warp size to 32. If you know the general hardware configuration and you know what CUDA does at a high level, yes, you'd miss out on the actual thread requirements, but when we say in the documentation "you should really have 128 threads per block at the absolute low end" and set the warp size to twice the effective SIMD width, I think that should balance things out a bit.

edit: oh this thread really isn't about nvision anymore, oh well!
 
Nvision'08 is already over, but this Arstechnica interview with Nvidia's co-founder Chris Malachowsky is still pretty interesting.
I particularly liked reading about his love/hate relationship with NV1:

Chris Malachowsky said:
That first product of ours—I basically designed the entire graphics pipeline myself—where it was a good technical achievement, it was a really shitty product.

:LOL:
 
Back
Top