Nvision 2008

Bob · Aug 28, 2008

So what's a core to you?

Jawed · Aug 28, 2008

http://en.wikipedia.org/wiki/Multi-core

GeForce 9 multi-core GPU (8 cores, 16 scalar stream processors per core)
GeForce 200 multi-core GPU (10 cores, 24 scalar stream processors per core)
Tesla multi-core GPGPU (8 cores, 16 scalar stream processors per core)

Seems right about the core count, to me. Of course that ignores the transcendental and double-precision units.

EDIT: Should add this seems right if you assume that the 2 (or 3 in TG200) MAD SIMDs + their associated transcendental units + double-precision (in GT200) + TMUs are all under the control of a single high level scheduler that issues instructions to all these units in parallel.

Jawed

armchair_architect · Aug 28, 2008

Jawed said:
http://en.wikipedia.org/wiki/Multi-core

Seems right about the core count, to me. Of course that ignores the transcendental and double-precision units.

EDIT: Should add this seems right if you assume that the 2 (or 3 in TG200) MAD SIMDs + their associated transcendental units + double-precision (in GT200) + TMUs are all under the control of a single high level scheduler that issues instructions to all these units in parallel.

Jawed

CUDA documentation gives pretty good evidence that each of the 2 or 3 units is independent of the others in the cluster. They share a texture unit, but that's about all as far as I can tell. So I'd count G80 as 16 cores and GT200 as 30.

Jawed · Aug 28, 2008

Trinibwoy found this:

http://forum.beyond3d.com/showpost.php?p=1157238&postcount=207

which puts a different perspective on things, implying that a single scheduler concurrently issues to the MAD and TMU pipes, taking heed of the pipelines' instruction issue rates.

There's extra fun to be had because the TMUs themselves seemingly consist of independent units (LOD/Bias, addressing, fetching, filtering) and so that begs the question "VLIW or superscalar?"

Similar to the question over the ALUs: "VLIW: MAD+transcendental; or superscalar?" which is why Trinibwoy linked that patent.

Jawed

aaronspink · Aug 28, 2008

Bob said:
So what's a core to you?

A collection of one or more ALUs plus registers, control, error reporting, interrupt infrastructure, memory and/or system interface and capable of autonomous operation. It certaintly isn't a single ALU as nVidia and ATI are trying to snowball people with.

But then again they can't make ridiculous statements if they have to admit they are using 10/30 core designs.

Skrying · Aug 28, 2008

I haven't seen these suggestions of 240 or 800 "cores" but instead they claim them processing units. So, I don't really understand where your comments are coming from.

bowman · Aug 28, 2008

Skrying said:
I haven't seen these suggestions of 240 or 800 "cores" but instead they claim them processing units. So, I don't really understand where your comments are coming from.

Nvidia constantly calls them cores. '128 cores', '240 cores' and so on.

http://www.nvidia.com/object/geforce_gtx_280.html - 'Processor cores'
http://www.nvidia.com/object/tesla_c1060.html - 'Processor cores'
http://www.nvidia.com/object/geforce_9800gtx.html - '128 cores'

http://www.amd.com/us-en/Corporate/VirtualPressRoom/0,,51_104_543,00.html - '800 cores'
http://www.nvidia.com/object/io_1213610051114.html

Davros · Aug 28, 2008

some links for y'all

http://www.hothardware.com/Articles/NVISION-08-Day-1-JenHsun-Huangs-Keynote-/
http://www.hothardware.com/Articles/NVISION-08-Day-2-Keynote-and-Exhibitors/

Humus · Aug 29, 2008

aaronspink said:
A collection of one or more ALUs plus registers, control, error reporting, interrupt infrastructure, memory and/or system interface and capable of autonomous operation.

With that definition I suppose GPUs have zero cores then.

nAo · Aug 30, 2008

Perhaps we should just not play the semantics game. It's clear that the definition of what a core is is someway 'fluid', on the other hand I wouldn't call NVIDIA streaming processors cores.

Andrew Lauritzen · Aug 30, 2008

armchair_architect said:
CUDA documentation gives pretty good evidence that each of the 2 or 3 units is independent of the others in the cluster. They share a texture unit, but that's about all as far as I can tell. So I'd count G80 as 16 cores and GT200 as 30.

Yeah my reading of the docs is the same as yours, but if that were the case, why can't I get down to 8-wide dynamic branching granularity rather than 16-wide (G80) or 24-wide (GT200)?

Jawed · Aug 30, 2008

24 on GT200, not 32? Woah.

Jawed

Arun · Aug 30, 2008

Andrew: Really? If so, I presume that's VS-only, while PS is still 32 - in which case, I guess that means the MUL is fully exposed in the VS, fun!

(heh it'd help if I ever had a GT200 in my hands, I guess)

Jawed · Aug 30, 2008

I don't understand, how would MUL affect dynamic branching coherence?

Jawed

Andrew Lauritzen · Aug 30, 2008

Arun said:
Andrew: Really? If so, I presume that's VS-only, while PS is still 32 - in which case, I guess that means the MUL is fully exposed in the VS, fun! (heh it'd help if I ever had a GT200 in my hands, I guess)

Oh I haven't tested the 24 number (don't have a GT200)... I was merely guessing from the architecture. Have you guys run the numbers on this? All I've seen is hints that it's higher on GT200 than G80.

Anyways my original question remains... if the 8-wide SIMD units are indeed independent then why can't they branch incoherently? If they can't, then they're really just a wider SIMD array!

armchair_architect · Aug 30, 2008

Andrew Lauritzen said:
Oh I haven't tested the 24 number (don't have a GT200)... I was merely guessing from the architecture. Have you guys run the numbers on this? All I've seen is hints that it's higher on GT200 than G80.

Anyways my original question remains... if the 8-wide SIMD units are indeed independent then why can't they branch incoherently? If they can't, then they're really just a wider SIMD array!

On G80 you have two independent SIMD arrays per cluster. Each one issues an instruction to 16 threads over 2 ALU clocks for vertex work, or an instruction to 32 threads over 4 ALU clocks for pixel and CUDA work. (My theory is they still issue an instruction every two clocks to get dual-issue -- which implies there's no dual-issue for vertex work.) So your dynamic branching granularity is 16 or 32 not because the two SIMD units are really one big unit, but because instruction issue is at half of the ALU clock rate.

GT200 is just three of these independent SIMD arrays per cluster. So the branch granularity is the same as on G80. Changing the number of SIMD arrays per SIMD+TEX cluster doesn't affect the branching granularity, just the overall math/tex throughput ratio.

For clocks, I find it far easier to think of everything in terms of half the ALU clock rate and twice the SIMD width. Conceptually, each SIMD processor is 16-wide running at ~650 MHz, even though physically the ALUs are 8-wide running at ~1300 MHz.

Andrew Lauritzen · Aug 30, 2008

armchair_architect said:
For clocks, I find it far easier to think of everything in terms of half the ALU clock rate and twice the SIMD width. Conceptually, each SIMD processor is 16-wide running at ~650 MHz, even though physically the ALUs are 8-wide running at ~1300 MHz.

Ah, interesting... that makes some amount of sense now, thanks

Certainly thinking about it as running half the clock rate makes some sense, but that hides the fact that you need twice as many "threads" as you'd think to run at full throughput, no?

Tim Murray · Aug 30, 2008

Andrew Lauritzen said:
Ah, interesting... that makes some amount of sense now, thanks Certainly thinking about it as running half the clock rate makes some sense, but that hides the fact that you need twice as many "threads" as you'd think to run at full throughput, no?

Well, that's a big part of why we set the warp size to 32. If you know the general hardware configuration and you know what CUDA does at a high level, yes, you'd miss out on the actual thread requirements, but when we say in the documentation "you should really have 128 threads per block at the absolute low end" and set the warp size to twice the effective SIMD width, I think that should balance things out a bit.

edit: oh this thread really isn't about nvision anymore, oh well!

INKster · Sep 3, 2008

Nvision'08 is already over, but this Arstechnica interview with Nvidia's co-founder Chris Malachowsky is still pretty interesting.
I particularly liked reading about his love/hate relationship with NV1:

Chris Malachowsky said:
That first product of ours—I basically designed the entire graphics pipeline myself—where it was a good technical achievement, it was a really shitty product.

apoppin · Sep 9, 2008

i went; here is my blog

http://alienbabeltech.com/?p=56

BtW

anyone see this?
- i was not privileged to attend the tech meetings

http://uneit.com/2008/09/08/leaked-roadmap-on-nvidias-next-high-end-gt212/

The most interesting part is the high-end margin, where GTX280 is slated to be replaced by a new chip codenamed GT212, which is signed 45nm, or 40nm.

Nvision 2008

Bob

Jawed

armchair_architect

Jawed

aaronspink

Skrying

S K R Y I N G

bowman

Davros

Humus

Crazy coder

nAo

Nutella Nutellae

Andrew Lauritzen

Moderator

Jawed

Arun

Unknown.

Jawed

Andrew Lauritzen

Moderator

armchair_architect

Andrew Lauritzen

Moderator

Tim Murray

the Windom Earle of mobile SOCs

INKster

apoppin

Similar threads