NVIDIA Fermi: Architecture discussion

3dilettante · Oct 14, 2009

dnavas said:
Although, now thinking about it in the context of this whole "real threads" .. umm .. discussion, I wonder if DWF isn't breathing life into the quaint and aging assumption that everyone and their brother is always running the same instruction. Seems like the cost of DWF and MIMD might be similar (difference between finding 32 runnable "threads" and finding 32 runnable "threads" all at the same PC is ..?), and we'd get better advantages from MIMD, even if we do need larger instruction caches?

Full MIMD would be larger instruction caches and multiplying the resources at the front end. In terms of issue ports, a physical SIMD of width 16 pushed to 16 MIMD units would require 16x the decoders, issue ports, and scheduling.
It's not necessarily 16x the hardware because these are potentially simpler than the more complex SIMD unit.
Regardless, Fermi is already plenty big.

The primary argument for DWF was that Nvidia's scheduling and register hardware was already oddly complex for what it was doing, and DWF was an incremental increase that could yield throughput decently close to what MIMD could offer for the workloads targeted.

This came up in the old G300 speculation thread. It's quite a trip down memory lane to go back there.
A lot of what Fermi turned out to be reflected a lot of the grumblings at the time, and the apparent die size reflected some of the fears.

Andrew Lauritzen · Oct 14, 2009

Even from a purely software perspective though the proverbial "crack in the armor" comes with shared memory and barrier synchronization, which given the limitations of how they can be used in the current programming models gives a greatly restricted model of "threads" compared to the usual definition. Even a simple producer/consumer model can't be easily modeled directly since all "threads" in a group are forced to converge at all barriers. So while they can predicate and do other SIMD-like things to appear to execute different code, they cannot go off on arbitrary control flow graphs - at least not with the ability to ever share data inside that control flow (which sort of limits the utility not to mention the abstraction...). Perhaps this limitation will be lifted with Fermi though - we'll see.

DemoCoder · Oct 14, 2009

Even in regular CPU models, you end up with some of these problems. Look at the Java Memory Model spec for example. There, you flush around synchronization primitives, or you go with optimistic concurrency and validate/retry (with the extreme being software transactional memory, just added to C#)

Andrew Lauritzen · Oct 15, 2009

Right, but the existence of both compiler and processor level memory barriers is not what's interesting here, as you need them in almost every iterative language that supports multi-threading. The problem is that if you restrict where these things can exist with respect to control flow, and further limit the ability of these barriers to - say - a global scope (or at least global with respect to shared memory) then you remove a huge amount of expressiveness that even the Java model still has. You also completely remove the illusion that your "threads" are executing "independently" since they can effectively only make useful progress when running in conceptual lock-step (with predication).

This is standard for SIMD, but non-standard for the use of the term "threads", hence the question of terminology choice. Now I will give NVIDIA et al. props for constantly trying to increase the expressiveness towards the goal of making these things operate as if they truly were independent threads by the traditional definition, but we're still a ways off. Fermi will undoubtedly bring us closer but it remains to be seen by how much.

psurge · Oct 15, 2009

Can atomics be used to implement barrier synchronization of threads that have arbitrarily diverged? If so, then wouldn't the illusion breaking __sync be an implementation detail that is exposed to allow improved performance in some cases, rather than a reason to claim that NV is misusing the term thread?

rpg.314 · Oct 15, 2009

Making sense of threads/vectors

To me, the most consistent and accurate terminology seems to be that

warp/wavefronts size := vector lanes

and

in CUDA speak, (threads/block)/(warp size) := number of threads

G80
has at most 24 hw threads per core, (16 cores overall), where each thread executes 32 wide simd instructions
GT200
has at most 32 hw threads per core, (30 cores overall), where each thread executes 32 wide simd instructions
Larrabee
has at most 4 hw threads per core, (?? cores overall), where each thread executes 16 wide simd instructions. LRB also implements multiple sw threads per hw thread for additional latency hiding
Cypress
cypress has at most ?? hw threads per core, (20 cores overall), where each thread executes 64 wide simd instructions. needs lot of ILP in code to reach peak performance.
Cell
and cell has at most 1 hw threads per core, (8 cores overall), where each thread executes 4 wide simd instructions (counting only spe's)
Nehalem
has at most 2 hw threads per core, (4 cores overall), where each thread executes 4 wide simd instructions. needs lot of ILP in code to reach peak performance.
Fermi
has at most 48 hw threads per core, (16 cores overall), where each thread executes 32 wide simd instructions

ALL chips above allow simd divergence to be handled, but with some performance penalty. Programming models are of course different depending upon vendor.

Chips traditionally called CPU's expose SIMD ISA to programmer, leaving it them to write SIMD code, (or to autovectorizing compilers).

Chips traditionally called GPU's expose do not expose SIMD ISA to programmer. They usually write scalar code which is vectorized in hardware.

Some chips, (like cypress, nehalem etc.) expect needs lot of ILP in code to reach peak performance.

Vectors vs threads don't have to be mutually exclusive.
What do the specialists of B3D think?

spacemonkey · Oct 15, 2009

I got an idea - let's call NVIDIA's threads "thNeads". That should clear up the confusion :mrgreen:

Ailuros · Oct 15, 2009

spacemonkey said:
I got an idea - let's call NVIDIA's threads "thNeads". That should clear up the confusion

Not bad; it's just a tad too hard to pronounce. How about nThreads(tm)?

Karoshi · Oct 15, 2009

spacemonkey said:
I got an idea - let's call NVIDIA's threads "thNeads". That should clear up the confusion

cudaThreads, obviously? As in:
How many cudaThreads does your CPU (cuda processing unit) support?
Further:
ROPs->COPs
TMUs->CMUs
Cache->cudaches
MC->CudamemController
[FONT=Arial, Helvetica, sans-serif]Jen-Hsun Huang->CUDA-Hsun Huang

[/FONT]

rpg.314 · Oct 15, 2009

Karoshi said:
cudaThreads, obviously? As in:
How many cudaThreads does your CPU (cuda processing unit) support?
Further:
ROPs->COPs
TMUs->CMUs
Cache->cuda-ache

I like this one...

DegustatoR · Oct 15, 2009

But we'll need to rename most of Cypress to Stream-something then. SBEs, SMUs, SMCs etc.

rpg.314 · Oct 15, 2009

No, that would be wavey-something...

3dilettante · Oct 15, 2009

There are kernel threads, userspace threads, pthreads, etc.

How about:

GPUthreads
ASICthreads
Slavespace threads
Evanescent threads
Datumthreads
Pseudothreads
and so on threads

rpg.314 · Oct 15, 2009

B3D threads...

3dilettante · Oct 15, 2009

psurge said:
Can atomics be used to implement barrier synchronization of threads that have arbitrarily diverged? If so, then wouldn't the illusion breaking __sync be an implementation detail that is exposed to allow improved performance in some cases, rather than a reason to claim that NV is misusing the term thread?

I think that makes sense to me in as a general rule. I wouldn't see it being a problem as long as the exposed features exist in addition to standard functions and are not required to be used.

fellix · Oct 15, 2009

Poor Charlie boy is on a rampage again: Fermi is for a second tape out, more delays

trinibwoy · Oct 15, 2009

Pretend they aren't bleeding key engineers.

I love how Charlie slips in these zingers with no proof, source or corroborating evidence. Journalism at its finest

Rangers · Oct 15, 2009

trinibwoy said:
I love how Charlie slips in these zingers with no proof, source or corroborating evidence. Journalism at its finest

But everything else he's right about, including that they EOL'd the GT200 parts, something you steadfastly denied across various forums until you couldn't do it anymore.

rpg.314 · Oct 15, 2009

Well this year so far, charlie has been mostly accurate

DegustatoR · Oct 15, 2009

Rangers said:
But everything else he's right about, including that they EOL'd the GT200 parts

Really? That's interesting. And the confirmation came from Charlie too?

NVIDIA Fermi: Architecture discussion

3dilettante

Andrew Lauritzen

Moderator

DemoCoder

Andrew Lauritzen

Moderator

psurge

rpg.314

spacemonkey

Ailuros

Epsilon plus three

Karoshi

rpg.314

DegustatoR

rpg.314

3dilettante

rpg.314

3dilettante

fellix

trinibwoy

Meh

Rangers

rpg.314

DegustatoR

Similar threads