NVIDIA Fermi: Architecture discussion

Although, now thinking about it in the context of this whole "real threads" .. umm .. discussion, I wonder if DWF isn't breathing life into the quaint and aging assumption that everyone and their brother is always running the same instruction. Seems like the cost of DWF and MIMD might be similar (difference between finding 32 runnable "threads" and finding 32 runnable "threads" all at the same PC is ..?), and we'd get better advantages from MIMD, even if we do need larger instruction caches?

Full MIMD would be larger instruction caches and multiplying the resources at the front end. In terms of issue ports, a physical SIMD of width 16 pushed to 16 MIMD units would require 16x the decoders, issue ports, and scheduling.
It's not necessarily 16x the hardware because these are potentially simpler than the more complex SIMD unit.
Regardless, Fermi is already plenty big.

The primary argument for DWF was that Nvidia's scheduling and register hardware was already oddly complex for what it was doing, and DWF was an incremental increase that could yield throughput decently close to what MIMD could offer for the workloads targeted.

This came up in the old G300 speculation thread. It's quite a trip down memory lane to go back there.
A lot of what Fermi turned out to be reflected a lot of the grumblings at the time, and the apparent die size reflected some of the fears.
 
Even from a purely software perspective though the proverbial "crack in the armor" comes with shared memory and barrier synchronization, which given the limitations of how they can be used in the current programming models gives a greatly restricted model of "threads" compared to the usual definition. Even a simple producer/consumer model can't be easily modeled directly since all "threads" in a group are forced to converge at all barriers. So while they can predicate and do other SIMD-like things to appear to execute different code, they cannot go off on arbitrary control flow graphs - at least not with the ability to ever share data inside that control flow (which sort of limits the utility not to mention the abstraction...). Perhaps this limitation will be lifted with Fermi though - we'll see.
 
Even in regular CPU models, you end up with some of these problems. Look at the Java Memory Model spec for example. There, you flush around synchronization primitives, or you go with optimistic concurrency and validate/retry (with the extreme being software transactional memory, just added to C#)
 
Right, but the existence of both compiler and processor level memory barriers is not what's interesting here, as you need them in almost every iterative language that supports multi-threading. The problem is that if you restrict where these things can exist with respect to control flow, and further limit the ability of these barriers to - say - a global scope (or at least global with respect to shared memory) then you remove a huge amount of expressiveness that even the Java model still has. You also completely remove the illusion that your "threads" are executing "independently" since they can effectively only make useful progress when running in conceptual lock-step (with predication).

This is standard for SIMD, but non-standard for the use of the term "threads", hence the question of terminology choice. Now I will give NVIDIA et al. props for constantly trying to increase the expressiveness towards the goal of making these things operate as if they truly were independent threads by the traditional definition, but we're still a ways off. Fermi will undoubtedly bring us closer but it remains to be seen by how much.
 
Can atomics be used to implement barrier synchronization of threads that have arbitrarily diverged? If so, then wouldn't the illusion breaking __sync be an implementation detail that is exposed to allow improved performance in some cases, rather than a reason to claim that NV is misusing the term thread?
 
Making sense of threads/vectors

To me, the most consistent and accurate terminology seems to be that

warp/wavefronts size := vector lanes

and

in CUDA speak, (threads/block)/(warp size) := number of threads

  • G80
    has at most 24 hw threads per core, (16 cores overall), where each thread executes 32 wide simd instructions
  • GT200
    has at most 32 hw threads per core, (30 cores overall), where each thread executes 32 wide simd instructions
  • Larrabee
    has at most 4 hw threads per core, (?? cores overall), where each thread executes 16 wide simd instructions. LRB also implements multiple sw threads per hw thread for additional latency hiding
  • Cypress
    cypress has at most ?? hw threads per core, (20 cores overall), where each thread executes 64 wide simd instructions. needs lot of ILP in code to reach peak performance.
  • Cell
    and cell has at most 1 hw threads per core, (8 cores overall), where each thread executes 4 wide simd instructions (counting only spe's)
  • Nehalem
    has at most 2 hw threads per core, (4 cores overall), where each thread executes 4 wide simd instructions. needs lot of ILP in code to reach peak performance.
  • Fermi
    has at most 48 hw threads per core, (16 cores overall), where each thread executes 32 wide simd instructions

ALL chips above allow simd divergence to be handled, but with some performance penalty. Programming models are of course different depending upon vendor.

Chips traditionally called CPU's expose SIMD ISA to programmer, leaving it them to write SIMD code, (or to autovectorizing compilers).

Chips traditionally called GPU's expose do not expose SIMD ISA to programmer. They usually write scalar code which is vectorized in hardware.

Some chips, (like cypress, nehalem etc.) expect needs lot of ILP in code to reach peak performance.

Vectors vs threads don't have to be mutually exclusive.
What do the specialists of B3D think? :p
 
Last edited by a moderator:
I got an idea - let's call NVIDIA's threads "thNeads". That should clear up the confusion :mrgreen:

cudaThreads, obviously? As in:
How many cudaThreads does your CPU (cuda processing unit) support?
Further:
ROPs->COPs
TMUs->CMUs
Cache->cudaches
MC->CudamemController
[FONT=Arial, Helvetica, sans-serif]Jen-Hsun Huang->CUDA-Hsun Huang

[/FONT]
 
There are kernel threads, userspace threads, pthreads, etc.

How about:

GPUthreads
ASICthreads
Slavespace threads
Evanescent threads
Datumthreads
Pseudothreads
and so on threads
 
Can atomics be used to implement barrier synchronization of threads that have arbitrarily diverged? If so, then wouldn't the illusion breaking __sync be an implementation detail that is exposed to allow improved performance in some cases, rather than a reason to claim that NV is misusing the term thread?

I think that makes sense to me in as a general rule. I wouldn't see it being a problem as long as the exposed features exist in addition to standard functions and are not required to be used.
 
I love how Charlie slips in these zingers with no proof, source or corroborating evidence. Journalism at its finest :D

But everything else he's right about, including that they EOL'd the GT200 parts, something you steadfastly denied across various forums until you couldn't do it anymore.
 
Back
Top