NVIDIA Fermi: Architecture discussion

Jawed - So ATI HW support interleaved execution of 8 kernels on a single "core"? Does each core get to run 8 different kernels, or is it 8 across the entire chip? Regarding your comments on the NV scheduler - I don't understand the difference between scoreboarding threads and instructions. Do you mean entire warps are now scoreboarded? I'm having a hard time imagining why they would ever have scoreboarded at the level of a single "thread"; can you provide links to any documentation verifying this?
 
You're just saying you need 4X as many ports or banks on a given amount of register file space. Which is really really bad. I applaud you for looking into things like this as it's interesting to think about, but it would be helpful if you thought about the hardware implications.
I absolutely am thinking about the hardware implications. We do not need 4X the ports. I may need fewer, larger banks with my design, but probably not, depending on what the current design is. 64-pixel batches with FP32 granularity means there are only 256 different locations in each register file channel.
It's not how much data you fetch, its where you fetch from. If you want to fetch 1KB of data from a single register file...you're DOA. Fetching from 2 reg files is still awful...but fetching from 16 different ones might be reasonable.
DOA? Right now there are 64 pixels batches, and a 32-bit float is fetched per cycle in each of 4 register file channels (xyzw). That's 1kB/cycle. After 3 cycles, you get 3 reads per channel, or 12 reads per instruction group.

I do have to confess one oversight that I wish someone would have pointed out earlier. ATI's GPUs allow up to 128 vec4 registers per thread. With 64 pixel batches, that works out to only two batches per SIMD. Thus my eight active batch system can't work. :oops:

But come on, do we really need to run 512 float programs without spillover? Aren't NVidia GPUs capable of far less? Still, it probably won't be hard for a scalar design to have two modes of operation, with the first as I described, and the second working on two active batches but requiring non-dependent instruction groups as is currently the case.

I think it would behoove you to understand how register files are designed in GPUs. There is at least one patent that I referenced in my article.
Sorry, I couldn't find it.
 
Which fragments the market even more, making the problem more acute. Genius! :)

Finding a way to prevent your hardware from being commodified is genius. We all know that eventually, they'll be something like DirectPhysics, but open standards don't materialize out of thin air. First, independent actors innovate with proprietary solutions and refine them via market feedback. As competitors enter the space, the fragmentation causes issues for developers, until finally a standard, which is usually synthesized from the existing proprietary solutions, arrives.
 
I guess CUDA is a perfect example of what you've just described.

Just about all "standards" I'm aware of take this route. Those that start out as standards first, without having gone through a proprietary stage usually fail, because they are created on paper by bureaucracy before being tested. TCP/IP is one of the exceptions, although you have to ignore prior research.
 
Also, I'm not sure what to make of the prospect of an OpenCL-extensions arms race, like the OpenGL arms race that's been going on for years now.

There is hope in this regard. The thing with Opengl was that it drew a lone way too high from silicon, and as technology progressed, the hardware evolution diverged from the abstraction that ogl was supposed to provide. And the bad guys made sure that ogl will never be able to recover:devilish:, witness the 3Dlabs' efforts and the nuking of the object/template based ogl3.

Ocl, otoh, is following hw evolution. ocl 1.0 is pretty much G80 and cypress is a natural, small extension of that, say ocl 1.1 (it is due out in 2h09/1h10 btw). Rapid evolution along with hw is the best shot ocl has at surviving the deathmatch that is extension-race.

Yet, I am not discounting the deleterious effects of politics. There is hope, but caution in this regard is necessary.
 
I think HTML5 WHATWG vs W3C is informative here. The W3C is like ARB, an industry consortium of many players, many of which don't even implement anything. The 'too many cooks' syndrome often acted to significantly delay specs, add useless extensions, or veto others. Progress slowed to a standstill. A parallel 'industry' HTML5 effort was born, with basically Mozilla, Apple, Google, and Opera, the four actors with horses in the race (MS recently joined). As a result, in just 2 years, browser capability has massively improved. Why? Because each vendor extends HTML5 with new features all the time, and then they quickly harmonize their implementations.

Extensions are a good thing, they allow quick experimentation and feedback. The breakdown occurs when the players cannot quickly agree on harmonization. Realistically, NVidia, Intel, AMD, Apple, and perhaps Microsoft, should handle this, keeping the group small, and dedicated to people who can execute quick implementation changes. Maybe, I could see Sony/IBM in the mix, but the group should be kept small. Extensions to OpenCL shouldn't be feared, as long as a short list of players (who are relevant to the market) can sync up extensions quickly.

They're bad when no one can agree, and the platform fragments.
 
Which fragments the market even more, making the problem more acute. Genius! :)

-Charlie
So you're saying a somehow artificially increased demand for graphics horsepower, be it physx, be it eyefinity, be it 3D-stereo, fragments the market even more by doing exactly what?

I agree, all that stuff doesn't help alleviate the basic problem, but it helps both vendors to find some justification for increased graphics performance - which is basically what their business is all about.
 
A few comments and thoughts - first of all your terminology is very very confused, largely due to the fact that all the different GPU guys have bizarre and non-standard terminology. Trying to straighten that out is a pretty heroic effort.
I'm mostly interested in the vector portion of each core, to make a broad comparison of implementation cost.

I'm basing the terminology on Intel's, used in describing Larrabee, the only sane terminology out there.

A core does instruction decoding, a thread has a program counter and each thread consists of strands that populate SIMD-ALU lanes with work and which take multiple issue cycles to run basic instructions (e.g. single-precision ADD is 1 issue cycle on Larrabee, 2 on GF100 and 4 on R800).

A kernel is an entire set of instructions that is independent. Though there's a terminological problem relating to scope of execution here as Larrabee seemingly supports 128 kernels, as 32 cores x 4 kernels per core, while NVidia supports 16 kernels at 1 per core, while ATI probably supports 8 kernels with any subset of those 8 (1 to 8) per core.

There's also a doubt in my mind over whether ATI truly supports multiple compute kernels per core. R600 supports 8 render states per GPU, where each render state can have a distinct VS, GS, PS etc. It'd be logical that upto 8 compute kernels can be scheduled on a core in R800, but there's only a very vague statement of multiple kernel support so far.


ATI
  1. ATI's cores are not the VLIWs, they are the SIMDs.
  2. It's not really clear how the calls work
  3. ATI has 128KB cache/memory controller
  4. It's a wavefront, not a thread.
  5. The shared memory is ~same as NV, the difference is shmem/vector lane.
  1. Instruction issue in ATI is VLIW (which is also a variable length VLIW). The SIMD is 16 lanes wide. For some reason I forgot to mention that detail. Though it turns out all 3 architectures are 16-wide, so that in itself isn't a point of distinction.
  2. Calls are clearly described in the ISA document - they're static and support recursion through the Sequencer's stack.
  3. The 128KB cache per controller is outside of the cores and since it isn't read/write doesn't really have a meaningful functionality from the point of view of the core.
  4. I'm not interested in "wavefront", it's just a stupid name for thread.
  5. Shared memory is smaller than in GF100 (upto 48KB per 32 threads/1024 strands). There are coding implications in the way LDS operates in comparison with shared memory in NVidia, but I didn't want to go into those.
Intel
  1. There are 16 vector lanes, not threads
  2. The L1D is shared memory
  3. The L1D can be accessed every cycle by the vector pipe
  4. Intel threads != NV or ATI threads
  1. The SIMD is 16 wide and there are 16 strands per thread, so one issue cycle per instruction.
  2. Lines can only be locked in L2, so L1 isn't truly shared memory. GF100 is providing locking in L1 (though granularity of locking is not very exciting).
  3. This is similar to how shared memory works.
  4. Rather meaningless statement. In terms of the vector unit, they are the same in all meaningful senses. Intel merely has only 4. In order to improve latency hiding, programmers are forced to use fibres (a purely software construct, so arbitrary in number) for sharing a thread's execution allocation.
I could also have more explicitly compared the various forms of gather and scatter.

Intel has two gather paths (texture units and direct, though the texturing path effectively overloads the cache system/ring-bus - unclear if the TUs are useful for non texturing fetches) while it appears that the gather path in ATI and NVidia is common through a core-dedicated texture unit (i.e. fetches without filtering). All rates are 16 scalars (32-bit) per clock - though we're waiting to see the LSU clock speed in GF100.

Scatter appears to be 16 per clock in both Intel and NVidia, while in ATI it appears to be 64 across the entire GPU (i.e. only 2 cores can scatter at any time at the rate of 32 scalars per clock per core). ATI caching appears to be only cursory here (only at the MCs for coalescing - though there's a question mark over how global atomics are implmented and the read-back of global read/write resources in general), whereas Intel and NVidia have dedicated caching per core. Clearly there's not enough off-die bandwidth for all Intel and NVidia cores to scatter into memory simultaneously.

I could have included constant and instruction caches in my comparison, but I decided they're probably too small and aren't key comparison points when comparing the implementation cost.

Jawed
 
L1 on NVIDIA is presumably banked (because it's also used as shared memory) so on random scatters which hit cache it would have about 3 times higher throughput (I think, my probability theory is rusty).

PS. using Johnny come lately's definition over the established ones (AMD/NVIDIA agree on what threads are) makes little sense to me and will generally just cause confusions. Also I don't think Intel's chosen definitions make a lot of semantic sense to begin with.
 
Last edited by a moderator:
From a software standpoint, my impression of things makes me think that Larrabee appears to be 128 cores, Nvidia 512, and AMD 320.
The GPUs through their dedicated scheduling hardware allow for divergent behavior for work units or pixels that as far as the software is concerned is an individual core.
Larrabee's emulation of similar behavior through strands and fibers is software-driven so it is software-visible.

At a hardware level, I'd say it's 32, 16, and 20 cores.

The trade-offs look to be interesting.

In terms of operand bandwidth, AMD has the 3840 32-bit operands loaded from registers per clock, at peak.
This comes from over 5 MiB of register file.
The LDS adds an extra aggregate 640 KiB.

Larrabee per clock has the capacity for 1536 such operands per clock. It clocks much higher, and it would take around 2 GHz to roughly equal Cypress.
The register and L1 resources are small compared to Cypress, but they are backed up by the massive L2.
The aggregate total of read/write programmable storage for both is something like Cypress having something over half the storage of Larrabee, while also being something over half the die size.
At least as far as these two designs go, the differences are in the arrangment rather than the proportion of SRAM to logic.

Cypress has a huge amount of register file, which offers high bandwidths within the registers, but each SIMD winds up sucking data through a straw when it comes to any transfers beyond that, and writeback in general has a much longer path to take than Larrabee's R/W cache.
Larrabee has lower bandwidth between (edit: within) the L1/reg file area, but has a much more balanced L2 to compute ratio.

I'm still poring over what Fermi has to offer. I am not sure what the aggregate amount of register file is for the entire chip, is it 128K per core or in total?

Given the size of the chip, this design seems to lean more heavily on the logic side.
The operand bandwidth per clock is what Larrabee would get.
The L1/shared mem bandwidth with 16 L/S units is in aggregate half that of Larrabee because there are only 16 cores.

Per core:
The Fermi arrangment may be the most flexible of the three, though this depends on these L/S units being able to perform independent access (before taking bank conflicts into consideration) for 64 bytes (edit: in total from) 16 different accesses.
Larrabee works on the granularity of cache lines, with 64 bytes per access.
Cypress can get 4 point-sampled values, though it seems like 64 bytes is not to be expected without fetch4, which may lie between Larrabee and Fermi in flexibility.

The global path to memory for Fermi is a little vague.
Larrabee has the 1024 bit ring-bus of unverified implementation.
Cypress has a 1024 bit crossbar for the texture path. The b3d article seems to hint with the 32-byte cache line sizes that the best case bandwidth is a cache line being accessed from each L2 quadrant.
Writes go their own way, apparently.

Without knowing more about how Larrabee implements its bus, I'm not certain about the math, but it looks like Larrabee has much higher and more flexible internal bandwidth than Cypress, given the higher clocks and generalized traffic.
The ring bus might inject its own quirks, however.
 
Last edited by a moderator:
So you're saying a somehow artificially increased demand for graphics horsepower, be it physx, be it eyefinity, be it 3D-stereo, fragments the market even more by doing exactly what?

I agree, all that stuff doesn't help alleviate the basic problem, but it helps both vendors to find some justification for increased graphics performance - which is basically what their business is all about.

Increasing demand id good if done in a standards compliant way. Doing in a way that can possibly only serve a small fraction of the market means a dev must decide to do the extra work to code for a minority. It may increase demand for GPU power, but only if people code for it.

Proprietary APIs like this make developers say "We will gladly do it if you gladly hand us bags of cash". At least that is what I have seen. :)

-Charlie
 
"high-performance computing" covers tens of billions of dollars of revenue.
 
Back
Top