NVIDIA Fermi: Architecture discussion

They can probably scale this design down to mainstream parts, like they've always done.
Obviously they're showing the big guns to the press (just like they showed the 8800 at launch, but there was an 8600 available not much later, basically just G80 scaled down (and some extra additions even, in that particular case)).

I'm sure they can, but ATI has parts in the mainstream segment as well, and the relative yields/price/performance ratios are still going to be in their favor, regardless of how nvidia decides to scale their architecture. Basically, at any given performance point, the GF100 part will probably be more expensive to *make*, just like the G200 vs R770 situation.

You can do many more things with the GF100, and it's use in HPC seems pretty awesome, but for mainstream gaming and consumer use, the market needs can't be met at the same price points without accepting a lower margin. That's generally not a good thing to do if you want to be competitive in that market.
 
G80 was also WAY more than just a DX10 chip, with a large part of the transistor budget dedicated to Cuda (those 3-year old chips can now run OpenCL and DirectCompute!).
I can't think of a single transistor in G80 that's CUDA specific. GT200 manages to squeeze in some double precision. Maybe I'm just tired and forgetful :???:

Jawed
 
It's a risk, but why is no one postulating that AMD is forsaking graphics by adding DP when it did?
That's simple, AMD put in really cheap DP. Insanely cheap. It isn't very good, either. Just adequate if you're willing to work around it.

Jawed
 
This is true for peak gflops rate using mad/fma (needs all 4 simple alus for a 64bit fma - on rv870, the rv770 couldn't do 64bit fma). It is noteworthy though that for other instructions, e.g. mul or add, the rate is 2/5 of the single precision rate, so still 544GFlops when using only adds or muls as long as the compiler can extract pairwise independent muls or adds (GF100 will drop to half the gflops with muls or adds).
That's a good point, but GF100 DP is prolly "real man's" DP, not the half-cocked ATI variant.

Subnormals and exception processing on GF100 would appear to be much better than on Larrabee too. Do all these things make GF100 compelling?

Does the lack of SSE compatibility in Larrabee make it a dead duck in HPC?

Jawed
 
It doesn't look to me like nvidia invested a lot of transistors into INT capability. 32bit muls are half rate according to rwt (64bit 1/4 rate, I guess mostly needed because chip can address more than 4GB of ram?),
32 bit MUL wants to produce a 64-bit result in general - that's still pretty expensive. If it wasn't, we'd have more INT MULs in earlier GPUs.

and you can't issue INT and FP at the same time (on the same pipeline),
I'm not willing to believe that. How would DP function if the bandwidth wasn't there, since it uses both units concurrently.

so in terms of implementation sounds pretty cheap to me. Still INT capability is certainly more than decent. Note that rv870 beefed up INT capability too - it can do 4 muls/adds per clock though they are 24bit only (of course that should be very cheap in implementation),
We have no details on the ATI 24-bit INT-MUL - is that just the 24 lowest bits? I suspect its for addressing type calculations.

Jawed
 
producer/consumer fashion. In GF100 producer/consumer requires either transmitting data amongst cores or using branching within a single kernel to simulate multiple kernels. This latter technique is not efficient. (Hardware DWF would undo this argument.)
What's so inefficient about putting circular append buffers between producer/consumer kernels and branching to consumers when they have full warps? It takes storage, but running strands on Larrabee with only a few active fibers won't be efficient either ... the storage is a necessity.
 
To summarize, (in no particular order)

1) Real DP, no really, full IEEE dp without apologies

2) function pointers, recursion

3) C++ style new delete, exception handling

4) a c++ debugger for vs, dunno if there will be a cuda fortran compiler

5) cpu style caches for better performance in irregular workloads

6) more int32 performance than what anybody needs. Why? Why?

7) More shared memory.

8) simultaneous kernel execution, compute, cpu->gpu memcpy, gpu->cpu memcpy, all 3 can go in parallel

9) full ECC, all the way from reg file to off chip ram

10) unified mem space, but memspace must be known at compile time.

11) supports both DDR3 and GDDR5. Why bother with DDR3 if you have spent trannies and effort to hack in ECC on GDDR5?

Have I left out anything?
 
5) cpu style caches for better performance in irregular workloads
It has a read/write cache on the memory bus, very nice ... but that's not really a CPU style architecture. CPUs tend to be read/write and coherent across the entire cache hierarchy.
6) more int32 performance than what anybody needs. Why? Why?
Because they have huge multipliers for DP around anyway, 50% of which will be idle when not doing DP ... no point in sweating the small stuff.
11) supports both DDR3 and GDDR5. Why bother with DDR3 if you have spent trannies and effort to hack in ECC on GDDR5?
Have they even said they would do ECC on GDDR5?
 
My interpretation is that Fermi is TMU- and ROP-less. NVidia's traded them for lots of INT ALUs (512, an insane number by any standard, let's make no mistake).
It's inferior to Larrabee, which has integer vector operations (and an additional scalar one as well).

I don't know how many 32-bit ALU cycles on GF100 a bog standard 8-bit texture result would take through LOD/bias/addressing/decompression/filtering, so it's hard to say whether GF100 has ~2x GT200's texture rate, or 4x, etc.
FP and INT share a data path (and possibly more hardware for all we know) and cannot coissue. Having something like that thrown back on the shader core would effectively end the decoupled texturing that lead to such efficiencies in earlier GPUs. I'd worry that shader work couldn't progress until texturing was done.


Intel, by comparison, dumped a relatively tiny unit (rasteriser), so the effective overhead on die is small. Rasterisation rate in Larrabee, e.g. 16 colour/64 Z per clock, is hardly taxing.
How did you derive this rate for Larrabee, particularly the Z rate?

Broadly, anything Larrabee can do, GF100 can do too in terms of programmability. What gives me pause is that on Larrabee multiple kernels per core can work in producer/consumer fashion.
So one or more threads on the core will wait around for the producer to complete, then pick up, or is it multiple working threads, then a context switch to pull in a consumer?
 
I'm sure they can, but ATI has parts in the mainstream segment as well, and the relative yields/price/performance ratios are still going to be in their favor, regardless of how nvidia decides to scale their architecture. Basically, at any given performance point, the GF100 part will probably be more expensive to *make*, just like the G200 vs R770 situation.

If G92 vs R670/G200 vs R770 has taught us anything, it's not how expensive it is to make, it's how well it performs.
 
Probably not exception handling outright, ie. with the hardware fully handling the branching/states/etc. with no overhead when the exceptions don't occur ... but as long as the flags are there they should be able to support exceptions with a performance hit (not a huge one either AFAICS).
 
Nothing about the DP nbody demo was simulated or faked. It was running on real hardware.

Which demo was that? And why was the fluid demo reportedly running on G2xx-hardware if there was real hardware to be used?
 
Back
Top