NVIDIA GT200 Rumours & Speculation Thread

Fornowagain · Jun 10, 2008

Just saw my first eurozone price. Pre-release, but Damn that's a lot of money. Hope its cheaper on the day.

2900guy · Jun 10, 2008

Fornowagain said:
Just saw my first eurozone price. Pre-release, but Damn that's a lot of money. Hope its cheaper on the day.

depends on reviews i suppose. if a 4870x2 costs less and performs better than the GTX 280 prices will go down.

Arty · Jun 10, 2008

2900guy said:
depends on reviews i suppose. if a 4870x2 costs less and performs better than the GTX 280 prices will go down.

Factor in the constrained supply rumor too.

suryad · Jun 10, 2008

Wow that is expensive...still I am drooling for a pair of ultras...

armchair_architect · Jun 10, 2008

mczak said:
Depends how you define "weak". Current Core2Quad is 2 (vec2 fp64) * 4 (cores) * 2 (mul+add) * clock flops.

C2Q can do a mul and an add in a single cycle? Last I checked they didn't have a MAD/FMA instruction, though I haven't kept up with the latest SSE flavors.

3dcgi · Jun 10, 2008

nAo said:
DP would be useful for SATs and to filter exponential shadow maps on 'long' ranges without employing log-filtering.

I skimmed Andy's GPU Gems 3 article the other day an noticed it mentioned double precision being useful to ensure numerical stability.

CarstenS · Jun 10, 2008

armchair_architect said:
C2Q can do a mul and an add in a single cycle? Last I checked they didn't have a MAD/FMA instruction, though I haven't kept up with the latest SSE flavors.

Another point in that respect would be theoretical GFLOPS vs. sustained throughput. I must admit that I don't have much of a clue about this kind of stuff, but am under the impression that GPUs generally get much closer to theoretical peaks than CPUs.

nutball · Jun 10, 2008

I wonder if NVIDIA are gambling that folks will be smart when writing their CUDA code and only use DP where it's really necessary, using SP everywhere else.

78GFLOPS isn't much, arguably it's not even worth the hassle of porting a code to CUDA when £5k gets you a 16-core Opteron box and fifteen minutes with an OpenMP manual is much simpler and cheaper on the old manpower front.

Jawed · Jun 10, 2008

mczak said:
btw is this supposed separate dp mad in addition to the 8 single precision mads (thus helping with the single precision flops too?) or replacing one of them?

I don't see how you could make one of the 8 lanes in a SIMD unit really different.

Clearly you can perform a double-precision operation on two SP operands - so there's a theoretical bump in SP performance (yay, 1TFLOP) but I guess the rounding behaviour will be a bit different...

With the throughput on this unit being so low you'd only want to use it if the resultant has an 8 clock latency before being required by another instruction - otherwise it's going to stall other SP operations.

Jawed

Kaotik · Jun 10, 2008

http://www.tdshop.fr/Negozio.asp?IdN...odotto=B994421

ASUS's model, price is 574,94€, which at current currency rates translates into $907.49 USD

mczak · Jun 10, 2008

Jawed said:
I don't see how you could make one of the 8 lanes in a SIMD unit really different.

Yes you're right this looks odd. Doesn't really make sense that one of the mads would just do a (implicit) double precision op while the other 7 do the same single precision op...

With the throughput on this unit being so low you'd only want to use it if the resultant has an 8 clock latency before being required by another instruction - otherwise it's going to stall other SP operations.

I got to wonder - it looks like a strange decision to have a unit explicitly for dp ops, instead of changing the sp units (and the control logic, if necessary) to also be able to handle dp ops. I don't know of any other chip having separate execution units for sp and dp.

mczak · Jun 10, 2008

armchair_architect said:
C2Q can do a mul and an add in a single cycle? Last I checked they didn't have a MAD/FMA instruction, though I haven't kept up with the latest SSE flavors.

Yes, C2Q can do mul and add in the same cycle. So can Phenoms, P4 and K8... All these chips have separate execution units for FMUL and FADD (P4 and K8 only get half the throughput per clock because their units are only 64bit wide internally so need 2 cycles per instruction).
So while a single MAD would require 2 cycles (cause the chip couldn't issue the mul and add part simultaneously - actually that's more than 2 cycles cause of the dependency and the mul instruction latency), the peak throughput with 2 separate units for mul and add is indeed the same as with a single mad unit. In theory mul + add has benefits because the chip can issue independent, well, muls and adds, but single mad should be easier for dependency tracking and stuff since only one instruction needs to be issued).

mczak · Jun 10, 2008

CarstenS said:
Another point in that respect would be theoretical GFLOPS vs. sustained throughput. I must admit that I don't have much of a clue about this kind of stuff, but am under the impression that GPUs generally get much closer to theoretical peaks than CPUs.

I think this depends a bit what you do. cpus should be able to get to their theoretical performance too with "simple" math kernels, but they might often get limited by memory bandwidth (requiring you possibly to rewrite algorithms to be more cache friendly) - Nehalem should be much better in that area than a penryn c2q, but obviously have nowhere close the memory bandwidth of a GTX280. OTOH, GPUs only come close to their peak performance if you can process data in batches equaling (or larger) than the chips batchsize (though actually with one dp unit per shader multiprocessor the batch size for dp ops should be only 2? Or maybe I'm misunderstanding something here).

ShaidarHaran · Jun 10, 2008

mczak said:
Yes you're right this looks odd. Doesn't really make sense that one of the mads would just do a (implicit) double precision op while the other 7 do the same single precision op...

I got to wonder - it looks like a strange decision to have a unit explicitly for dp ops, instead of changing the sp units (and the control logic, if necessary) to also be able to handle dp ops. I don't know of any other chip having separate execution units for sp and dp.

Yes, it does have sort of a tacked-on feel, almost as though it were an afterthought...

There is one advantage to having separate DP units though - your SP units aren't tied up so you can still get full rate there (barring setup issues).

Andrew Lauritzen · Jun 10, 2008

3dcgi said:
I skimmed Andy's GPU Gems 3 article the other day an noticed it mentioned double precision being useful to ensure numerical stability.

Incidentally I just prototyped a SAT + EVSMs implementation the other day to judge the feasibility... the good news is that the precision artifacts are on the same order as SAT + standard VSMs... the bad news is the same

I may be able to pull something with mod arithmetic like I did with the int32 stuff in the chapter/demo, but it's not as obvious how to make that work well with the exponential warp. It'd be a really fun thing to play with using doubles.

nAo · Jun 10, 2008

The other day I was actually thinking about using extended precision instead of double precision as to filter exponential values we already have enough mantissa bits, while we need more exponent bits. Theoretically extended precision implemented with half floats might be better than single precision values (range wise).

ShaidarHaran · Jun 10, 2008

Any idea what the perf. penalty would be going to EP?

nAo · Jun 10, 2008

ShaidarHaran said:
Any idea what the perf. penalty would be going to EP?

It really depends on what you actually need to compute and what shortcuts you can take. Anyway..we are going OT, I'm going to stop here before Rys tells me off!

Jawed · Jun 10, 2008

nAo said:
The other day I was actually thinking about using extended precision instead of double precision as to filter exponential values we already have enough mantissa bits, while we need more exponent bits. Theoretically extended precision implemented with half floats might be better than single precision values (range wise).

Sounds like you could do your own math then, using say an int8 for the mantissa and int24 for the exponent.

Jawed

Andrew Lauritzen · Jun 10, 2008

Jawed said:
Sounds like you could do your own math then, using say an int8 for the mantissa and int24 for the exponent.

If the math is non-linear though you can't use hardware filtering (or SATs for that matter)

But yes, G(T)200 ftw

That said, my guess is that they don't bother to expose DP until the next major API update. Perhaps they'll add it to the GL extensions, but I dunno whether I'm keen enough to recode everything in GL again!

NVIDIA GT200 Rumours & Speculation Thread

Fornowagain

2900guy

Arty

KEPLER

suryad

armchair_architect

3dcgi

CarstenS

Moderator

nutball

Jawed

Kaotik

Drunk Member

mczak

mczak

mczak

ShaidarHaran

hardware monkey

Andrew Lauritzen

Moderator

nAo

Nutella Nutellae

ShaidarHaran

hardware monkey

nAo

Nutella Nutellae

Jawed

Andrew Lauritzen

Moderator

Similar threads