NVIDIA GT200 Rumours & Speculation Thread

Status
Not open for further replies.
GPU powered supercomputer?

http://www.google.com/translate?u=http%3A%2F%2Fwww.pcinpact.com%2Factu%2Fnews%2F43165-premier-supercalculateur-GPU-France-Tesla.htm&langpair=fr%7Cen&hl=en&ie=UTF8

Translation said:
According to our information, the 1068 processors are indeed Nehalem eight of hearts, and 48 modules are GPU solutions from NVIDIA Tesla. This will be in addition to a new generation of Tesla, with a very high probability of the GT200 chip controllers, and even several for each module (we imagine two, as in the current solutions Tesla NVIDIA).

In this machine, all the CPU is supposed to produce a powerful theoretical 103 Tflops, up from 192 Tflops for GPU. With 96 against GPU 1068 CPU, it is already clear domination of the GPU on the CPU intensive computing, thanks to its multiple parallel processors flow.
 
Interesting. I think they're wrong though; those are very likely 3GHz 4-core Nehalems if the delivery is to take place in early 2009. Nehalem-EX is only slated for 2H09. I also think it's 192 GPUs, not 96. It's also much easier to get to the stated GFlop figures that way.

It is a rather interesting setup. Clearly they're hedging their bets; even if some users of the supercomputer don't use the GPUs, it'll still be top-notch. In terms of revenue, there's clearly still a lot more going to Intel than to NVIDIA there; so the potential is for that to evolve in the coming years with GPUs having more and more HPC design wins and more and more of each win being GPU-centric. We'll see how that goes.

Cheers for pointing this out, I'll news it later today... :)
 
Presumably this is single precision? We don't expect GT200 to be packing a TFLOP of DP math do we?

And if the assumption is corrrect about it being a 4 core Nehalem then thats about 100 GFLOPS per CPU. Isn't that what existing 4 core Penryns peak at under SP aswell?
 
Presumably this is single precision? We don't expect GT200 to be packing a TFLOP of DP math do we?
Yeah, this is definitely all SP as far as I can tell.

And if the assumption is corrrect about it being a 4 core Nehalem then thats about 100 GFLOPS per CPU. Isn't that what existing 4 core Penryns peak at under SP aswell?
That is correct; see David Kanter's analysis of Nehalem at RWT. The execution units are basically identical. With 2 threads/core and a lot more memory bandwidth, real-world performance should be substantially improved though.
 
Yeah, this is definitely all SP as far as I can tell.

That is correct; see David Kanter's analysis of Nehalem at RWT. The execution units are basically identical. With 2 threads/core and a lot more memory bandwidth, real-world performance should be substantially improved though.

There's been a slight re-organization and optimization of the front-end, as well as the addition of SSE 4.2 to add into the mix as well.
 
There's been a slight re-organization and optimization of the front-end, as well as the addition of SSE 4.2 to add into the mix as well.
Yeah, unless your definition of a floating point operation is non-standard though, neither of those is going to affect the final GFlop rating. So fair enough, but I did take that into consideration when I focused on execution units only! :)
 
Yeah, unless your definition of a floating point operation is non-standard though, neither of those is going to affect the final GFlop rating. So fair enough, but I did take that into consideration when I focused on execution units only! :)

I disagree. Instruction throughput will increase because of the improved decode capabilities such as new instructions eligible for micro op fusion. More instructions decoded = more issued which should lead to an overall resource utilization increase which of course increases efficiency and instruction throughput (or parallelism, if you prefer).
 
The GFLOP rating being used is the maximum theoretical output of the FP hardware at the assumed clocks Nehalem will run at.

No amount of efficiency change is going to affect the peak value.
 
I disagree. Instruction throughput will increase because of the improved decode capabilities such as new instructions eligible for micro op fusion. More instructions decoded = more issued which should lead to an overall resource utilization increase which of course increases efficiency and instruction throughput (or parallelism, if you prefer).

Yeah but that's going to increase actual throughput not the theoretical maximums being discussed here.

Arun, what is the standard for evaluating CPU Flops anyway? Is it based on 32-bit MADDs or something?
 
I think single-precision ADD + MULs. So just ignore all the 'misc.' stuff like shifting even though theoretically they are 'operations'. And Shaidar, someone pointed out in the french comment thread that the Top 500 ranking is measured via Linpack, so clearly you're not the only person who looks at it that way... ;) (now I wonder when we'll see a CUDA-accelerated Linpack suite!)
 
Linpack ftw! Cuda/CTM Linpack would be interesting, although it'd be hard to do anything more than SP and Linpack is at least DP IIRC.
 
I think single-precision ADD + MULs. So just ignore all the 'misc.' stuff like shifting even though theoretically they are 'operations'. And Shaidar, someone pointed out in the french comment thread that the Top 500 ranking is measured via Linpack, so clearly you're not the only person who looks at it that way... ;) (now I wonder when we'll see a CUDA-accelerated Linpack suite!)

Well in the REAL world of HPC its DP flops/s and DP flops/s on linpack nxn. Hate this PR BS of using SP flops in the HPC world cause its about as accurate as using bogomips!

So for a top500 RPEAK you are probably looking at ~50 TFlops based on public info and an RMAX of ~40 TFLOPS for the cpu portion and probably a lot less than that for GT200 portion.

But even that is widely optimistic for any real HPC suite of applications.

Aaron Spink
speaking for myself inc.
 
Amazing. So TG Daily can't even read basic text. Let alone make a tiny bit of research and realize what the truth is (which is that the Nehalems are 4-core/8-thread Bloomfields and the GPUs are 48x4 GT200s for a total of 192). And regarding bandwidth, errr, each GPU has 4x the bandwidth of a Bloomfield chip. The bandwidth per GFlop is indeed a bit lower and there's much less cache, but it's not THAT huge of a difference.

So yeah, 'mainstream' websites continue on their streak of getting everything wrong for that kind of thing. yay! And no, I'm not bitter - it's just a bit pitiful.
 
Well in the REAL world of HPC its DP flops/s and DP flops/s on linpack nxn. Hate this PR BS of using SP flops in the HPC world cause its about as accurate as using bogomips!
haha, agreed. As for the performance estimates, fair enough - I figure you have a lot more knowledge there than I do or nearly everyone else here. I'd still be quite curious to see what a port of Linpack to GT200 via CUDA would look like though, both in terms of implementation and performance.
 
haha, agreed. As for the performance estimates, fair enough - I figure you have a lot more knowledge there than I do or nearly everyone else here. I'd still be quite curious to see what a port of Linpack to GT200 via CUDA would look like though, both in terms of implementation and performance.

For linpack nxn, they should be able to get close to 90% efficiency except for the effects of the roll up portion which could cause them some issues. But on a single chip, I would say anything less than 90% would be reason to shoot the programmers, after all it should basically be playing to every single strength that they have, massive computation, easily blockable, etc. And they can size the matrix so that it completely fits within the GDDR memory.

Now the one problem is that linpack proper is DP and no one knows anywhere close to enough to guess how/if the GT200 has DP support.

Aaron Spink
speaking for myself inc.
 
So what's the apparent single-precision FLOP rating of GT200 extracted from this?

Jawed

all numbers appear to be single precision peak numbers for the GT200 which means ~1 TF (assuming of course that the websites got the numbers right). And there is probably no way the machine makes it into the top 10 of the top 500 list contrary to what tgdaily says unless all the reported numbers are way off.
 
Last edited by a moderator:
Status
Not open for further replies.
Back
Top