NVIDIA Kepler speculation thread

I think the GPU is soldered onto the motherboard in this case.
I don't think so. The X2090 is a card that come without any fan and so on, but it still has a slot and is obviously screwed to the mainboard.

That's a Tesla X2090:
cray_xk6_super_x2090_gpu.jpg


And that's a XK6 board with 4 of them at the right side:
XK6_blade.bmp


I see slots there.
 
Most server racks I've seen so far stand side by side and use the accessible front for air intake.

edit: My notebook uses passive cooling and cannot be opened, sorry.
 
Most server racks I've seen so far stand side by side and use the accessible front for air intake.

edit: My notebook uses passive cooling and cannot be opened, sorry.

Aah, you mean the direction of airflow! Actually, what you see there is a blade, so it's probably going vertical ;)
And I would almost expect the ORLN is using the liquid cooling option anyway.

By the way, I think the point where they say that they will populate the second socket in each board with a Kepler GPU could also be a mixed word order expression for populating each second slot with a GPU. I.e. they will populate only half of the maximum amount, so only 48 GPUs per Rack. In an older interview it was claimed that while Titan is planned for 20 PFlop/s it could deliver up to 30 PFlop/s in the same area footprint (200 cabinets) and it is a matter of funds to scale it to that point.
 
I.e. they will populate only half of the maximum amount, so only 48 GPUs per Rack. In an older interview it was claimed that while Titan is planned for 20 PFlop/s it could deliver up to 30 PFlop/s in the same area footprint (200 cabinets) and it is a matter of funds to scale it to that point.

that said it is 1.9+ TF and on par with nvidia's claim triple DP throughput..
 
By the way, I think the point where they say that they will populate the second socket in each board with a Kepler GPU could also be a mixed word order expression for populating each second slot with a GPU. I.e. they will populate only half of the maximum amount, so only 48 GPUs per Rack.
That's what I was thiniking when I read that bit. But 1.8 TFLOPS per Kepler seems a bit on the high side, I think. Or do we have other options on the CPU-side? Higher clocks, maybe through turbo?
 
That's what I was thiniking when I read that bit. But 1.8 TFLOPS per Kepler seems a bit on the high side, I think. Or do we have other options on the CPU-side? Higher clocks, maybe through turbo?
The Titan populated with only half of the GPUs will be less than 20 PFlops. The pdf says 10-20 PFlops. And in the interview (there are several, here is one of them) the 30 PFlops are mentioned for the full system.
As for the size of Titan and its capabilities, Mason said that'll depend on Congress and the funding that becomes available.

"We think we can get to 30 (petaflops) when it's fully built up," he said. "Now how quickly we get there will depend on what happens in the budget discusions and so forth. But it'll be somewhere in the 10 to 30 petaflops (range), depending on the funding and how quickly we can populate these GPU slots."

Is it conceivable that Titan could become a 30-petaflops machine in 2012?

''It's all dependent on money," Mason said.

So one should take the 30 PFlop/s with 19200 GPUs and 19200 16core-BDs:

BD: 8 FPUs*4 FMAs*2 Flops*2.5 GHz = 160 GFlop/CPU = 3.07 PFlop/s
Remains ~27 PFlop/s for 19200 GPUs => 1.4 TFlop/s per GPU.

As the 30 PFlop/s where already mentioned some time ago, I would guess the 1.4 TFlop figure is about that what nvidia pojected for it.

The uncertain funding could also be an alternative reason (instead of availability) that the GPU upgrade is only scheduled for the second half of 2012 with an unclear timeline.

Edit:
As we speak of 1.4 TFlop/s, 16 SMs with 64 SPs each and GF104 style multi issue schedulers (maybe even triple instead of dual issue?) sounds quite feasible to me. And that nvidia plans to run those 1024 SPs @ 700/1400 MHz for the Kepler-Teslas appears not be out of this world.
 
Last edited by a moderator:
Given the utilization I've seen on GF104 vs. GF100, I am still not convinced that this is the to go looking for the HPC holy grail. Why would Nvidia bet on it's second best horse?
 
Given the utilization I've seen on GF104 vs. GF100, I am still not convinced that this is the to go looking for the HPC holy grail. Why would Nvidia bet on it's second best horse?
In the HPL benchmark you have several parallel chains of FMAs in each "thread". The dual or triple issue won't limit you the slightest. Otherwise CPUs wouldn't be as effective as they are ;)

And considering GCN's massive scheduling bandwidth, 4 dual issue schedulers for 4 vec16 ALU blocks, 2 vec16 L/S, at least 2 SFU vec4 ALUs, and 8 TMUs, wouldn't look overspec'ed either in comparison.
 
Last edited by a moderator:
You know, there was a time (around R600), when I was under the impression that „future workloads” were largely determined by 3DMark performance. Luckily for the gaming side of things, this has changed. I hope no one trips into that pit in the HPC market.

That said, I don't know the HPL Bench intimately and probably cannot estimate it's importance. But given that most people in the HPC space are likely to run their own code, they might not be blinded by the numbers game as easily as the average Joe.
 
That said, I don't know the HPL Bench intimately and probably cannot estimate it's importance.
It's the one determining the ordering of the Top500 list ;).
HPL is short for High Performance Linpack and is a portable implementation of the linpack benchmark for distributed memory computers (aka HPC cluster).

But given that most people in the HPC space are likely to run their own code, they might not be blinded by the numbers game as easily as the average Joe.
4 dual issue schedulers for 4 vec16 ALUs can't be worse than 2 single issue scheduler for 2 vec16 ALUs in GF100.
 
I guess in that case they would claim 37 or 38 PFlops/s peak for the full system, isn't it?

Don't know what full system means really. If it's 19200 GPUs then their 3x perf/w claim is bogus unless Kepler pulls far less than 225w.
 
Don't know what full system means really. If it's 19200 GPUs then their 3x perf/w claim is bogus unless Kepler pulls far less than 225w.
That 3x claim has a quite unknown base. It was speculated it could be DGEMM performance, where the Fermi doesn't reach the full efficiency because of register or bandwidth constraints (a larger register file and/or faster L1 could remedy this, potentially adding up to 40% of performance without increasing the ALU count).

From the (different versions of the) roadmap slides all we know is the claim of up to 6 GFlop/s per Watt. And if we take the 238W(?) TDP (or 225W, doesn't matter) of the current Teslas, we also arrive at ~1.4 TFlop/s for Kepler.
 
Last edited by a moderator:
Back
Top