Nvidia GT300 core: Speculation

nAo · Apr 22, 2009

aaronspink said:
There are far more complex things required for a modern computation device than coherency.

I am curious, can you make some practical examples?

Jawed · Apr 22, 2009

In the recent interview,

http://www.guru3d.com/article/interview-with-ati-dave-baumann/2

While having lots of bandwidth is rarely a bad thing, the ATI CrossFireX communication bandwidth between two discrete cards is less than local bandwidth - even though Sideport doubles the inter-GPU communication bandwidth on an X2 type solution it's still not significant enough to really change the disparity in local frame buffer and inter-GPU bandwidths.

With memory bandwidth in a new spiral thanks to GDDR5 it seems inter-GPU bandwidth is bound to continue being "useless" for quite a while. At least in AMD's eyes.

Jawed

MfA · Apr 22, 2009

aaronspink said:
Hardware tends to me more robust and has fairly minimal overheads overall.

Something like COMA has minimal overhead, but there is no symmetry since there is an owner of the data. Keeping things symmetrical has major overhead.

Show me a SMP supercomputer to prove me wrong ...

AnarchX · Apr 22, 2009

3 TFLOPs through 512 MIMD cores @ 2GHz?
http://www.brightsideofnews.com/new...00-specifications-revealed---its-a-cgpu!.aspx

suryad · Apr 22, 2009

AnarchX said:
3 TFLOPs through 512 MIMD cores @ 2GHz?
http://www.brightsideofnews.com/new...00-specifications-revealed---its-a-cgpu!.aspx

Damn thats a good read. Thanks for that. Very interesting provided all that is true.

*insert obligatory Crysis joke* So we will finally be able to max Crysis out @ 1080p and still get 60 fps!

aaronspink · Apr 22, 2009

nAo said:
I am curious, can you make some practical examples?

memory order model
security
RAS

Coherency models are at this point well understood with a large backing of both advanced validation and formal models.

fehu · Apr 22, 2009

AnarchX said:
3 TFLOPs through 512 MIMD cores @ 2GHz?
http://www.brightsideofnews.com/new...00-specifications-revealed---its-a-cgpu!.aspx

we'll need bigger cases...

aaronspink · Apr 22, 2009

MfA said:
Something like COMA has minimal overhead, but there is no symmetry since there is an owner of the data. Keeping things symmetrical has major overhead.

Even E/M only hardware coherency has advantages over software only coherency.

Show me a SMP supercomputer to prove me wrong ...

Columbia at NASA AMES
ASC Purple
Etc.

the vast majority of the top 500 list contains SMPs of various scales. Some large, some small.

aaronspink · Apr 22, 2009

AnarchX said:
3 TFLOPs through 512 MIMD cores @ 2GHz?
http://www.brightsideofnews.com/new...00-specifications-revealed---its-a-cgpu!.aspx

Considering that Nvidia tried to convince people that the G80 was 128 cores and that the GT200 was 240 cores. I'll retain my belief until it ships...

bowman · Apr 22, 2009

AnarchX said:
3 TFLOPs through 512 MIMD cores @ 2GHz?
http://www.brightsideofnews.com/new...00-specifications-revealed---its-a-cgpu!.aspx

Lol, Theo Valich. I think I've seen his source tossing around BS before as well.

I'll take my Caterpillar of salt with this and wait until something more substantiated surfaces.

TimothyFarrar · Apr 22, 2009

AnarchX said:
3 TFLOPs through 512 MIMD cores @ 2GHz?
http://www.brightsideofnews.com/new...00-specifications-revealed---its-a-cgpu!.aspx

Anyone got a translation to this line, "This is not the only change - cluster organization is no longer static. The Scratch Cache is much more granular and allows for larger interactivity between the cores inside the cluster. "

Sounds to me like "Scratch Cache" is CUDA's "Shared Memory", and that they do dynamic warp formation based on not only branching but also to reduce shared memory bank conflicts (ie the "much more granular")... meaning that you can mostly code without thinking about bank conflicts.

Going to have to agree with Aaron here that the word "core" is awful in confusing reporters...

MfA · Apr 23, 2009

aaronspink said:
Columbia at NASA AMES

Directory based ccNUMA.

ASC Purple

64 processor SMP is certainly impressive ... but I wouldn't call it a supercomputer just yet, above that it's "just" a cluster AFAICS.

MfA · Apr 23, 2009

TimothyFarrar said:
Anyone got a translation to this line, "This is not the only change - cluster organization is no longer static. The Scratch Cache is much more granular and allows for larger interactivity between the cores inside the cluster. "

Perhaps they carve it up into 4 wide sets, with a crossbar for each set so you can perform single cycle swizzles? (Which would also allow multiple broadcasts within such sets, which would normally cause conflicts.)

aaronspink · Apr 23, 2009

MfA said:
Directory based ccNUMA.

AKA SMP

64 processor SMP is certainly impressive ... but I wouldn't call it a supercomputer just yet, above that it's "just" a cluster AFAICS.

Um, ACS Purple is certainly a super computer. Generally, most supers will involve some level of message passing as there isn't enough of a market in general for very large scale SMPs, as outside of the SC market there is little requirement for that scale of computation.

trinibwoy · Apr 23, 2009

AnarchX said:
3 TFLOPs through 512 MIMD cores @ 2GHz?
http://www.brightsideofnews.com/news/2009/4/22/nvidias-gt300-specifications-revealed---its-a-cgpu!.aspx

Don't know about MIMD/MPMD but certainly 512 ALUs is the minimum we should expect from GT300. 32SP clusters also won't be anything revolutionary. Seems like Theo is just cobbling together bits and pieces that we've heard elsewhere....

Crossbar · Apr 23, 2009

TimothyFarrar said:
I'm interested in what really changed your mind?

Was it (a) the lack of tools to make it easy to build really fast SPU kernels (ie all the manual labor in optimization such as loop unrolling, assembly, etc)? Or (b) the requirement of manual DMA? Or (c) was it mostly a mess of integration between traditional CPU OO code/systems and the completely different type/style of systems/code required to get good performance on the SPUs?

http://forum.beyond3d.com/showpost.php?p=1209408&postcount=980

Ailuros · Apr 23, 2009

aaronspink said:
Considering that Nvidia tried to convince people that the G80 was 128 cores and that the GT200 was 240 cores. I'll retain my belief until it ships...

Assuming any of it should be true (which for some I have severe doubts) and since I recall your definitions debacle (not entirely unjustified) you could call such a hypothetical chip as either 16 or 32-core depending on definition.

Don't know about MIMD/MPMD but certainly 512 ALUs is the minimum we should expect from GT300. 32SP clusters also won't be anything revolutionary. Seems like Theo is just cobbling together bits and pieces that we've heard elsewhere....

It might be just one letter difference between SIMD and MIMD, but there's a quite some difference between those two in many aspects. Of course could you call something like a highly optimized SIMD unit eventually MPMD or MPMT, yet it's still no MIMD unit.

DegustatoR · Apr 23, 2009

dkanter said:
So what? X2 cards of any sort sell for huge amounts of money, with ludicrous margins to spread all around. Unless your single chip performance just sucks and then you have bigger problems. Nobody needs to optimize high-end cards for cost. Why do you think NV fully specifies the high-end, including the cooling? Because if folks in Taiwan try to cut costs, they will create problems further down the road.

If you look at the strategy for high-end cards, it doesn't involve optimizing for cost. Cooling >130W is quite expensive, routing that much power involves many many layered PCB, shit tons of caps, VRMs, etc.

What you're basically saying is this: "why earn $300 when you can earn $100". Do you really want to say something like this? Because that's clearly b.s.

dkanter said:
You totally missed the big picture. The super high-end of the market that buys GTX 280 or RV770x2, is minuscule by volume and has almost 0 impact on overall profits. It's the halo effect that's useful. GPU vendors make most of their money on pro GPUs and GPUs in the $100-250 range.

Bolded part isn't true at all. I know this for sure. So all the other parts of this phrase are wrong also.

dkanter said:
Yes a single GPU may be more efficient, but only in a narrow and uninteresting sense...

Doing more with less resources is uninteresting? Doing things that are impossible on AFR system is uninteresting? That's certainly an interesting point of view. Maybe we should go back to Voodoo days since all that flexibility and programmabililty is uninteresting?

dkanter said:
Flexibility is tricky, since SLI/XF are software visible hacks that require changing your app.

Nothing is tricky in the inefficiencies of AFR. The tricky part is when you try to avoid them. And for that you often loose that flexibility.

dkanter said:
How do you define efficiency?

Power/performance is one way to define it.

dkanter said:
Frankly, if you look at good CPU architectures, it's quite easy to see that DP servers that is pretty much exactly as efficient as a single socket server for many workloads and hence are the sweet spot for efficiency (e.g. 95% scaling).

However these servers don't use middle class CPUs to achieve that and they certainly aren't selling in mainstream market. Why? If anything we're seeing the opposit process with CPUs: more cores are getting integrated into one big chip. Have you ever thought about this?

dkanter said:
GPU workloads are by definition trivially parallel, so it's quite easy to see how a dual chip approach would be just as efficient. Both from a performance and power/cost standpoint.

Dual chips will always have some logic that isn't needed in dual chip configuration and that means that their efficiency will always be less than the efficiency of single chip. Single chip will always have some algorythms where it will beat dual chips because of the limitations of AFR mGPU scheme.

dkanter said:
Yes, that's interesting. But what else is interesting is having a much more highly optimized card to server the $100-250 market, where you can kickass AND make mad money because your die size is way smaller.

So you've saved some bucks on the die and you've wasted nearly 2x bucks on the price of the card. Are you in the green after that? What if you've missed the sweet spot and even one GPU competitor's card is faster than your mGPU card? If you have some GPU faster than you're using in your mGPU card you may be able to use it in the new mGPU card (GTX295 is an example although not the best one), if not -- you're truly fucked.
AMD is leaving it's high end dangerously open for a possibility like that. Let's say that LRB will be fast and will use 32nm way ahead of NVs and AMDs GPUs. That could mean that AMD won't have _any_ answer to LRB in the high end _at all_. NV might be able to create some mGPU solution using two big dies, but AMD simply won't have any to make any solution.
It's a question of having full line up. AMDs line up is missing high end at the moment. Were NV will use two chips in Quadro/Tesla market AMD might need to use four with appalling efficiency and costs. That's a possibility that you should think about when you're speaking of multi-CPU servers.

And can someone explain me why ATI earns zero on all these great small GPUs and NV earns nearly the same on that big ugly GT200 now selling in cards for less than $200? I've always had a problem with that pricing argument since it's kinda always was "assumed" that RV770 is much better for AMD than GT200/b for NV from the pricing point of view but in reality i'm not seeing any results of this "greatness" in AMDs balance sheets -- ATI earned less in the 1Q09 than it did in the 1Q08 when all they had was RV670 against G92 and G80.

dkanter said:
The biggest single advantage of a monolithic GPU is that using multiple GPUs for general purpose workloads is retarded, because the programming model (i.e. no coherency) sucks ass. NV has to produce large monolithic GPUs to make GPGPU interesting and get sufficient performance gains over a standard dual socket server.

Nobody is have to do anything. NVIDIA is doing what they believe will earn them money. AMD is doing the same. Whose way is the best -- we don't know. But what everyone should consider is that NV's way is essentially nothing more and nothing less than AMD's way plus big GPU dies for high-end/workstation/server markets. AMD has simply left that market segment.

dkanter said:
It's a huge waste of money and engineers time. Next question?

It's funny that you say this right after you've said why single big GPUs are neccessary after all.

Tahir2 · Apr 23, 2009

OK some quick maths and why a card like the 4850 was positioned where it was with particular reference to performance/price:

Gross Profit on x amount sold relatively (figures taken out of the air for comparison like most figures in these threads - guesttimates):

4870 X2

$300

x 10

= $3000

Gross Profit on 4870/4890

$150

x 100 = $15000

Gross Profit on 4850

$100

x 1000 = $100000

Given the above whilst the ultra high end is desirable if you look at what part actually sells in volume and makes the most overall profit then it is the sweetspot of the 4850 and maybe even below.

The ultra high end is important however for marketing purposes to ensure that the trickle down effect is felt by the consumer - i.e. the card they are purchasing is of high pedigree.

Scali · Apr 23, 2009

Tahir2 said:
Given the above whilst the ultra high end is desirable if you look at what part actually sells in volume and makes the most overall profit then it is the sweetspot of the 4850 and maybe even below.

I suppose the same goes for nVidia, which explains why GT200 isn't really cutting into their business results overall. I guess the 9600 and 9800-variations are the bread-and-butter.

Nvidia GT300 core: Speculation

nAo

Nutella Nutellae

Jawed

MfA

AnarchX

suryad

aaronspink

fehu

aaronspink

aaronspink

bowman

TimothyFarrar

MfA

MfA

aaronspink

trinibwoy

Meh

Crossbar

Ailuros

Epsilon plus three

DegustatoR

Tahir2

Scali

Similar threads