Nvidia GT300 core: Speculation

Status
Not open for further replies.
So how does all this latest mumbo jumbo relate to Nvidia GT300 core Speculation?
Will NVidia change the double-precision configuration to increase performance? A thread of discussion has been the need to do so.

ATI and Larrabee are looking very strong in DP (particularly the latter) so as far as CUDA goes it's interesting to see whether NVidia will beef-up DP in the next generation.

Jawed
 
As I described here:

http://forum.beyond3d.com/showpost.php?p=1282350&postcount=12

unrolling increased ALU utilisation. Additionally, as we've discussed before, NVidia's compiler has to make a decision whether to compile MAD as MAD or split it into MUL + ADD - there are heuristics/models that do this. The aim is to maximise the utilisation of the MUL in the MI ALU.

I still don't get your angle on this. Let's say Nvidia's stuff benefits from ILP. The point is that it is a lot less dependent on it than the competition. People recognize that it takes work to get the best from Nvidia hardware too. I know early on you were especially annoyed that people underestimated the amount of effort it took to get the best out of Nvidia hardware. Is that angst still at the root of all this negativity?

I've written code for you, in the other thread, to demonstrate this. And quoted SGEMM performance. What do you want, a thesis?
Nothing that fancy. Just examples of actual applications.

It's upto NVidia to deliver a genuine architectural improvement. When a reasonable rumour for such a change appears I'll bear it mind. Right now NVidia's double-precision is out by an order of magnitude. I won't say it's impossible...
Improvement compared to what? Where are all the fantastic applications taking advantage of AMD's DP advantage? You're arguing from a pretty weak position here.

NVidia hasn't backed-up anything, since no comparison has been made. What NVidia has done is delivered an adequate toolset to go along with its hardware. All NVidia's comparisons are solely with CPUs. Often, with single CPU cores running unoptimised code :oops:
So it's Nvidia's fault that AMD hasn't been able to produce anything useful? Wow, sure blame them for that too. What's your measuring stick for this stuff anyway?

Look in the mirror. You don't spend any time questioning what's transpiring in the field, but want to be spoon-fed. It's pretty tedious.
Your inability to recognize that you are constantly criticizing Nvidia when your team isn't even in the game is even more so. What exactly is transpiring in the field? All you're doing right now is making excuses for AMD and trying your best to minimize the value of everything Nvidia and its partners have actually produced. Your position is so untenable it's crazy at this point.

No, optimised for "target x" does not get levelled-out by OpenCL, per se. That was the point of my earlier remarks about SGEMM.
Well we all expect code optimized for particular platforms to run best on those platforms even under OpenCL. But the question raised earlier is whether there will be enough ILP available for AMD's stuff without a lot of fumbling around. You say yes but that's yet to be seen.

I mean, for example, have you noticed there's a company called Havok?

Yep, the one with the mature CPU Physics library? What do they have to do with AMD and/or GPGPU? Oh is the red-dress demo now equivalent to all of PhysX?

I'm not defending, I'm trying to promote a separation of the marketing about GPU capabilities from the architectural capabilities.
So what exactly is your measure of architectural capabilities if it's not the actual applications being produced? You propose that we ignore everything produced using CUDA because it's all simply marketing drivel and compare the architectures based on what? One SGEMM routine? Or is it AMD's immature toolset to blame? Their architecture is actually way more awesome than what Nvidia cobbled together but those darn APIs just got in the way? To be honest I can't tell anymore whether you're cheerleading AMD or demonizing Nvidia.

The algorithm might be tackled differently. For example, the dimensions of the tile of data in NVidia shared memory (size of data for extant threads) is constrained by having to wrap around at 16KB (or 32KB, at least, in GT300). This wrap-around constraint is much much looser in Larrabee. So rather than doing staggered reads like this something else may be done (e.g. pack/splat data, fine-grained pre-fetching etc.) Who knows, eh?
Well isn't it interesting that you can recognize the need to do this extra work to get the same benefit of shared memory while at the same time dismissing the attractiveness of the shared-memory approach in the first place?

It's not that you don't make good points when it comes to limitations inherent to Nvidia's approach. You obviously spend a lot of time thinking about this stuff. But it just doesn't make sense to push the "AMD is just as good or better" mantra when they have nothing to show for it. Do you think there's any relationship between the maturity of AMD's software stack and the hardware design? In any case, the proof in the pudding is in the eating as they say. So no amount of theorizing will ever stand up in the face of actual results.
 
Will NVidia change the double-precision configuration to increase performance? A thread of discussion has been the need to do so.

ATI and Larrabee are looking very strong in DP (particularly the latter) so as far as CUDA goes it's interesting to see whether NVidia will beef-up DP in the next generation.

Jawed

Is the question if NVIDIA will beef-up DP or rather by how much?
 
And that brings up the question whether or not Nvidia is going to stick with separate DP units or will try finding a way to use the smaller, numerous SP units to do the trick.

My take: If they're going to bump DP significantly and are not going to produce different ASICs for gamer and professional markets, there's not going to be more than on DP in it's current form per TPC (if such a thing continues to exist, that is). Given AMDs advances on the integration front they simply cannot afford this luxury.

So the options (IMO) would be either to make the DP-unit(s) useful for concurrent operations with the eight SP-SPs and the transcendental unit and maybe promoting the latter to DP to boost performance or to have the SP-SPs expanded just enough to do some DP-FMAC also - maybe in joint operation.
 
I still don't get your angle on this. Let's say Nvidia's stuff benefits from ILP.
There's no if. Whether the degree is important is a separate question.

The point is that it is a lot less dependent on it than the competition.
Yes I said as much in the other thread :rolleyes:

People recognize that it takes work to get the best from Nvidia hardware too. I know early on you were especially annoyed that people underestimated the amount of effort it took to get the best out of Nvidia hardware. Is that angst still at the root of all this negativity?
No, the "it must be alright, it's NVidia" default position that you and others have is tedious. And it's so entrenched that ...

Nothing that fancy. Just examples of actual applications.
You mean like the video conferencing system? I linked that. Or Cyberlink? I forget, is it their video encoder that includes AMD acceleration? I don't know if the medical visualisation stuff is a commercial application or just research.

Improvement compared to what? Where are all the fantastic applications taking advantage of AMD's DP advantage? You're arguing from a pretty weak position here.
Well, DGEMM is one - it's actually useful. But I don't know who's using it. It's boring, you know?

Your inability to recognize that you are constantly criticizing Nvidia when your team isn't even in the game is even more so.
Yes, you're still on that "zero GPGPU penetration" wicket.

What exactly is transpiring in the field? All you're doing right now is making excuses for AMD and trying your best to minimize the value of everything Nvidia and its partners have actually produced. Your position is so untenable it's crazy at this point.
I think you might want to scan through the names of the starter of threads:

http://forum.beyond3d.com/forumdisplay.php?f=42

and the subject of those threads :p

Yep, the one with the mature CPU Physics library? What do they have to do with AMD and/or GPGPU? Oh is the red-dress demo now equivalent to all of PhysX?
Eh? You think they knocked something up the day before GDC? and that's all we'll ever see of it?

Well isn't it interesting that you can recognize the need to do this extra work to get the same benefit of shared memory while at the same time dismissing the attractiveness of the shared-memory approach in the first place?
OK, when did I dismiss shared memory?

For all we know LDS in ATI is no good going forwards and they'll have to do something different. We discussed the utility back here:

http://forum.beyond3d.com/showthread.php?t=53089

Can't tell what the performance is like, except we now know that SGEMM is slowed down on ATI by the use of LDS.

It's not that you don't make good points when it comes to limitations inherent to Nvidia's approach. You obviously spend a lot of time thinking about this stuff. But it just doesn't make sense to push the "AMD is just as good or better" mantra when they have nothing to show for it.
No, I'm just pushing against the "AMD's hardware design is incapable of being competitive" mantra.

For example by having in-pipe registers there's essentially no register read-after-write latency (there are corner cases but they're seriously obscure). This is a feature that seems to go back to R300 as far as I can tell. It means that there's less total latency for the scheduler to hide over the lifetime of a shader, in comparison with an architecture that incurs 24 cycles of latency for every register write.

Do you think that makes a difference to the compiler? Do I need to give you a clue?

Do you think there's any relationship between the maturity of AMD's software stack and the hardware design? In any case, the proof in the pudding is in the eating as they say. So no amount of theorizing will ever stand up in the face of actual results.
Well as I pointed out the other day, there are still gotchas in hardware assembly. e.g. if I vectorise the Mandlebrot code to generate multiple points per thread instead of just the one, the fucking hardware compiler insists on solely using .x and runs out of registers because it's not using .y, .z and .w. The compiler doesn't tell me it's run out of registers - it just says "failed". 4 results per thread consumes 32 registers (that's 128 scalars)! All because I use a struct of floats.

There are gotchas on the NVidia side too, such as people complaining that they can't control the optimiser or that register allocation is a black art. Volkov's work has been very influential in shifting perspectives as far as I can tell and I know that NVidia's been putting a lot of effort into making the development experience a lot more like developing for a CPU, e.g. with support for profiling.

NVidia's very much in the "polish" stage of its toolset whereas AMD is still in the "let's make it work" stage.

Jawed
 
So the options (IMO) would be either to make the DP-unit(s) useful for concurrent operations with the eight SP-SPs and the transcendental unit and maybe promoting the latter to DP to boost performance or to have the SP-SPs expanded just enough to do some DP-FMAC also - maybe in joint operation.
NVidia promotes the enriched feature set of its double-precision implementation. Rounding modes, NANs and other stuff I forget. Since these features seem to be reasonably advanced it's then a question of how expensive these features are and whether they affect the amount of performance they can implement.

Slide 62:

http://s08.idav.ucdavis.edu/luebke-cuda-fundamentals.pdf

The second and third columns are SSE2 and Cell SPE.

It's notable that double precision didn't make it into OpenCL 1.0, it's an extension.

Jawed
 
No, the "it must be alright, it's NVidia" default position that you and others have is tedious. And it's so entrenched that ...

I think you're seeing ghosts, to be honest.
You've attacked me a few times because you thought I was not giving ATi enough credit, or I 'neglected' to also criticize nVidia when I was criticizing some technology of ATi (eg their poor Avivo display).
Basically you seem to constantly misinterpret what I say, and make ridiculous demands, all seemingly coming from this unrealistic view that you seem to have.
Not everyone criticizing something of ATi is necessarily pro-nVidia. Also, not everyone using nVidia is just using it because nVidia marketing said so.
It's just a simple fact that neither ATi nor nVidia are perfect, so both have flaws in their products, and as such, sometimes criticism is justified, when pointing out these flaws.
Also, both ATi and nVidia do have some merit to their products. It's not all just marketing.

We will play nice with eachother from now on, won't we?

The fact that you tried to attack me is ridiculous. I've supported ATi since the Radeon 8500, which I think most will agree was the first GPU that was really competitive with nVidia's offerings. So in a way I was an 'early adopter' of ATi products.
Since then I mostly used ATi products, until GeForce 8800 happened.
My next card will be a DX11 one, and assuming that graphics performance and price remain as competitive as they are today, I will let the choice for my next GPU depend on who has the best OpenCL/DXCS performance. Which is why I think the GPGPU debate is far more interesting than the DX10.1 debate.
 
Last edited by a moderator:
NVidia promotes the enriched feature set of its double-precision implementation. Rounding modes, NANs and other stuff I forget. Since these features seem to be reasonably advanced it's then a question of how expensive these features are and whether they affect the amount of performance they can implement.
That's the question, right. Another question would be, if they're not canning the "gaming chip" (as is suspect they did a year ago) this time 'round and go for different chips for professional and gamer market - but that also would depend on how much ground Cuda and OpenCL already have gained.

But with the economical crisis and margins being low and all that, I'm rather inclined to another one-size-fits-all solution.
 
Well as I pointed out the other day, there are still gotchas in hardware assembly.

Well that is to be expected. It obviously can't undo the brin damage inflicted on the code by the brook+ il generator.
e.g. if I vectorise the Mandlebrot code to generate multiple points per thread instead of just the one, the fucking hardware compiler insists on solely using .x and runs out of registers because it's not using .y, .z and .w. The compiler doesn't tell me it's run out of registers - it just says "failed". 4 results per thread consumes 32 registers (that's 128 scalars)! All because I use a struct of floats.

After looking at the R7xx assembly generated by the tools, I am coming around to the view that it may be best for the compiler to think of AMD gpu's as having float registers, instead of float4 registers. I mean that instead of thinking that the registers are r0, r1, r2 etc (like in nv gpu's), think of them as r0.x, r0.y, r0.z, r0.w, r1.x, r1.y, r1.z, r1.w etc. The huge ILP afforded by the vliw design should mean long, deeply unrolled kernels would do very well.

IE, see everything as a scalar register instead of blindly upconverting everything to float4/int4 or double2.
There are gotchas on the NVidia side too, such as people complaining that they can't control the optimiser or that register allocation is a black art. Volkov's work has been very influential in shifting perspectives as far as I can tell and I know that NVidia's been putting a lot of effort into making the development experience a lot more like developing for a CPU, e.g. with support for profiling.

I am curious here, what specific takeaway's would you have from volkov's work. The one which saw was treating registers as a large block of memory as they are 4x the size of shared memory.

NVidia's very much in the "polish" stage of its toolset whereas AMD is still in the "let's make it work" stage.

Let's hope the lack of improvements on brook side mean a good opencl toolset delivered asap.
 
That's the question, right. Another question would be, if they're not canning the "gaming chip" (as is suspect they did a year ago) this time 'round and go for different chips for professional and gamer market - but that also would depend on how much ground Cuda and OpenCL already have gained.

I'd have been surprised to see different chips even in better times. And now they are especially bad. :(
 
Sure, but there must have been plans, very advanced plans, to have a chip with GT200's capabilities sans DP - Compute Device 1.2 ;)
 
Until now, there's no chip with those caps and the Compute-Device revision has already been bested, so i'd think of this Chip as "plans" right now.
 
That's the question, right. Another question would be, if they're not canning the "gaming chip" (as is suspect they did a year ago) this time 'round and go for different chips for professional and gamer market - but that also would depend on how much ground Cuda and OpenCL already have gained.

But with the economical crisis and margins being low and all that, I'm rather inclined to another one-size-fits-all solution.

I'm not convinced that there is enough of a professional market to target a chip specifically at that market with all the required NRE required factored in. So unless the chip can also succeed in other markets with higher volumes, I just don't see any of the vendors spinning a chip specifically for the professional market.
 
That was also what made me think (to no avail...) when GT200 arrived: This was the first chip i knew of that had dedicated hardware for a very small audience sitting idle in >95% of sold volume.

I bet Nvidia utterly regrets this decision by now, given the price war forced upon them by AMD. :D
 
Unlike the tesselation unit on the Rv cores.

I've not seen a disclosure of the area penalty imposed by Nvidia's DP hardware.
Comparisons on density for other parts of the chip seem to indicate that if we're looking for culprits for the big die size, the DP hardware isn't the sole offender.
 
Tesselation was used in Viva Pinata, IIRC. But yes, you have a point there, also with the DP not being solely responsible for Nv's bad margins. But every additional chip they could have fit on a wafer would be very welcome now, i guess.
 
AFAIK 360 GPU tesselator is mostly bottlenecked by its (albeit fast) setup unit. Won't be surprised if next GPUs from NVIDIA and ATI will have, for the first time, more than one setup unit and/or will move part of the computations to the shader core (..now that they support double precision..)
 
What's the major reason why setup units are not parallelized? Is it because of die area, or because of (I suspect) concurrency issues?
 
Status
Not open for further replies.
Back
Top