Nvidia GT300 core: Speculation

CarstenS · Apr 11, 2009

I wonder if AMDs extremely dense ALU-logic does have some hidden caveats? I mean, clearly they have very talented people working at it, but in the end they can do no more magic than the next guy.

Maybe this is somehow related?

silent_guy · Apr 11, 2009

Scali said:
Any info on what exactly they moved to DP?
Because I doubt they just thought "Oh, to hell with it, we'll just do a search-replace of "float" -> "double" on the entire sourcecode, and suck up the performance hit".

I'm pretty sure that's exactly how they did it. A 20% or so performance drop is a fair trade-off when your bottleneck is man power.

aaronspink · Apr 11, 2009

Scali said:
Any info on what exactly they moved to DP?
Because I doubt they just thought "Oh, to hell with it, we'll just do a search-replace of "float" -> "double" on the entire sourcecode, and suck up the performance hit".

No idea, probably all related to geometry.

Also, PRMan is not a raytracer.

you should take a look at graphics.pixar.com again. Specifically the paper "Ray Tracing for the Movie Cars" which describes how RenderMan was enhanced to support ray tracing (which the claim is the first program to have ray tracing for scenes of such complexity).

Scali · Apr 11, 2009

aaronspink said:
you should take a look at graphics.pixar.com again. Specifically the paper "Ray Tracing for the Movie Cars" which describes how RenderMan was enhanced to support ray tracing (which the claim is the first program to have ray tracing for scenes of such complexity).

No need, I know all about it.
RenderMan is a REYES renderer. It does support raytracing, but prior to the movie Cars, this was rarely used in any Pixar movies. And even in Cars itself, the raytracing is done only on certain parts of the scene, because of performance issues (apparently man-power isn't the only thing that matters at Pixar, render-time does aswell). It's all in that paper you refer to.

Now, you mentioned A Bug's Life, which predates Cars by about 8 years, so it's still from the 'non-raytracing' era of Pixar movies.
Hence it doesn't really fit into the discussion about raytracing.
So perhaps you need to take a look at graphics.pixar.com again. Have a nice day.

Razor1 · Apr 11, 2009

Jawed said:
There's no proof they run better on NVidia's hardware. The comparison, generally speaking, hasn't been made. CUDA has been an easy choice for a lot of people because it's been shoved under their nose by NVidia, and AMD in its own wisdom is building relationships mostly with commercial partners who don't publish papers and publish "speed-up" graphs with lies about speed-ups based on un-optimised CPU code.

Yep, it's working on you.

When there is no other viable option out there until recently, how is that marketing and how is that being pushed under other's noses? Don't fool yourself and think AMD likes the position its in right now.

aaronspink · Apr 11, 2009

MfA said:
Just for reference, what is the ratio of the areas of say a 8K instruction cache with decoding/control logic for a scalar in order processor and a single cycle througput single precision FP multiplier both running at the same clock?

Depends largely on process and other design constraints. But in general, a SP FP pipeline and a 8k cache + decode/control should be in the same ballpark.

aaronspink · Apr 11, 2009

Scali said:
Now, you mentioned A Bug's Life, which predates Cars by about 8 years, so it's still from the 'non-raytracing' era of Pixar movies.
Hence it doesn't really fit into the discussion about raytracing.
So perhaps you need to take a look at graphics.pixar.com again. Have a nice day.

Yes but it relates to WHY you might want to use DP in a ray tracer because complex scenes with high geometric density start to have issues with single precision. If you can't see how they are related, then have a nice day.

Scali · Apr 11, 2009

aaronspink said:
Yes but it relates to WHY you might want to use DP in a ray tracer because complex scenes with high geometric density start to have issues with single precision. If you can't see how they are related, then have a nice day.

If you read my posts more clearly, I never denied that you may want to use DP in a raytracer. As I said, I *mostly* used SP, not *entirely*. So I also used DP. Doing everything in DP is a different story though.
You yourself pointed out that Cars was the first time scenes with such high complexity were raytraced... which basically proves the point that this is a rather exceptional case.
If you can't comprehend my posts, then have a nice day.

trinibwoy · Apr 11, 2009

Razor1 said:
When there is no other viable option out there until recently, how is that marketing and how is that being pushed under other's noses? Don't fool yourself and think AMD likes the position its in right now.

Whoa! How could I miss that sentence. Jawed, come on man. You really think AMD is making vast inroads with commercial customers behind closed doors and are just oh so wise and humble that they are hiding it from the public and their shareholders? And in the meantime Nvidia is thumping its chest and hyping all the meaningless displays of functioning CUDA applications that people can actually purchase and use today? Again...whoa!! I can't wait the see the fruit of these black-ops operations happening over at AMD

TimothyFarrar · Apr 11, 2009

trinibwoy said:
Is Larrabee's L1 banked? Not sure what you're saying here. With shared memory you get a fast read as long as it's coalesced. So you're going to have a lot more opportunities for one shot reads compared to a traditional cache where everything has to be on the same cache line.

Great point there!

Oops. Looks like I glanced over the LRB slides too fast also and assumed. Seems according to this slide that you are indeed correct LRB's L1$ services only gather/scatter from one cache line per clock,

http://pc.watch.impress.co.jp/docs/2009/0330/kaigai498_p125.jpg

Now if NVidia did go with some kind of dynamic warp formation with GT3xx, then I wonder if they could formulate warps not only on same branch path, but also grouping to lower bank conflicts (to shared memory, or global memory)?

TimothyFarrar · Apr 11, 2009

Jawed said:
Now how much on-die memory does each strand get?

Let's say GT300 has GT200's register file of 64KB but with 32KB of shared memory: that's 96KB / 1024 work-items that's 96 bytes.

Let's say that Larrabee can only use L2 (256KB) to hold work-item registers + local memory, i.e. treating Larrabee's L1 and registers as scratchpad. 256KB / 2048 work-items = 128 bytes each. Obviously it's impossible to get data in and out of a core without it going through L2, so there'll be less than 128 bytes actually available.

Problem I see with this is that LRB L2 latency is not the same as a shared memory or register access. Isn't LRB L2 at least 10 clocks away? So this 96 to 128 byte number isn't apples to apples.

Also we should toss in an extra 8KB of constant space on GT200

aaronspink · Apr 12, 2009

Scali said:
If you read my posts more clearly, I never denied that you may want to use DP in a raytracer. As I said, I *mostly* used SP, not *entirely*. So I also used DP. Doing everything in DP is a different story though.
You yourself pointed out that Cars was the first time scenes with such high complexity were raytraced... which basically proves the point that this is a rather exceptional case.
If you can't comprehend my posts, then have a nice day.

But what is the real slow down for DP? Its pretty easy to do a global search and replace and boom, a 20% slowdown. Done. You assume everything thing is about maximum performance, in reality it is all about tradeoffs. The amount of extra work to determine where you actually need DP is significant, so once you know you need DP some where, why do it half way?

And I would hardly call the entire 3D Studio side of graphics an "exceptional case". They are the ones driving most of the technology from the commercial side.

MfA · Apr 12, 2009

aaronspink said:
Depends largely on process and other design constraints. But in general, a SP FP pipeline and a 8k cache + decode/control should be in the same ballpark.

Lets suppose this is true for a moment ... this would mean you could go for a branch granularity of 4 without making too big an impact on density, since all the other essentials for a single shader have to be present in the same amount regardless of granularity.

Of course the moment you go for 4 the best thing is to just go for ATI style VLIW instead ... lets say the slightly wider decode logic and swizzling increases the size to twice a SP multiplier, still a cost which doesn't rule the approach out outright.

I think the wide granularity in GPUs is more of convenience than necessity. The moment you go for small granularity coordination becomes an utter nightmare, but that is complexity from a design/compilation/scheduling/synchronization standpoint moreso than a mm2 one.

Scali · Apr 12, 2009

aaronspink said:
But what is the real slow down for DP? Its pretty easy to do a global search and replace and boom, a 20% slowdown. Done. You assume everything thing is about maximum performance, in reality it is all about tradeoffs. The amount of extra work to determine where you actually need DP is significant, so once you know you need DP some where, why do it half way?

I disagree. I and many others write our software for maximum performance. Determining what kind of maths you need where is EXACTLY what we do. It's what separates us from the average programmer in India, who can write code for a fraction of the price.
Because we are so experienced at what we do, after we work out an algorithm in general, working out where we need what precision is a negligible amount of extra work.

Aside from that, the argument here is that for nVidia hardware, the slowdown factor is far more than 20%, all the more reason to not just use DP everywhere.
Thing is, we haven't been doing this since yesterday. We started out long ago, when performance on a standard CPU was way different aswell, and SP mattered a lot more than it does today. As such our codebase started out as optimized for SP, and we built our experience from that. So code with mostly SP already exists, and can be re-used for architectures like current NV hardware, which benefits greatly from avoiding DP wherever possible.

aaronspink said:
And I would hardly call the entire 3D Studio side of graphics an "exceptional case". They are the ones driving most of the technology from the commercial side.

So now you think 3D Studio and a Pixar movie are the same thing?
Last time I wrote an exporter for 3DSMAX, all the data (including geometry) was single precision.

jimmyjames123 · Apr 12, 2009

Jawed said:
There's no proof they run better on NVidia's hardware. The comparison, generally speaking, hasn't been made. CUDA has been an easy choice for a lot of people because it's been shoved under their nose by NVidia, and AMD in its own wisdom is building relationships mostly with commercial partners who don't publish papers and publish "speed-up" graphs with lies about speed-ups based on un-optimised CPU code.

Wow, just wow. This has to be one of the most inane things I have read recently. Clients for NVIDIA are having CUDA "shoved under their nose" by NVIDIA? That makes no sense. The clients for NVIDIA are choosing CUDA and NV GPUs because they result in a substantial speed-up in processing time (vs CPUs) based on their own testing! We are talking sometimes 10x, 20x, 50x, 100x or more speed up in real time, often with significantly lower cost and lower power consumption, and that is somehow disingenuous? Wow

How about this: why don't you spend time to write "optimized" CPU code, just to prove NVIDIA and their clients wrong about these massive speed ups using CUDA and NV GPUs. I dare you. No wait, I double dare you

The silliness of these statements is just beyond belief (no pun intended). NVIDIA is spending an incredible amount of time and resources to help clients make efficient use of CUDA and NV GPUs for GPGPU applications, these clients are reporting very real, tangible, HUGE improvements in processing time vs using CPUs, and you simply crap on all that effort and attempt to portray it as marketing tricks, all while pimping the LRB GPU that doesn't even exist yet for consumers and clients. That's BS.

Do you work for Intel Corporation, have you been offered a position there, or have you purchased shares of their stock? It sure sounds like it based on the demeanor of your posts and your intense hatred and bitterness with all things NVIDIA (whether be it NV hardware architecture, software design, marketing, etc)?

Scali · Apr 12, 2009

One thing that nVidia has done, is to create libraries that can be used directly from Matlab.
The researches at my company generally prototype stuff in Matlab, and then later convert it to C++ code.
Thanks to nVidia, they can run the Cuda-accelerated code even in their prototypes, giving them a lot more performance. Then when they want to create a C++ version, they will likely continue using Cuda libraries.
As far as I know, ATi simply doesn't offer this option.
Also, I think arguing that Matlab doesn't have well-optimized libraries for CPU is a bit of a dead-end.

MfA · Apr 12, 2009

jimmyjames123 said:
How about this: why don't you spend time to write "optimized" CPU code, just to prove NVIDIA and their clients wrong about these massive speed ups using CUDA and NV GPUs. I dare you. No wait, I double dare you

Do you have any specific algorithm in mind for this contest?

Scali · Apr 12, 2009

MfA said:
Do you have any specific algorithm in mind for this contest?

I believe Jawed is already working on some image reconstruction algorithm:
http://forum.beyond3d.com/showthread.php?t=53170

Perhaps you can go from there?
Define some parameters within which the algorithm has to operate, and then make a version with Cuda, one with ATi Stream, and perhaps a reference version on CPU?

Arun · Apr 12, 2009

Jawed, do you realize NVIDIA has real revenue for CUDA (several million dollars at least in 2008, ultra-high-margin), while AMD doesn't for their GPGPU solution? You could argue NV dropped the ball when it comes to consumer GPGPU, but trying to defend AMD in HPC is just really dumb - and I'm sure you know better anyway.

But you're right that many CUDA papers aren't being very fair compared to CPUs in terms of optimization, but let's not get ahead of ourselves. We're not talking about 60x speed-ups becoming negligible; we're talking 40x going to 10x probably. And frankly if that's the only point, it's an incredibly backward-looking one because GPU flops in 2H09/1H10 are going to increase so much faster than CPU flops I'm not sure why we're even discussing this. In fact, the fact real-world performance compared to super-optimized already-deployed code isn't always so massive was even mentioned in June 2008 at Editor's Day for the GT200 Tesla. There was a graph for a oil & gas algorithm IIRC, and the performance was only several times higher - but scalability was also much better, and even excluding that cost efficiency was better than just the theoretical performance improvement.

Also, uhhh... for how long have we been discussing G8x? I was pretty damn sure you understood at one point that: a) there are only two ALU lanes, not three; you can't issue a SFU and an extra MUL no matter what. b) These two ALU lanes can be issued with DIFFERENT threads on GT200, efficiency should be ~100% for dependent code that is of the form 'Interp->MAC->MUL->MAC->Interp->...' - it just works! Don't try to imagine problems that aren't there...

Jimmy, it's not fair to say Jawed must work for Intel or must have purchased shares of their stock. I'm not even sure how anything he said in this thread is so much pro-Intel, rather pro-AMD... He's always liked AMD's architectures for many years now, I guess there's nothing wrong with having a slight bias as long as you don't mind being corrected on factual inaccuracies.

Jawed · Apr 12, 2009

trinibwoy said:
Haha, it's not. That was my point. Why doesn't AMD dump some dollars into marketing and get their stuff up and running then?

I dunno, someone needs to ask them.

Sigh, 3-ALU issue? That's a bit dishonest. You fully well know that instructions from different threads can be issued to the ALUs independently and is taken care of by the hardware. Hence the developer doesn't concern himself with that. On the other hand all of AMD's 5 ALUs must be filled from instructions in a single thread so the developer has to ensure there's enough ILP available.

As I described here:

http://forum.beyond3d.com/showpost.php?p=1282350&postcount=12

unrolling increased ALU utilisation. Additionally, as we've discussed before, NVidia's compiler has to make a decision whether to compile MAD as MAD or split it into MUL + ADD - there are heuristics/models that do this. The aim is to maximise the utilisation of the MUL in the MI ALU.

Theoretically the double-precision MAD is also usable for single-precision MAD, ADD, MUL - I don't know if NVidia's compiler tries to optimise SP code across that ALU.

And have you compared them to the corresponding algorithm on AMD's stuff?

That's what I've been doing. Some of my posts take 10+ hours to put together because I've buried my head in other forums, papers, coding, optimisation etc.

Perhaps, but again this is all guesswork. We simply do not have the evidence on AMD's side to support your hypothesis.

It's not guesswork. Optimisation performed by CUDA developers usually includes FLOPs/byte optimisation, which normally means sharing bytes fetched/written and loop-housekeeping across more intense computation. The most important metric in throughput computing is arithmetic intensity, which is the ratio of computation to bytes read/written.

If you're really clever you maximise bandwidth usage and ALU utilisation simultaneously.

Shouldn't it be much easier to simply find such algorithms running on AMD's hardware as proof of concept? Where are they?

I've written code for you, in the other thread, to demonstrate this. And quoted SGEMM performance. What do you want, a thesis?

Huh, I didn't say that. I meant we should be comparing it to the hypothetical GT300. Not to G80 like you've been doing.

It's upto NVidia to deliver a genuine architectural improvement. When a reasonable rumour for such a change appears I'll bear it mind. Right now NVidia's double-precision is out by an order of magnitude. I won't say it's impossible...

Exactly!! So we're pissing into the wind with claims that AMD's hardware would be just as good or better. There's nothing to back it up so just as you hate Nvidia's marketing at least they back it up with results. On the other hand the stuff you're cheerleading hasn't produced anything worthwhile to date.

NVidia hasn't backed-up anything, since no comparison has been made. What NVidia has done is delivered an adequate toolset to go along with its hardware. All NVidia's comparisons are solely with CPUs. Often, with single CPU cores running unoptimised code

Brook+, specifically, is still in alpha as far as I can tell, with documentation aimed at geeks who've warped in from the 1970s and don't know better. I've got no idea whether AMD's OpenCL will be useful this year. Nor whether AMD will put together a decent environment and documentation for coding.

Haha, man I don't have anything against AMD. I just recognize and appreciate the good points of Nvidia's decisions for GPGPU. A perspective that you obviously do not share. What I don't get is that in spite of all of CUDA's success you work hard to point out its apparent flaws and are batting hard for an alternative that has not proven itself. And then blame Nvidia's marketing for the fact.

Look in the mirror. You don't spend any time questioning what's transpiring in the field, but want to be spoon-fed. It's pretty tedious.

OpenCL will actually be an interesting battleground. Because if, as you say, a well designed algorithm should have lots of ILP then anything optimized for Nvidia hardware should fly on AMD's stuff too. Unless by well-designed you mean explicit vec4 packing

No, optimised for "target x" does not get levelled-out by OpenCL, per se. That was the point of my earlier remarks about SGEMM.

As for what kind of arithmetic-intensity is optimal for any algorithm, well one of the interesting techniques that NVidia's brought to the table is "self-tuning" code (though this seems to be an old supercomputing technique, to be fair). There are different approaches to searching the optimisation space so then it becomes a question of whether the right variables for multi-target OpenCL tuning are emplaced.

I expect SPEC benchmarks will be drawn up at some point and then we get into the wonderful world of gamed-benchmarking and performance/watt and all the other stuff that the guys with big computers have been doing. Ah yes, graphics cards playing benchmarking games, they'll feel right at home.

Jawed

Nvidia GT300 core: Speculation

CarstenS

Moderator

silent_guy

aaronspink

Scali

Razor1

aaronspink

aaronspink

Scali

trinibwoy

Meh

TimothyFarrar

TimothyFarrar

aaronspink

MfA

Scali

jimmyjames123

Scali

MfA

Scali

Arun

Unknown.

Jawed

Similar threads