Nvidia GT300 core: Speculation

Jawed · Apr 11, 2009

rpg.314 said:
The forward as my limited foresight lets me see is that either ATI too goes scalar or does some compiler level shenanigans to pack 5 scalar threads into one vector thread. Else, despite the very attractive hardware capabilities, it is going to remain a step or two behind.

I created a thread to discuss this kind of stuff:

http://forum.beyond3d.com/showthread.php?t=53170

Essentially, after you've optimised a data-parallel algorithm, it turns out you've vectorised. So instruction-level parallelism, per work-item is quite common. I won't claim ILP is guaranteed. Prolly best to carry on this discussion over there.

Jawed

rpg.314 · Apr 11, 2009

While I agree that double-precision may be a strong-point of Larrabee, I don't think dp will be very relevant for most GPGPU applications. If they mostly rely on single-precision, then this advantage won't do Larrabee much good in practice.

Firt you say this.

Scali said:
Well, perhaps I'm missing something here... but don't NV, ATi and LRB all support DP?
So the 'must-have' argument doesn't go.
They all have it. The discussion is about performance of the different DP solutions and how this may or may not be relevant. I already gave the arguments for that.

And then you said this.

All have dp because dp is a must atleast for scientific apps.

In consumer gpgpu, may be it's not needed. But in raytracing for instance (from POV ray docs) and HPC, dpfp rules.

rpg.314 · Apr 11, 2009

Scali said:
Well, it's not uncommon for an architecture to not actually have separate integer ALUs, and instead use FPU functionality to perform integer operations (eg the original Pentium used the FPU for integer mul, and since R300, fixedpoint shaders are done with FP shader units).
A single-precision FPU would only give you 24 bit integer precision at best. So for 32-bit you'd need an FPU with more than single-precision.
I think that's what ThorHa was referring to.

on surface, using the int mul unit to do spfp and dpfp muls as well seems a nice idea. But 24 bit multipliers seems more elegant to me. May be others know better *shrugs*

Scali · Apr 11, 2009

rpg.314 said:
Firt you say this.

And then you said this.

In the first quote I was referring to the advantage of double precision in Larrabee.
The advantage being its better performance obviously (its SIMD units provide full double precision, as opposed to nVidia's 'tacked-on' DP unit).
I think you somehow mistakenly thought that NV doesn't have double precision at all, and misinterpreted the statement?
Because in the proper context (performance rather than having or not having DP), I think both statements make perfect sense together.

rpg.314 said:
In consumer gpgpu, may be it's not needed. But in raytracing for instance (from POV ray docs) and HPC, dpfp rules.

Well, I've written raytracers that mostly used single-precision for performance reasons, with no tradeoff for quality (obvious examples would be the storage of geometry or the processing of animation or (most) fragment shading).
If POV-Ray uses DP everywhere (which I doubt), it's just very poorly designed.
DP certainly doesn't 'rule' in raytracers. You may need it here and there, but if you write your entire raytracer with DP, you're doing something wrong.
By the way, I don't strictly consider raytracing 'scientific computing' either.

rpg.314 · Apr 11, 2009

Jawed said:
Indeed, it's reasonable to argue that NVidia's absolute performance is irrelevant, just as long as it's more than the fastest quad-core CPU (which isn't difficult when you only count single-precision FLOPs). Sadly there are quite a few "scientists" out there, as well as NVidia staff, who claim speed-ups for CUDA implementations based on entirely un-optimised CPU code (the N-body stuff we've been discussing is a case in point). Some of these people are so corrupt they make their comparison based on a single CPU core.

I wouldn't be surprised if they were comparing serial x87 fpu code with CUDA code.

You mean how could NVidia have made a smaller chip with the same performance? Well, it's staring you in the face, it's called RV770. Larrabee will, I expect, show an advantage too - and blow ATI and NVidia's doors off when it comes to double-precision.

Yup, I expect lrb to murder everything in it's path when it comes to dpfp flops/mm2

A benefit of their configuration is being able to implement a small warp size, 32 - though my theory is that this may really be 64 due to the effect of pairing of warps - I can't find any tests, though I'm wondering now if Volkov's mysterious "minimum of 64" that we were discussing recently is in fact confirmation. I'm strongly convinced ATI has an effective size of 128 because of paired wavefront issue in its ALUs.

That's correct. If you look at the ATI forums where somebody asked what's the max regs he could use per thread, the reply he he got was that it has to low enough that 2 wave fronts can run simultaneously. I believe you are on those forums, so I'll leave it for you to dig through them.

So in a way nv needs 1 warp /multiprocessor at the min but amd needs 2 wavefronts /simd

They're out there too, I've even posted links to them (and scolded some of the crap). Obviously I'm wasting time doing so because apparently people round here believe that AMD has zero GPGPU penetration.

AMD hw has a lot of gpgpu potential. If only they had tools worth a damn.......sigh

rpg.314 · Apr 11, 2009

Scali said:
Well, I've written raytracers that mostly used single-precision for performance reasons, with no tradeoff for quality (obvious examples would be the storage of geometry or the processing of animation or (most) fragment shading).
If POV-Ray uses DP everywhere (which I doubt), it's just very poorly designed.
DP certainly doesn't 'rule' in raytracers. You may need it here and there, but if you write your entire raytracer with DP, you're doing something wrong.
By the way, I don't strictly consider raytracing 'scientific computing' either.

POV ray could be badly designed, and while you can certainly ray tracers in spfp, but it does use dpfp to represent the world positions. It could be because it represents some surfaces analytically so ray surface tests need extra precision.

CarstenS · Apr 11, 2009

Jawed said:
And Vantage shows the opposite.

http://techreport.com/articles.x/15293/4

So where does that get you?

I'm not asserting that texturing is faster on RV770, merely that it's exceptionally hard to find any game where any metric shows anything like a proportionate value of the higher rates in GT200.

Don't forget I'm talking colour rate and z fill rate too.

Jawed

For fairly recent games i wholeheartedly agree - and we all know the reason for that: load distribution.

WRT to Vantage: before i trust any of it's fillrates, i'd like to know what exactly it measures there (the values sometimes do not even make sense within one gpu family).

Other synthetic benchmarks show quite clearly what our interlude here is about - but we might move that discussion to another thread, mkay?

aaronspink · Apr 11, 2009

rpg.314 said:
Float4 is nice for graphics, but even then it struggles to saturate full 5 way madd units. The utilization is mostly 3.5 or thereabouts.

The forward as my limited foresight lets me see is that either ATI too goes scalar or does some compiler level shenanigans to pack 5 scalar threads into one vector thread. Else, despite the very attractive hardware capabilities, it is going to remain a step or two behind.

its all a trade off really. ATI's approach allows then much greater computational density but a reduction in computational efficiency. It remains to be seen what approach is better in the long term and there are numerous mitigating factors. But currently ATI seems to be at least holding their own with much smaller die budgets.

aaronspink · Apr 11, 2009

MfA said:
Maybe NVIDIA really will go pure scalar like the rumours say (ie. branch granularity of 1) in which case neither ATI nor Intel will ever get close on highly divergent scalar code

I'm sure everyone and their dog in the graphics industry would like Nvidia to go to a real branch granularity of 1. I'm not however sure that Nvidia nor their shareholders want them to go broke.

aaronspink · Apr 11, 2009

Scali said:
Well, I've written raytracers that mostly used single-precision for performance reasons, with no tradeoff for quality (obvious examples would be the storage of geometry or the processing of animation or (most) fragment shading).
If POV-Ray uses DP everywhere (which I doubt), it's just very poorly designed.
DP certainly doesn't 'rule' in raytracers. You may need it here and there, but if you write your entire raytracer with DP, you're doing something wrong.
By the way, I don't strictly consider raytracing 'scientific computing' either.

IIRC, PRMan went to DP around the time of "A Bug's Life" due to issues with precision in the rendering. Basically, the scenes got too complex for SP to be sufficient without leading to artifacts.

So while you can do a lot of scenes in SP, there will always be the potential that the artists will come up with something that doesn't quite work. The question then becomes are you so limited in FP performance that the extra work and overhead of debugging those situation is worth it.

Scali · Apr 11, 2009

aaronspink said:
IIRC, PRMan went to DP around the time of "A Bug's Life" due to issues with precision in the rendering. Basically, the scenes got too complex for SP to be sufficient without leading to artifacts.

Any info on what exactly they moved to DP?
Because I doubt they just thought "Oh, to hell with it, we'll just do a search-replace of "float" -> "double" on the entire sourcecode, and suck up the performance hit".

Also, PRMan is not a raytracer.

MfA · Apr 11, 2009

aaronspink said:
I'm sure everyone and their dog in the graphics industry would like Nvidia to go to a real branch granularity of 1. I'm not however sure that Nvidia nor their shareholders want them to go broke.

Just for reference, what is the ratio of the areas of say a 8K instruction cache with decoding/control logic for a scalar in order processor and a single cycle througput single precision FP multiplier both running at the same clock?

trinibwoy · Apr 11, 2009

Jawed said:
Apart from the technical merit of the malleability of shared memory and investment in toolset (which was pretty ropey at the start), I'd say CUDA's success is 80% marketing. There's a lot of universities with free hardware from NVidia etc. It works for games, why not in GPGPU?

You're doing it again

If CUDA's success is all marketing then why can't AMD even come up with something that makes their stuff shine? Don't you see the fallacy in your argument?

The configuration of the scheduler and ALUs has squat to do with NVidia's success.

Right, because writing code for AMD's hardware would be just as easy.

Still, there are plenty of people getting real speed-ups, things like option pricing, MRI post-processing and seismic computation - the list goes on

Perhaps you'd like to explain how NVidia's style of instruction issue has provided any notable benefit solely in its own right.

Exactly, I've seen several papers of folks trying to delve into this stuff using CUDA and none of them ever concern themselves directly with ALU utilization. As you pointed out many times there are considerations for memory bandwidth but the packing of VLIW instructions is something that just doesn't come up with CUDA. Yet you keep arguing that it's trivial and is an advantage that should be completely ignored.....

Of course you can argue that the bloated size of GT200 is irrelevant if you sell them at $5000 a go.

Not sure what you're saying here. AMD has nothing to show for itself in the GPGPU space. It's telling that the best we can come up with is some AMD designed synthetic tests and one SGEMM routine. Is that really worth counting as worthwhile? So I'll have to agree with the others who say AMD has no GPGPU footprint right now.

You mean how could NVidia have made a smaller chip with the same performance? Well, it's staring you in the face, it's called RV770. Larrabee will, I expect, show an advantage too - and blow ATI and NVidia's doors off when it comes to double-precision.

Games sure, but we're not discussing those are we? And we don't know what Larrabee will do compared to upcoming hardware. Comparing it to an architecture from 2006 isn't really a worthwhile exercise it it?

Feel free to provide a list of GPGPU applications that are fundamentally scalar with practically zero instruction-level parallelism in the optimum algorithm on NVidia. I'm still looking.

CUDA has already proven itself. Where are all the GPGPU algorithms that are running better on AMD's hardware? That's where the burden of proof lies. And the fundamental point is the one rpg made above. You don't have to concern yourself with making sure there are enough instructions going around to pack the ALU. How you can ignore that is beyond me.

apparently people round here believe that AMD has zero GPGPU penetration.

Can't blame them. It must be Nvidia's marketing smoke screen that's preventing people from seeing them

trinibwoy · Apr 11, 2009

Jawed said:
Eh? They're the same speed. One operand per ALU instruction can come from L1 in Larrabee and shared memory in NVidia. The L1 fetch of a single line of 16 operands is "bank aligned", i.e. full speed.

Jawed

Is Larrabee's L1 banked? Not sure what you're saying here. With shared memory you get a fast read as long as it's coalesced. So you're going to have a lot more opportunities for one shot reads compared to a traditional cache where everything has to be on the same cache line.

Arun · Apr 11, 2009

aaronspink said:
I'm sure everyone and their dog in the graphics industry would like Nvidia to go to a real branch granularity of 1. I'm not however sure that Nvidia nor their shareholders want them to go broke.

I heavily suspect NV isn't going for MIMD, but either way I think most people are massively exaggerating the problem. Based on reliable data for certain variants of PowerVR's SGX, which is a true MIMD architecture with some VLIW, their die size is perfectly acceptable.

My belief is the future lies in serial VLIW (which I do not believe is SGX's strategy FWIW), as opposed to parallel VLIW; i.e. the instructions in a word are executed one after the other. This should in theory only have slightly lower utilization than scalar yet only slightly higher overhead than parallel VLIW.

In the case of GPUs where the execution unit's pipeline is composed of many stages, it does require a certain amount of extra storage to keep previously scheduled but not executed instructions ready, but I'd be surprised if it wasn't still a big win versus pure scalar. I'm sure there are more clever ways to achieve the same thing I'm not thinking of right now, too...

Jawed · Apr 11, 2009

trinibwoy said:
You're doing it again If CUDA's success is all marketing then why can't AMD even come up with something that makes their stuff shine? Don't you see the fallacy in your argument?

No, you'll have to spell it out. Since when is marketing a measure of technical capability?

Right, because writing code for AMD's hardware would be just as easy.

Perhaps you'd like to explain what it is about NVidia's 3 ALU-issue that makes it easier than AMD's 5-ALU issue?

Exactly, I've seen several papers of folks trying to delve into this stuff using CUDA and none of them ever concern themselves directly with ALU utilization.

I suggest you go re-read them. Any time you see someone evaluating FLOPs they're doing precisely that.

When I started reading this stuff I was expecting to find that NVidia's ALU design was a big win, freeing developers from iterations of algorithmic evolution (vectorisation, vector fetches, unrolling) to maximise performance. It's not remotely true.

As you pointed out many times there are considerations for memory bandwidth but the packing of VLIW instructions is something that just doesn't come up with CUDA. Yet you keep arguing that it's trivial and is an advantage that should be completely ignored.....

No my argument is really much simpler: once you have built an efficient algorithm there's so much instruction-level parallelism that having a scalar or VLIW ALU is often neither here nor there.

There are degrees of architecture-specific tuning you can do like I said earlier: if you have a GPU with crap caches and small per-strand state (NVidia) then you are forced to use shared memory for SGEMM.

And remember, NVidia has 3 ALUs, not 1. Of course another corruption in CUDA-think is that the only ALU whose utilisation matters is the MAD. Yep, marketing-101 for the win.

Games sure, but we're not discussing those are we?

As I said earlier, feel free to find the algorithm that is optimal as code with nearly no instruction-level parallelism. Everything else is just like graphics, loaded with wodges of ILP.

The primary mitigating factor is the control flow divergence penalty of different architectures. Getting evidence for the penalties in these architectures is really hard.

And we don't know what Larrabee will do compared to upcoming hardware. Comparing it to an architecture from 2006 isn't really a worthwhile exercise it it?

No we should just stop talking till GT300 arrives.

CUDA has already proven itself. Where are all the GPGPU algorithms that are running better on AMD's hardware? That's where the burden of proof lies.

There's no proof they run better on NVidia's hardware. The comparison, generally speaking, hasn't been made. CUDA has been an easy choice for a lot of people because it's been shoved under their nose by NVidia, and AMD in its own wisdom is building relationships mostly with commercial partners who don't publish papers and publish "speed-up" graphs with lies about speed-ups based on un-optimised CPU code.

The person who I mentioned I was talking with, earlier, doesn't have anything else enlightening to say, so I've got nothing to add on that particular data point.

Can't blame them. It must be Nvidia's marketing smoke screen that's preventing people from seeing them

Yep, it's working on you.

Before long you'll be ribbing me for suggesting that version 1.0 of AMD's OpenCL isn't optimal, give it a while yet. I suppose that'll be all the proof you need.

Jawed

trinibwoy · Apr 11, 2009

Arun said:
My belief is the future lies in serial VLIW (which I do not believe is SGX's strategy FWIW), as opposed to parallel VLIW; i.e. the instructions in a word are executed one after the other. This should in theory only have slightly lower utilization than scalar yet only slightly higher overhead than parallel VLIW.

Well you lose the density advantage of parallel VLIW. And where are the overhead savings coming from compared to scalar? The way I understand it AMD's overhead advantage comes not only from the VLIW arrangement but also from much coarser grained scheduling (clause level). You can still end up with instruction level scheduling with serial VLIW.

trinibwoy · Apr 11, 2009

Jawed said:
No, you'll have to spell it out. Since when is marketing a measure of technical capability?

Haha, it's not. That was my point. Why doesn't AMD dump some dollars into marketing and get their stuff up and running then?

Perhaps you'd like to explain what it is about NVidia's 3 ALU-issue that makes it easier than AMD's 5-ALU issue?

And remember, NVidia has 3 ALUs, not 1. Of course another corruption in CUDA-think is that the only ALU whose utilisation matters is the MAD. Yep, marketing-101 for the win.

Sigh, 3-ALU issue? That's a bit dishonest. You fully well know that instructions from different threads can be issued to the ALUs independently and is taken care of by the hardware. Hence the developer doesn't concern himself with that. On the other hand all of AMD's 5 ALUs must be filled from instructions in a single thread so the developer has to ensure there's enough ILP available.

When I started reading this stuff I was expecting to find that NVidia's ALU design was a big win, freeing developers from iterations of algorithmic evolution (vectorisation, vector fetches, unrolling) to maximise performance. It's not remotely true.

And have you compared them to the corresponding algorithm on AMD's stuff?

No my argument is really much simpler: once you have built an efficient algorithm there's so much instruction-level parallelism that having a scalar or VLIW ALU is often neither here nor there.

Perhaps, but again this is all guesswork. We simply do not have the evidence on AMD's side to support your hypothesis.

As I said earlier, feel free to find the algorithm that is optimal as code with nearly no instruction-level parallelism. Everything else is just like graphics, loaded with wodges of ILP.

Shouldn't it be much easier to simply find such algorithms running on AMD's hardware as proof of concept? Where are they?

No we should just stop talking till GT300 arrives.

Huh, I didn't say that. I meant we should be comparing it to the hypothetical GT300. Not to G80 like you've been doing.

There's no proof they run better on NVidia's hardware. The comparison, generally speaking, hasn't been made.

Exactly!! So we're pissing into the wind with claims that AMD's hardware would be just as good or better. There's nothing to back it up so just as you hate Nvidia's marketing at least they back it up with results. On the other hand the stuff you're cheerleading hasn't produced anything worthwhile to date.

Before long you'll be ribbing me for suggesting that version 1.0 of AMD's OpenCL isn't optimal, give it a while yet. I suppose that'll be all the proof you need.

Haha, man I don't have anything against AMD. I just recognize and appreciate the good points of Nvidia's decisions for GPGPU. A perspective that you obviously do not share. What I don't get is that in spite of all of CUDA's success you work hard to point out its apparent flaws and are batting hard for an alternative that has not proven itself. And then blame Nvidia's marketing for the fact.

OpenCL will actually be an interesting battleground. Because if, as you say, a well designed algorithm should have lots of ILP then anything optimized for Nvidia hardware should fly on AMD's stuff too. Unless by well-designed you mean explicit vec4 packing

nAo · Apr 11, 2009

Jawed said:
When I started reading this stuff I was expecting to find that NVidia's ALU design was a big win, freeing developers from iterations of algorithmic evolution (vectorisation, vector fetches, unrolling) to maximise performance. It's not remotely true.

NVIDIA's ALU design frees compilers, not developers. There's no question that writing a compiler that generates efficient code for NVIDIA's ALU design is going to be easier then writing such a compiler for AMD's ALU design. I also suspect that writing assembly code in their own respective virtual ISAs would be way easier on NVIDIA hardware.

Said that I don't buy the argument that AMD's ALUs incredible density is just a byproduct of their architecture. If that was the case we would have seen such a density advantage even on R600, which wasn't there. I don't know what magic AMD pulled this time but they certainly did a great job and whatever they did they are not telling anyone.

trinibwoy · Apr 11, 2009

nAo said:
NVIDIA's ALU design frees compilers, not developers.

Assuming there's always enough ILP going around that the compiler is able to work with it.

Said that I don't buy the argument that AMD's ALUs incredible density is just a byproduct of their architecture. If that was the case we would have seen such a density advantage even on R600, which wasn't there. I don't know what magic AMD pulled this time but they certainly did a great job and whatever they did they are not telling anyone.

Yeah definitely, it's much more than just VLIW vs scalar. They trimmed the fat big time. But that's not to say their architectural design decision didn't facilitate that as well. Maybe it just won't be possible to be that lean with a different approach.

An option for Nvidia is to drop the interpolation/transcendental logic and do those on the main ALU. Without that consideration they can potentially issue a MAD every two shader clocks instead of every four allowing for 16-wide SIMDs that execute a 32-thread warp over two shader-clocks instead of four. There should be some operand fetch, instruction issue and thread scheduling overhead savings there assuming it's even feasible to drop the SFUs in the first place. Maybe they could use the savings for dynamic warp formation

Nvidia GT300 core: Speculation

Jawed

rpg.314

rpg.314

Scali

rpg.314

rpg.314

CarstenS

Moderator

aaronspink

aaronspink

aaronspink

Scali

MfA

trinibwoy

Meh

trinibwoy

Meh

Arun

Unknown.

Jawed

trinibwoy

Meh

trinibwoy

Meh

nAo

Nutella Nutellae

trinibwoy

Meh

Similar threads