Nvidia GT300 core: Speculation

CarstenS · Apr 10, 2009

trinibwoy said:
What does Catalyst AI (not) do?

Apparently, it can change the number of samples used for a single screen-pixel, thus saving work deemed unnecessary by the driver team.

trinibwoy · Apr 10, 2009

Well more power to them if the final result is the same.

CarstenS · Apr 10, 2009

trinibwoy said:
Well more power to them if the final result is the same.

That's probably the point, isn't it? Make people think, your 40 TMUs are doing as good a job as the competitors 80 units.

Somewhere you have to cut corners when you want to compete with your 280mm² chip against a die double the size.

Jawed · Apr 10, 2009

Scali said:
I didn't say you did, just that it's very hard to get a good idea of how GPGPU relates if you can only compare Cuda to Brook+, with both using a different programming model and using different compilers etc. OpenCL will even the playing field in many respects, just as OpenGL/D3D did for regular graphics tasks.

Hopefully level.

Well, first the obvious:
RV770 is functionally equivalent to G80, a chip that is 2 years older!

And so is GT200, so what?

You said ATi was slower than nVidia when shared memory is involved.

"On ATI you get better performance (in absolute terms), and if you use LDS then you get worse performance." I should have added on to the end something like "...get worse performance than without" to clarify that that wasn't a comparison of absolute performance, merely that ATI performance is lower with LDS.

Now I am willing to go as far as stating that shared memory will be a very important tool in many GPGPU algorithms (unlike graphics).

I agree. AMD held steadfastly to the traditional "stream" view of stream computing whereas NVidia decided to embrace shared memory. I've said it before, it was a mistake. Apart from that, of course, AMD didn't actually have a marketing strategy for throughput computing - still stuck in thinking "if we build it someone will use it".

Yea, it's a shame ATi doesn't have any actual software out there.

Yep, absolutely zero

And it proves my point that there are GPGPU algorithms that can benefit significantly from fast shared memory.

You won't get any argument from me.

Question remains: if the ATi client is rewritten to use shared memory, will it once again become competitive?
I wouldn't be surprised if they can't quite close the gap.

So, what's the basis for you saying this?

Oh please, as if your bias wasn't showing enough already. This remark was completely uncalled for, and makes you look like a silly frustrated fanboy.

Hmm, I'm not the one counting chickens before they've hatched.

What I see is that NVidia cannot differentiate on the basis of CUDA, per se, beyond the short term. There's a huge amount of good work out there that's already happening on NVidia, commercially and obviously relationships mean an awful lot and the CUDA brand is a brand. There's no reason for NVidia to lose that ground but obviously the era of incumbency is drawing to a close.

I think OpenCL on Larrabee will be the real eye-opener. But of course its competitiveness is utterly unknown.

Jawed

Jawed · Apr 10, 2009

CarstenS said:
While I won't argue with you about the stciker-speed of theoretical DP throughput (but I'd love to see some real world numbers on real applications), especially TEX is a different beast since Catalyst AI saves them TUs quite some (mega)bits of fetching and filtering. Without the filtering stuff, some texturing benchmarks show the expected numbers wrt to frequency and unit count on both architectures.

And Vantage shows the opposite.

http://techreport.com/articles.x/15293/4

So where does that get you?

I'm not asserting that texturing is faster on RV770, merely that it's exceptionally hard to find any game where any metric shows anything like a proportionate value of the higher rates in GT200.

Don't forget I'm talking colour rate and z fill rate too.

Jawed

Scali · Apr 10, 2009

Jawed said:
And so is GT200, so what?

2-year gap, as I said.

Jawed said:
Hmm, I'm not the one counting chickens before they've hatched.

Neither am I.
I'm just saying that I *think* nVidia has put more thought into their GPGPU design than ATi in the past few years, which I *think* might give them an advantage in OpenCL. This is just a speculation thread about GT300, remember? That's what I'm doing, speculating.
I haven't said that I'm sure of it, let alone that I would actually invest money on this.
So your remark was rather lame and out of line, in my opinion.

What I see is that NVidia cannot differentiate on the basis of CUDA, per se, beyond the short term. There's a huge amount of good work out there that's already happening on NVidia, commercially and obviously relationships mean an awful lot and the CUDA brand is a brand. There's no reason for NVidia to lose that ground but obviously the era of incumbency is drawing to a close.

Jawed said:
I think OpenCL on Larrabee will be the real eye-opener. But of course its competitiveness is utterly unknown.

Well, Intel itself isn't aiming for the stars. They say they aim at midrange performance at introduction. Given the info we have on Larrabee so far, I think that is a reasonable goal.
So I don't expect Larrabee to outperform nVidia and ATi solutions, not in GPGPU tasks either, because I don't quite see any advantages of Larrabee in OpenCL. It seems to suit the G80+ architecture of nVidia just fine.

Jawed · Apr 10, 2009

Scali said:
So that means when you THINK you're benchmarking 32-bit uncompressed texture operations, you're actually doing less.

What benchmark, or game being benchmarked, has 32-bit uncompressed texture operations?

Jawed

Jawed · Apr 10, 2009

Scali said:
2-year gap, as I said.

So with their two-year lead, NVidia is still making GPUs that are uncompetitively large:

Jawed said:
Scali said:

Because that's what they focused on in the past few years, and that's largely the reason why their chips are so much bigger than ATi's.

Click to expand...

RV770 is functionally equivalent to any of NVidia's CUDA capabilities as far as I can tell. Perhaps you'd like to indicate why they're bigger?

So, again, why are they bigger?

Neither am I.
I'm just saying that I *think* nVidia has put more thought into their GPGPU design than ATi in the past few years, which I *think* might give them an advantage in OpenCL. This is just a speculation thread about GT300, remember? That's what I'm doing, speculating.
I haven't said that I'm sure of it, let alone that I would actually invest money on this.
So your remark was rather lame and out of line, in my opinion.

If you're going to speculate:

Scali said:
No, it does mean something:
ATi didn't have shared memory until recently. nVidia had it for years. nVidia is now reaping the benefits.
And it proves my point that there are GPGPU algorithms that can benefit significantly from fast shared memory.
Question remains: if the ATi client is rewritten to use shared memory, will it once again become competitive?
I wouldn't be surprised if they can't quite close the gap.

... then answer my earlier question: why wouldn't you be surprised? What technical insight do you have?

Well, Intel itself isn't aiming for the stars. They say they aim at midrange performance at introduction. Given the info we have on Larrabee so far, I think that is a reasonable goal.
So I don't expect Larrabee to outperform nVidia and ATi solutions, not in GPGPU tasks either, because I don't quite see any advantages of Larrabee in OpenCL. It seems to suit the G80+ architecture of nVidia just fine.

The reason I disagree is two-fold:

Larrabee will be competitive purely in terms of FLOPs - and particularly in double-precision - because Larrabee is FLOPs swimming on a sea of cache with some texturing decorating the edges
Larrabee programmers will have the choice to use almost any amount of local memory(shared memory in CUDA) per work-item in comparison with relative small and fixed amounts in any GPU - unless the GPUs are radically re-designed. Likely to be 32KB per SIMD in the GPUs (which it seems is what D3D11-CS requires, not 16KB as I originally thought). Though if GT300 is 64 SIMDs with 32KB each, that's going some Programmers will be freed from the terrors of CUDA occupancy balanced against bytes per thread of shared memory. I can hear them cheering already.

The caveat for Larrabee is absolute bandwidth - as I expect Larrabee to be way short because it's theoretically much more efficient per unit of bandwidth than today's GPUs.

Jawed

Silent_Buddha · Apr 10, 2009

DegustatoR said:
I don't see how's that means anything like this.
If anything the whole "simultaneous" GT200/RV770 launch thing screams that someone was waiting to see what's the other one will put to market. And from what i know (and logic should tell you) it wasn't AMD.

Which, of course, completely explains why 48xx not only sold more but was less supply constrained than GTX 2xx.

I mean Nvidia were just sitting there waiting for 48xx to launch and suddenly went, oh crap, we COULD have had stock but we...uh...forgot to make any cards.

Oooopsie, sorry folks. Yeah, we suck at making good business choices.

Somehow I don't think that's the case. Nvidia's hand was forced and they had to launch GTX 2xx far sooner than they would have liked.

To just sit on GTX 2xx would have just been bad business practice when they could have been making money for 1-3 months (according to you) off people upgrading from 8800 GTX and 9800 GTX with absolutely no competition from ATI. Not only that they could have been raking in the money selling them with NO competition at their original high prices. 650 for a GTX 280? With no competition? Oh yeah...

Instead, GTX 2xx had to launch similar to 48xx and got slaughtered in sales.

In fact, looking at things realistically, everything "screams" (as you put it) that Nvidia was caught with its pants down.

That they were able to respond so quickly with price cuts in order to stem the losses speaks of Nvidia's business prowess at making the best of a bad situation.

And by doing so they still couldn't move GTX 2xx in anywhere near the numbers of 48xx. Thankfully they had G92 to stem the losses of marketshare to 4850.

Regards,
SB

TimothyFarrar · Apr 10, 2009

Jawed said:
What benchmark, or game being benchmarked, has 32-bit uncompressed texture operations?

Jawed

Anything which does post processing likely does, not that the driver could do a format change on you to make that faster. Also if you had a texture as a lookup table into a texture atlas (I think the older DICE presentation had an example of this in a shipped title). There are other examples. Adjusting texture format from uncompressed to compressed can in cases be really really bad.

trinibwoy · Apr 10, 2009

SB, that's been a disingenuous comparison from day one. Regardless of whether or not Nvidia was caught by surprise by 48xx, those cards started off in price territory where Nvidia never meant to go with GT200. So you're incorrectly bundling in sales there of the 4850's for example for which the rightful comparison should be G92.

Silent_Buddha · Apr 10, 2009

trinibwoy said:
SB, that's been a disingenuous comparison from day one. Regardless of whether or not Nvidia was caught by surprise by 48xx, those cards started off in price territory where Nvidia never meant to go with GT200. So you're incorrectly bundling in sales there of the 4850's for example for which the rightful comparison should be G92.

Either way you want to look it, 48xx basically castrated sales of GTX 2xx at launch.

Going back to the comment I was replying to, I'm sure Nvidia would have been much happier with sales of GTX 280 and GTX 260 at their original launch prices from owners of 8800 GTX and 9800 GTX upgrading.

As it was, rather than moving to GTX 2xx, they instead moved to 48xx until GTX 2xx prices were reduced to be competitive. And even then didn't sell all that well in comparison.

All of which points to the extreme unlikelyhood that Nvidia were sitting there ready to launch GTX 2xx for 1-3 months but decided not to.

There is absolutely no business plan where doing that would have been more profitable than launching GTX 2xx as soon as was possible with decent supply.

Even if 48xx had turned out to be a relative turd following in the path of R600 or just plain not very competitive ala Rv670, they still would have made significantly more sales by releasing GTX 2xx as soon as possible with decent supply.

As it was, it turns out it was launched at a bad time (against 4870) at a bad price (vs. 4870) with bad supply (which would have been FAR worse if 4870 hadn't launched).

Regards,
SB

trinibwoy · Apr 10, 2009

Yeah I wasn't commenting on that other discussion you guys were having. I presume Nvidia released GT2xx as fast as they could since there's no evidence to the contrary. But I don't think there was much overlap between GT2xx and HD4870 in either price or performance for a while after they launched.

Jawed · Apr 10, 2009

trinibwoy said:
Not sure why you hate on this approach so much.

Because currently, it's more marketing than useful?

If it's a half-way step to providing a super-efficient ALU design whose performance is substantially like MIMD instead of SIMD then it'll turn out to have been really cool. In the meantime it's just something else that bloats the chip.

I'm still in search of any demonstration of a performance benefit here, particularly because in theory the one useful bit of NVidia's design is the relatively small warp size of 32.

I'm actually in the middle of a discussion, off forum, with someone who's running an application on both. It appears that dynamic branching is giving GTX280 a 34% performance benefit - but the application is heavily bandwidth bound so it might only be an 11% benefit accruing from dynamic branching. I'm trying to get an idea of the quality of optimisation on both platforms.

One thing is definitely clear, AMD's toolset in this environment is comparatively immature and obstructive.

It's obviously more flexible and has fewer corner cases than AMD's pre-determined clause scheduling that maps much better to traditional graphics workloads. Fine-grained scheduling requires more hardware yes but it's probably the way of the future.

AMD does dynamic clause scheduling - it's the instructions that are fixed in VLIW.

Larrabee's interesting because there appears to be branch prediction. But control flow and instruction sequencing within the VPU appears to be entirely static. It helps that the VPU is a purely scalar ALU from the point of view of a strand. So it seems like there's no need to perform instruction-by-instruction scheduling like NVidia does - though what we don't know yet is the read-after-write register latency in the VPU. This might enforce a minimum count of fibres per thread, which could inflate the cost of control flow divergence.

So, waiting to find out for sure how instructions are scheduled in Larrabee.

And as I said above, where are all the compute apps that demonstrate the viability of AMD's approach?

How does VLIW affect viability?

Jawed

Jawed · Apr 10, 2009

TimothyFarrar said:
Anything which does post processing likely does, not that the driver could do a format change on you to make that faster. Also if you had a texture as a lookup table into a texture atlas (I think the older DICE presentation had an example of this in a shipped title). There are other examples. Adjusting texture format from uncompressed to compressed can in cases be really really bad.

And so we should have noticed if this was being done.

Maybe Scali will provide examples :???:

Jawed

nAo · Apr 10, 2009

Jawed said:
[*]Larrabee programmers will have the choice to use almost any amount of local memory(shared memory in CUDA) per work-item in comparison with relative small and fixed amounts in any GPU - unless the GPUs are radically re-designed.

There is also a big difference here as shared memory is just that, and not a cache.
I never understood why it has been called "parallel data cache".

Jawed · Apr 10, 2009

nAo said:
There is also a big difference here as shared memory is just that, and not a cache.
I never understood why it has been called "parallel data cache".

What's interesting is that only 1 operand per instruction can come from shared memory in GT200. The same rule applies in Larrabee, if we label L1 cache as "shared memory".

So with a combination of cache-line locking and cache-prefetching it seems to me that using OpenCL local memory on Larrabee will be extremely flexible. Obviously that requires Intel to write a decent run time compiler.

e.g. If an application requires 4 bytes per work item of memory to share amongst work items then under CUDA 28KB of shared memory (assuming 1024 work items) out of the 32KB assigned to the SIMD will be sat doing absolutely nothing. Though, to be fair, it's possible to use that memory as private storage per work item - but then you suffer doubled latency per operand and still a worst-case 1/4 operand bandwidth problem.

Larrabee doesn't have an arbitray "work items per core" limitation that we see in GPUs, because there's no hardware scheduling of fibres/strands.

Jawed

Scali · Apr 10, 2009

Jawed said:
What benchmark, or game being benchmarked, has 32-bit uncompressed texture operations?

It's just an example... You get the point... the assumption of a benchmark's memory bandwidth won't hold if the videocard isn't actually doing what you think it's doing.
Stop trolling.

Scali · Apr 10, 2009

[MOD: Tone it down]

Jawed said:
... then answer my earlier question: why wouldn't you be surprised? What technical insight do you have?

My understanding of both architectures and driver structures and hands-on experience lead me to believe that nVidia has better efficiency in many GPGPU tasks.
That's just it. Call it speculation, call it educated guess, call it an opinion... whatever.
I've already said everything I have to say about it. Agree to disagree.

Jawed said:
[*]Larrabee will be competitive purely in terms of FLOPs - and particularly in double-precision - because Larrabee is FLOPs swimming on a sea of cache with some texturing decorating the edges

While I agree that double-precision may be a strong-point of Larrabee, I don't think dp will be very relevant for most GPGPU applications. If they mostly rely on single-precision, then this advantage won't do Larrabee much good in practice.

Jawed said:
[*]Larrabee programmers will have the choice to use almost any amount of local memory(shared memory in CUDA) per work-item in comparison with relative small and fixed amounts in any GPU - unless the GPUs are radically re-designed. Likely to be 32KB per SIMD in the GPUs (which it seems is what D3D11-CS requires, not 16KB as I originally thought). Though if GT300 is 64 SIMDs with 32KB each, that's going some Programmers will be freed from the terrors of CUDA occupancy balanced against bytes per thread of shared memory. I can hear them cheering already.

We'll just have to see what this pans out to. This is clearly speculation on your behalf. I'll just ignore this, instead of the constant trolling that you do.

3dilettante · Apr 10, 2009

Jawed said:
There is. The scalar half of the core couldn't
But control flow and instruction sequencing within the VPU appears to be entirely static. It helps that the VPU is a purely scalar ALU from the point of view of a strand. So it seems like there's no need to perform instruction-by-instruction scheduling like NVidia does - though what we don't know yet is the read-after-write register latency in the VPU.

Tom Forsyth's presentation said most vector instructions have latencies in the 4-9 cycle range.
As for instruction by instruction scheduling, exact parallels with a GPU are harder to draw.

The CPU is capable of actual superscalar issue, and it does support 4 hardware threads. If this is full SMT, then there is arbitration at instruction issue between 8 instructions pulled from the top of four instruction buffers.
The pool is of course much smaller than what Nvidia's attempting.

Nvidia GT300 core: Speculation

CarstenS

Moderator

trinibwoy

Meh

CarstenS

Moderator

Jawed

Jawed

Scali

Jawed

Jawed

Silent_Buddha

TimothyFarrar

trinibwoy

Meh

Silent_Buddha

trinibwoy

Meh

Jawed

Jawed

nAo

Nutella Nutellae

Jawed

Scali

Scali

3dilettante

Similar threads