Nvidia GT300 core: Speculation

MfA · Apr 12, 2009

Assuming 25% utilization over 2 years one of those ultra high margin Tesla Blades should be compared against ~5 quadcore blades.

Jawed · Apr 12, 2009

trinibwoy said:
Is Larrabee's L1 banked? Not sure what you're saying here. With shared memory you get a fast read as long as it's coalesced. So you're going to have a lot more opportunities for one shot reads compared to a traditional cache where everything has to be on the same cache line.

You're right - if the data is staggered across banks in NVidia shared memory then bandwidth is maintained despite it being "unaligned". It's amusing when the optimum algorithm staggers data by 17 (i.e. leaving memory deliberately unused) in order to maintain banking. Optimisation is rewarding when you discover these little tricks

I don't know the performance profile of this scenario in Larrabee.

The algorithm might be tackled differently. For example, the dimensions of the tile of data in NVidia shared memory (size of data for extant threads) is constrained by having to wrap around at 16KB (or 32KB, at least, in GT300). This wrap-around constraint is much much looser in Larrabee. So rather than doing staggered reads like this something else may be done (e.g. pack/splat data, fine-grained pre-fetching etc.) Who knows, eh?

Jawed

Jawed · Apr 12, 2009

rpg.314 said:
That's correct. If you look at the ATI forums where somebody asked what's the max regs he could use per thread, the reply he he got was that it has to low enough that 2 wave fronts can run simultaneously. I believe you are on those forums, so I'll leave it for you to dig through them. So in a way nv needs 1 warp /multiprocessor at the min but amd needs 2 wavefronts /simd

I think 2 wavefronts are the minimum because that's the widest the register file can be split-up. One wavefront simply can't address the entire register file as the addressing logic doesn't support that. 128KB of registers for 64 work items is a lot.

Jawed

rpg.314 · Apr 12, 2009

That could be a possibility, but why do you think the addressing logic is the cause?

Jawed · Apr 12, 2009

nAo said:
NVIDIA's ALU design frees compilers, not developers. There's no question that writing a compiler that generates efficient code for NVIDIA's ALU design is going to be easier then writing such a compiler for AMD's ALU design. I also suspect that writing assembly code in their own respective virtual ISAs would be way easier on NVIDIA hardware.

I agree it's easier, but it isn't a pure scalar pipeline so it's not trivial, particularly hampered by read-after-write latencies which increase total per-thread latency (i.e. the compiler doesn't have free reign when evaluating instruction-sequencing versus register-count).

I think Larrabee's VPU is easier still because that's truly scalar as far as I can tell and provided there aren't register read-after-write gotchas, should be smooth sailing.

But the scalar part of Larrabee's core introduces a question of scheduling dependencies generated across scalar + VPU. So, not quite so smooth sailing, if you intend to juggle cache lines and VPU bit masks on that side of the fence. Too early to say how separable these are and how much compiler grief this causes.

---

I've been playing with some mandelbrot code (partly inspired by Abrash's presentation's snippet of mandelbrot on Larrabee). The main loop on AMD is 8 cycles with 30% ALU utilisation, for 7 FLOP (i.e. excluding iteration counting, that's an int) - i.e. 0.875FLOP per cycle and 15G iterations per second.

In double precision it's 13 cycles with 52% ALU utilisation, i.e. 0.538FLOP per cycle and 9.3G iterations per second.

The hardware compiler often doesn't pack double-precision ADDs properly, though. Optimally compiled it should be 12 cycles.

I count 14 cycles single-precision on Larrabee (0.5FLOP per cycle) and I reckon double-precision would be 21 cycles (0.33FLOP per cycle).

Said that I don't buy the argument that AMD's ALUs incredible density is just a byproduct of their architecture. If that was the case we would have seen such a density advantage even on R600, which wasn't there. I don't know what magic AMD pulled this time but they certainly did a great job and whatever they did they are not telling anyone.

Orton bemoaned "libraries" on the subject of R600. Perhaps better libraries are one of the reasons for the density gain.

Jawed

Jawed · Apr 12, 2009

rpg.314 said:
That could be a possibility, but why do you think the addressing logic is the cause?

SIMD processor and addressing method

http://www.research.ibm.com/people/h/hind/pldi08-tutorial_files/GPGPU.pdf

Enjoy

Jawed

Jawed · Apr 12, 2009

Arun said:
I heavily suspect NV isn't going for MIMD, but either way I think most people are massively exaggerating the problem. Based on reliable data for certain variants of PowerVR's SGX, which is a true MIMD architecture with some VLIW, their die size is perfectly acceptable.

You keep saying this but never provide any data.

What percentage of the die is ALU? What's the FLOP/mm2? What happens when you try to run it at 1GHz (i.e. how many pipeline stages need to be added)?

Jawed

Jawed · Apr 12, 2009

trinibwoy said:
An option for Nvidia is to drop the interpolation/transcendental logic and do those on the main ALU. Without that consideration they can potentially issue a MAD every two shader clocks instead of every four allowing for 16-wide SIMDs that execute a 32-thread warp over two shader-clocks instead of four. There should be some operand fetch, instruction issue and thread scheduling overhead savings there assuming it's even feasible to drop the SFUs in the first place. Maybe they could use the savings for dynamic warp formation

I think this is what I was suggesting too.

This reminds me, a while back I was making a comparison of NVidia and ATI interpolation rates:

http://forum.beyond3d.com/showpost.php?p=1263916&postcount=503

Bob corrected my interpretation of "attribute" and I didn't realise immediately that this means that the ATI interpolation rate is far higher than what I described, 4x. ATI is actually interpolating 128 scalars per clock. This is the same per clock rate as G80/G92, except NVidia runs interpolation at ALU clock, ~2x core clock.

Arguably this matches nicely with NVidia's 2x texturing rate per clock, too (compared with ATI).

But as ALU:tex rate increases in NVidia there'll be less reason to increase interpolation rate. Arguably the crunch time is a fair way off though.

Jawed

Jawed · Apr 12, 2009

CarstenS said:
I wonder if AMDs extremely dense ALU-logic does have some hidden caveats? I mean, clearly they have very talented people working at it, but in the end they can do no more magic than the next guy.

Maybe this is somehow related?

I think that's simply reflecting the lack of shared memory.

In N-body Simulation:

http://forum.beyond3d.com/showthread.php?p=1281746#post1281746

which I presume is similar to what's being done with F@H, a body, I, has forces summed as a result of interactions with other bodies. When the force between body I and body J is computed, the result can be summed into the forces on both I and J.

The naive N-body algorithm doesn't do this, it computes the force between I and J twice, once when I is the subject of the forces and again, later, when J becomes the subject of the forces.

Without shared memory ATI is stuck with the naive algorithm, I believe, computing each interaction twice.

Jawed

Jawed · Apr 12, 2009

Razor1 said:
When there is no other viable option out there until recently, how is that marketing and how is that being pushed under other's noses? Don't fool yourself and think AMD likes the position its in right now.

CUDA centres of excellence (I think there's 5 of them now) and the free hardware that is credited in dozens of papers both say hello.

Hell, if I was doing research I'd be right in line.

Jawed

Scali · Apr 12, 2009

Jawed said:
CUDA centres of excellence (I think there's 5 of them now) and the free hardware that is credited in dozens of papers both say hello.

I taste that you somehow want to use the fact that nVidia actively promotes its products as some kind of argument that these products are somehow inferior...
Yes, I use the word inferior, which implies a comparison to other solutions. That is rather ironic, since these other solutions haven't even reached the same market-ready state yet.

I have no problem with nVidia giving away free hardware, and I certainly don't let it cloud my judgement of their technology and toolset, neither positively nor negatively.

Jawed · Apr 12, 2009

trinibwoy said:
Whoa! How could I miss that sentence. Jawed, come on man. You really think AMD is making vast inroads with commercial customers behind closed doors and are just oh so wise and humble that they are hiding it from the public and their shareholders?

Are you kidding?

I mean, for example, have you noticed there's a company called Havok?

Jawed

Jawed · Apr 12, 2009

jimmyjames123 said:
or have you purchased shares of their stock?

My stockbroker emailed me recently to say I can no longer trade foreign shares

Jawed

Jawed · Apr 12, 2009

TimothyFarrar said:
Problem I see with this is that LRB L2 latency is not the same as a shared memory or register access. Isn't LRB L2 at least 10 clocks away? So this 96 to 128 byte number isn't apples to apples.

With L1 cache-line locking for stream-through and pre-fetching there's no L2 latency at all.

Also we should toss in an extra 8KB of constant space on GT200

64KB shared across the entire GPU, very useful for large wodges of data shared by all threads

Jawed

nAo · Apr 12, 2009

Jawed said:
I count 14 cycles single-precision on Larrabee (0.5FLOP per cycle) and I reckon double-precision would be 21 cycles (0.33FLOP per cycle).

That sample code is just that, sample code. There's no attempt to proper scalar and vector instructions scheduling. Real cycles count would be close to your R700 figure.

Jawed · Apr 12, 2009

Scali said:
As far as I know, ATi simply doesn't offer this option.

It's a terrible burden interfacing MATLAB through its C interface into the ACML-GPU library for SGEMM/DGEMM acceleration on ATI.

Oh, wait, you are waiting for AMD's marketing guys to tell you you can do this, aren't you?

Also, I think arguing that Matlab doesn't have well-optimized libraries for CPU is a bit of a dead-end.

Who argued that?

Jawed

Scali · Apr 12, 2009

Jawed said:
It's a terrible burden interfacing MATLAB through its C interface into the ACML-GPU library for SGEMM/DGEMM acceleration on ATI.

Not quite the same as nVidia's solution.

Jawed said:
Oh, wait, you are waiting for AMD's marketing guys to tell you you can do this, aren't you?

They know better than to try and sell a half-arsed solution like that.

Jawed said:
Who argued that?

You did. You said that Cuda was only compared to unoptimized x86 code.

Jawed · Apr 12, 2009

Arun said:
Jawed, do you realize NVIDIA has real revenue for CUDA (several million dollars at least in 2008, ultra-high-margin), while AMD doesn't for their GPGPU solution?

Does AMD break-out GPGPU revenue?

You could argue NV dropped the ball when it comes to consumer GPGPU, but trying to defend AMD in HPC is just really dumb - and I'm sure you know better anyway.

I'm not defending, I'm trying to promote a separation of the marketing about GPU capabilities from the architectural capabilities.

But you're right that many CUDA papers aren't being very fair compared to CPUs in terms of optimization, but let's not get ahead of ourselves. We're not talking about 60x speed-ups becoming negligible; we're talking 40x going to 10x probably. And frankly if that's the only point, it's an incredibly backward-looking one because GPU flops in 2H09/1H10 are going to increase so much faster than CPU flops I'm not sure why we're even discussing this. In fact, the fact real-world performance compared to super-optimized already-deployed code isn't always so massive was even mentioned in June 2008 at Editor's Day for the GT200 Tesla. There was a graph for a oil & gas algorithm IIRC, and the performance was only several times higher - but scalability was also much better, and even excluding that cost efficiency was better than just the theoretical performance improvement.

For astrophysics it seems to me NVidia's selling their stuff too cheap

I get the impression there's a riot going on out there as the speed-ups are just absurd and GPUs are obnoxiously cheap.

Also, uhhh... for how long have we been discussing G8x? I was pretty damn sure you understood at one point that: a) there are only two ALU lanes, not three; you can't issue a SFU and an extra MUL no matter what.

I'm talking about GT200 - haven't you heard, G80's ancient history.

b) These two ALU lanes can be issued with DIFFERENT threads on GT200, efficiency should be ~100% for dependent code that is of the form 'Interp->MAC->MUL->MAC->Interp->...' - it just works! Don't try to imagine problems that aren't there...

Even on G80 unrolling increases throughput, so I think you need to go write some code and test it properly. Hint: dumb sequences of MADs/MULs/transcendentals without any memory operations and/or trivial branching are a waste of time.

Jawed

Razor1 · Apr 12, 2009

Jawed said:
CUDA centres of excellence (I think there's 5 of them now) and the free hardware that is credited in dozens of papers both say hello.

Hell, if I was doing research I'd be right in line.

Jawed

And is that a problem for them promoting what they have and no other company has? Capability wise, ATi came out with better GPGPU tech well before nV, but why don't they do anything to promote themselves now? You would think your theory is correct if they had a solution that is more capable then CUDA/and g80 and up tech, they would right? Or you just think its just AMD playing dumb so they can blindside nV in the near future? There is no basic logic that would even condone a competitor to let the other to take a market without even a whimper if they had something that was capable of being competitive.

Its like TWIMTBP, if AMD had the money to push thier tech, if they had the foresight to help thier community, but again they did the same mistake. Have they learned from their mistakes, guess not.

Jawed · Apr 12, 2009

nAo said:
That sample code is just that, sample code. There's no attempt to proper scalar and vector instructions scheduling. Real cycles count would be close to your R700 figure.

Can you explain at least what the starting point is? I can't see it

Jawed

Nvidia GT300 core: Speculation

MfA

Jawed

Jawed

rpg.314

Jawed

Jawed

Jawed

Jawed

Jawed

Jawed

Scali

Jawed

Jawed

Jawed

nAo

Nutella Nutellae

Jawed

Scali

Jawed

Razor1

Jawed

Similar threads