You're right - if the data is staggered across banks in NVidia shared memory then bandwidth is maintained despite it being "unaligned". It's amusing when the optimum algorithm staggers data by 17 (i.e. leaving memory deliberately unused) in order to maintain banking. Optimisation is rewarding when you discover these little tricksIs Larrabee's L1 banked? Not sure what you're saying here. With shared memory you get a fast read as long as it's coalesced. So you're going to have a lot more opportunities for one shot reads compared to a traditional cache where everything has to be on the same cache line.
I think 2 wavefronts are the minimum because that's the widest the register file can be split-up. One wavefront simply can't address the entire register file as the addressing logic doesn't support that. 128KB of registers for 64 work items is a lot.That's correct. If you look at the ATI forums where somebody asked what's the max regs he could use per thread, the reply he he got was that it has to low enough that 2 wave fronts can run simultaneously. I believe you are on those forums, so I'll leave it for you to dig through them. So in a way nv needs 1 warp /multiprocessor at the min but amd needs 2 wavefronts /simd
I agree it's easier, but it isn't a pure scalar pipeline so it's not trivial, particularly hampered by read-after-write latencies which increase total per-thread latency (i.e. the compiler doesn't have free reign when evaluating instruction-sequencing versus register-count).NVIDIA's ALU design frees compilers, not developers. There's no question that writing a compiler that generates efficient code for NVIDIA's ALU design is going to be easier then writing such a compiler for AMD's ALU design. I also suspect that writing assembly code in their own respective virtual ISAs would be way easier on NVIDIA hardware.
Orton bemoaned "libraries" on the subject of R600. Perhaps better libraries are one of the reasons for the density gain.Said that I don't buy the argument that AMD's ALUs incredible density is just a byproduct of their architecture. If that was the case we would have seen such a density advantage even on R600, which wasn't there. I don't know what magic AMD pulled this time but they certainly did a great job and whatever they did they are not telling anyone.
SIMD processor and addressing methodThat could be a possibility, but why do you think the addressing logic is the cause?
You keep saying this but never provide any data.I heavily suspect NV isn't going for MIMD, but either way I think most people are massively exaggerating the problem. Based on reliable data for certain variants of PowerVR's SGX, which is a true MIMD architecture with some VLIW, their die size is perfectly acceptable.
I think this is what I was suggesting too.An option for Nvidia is to drop the interpolation/transcendental logic and do those on the main ALU. Without that consideration they can potentially issue a MAD every two shader clocks instead of every four allowing for 16-wide SIMDs that execute a 32-thread warp over two shader-clocks instead of four. There should be some operand fetch, instruction issue and thread scheduling overhead savings there assuming it's even feasible to drop the SFUs in the first place. Maybe they could use the savings for dynamic warp formation
I think that's simply reflecting the lack of shared memory.I wonder if AMDs extremely dense ALU-logic does have some hidden caveats? I mean, clearly they have very talented people working at it, but in the end they can do no more magic than the next guy.
Maybe this is somehow related?
CUDA centres of excellence (I think there's 5 of them now) and the free hardware that is credited in dozens of papers both say hello.When there is no other viable option out there until recently, how is that marketing and how is that being pushed under other's noses? Don't fool yourself and think AMD likes the position its in right now.
CUDA centres of excellence (I think there's 5 of them now) and the free hardware that is credited in dozens of papers both say hello.
Are you kidding?Whoa! How could I miss that sentence. Jawed, come on man. You really think AMD is making vast inroads with commercial customers behind closed doors and are just oh so wise and humble that they are hiding it from the public and their shareholders?
My stockbroker emailed me recently to say I can no longer trade foreign sharesor have you purchased shares of their stock?
With L1 cache-line locking for stream-through and pre-fetching there's no L2 latency at all.Problem I see with this is that LRB L2 latency is not the same as a shared memory or register access. Isn't LRB L2 at least 10 clocks away? So this 96 to 128 byte number isn't apples to apples.
64KB shared across the entire GPU, very useful for large wodges of data shared by all threadsAlso we should toss in an extra 8KB of constant space on GT200
That sample code is just that, sample code. There's no attempt to proper scalar and vector instructions scheduling. Real cycles count would be close to your R700 figure.I count 14 cycles single-precision on Larrabee (0.5FLOP per cycle) and I reckon double-precision would be 21 cycles (0.33FLOP per cycle).
It's a terrible burden interfacing MATLAB through its C interface into the ACML-GPU library for SGEMM/DGEMM acceleration on ATI.As far as I know, ATi simply doesn't offer this option.
Who argued that?Also, I think arguing that Matlab doesn't have well-optimized libraries for CPU is a bit of a dead-end.
It's a terrible burden interfacing MATLAB through its C interface into the ACML-GPU library for SGEMM/DGEMM acceleration on ATI.
Oh, wait, you are waiting for AMD's marketing guys to tell you you can do this, aren't you?
Who argued that?
Does AMD break-out GPGPU revenue?Jawed, do you realize NVIDIA has real revenue for CUDA (several million dollars at least in 2008, ultra-high-margin), while AMD doesn't for their GPGPU solution?
I'm not defending, I'm trying to promote a separation of the marketing about GPU capabilities from the architectural capabilities.You could argue NV dropped the ball when it comes to consumer GPGPU, but trying to defend AMD in HPC is just really dumb - and I'm sure you know better anyway.
For astrophysics it seems to me NVidia's selling their stuff too cheap I get the impression there's a riot going on out there as the speed-ups are just absurd and GPUs are obnoxiously cheap.But you're right that many CUDA papers aren't being very fair compared to CPUs in terms of optimization, but let's not get ahead of ourselves. We're not talking about 60x speed-ups becoming negligible; we're talking 40x going to 10x probably. And frankly if that's the only point, it's an incredibly backward-looking one because GPU flops in 2H09/1H10 are going to increase so much faster than CPU flops I'm not sure why we're even discussing this. In fact, the fact real-world performance compared to super-optimized already-deployed code isn't always so massive was even mentioned in June 2008 at Editor's Day for the GT200 Tesla. There was a graph for a oil & gas algorithm IIRC, and the performance was only several times higher - but scalability was also much better, and even excluding that cost efficiency was better than just the theoretical performance improvement.
I'm talking about GT200 - haven't you heard, G80's ancient history.Also, uhhh... for how long have we been discussing G8x? I was pretty damn sure you understood at one point that: a) there are only two ALU lanes, not three; you can't issue a SFU and an extra MUL no matter what.
Even on G80 unrolling increases throughput, so I think you need to go write some code and test it properly. Hint: dumb sequences of MADs/MULs/transcendentals without any memory operations and/or trivial branching are a waste of time.b) These two ALU lanes can be issued with DIFFERENT threads on GT200, efficiency should be ~100% for dependent code that is of the form 'Interp->MAC->MUL->MAC->Interp->...' - it just works! Don't try to imagine problems that aren't there...
CUDA centres of excellence (I think there's 5 of them now) and the free hardware that is credited in dozens of papers both say hello.
Hell, if I was doing research I'd be right in line.
Jawed
Can you explain at least what the starting point is? I can't see itThat sample code is just that, sample code. There's no attempt to proper scalar and vector instructions scheduling. Real cycles count would be close to your R700 figure.