G80 Architecture from CUDA

rwolf · Feb 28, 2007

http://www.techreport.com/onearticle.x/11929

Some interesting comments on CUDA. Sounds like the architecture is fairly complex.

KimB · Feb 28, 2007

rwolf said:
http://www.techreport.com/onearticle.x/11929

Some interesting comments on CUDA. Sounds like the architecture is fairly complex.

Certainly. But this is what libraries are for

nAo · Feb 28, 2007

Cuda more complex than CELL, yeah..sure..fine..whatever

Arun · Feb 28, 2007

nAo said:
Cuda more complex than CELL, yeah..sure..fine..whatever

If you combine the various comments around the net, you will conclude that:
- CUDA is harder than CELL.
- CUDA is very hard to get more than 10% efficiency out of.
- CUDA has advanced memory hierarchy optimization requirements.

No offense intended to any of these people, but I would tend to believe every single one of these statements is a gigantic bunch of bullshit based on my experience...

KimB · Feb 28, 2007

However difficult CUDA may be (and it is certainly more difficult than most CPU programming), it is vastly easier to deal with than previous GPU programming paradigms.

[maven] · Feb 28, 2007

Arun Demeure said:
If you combine the various comments around the net, you will conclude that:
- CUDA is harder than CELL.
- CUDA is very hard to get more than 10% efficiency out of.
- CUDA has advanced memory hierarchy optimization requirements.

No offense intended to any of these people, but I would tend to believe every single one of these statements is a gigantic bunch of bullshit based on my experience...

I have seen your comment on the blog article in question and I disagree somewhat*. I do think the article in question overstates the difficulty somewhat (especially if you come from a GPGPU background), but with CUDA (and probably CTM) you have to manage hardware intricacies that are usually hidden by the graphics driver during normal use, whereas CELL was built as it is to be managed by the programmer. In that sense, CUDA is harder because it exposes plenty of quirks that only driver writers should have to know about.
The 10% efficiency remark is certainly odd, unless he's talking about general purpose code.

*Having only read documents / SDKs on both, no practical experience.

rwolf · Feb 28, 2007

Arun Demeure said:
If you combine the various comments around the net, you will conclude that:
- CUDA is harder than CELL.
- CUDA is very hard to get more than 10% efficiency out of.
- CUDA has advanced memory hierarchy optimization requirements.

No offense intended to any of these people, but I would tend to believe every single one of these statements is a gigantic bunch of bullshit based on my experience...

We have plenty of previous examples of hardware that failed to live up to their early marketing promise, from the i860 to the PS3. CUDA looks set to follow in their footsteps: I expect that it will take vast amounts of work for programmers to get halfway decent performance out of a CUDA application, and that few will achieve more than 10% of theoretical peak performance.

He is saying that manually managing memory between three tiers is going to require great effort of behalf of the developer, but is key to performance. Kind of like shaders on NV30 until nvidia made a proper optimizing compiler. The compiler should have been written to optimize memory accesses. Has nothing to do with scaler/vector arguments or ALU implementation. This could also explain why Nvidia is having problems getting the drivers out the door.

Bryan O’Sullivan has a beautiful summary of the present state of NVIDIA’s CUDA. He explains the programming model, along with the many different levels of memory and their restrictions (there are many ). I had been quite optimistic in my last post about CUDA (just from taking a quick glance at their source code), but Bryan’s very educated opinion brought me back to earth

http://arstechnica.com/news.ars/post/20070227-8931.html

I think that O'Sullivan's point at the end about how only people on Wall Street and in the defense sector will love CUDA because they can commit the developer resources to learning it intimately is well-taken, but I don't see this as a criticism of the technology. To switch subjects for a moment and talk about CTM, the fact that AMD/ATI just opened up the assembly language interface to their GPUs and told people "have at it" is essentially an admission that they're currently only pitching it to parties who truly need this kind of performance and are willing to pay for it in programmer time. The same is almost certainly true of CUDA at this stage

.

Not sure about your background Arun, but these guys seem to know there stuff. Just because they are not heaping praise on CUDA doesn't make it BS.

trinibwoy · Feb 28, 2007

Ummm isn't CUDA currently the most accessible GPU programming solution? I think Nvidia did themselves a disservice by emphasizing the C-type language elements, now people are expecting to write "Hello World!" in CUDA as easily as they do in Visual Studio

They're comparing CUDA to the wrong things IMO.

nutball · Feb 28, 2007

rwolf said:
Not sure about your background Arun, but these guys seem to know there stuff. Just because they are not heaping praise on CUDA doesn't make it BS.

Do you own a G80 board? Have you tried programming in CUDA? If not, why not? Why go from blogs when real experience is so cheaply had?

Programming in CUDA is what... a Â£250 purchase plus a free download of a beta compiler + SDK? I have one now, and I'm using it, and it's great. For Â£250. Despite IBM repeatedly proclaiming their undying love for me my colleagues and wanting our millions they are yet to open their wallet and allow me similarly cheap access to any Cell hardware ($17000 was their best offer), or to a Clearspeed plugged in to one of their Opteron boxen ("you can give us some code and we'll run it for you, maybe" ha ha).

CUDA has many flaws, partly because flexible as it is G80 isn't as flexible as a lazy programmer might wish. There are many reasons for this. However of the solutions possible (CUDA/CTM, Cell, Clearspeed) the GPU-based solutions are by far the most accessible to the wider developer community. Therefore they win, regardless of anything else. CUDA isn't finished yet, and neither is G80/90/1xx, but they're here, now and cheap.

nAo · Feb 28, 2007

BTW if these kids playing with CELL and CUDA are scared to use local memory (bless them)...they can simply not do it, at least on CUDA and live happily ever after.
If they want the same power of a GPU with the same flexibility and easy to use 'pc programming model' they can wait..... forever.
Extra flexibility comes at a cost and ever will.

rwolf · Mar 1, 2007

nutball said:
Do you own a G80 board? Have you tried programming in CUDA? If not, why not? Why go from blogs when real experience is so cheaply had?

Programming in CUDA is what... a Â£250 purchase plus a free download of a beta compiler + SDK? I have one now, and I'm using it, and it's great. For Â£250. Despite IBM repeatedly proclaiming their undying love for me my colleagues and wanting our millions they are yet to open their wallet and allow me similarly cheap access to any Cell hardware ($17000 was their best offer), or to a Clearspeed plugged in to one of their Opteron boxen ("you can give us some code and we'll run it for you, maybe" ha ha).

CUDA has many flaws, partly because flexible as it is G80 isn't as flexible as a lazy programmer might wish. There are many reasons for this. However of the solutions possible (CUDA/CTM, Cell, Clearspeed) the GPU-based solutions are by far the most accessible to the wider developer community. Therefore they win, regardless of anything else. CUDA isn't finished yet, and neither is G80/90/1xx, but they're here, now and cheap.

Not sure what this has to do with the complexity of the hardware to program, but I agree that the whole appeal of doing processing on the GPU is exactly what your saying. Cheap available and powerfull hardware.

rwolf · Mar 1, 2007

nAo said:
BTW if these kids playing with CELL and CUDA are scared to use local memory (bless them)...they can simply not do it, at least on CUDA and live happily ever after.
If they want the same power of a GPU with the same flexibility and easy to use 'pc programming model' they can wait..... forever.
Extra flexibility comes at a cost and ever will.

Even cell is complicated to program. Read an article just the other day about developers wanting access or getting access to IBM cell engineers so they could better utilize the PS3 hardware.

nAo · Mar 1, 2007

rwolf said:
Even cell is complicated to program. Read an article just the other day about developers wanting access or getting access to IBM cell engineers so they could better utilize the PS3 hardware.

I have far more programming experience on CELL than on CUDA but it does not take a rocket scientist to understand that it's much much easier to setup something running decently fast on the latter platform.
As I already wrote with CUDA you're not forced to explicitely make use of on chip memory, it's up to you to do so.
I hope that even an undergraduate student knows the difference between registers and external memory, is he/she is not able to cope with that I think the problem does not lie with CUDA or CELL or whatever other platform you want to adopt

rwolf · Mar 1, 2007

nAo said:
I have far more programming experience on CELL than on CUDA but it does not take a rocket scientist to understand that it's much much easier to setup something running decently fast on the latter platform.
As I already wrote with CUDA you're not forced to explicitely make use of on chip memory, it's up to you to do so.
I hope that even an undergraduate student knows the difference between registers and external memory, is he/she is not able to cope with that I think the problem does not lie with CUDA or CELL or whatever other platform you want to adopt

I am sure that is true.

Jawed · Mar 19, 2007

http://forums.nvidia.com/index.php?showtopic=30042&view=findpost&p=168648

basically, each multiprocessor (on G80) can support 24 32-thread warps at a time.

Apart from per-thread register usage, I'm not sure if there are other factors that come in to play here. Or, how this count relates to DX9 or D3D10 usage of G80.

It seems to me that a best-case single-instruction latency hiding works out to be 384 ALU clocks (192 core clocks) for a vec4 instruction and 96 ALU clocks (48 core) for a scalar ALU instruction.

Jawed

trinibwoy · Mar 20, 2007

I'm kinda confused by that statement Jawed. Wouldn't best case instruction latency in clocks be static regardless of the instruction width? Then the number of instructions needed to hide that latency on G80 would differ based on whether they're vec 1/2/3/4....?

Also, what does "single instruction latency" mean?

Jawed · Mar 20, 2007

I'm talking about how many clock cycles of what is effectively "texturing latency" (setting parameters, fetching from memory, filtering) can be hidden by a single instruction.

The number of instructions required to hide a specific amount of latency depends on the vector width of each instruction issued in parallel with the texture operation, yes, agreed.

Jawed

trinibwoy · Mar 20, 2007

Jawed said:
I'm talking about how many clock cycles of what is effectively "texturing latency" (setting parameters, fetching from memory, filtering) can be hidden by a single instruction.

Ah, gotcha. But is that going to be a common scenario? Isn't the whole point of threading to have multiple threads/warps and multiple non-dependent instructions per thread/warp available to keep the ALU's busy during IO?

Jawed · Mar 21, 2007

trinibwoy said:
Isn't the whole point of threading to have multiple threads/warps and multiple non-dependent instructions per thread/warp available to keep the ALU's busy during IO?

Absolutely.

For comparison R5xx has 128 batches (6144 threads) of best-case single-instruction latency hiding, i.e. 512 clocks.

I merely posted this as a fact that was heretofore unknown around these parts.

Jawed

Rufus · Mar 22, 2007

The CUDA website was just updated with an XLS spreadsheet to calculate warp efficiency: http://developer.download.nvidia.com/compute/cuda/CUDA_Occupancy_calculator.xls

Now the interesting part is click the GPU data tab. There you find:

Code:

GPU:                                            G80
Multiprocessors per GPU                         16
Threads / Warp                                  32
Warps / Multiprocessor                          24
Threads / Multiprocessor                        768
Thread Blocks / Multiprocessor                  8
Total # of 32-bit registers / Multiprocessor    8192
Shared Memory / Multiprocessor (bytes)          16384

Nice little run down of the main stats. Has anything like this be released publicly before so concisely?

G80 Architecture from CUDA

rwolf

Rock Star

KimB

nAo

Nutella Nutellae

Arun

Unknown.

KimB

[maven]

rwolf

Rock Star

trinibwoy

Meh

nutball

nAo

Nutella Nutellae

rwolf

Rock Star

rwolf

Rock Star

nAo

Nutella Nutellae

rwolf

Rock Star

Jawed

trinibwoy

Meh

Jawed

trinibwoy

Meh

Jawed

Rufus

Similar threads