ACM Queue: GPUs, Not Just for Graphics

Rufus

Newcomer
The latest ACM Queue magazine is a full issue about GPUs and the GPU/CPU convergence that is happening. It's an extremely good read, covering both the fundamentals and where people in the know think GPUs are/should be going. It includes:

Interview with Kurt Akeley (SGI, OpenGL) and Pat Hanrahan (both Mike Houston's and Ian Buck's advisor).

GPUs: a closer look - good overview of how GPU works, co-authored by our own Mike Houston.

Scalable Parallel Programming with CUDA - Nvidians (including Ian Buck) giving an overview of CUDA.

Data-Parallel Computing - MS Research giving a good overview of different parallel languages and techniques for implementing a data-parallel program.

Future Graphics Architectures - Intel on where they think the future is (hrm...I wonder what they could be basing this article on)

Online magazine: http://mags.acm.org/queue/20080304/
PDF of the magazine: http://mags.acm.org/queue/20080304/data/queue20080304-dl.pdf
Web version: http://acmqueue.org/modules.php?name=Content&pa=list_pages_issues&issue_id=48
 
Oops, I knew I forgot to do something once that was pointed out to me quite some time ago! :( (i.e. news it) - cheers Rufus.

The content is very nice indeed overall, but... I really really hate saying this given how much respect I have (and is due) for all of these people - but somehow, much of it struck me as not considering some of the "fundamental" aspects of computation architecture. It's a great start for a discussion or debate, but perhaps not quite ideal as a reference.

I didn't notice any real discussion of the trade-offs involved in SIMD vs MIMD, or the dynamics of making certain potential bottlenecks like triangle setup and texture filtering run in the shader core. Related subjects were discussed and predictions were made, but I think there could be plenty of good reasons to disagree with those predictions personally. Some of the discussion of thread contexts and latency hiding seemed strange, too.

Now, don't get me wrong, I'm not going as far as implying that because there is no advanced discussion of the implications of GPU's threading architectures on register file architectures (including multi-banking), the content isn't good or valuable. What I'm trying to say here is that if I was, say, a top GPU hardware researcher at Intel in 2005 who is now working on Larrabee, this wouldn't have helped me figure out a more efficient design, quite on the contrary.

Now, mind you, the target audience isn't an hypotethical engineer 3 years ago. If I was a CPU programmer or a CPU/DSP/... engineer who is simply curious about GPUs' past, present and future... Then, this would be truly splendid content that is easy to digest, very high signal-to-noise, nicely illustrated, in-depth, and so forth. And clearly that's orders of magnitude nearer the target audience.

Anyway, the ONLY thing I'm saying here is that it might not a good idea to refer to this as some form of infinitely insightful discussion of GPU architecture and future trends. I don't think that was the authors' goals either, but better safe than sorry. I deeply appreciated the articles and interviews, their quality was really good and it was a pleasure to read, I'm sure just about everyone would agree with me here. But as I said, I think it's more of a good start for a great discussion than a real reference.

If all this was mind blowingly obvious to everyone, I apologize - as I said though, better safe than sorry... Or if you disagree, well, that's a good start for a discussion too! ;)
 
If all this was mind blowingly obvious to everyone, I apologize - as I said though, better safe than sorry... Or if you disagree, well, that's a good start for a discussion too! ;)

Kind of figured an Intel guy would have to at least trash AMD CTM and NVidia's CUDA with some very broad generalizations and no specific evidence.

However if we could read into this he might be confirming that Larrabee is going to have dedicated hardware for texture compression and filtering...
 
I thought the concensus was fixed-function hardware for texture *addressing* and DXTC decompression inside the addressing unit. I didn't notice anything that'd go against that - did I miss something?

Anyway, I said and I'll say it again: the optimal implementation for texture filtering very very likely is to handle texture addressing *and* filtering for compressed formats (and basic formats like INT8 *maybe*) in fixed-function hardware, and filtering for other things in the shader core. I suspect that is the direction at least one IHV will go towards in the DX11 generation, although I don't have any hard proof of it.

EDIT: Which reminds me of something that struck me in one of the articles: it was claimed that the primary reason for fixed-function hardware is that not everything is parallelizable. That's obviously one reason (especially when parallelism implies SIMD), but I personally consider the difference in unit costs to be at least as important, if not more. You really, really don't want to be using a unit capable of single-cycle FP32/INT32 multiplications to run INT8 code, or even FP32 comparisons for depth tests.

It's a truly ridiculous waste unless that's only a very small part of your workload. I suspect Larrabee will be relatively smart in terms of how it handles that given that its SIMD unit(s) is VLIW, though, so they can specialize some of it to less compute-intensive tasks. We'll see how it goes...
 
However if we could read into this he might be confirming that Larrabee is going to have dedicated hardware for texture compression and filtering...
You don't really need that as long as your local storage is big enough and there are enough threads running. They could also virtualize the processor through a driver. Same thing, really.
 
I thought the concensus was fixed-function hardware for texture *addressing* and DXTC decompression inside the addressing unit. I didn't notice anything that'd go against that - did I miss something?

Anyway, I said and I'll say it again: the optimal implementation for texture filtering very very likely is to handle texture addressing *and* filtering for compressed formats (and basic formats like INT8 *maybe*) in fixed-function hardware, and filtering for other things in the shader core. I suspect that is the direction at least one IHV will go towards in the DX11 generation, although I don't have any hard proof of it.
All you really need for that is multi-gather over boundaries or many threads. Accumulation (MAD) is nice, but we all expect that to be available anyway.

And you can also turn it around: suspend threads at a texture lookup, load (part of) the texture in local store and do all the lookups for that texture at once. L2 cache could work as well, if it can persist and understands 2D data structures.
 
All you really need for that is multi-gather over boundaries or many threads. Accumulation (MAD) is nice, but we all expect that to be available anyway.

And you can also turn it around: suspend threads at a texture lookup, load (part of) the texture in local store and do all the lookups for that texture at once. L2 cache could work as well, if it can persist and understands 2D data structures.

Now I understand why Cell pwns for graphics.
 
All you really need for that is multi-gather over boundaries or many threads. Accumulation (MAD) is nice, but we all expect that to be available anyway.

And you can also turn it around: suspend threads at a texture lookup, load (part of) the texture in local store and do all the lookups for that texture at once. L2 cache could work as well, if it can persist and understands 2D data structures.
But that has nothing to do with the fact using FP32+ units for INT8 filtering is a pretty damn big waste (heck, it has nothing to do with texture *filtering* either). You're using very expensive units integrated in a very complex scheduling mechanism to do a very cheap task. And it's not like only 1% of the performance would go in that direction, so no, it's not negligible.

Using your programmable core for triangle setup is great. For blending, it's also okay. For FP32 or even FP16/INT16 filtering, why not! For rasterization, well if you can make it cheap enough, I won't complain too much. But good design decisions are made by thoughful analysis, not philosophical choices that consider one approach to be 'good enough' and closer to the company's "vision".
 
EDIT: Which reminds me of something that struck me in one of the articles: it was claimed that the primary reason for fixed-function hardware is that not everything is parallelizable. That's obviously one reason (especially when parallelism implies SIMD), but I personally consider the difference in unit costs to be at least as important, if not more. You really, really don't want to be using a unit capable of single-cycle FP32/INT32 multiplications to run INT8 code, or even FP32 comparisons for depth tests.

It's a truly ridiculous waste unless that's only a very small part of your workload. I suspect Larrabee will be relatively smart in terms of how it handles that given that its SIMD unit(s) is VLIW, though, so they can specialize some of it to less compute-intensive tasks. We'll see how it goes...
Then again, that's also a chicken-and-egg thing: if you've got abundant general-purpose (scalar) execution units anyway, you tend to be bound by bandwith or latency before you run out of processing power. In that case it can make more sense to increase the internal storage and/or threads in flight to put those units to good use.

It depends on your target market: if it's only graphics, some specialized units make sense as they reduce the overall transistor budget. But if you go for general purpose, virtualization is the ticket.

Edit: I think this also answers your last post.

Edit2: it would be like a GeForce 8800 that is CPU first and GPU second.
 
I thought the concensus was fixed-function hardware for texture *addressing* and DXTC decompression inside the addressing unit. I didn't notice anything that'd go against that - did I miss something?

"In particular, I expect the following specialization will
continue to exist for graphics architectures:

Texture hardware. Texture addressing and filtering
operations use low-precision (typically 16-bit) values that
are decompressed on the fly from a compressed representation
stored in memory. The amount of data accessed is
large and requires multithreading to deal effectively with
cache misses. These operations are a significant fraction
of the overall rendering cost and benefit enormously from
specialized hardware."

Not sure exactly what to gather :) from this, other than dedicated hardware to decompress textures and filter them.

Seems like it would be better to cache the compressed textures and "decode and filter on the fly" than to have operations decompress from memory into the cache, then use vector operations and manually do the filtering. Given Intel's push for low latency designs, higher latency to decode from cache, so guessing Larrabee would be decode into cache, then SIMD opcode to filter.

I see a good reason to support 16-bit compressed textures (whatever future format) and filtering in hardware. But can agree that 32bit int and full float filtering is a waste to do in hardware (IMO).
 
You could use a 32-bit unit to filter an 8- or 16-bit texture if it supports the same operation on parts of the register at byte or word boundaries at the same time, like SSE. You only need to be able to fill it with data that can be in different memory locations, instead of a single longword.

Decompression on-the-fly can be done with referenced, indirect adressing, which will be about the same as when it's done through hardware. Depending on the addressing scheme used, it can be done in multiple pipeline stages.
 
Frank: You must be thinking of MMX, not SSE? Anyway, yeah that's a fair point, although I'd suspect a FP32 unit capable of such a thing is still substantially more expensive than 4 much lower precisions units. And there's the control overhead of the core too. But I agree the gap obviously wouldn't be as dramatic as if you literally had to filter it in FP32 mode...

Anyway AFAICT, the way G84/G86/G9x works is that texture addressing and filtering hardware for non-FP32 modes is shared (which explains the lower bilinear rates). Clearly that'd also be an intriguing option for Larrabee, although I'm not sure it's the best solution.
 
Frank: You must be thinking of MMX, not SSE?
Yes, my mistake.

Anyway, yeah that's a fair point, although I'd suspect a FP32 unit capable of such a thing is still substantially more expensive than 4 much lower precisions units. And there's the control overhead of the core too. But I agree the gap obviously wouldn't be as dramatic as if you literally had to filter it in FP32 mode...
If you do it like that, you can also use them for DSP, and you can handle arbitrary sized data (like, DP) at the same time. If you pack those units in fours, you can do quads as well.

You can even design them as integer, and use one of them for the exponent if you want to do FP. That reduces the control logic needed by quite a bit, but makes them slower at SP. Or you can use two 64-bit units, that both have the FP logic but can work on smaller sizes for integer. Integer is easy.

Anyway AFAICT, the way G84/G86/G9x works is that texture addressing and filtering hardware for non-FP32 modes is shared (which explains the lower bilinear rates). Clearly that'd also be an intriguing option for Larrabee, although I'm not sure it's the best solution.
I think it depends on what market Intel sees for it.
 
;-)

Note that these articles were written for a general technical audience and so we couldn't be as "academic" as some people would have liked. Also note that there are a few small copy errors in the articles. Who can find the "easy" one in the article Kayvon and I did?
 
Yay! Yay! Yay! Yay! happy happy happy!

Thought you of all people would appreciate that :LOL:

Mike, for what it's worth I think you guys nailed it considering your target audience. Some of the other articles are really good too. Is this particularly magazine usually this good?
 
Queue is generally pretty good, but they bounce around many different domains and authors, so there is variability. Communications of the ACM is also pretty good. The "special issues" that have many articles in the same domain tend to be the strongest issues.
 
;-)

Note that these articles were written for a general technical audience and so we couldn't be as "academic" as some people would have liked. Also note that there are a few small copy errors in the articles. Who can find the "easy" one in the article Kayvon and I did?
Well, I don't recognize this product. ATI HD 2700XT
 
Back
Top