GPU vs CPU Architecture Evolution

silent_guy · Sep 10, 2009

Nick said:
That's absurd. A texture lookup only requires the coordinates and the sampler index. This sampler index can be extended to samplers from multiple shaders. In other words, if a texture unit can sample different textures within the same shader it can also sample textures from different shaders. Right?

In theory it could, but I'm pretty sure it doesn't do it in practice: it still assumes that your shader units are able to context switch very quickly between shaders of different contexts while other texture operations of a different context are still in flight. Right now, when you're running CUDA, *all* shaders units across the chip have to run and complete(!) the same shader program before it can launch on to the next one. That's two steps away from what you want.

Now it's possible (likely?) that the architecture allows starting up a new shader program as soon as the last one has been launched, but that's still a far cry from switching between contexts all the time. Let's say it this way: if the architecture allows to do that efficiently, it'd be an incredibly amount of waste in terms of area for a feature that's not used by anyone.

Games have dozens of shaders with many different characteristics. Some perform a Gaussian blur and are completely TEX limited, while for instance vertex shaders are typically purely arithmetic, and particle shaders are ROP limited.

Of course they do, but there'll always be a few shaders that will eat up the majority of the time.

Combined they utilize the hardware way better than one could do on its own.

Even if that's true (which I doubt), you're still ignoring the little detail of switching contexts. It's the key to this whole discussion. And we're not talking about 2 contexts, but 10.

I'm not saying GPU's currently don't support switching contexts: somehow they have to run multiple 3D programs in Windows, but the idea that they have been engineered to do this extremely quickly is doubtful. It's easy to test, BTW, just run 2 demanding 3D programs at the same time and check if your performance goes down less or more quickly than expected.

Thread contexts can be switched cheaply.

How?

Jawed · Sep 10, 2009

silent_guy said:
The fact that ROPs or TEX units may be underutilized during a long math based shader doesn't mean that you can assign those resources to some other context. After all, they're slave units to the shader.

ROPs are doing work given them by the shader program and often they're required to complete the work they're given in the order it's given. For the sake of argument you can think of them as running a kernel in the sense that vertex shading and pixel shading are independent kernels, i.e. forming a pipeline.

Texturing work is as freely schedulable as ALU. Both types of units are switching context quite freely. Dependencies, load-balancing and scheduling algorithms might put a variety of constraints on scheduling, but they're essentially independent of each other in terms of actual execution.

Jawed

Nick · Sep 11, 2009

pcchen said:
Only if the context is small... consider this, the MP in G8X/G9X has 16KB worth of registers and 16KB shared memory. How do you switch that context cheaply?

Can't that register file already contain work from multiple shaders? I assumed that was the case, so switching context would be pretty trivial.

Think of a pixel shader and a vertex shader, or two pixel shaders working on different tiles. I can't think of any reason why the shader processor shouldn't be able to switch between them when blocked by TEX or ROP or anything else.

pcchen · Sep 11, 2009

Nick said:
Can't that register file already contain work from multiple shaders? I assumed that was the case, so switching context would be pretty trivial.

Think of a pixel shader and a vertex shader, or two pixel shaders working on different tiles. I can't think of any reason why the shader processor shouldn't be able to switch between them when blocked by TEX or ROP or anything else.

They can switch if they are running on the same shader. In NVIDIA's case (I believe it's the same in ATI's case), the big register file is partitioned among threads of the same program, so switching threads in this case is trivial.

OpenGL guy · Sep 11, 2009

Nick said:
Can't that register file already contain work from multiple shaders? I assumed that was the case, so switching context would be pretty trivial.

Think of a pixel shader and a vertex shader, or two pixel shaders working on different tiles. I can't think of any reason why the shader processor shouldn't be able to switch between them when blocked by TEX or ROP or anything else.

Suppose all the GPRs are currently in use by threads waiting for texture reads, where would you find GPRs to run another shader?

silent_guy · Sep 11, 2009

Jawed said:
ROPs are doing work given them by the shader program and ...

Thanks for the tutorial, but the issue is not whether it's possible to design ROPs and TEX units for multiple contexts (it's probably trivial, especially for TEX, which should be relatively straightforward math pipeline: you just tag each operation in flight with a context number.) The question is if you can gain efficiency at the system level by doing so.

Nick said:
Can't that register file already contain work from multiple shaders? I assumed that was the case, so switching context would be pretty trivial.

Let's assume that it is possible to do fast context switching.

That doesn't mean that doing so would make things more efficient on the whole: no matter what, fixed sized resources are going to have to be shared now with multiple contexts. In SMT, a first order way to solve this is to double all registers and mux between them. In a GPU, there are many places where you can't because the storage amount are much larger.

Some things you'll need to share:
- Texture cache
- Shader program cache or program RAM.
- Shader data cache or data RAM
- L2 caches for the ROPs(?)
- Shader register files
For the last one, you could argue that there's no problem since your latency hiding ability stays the same, but for the first 4, you will inevitably increase the amount of trashing. (After all, you said that the chances of multiple contexts running the same shader are low.) E.g. texture caches are really small. You're going to share those with 10 independent contexts? Good luck with that.
- Color compression and Z compression
As I understand it, this is only a bandwidth reduction technique. In fact, it increases the amount of storage by needing on-chip RAMs to store additional info. You're going to share those between 10 processes? If a single application would nicely fit all its render targets within that RAM, now many targets will inevitable expand beyond that, so you lose compression. Ouch, that hurts.
- Z optimizations?
ATI has things like fast Z clears and HiZ. I don't know exactly how they work, but according to their documentation, Z clears are free and HiZ make it possible to kill pixels instead of them being rendered. This suggests once again that there's some kind of limited resource to keep track of them. Let's hope you don't run out of that with your 10 processes.
- Plain old memory buffer
Even on a 2GB GPU, 10 contexts only give you 200MB per contexts. It is very unlikely that drivers are optimized to share textures between different contexts that run the same game. After all, it's not exactly a common configuration, so why optimize for it? And Vista probably doesn't even allow it anyway. So you're going to blow out of memory and start putting things in main memory.

There is little evidence that current external memory interfaces are vastly over-dimensioned. Yet each of my points above will inevitably increase bandwidth, often dramatically. Your multi-context machine will be one big pile of memory starved agents.

nAo · Sep 11, 2009

OpenGL guy said:
Suppose all the GPRs are currently in use by threads waiting for texture reads, where would you find GPRs to run another shader?

External memory, if your architecture can spill and fill regs.

OpenGL guy · Sep 11, 2009

nAo said:
External memory, if your architecture can spill and fill regs.

Sure, this would gain you working space, but would it increase performance? Consider the memory latencies. The shader is designed to absorb latency from, say, texture reads by having many threads in flight. What would you use to cover the latency caused by spilling?

nAo · Sep 11, 2009

OpenGL guy said:
Sure, this would gain you working space, but would it increase performance? Consider the memory latencies. The shader is designed to absorb latency from, say, texture reads by having many threads in flight. What would you use to cover the latency caused by spilling?

Spilling is easy to hide as long as you can buffer writes to memory (and GPUs do it all the time..). Registers re-fill requires pre-fetching.

OpenGL guy · Sep 11, 2009

nAo said:
Spilling is easy to hide as long as you can buffer writes to memory (and GPUs do it all the time..). Registers re-fill requires pre-fetching.

Yes, GPUs can hide latency from some memory writes, but the amount of data GPUs write to memory as color data, for example, is generally far lower than you would get from spilling. Regarding pre-fetching, where would you store the pre-fetched data? You'd have to wait until a thread, perhaps more than one, was retired/spilled (again perhaps) in order to make room for the incoming data.

silent_guy · Sep 11, 2009

nAo said:
External memory, if your architecture can spill and fill regs.

Spilling and refetching a whole set of registers of a thread that it waiting for a couple of texture reads. Doesn't sound like a good deal.

nAo · Sep 11, 2009

Think again about it, you don't need to spill the entire set just because you need one more live register than your on chip mem can afford without decreasing the threads count (which we know is counter-productive under a certain architecture-dependent threshold).
Also you don't spill and re-fetch to wait for texture reads, you do it to expand your live registers set. All you need is writes and pre-fetches buffers and a smarter scheduler. I am not saying it's free, but it's definitely doable and it's certainly not rocket science.

Humus · Sep 11, 2009

silent_guy said:
- Color compression and Z compression
As I understand it, this is only a bandwidth reduction technique. In fact, it increases the amount of storage by needing on-chip RAMs to store additional info. You're going to share those between 10 processes? If a single application would nicely fit all its render targets within that RAM, now many targets will inevitable expand beyond that, so you lose compression. Ouch, that hurts.
- Z optimizations?
ATI has things like fast Z clears and HiZ. I don't know exactly how they work, but according to their documentation, Z clears are free and HiZ make it possible to kill pixels instead of them being rendered. This suggests once again that there's some kind of limited resource to keep track of them. Let's hope you don't run out of that with your 10 processes.

Actually, from R600 and up it's no longer a fixed on-chip resource but backed by memory. So you'll get z optimizations on all depth buffers. As of G80 Nvidia still used a fixed on-chip memory though, not sure for later chips.

silent_guy said:
- Plain old memory buffer
Even on a 2GB GPU, 10 contexts only give you 200MB per contexts. It is very unlikely that drivers are optimized to share textures between different contexts that run the same game. After all, it's not exactly a common configuration, so why optimize for it? And Vista probably doesn't even allow it anyway. So you're going to blow out of memory and start putting things in main memory.

Actually, you can share resources between applications. But they have to be coded for it.

silent_guy · Sep 11, 2009

Humus said:
Actually, from R600 and up it's no longer a fixed on-chip resource but backed by memory. So you'll get z optimizations on all depth buffers. As of G80 Nvidia still used a fixed on-chip memory though, not sure for later chips.

Interesting. How many bits do you need per pixel? 1 bit per 4 pixels for Z compression? That would work out to be 60KB for a 1600x1200 render target. Not the end of the world in terms of additional bandwidth.

nAo · Sep 11, 2009

silent_guy said:
Interesting. How many bits do you need per pixel? 1 bit per 4 pixels for Z compression? That would work out to be 60KB for a 1600x1200 render target. Not the end of the world in terms of additional bandwidth.

You need more bits per quad (or super-quads) to store conservative min or max Z if you want proper hi-z early rejection too.

silent_guy · Sep 12, 2009

nAo said:
You need more bits per quad (or super-quads) to store conservative min or max Z if you want proper hi-z early rejection too.

Yes, I was thinking regular compression here, not HiZ. Any suggestions where I can find basic info about HiZ? There's surprisingly little to find with basic google searches.

nAo · Sep 12, 2009

silent_guy said:
Yes, I was thinking regular compression here, not HiZ. Any suggestions where I can find basic info about HiZ? There's surprisingly little to find with basic google searches.

I am afraid I don't know any useful resource.

3dcgi · Sep 12, 2009

I believe Ned Greene was the first to publish the hierarchical Z method so you could look for his name if I didn't misspell it. There might have been a R200 document from Ati as well.

rpg.314 · Sep 12, 2009

silent_guy said:
- Plain old memory buffer
Even on a 2GB GPU, 10 contexts only give you 200MB per contexts. It is very unlikely that drivers are optimized to share textures between different contexts that run the same game. After all, it's not exactly a common configuration, so why optimize for it? And Vista probably doesn't even allow it anyway. So you're going to blow out of memory and start putting things in main memory.

My question exactly. Hell, how will the driver even know that there is the same app runnig twice and so it can share textures and geometry data? Can anybody explain to me why drivers dont run out of video ram when faced with 10 instances of crysis? And what about getting cpu limited? Is one cpu enough to drive 10 instances of crysis simulltaneously.

I am pretty confused at the moment regarding this otoy business. :???:

`

Nick · Sep 12, 2009

rpg.314 said:
Can anybody explain to me why drivers dont run out of video ram when faced with 10 instances of crysis?

Like Humus already said, you can share resources with the existing Direct3D API, but the application has to explicitely make use of it (i.e. check whether the resource was already created by another process and use that handle). I assume they modified CryEngine 2 to get that working.

And what about getting cpu limited? Is one cpu enough to drive 10 instances of crysis simulltaneously.

The game doesn't stress the CPU much at all. Also, again, the times the CPU is used more intensely are pretty 'bursty', so you can run many instances before they really start hampering each other. Last but not least, the engine is largely single-threaded, so something like a Core i7 can easily run many instances.

GPU vs CPU Architecture Evolution

silent_guy

Jawed

Nick

pcchen

Moderator

OpenGL guy

silent_guy

nAo

Nutella Nutellae

OpenGL guy

nAo

Nutella Nutellae

OpenGL guy

silent_guy

nAo

Nutella Nutellae

Humus

Crazy coder

silent_guy

nAo

Nutella Nutellae

silent_guy

nAo

Nutella Nutellae

3dcgi

rpg.314

Nick

Similar threads