Threaded rendering

Humus

Crazy coder
Veteran
With multi-GPU solutions in the form of Crossfire/SLI there are all kinds of problems with scaling because of shared resources, cross-frame dependencies etc, plus that you get added latency. Or if you choose SFR you waste vertex computation power. Also, with current solutions resources need to be duplicated between the two cards / chips.

I was just thinking, at least for dual-chips cards, what if we would instead let the chips work in a "dual-core" fashion, just like on the CPU, with both chips accessing the same memory. For that to be performant I suppose we would still need twice the bandwidth, although we wouldn't need twice the amount of physical memory, which would be a significant cost saving.

The advantage of this would be the you could schedule independent rendering tasks to run in parallel on the different chips, like you would with different game code tasks on a dual/quad core CPU. On consoles you can already do this for the CPU part of the rendering by using command buffers, although in the end the GPU consumes the commands in sequence. With true threaded GPU side rendering you wouldn't really need to explicitely build command buffers, at least as long as the number of CPU and GPU rendering threads is the same.

I'm thinking this might scale better than Crossfire/SLI. There are a lot of rendering tasks that are independent on other tasks. For instance you could render each shadow map in a separate thread, reflection/refraction maps, and even post-effects if you let it trail by a frame. Or GPU physics for that matter, which would make more sense if we can break loose from the sequential rendering paradigm. Also, we wouldn't have the additional frame of latency of AFR, or several frames if you add more chips.

For this to work we would need some kind of GPU side synchronization mechanism so that one GPU thread could wait for results from another, like the GPU equivalent of WaitForMultipleObjects(), but that would only affect the GPU thread and the CPU thread would not have to wait.

Thoughts?
 
With multi-GPU solutions in the form of Crossfire/SLI there are all kinds of problems with scaling because of shared resources, cross-frame dependencies etc, plus that you get added latency. Or if you choose SFR you waste vertex computation power. Also, with current solutions resources need to be duplicated between the two cards / chips.

I was just thinking, at least for dual-chips cards, what if we would instead let the chips work in a "dual-core" fashion, just like on the CPU, with both chips accessing the same memory. For that to be performant I suppose we would still need twice the bandwidth, although we wouldn't need twice the amount of physical memory, which would be a significant cost saving.

The advantage of this would be the you could schedule independent rendering tasks to run in parallel on the different chips, like you would with different game code tasks on a dual/quad core CPU. On consoles you can already do this for the CPU part of the rendering by using command buffers, although in the end the GPU consumes the commands in sequence. With true threaded GPU side rendering you wouldn't really need to explicitely build command buffers, at least as long as the number of CPU and GPU rendering threads is the same.

I'm thinking this might scale better than Crossfire/SLI. There are a lot of rendering tasks that are independent on other tasks. For instance you could render each shadow map in a separate thread, reflection/refraction maps, and even post-effects if you let it trail by a frame. Or GPU physics for that matter, which would make more sense if we can break loose from the sequential rendering paradigm. Also, we wouldn't have the additional frame of latency of AFR, or several frames if you add more chips.

For this to work we would need some kind of GPU side synchronization mechanism so that one GPU thread could wait for results from another, like the GPU equivalent of WaitForMultipleObjects(), but that would only affect the GPU thread and the CPU thread would not have to wait.

Thoughts?

LOL, someone's been reading about R700 ;)
 
I've been wondering about multiple concurrent contexts for quite a while now. Regardless of the number of GPUs in a system (i.e. 1 or more) the ability to create multiple contexts and have the GPU(s) load-balance and partition resources amongst those contexts seems to be the future of D3D. That's my interpretation, anyway.

Once this framework is in place the developer is responsible for devolving work amongst contexts. Obviously there's the proviso that with a single GPU it's risky to have multiple contexts on the go - yet at the same time it's possible that this will lead to better utilisation if "asymmetric" contexts are running concurrently. e.g. something that's heavy on vertex-rate and fillrate partnered with something that's heavy on texturing rate.

Separately if you create an application with serialised kernels forming a virtual pipeline (not necessarily graphics) where the computational workloads of the two kernels are imbalanced, then there's a question of buffering the output stream of kernel 1 for the use of kernel 2. This seems to be a question of virtualised resources which traditionally would be allocated on-die (e.g. as post transform vertex cache in the traditional graphics pipeline) but in this case would be cached resources allocated in video memory.

If there are multiple GPUs working on a network of kernels then it seems that each virtualised buffer is available to all GPUs.

Of course the tricky bit is spreading this work around. And as far as the D3D pipeline is concerned, the tricky bit is allowing memory clients to access buffers regardless of their location.

e.g. if post transform vertex cache (something that is normally only on-die) was shared by both GPUs in a system, you'd have something that appeared to be a single GPU. Serialisation buffers like PTVC will tend to be pretty severe bottlenecks in multi-GPU though.

Anyway, the way I see it is that the complete solution consists of multiple concurrent contexts + shared access to resources. The developer can then choose how to scale over multiple GPUs using a mixture of independent tasks + allowing multiple GPUs to load-balance multiple kernels configured in a virtual pipeline.

Jawed
 
LOL, someone's been reading about R700 ;)

Well, I'm checking in on the R7xx speculation thread occasionally, but it's not so much the R700 rumors as much as my exposure to console development that's got me thinking along these lines.

Separately if you create an application with serialised kernels forming a virtual pipeline (not necessarily graphics) where the computational workloads of the two kernels are imbalanced, then there's a question of buffering the output stream of kernel 1 for the use of kernel 2.

I guess it depends on what level of parallelism you want. I was thinking more in large-scale synchronization primitives issued by the application in between draw calls. I'd be willing to accept that threads running in parallel trying to access the same resources would run into race conditions just like on the CPU. But I suppose it could be desirable to have more fine-grained synchronization where one GPU could for instance wait for a buffer to be filled to a certain point before processing some primitives. For something like StreamOut I think it could make sense. A draw call that's going to draw with the resulting vertices could draw only as vertices arrive in the stream buffer. For things like render targets I'm not sure if that level of synchronization makes sense other than in specific cases.

e.g. if post transform vertex cache (something that is normally only on-die) was shared by both GPUs in a system, you'd have something that appeared to be a single GPU.

I don't think sharing the post transform vertex cache makes much sense. Generally I would think both chips would have their own caches for most stuff. I don't think the performance gain of sharing caches on the GPU would be anywhere close to what gain you get on CPUs for sharing caches across cores. Both because caches generally are much smaller on GPUs anyway and data tend to live much shorter time, so the chance of another thread gaining anything from having data already in the cache is quite small.

The way I see it the only concern would be correctness. So output caches might make sense to share, like the render backend cache. That would be necessary if you want to be able to render to the same surface from two different threads at the same time, like explicit SFR. Either that or you have to give the application some knowledge about the surface tiling so that one thread cannot pollute areas that the other thread is rendering to, but I don't think that's likely to ever be considered for DirectX, so the choices are either accept that there is a race condition if you render to the same resource or you share the render backend cache. For the post transform vertex cache this is not an issue though because that's only used locally as an optimization and never written to memory.
 
My question is if a product is to come out, say a MCM GPU, would the IHV have the ability to split workload them selves?
 
As long as there is an order of magnitude difference between PCI-E bandwidth vs. bandwidth to local memory, it won't make sense to "share" the local memory on each card. With that large a difference, you really want to do as much as possible locally, and copy to the other GPU when necessary.

So something like sharing the post-transform cache doesn't make sense -- it would be far cheaper for the second GPU to just recompute the vertex rather than read the result from the other GPU. And that's ignoring the other one or two orders of magnitude difference between off-chip bandwidth and bandwidth of on-chip FIFOs and caches.

Things like computing shadowmaps on separate GPUs in parallel could work though, but I think you'd want to have the application explicitly say when to copy data between GPUs. So not just inter-GPU synchronization primitives, you'd want inter-GPU copy primitives.

Now if you had two GPUs on the same board there might be some interesting things you could do with true memory sharing. But this would need a much higher bandwidth connection than the current PCI-E bridge chip solutions that both 3870X2 and 9800GX2 use.
 
To be honest I cannot see how this will bet AFR performance when your game follows the rules.

Additional with the current number of SLI/Crossfire solutions out there I don’t suspect that many developers are willing to do this kind of extra work for them. It’s the same as with CPU multi cores. Beside of the “interesting technical challenge” aspect no one really wanted to write multi core optimized games. The market situation forced developers in this.

Seeing multiple cores as one is something developers want for CPUs. The GPUs are already there (with some limitations). Therefore making the multiple cores visible again would be a step back.
 
We are splitting the workload between multiple GPU's for some time now. For some applications, i.e., large data visualization, this is worth the effort.

Unfortunately there is so far little interest from the hardware vendors to decrease the overhead, but that will hopefully change in the future.

You can find more information here: http://www.equalizergraphics.com/
 
I guess it depends on what level of parallelism you want. I was thinking more in large-scale synchronization primitives issued by the application in between draw calls.
I was just trying to unify both types of parallelism. e.g. with large scale parallelism you'd want it to scale across multiple GPUs without the developer having to code specifically for differing number of GPUs.

I'd be willing to accept that threads running in parallel trying to access the same resources would run into race conditions just like on the CPU. But I suppose it could be desirable to have more fine-grained synchronization where one GPU could for instance wait for a buffer to be filled to a certain point before processing some primitives. For something like StreamOut I think it could make sense. A draw call that's going to draw with the resulting vertices could draw only as vertices arrive in the stream buffer. For things like render targets I'm not sure if that level of synchronization makes sense other than in specific cases.
GPUs as they currently stand are multiple processors (multiple independent hardware threads) that are sharing resources. So I'm thinking in terms of expanding the bounds of existing sharing mechanisms.

I don't think sharing the post transform vertex cache makes much sense. Generally I would think both chips would have their own caches for most stuff. I don't think the performance gain of sharing caches on the GPU would be anywhere close to what gain you get on CPUs for sharing caches across cores. Both because caches generally are much smaller on GPUs anyway and data tend to live much shorter time, so the chance of another thread gaining anything from having data already in the cache is quite small.
The curious thing about PTVC is that vertices arrive in from multiple processors, e.g. in RV670 all four SIMDs can be running vertex shaders, sending their results to PTVC. So there's already a sense in which "sharing" is taking place. With a batch size of 64 vertices in RV670 (I presume) I expect PTVC can hold more than one batch of vertices. But we don't how much "interleaving" of batches of vertices there is.

Apparently if a vertex is evicted from PTVC too early it is re-submitted for vertex shading again. That implies that vertex shading isn't considered very important as a workload - or that re-calculation is only a rare event :???:

The way I see it the only concern would be correctness. So output caches might make sense to share, like the render backend cache. That would be necessary if you want to be able to render to the same surface from two different threads at the same time, like explicit SFR. Either that or you have to give the application some knowledge about the surface tiling so that one thread cannot pollute areas that the other thread is rendering to, but I don't think that's likely to ever be considered for DirectX, so the choices are either accept that there is a race condition if you render to the same resource or you share the render backend cache. For the post transform vertex cache this is not an issue though because that's only used locally as an optimization and never written to memory.
Yeah I would expect some kind of tiling to enable multiple processors (e.g. there are 8 processors in a pair of RV670s) to be able to write to a common resource, just like SFR or supertiling.

The old argument against supertiling is that both GPUs have to perform vertex shading for the frame, because they can't share a common PTVC. Additionally, from PTVC through setup and on to rasterisation is serial. Triangles have to be rasterised in strictly serial order, it seems. So there's a bottleneck there on fragment-batch creation. One GPU could be responsible for all rasterisation but it would then have to distribute batches of fragment shading to all processors.

Overall I don't see why two GPUs can't be made to cooperate on shared resources like this as each GPU die consists internally of a number of processors. Obviously there are some hairy bandwidths involved in certain on-die operations, but then we see something like streamout which is able to produce 4KB of data per input vertex and write it off die, supporting batches of 10s of vertices in parallel and able to write to as many as 4 independent streams in parallel. That's a lot of bandwidth and interleaved futzing...

Jawed
 
The curious thing about PTVC is that vertices arrive in from multiple processors, e.g. in RV670 all four SIMDs can be running vertex shaders, sending their results to PTVC. So there's already a sense in which "sharing" is taking place.

At least one of the current DX10-generation architectures doesn't have a global PTVC the way you describe it, precisely because having many parallel processors feed into a single shared serial structure is too expensive. And that's with everything is on the same die...
 
At least one of the current DX10-generation architectures doesn't have a global PTVC the way you describe it, precisely because having many parallel processors feed into a single shared serial structure is too expensive. And that's with everything is on the same die...
:D So, which GPU?

Does this GPU have one setup engine sampling from all these PTVCs? Or does it run multiple setup engines?

Jawed
 
For this to work we would need some kind of GPU side synchronization mechanism so that one GPU thread could wait for results from another, like the GPU equivalent of WaitForMultipleObjects(), but that would only affect the GPU thread and the CPU thread would not have to wait.

Thoughts?

As a side note, as for SFR, the duplication of work (acceleration structures, etc) and memory, is somewhat the norm for massively parallel ray tracing. Network is just too expensive for any fine grain sync points. Guessing as the trend towards more parallelism continues, that we will get closer and closer to this "supercomputing" model.

Also we know that some GPUs can load balance between multiple non-dependent draw calls. So perhaps with some API changes it wouldn't be to hard to use current hardware in the way you are describing.
 
Another approach

What you suggests will work, but I believe the disadvantages are greater than the benefits.

I can see at least 3 major problems with your solution:


1. You will need a memory interface chipset just like Intel uses for the CPU.
2. A bus will create overhead and have limited bandwidth compared to a direct memory interface of the same width.
3. No scaling of bandwidth with the number of GPUs. This will be a major performance bottleneck.

But the threaded rendering approach are interesting and should have been investigated for the current multi chip architecture.

It is clear that a better solution for multi GPUs than the existing SLI/Crossfire should be developed. Both the waste of memory capacity and the optimization profile for each game must be fixed. For a programmer it should not matter if a card have 1,2 or N GPUs.

So I suggest this:

Add HyperTransport links to the existing GPU architecture. Connect the GPUs with the links and use a part of each GPUs memory as a 3'rd level cache.

So for a dual GPU solution with 1 HT link and 512Mb for each GPU you could have 128Mb as local cache. If you place all the buffers/textures you read from in the cache and divide the buffers you write to among the GPUs it should be possible to balance the workload. From a total of 1Gb you will have 768Mb as usable memory and in addition have full bandwidth. For a 4 GPU card with 2Gb, 1.5Gb will be usable memory. Really big textures (16k x 16k) can not fit the cache and must be loaded from the HT link directly.

If you have 2 HT links in each GPU you could make a ring bus and connect as many GPUs as you want. This will make sence since most of the cache will be equal for all GPUs and you can just broadcast data in the ring. For 3 HT links a 4 GPU card can be directly connected to all other GPUs.

Even if HyperTransport was made by AMD it is an open standard all GPU manufacturers could use. Intel is not a member, but Nvidia is. Take a look at http://www.hypertransport.org for more information. One HT 3 link with 32 lanes have a bandwidth of 41.6 Gb/s and that should enough to fill the cache and copy buffers between the GPUs.
 
My question is if a product is to come out, say a MCM GPU, would the IHV have the ability to split workload them selves?

Well, they can always do AFR / SFR. Task-based workload could also be made to work if the driver can figure out what parts are independent of others, for instance based on what render targets you're rendering into.

To be honest I cannot see how this will bet AFR performance when your game follows the rules.

Well, first of all many games don't follow the rules. Inter-frame dependencies happens all the time. Secondly, AFR doesn't scale much beyond 2 chips before latency becomes a huge problem, not to mention the 3 frames in flight limit that D3D imposes.

Additional with the current number of SLI/Crossfire solutions out there I don’t suspect that many developers are willing to do this kind of extra work for them. It’s the same as with CPU multi cores. Beside of the “interesting technical challenge” aspect no one really wanted to write multi core optimized games. The market situation forced developers in this.

Well, pretty much everyone supports multicore these days whether they wanted to or not. If it's the way the industry is heading, then surely developers would follow for GPUs as well.

GPUs as they currently stand are multiple processors (multiple independent hardware threads) that are sharing resources. So I'm thinking in terms of expanding the bounds of existing sharing mechanisms.

Well, that's true. Not that I'm a hardware guy, but I suspect it would be much harder to implement cross-chip synchronization on that level so that two chips would essentially work as if they were a single chip.
 
It would be harder, but not significantly more expensive in manufacturing. If you are going to spend the area on the kind of bandwidth you would need to share framebuffers you might as well spend the effort to make optimal use of it.
 
Would such a threading of several gpus make it possible to do supporting (GI, particles) calculations on one gpu and the actual rendering on the other?

Am I correct in thinking that having one gpu dedicated to calculating GI with the other being freed up to do the "rest" of the rendering would allow us to do much more realistic effects than is possible with the current scheme?
 
I was just thinking, at least for dual-chips cards, what if we would instead let the chips work in a "dual-core" fashion, just like on the CPU, with both chips accessing the same memory. For that to be performant I suppose we would still need twice the bandwidth, although we wouldn't need twice the amount of physical memory, which would be a significant cost saving.
It's not the memory BW that's the issue as much as the BW between the GPUs. Making a 50-100 GB/s connection between two GPUs or having a "northbridge" with 100GB/s connections to the RAM and each GPU isn't particularly easy, and the latter would waste gobs of silicon. You need a lot of pins running at a very high speed to get that kind of a connection.

I do think it will eventually happen, though. Maybe we'll see fibre-optic GPU interconnects in a few generations.
 
Shared L3 seems to be a popular new way of communicating in the in the CPU world, would that translate?
 
Shared L3 seems to be a popular new way of communicating in the in the CPU world, would that translate?
Well that's multiple cores on the same die, which in a way we already have on the GPU.

In this thread we're talking about increased efficiency with multiple dies compared to SLI/Crossfire.
 
Back
Top