quad gpu Nvidia SLI consumer product feasible?

Mendel

Mr. Upgrade
Veteran
Just wanted some insights on if this was possible from technical standpoint.
I was thinking if they used two dual gpu cards such as GIGABYTE´s GV-3D1 or Asus' EN6600GT Dual.

Would it be possible to use each chip as pcie-4x device to sum up to total of 16 pcie lanes so that upcoming motherboards could cope with it, or is there some limitation making less than 8x pcie per chip feasible?
 
The immediate limitation is at the software level, namely the drivers.
 
No they won't it would cost to much really the dual GPU solutions would be less then 1% of the current sales due to cost.
 
If the margins were high enough, I'm sure they would. Look at people buying 2 6800 Ultras or X850 XT's. They make up a very small amount of the market, but they produce bragging rights and usually contain huge margins.
 
possible yes. there are Evans & Sutherland cards with quad ATI VPUs. but I doubt it is feasible for the consumer market. even the very high end of the consumer market.

dual VPUs, like mentioned above, will have extremely limited penatration also..

but hey, I would buy a quad GPU/VPU card if it cost no more than $1000, if there was software to take advantage of it.
 
Megadrive1988 said:
possible yes. there are Evans & Sutherland cards with quad ATI VPUs. but I doubt it is feasible for the consumer market. even the very high end of the consumer market.

E&S use up to 64 in RenderBeast.

The difference here is that ATI has SuperTiling built into the chip which automatically takes care of the screen space division and/or FSAA samples over multiple chips, whereas NVIDIA's current solution doesn't have this. Presently the majority solution for NVIDIA is using AFR so using more chips / boards will introduce further latencies with the solution, the Split Screen Rendering would work. They can't get the same benefits of scaling up MSAA such as ATI because it is limited to 4 sample positions.

It looks like the ATI solution can also arbitrate the data from the master to chip on a board to the slave(s), which is something that I'd like to see from the NVIDIA solution as well.
 
It's technically not pure multisampling if two chips are rendering the same tile, since they will process vertices twice, and fetch textures twice. If I did the same thing on a single chip (render an MSAA tile twice, the second with altered sample positions, then combine) you'd call it hybrid-SS/MS. It's an MSxSS method, combing say two or more 6xMSAA buffers. I don't see the large gain or any inherent difference in scalability.

The 4x sample positions are not any impedence on scalability either, since between chips, you can jitter positions and achieve arbitrary sample distributions.

The market for flightsims and realtime megavideo walls is not very big, unless you count military and aerospace customers. The more interesting case for clustered GPUs is offline rendering, of which there is an unending demand for faster renderfarms, and there, NVidia's architecture is no less scalable than ATI's. NVidia's philosophy is to target offline rendering (i.e. Gelato), hence their acquisition of ExLuna, their push for SM3.0, and their investments in the DCC market.
 
It's technically not pure multisampling if two chips are rendering the same tile, since they will process vertices twice, and fetch textures twice. If I did the same thing on a single chip (render an MSAA tile twice, the second with altered sample positions, then combine) you'd call it hybrid-SS/MS. It's an MSxSS method, combing say two or more 6xMSAA buffers. I don't see the large gain or any inherent difference in scalability.

Vertices are processed, up to the clipping level at least, per chip on each system regardless of MSAA or not; the only time this isn’t the case is (one any solution that we’re aware of) is AFR.

As far as the application is concerned it is behaving exactly the same as an MSAA solution, its just able to go to more samples than can natively be achieved by a single chip - this would be of benefit in a consumer implementation. All of this is catered for by the hardware level and there is a transparency between what it happening on one chip with 4x MSAA and what it happening on 4 with 24x MSAA – the subsample distribution pattern logically works in the same way.

But this doesn’t, of course, preclude a mode whereby the texture subsample center is offset per chip rendering that tile to achieve the same benefits of mixed MS+SS AA.
 
Shaders are executing twice, once on each chip. 2x the bandwidth is also consumed as well. It is true that as far as the app is concerned, it's MSAA, but my point is, it doesn't consume less bandwidth or shaderops than just jittering on each chip. The scalability is the same as rendering two MSAA buffers and combining them. The speedup is sublinear. The combination of two GPUs using ATI's method runs slower than a GPU with 2x the fillrate and bandwidth, because, as I said, it is not quite correct to view the combined cluster as a true 24xMSAA buffer, since everything is being done twice, which would not happen on a single GPU which supported 24xMSAA. The semantics in terms of texture bandwidth and shader intensive are not the same.
 
DemoCoder said:
It is true that as far as the app is concerned, it's MSAA, but my point is, it doesn't consume less bandwidth or shaderops than just jittering on each chip.

Well, that was a given - thats why I said its taken care of over multiple chips. I was talking about from a subsample positioning as well as the API and application perspective.

If you want to get into the specifics of the performance implications then under ATI and NVIDIA's current solutions there is an incremental performance advantage for ATI since they have a native 6x sampling rate per chip.

The combination of two GPUs using ATI's method runs slower than a GPU with 2x the fillrate and bandwidth, because, as I said, it is not quite correct to view the combined cluster as a true 24xMSAA buffer, since everything is being done twice, which would not happen on a single GPU which supported 24xMSAA

Well, 4 times with ATI's current solution such as one on the likes of RenderBeast.
 
hehe, sorry to interrupt but well, let's think about a theoretical system that would have 1 quad GPU 6200 or 2 dual gpu geforce 6200's. (let's consider them some special versions that would support SLI)

How much do you think would be the total cost of these two dual chip cards or one quad chip card vs two one chip 6600GTs or one dual chip 6600 GT and which card would win in performance?

I'd imagine if Nvidia suddenly found themselves in a situation where they had a huge inventory of 6200 chips, a card like that just might border on being reasonable.

After all The S of the SLI comes from Scalable. If it scales only from 1 to 2 it would be more like NVSLI (NV coming from "Not Very" as well as NVidia) :)

Oh and wouldn't the next step be dual core on one chip then multiple n-core chips on one card?
 
Mendel said:
I'd imagine if Nvidia suddenly found themselves in a situation where they had a huge inventory of 6200 chips, a card like that just might border on being reasonable.

At a silicon implementation level multiple chips is always going to have an overhead compared to single chips. I think this is a specific scenario you are talking about, and not really the norm that you would plan for. There is always the argument an older, more stable process can be used to make 2 chips compete with a single chip on a single chip on a newer process – the cost factors that have to be weighed up is whether the increase in die for the two chips on an old process is offset by the increase in yield from the old process; this was 3dfx’s and XGI’s argument, and history hasn’t looked too favourably on them – the critical balance may be there if/when processes get to the edge of their capabilities in the future.

After all The S of the SLI comes from Scalable. If it scales only from 1 to 2 it would be more like NVSLI (NV coming from "Not Very" as well as NVidia) :)

IMO this is "revision 1". If they are serious about it then more functionality will be built into future hardware to better support the solution from arbitrating the host data from a single “master†device per board across a plurality of devices on that board and better solutions for apportioning the workload across the plurality of graphics devices that contribute towards the final image within the entire system. With these two elements, at least, then I can see a situation where you’ll be able to achieve multiple graphics on a single board and multiple boards rendering a single image within the same system.

Oh and wouldn't the next step be dual core on one chip then multiple n-core chips on one card?

Dual core for graphics will not buy you much since graphics are already inherently very, very scalable, but will leave you with some similar unnecessary silicon overhead as rendering across multiple chips/boards. Currently, with graphics, it’s always more efficient to pack as many pixel pipelines (well, read: fragement processing capability) into a single chip as you think you feasibly can rather than multi-core in a single die.

This still presents other issues as, under present systems, the more processors you are scaling up to the less efficiently you are making use of the hugely expensive RAM you are paying for. And off screen render targets also represent issues.

Under pre-VS3.0 systems it would also have been better (from a silicon utilisation standpoint) to remove the VS from the graphics core for multi-rendering board and have a separate geometry processing device, as 3dfx where going to do with Voodoo 5 and Rampage and like 3dlabs do (although Realizm has VS in the raster core that is just disabled when P20 is used in the Realizm 800 configuration), such that this could acts as the master device, receiving data from the host then doing all the boards geometry processing up to the setup stage and apportioning the appropriate triangles to each chip on the board such that geometry processing is only done once for the entire screen per board, rather than once for the entire screen per chip (under non-AFR solutions). But elements such as vertex texturing and the greater interaction of VS/PS/GS/Tessellator being talked about with WGF2.0 represents more hurdles to overcome
 
DaveBaumann said:
Under pre-VS3.0 systems it would also have been better to remove the VS from the graphics core for multi-rendering board and have a separate geometry processing device.[snip]
But elements such as vertex texturing and the greater interaction of VS/PS/GS/Tessellator being talked about with WGF2.0 represents more hurdles to overcome

Under this aspect wouldn't a unified PS and VS approach have a inherent total advantage? (at least in a hypothetical pre VS3.0 system?)
 
Do you mean having one process purely geometry functions and then one or more doing the fragment processing?
 
DaveBaumann said:
The difference here is that ATI has SuperTiling built into the chip which automatically takes care of the screen space division and/or FSAA samples over multiple chips, whereas NVIDIA's current solution doesn't have this. Presently the majority solution for NVIDIA is using AFR so using more chips / boards will introduce further latencies with the solution, the Split Screen Rendering would work.
The only real difference between split screen, SuperTiling and scan line interleave for that matter is that the latter two "automatically" take care of load balancing, while split screen should need a bit less texture bandwidth because of increased locality.

DaveBaumann said:
But this doesn’t, of course, preclude a mode whereby the texture subsample center is offset per chip rendering that tile to achieve the same benefits of mixed MS+SS AA.
It would be a waste of processing power to not do this.

DaveBaumann said:
Under pre-VS3.0 systems it would also have been better (from a silicon utilisation standpoint) to remove the VS from the graphics core for multi-rendering board and have a separate geometry processing device, as 3dfx where going to do with Voodoo 5 and Rampage and like 3dlabs do (although Realizm has VS in the raster core that is just disabled when P20 is used in the Realizm 800 configuration), such that this could acts as the master device, receiving data from the host then doing all the boards geometry processing up to the setup stage and apportioning the appropriate triangles to each chip on the board such that geometry processing is only done once for the entire screen per board, rather than once for the entire screen per chip (under non-AFR solutions). But elements such as vertex texturing and the greater interaction of VS/PS/GS/Tessellator being talked about with WGF2.0 represents more hurdles to overcome
The nice thing about a unified architecture is that you don't need two different chips for such a VSU/VPU setup. You can use one chip to do all the geometry work and send the transformed data to all the other chips, and if the geometry load is low that chip can even perform part of the rendering work.
 
Xmas said:
DaveBaumann said:
The difference here is that ATI has SuperTiling built into the chip which automatically takes care of the screen space division and/or FSAA samples over multiple chips, whereas NVIDIA's current solution doesn't have this. Presently the majority solution for NVIDIA is using AFR so using more chips / boards will introduce further latencies with the solution, the Split Screen Rendering would work.
The only real difference between split screen, SuperTiling and scan line interleave for that matter is that the latter two "automatically" take care of load balancing,

Yes that’s one of the differences - there are implementation differences as well as I illustrated with the MSAA design. The point being is that it is already implemented.

However, under NVIDIA’s current method should you start scaling up chips / boards they would either have to increase the software logic on the load balancing to cope with 4 or more chips or just leave the ratios fixed. There is also the issue that for the game titles that use AFR the latencies would be increased – the reasons NVIDIA have to use AFR are likely to be applicable to a super tiling situation and ATI will probably have to use AFR as well if they want to see gains outside of very high FSAA usage scenarios, which is one of the reasons why I suspect that we wont see massive multichip / board implementations for some time, nor significant development down these lines (at least, beyond two).

while split screen should need a bit less texture bandwidth because of increased locality.

That can be implementation specific as well, since R300 class boards are already tiled on a per quad basis in the first place, and their rendering localities are already out – SuperTiling gives larger groups of tiles to different processors (which are then tiled further down to each quad), so the locality doesn’t change over multiple chips for an R300+ based implementation.

The nice thing about a unified architecture is that you don't need two different chips for such a VSU/VPU setup. You can use one chip to do all the geometry work and send the transformed data to all the other chips, and if the geometry load is low that chip can even perform part of the rendering work.

Complications spring in if you want to pass any rendered data back to the vertex processor from the something rendered in the framebuffer.

“What is processing what†would also, at the very least, need to be worked out on a per frame basis (i.e. the tiles / split screen distribution), so I doubt that the geometry processor could opportunistically “jump in†and start rendering some tiles, since the workload would have already been distributed – you could do some “geometry vs render†load balancing and apportion a number of render tiles to the “geometry processor†dependant on how quickly it finished the previous frame, but then you could introduce bigger hold-ups in some frames and you also would need to give the same amount of RAM to the geometry processor as the other render chips, which in a consumer implementation you probably wouldn’t want to, especially if it wasn’t guaranteed to be utilised. This would also be the case that there are still fairly significant portions of the chip that would be sparsely, or not at all, utilised and probably wouldn’t be an effective use of die.
 
DaveBaumann said:
Complications spring in if you want to pass any rendered data back to the vertex processor from the something rendered in the framebuffer.

Yes, but this is a problem you will see with any kind of SFR too. If you split the render 2 texture work on two chips the results need to copy together before the vertex shader can use them. If you did not spilt the work you will have to do it one any chip you win nothing at all.
 
DaveBaumann said:
Yes that’s one of the differences - there are implementation differences as well as I illustrated with the MSAA design. The point being is that it is already implemented.
NVidia could also distribute AA samples across multiple chips. However, lacking programmable sample positions, they would need to offset the pixel center, and ATI could get a better pattern for the same reason. OTOH, from a performance POV it is slightly better to have all samples for one pixel processed by one chip.

Complications spring in if you want to pass any rendered data back to the vertex processor from the something rendered in the framebuffer.
I don't see why that would be any more complicated than render-to-texture situations. Which you already have to solve today.


“What is processing what” would also, at the very least, need to be worked out on a per frame basis (i.e. the tiles / split screen distribution), so I doubt that the geometry processor could opportunistically “jump in” and start rendering some tiles, since the workload would have already been distributed – you could do some “geometry vs render” load balancing and apportion a number of render tiles to the “geometry processor” dependant on how quickly it finished the previous frame, but then you could introduce bigger hold-ups in some frames [...]
I don't think it's too bad a bet to rely on the previous frame. But you'd probably need at least a four chip setup to have the "geometry" chip fully utilized (or use a smaller variant as geometry chip).

and you also would need to give the same amount of RAM to the geometry processor as the other render chips, which in a consumer implementation you probably wouldn’t want to, especially if it wasn’t guaranteed to be utilised.
Ideally, you'd want every chip to use the same memory. I wonder if it would be feasible to have a "memory hub", basically an external memory controller with a large amount of cache (it would probably be pad-limited anyway, so there's enough space)

This would also be the case that there are still fairly significant portions of the chip that would be sparsely, or not at all, utilised and probably wouldn’t be an effective use of die.
Certainly, but that's true for all multichip configurations. It would be too expensive to design chips specifically for multichip use.
 
OTOH, from a performance POV it is slightly better to have all samples for one pixel processed by one chip.

Of course, but we’re talking about consumer implementations here, so we’re probably always going to be putting up with some sampling limitations.

I don't see why that would be any more complicated than render-to-texture situations. Which you already have to solve today.

Under current situations on average 50% of the time the render to texture results are going to be local to one of the chips – with a separate geometry processor the likelihood is that its going to be on another chip 100% of the time.

Ideally, you'd want every chip to use the same memory. I wonder if it would be feasible to have a "memory hub", basically an external memory controller with a large amount of cache (it would probably be pad-limited anyway, so there's enough space)

There is such a thing as multi-portedmemory, IIRC, but its fairly slow. I do also seem to recall the older Wildcat boards using a shared texture memory (?).

However, what I had in mind with a geometry processor would be something dedicate to the task such that it had a small vertex cache or even used system memory over PCI Express for a vertex cache and only setup triangles would be distributed by the raster chips.
 
Back
Top