Dual/Multi Chip Consumer Graphics Boards

Ram was a lot more expensive then--and a lot slower. That was the cost barrier, IMO.

a) Cards didn't need also the amounts of onboard ram cards need today. Next generation cards will start most likely with 256MB ram and that for single chip sollutions.

b) How cheap is DDR II ram today?

I don't see how the cost has changed in that aspect.
 
Xmas said:
SLI only means that the two chips/boards share the framebuffer so that one renders all odd lines and the other one all even lines. V5 employs a similar scheme, but it's stripes of multiple lines interleaved.

IIRC Metabyte did something similar with 2x Voodoo Banshees rendering 1/2 the screen each.
 
Makes the 0.13 R350 rumors even more interesting (power reduction), and I sort of wonder what a dual RV350 chip design would perform like.

Well there are quad R300 chips that have 9.6 Gpixel fill rates so who woul dknow what a dual R350 chip could do?
 
Jabjabs said:
Makes the 0.13 R350 rumors even more interesting (power reduction), and I sort of wonder what a dual RV350 chip design would perform like.

Well there are quad R300 chips that have 9.6 Gpixel fill rates so who woul dknow what a dual R350 chip could do?

Well, I was wondering what a dual RV350 design would perform like...there isn't that much question what the performance of a dual R350 would be like (monstrous :p ).

With the increased clock speeds it could offer, if the workload sharing was efficient, and key features aren't missing, I think it could be a viable product for an intermediate step between R350 and R400. I'd expect two RV350 chips to be cheaper than one R350 at equivalent clock speeds so the issue would seem to be an issue of cost and performance of the "MAXX" board design.
If these conditions are met, it could be a lot more cost effective than re-implementing R300 or R350 for 0.13. OTOH, if the RV350 is very similar to R300 or R350 in design, re-implementing them on 0.13 could be fairly easy.
*shrug* Just more speculation.
 
One of the primary problems with multi-chip architectures is that it is impossible to be as efficient with the use of both chips as with just one. Some quick examples:

1. Voodoo5 5500: Shared framebuffer memory, but separate texture memory meant less available memory space and bandwidth than the theoretical values (I believe the Voodoo2 SLI shared the same problems here).

2. ATI Rage Fury MAXX: alternate-frame rendering shared no memory between the chips, resulting in quite a bit less available memory than existed on chip, though bandwidth usage was maximized. Tended to stutter without triple buffering.

3. And, finally, moving into a hypothetical future multi-chip architecture, it becomes rather challenging to share vertex processing power between chips.

And, of course, the #1 problem with going multi-chip is cost. After all, with the high performance of today's accelerators, who really needs to pay so much more for more performance sooner? You'd likely be spending about twice as much money for not quite twice the performance. The truth is that it's the busses between various system components that are fast becoming the bottlenecks. By moving more and more systems together onto single chips, these bottlenecks can be removed.

That is, eventually, we'll be using SOC (system-on-a-chip) designs not for cost, but for performance.
 
hmmh... I haven't posted much nowadays, but this is something I can share my thoughts...

I believe there's room for multichip (having one chip for cheaper, two or more to get true powerhouse.) implementations in one special area that isn't yet used in PC market. and yes, I am talking about eRAM based implementations.

I don't know how much is wise to talk about how Bitboys Axe had rendering implemented so that it was easily scalable for multiple chips. (or at least how I understood it.) so I won't go to the exact details.

but think about rendering framebuffer in tiles, instead of scanlines. with this arrangement, it is enough that one tile fits to on-chip memory. primary chip would have some sort of "tile buffer" that has information about geometry, textures, shader program locations... majorly about all stuff that Pixel Processing Unit needs to render tile. each chip does the rendering on eram, having as much as possible needed data on there as well. when tile is finished, chip copies it to SDRAM with among the others. then it picks up new data from tile buffer and starts rendering new tile. ( possible does mem copy from SDRAM to get needed geometry/textures/shader programs to onChip... )

practically, with this arrangement, it is possible to create multichip rendering array that amount of GPU's almost linearry affects to tile rendering speed. only problems I could see is that tiling speed on chip that does geometry and tile clipping could become problem if it is too slow. still, that would affect only geometry speed.
 
I wonder how well the Kyro architecture would work in a multichip configuration. The rasteriser looks ideal for that, simply a tile per chip, but I wonder if there would be big problems with (T&L and) binning?
 
Chalnoth said:
1. Voodoo5 5500: Shared framebuffer memory, but separate texture memory meant less available memory space and bandwidth than the theoretical values (I believe the Voodoo2 SLI shared the same problems here).
This is indeed one of the biggest problems with multichip solutions: texture memory.
The best thing you can do is use a small cache, reading in advance some values ( you can determine which ones you need before actually taking them from memory ) - but preventing stalls would be very hard.
There could be algorithms to minimize stalls, but 100% efficiency ain't gonna happen.

3. And, finally, moving into a hypothetical future multi-chip architecture, it becomes rather challenging to share vertex processing power between chips.

But 3DFX fixed that about 2 years ago with Sage, according to rumors.
The idea was to have 1 T&L chip. And 2 "pixel" ( or zixel :p ) chips for the high-end, 1 T&L chip & 1 "pixel" chip for the mid-end and only 1 "pixel" chip for the low-end.
It was a very good approach at the problem, IMO. Of course, we might never know if there were any other problems coming from that... :(

And, of course, the #1 problem with going multi-chip is cost. After all, with the high performance of today's accelerators, who really needs to pay so much more for more performance sooner? You'd likely be spending about twice as much money for not quite twice the performance.

Agreed. But multi-chip for workstation is still an excellent approach.
Remember Quantum3D? The VSA-100 based beast. IIRC, there was 8 VSA-100 on it. Maybe even more.
It cost $20000, but it could beat a GF4 Ti4600 in fillrate benchmarks, and that 2 years before!
Fact is, you could still have a similar thing today, and it *would* sell to a niche market.

Another reason to go multichip, for nVidia for example, is TSMC's Low K process. It seems to be highly inefficient with more than 100M Transistors due to I don't quite remember what ( that's according to MuFu, BTW )
So, unless TMSC finds a miraculous solution, it could be a very good reason.

But still, beside workstation, I don't see multichip solutions anytime soon.


Uttar
 
i brelieve it was 16 VSA-100's. and there is an advantage to not sharing texture memory... do you have any idea the mem badwidth that beast had!? in fill-rate tests i bet it would whip a R350. too bad the drivers were never finished to enable all of the features (like culling) :(
 
Yep, although Quantum3D's AAlchemy products were advertised to have up to 32 VSA-100 chips and 2 GB of video memory, their product page shows max 16 chips and 1 GB.

However, these are custom built rackmount systems, not AGP video cards, so I'm not sure a comparison to Geforces or Radeons is fair :p

About Kyros doing multichip, I completely forgot it's been done (with two CLX chips and the Elan T&L) in the Naomi 2 arcade board. Have to search for more information on that, I guess...

But now I wonder if VS/PS 3.0 of DX10 would somehow hamper (or even prevent) deferred rendering, if all texture and other source data must be freely accessible by either VS or PS at any time (IIUC). Can you then cut the action midway like the binning & sorting stage does? -- I'm on thin ice here, and this is slightly off topic, tho.
 
Ah, interesting, Quantum3D's Independence sports apparently max 16 Quadro 4 chips in one system. So they indeed are continuing with Nvidia chips now. I remember from some interview that Quantum3D participated in the design of VSA-100 (they were close, a co-founder of 3dfx founded Q3D) to get their requirements into it -- I now wonder why that was necessary, if they can just as "easily" make new big iron products with non-SLI Nvidia chips.
 
Quantum3D's Independence is 2*IA32+16*(2*IA32+Quadro4)
So you have 16 dual CPU PCs with one Quadro4 each, and a dual CPU PC to control them. And then some special hardware to keep the Quadro4s output in sync so they can be blended. In VSA100-terms, it's a 16 way T-buffer with one 2*IA32+Quadro4 for each sub-buffer.

Good when you want to do 16-way T-buffer. But you can't trade the FSAA for performance (if you think 8-way T-buffer is enough). There's also a whole lot of redundant IA32 processors, but maybe that's not a big problem since this isn't a price sensitive niche (and for other T-buffer effects than FSAA, they could be useful).

But I don't think I would call it easy or even "easy", concidering the amount of hardware they needed to throw in around it.
 
Thanks Basic for the clarification. Well I didn't mean downright easy but at least doable :) Come to think of it now, maybe it was easier with VSA-100 chips as they could just hook the chips together without CPUs -- at least the configuration was simpler. (Although as you said, certainly their clientele has deep pockets. Some stuff on www.MetaVR.com indicates there is some occasional price competition, though.)

A propos, what do you think about deferred renderers (multiple or single) with the freedom of access VS3.0/PS3.0 seems to require, any problems there?
 
What if you could have Dual VPU's handling different operations? 1 for Vertex Processing and the other for Pixel Processing. Couldn't you also split the other processes to lighten the workload?

Also another consideration is the RV350. Already proven to be less power demanding while providing excellent O/C's being fed solely by the AGP Bus. Already .13u in design and DDR-II configured. Bandwidth on the AGP 8x Bus is still being unrealized so there is plenty of bandwidth with the current bus. Even more so when you consider this design could easily be adapted to a external unit run off of PCI-Express.

What do you think? Possible? :?
 
Chalnoth said:
And, of course, the #1 problem with going multi-chip is cost. After all, with the high performance of today's accelerators, who really needs to pay so much more for more performance sooner? You'd likely be spending about twice as much money for not quite twice the performance. The truth is that it's the busses between various system components that are fast becoming the bottlenecks. By moving more and more systems together onto single chips, these bottlenecks can be removed.

That is, eventually, we'll be using SOC (system-on-a-chip) designs not for cost, but for performance.

Performance comes from more transistors, and multi-chip is about the only way you can increase transistor for a given die size. Although the more mature approach would be to have multiple chips on a single package.
 
Gunhead said:
I wonder how well the Kyro architecture would work in a multichip configuration. The rasteriser looks ideal for that, simply a tile per chip, but I wonder if there would be big problems with (T&L and) binning?

AFAIK there is an arcade solution (Naomi) using multichip. There is a geometry processor acting as a bridge and doing the T&L & binning and two PowerVR Series 3 chips doing the pixel shading.
 
Hey, speaking of Multichip solutions, I came across a small UK company by the name of colorgraphic at a recent show. They primarily deal in multidisplay solutions, such as video wals and multidestop systems, and have recently started a partnership with ATI.

Their newest solutions, Xentra GT, can use up to 4 Radeon 9000's for 8 way displays. Interestingly, for power concerns, rather than using desktop 9000's they are using the mobile versions.
 
I believe the CLX chips in NAOMI 2 were designed to be scaled up to 16 chips. Simon F can probably offer some more info on it's operation/config.
 
Back
Top