All benches with volari ultra duo only used one chip (gpu)!

ByteMe · Jan 4, 2004

With this dual chip approach, I wonder what kind of speed can be figured. I understand a dual CPU only gives 40%-60% more speed (depending on a ton of things). I understand that graphics lends itself to a parallel approach better. Could we expect an average of a 70% speed increase (vs. one gpu).

Anyone care to enlighten me?

Ailuros · Jan 4, 2004

RussSchultz said:
Transistor count (or die area) is an indication of cost, and hence integral to determining "efficiency".

Equivalent die area should result in equivalent performance (in an ideal world, of course).

Equivalent die area presupposes equivalent featuresets/architectures and/or implementations under all conditions?

Ailuros · Jan 4, 2004

ByteMe said:
With this dual chip approach, I wonder what kind of speed can be figured. I understand a dual CPU only gives 40%-60% more speed (depending on a ton of things). I understand that graphics lends itself to a parallel approach better. Could we expect an average of a 70% speed increase (vs. one gpu).

Anyone care to enlighten me?

Depends on the method two GPUs utilize and quite a significant amount of conditionals.

Here's an old quote from SA:

Highly scalable problems such as 3d graphics and physical simulation should get near linear improvement in performance with transistor count as well as frequency. As chips specialized for these problems become denser they should increase in performance much more than CPUs for the same silicon process. This means moving as much performance sensitive processing as possible from the CPU to special purpose chips. This is quite a separate reason for special purpose chips than simply to implement the function directly in hardware so as to be able to apply more transistors to the computations (as mentioned in my previous post) It also applies to functions that require general programmability (but are highly scalable). General programmability does not preclude linear scalablity with transistor count. You just need to focus the programmability to handle problems that are linearly scalable (such as 3d graphics and physical simulation). In makes sense of course to implement as many heavily used low level functions as possible directly in hardware to apply as many transistors as possible to the problem at hand.

The other major benefit from using special purpose chips for highly scalable, computation intensive tasks is the simplification and linear scalablity of using multiple chips. This becomes especially true as EDRAM arrives.

The MAXX architecture requires scaling the external memory with the number of chips, so does the scan line (band line) interleave approach that 3dfx used. With memory being such a major cost of a board, and with all those pins and traces to worry about, it is a hard and expensive way to scale chips (requiring large boards and lots of extra power for all that external memory). The MAXX architecture also suffers input latency problems limiting its scalability (you increase input latency by one frame time with each additional chip). The scan line (band line) method also suffers from caching problems and lack of triangle setup scalability (since each chip must set up the same triangles redundantly).

With EDRAM, the amount of external memory needed goes down as the number of 3d chips increase. In fact, with enough EDRAM, the amount of external memory needed quickly goes to 0. EDRAM based 3d chips are thus ideal for multiple chip implementations. You don't need extra external memory as the chips scale (in fact you can get by with less or none), and the memory bandwidth scales automatically with the number of chips.

To make the maximum use of the EDRAM approach, the chips should be assigned to separate rectangular regions or viewports (sort of like very large tiles). The regions do not have their rendering deferred (although they could of course), they are just viewports. This scaling mechanism automatically scales the computation of everything: vertex shading, triangle setup, pixel operations, etc. It does not create any additional input latency, allows unlimited scalablity, and does not require scaling the memory as required by the previously mentioned approaches.

Tilers without EDRAM also scale nicely without needing extra external memory. They are in fact, the easiest architecture to scale across multiple chips. You just assign the tiles to be rendered to separate chips rather than the same chip. The external memory requirements while remaining constant, do not drop however, as they do with EDRAM. The major problem to deal with is scaling the triangle operations as well as the rendering. In this case, combining the multi-chip approach mentioned for EDRAM with tiling solves these issues. You just assign all the tiles in a viewport/region to a particular chip. Everything else is done as above and has the same benefits.

In my mind, the ideal 3d card has 4 sockets and no external memory. You buy the card with one socket populated at the cost of a one chip card. The chip has 32 MB of EDRAM, so with 1 chip you have a 32MB card. When you add a second chip you get a 64MB card with double the memory bandwidth and double the performance. For those who go all out and decide to add 3 chips, they get 128 MB of memory, and quadruple the memory bandwidth and performance. Ideally, the chip uses some form of occlusion culling such as tiling, or hz buffering with early z check, etc. Using the same compatible socket across chip generations would be a nice plus.

In the long run I agree with MFA. Using scene graphs or a similar spatial heirarchy simplifies and solves most of these problems, including accessing, transforming, lighting, shading, and rendering, only what is visible. They also simplify the multiple chip and virtual texture and virtual geometry problems. We will need to wait bit longer for it to appear in the APIs though.

There are indeed two problems generally associated with partitioning the screen across multiple chips. Load balancing, and distributing the triangles to the correct chip. Both have fairly straight forward, very effective solutions, though I can't mention the specifics here.
Those are some good comments, MFA. However, there is no need to defer rendering and no need for a large buffer. Each chip knows which vertices/triangles to process, without waiting.

3dfx used SLI (scan line interleaving), while ATI for the MAXX used AFR (alternate frame rendering). XGi's resembles a lot to AFR, since each chip is working on independant frames (Master chip = frame 1, Slave chip= frame 2, Master chip = frame 3....etc. well it's more complicated but it should give a picture). IMHO though the problems concerning the MAXX and Volari architectures in terms of memory scaling and input latency should be pretty close between them.

RussSchultz · Jan 4, 2004

Ailuros said:
RussSchultz said:

Transistor count (or die area) is an indication of cost, and hence integral to determining "efficiency".

Equivalent die area should result in equivalent performance (in an ideal world, of course).

Click to expand...

Equivalent die area presupposes equivalent featuresets/architectures and/or implementations under all conditions?

You wouldn't compare apples and oranges, would you?

jimbob0i0 · Jan 4, 2004

RussSchultz said:
You wouldn't compare apples and oranges, would you?

Well....

They both have a skin on the outside... seeds on the inside.... 'flesh' on the inside... reproduce naturally in the open....

digitalwanderer · Jan 4, 2004

jimbob0i0 said:
RussSchultz said:

You wouldn't compare apples and oranges, would you?

Click to expand...

Well....

They both have a skin on the outside... seeds on the inside.... 'flesh' on the inside... reproduce naturally in the open....

Next you'll be telling me they both grow on trees.

MuFu · Jan 4, 2004

ASICs grow on trees. Different kind of tree though, lol.

MuFu.

Tagrineth · Jan 4, 2004

ByteMe said:
With this dual chip approach, I wonder what kind of speed can be figured. I understand a dual CPU only gives 40%-60% more speed (depending on a ton of things). I understand that graphics lends itself to a parallel approach better. Could we expect an average of a 70% speed increase (vs. one gpu).

Anyone care to enlighten me?

3dfx's SLI was able to get VERY close to 100% dual-chip efficiency (probably average 99%). Easily the best solution for multi-chip 3D.

Ailuros · Jan 5, 2004

RussSchultz said:
Ailuros said:

RussSchultz said:

Transistor count (or die area) is an indication of cost, and hence integral to determining "efficiency".

Equivalent die area should result in equivalent performance (in an ideal world, of course).

Click to expand...

Equivalent die area presupposes equivalent featuresets/architectures and/or implementations under all conditions?

Click to expand...

You wouldn't compare apples and oranges, would you?

Of course wouldn't I. And that's the exact reason why I find any comparison exclusively based on transistor count between chips, to be meaningless. That's what my original comment was about.

Ailuros · Jan 5, 2004

Tagrineth said:
ByteMe said:

With this dual chip approach, I wonder what kind of speed can be figured. I understand a dual CPU only gives 40%-60% more speed (depending on a ton of things). I understand that graphics lends itself to a parallel approach better. Could we expect an average of a 70% speed increase (vs. one gpu).

Anyone care to enlighten me?

Click to expand...

3dfx's SLI was able to get VERY close to 100% dual-chip efficiency (probably average 99%). Easily the best solution for multi-chip 3D.

On VSA-100 doubling chips, resulted ONLY in twice the FSAA performance.

His question is quite simple, all he's asking is if you have with one chip say 100fps in condition X, if you get with two chips in the same condition 170fps. Apart from enabled FSAA, the VSA-100 didn't yield according to that paradigm 198fps vs 100fps and I fail to see where the supposed 99% efficiency comes from; and yes I am concentrating exclusively on released and existing products.

Apart from that SLI is IMHO definitely the better sollution, compared to AFR.

All benches with volari ultra duo only used one chip (gpu)!

ByteMe

Ailuros

Epsilon plus three

Ailuros

Epsilon plus three

RussSchultz

Professional Malcontent

jimbob0i0

digitalwanderer

wandering

MuFu

Chief Spastic Baboon

Tagrineth

murr

Ailuros

Epsilon plus three

Ailuros

Epsilon plus three

Similar threads