Opinions needed on this Interview

DaveBaumann said:
The 4 quad board would need to render more metatiles though.
Yes, absolutely! Otherwise the 3-quad board does more work!!! Every quad on the 3-quad board would be called 1.33 times as often as the quads on the 4-quad board!

So, simple round-robin meta-tiling runs counter to load-balancing. You'd want to have the 4 quad board doing 1.33 times as many meta-tiles.

Jawed
 
The following two pictures show the tiles for a 1024x768 screen. Note that these don't use Dave's "meta-tiling", they are simply round-robin tilings of 16x16 pixel areas. Note for Dave, these individual squares are 16x16-pixel tiles, each owned by one quad :)

This first diagram is of a 3-quad board with a 4-quad board. Note every quad has an equal number of tiles (439), except for quad 4 on the X800XT which has 438 tiles:

b3d08.gif


In this configuration we have two X800XTs, again there's no explicit meta-tiling as such, just round-robin tile ordering. Obviously all the quads get an equal billing, 384 times:

b3d09.gif


What's interesting is that even without "meta-tiling", meta-tiles arise. These are areas where a board renders 4 or more adjoining tiles.

It seems to me there's no need to code a "meta-tiling" algorithm, in order to produce meta-tiles. They drop out, anyway.

Jawed
 
Here are the meta-tiles you might use when dealing with 7-quads:

b3d10.gif


Board 1's meta-tile is 24 tiles in size. Board 2's is 32 tiles.

You could make these meta-tiles twice as high (48 and 64 tiles, respectively).

Here is how these meta-tiles fill a 1024x768 screen:

b3d11.gif


Board 1's quads render 384 tiles each, whereas board 2's quads render 480 tiles each (i.e each quad in board 2 works 25% more).

Maybe there are better meta-tilings to make the rendering more equal per quad across both boards.

Jawed
 
It seems that anybody here believes that ATi will use supertiling for multichip rendering.

But supertiling will make everything much more complicated. IMHO ATí will use only a single splitline an no supertiles.
 
Personally I'm not sold on any implementation yet. However, I'm not sure I buy that it is any more complex than a split as the same issues have to be solved, the hardware is already setup to cater for this mechanism and they already have the supertiling method implemented in hardware solutions (E&S/SGI).

I guess the main advantage for a single split is render to texture locality.
 
DaveBaumann said:
Personally I'm not sold on any implementation yet. However, I'm not sure I buy that it is any more complex than a split as the same issues have to be solved, the hardware is already setup to cater for this mechanism and they already have the supertiling method implemented in hardware solutions (E&S/SGI).

I guess the main advantage for a single split is render to texture locality.

locality in many different parts is a big advantage. But this is not the only reason why I believe that ATI use a splitline. A little trip down the driver road brought mw to this result.
 
Demirug said:
DaveBaumann said:
Personally I'm not sold on any implementation yet. However, I'm not sure I buy that it is any more complex than a split as the same issues have to be solved, the hardware is already setup to cater for this mechanism and they already have the supertiling method implemented in hardware solutions (E&S/SGI).

I guess the main advantage for a single split is render to texture locality.

locality in many different parts is a big advantage. But this is not the only reason why I believe that ATI use a splitline. A little trip down the driver road brought mw to this result.

Yeah, a couple pages back I tried to lure them over to your other posts on the other thread re this matter, but they weren't having any. :(
 
When rendering to a texture (in non-AFR mode) you can have:
  • a) lots of 16x16 pixel tiles spread evenly on both cards
    b) a few (meta-) tiles spread evenly
    c) a half of the texture rendered per card, delineated by a split line
Or you can ignore two-card parallelism:
  • d) one card renders the texture
    e) both cards render the texture separately for their own use thereafter
a to d require that the texture is swapped with the other card.

a to d all incur the same fragment workload.

a to c have the same geometry workload.

d incurs extra geometry load on one card (compared with the other card).

e results in no gain in performance for render to texture (although no inter-card rendered texture swap is required at completion, so there's a minor performance gain in that sense).

Jawed
 
geo said:
Yeah, a couple pages back I tried to lure them over to your other posts on the other thread re this matter, but they weren't having any. :(

Tiling methods (meta- or not), generally, don't actually seem to perform any kind of load balancing, per se. At least that's how they seem to me. The tiling is "dumb", relying upon workload density across the frame to "even-out" simply by dishing out work in a round-robin fashion to fine-grained parallel work units.

In other words, at no point in the rendering of a frame does the driver, or the cards, decide to allocate tiles on the basis of workload. Load-balancing purely derives from "the law of averages" inherent in a fairly fine-grained workload split between the cards.

So, in that sense a split-line is equally dumb (horizontal or vertical).

It's worth saying that even with the split line implemented, both cards will still be rendering in tiles. The split line merely creates two huge meta-tiles, one per card.

Jawed
 
Jawed, do your tiling examples assume similar clock speeds? What happens when the 3-quad card has a higher clock than the 4-quad card and performance is not in a 3:4 ratio?
 
trinibwoy said:
Jawed, do your tiling examples assume similar clock speeds?
I don't assume anything about clockspeeds. The primary concept is that every quad (across both cards) is called the same number of times.

This tiling is based (naively) on descriptions of the E&S system - but that system uses identical cards, so there's no clock issue.

What happens when the 3-quad card has a higher clock than the 4-quad card and performance is not in a 3:4 ratio?
The clock speeds of the two cards make a mess, no matter if the two cards have the same number of quads or not :!: I don't know how you can account for clock speeds. This has always been a stumbling block.

As I was saying in my earlier message, ATI's tiling is not actively load-balancing. It relies upon the granularity of the tiles and "averaging" of the workload across the frame between the two cards.

Think of X850XT and X850XTPE - they have the same number of quads each, but one runs about 5% faster than the other. If the screen area is split exactly in half between them then the frame's render time is going to be determined by the slower card. The only saving grace is that it's only taking on half the workload.

If Demirug's split-line is able to shift back and forth (let's say it shifts in increments of 16-pixels, equivalent to the side of one tile) then it would be possible for the driver to account for the different performance of the two cards. Of course, what I've just described is SFR.

Unfortunately Demirug is not exactly promoting discussion here, because he hasn't revealed his reasons (beyond the registry setting) for thinking that the split-line is the primary method in MVP :D

So, I dunno what to think about clock speeds. There are all sorts of gotchas here.

One gotcha to consider is the width of the memory bus. Imagine two lower mid-range cards, one with a 128-bit bus and the other with a 256-bit bus. Both have 2 quads and the same clock speed...

In my view, ATI's only opportunity for marketing differentiation against NVidia is to allow for "un-matched card" MVP. If all these gotchas actually prevent that, there's gonna be groans all round :LOL:

Who knows :?:

Jawed

(edited first paragraph to remove false sentence about "equal area halves")
 
Jawed said:
e results in no gain in performance for render to texture (although no inter-card rendered texture swap is required at completion, so there's a minor performance gain in that sense).
Since the bandwidth of the inter-card interface is way lower than that of the local memory interface, I think each card rendering the whole texture is the fastest method in quite a number of cases (like, <5 cycles per pixel maybe).
 
I've just realised that Demirug's split line may be ATI's solution to the unequal quads problem.

The split doesn't actually make two equal halves. It splits the screen in proportion to the number of quads on each board.

So, for example, with X800 Pro and X800XT (3 and 4 quads) you'd split the 48 tile lines of a 1024x768 screen in 3:4 proportion, i.e. the top section for the X800 Pro would consist of 21 lines and the bottom section for the X800XT would consist of 27 lines. Or 20 and 28.

Or if you want to account for the X800 Pro's lower clock speed, 18 and 30, say.

This way the split line is fixed, but takes account of the relative capability of each card. There is no dynamic load balancing, but each board's quads work according to their capability.

b3d12.gif


The X800 Pro's quads each render 427 tiles (except for quad 3, 426) and the X800XT's quads each render 448 tiles.

Jawed
 
But would it be worth sacrificing dynamic load balancing to accomodate asymmetrical cards? Can nV's SFR mode be fixed to a specific split, to test this? I wonder if dynamic load balancing merely raises average fps, as opposed to also raising the all-important minimum fps?
 
Pete said:
But would it be worth sacrificing dynamic load balancing to accomodate asymmetrical cards?
ATI's tiling, whether small, meta- or split-line, is (in my opinion) static. There is no dynamic element in the load balancing.

To be honest, this is why I think the simple round-robin tiling I presented in:

http://www.beyond3d.com/forum/viewtopic.php?p=504639#504639

is the most flexible. There's no need to dynamically load-balance this, because any "intense" tiles will tend, by the law of averages, to be spread equally across all quads (and therefore both cards). Simply put, it is self-balancing.

The larger you make the areas rendered by one card (using meta-tiles, or using a split line), the less self-balancing the two cards become. But you get an advantage in terms of cache reuse (more chance a texture will be cached by the quad, due to locality of adjacent tiles, per quad) or at the very least some textures will only need to exist in the local memory of one card.

But I can see how the split line would solve the cases where a non-power of 2 total number of quads are involved (e.g. 7 quads as discussed already or 6 quads such as X700 Pro and X800XL)... It looks nice and simple. It sorta looks like two render targets joined together.

Jawed
 
It just doesn't add up for me. Notwithstanding pipeline count, clockspeed and efficiency differences between different chips, the fixed workload (in terms of screen space) doesn't make sense to me.

The rendering speed will be locked to the slower of the two cards anyway for any given frame. And the workload in the assigned part of the screen can change drastically as the view changes. This could make for dramatic frame-rate fluctuations. Some sort of adaptive load-balancing approach like Nvidia has with SFR still seems like the most sensible solution.

If ATI does allow mixing and matching of different cards they're going to have a hell of a time educating joe public about which combinations are workable. That leads me to think they're going to limit dual-gpu configs to similar cards.
 
trinibwoy said:
It just doesn't add up for me. Notwithstanding pipeline count, clockspeed and efficiency differences between different chips, the fixed workload (in terms of screen space) doesn't make sense to me.
Well that's how the E&S system does 24xAA with upto 32 cards - four cards each delivering 6xAA per tile; effectively an 8-quad screen tiling. There's no dynamic load balancing. There's no need with 16x16 pixel tiles because on average the variations in the intensity of work done per tile average out. There's thousands of these tiles (3072 at 1024x768 resolution).

Think of it as like having a tug of war. The more men (tiles) in each team, the less difference in overall performance a single man (tile) makes.

Obviously the more un-balanced the two cards the less performance gain you'll get. You'll be getting a maximum of twice the per-quad performance of the slowest card. If your faster card is more than twice as fast per-quad anyway, you're doomed. Note this is not overall card performance, but performance per quad. Unless you're geometry limited...

If Demirug's suggestion of a split line is implemented by ATI, you can salvage the otherwise doomed, badly unbalanced, configurations. But it seems unlikely to me that the split line will be dynamic, because that's SFR.

As a matter of interest, if you compare the theoretical pixel fill rate and AA sampling rates of X700XT and X800XTPE, you find that the per-quad performance of the X800XTPE is only 9% higher than X700XT. At first glance that makes these two cards highly compatible in a dual-card configuration. 9% is nothing.

Tomorrow I'll have a play with some more numbers along these lines, using measured performance rather than theoretical...

Jawed

(edited to add conclusion about per-quad theoretical performance)
 
Jawed, it looks like that ATi prefer a vertical split for rendertargets and no split for R2T. The split position seems to set as percentvalue. The driver use this value and calculate the pixelposition for the split. To make sure that the split is not inside a tile the driver move it to the next tileborder if nessary.

A vertical split is better for most games but it looks like that ATI can control this on per game basis like nVidia.
 
Back
Top