PDA

View Full Version : Next-gen specs


Dave
01-Mar-2002, 17:55
I've just been doing some thinking lately about next-gen 3D chips. I'm curious what everyone thinks they should include, not only in general specs, but also advanced features. Now I'm not asking what everyone wants to see, I'm just asking what everyone expects to see. Be specific too, even if it is something at the core architecture.

Saem
01-Mar-2002, 18:36
Hrm, does anyone know how difficult it would be to design a chip that can do a per frame switch between TBR + OD, IM + OD (HyperZ style) and IM? That'd be interesting. If the T&L is being done on the GPU wouldn't it be able to do a quick and dirty calculation to figure out which would be faster and if it's on the CPU, then it could be supported via an extention, hopefully it won't be hard to implment. If some game has some compatibility issues, the user could force a certian mode of operation or stop the card from using whichever mode is causing the problem. I just thought it'd be a neat idea.

I also would like to see greater than or equal to 32bits internal rendering percision.

I would really really really like to see an OPEN STANDARD for hardware texture compression/decompression (bring back FXT1, if possible, seeing as it's already done) and vertex compression/decompression.

To tell you the truth, right now I'm more interested in bettering older features before "moving" onto new ones, I'm not concerned about shaders. I would like to see fill rate go up by improving the amount/use of bandwidth, I want to see compression schemes that will free up bandwidth to allow better FSAA. I'd also like to see more aggressive filtering methods. As soon as these happen, bring on more features.

Also, what's the feasibility (minimal increase in price and additional perfomance hit relative to current FSAA methods taking equal number of samples) of implmenting a SS FSAA method that will change the number of samples and their orientation on a 4*4 grid based on depth and location on the triangle. I know smoothvision comes close to this, but I believe the sampling pattern doesn't adapt to location on the triangle and I don't think it reduces the number of samples based on distance.

arjan de lumens
01-Mar-2002, 22:42
Also, what's the feasibility (minimal increase in price and additional perfomance hit relative to current FSAA methods taking equal number of samples) of implmenting a SS FSAA method that will change the number of samples and their orientation on a 4*4 grid based on depth and location on the triangle. I know smoothvision comes close to this, but I believe the sampling pattern doesn't adapt to location on the triangle and I don't think it reduces the number of samples based on distance.


I don't see how that could work at all - for normal supersampling operation, you would, for a given pixel, use the same set of sample points regardless of depth. If you don't do this, you end up Z testing and rendering different sets of sample points for different polygons touching the pixel, which would cause rendering errors. In particular, if you have two polygons share a common edge on the screen and you don't strictly force the two polygons to use the same sample pattern for each given pixel, you get a rather nasty-looking seam between them - if the sample point set used is not identical for both polygons, you end up with some sample points covered by both polygons and other covered by neither. The only time you can safely decide sample points for a pixel is before you render to it, and at that time you cannot know the depth of the pixel.

Edit: Messed up quote tag

<font size=-1>[ This Message was edited by: arjan de lumens on 2002-03-01 23:43 ]</font>

SMarth
01-Mar-2002, 23:15
I expect next-gen graphics processor to give me very high quality anisotropic filtering and pretty good AA quality at a very low cost in speed (0-10%) at 1024*768*32 85hz under virtually all circonstances.

I would expect core speed to be much faster then what we have today with much more processing power for geometry and pixels. Extreme numbers mean nothing if it's for a simple transform with a single texture.

I would also expect memory bandwidth to go much higher then today, possibly with a 256 bits memory bus or larger. Integrating the memory chip die near the gpu core die in a single chip package will be great, but I expect it too happen.

Like someone else said, I expect current features to become more usable then they are today. The rest should go with much more flexible and programmable shaders.

Frankly, I'm dissapointed with the performences of current hi-end graphic cards, they are not worth the money.

nAo
02-Mar-2002, 01:26
According to many rumours floating around on the net we're not going to see a deferred renderer in the upcoming next generation GPUs, so I'm assuming we're stuck on IMRs.

I expect more bandwith.
With that I mean 2 things; more raw bandwith and more usable bandwith (better efficiency).
More raw bandwith could be provided in many ways:
1) increasing external data bush width
2) large onchip-memory

The point one could be accomplished doubling the external data bus wires from 128 to 256 bits or with multichip configurations. It seems an expensive solution at this time so I don't expect this. Point two requires to use non-standard design and cells and requires special foundry libraries and processes. It could be very expensive too if it's going to significantly increase die area.
I believe the only 2 candidates to release a DX9 compliant part in a short time, ATI and NVIDIA, will not going along these routes.
(I'll be happy if someone is going to tell me I'm wrong).
I expect them to virtualize every single pool of bits allocated on external memory. They should cache everything on chip. so I see a large die area devoted to sram (they should move to 0.13 micron process and this should provide some more spare die area to play with). The problem here is to increase data access locality in space (better efficiency on memory) and time (better use of on-chip caches). How? I don't know :smile:
It seems both ATI and NVIDIA use some kind of hierarchical z-buffer, so I expect a better use of it and maybe could be possibile to extract and store informations about scene depth and other useful things from a previous frame to try to make some useful speculation on the current frame.

Obviously we'll have more efficient and effective AA and anistropic filtering, without a big hit on performance.

The hw will be way more programmable with more complex (and faster) vertex and pixel shaders. As I wrote in another thread, I believe the pixel pipeline as a concept should be abandoned and replaced with a lot of independet (full pipelined) functional units and a smart control/dispatch/issue unit (almost like a modern cpu but way more specialized in its tasks)

ciao,
Marco

MfA
02-Mar-2002, 02:03
A 10+ GFLOPS fully programmable geometry engine. Optimized for vertex shaders, but it should be general enough to be able traverse scenegraphs, tesselate subdivision surfaces etc. Something like Imagine Stream, but with 4 way SIMD fp, or Bops. Give it a dedicated lighting unit for the standard lighting model though. It would probably be best to give it its own mechanism for sampling&amp;filtering displacement maps.

Still thinking about the rasterizer.

Fred
02-Mar-2002, 02:15
Why is everyone so concerned about speed hit, when talking about FSAA and anisotropic filtering.

Raw numbers are obviously more important.

If its between having a gpu that can do
scene 1 (no fsaa and filtering) at 300 fps
scene 2 (full features) on at 60 fps

and a 2nd gpu
scene 1 80fps
scene 2 60fps

then count me in for the first, even though the second has a much lower percentage hit.

pascal
02-Mar-2002, 04:10
Multichip scalable solution

chip One (service chip):
-AGP interface
-dual 400MHz RAMDAC
-2D video control
-fast crossbar switch
-3 HT link to chip Two, Three and Four
-Its own 32MB DDR
-HDTV
-video capture with mpeg-2
-VR interface
-firewire
-some fancy functions

chip Two (3D GPU):
-256bits 300MHz DDR interface
-12MB edram
-fully programmable RISC T&amp;L
-fully programmable multipipe RISC rasterizer
-1 HT link to chip one.

Chip Three and Four: the same as Two

Combine it as you wish :cool:

<font size=-1>[ This Message was edited by: pascal on 2002-03-02 05:15 ]</font>

SMarth
02-Mar-2002, 04:37
> Fred

I don't need 300fps, but I do HATE jumpy and non constant frame rate like we have today, high frame rate usually makes those less frequent or noticable. Of course it would help if some game programmers would put a little bit more efforts into managing data flows. But anyways, you don't get that kind of high frame rate with today's complex engine, so if the price for good AA and aniso is too high then it won't be usable.

Beside I want next-gen to reach a point were image quality features like good aa and very high texture filtering are what everybody expect. Just like high resolutions, 32 bits, mutitexturing is today. What worth is 9129341827378253fps with low-res 8 bits non-textured flat shaded polygons even if card A do it faster then card B ?

I would also like to see constant frame rates with vsync, that's an old dream of mine, but that one goes against the culture of computer games and even if it wasn't the case it's not simple to achieve in an unstable/variable pc environment :sad:



<font size=-1>[ This Message was edited by: SMarth on 2002-03-02 05:55 ]</font>

pascal
02-Mar-2002, 05:04
Improving the first post.
Multichip scalable solution

chip One (service/image/geometry chip):
-AGP interface
-fast/low latency crossbar switch with multicast
-4 HT links to chip Two, Three, Four and Five
-Its own 256MB 250MHz DDR 128bits
-Virtual memory management
-fully programmable RISC T&amp;L (geometry)
-HDTV
-fully programmable video/image RISC capable to video capture with mpeg-2
-dual 400MHz RAMDAC
-2D video control
-VR interface
-firewire
-some fancy functions

chip Two, Three, Four and Five (3D rasterizer):
-64MB 128bits 300MHz DDR
-12MB edram
-fully programmable multipipe RISC rasterizer
-sthocastic multisampling FSAA.
-1 HT link to chip one.

Starting with one service/geometry chip and two rasterizers.

LittlePenny
02-Mar-2002, 06:01
2D should go where it belongs, and thats on the motherboards. We need dedicated 3D only cards. I expect to see in actuality the first attempts at higher bit precision.

pcchen
02-Mar-2002, 07:22
MB is not a good place for display chip. It is hard to preserve signal quality on a motherboard.

Blitzkrieg
02-Mar-2002, 08:26
How about a dual kyro2 at .15 micron?
it would be only 25mill resistors ,still very small sharing one ddr 128 bit bus at 250 core?
maybe they could also add a programmable tnl and also free high qual aniso , wallah the perfect gpu :grin:

V3
02-Mar-2002, 09:40
Not exactly a feature, but a chip using eDRAM like what BitBoys promised would be cool to solve bandwidth problem.

Higher precision of everything would also be cool.

darkblu
04-Mar-2002, 09:35
MB is not a good place for display chip. It is hard to preserve signal quality on a motherboard.

after the DAC, you mean. well, it may not be an issue once the DVI gets fully adopted.

zurich
04-Mar-2002, 13:24
pascal,

wow, sounds like the PS2 :smile:

EE = fully programmable "service chip"
GS = rasterizer w/eDRAM

I know Sony's design didnt work out exactly as well as they hoped, but they were on the right track. Not bad for 1999 :wink:

zurich

pascal
04-Mar-2002, 14:21
Maybe like Rampage.

My 1999 card was the TNT2 16MB :smile:

Probably multichip solution is not viable in the current marketplace. Lets redesign the chip:
- year 2003
- AGP 8X
- 12MB .13 micron edram 20GB/s
- 300MHz DDR 128bits 10GB/s
- 125 milions polygons/sec programmable geometry engine
- 4 dual pixel programmable processor pipeline
- 8 stage loopback
- 64bits precision
- DX9 and OpenGL 2.0

Well, it is just a dream :rollseyes:

nooneyouknow
04-Mar-2002, 19:52
Considering Samsung just announced 400 MHz DDR, you may want to update your spec.

pascal
04-Mar-2002, 20:26
It was supposed to be a low cost card :grin:

A high end version will have something like the 400Mhz DDR. :wink:

psurge
05-Mar-2002, 01:18
The geometry engine should be massively parallel with SIMD FUs - something like 16 FMACs, 4 pipelined FDIVs, and 4 pipelined FSQRT units.

Maybe support simultaneous processing of multiple vertex programs at a time (better FU utilization), or at least support for executing multiple independant instructions simultaneously.

Support for more sophisticated control flow,
creating destroying vertices/triangles, displacement mapping, traversing scenegraphs, tesselation, etc...

Output of the geometry processing should go to a special HZ buffer. This buffer would store data ready to render in a spatial heirarchy (very much the way a tiler bins geometry).

The data stored would consist of geometry and associated state information. Geometry entering the buffer could throw out geometry fully occluded by it. Bounding volume occlusion queries would be supported.

The rasterizer would be able to pick geometry to render from the buffer based on required state changes as well as screen location.

The actual buffer would consist of say 4MB eDRAM - a large intelligent cache between the geometry and rendering processors. It would give some of the benefits of tilers without having to go the fully defferred rendering route.

The rasterizer :
- dynamically allocateable TUs and FUs
- high quality anisotropic filtering
- 64bit fp internal precision for everything
- ability to work on pixels from more than one triangle at a time
- more flexible frame buffer format (as in an F-buffer or R-buffer)
- better FSAA (adaptive number of samples per pixel, jittered sample positions)
- FUs separated from the concept of a pixel pipe. in the programmable model, a pixel pipe would basically consist of state specific to a pixel (like a register file and a program counter for instance).

JF_Aidan_Pryde
05-Mar-2002, 04:29
It's a complete different matter between what we are expecting and what we *would rather have.* Most people are listing the later.

I expect higher order internal precision and better bandwidth management. Overbright lighting will be possible with next generation hardware. Higher numbers of operations and textures per pass is another near certainty. New methods of programmability for shaders are also on the table.

What I'd like to see is much like Gary Tarolli's vision. (I think)

One board with four sockets. You buy the board with one socket populated, each chip has 12MB of EDRAM. By adding another chip, you double the fillrate and bandwidth. Geometry processing is also scaled. You can buy additional chips as you need and you may end up with all four sockets filled and quadriple the original theoretical power. This configuration offers virtually unlimited scalability, no input latency and near linear returns.

fresh
05-Mar-2002, 19:44
On 2002-03-04 14:24, zurich wrote:
pascal,

wow, sounds like the PS2 :smile:

EE = fully programmable "service chip"
GS = rasterizer w/eDRAM

I know Sony's design didnt work out exactly as well as they hoped, but they were on the right track. Not bad for 1999 :wink:

zurich


The PS2 design was locked down in 1997 actually.

LittlePenny
05-Mar-2002, 20:48
What I'd like to see is much like Gary Tarolli's vision. (I think)

One board with four sockets. You buy the board with one socket populated, each chip has 12MB of EDRAM. By adding another chip, you double the fillrate and bandwidth. Geometry processing is also scaled. You can buy additional chips as you need and you may end up with all four sockets filled and quadriple the original theoretical power. This configuration offers virtually unlimited scalability, no input latency and near linear returns.


The problem with this is it still deals with amdahls laws with parallel processing. http://www.umr.edu/~buechler/img038.gif

<font size=-1>[ This Message was edited by: LittlePenny on 2002-03-05 21:49 ]</font>

pascal
05-Mar-2002, 21:05
[Guessing mode on]
This is really a Paradox, but if we keep a very fast central Geometry unit chip with four parallel rasterizers chips (doing heavy sthocastic 4 x FSAA, 64 tap aniso, 64bit precision, 8 level multitexture) then:

S= 10% (superguess); P= 90%; N=4 =>

Speedup = 1 /(.1+.9/4) = 3

This is a good speedup :cool:

edited: some kind of multicast crossbar switch to control multiple rasterizers simultaneouslly will minimize the serial work like texture, polygons info transfer, etc... :cool:
[Guessing mode off]

I like that :wink:

<font size=-1>[ This Message was edited by: pascal on 2002-03-05 22:08 ]</font>

psurge
06-Mar-2002, 01:28
JF, i dunno about the multichip thing. You still have the issue of duplicating textures to each chip's memory, and if you want geometry to scale as well, you need somehow split the geometry between chips. If each chip is processing a separate portion of the framebuffer, this could mean bursts of activity in one chip while the others are idle. So I seriously doubt that you would see anything close to a 4 fold performance increase...

i listed what i want to see because Dave did ask what we thought "they should include". As far as what I expect, i pretty much second what you said :

more internal precision, gamma correction of all color data in hardware, somewhat more programmability, maybe an increase in the number of vertex/pixel pipes. One thing however - given the problems with .13 and lower processes ATM, i'm thinking that maybe GPU core clock should be doubled (instead of adding more pipes). Intel/AMD are at 1GHz+ on 0.18Aluminum processes, how about targetting 733 or 800Mhz on 0.15 for a GPU?

On the memory side of things, i expect a slight increase in bandwidth. Maybe a more advanced/efficient occlusion culling scheme that supports bounding volumes on geometry data, but i doubt it. I don't expect much in the FSAA quality department.

Serge

MfA
06-Mar-2002, 01:45
I dont think I would want the heatsink fan combo I have on my processor on my graphics card though :)

psurge
06-Mar-2002, 06:21
MfA,

damnit :smile: forgot all about that. Do
you think it would be possible to run a 0.15u GPU with say 65 million transistors
at 800MHz, assuming better packaging (integrated heatspreader), off of AGP power?

i know next to nothing about power/heat related issues - any links to places where i could read up on this stuff?

Thanks,
Serge

JF_Aidan_Pryde
06-Mar-2002, 07:17
I'm not sure on the specifics but according to SA's post, texture data can be shared and geometry scaling will not be a large problem. The two obvious problems of load balancing and distributing the triangles to the correct chip was said to have straight foward and effective solutions (refused to elaborate). And I meant 32MB of EDRAM for each chip, not 12. :smile:

Not sure if virutualized textures will help here..

darkblu
06-Mar-2002, 09:11
The two obvious problems of load balancing and distributing the triangles to the correct chip was said to have straight foward and effective solutions (refused to elaborate).

SLI actually yields pretty good load balancing.

MfA
06-Mar-2002, 09:37
SLI, yuck ... to me it always seemed like an excercise in how to fuck up the texture caching.

Multiple scanline or tile interleaving ala BB's to me seems much more palatable.

Psurge, your guess is as good as mine ... my guess is no, power consumption doesnt scale down that fast.

<font size=-1>[ This Message was edited by: MfA on 2002-03-06 10:40 ]</font>

darkblu
06-Mar-2002, 09:44
Multiple scanline or tile interleaving ala BB's to me seems much more palatable.

not sure what you mean by BB here, but the greater the granularity of the subdivision elements the poorer the load balancing. caching is auxiliary to the problem of load balancing -- a given architecture may not require caching at all.

MfA
06-Mar-2002, 10:40
BitBoy's.

As for balancing, a tile is smaller than a scanline ... so if we had infinite input buffer size balancing would actually be better. We dont, but I doubt its an issue.

Even with unique texturing texel's are used repeatedly, inability to make full use of that fact is hardly an indicator of good balancing IMO. Even with multiple chips external memory bandwith is expensive.

Marco

<font size=-1>[ This Message was edited by: MfA on 2002-03-06 11:47 ]</font>

darkblu
06-Mar-2002, 11:39
i'm not familiar with any of the bitboys' designs, can't comment. still,

As for balancing, a tile is smaller than a scanline ... so if we had infinite input buffer size balancing would actually be better. We dont, but I doubt its an issue.

by granularity of the subdivision regions i didn't mean their integral area pixel-wise but rather the chance of a primitive to fall completely (or largely) within the bounds of one such subdivision region, thus compromising the load ballance. here i put the focus on the coverage of a single primitive, as trying to infer a coverage pattern scene-wise would be considerably (statistically) harder. so back to scanline-vs-tile granularity. in your opinion, which is more likely to occur: a single-pixel-tall, many-pixels-wide polygon, or, a polygon which falls within, say, 32x32 pixels tile?

Even with unique texturing texel's are used repeatedly, inability to make full use of that fact is hardly an indicator of good balancing IMO. Even with multiple chips external memory bandwith is expensive.

what i said was purely hypothetical, and was meant to imply that caching or the lack thereof is a matter of particular implementation, vs. the subject of load ballancing which is a statistical matter. speaking of implementations, nobody ever claimed a SLI architecture should not use a multiported texture cache, providing 100% texels reuse among the scanlines.

ed: quote tags

<font size=-1>[ This Message was edited by: darkblu on 2002-03-06 12:40 ]</font>

MfA
06-Mar-2002, 11:53
How does it matter if a tri is completely rendered by a single chip? The other chip just moves on to the next tri, sometimes one will pull ahead sometimes the other ... thats what buffering is for.

How would a blah-cache help? In one situation both chips are accessing the same texels, which is inefficient, in the other they are mostly accessing different texels. The only option to get around it is for multiple chips to share a cache ... which is clearly not an option.

Marco

PS. 32x32 is huge, think 4x4 or 8x8.

<font size=-1>[ This Message was edited by: MfA on 2002-03-06 13:10 ]</font>

JF_Aidan_Pryde
06-Mar-2002, 12:22
Originally Posted by SA:

--------------------------------------

Highly scalable problems such as 3d graphics and physical simulation should get near linear improvement in performance with transistor count as well as frequency. As chips specialized for these problems become denser they should increase in performance much more than CPUs for the same silicon process. This means moving as much performance sensitive processing as possible from the CPU to special purpose chips. This is quite a separate reason for special purpose chips than simply to implement the function directly in hardware so as to be able to apply more transistors to the computations (as mentioned in my previous post) It also applies to functions that require general programmability (but are highly scalable). General programmability does not preclude linear scalablity with transistor count. You just need to focus the programmability to handle problems that are linearly scalable (such as 3d graphics and physical simulation). In makes sense of course to implement as many heavily used low level functions as possible directly in hardware to apply as many transistors as possible to the problem at hand.

The other major benefit from using special purpose chips for highly scalable, computation intensive tasks is the simplification and linear scalablity of using multiple chips. This becomes especially true as EDRAM arrives.

The MAXX architecture requires scaling the external memory with the number of chips, so does the scan line (band line) interleave approach that 3dfx used. With memory being such a major cost of a board, and with all those pins and traces to worry about, it is a hard and expensive way to scale chips (requiring large boards and lots of extra power for all that external memory). The MAXX architecture also suffers input latency problems limiting its scalability (you increase input latency by one frame time with each additional chip). The scan line (band line) method also suffers from caching problems and lack of triangle setup scalability (since each chip must set up the same triangles redundantly).

With EDRAM, the amount of external memory needed goes down as the number of 3d chips increase. In fact, with enough EDRAM, the amount of external memory needed quickly goes to 0. EDRAM based 3d chips are thus ideal for multiple chip implementations. You don't need extra external memory as the chips scale (in fact you can get by with less or none), and the memory bandwidth scales automatically with the number of chips.

To make the maximum use of the EDRAM approach, the chips should be assigned to separate rectangular regions or viewports (sort of like very large tiles). The regions do not have their rendering deferred (although they could of course), they are just viewports. This scaling mechanism automatically scales the computation of everything: vertex shading, triangle setup, pixel operations, etc. It does not create any additional input latency, allows unlimited scalablity, and does not require scaling the memory as required by the previously mentioned approaches.

Tilers without EDRAM also scale nicely without needing extra external memory. They are in fact, the easiest architecture to scale across multiple chips. You just assign the tiles to be rendered to separate chips rather than the same chip. The external memory requirements while remaining constant, do not drop however, as they do with EDRAM. The major problem to deal with is scaling the triangle operations as well as the rendering. In this case, combining the multi-chip approach mentioned for EDRAM with tiling solves these issues. You just assign all the tiles in a viewport/region to a particular chip. Everything else is done as above and has the same benefits.

In my mind, the ideal 3d card has 4 sockets and no external memory. You buy the card with one socket populated at the cost of a one chip card. The chip has 32 MB of EDRAM, so with 1 chip you have a 32MB card. When you add a second chip you get a 64MB card with double the memory bandwidth and double the performance. For those who go all out and decide to add 3 chips, they get 128 MB of memory, and quadruple the memory bandwidth and performance. Ideally, the chip uses some form of occlusion culling such as tiling, or hz buffering with early z check, etc. Using the same compatible socket across chip generations would be a nice plus.

In the long run I agree with MFA. Using scene graphs or a similar spatial heirarchy simplifies and solves most of these problems, including accessing, transforming, lighting, shading, and rendering, only what is visible. They also simplify the multiple chip and virtual texture and virtual geometry problems. We will need to wait bit longer for it to appear in the APIs though.

There are indeed two problems generally associated with partitioning the screen across multiple chips. Load balancing, and distributing the triangles to the correct chip. Both have fairly straight forward, very effective solutions, though I can't mention the specifics here.
Those are some good comments, MFA. However, there is no need to defer rendering and no need for a large buffer. Each chip knows which vertices/triangles to process, without waiting.

----------------------------------------

darkblu
06-Mar-2002, 13:25
How does it matter if a tri is completely rendered by a single chip? The other chip just moves on to the next tri, sometimes one will pull ahead sometimes the other ... thats what buffering is for.

the other chip cannot just move to the next triangle as those two triangles may overlap and hence determinism conflicts could occur, which to be resolved would need imposing artificial latencies to the parallelism (i.e. clock-for-clock offsetting). that's why we need a scheme, which, for each moment T, would keep strict partitioning (read non-overlapping) across the frame regions, so that at said moment T chipN would not spatially overlap with chipN+1. now, last time we were discussing about the effectivness of that frame partitioning, namely scanlines vs. tiles.


How would a blah-cache help? In one situation both chips are accessing the same texels, which is inefficient,

it wouldn't be inefficient if such an access could come w/o penalties, and a multy-ported, shared cache could help here.


in the other they are mostly accessing different texels. The only option to get around it is for multiple chips to share a cache ... which is clearly not an option.


now, why would that not be an option?


PS. 32x32 is huge, think 4x4 or 8x8.

well, i picked 32x32 due to historical powerVR reasons (as i told you i'm not familiar with bitboys' design)

Dave Baumann
06-Mar-2002, 13:34
AFAIK PowerVR is more like 16x32 (or 32x16).

darkblu
06-Mar-2002, 13:40
AFAIK PowerVR is more like 16x32 (or 32x16).

ok, my bad.

MfA
06-Mar-2002, 15:48
Darkblu, we are talking about different forms of tiles I think ... I just meant a static assignment of screen tiles to individual chips (tiles as in with a tiled framebuffer, such as nearly every traditional 3D chip uses). So there can be no overlap with subsequent tri's on the same chip if it skips it (its invisible to it).

As for shared caches, such a high bandwith path takes lots of pins ... reducing redundancy of reads between the chips (through multi-scanline-SLI or tiling) and using the saved pins for more memory bandwith is a clear win.

darkblu
06-Mar-2002, 16:16
Darkblu, we are talking about different forms of tiles I think ... I just meant a static assignment of screen tiles to individual chips [snip] So there can be no overlap with subsequent tri's on the same chip if it skips it (its invisible to it).

exactly from there stems the whole problem with load balancing -- a chip processes only the triangles, and only the portion of those, which fall within its tile/scanline/whatever-subdivision-element-our-framebuffer-divides-into. that's why there's a chance that chip N does more work than chip M as it's possible that there's higher triangle coverage to his portion of the framebuffer that to chip M's portion of the framebuffer. my whole point being that subdividing the framebuffer into scanlines gives better (statistically) load balance in terms of triangle area that each chip has to process, than, say, tiles of 16x16.

Simon F
06-Mar-2002, 16:46
DarkBlu,
Naomi 2 does the distributed tile method (using 2 rendering chips), and AFAICS, the load balancing was pretty good.

You might get slightly better balancing with a scan line approach but, IMHO, you lose far more due to the decrease in data reuse (i.e. texture caching effectiveness decreases) and the fact that nearly all triangles have to be processed by both rendering chips.

darkblu
06-Mar-2002, 17:03
Simon,

i know naomi2 did distributed tiling, and that it was said to be pretty good at that. but that does not change the fact that purely statistically using lower-granularity frame subdivision elements should produce better load balancing. AAMOF, distributing 1 pixel to a chip whould produce the best possible load balancing, and that's what multiple pixel pipes at a chip do -- they achieve the optimal load balance* across the chip.

it appears people seem to think of SLI in terms of 3dfx's particular implementation, which had its flows. if naomi2 could be fairly efficient at doing tiles of, erm 32x16, then i see no reason a SLI architecture should not be able to do the same, only at higher load balance efficiency.



*optimal load balance: when no pixel pipeline stays idle when there are still pixels to be drawn.

MfA
06-Mar-2002, 17:18
Once again I have to retort with the fact that a tile is smaller than a scanline :) Statistically speaking if we assume an infinite buffer and no stalls only the number of pixels in the screen division's determin the distribution of the difference in work between the two chips.

BTW if you use tilers the different chips would not need to render the screen in strict order, if a tile is done you move on to the next tile ... no stalls no hassle.

Marco

<font size=-1>[ This Message was edited by: MfA on 2002-03-06 18:24 ]</font>

darkblu
07-Mar-2002, 07:32
BTW if you use tilers the different chips would not need to render the screen in strict order, if a tile is done you move on to the next tile ... no stalls no hassle.

ok, under the premise that a chip could move on to a pending tile at any time _and_ tile size is reasonably small then yes, that'd be close to optimal load balance (optimal reached with tiles of 1x1).

Simon F
07-Mar-2002, 08:22
DarkBlu,
I know what you are trying to achieve by insisting on having the smallest possible level of 'unit of work' granularity, but you are ignoring a competing factor.

What is the point of, say, saving 5% of the frame rendering time due to the (possibly better) load balancing of SLI if the relative bandwidth requirements double, as would (typically) happen with the texturing?

Marco wrote:BTW if you use tilers the different chips would not need to render the screen in strict order, if a tile is done you move on to the next tile ... no stalls no hassle.
I'm not sure that's a good idea because then you'd either have to distribute the database to both chips' local memory or have a shared memory system.
_________________
"Your work is both good and original. Unfortunately the part that is good is not original and the part that is original is not good." - Samuel Johnson


<font size=-1>[ This Message was edited by: Simon F on 2002-03-07 09:26 ]</font>

MfA
07-Mar-2002, 09:45
Thats depends where the tri's have to come from. To the tiler it doesnt really matter wether they were send to them beforehand or wether it gets them on the fly during rendering.