Next-gen specs

JF_Aidan_Pryde · Mar 5, 2002

It's a complete different matter between what we are expecting and what we *would rather have.* Most people are listing the later.

I expect higher order internal precision and better bandwidth management. Overbright lighting will be possible with next generation hardware. Higher numbers of operations and textures per pass is another near certainty. New methods of programmability for shaders are also on the table.

What I'd like to see is much like Gary Tarolli's vision. (I think)

One board with four sockets. You buy the board with one socket populated, each chip has 12MB of EDRAM. By adding another chip, you double the fillrate and bandwidth. Geometry processing is also scaled. You can buy additional chips as you need and you may end up with all four sockets filled and quadriple the original theoretical power. This configuration offers virtually unlimited scalability, no input latency and near linear returns.

fresh · Mar 5, 2002

On 2002-03-04 14:24, zurich wrote:
pascal,

wow, sounds like the PS2 :smile:

EE = fully programmable "service chip"
GS = rasterizer w/eDRAM

I know Sony's design didnt work out exactly as well as they hoped, but they were on the right track. Not bad for 1999

zurich

The PS2 design was locked down in 1997 actually.

LittlePenny · Mar 5, 2002

What I'd like to see is much like Gary Tarolli's vision. (I think)

One board with four sockets. You buy the board with one socket populated, each chip has 12MB of EDRAM. By adding another chip, you double the fillrate and bandwidth. Geometry processing is also scaled. You can buy additional chips as you need and you may end up with all four sockets filled and quadriple the original theoretical power. This configuration offers virtually unlimited scalability, no input latency and near linear returns.

The problem with this is it still deals with amdahls laws with parallel processing.

[ This Message was edited by: LittlePenny on 2002-03-05 21:49 ]

pascal · Mar 5, 2002

[Guessing mode on]
This is really a Paradox, but if we keep a very fast central Geometry unit chip with four parallel rasterizers chips (doing heavy sthocastic 4 x FSAA, 64 tap aniso, 64bit precision, 8 level multitexture) then:

S= 10% (superguess); P= 90%; N=4 =>

Speedup = 1 /(.1+.9/4) = 3

This is a good speedup

edited: some kind of multicast crossbar switch to control multiple rasterizers simultaneouslly will minimize the serial work like texture, polygons info transfer, etc...

[Guessing mode off]

I like that

[ This Message was edited by: pascal on 2002-03-05 22:08 ]

psurge · Mar 6, 2002

JF, i dunno about the multichip thing. You still have the issue of duplicating textures to each chip's memory, and if you want geometry to scale as well, you need somehow split the geometry between chips. If each chip is processing a separate portion of the framebuffer, this could mean bursts of activity in one chip while the others are idle. So I seriously doubt that you would see anything close to a 4 fold performance increase...

i listed what i want to see because Dave did ask what we thought "they should include". As far as what I expect, i pretty much second what you said :

more internal precision, gamma correction of all color data in hardware, somewhat more programmability, maybe an increase in the number of vertex/pixel pipes. One thing however - given the problems with .13 and lower processes ATM, i'm thinking that maybe GPU core clock should be doubled (instead of adding more pipes). Intel/AMD are at 1GHz+ on 0.18Aluminum processes, how about targetting 733 or 800Mhz on 0.15 for a GPU?

On the memory side of things, i expect a slight increase in bandwidth. Maybe a more advanced/efficient occlusion culling scheme that supports bounding volumes on geometry data, but i doubt it. I don't expect much in the FSAA quality department.

Serge

MfA · Mar 6, 2002

I dont think I would want the heatsink fan combo I have on my processor on my graphics card though

psurge · Mar 6, 2002

MfA,

damnit :smile: forgot all about that. Do
you think it would be possible to run a 0.15u GPU with say 65 million transistors
at 800MHz, assuming better packaging (integrated heatspreader), off of AGP power?

i know next to nothing about power/heat related issues - any links to places where i could read up on this stuff?

Thanks,
Serge

JF_Aidan_Pryde · Mar 6, 2002

I'm not sure on the specifics but according to SA's post, texture data can be shared and geometry scaling will not be a large problem. The two obvious problems of load balancing and distributing the triangles to the correct chip was said to have straight foward and effective solutions (refused to elaborate). And I meant 32MB of EDRAM for each chip, not 12. :smile:

Not sure if virutualized textures will help here..

darkblu · Mar 6, 2002

The two obvious problems of load balancing and distributing the triangles to the correct chip was said to have straight foward and effective solutions (refused to elaborate).

SLI actually yields pretty good load balancing.

MfA · Mar 6, 2002

SLI, yuck ... to me it always seemed like an excercise in how to fuck up the texture caching.

Multiple scanline or tile interleaving ala BB's to me seems much more palatable.

Psurge, your guess is as good as mine ... my guess is no, power consumption doesnt scale down that fast.

[ This Message was edited by: MfA on 2002-03-06 10:40 ]

darkblu · Mar 6, 2002

Multiple scanline or tile interleaving ala BB's to me seems much more palatable.

not sure what you mean by BB here, but the greater the granularity of the subdivision elements the poorer the load balancing. caching is auxiliary to the problem of load balancing -- a given architecture may not require caching at all.

MfA · Mar 6, 2002

BitBoy's.

As for balancing, a tile is smaller than a scanline ... so if we had infinite input buffer size balancing would actually be better. We dont, but I doubt its an issue.

Even with unique texturing texel's are used repeatedly, inability to make full use of that fact is hardly an indicator of good balancing IMO. Even with multiple chips external memory bandwith is expensive.

Marco

[ This Message was edited by: MfA on 2002-03-06 11:47 ]

darkblu · Mar 6, 2002

i'm not familiar with any of the bitboys' designs, can't comment. still,

As for balancing, a tile is smaller than a scanline ... so if we had infinite input buffer size balancing would actually be better. We dont, but I doubt its an issue.

by granularity of the subdivision regions i didn't mean their integral area pixel-wise but rather the chance of a primitive to fall completely (or largely) within the bounds of one such subdivision region, thus compromising the load ballance. here i put the focus on the coverage of a single primitive, as trying to infer a coverage pattern scene-wise would be considerably (statistically) harder. so back to scanline-vs-tile granularity. in your opinion, which is more likely to occur: a single-pixel-tall, many-pixels-wide polygon, or, a polygon which falls within, say, 32x32 pixels tile?

Even with unique texturing texel's are used repeatedly, inability to make full use of that fact is hardly an indicator of good balancing IMO. Even with multiple chips external memory bandwith is expensive.

what i said was purely hypothetical, and was meant to imply that caching or the lack thereof is a matter of particular implementation, vs. the subject of load ballancing which is a statistical matter. speaking of implementations, nobody ever claimed a SLI architecture should not use a multiported texture cache, providing 100% texels reuse among the scanlines.

ed: quote tags

[ This Message was edited by: darkblu on 2002-03-06 12:40 ]

MfA · Mar 6, 2002

How does it matter if a tri is completely rendered by a single chip? The other chip just moves on to the next tri, sometimes one will pull ahead sometimes the other ... thats what buffering is for.

How would a blah-cache help? In one situation both chips are accessing the same texels, which is inefficient, in the other they are mostly accessing different texels. The only option to get around it is for multiple chips to share a cache ... which is clearly not an option.

Marco

PS. 32x32 is huge, think 4x4 or 8x8.

[ This Message was edited by: MfA on 2002-03-06 13:10 ]

JF_Aidan_Pryde · Mar 6, 2002

Originally Posted by SA:

--------------------------------------

Highly scalable problems such as 3d graphics and physical simulation should get near linear improvement in performance with transistor count as well as frequency. As chips specialized for these problems become denser they should increase in performance much more than CPUs for the same silicon process. This means moving as much performance sensitive processing as possible from the CPU to special purpose chips. This is quite a separate reason for special purpose chips than simply to implement the function directly in hardware so as to be able to apply more transistors to the computations (as mentioned in my previous post) It also applies to functions that require general programmability (but are highly scalable). General programmability does not preclude linear scalablity with transistor count. You just need to focus the programmability to handle problems that are linearly scalable (such as 3d graphics and physical simulation). In makes sense of course to implement as many heavily used low level functions as possible directly in hardware to apply as many transistors as possible to the problem at hand.

The other major benefit from using special purpose chips for highly scalable, computation intensive tasks is the simplification and linear scalablity of using multiple chips. This becomes especially true as EDRAM arrives.

The MAXX architecture requires scaling the external memory with the number of chips, so does the scan line (band line) interleave approach that 3dfx used. With memory being such a major cost of a board, and with all those pins and traces to worry about, it is a hard and expensive way to scale chips (requiring large boards and lots of extra power for all that external memory). The MAXX architecture also suffers input latency problems limiting its scalability (you increase input latency by one frame time with each additional chip). The scan line (band line) method also suffers from caching problems and lack of triangle setup scalability (since each chip must set up the same triangles redundantly).

With EDRAM, the amount of external memory needed goes down as the number of 3d chips increase. In fact, with enough EDRAM, the amount of external memory needed quickly goes to 0. EDRAM based 3d chips are thus ideal for multiple chip implementations. You don't need extra external memory as the chips scale (in fact you can get by with less or none), and the memory bandwidth scales automatically with the number of chips.

To make the maximum use of the EDRAM approach, the chips should be assigned to separate rectangular regions or viewports (sort of like very large tiles). The regions do not have their rendering deferred (although they could of course), they are just viewports. This scaling mechanism automatically scales the computation of everything: vertex shading, triangle setup, pixel operations, etc. It does not create any additional input latency, allows unlimited scalablity, and does not require scaling the memory as required by the previously mentioned approaches.

Tilers without EDRAM also scale nicely without needing extra external memory. They are in fact, the easiest architecture to scale across multiple chips. You just assign the tiles to be rendered to separate chips rather than the same chip. The external memory requirements while remaining constant, do not drop however, as they do with EDRAM. The major problem to deal with is scaling the triangle operations as well as the rendering. In this case, combining the multi-chip approach mentioned for EDRAM with tiling solves these issues. You just assign all the tiles in a viewport/region to a particular chip. Everything else is done as above and has the same benefits.

In my mind, the ideal 3d card has 4 sockets and no external memory. You buy the card with one socket populated at the cost of a one chip card. The chip has 32 MB of EDRAM, so with 1 chip you have a 32MB card. When you add a second chip you get a 64MB card with double the memory bandwidth and double the performance. For those who go all out and decide to add 3 chips, they get 128 MB of memory, and quadruple the memory bandwidth and performance. Ideally, the chip uses some form of occlusion culling such as tiling, or hz buffering with early z check, etc. Using the same compatible socket across chip generations would be a nice plus.

In the long run I agree with MFA. Using scene graphs or a similar spatial heirarchy simplifies and solves most of these problems, including accessing, transforming, lighting, shading, and rendering, only what is visible. They also simplify the multiple chip and virtual texture and virtual geometry problems. We will need to wait bit longer for it to appear in the APIs though.

There are indeed two problems generally associated with partitioning the screen across multiple chips. Load balancing, and distributing the triangles to the correct chip. Both have fairly straight forward, very effective solutions, though I can't mention the specifics here.
Those are some good comments, MFA. However, there is no need to defer rendering and no need for a large buffer. Each chip knows which vertices/triangles to process, without waiting.

----------------------------------------

darkblu · Mar 6, 2002

How does it matter if a tri is completely rendered by a single chip? The other chip just moves on to the next tri, sometimes one will pull ahead sometimes the other ... thats what buffering is for.

the other chip cannot just move to the next triangle as those two triangles may overlap and hence determinism conflicts could occur, which to be resolved would need imposing artificial latencies to the parallelism (i.e. clock-for-clock offsetting). that's why we need a scheme, which, for each moment T, would keep strict partitioning (read non-overlapping) across the frame regions, so that at said moment T chipN would not spatially overlap with chipN+1. now, last time we were discussing about the effectivness of that frame partitioning, namely scanlines vs. tiles.

How would a blah-cache help? In one situation both chips are accessing the same texels, which is inefficient,

it wouldn't be inefficient if such an access could come w/o penalties, and a multy-ported, shared cache could help here.

in the other they are mostly accessing different texels. The only option to get around it is for multiple chips to share a cache ... which is clearly not an option.

now, why would that not be an option?

PS. 32x32 is huge, think 4x4 or 8x8.

well, i picked 32x32 due to historical powerVR reasons (as i told you i'm not familiar with bitboys' design)

Dave Baumann · Mar 6, 2002

AFAIK PowerVR is more like 16x32 (or 32x16).

darkblu · Mar 6, 2002

AFAIK PowerVR is more like 16x32 (or 32x16).

ok, my bad.

MfA · Mar 6, 2002

Darkblu, we are talking about different forms of tiles I think ... I just meant a static assignment of screen tiles to individual chips (tiles as in with a tiled framebuffer, such as nearly every traditional 3D chip uses). So there can be no overlap with subsequent tri's on the same chip if it skips it (its invisible to it).

As for shared caches, such a high bandwith path takes lots of pins ... reducing redundancy of reads between the chips (through multi-scanline-SLI or tiling) and using the saved pins for more memory bandwith is a clear win.

darkblu · Mar 6, 2002

Darkblu, we are talking about different forms of tiles I think ... I just meant a static assignment of screen tiles to individual chips [snip] So there can be no overlap with subsequent tri's on the same chip if it skips it (its invisible to it).

exactly from there stems the whole problem with load balancing -- a chip processes only the triangles, and only the portion of those, which fall within its tile/scanline/whatever-subdivision-element-our-framebuffer-divides-into. that's why there's a chance that chip N does more work than chip M as it's possible that there's higher triangle coverage to his portion of the framebuffer that to chip M's portion of the framebuffer. my whole point being that subdividing the framebuffer into scanlines gives better (statistically) load balance in terms of triangle area that each chip has to process, than, say, tiles of 16x16.

Next-gen specs

JF_Aidan_Pryde

fresh

LittlePenny

pascal

psurge

MfA

psurge

JF_Aidan_Pryde

darkblu

MfA

darkblu

MfA

darkblu

MfA

JF_Aidan_Pryde

darkblu

Dave Baumann

Gamerscore Wh...

darkblu

MfA

darkblu

Similar threads