View Full Version : Future directions in 3D hardware
Some of the future directions I see 3D hardware heading are:
- High quality AA (16x or more). It will probably be on by default with only a small peformance drop. More vendors are likely to use some form of coverage mask AA combined with improved anisotropic filtering. --- timeframe: late 2002 - 2003
- The merging of vertex shaders and pixel shaders into a single, completely programmable, high-level, floating point, 3D hardware language. Some vendors are likely to merge the floating point hardware functionality of pixel processors and vertex processors as well. In this case they would likely allocate the merged set of floating point processors dynamically much like P10 allocates them dynamically for just vertex shaders. --- timeframe: late 2003 - 2004
- Major increases in triangle rate (pushing upward toward 1 billion triangles per second) as well as general floating point processing rate. To reduce the external memory bandwidth requirements of such high triangle rates, on-chip geometry instance caches, on-chip tessellation, and on-chip displacment mapping will be relied upon. --- timeframe: late 2002 - onward
- Further improvements in occlusion culling and hidden surface removal. IMR's will use combinations of hierarchical z buffering, compressed z's, and multiple z checks per pixel. Z occlusion queries will most likely be implemented across a larger number of vendors, and will find its way into DirectX. --- timeframe: late 2002 - 2003
- Improved global illumination support. This includes improved anti-aliased soft shadow maps and spherical harmonic environment maps. --- timeframe: late 2002 - 2003
In the slightly more distant future I see the following happening:
- Programmable hardware support for selective ray tracing (forward or backward). timeframe: --- 2005 - 2006
- Further improvements in the general purpose floating point programmability of the GPU/VPU, allowing much of the physics and collision processing to be off-loaded from the CPU. timeframe 2004 - 2006
- Enough programmablity (and large enough 3d hardware programs) that CGI software renderers can run directly on the 3D hardware, reducing rendering times by 1 to 2 orders of magnitude per machine. This would signal the convergence of CGI and real-time 3D.
timeframe: 2005 - 2006
Given just how friggin large pixel shaders are going to get (floating point yuck-phoey, the only good floating point is block floating point :) mask based approaches seem the only road to high quality edge AA in a short timeframe ... fillrate growth is likely to be worse for the coming generation than even the present one.
My issue with them is that nothing has changed though, they wont take much resources to implement now ... they wouldnt have done so last generation either. The only thing that has changed is that a "major" (well almost :) player has finally gone and done it, if that was all that it took then double damn for Warp5's untimely death.
Laa-Yosh
28-May-2002, 10:01
I wonder which shadowing technique will become the standard - volumetric stencil buffer shadows, or shadow buffers?
The first would limit polygon counts (although a more advanced vertex shader might be able to calculate them a lot faster then the host CPU), and it would require even more computing power and fillrate to do soft shadows with it (like, render shadows for each lightsource 4 times).
The second is the current standard in the CGI industry - shadow maps can be tiled, cached for non-moving lightsources, but need a lot of manual tweaking and cheating to get them right. They also eat memory and get jagged edges when viewed from a short distance - although oversampling can solve this.
IMHO, both methods will need to evolve for a few more years to decide the question - if it can be decided at all. Maybe both will continue to live in different kinds of engines.
Entropy
28-May-2002, 10:45
A major limitation to be considered is the data path between host CPU and the graphics card. Considering how long we had 2xAGP before 4xAGP came around, and has stayed around until 8xAGP will enter the scene towards the end of the year, it seems reasonable to assume that 8xAGP will be around for at least a couple of years. Probably more.
So until, say, 2005, that's presumably a bottleneck we'll have to deal with.
To me, that implies that it will become increasingly important to minimize traffic between CPU and GPU, implying generally that techniques that emphasize 'more work with less data' will be favoured.
Entropy
Dave Baumann
28-May-2002, 10:49
- Enough programmablity (and large enough 3d hardware programs) that CGI software renderers can run directly on the 3D hardware, reducing rendering times by 1 to 2 orders of magnitude per machine. This would signal the convergence of CGI and real-time 3D.
timeframe: 2005 - 2006
I wonder how realistic this is already with 3Dlabs virtual memory and its capability of addressing up to 16GB of space?
Adaptive shadow maps are the only scalable approach (apart from raytracing) to determining hard shadows IMO, that soft shadows are easier to fudge with shadow maps is just a perk.
DaveBauman, since its peak FP power (for the programmable part, not the blah blah hype part) is below that of high end processors ... not very close.
Dave Baumann
28-May-2002, 11:04
I didn't actually necessarily mean at 2X the speed!
Laa-Yosh
28-May-2002, 11:13
Adaptive shadow maps are the only scalable approach (apart from raytracing) to determining hard shadows IMO, that soft shadows are easier to fudge with shadow maps is just a perk.
I dunno, 4-8 sample slightly blurred shadow maps usually work for me quite well. And they calculate quite fast on my old 866 CPU, even for few million poly scenes, so I suppose that hardware acceleration would make them even faster.
What exactly are you disagreeing with me on?
Laa-Yosh
28-May-2002, 11:47
Hm, maybe I misunderstood your comment...
that soft shadows are easier to fudge with shadow maps is just a perk.
English is not my first language, so I don't really get what you mean here :)
I mean that the fact that you can easily fake soft shadows with shadow maps is an added bonus, but that an even more important feature is that they allow you to leverage all the HSR/occlusion-culling support of future hardware for shadow generation.
SA
I think one point is the number of transistors available to implement the functions you described. The TSMC .13 micron process is not a big improvment compared to the .15 micron process, but they expect a lot from the new .09 micron process (for 2005). See this thread: http://64.246.22.60/~admin61//forum/viewtopic.php?t=984
What about a farm (64 or 128) of highlly programmable (with branch) powerfull floating point intensive SIMD processors (like advanced risc DSPs), some fast buses and multiple caches and maybe some multiple purpose 1T-SRAM inside a large .09 micron process 500MHz chip by 2005? 8)
I have a interesting quote from Nvidia , that I probably can't use until fall..... :(
Mephisto
28-May-2002, 17:36
I have a interesting quote from Nvidia , that I probably can't use until fall..... :(
....
I honestly don't try to comment on something I can't ...
8)
Bogotron
28-May-2002, 17:39
ben6:
Ohhhhh..... I think I speak for all of us when I say: "I hate you!" :D
And if you can't say anything, please don't say anything at all. Makes life much easier for the rest of us. It's evil to taunt like that, ya know? And NDA's suck, as usual. :)
Heh Mephisto, gonna follow me around with that quote .... Seriously though, it's kind of a gray area, and I'm awaiting their response on what I should or can post on my E3 meeting....
Bogotron, except in this instance, it may or not be NDA material...
Reverend
28-May-2002, 17:45
Regarding ben6 - everyone, just get used to him... he does this in almost every post.
As for the topic at hand :
- AA
SA should be on the money re AA type and timeframe IMO.
- shaders
I see as SA sees it but at a later timeframe. Probably more towards late 2004. API revamp is how I see it as well.
- culling
Needs a bit of work and better dev support. Not as simple as I hope it to be.
Just MHO.
Chalnoth
28-May-2002, 18:09
- The merging of vertex shaders and pixel shaders into a single, completely programmable, high-level, floating point, 3D hardware language.
I seriously doubt it. The rasterization operations are fundamentally different from T&L operations. For example, there is really no need for greater precision than 16 bits per channel in the rasterizer, but a good T&L unit will go up to 64 bits per coordinate (x, y, z, w). Additionally, most T&L operations involve matrix multiplications and other vector ops. Most rasterization ops involve multiplication and addition.
- Programmable hardware support for selective ray tracing (forward or backward). timeframe: --- 2005 - 2006
Supposedly, we may be able to do ray tracing in hardware with the next generation of technology...according to a .pdf file that's been floating around the web recently...but we'll see...
- Further improvements in the general purpose floating point programmability of the GPU/VPU, allowing much of the physics and collision processing to be off-loaded from the CPU. timeframe 2004 - 2006
I think we'll see scene management and skeletal animation systems managed on the GPU first, leading more towards a scene graph programming interface than today's API's.
Btw, it'll be a long time before PC's will even approach Movie-quality computer graphics, for one reason: Radiosity Lighting. Radiosity is used today for static lights in some games, such as those seen in Quake2/3 (this is why it takes so long to calculate the lights in the levels for these games). It is just far too computation-intensive to run in real-time today, and the processing power required increases with triangle count. I have a strong feeling that we will not see real-time radiosity before we move away from Silicon-based processors.
I didn't actually necessarily mean at 2X the speed!
Not to be nitpicky or anything. But isn't an order of magnitude an exponential expansion, not a doubling?
By that definition, an order of magnitude increase in performance will happen in 3-5 years roughly. We double real world performance basically every 1.5 years or so (a bastardization of moore's law which states transistor density, not performance directly, will double every 18 months). See TNT --> GF, GF --> GF3. GF3 --> NV30.
So technically SA, your wish will be granted just by the passage of time, and not necessarily the costly, transistor wise, implementation of running a 3D program directly on the hardware through the use of virtual memory and whatever other hardware techniques the engineers come up with.
Laa-Yosh
28-May-2002, 19:34
Mfa, it was a misunderstanding then, sorry :)
Radiosity, or rather the more common term Global Illumination is not that often used in CGI. The most that the movie effects industry does is a "skylight pass", sometimes using image based lighting, that will be one layer of a composite consisting other passes from the same scene. But it's not that common, the most often used form is to bake GI into textures.
The reason is that GI is usually not enough in itself. Lighting can be used to place emphasis on different things in a scene, to hide imperfections in objects and textures. Lighting is also used in real life movies for the above reasons; in fact most of a lighting technician's job is to remove the effects of GI from the set. So GI can add some depth to a scene, but its significance is not that big; I'd rather see better shadows and materials first.
Yeah, GI is usually to computationally expensive even for hollywood effects (and sometimes, it shows) :-? It's seldomly used for casting shadows, with some exceptions. I know the arena in Gladiator had radiosity shadows prerendered with Lightscape and baked into the textures. The only scene I am aware of to for sure to use full blown radiosity is the journey through the waste basket scene in Fight Club - and that's only a couple of seconds long. Hope they'll start using it more as computing power becomes cheaper. You can get some amazing results from it! Found a couple of really nice GI pictures earlier today over at the High End 3D forum (http://www.highend3d.com/boards/showflat.php?Cat=&Board=critiquemyworkpro2&Number= 106339&page=0&view=collapsed&sb=5&o=&fpart=).
Regards / ushac
Colourless
28-May-2002, 20:34
there is really no need for greater precision than 16 bits per channel in the rasterizer
Do not say things like that. Remember the Bill Gates 640kb quote? Your statement is the same sort of thing. Just because you can't see any more bits being required, doesn't mean someone else doesn't. 16 bit's can be a rather big limitation, depending on what you want to do. If you want to attempt to generate vectors then 16 bits per chan isn't enough.
DemoCoder
28-May-2002, 22:52
- The merging of vertex shaders and pixel shaders into a single, completely programmable, high-level, floating point, 3D hardware language.
I seriously doubt it. The rasterization operations are fundamentally different from T&L operations. For example, there is really no need for greater precision than 16 bits per channel in the rasterizer, but a good T&L unit will go up to 64 bits per coordinate (x, y, z, w). Additionally, most T&L operations involve matrix multiplications and other vector ops. Most rasterization ops involve multiplication and addition.
He said hardware language, not hardware unit. For example, Stanford's Real Time Shading Language is a Renderman-like shading language where each little piece of code can be tagged based on its granularity -- CONSTANT, PER OBJECT, PER VERTEX, PER FRAGMENT. This high level language unifies vertex and pixel shading into a single high level language.
The compiler takes this language and translates it into separate low level code for per-vectex or per-fragment processors (vertex or pixel shaders) and even multipass where approriate. The benefit is, the developer is freed from learning multiple *different* assembly langauges for different units. MUL is MUL and DP3 is DP3. The only differences are in the precision of the variable types and the granularity of the operations.
Moreover, it is possible to design units that can handle both 64-bit SIMD ops and 16-bit SIMD ops. In 64-bit SIMD mode, the unit can handle 1 op at a time. In 16-bit mode, the 64-bit unit acts like 4 16-bit SIMD units allowing parallel execution. Granted, you don't want to stall vertex processing while rasterization is taking place, there sometimes is a need to process a vertex to determine what to do during per pixel lighting.
I think the Stanford RTSL is a good middle ground by preserving the ability to build fast hardware. The 100% general purpose programmable approach means a return to the days of software rasterization, and I doubt the ability of a GP to keep up with 16x FSAA and 64-tap anisotropic filtering requirements at current speeds.
Chalnoth
28-May-2002, 23:25
Do not say things like that. Remember the Bill Gates 640kb quote? Your statement is the same sort of thing. Just because you can't see any more bits being required, doesn't mean someone else doesn't. 16 bit's can be a rather big limitation, depending on what you want to do. If you want to attempt to generate vectors then 16 bits per chan isn't enough.
No, actually it's not.
You see, professional DACs generally limit out at around 12 bits per channel. I showed earlier over at the nvnews forums that you can actually do in the range of hundreds of passes before 16 bits per channel becomes bad. When you compare this with the fact that there is no need for any color depth loss with internal calculations (internal calculations should always be done at higher precision than framebuffer values), it becomes virtually impossible to make errors from 16 bits per channel filter down to a 12-bit DAC (note that the next-gen cards probably won't have better than a 10-bit DAC, and may even have 8-bit DACs).
Then you have to consider that the calculations I did were based upon an integer pipeline. A floating-point pipeline should naturally be a bit better.
The only exception is in multiplication. With the inclusion of overbright lights, 16 bits per channel may become cumbersome, but I'm fairly certain that the floating-point nature of the upcoming 64-bit color pipelines should eliminate this problem.
Update:
I did some thinking, and a floating-bit pipeline would likely have a bit more error in addition operations (presumably no error for operations within the pipeline), and much less error for multiplication operations. Just don't forget that this only affects multipass effects...future hardware that can apply more textures than today's hardware in a single pass won't have much possibility for addition errors to enter the pipeline. The only situation where there will be significant multipass effects will be in the case of multiple layers of complex transparencies.
Chalnoth
28-May-2002, 23:33
Moreover, it is possible to design units that can handle both 64-bit SIMD ops and 16-bit SIMD ops. In 64-bit SIMD mode, the unit can handle 1 op at a time. In 16-bit mode, the 64-bit unit acts like 4 16-bit SIMD units allowing parallel execution. Granted, you don't want to stall vertex processing while rasterization is taking place, there sometimes is a need to process a vertex to determine what to do during per pixel lighting.
I'm not so sure...I could see it easily with integer pipelines, but floating-bit pipelines would be a bit more challenging to divide up like this. This is especially true since you want to be especially careful not to decrease accuracy for color ops, meaning you'll need more than 16 bits of accuracy within the pipeline.
I also don't see why it would be very good to combine the two languages into one. They're just so different...but we'll see, obviously...
Bah, I say let the programmers earn their money ... give em a 12-16 bit fixed point architecture with barrel shifters :)
MfA no thanks, we work hard enough ;)
Dave Baumann
28-May-2002, 23:54
future hardware that can apply more textures than today's hardware in a single pass won't have much possibility for addition errors to enter the pipeline.
Not 'that' future either. P10 allows for as many internal loops as required, its limitation being that of the API.
Some things should not depend entirely on the judgement of the PC games development community at large (Im sure a good deal of them have some experience with fixed point development on good fixed point platforms, not x86, or hardware development ... but Im assuming the majority dont). I think the ease of programming thing can go overboard, even if pixel shaders have to be written in block floating point for optimal performance (the hardware could emulate floating point plenty fast enough for prototyping) its not going to take a huge chunk of development.
Of course it might not present a significant savings in die area to go that way because of various factors, in which case there would be no need to not use floating point.
DaveB - are the integer pixel processors in the p10 similar to what MfA is proposing?
Serge
Entropy
29-May-2002, 01:02
Have to agree that it would seem as though fixed-point should be able to do the job. Unfortunately, estimating gate budget for similar performance fixed point vs. floating point is more than I'd dare do.
Any experienced DSPers around here, who remember data back from the fixed to float shift?
Entropy
Schreck
29-May-2002, 01:35
No, actually it's not.
You see, professional DACs generally limit out at around 12 bits per channel. I showed earlier over at the nvnews forums that you can actually do in the range of hundreds of passes before 16 bits per channel becomes bad. When you compare this with the fact that there is no need for any color depth loss with internal calculations (internal calculations should always be done at higher precision than framebuffer values), it becomes virtually impossible to make errors from 16 bits per channel filter down to a 12-bit DAC (note that the next-gen cards probably won't have better than a 10-bit DAC, and may even have 8-bit DACs).
You are thinking in a very limited way. Who's to say that rasterizers should only be manipulating colors? What if you want to do complex math on vectors, then use this to lookup into a huge 4k wide texture? Still think 16-bit is enough?
Dave Baumann
29-May-2002, 02:23
DaveB - are the integer pixel processors in the p10 similar to what MfA is proposing?
I'll confess, I don't know what MfA was proposing! :oops:
demalion
29-May-2002, 02:54
No, actually it's not.
You see, professional DACs generally limit out at around 12 bits per channel. I showed earlier over at the nvnews forums that you can actually do in the range of hundreds of passes before 16 bits per channel becomes bad. When you compare this with the fact that there is no need for any color depth loss with internal calculations (internal calculations should always be done at higher precision than framebuffer values), it becomes virtually impossible to make errors from 16 bits per channel filter down to a 12-bit DAC (note that the next-gen cards probably won't have better than a 10-bit DAC, and may even have 8-bit DACs).
You are thinking in a very limited way. Who's to say that rasterizers should only be manipulating colors? What if you want to do complex math on vectors, then use this to lookup into a huge 4k wide texture? Still think 16-bit is enough?
Wow, I thought I had a handle on some general architecture components, but I can't see why you'd be doing anything like this on a rasterizer. DAC = Digital to Analog Converter, correct? Why would you need more than 16 bits per color component for a texture lookup, and what would you be using that texture look up for at the DAC?
Fundamentally, there is no difference between pixel shading and vertex shading. In both cases it boils down to calculating surface characteristics such as 3d location and lighting effects at a 3d surface. Pixel shading versus vertex shading as such, is an implementation artifact of trying to reduce the computation costs by performing the expensive computations at the vertices and reusing those computations at each of the interior pixels, generally via interpolation. This implementation artifact can be hidden by a single high level language that deals with shading 3d surfaces.
In addition, this opens up the possibility that some vendors may take advantage of. Since both the pixel shaders and vertex shaders implement the same instruction set (at a high level at least), and they both now require floating point of one precision or another, implementing them entirely separately becomes expensively redundant. Therefore, some vendors are likely to look for ways to combine them.
About transistor counts. They drive the whole technology of course. Most of the algorithms were developed long ago. As I mentioned in one of my posts on the old board, performance for 3d graphics chips goes up roughly as the cube of the (inverse of the) feature size. That is, frequencies increase ~linearly, and transistor counts go up by the square of the decrease in feature size. Since graphics hardware applies the additional transistors to the task at hand rather than just housekeeping, their performance improves proportionally to the transistor count times the frequency improvement.
CPUs, however, only improve performance approximately linearly based on the same metric. This is because CPUs get most of their performance improvement from increased frequency and get very little additional benefit from increased transistor count. Additional transistors are usually allocated to more cache, larger buffers, etc. all of which offer only a few percent improvement.
3d graphics chips therefore improve in performance far faster than CPUs and most of this due to increased transistor counts.
The move from .15m to .13m should therefore provide (.15/.13)^3 or ~1.5 times the overall performance. However, the next generation of hardware will use higher precision computations which require a significant amount of extra transistors. To keep up with performance improvement expectations, 3d vendors will need to rely on some other techniques other than just process improvements. Those vendors that do a better job of this will fare better this generation. There are ways of course other than just process improvements to increase transistor count, and there are ways to improve performance other that just increasing transistor count such as memory bandwidth improvements and reducing the amount of computation needed.
Chalnoth
29-May-2002, 03:16
Fundamentally, there is no difference between pixel shading and vertex shading. In both cases it boils down to calculating surface characteristics such as 3d location and lighting effects at a 3d surface. Pixel shading versus vertex shading as such, is an implementation artifact of trying to reduce the computation costs by performing the expensive computations at the vertices and reusing those computations at each of the interior pixels, generally via interpolation. This implementation artifact can be hidden by a single high level language that deals with shading 3d surfaces.
Vertex "shaders" are also used to calculate vertex positions. While I will agree that the lighting portion of T&L in the vertex shader is much more similar to what is done in the rasterizer, the transformations are very different.
OpenGL guy
29-May-2002, 03:17
Not 'that' future either. P10 allows for as many internal loops as required, its limitation being that of the API.
So if I want 10 passes, it can handle it? What about 100? 1,000? 10^100 ( = google)?
I'm sure there's a limit in there someplace...
Chalnoth
29-May-2002, 03:44
So if I want 10 passes, it can handle it? What about 100? 1,000? 10^100 ( = google)?
I'm sure there's a limit in there someplace...
If the architecture is sufficiently-generalized, the only limit would be the size of the program that could fit into available memory. I wonder if the P10 allows for vertex programs to spill out of video memory and into system memory? That might be a bit excessive...
Dave Baumann
29-May-2002, 09:16
So if I want 10 passes, it can handle it? What about 100? 1,000? 10^100 ( = google)?
I'm sure there's a limit in there someplace...
Naturally there will be a practicality limitation due to performance – the more loops required, the more clocks will be used. Having said that, given the stable this chip is heading from not all their customers will need fast paced realtime action.
If the architecture is sufficiently-generalized, the only limit would be the size of the program that could fit into available memory. I wonder if the P10 allows for vertex programs to spill out of video memory and into system memory? That might be a bit excessive...
Again, with Virtual memory and the capability of addressing 16GB of space then the program can be huge! This may not be desirable in all situations, but there may be non-realtime apps where it could be useful.
Both vertex shaders and pixel shaders must calculate both the 3d locations and the lighting of the 3d surface. Vertices calculate the 3d surface location by direct transformation, while pixels calculate the 3d surface locations incrementally by interpolation.
The separation into two types of shaders is an implementation artifact. There is no separation into vertices and pixels on a real life physical 3d surface, only the surface. Vertices are 3d surface sample points and the pixels are linearly interoplated between sets of 3 surface sample points. This is very similar to the 2d sampling of textures with 2d bi-linear interpolation.
Ultimately however, for realistic physical simulation, the same types of physical calculations must be performed at each surface point. As this happens, pixel shaders quickly take on more and more of the functionality of vertex shaders, including precision, dynamic range, and instruction set. At some point the separation into two types of shaders becomes redundant, where you are implementing two sets of very sophisticated highly programmable floating point functional units that are capable of doing the same things.
At first the instruction sets will merge into a single high level language. At that point, 3d venders will be searching for ways to merge the implementations as well.
Chalnoth
29-May-2002, 10:18
Both vertex shaders and pixel shaders must calculate both the 3d locations and the lighting of the 3d surface. Vertices calculate the 3d surface location by direct transformation, while pixels calculate the 3d surface locations incrementally by interpolation.
I think you're thinking of the triangle setup engine, not the rasterizer itself. I don't see why triangle setup should be, or needs to be, made programmable. It's just more work for the developer that really is unnecessary. Of course, it may be beneficial for a professional video card, where somebody may want to implement a special texture filtering algorithm or something, but it's not anything that a game developer should have to deal with.
darkblu
29-May-2002, 10:29
You see, professional DACs generally limit out at around 12 bits per channel. I showed earlier over at the nvnews forums that you can actually do in the range of hundreds of passes before 16 bits per channel becomes bad.
snip
Then you have to consider that the calculations I did were based upon an integer pipeline.
erm, so we have 12 significant bits on the output and 16 bits in the intermediate store (i.e. 4 bits of extra precision, or 1/16 granularity), and you say that you can do "hundreds of passes" before an error appears in the significant bits?
I did some thinking, and a floating-bit pipeline would likely have a bit more error in addition operations (presumably no error for operations within the pipeline), and much less error for multiplication operations.
would you care to elaborate on the above? the magnitude of the error in any (sane implementation of) fp math is a function of the difference in magnitude between the domain and range of the particular calculation, on one hand, and the difference in magnitude among the calculation arguments, on the other. if the framebuffer can keep full-precision (whatever it may be) and the fp pipeline maintains maximal (i.e. properly rouned) correctness of the LSBs of both the significand and the exponent till data get to the framebuffer, then i can't see how addition cold yeild higher error that multiplication. or did i completely miss your point?
Entropy
29-May-2002, 12:33
1. Error propagation is algorithm dependent. As soon as we move into more general programmability, it is impossible to make statements on sufficient accuracy without first defining what will be done.
2. Floating point data trades accuracy for dynamic range. Since you divide your bits into mantissa and exponent + sign bits for both, the potential information represented in the mantissa is significantly reduced compared to fixed point representation. Exactly what floating point format is proposed? How many bits for mantissa and exponent?
Entropy
Maybe i'm missing something, its late here, but aren't we really going to be limiting ideal performance if we start talking about abstracting to the same general implementation for both vertex and pixel shading operations. The math involved is significantly different.
That we use the same language is one thing that is ultimately quite understandable, but sacrificing a little operational redundancy for fairly large gains from hardwiring seems to be a bit overkill
Chalnoth
29-May-2002, 16:10
erm, so we have 12 significant bits on the output and 16 bits in the intermediate store (i.e. 4 bits of extra precision, or 1/16 granularity), and you say that you can do "hundreds of passes" before an error appears in the significant bits?
Basically, 4 bits of error is 16. In a standard pipeline, bits are truncated, meaning that the error for each average is between 0-2 (25% chance 0, 50% chance 1, 25% chance 2). Obviously, in 16 passes, the average error would be 16, going past the 4 bits of error.
If you could design the pipeline to round off instead of truncate (Or use any other method to center the errors about zero), the error becomes a bit less obvious. What actually happens is that it requires of the order N^2 passes to reach the new 'error threshold' value. This has the twofold impact of increasing the number of passes before the average error reaches the 'error threshold,' and it reduces the number of artifacts seen before this threshold is reached. I made a little program to test this particular probability distribution, and the 'error threshold' is reached at 124 passes. Number of passes before 'error threshold' reached with a rounding algorithm:
1 bit : 2
2 bits: 7
3 bits: 33
4 bits: 124
Of course, this is only dealing with averages (0.5 * A + 0.5 * B). Generalized blending is more complex, and harder to calculate accurately, but most any operation that does not involve multiplaction by a number greater than 1 shouldn't produce higher errors. This is, of course, what the floating point values are for.
would you care to elaborate on the above? the magnitude of the error in any (sane implementation of) fp math is a function of the difference in magnitude between the domain and range of the particular calculation, on one hand, and the difference in magnitude among the calculation arguments, on the other. if the framebuffer can keep full-precision (whatever it may be) and the fp pipeline maintains maximal (i.e. properly rouned) correctness of the LSBs of both the significand and the exponent till data get to the framebuffer, then i can't see how addition cold yeild higher error that multiplication. or did i completely miss your point?
The more and less I stated above were in relation to an integer pipeline. i.e. floating point has more error when adding than integer, and less error when multiplying than integer.
Anyway, I think we can generally assume that the DACs on these next-gen cards will not be greater than 10 bits. The exponent is the big question, as it will directly determine the max brightness for overbright lights. I would doubt that the exponent will be greater than 4 bits (8 max exponent...8 because 3 bits for value, one for +/-). Of course, the exponent might be as high as 6 bits to make the mantissa more in line with the DAC. I personally doubt the wisdom of doing this, as a max exponent of even 4 (3 bits exponent) would allow for multiples of 16. But, it should be borne out in the specs of the new extensions to OpenGL and in DX9.
One last thing I want you to think about in regards to additions with floating-point values is that if one of the numbers is severely truncated, will that loss ever be seen? After all, there is no such thing as subtraction in 3D graphics. The loss of the lesser value is really no problem as the greater value will have much more effect on the final output anyway.
In closing, I'd like to say that provided these conditions are met, there will never be a need for greater than 16-bits per channel accuracy:
1. Center all errors about zero.
2. Allow at least 2 bits of extra precision in the mantissa (This one may be hard with 16-bit, unfortunately, and may mean we can only get truly good output for 10-bit DACs).
3. Keep number of external passes down by doing more effects in a single pass.
4. Sufficiently high internal precision to make sure no errors happen in single-pass effects (multitexturing, texture filtering, FSAA).
Mintmaster
29-May-2002, 18:10
Wow, I thought I had a handle on some general architecture components, but I can't see why you'd be doing anything like this on a rasterizer. DAC = Digital to Analog Converter, correct? Why would you need more than 16 bits per color component for a texture lookup, and what would you be using that texture look up for at the DAC?
We're talking about pixel shaders here. I agree with Shrek, you can't just be thinking about the DAC. Pixel shaders can be used for all sorts of things, not just pixels that go to the screen.
When you render to textures, you may want to use that texture for a displacement map, where you'll want vertex precision. I'm personally writing a shader where you do some processing on a shadow map (I'll put up a demo when I'm done), and you would really want 32-bits per channel there. I'm stuck with a 8-bit shadow map right now (ugh!). Pixel shaders can also be used to do physics, as in the Tomohide Fur Demo (see ATI's demo site) or NVidia's water simulation paper. Somewhere I saw the idea of rendering the Mandelbrot set using pixel shaders, and it would be tens (maybe hundreds?) of times faster than a CPU could do it.
There is no doubt that 32-bit precision in the pixel shader will be useful. I really do think a unification of the vertex and pixel shaders will eventually be in order.
I liked the idea of a single processor (pixel and vertex) implementation because some games will have more pixel shading at sometimes, and other times it will have more vertex shading.
I mean a farm of processors may have better load balance (physics, vertex and pixel) and then better overall performance. The key is keep all processors busy, it means a lot internal/external bandwith and some good cache and maybe some 1T-SRAM. Maybe each processor could have its owns L1 cache.
- The chip design could be very simple.
- The "small" processor could be highlly optimized (design point of view).
- Will need some very good software.
I wish we could see something like that with the .13 micron process, and maybe 2003/2004 8)
Chalnoth
29-May-2002, 21:36
We're talking about pixel shaders here. I agree with Shrek, you can't just be thinking about the DAC. Pixel shaders can be used for all sorts of things, not just pixels that go to the screen.
Well, then, it's simple...perhaps in the future we'll have pipelines that allow for textures that will be manipulated at greater than 16 bits per channel, but the actual framebuffer need never be greater than that (Of course, if some sort of render-to-texture is used for these, the option would have to be there...).
Still, you can't really call 32 bits per channel "128-bit color" as you'd never need it for actual color ops...that would be interesting for PR, wouldn't it?
Actually, now that I think about it, things like this may prove to be reasons to join the pixel and vertex shader languages into one, so that you can use the much higher precision of the vertex shader to operate on certain special textures...
demalion
30-May-2002, 07:15
We're talking about pixel shaders here. I agree with Shrek, you can't just be thinking about the DAC. Pixel shaders can be used for all sorts of things, not just pixels that go to the screen.
But the poster he quoted was just talking about the DAC, not the framebuffer or color calculation precision. You could store at beyond "64 bit" color in the framebuffer, and still only have even a 12 bit dac...the DAC precision has nothing to do with calculations, only with the limitation of the representation of the color data on screen. The way Shreck responded seemed to be contradicting this, though the way you responded fits in with my current understanding.
Colourless
30-May-2002, 07:54
But the poster he quoted was just talking about the DAC, not the framebuffer or color calculation precision. You could store at beyond "64 bit" color in the framebuffer, and still only have even a 12 bit dac...the DAC precision has nothing to do with calculations, only with the limitation of the representation of the color data on screen. The way Shreck responded seemed to be contradicting this, though the way you responded fits in with my current understanding.
But this all started when Chalnoth said "For example, there is really no need for greater precision than 16 bits per channel in the rasterizer". The post said nothing about the DAC.
Simon F
30-May-2002, 09:53
I liked the idea of a single processor (pixel and vertex) implementation because some games will have more pixel shading at sometimes, and other times it will have more vertex shading.
There are two immediate problems that I see with this.
1: Plumbing: Shipping data around the chip so that it can be used with a centralised processing resource (and make it efficient) would nightmare.
2: Cost: To make an (on chip) generalised processor farm do what even medium powered graphics cards currently do, you would have to make it huge. I'm taking a risk of starting a silly debate here but how about this for an analogy?...
You are Henry Ford. You want to build a fleet of cars. You could construct a plethora of "home workshops" . The advantage would be that you could also build furniture or whatever else you wanted. The trouble is, nothing would be done very fast.
Alternatively, you could build a specialised car production line, much smaller than the collection of home workshops, and produce a lot of cars rapidly.
demalion
30-May-2002, 10:19
But this all started when Chalnoth said "For example, there is really no need for greater precision than 16 bits per channel in the rasterizer". The post said nothing about the DAC.
OK, with the text that was quoted I lost track of the discussion. My bad. I'm back on track now.
darkblu
30-May-2002, 11:43
Basically, 4 bits of error is 16. In a standard pipeline, bits are truncated, meaning that the error for each average is between 0-2 (25% chance 0, 50% chance 1, 25% chance 2). Obviously, in 16 passes, the average error would be 16, going past the 4 bits of error.
If you could design the pipeline to round off instead of truncate (Or use any other method to center the errors about zero), the error becomes a bit less obvious. What actually happens is that it requires of the order N^2 passes to reach the new 'error threshold' value. This has the twofold impact of increasing the number of passes before the average error reaches the 'error threshold,' and it reduces the number of artifacts seen before this threshold is reached. I made a little program to test this particular probability distribution, and the 'error threshold' is reached at 124 passes. Number of passes before 'error threshold' reached with a rounding algorithm:
1 bit : 2
2 bits: 7
3 bits: 33
4 bits: 124
i'm not sure it's alright to take an average case, as generally one would want to keep a guaranteed error threshold, hence would need to consider the worst case. IOW i'm not sure your gaussian scenario is valid.
let's consider the two cases of 16-bit truncating and 16-bit rounding pipelines with a 12bit DAC (i.e. extra 4 bits in the storage providing granularity of 1/16): worst case errors per single operation/pass would be ERRtrunc -> 1/16 and ERRround -> 1/32, respectively. subsequently, if we take a worst-case error propagation (i.e. worst error with each next pass), it'd take 16 + 1 passes of OP_ADD for the truncatng pipeline and 32 + 1 passes of OP_ADD for the rounding pipeline to introduce a unit of error in the significant 12 bits. correct me if i'm wrong.
ed: wording
A major limitation to be considered is the data path between host CPU and the graphics card. Considering how long we had 2xAGP before 4xAGP came around, and has stayed around until 8xAGP will enter the scene towards the end of the year, it seems reasonable to assume that 8xAGP will be around for at least a couple of years. Probably more.
AGP8x is pretty useless as long as there isn't faster system memory. Todays host RAM only delivers a peak bandwith of 2,7 GB/s. A fast CPU takes away up to 2,1 GB/s, leaving as little as 0.6 GB/s for sending geometry and textures over the AGP.
Some of the future directions I see 3D hardware heading are:
- Further improvements in occlusion culling and hidden surface removal. IMR's will use combinations of hierarchical z buffering, compressed z's, and multiple z checks per pixel. [...] --- timeframe: late 2002 - 2003
In addition to that, I think it is about time for colorbuffer compression. Moore's law should allow to implement enough logic to do lossless compression/decompression even fast enough for blending.
arjan de lumens
01-Jun-2002, 19:57
AFAIK, a modern CPU will under normal operation not come close to saturating its front side bus (i've heard numbers like 10-30% utilization for most programs), meaning that there in practice will be much more than 0.6 GB/s bandwidth available to the AGP 8X interface - also, the northbridge chip could well be designed to give priority to AGP over the CPU, ensuring full 2.1 GB/s bandwidth to the AGP 8X interface at all times.
Colorbuffer compression is in general too difficult to be worthwhile. Z values tend to vary rather smootly from one pixel to the next, making compression easy, while color values tend to vary rather erratically, making compression difficult. I'd estimate a ~1.2:1 ratio to be achievable for 16-bit colors, even less for higher color depths (this comes at the cost of making random accesses difficult, negating any bandwidth usage advantage). Multisample buffer compression may be easy, though.
AFAIK, a modern CPU will under normal operation not come close to saturating its front side bus (i've heard numbers like 10-30% utilization for most programs), meaning that there in practice will be much more than 0.6 GB/s bandwidth available to the AGP 8X interface - also, the northbridge chip could well be designed to give priority to AGP over the CPU, ensuring full 2.1 GB/s bandwidth to the AGP 8X interface at all times.
I'm sorry to go OT here, but thats not true. Modern CPUs are very memory bandwith hungry. Todays Athlon XPs are all memory bandwith starved, even DDR333/PC2700 with 2.7GB bandwith is useless if the FSB can only suck up 2.1GB. The same with the P4, it is no wonder it does best on dual channel RDRAM. Also look at Hammer/Opteron, AMD realizes the memory/FSB bottleneck and put the memory controller on die, to give it a better pipe to the memory, and because of its architecture it will fly in SMP (gains more bandwith with every CPU you add). AGP8x implementations/performance will be very interesting on Opteron systems.
arjan de lumens
01-Jun-2002, 21:05
[rather OT...]
OK - the 10-30% number was a bit old - IIRC, some guy who used a logic analyzer on a Pentium-II 400's bus to find that its actual bandwidth usage while running programs was around 100 MB/s out of 800 MB/s avaliable bandwidth. Still, I would guess that processors are more limited by memory latency than raw bandwidth (due to the way they just stall on cache misses), in which case the processor will frequently fail to overlap multiple memory accesses and thus be unable to extract more that perhaps ~50% of its theoretical bandwidth. The main reason that the memory controller is moved onchip in the Clawhammer design is that it cuts down memory latency by ~30-50%, not that it gives any additional bandwidth.
[rather OT...]
Still, I would guess that processors are more limited by memory latency than raw bandwidth (due to the way they just stall on cache misses), in which case the processor will frequently fail to overlap multiple memory accesses and thus be unable to extract more that perhaps ~50% of its theoretical bandwidth. The main reason that the memory controller is moved onchip in the Clawhammer design is that it cuts down memory latency by ~30-50%, not that it gives any additional bandwidth.
It does give it more additional bandwith (well, indirectly), in the form that there is no FSB limiting it to 2,1GB like now. The limit will be the real memory bandwith, not the FSB bottleneck. Latency is a issue, therefore we have prefetch in newer CPUs, which trades a bit bandwith for latency. There are several applications which are highly memory/FSB bandwith limited nowadays, much media (audio/video) work, Quake3 is also supposed to be memory/FSB limited in low resolutions (reason why P4 loves it). Thats also why the P4 needs such a high clocked FSB, to shuffle all the data to the pipeline, to prevent it from being bandwith starved, latency is not such a problem for the P4 (else it would have problems while using the higher latency RDRAM).
SimonF,
I totally agree - I think some things in the 3d pipeline should be hardwired.
IMO there should be hardwired units for
- HZ/occlusion culling
- triangle setup
- backface culling/clipping
- rasterization
- Z3 64xFSAA with storage for 4 samples
As for the rest....
A TU is basically a memory address generator and load unit, specific to 1d/2d/3d arrays of data. I don't really know enough to say, but perhaps the texturing units could be made to support more general memory access? Basically a pixel/vertex program running on a VPU would take a stream(s) of input data. It tells it's memory units about all the memory areas it will read from(textures, index lists, displacement maps, etc...), the ones it will write to, the ones it wants to read/write. Memory access inside these cacheable memory areas would be general, with optimizations for common access patterns (i.e. texture lookup). The rest of the chip would be made up of indentical programmable mini-DSPs.
To use the stream programming model - each program would be a set of kernels operating on either a stream or some fixed size data set (i.e. a primitive program, a vertex program, a pixel program). Some of the kernels are implemented in silicon. The driver would decide which programmeable units execute what and how to route data between units efficiently... What the hardwired units are would depend on what type of rendering is being performed (rasterization, ray-tracing, or...?)
Would this sort of design be enough for full programmability without sacrificing [too much] speed?
Regards,
Serge
Chalnoth
01-Jun-2002, 22:02
One quick note about AGP8x:
There'd finally be a reason for the nForce's dual-channel DDR SDRAM architecture when integrated video is not in use (Of course, I'm assuming that the next-gen nForce will have the same dual-channel DDR SDRAM interface...which it should).
arjan de lumens
01-Jun-2002, 22:04
It does give it more additional bandwith (well, indirectly), in the form that there is no FSB limiting it to 2,1GB like now. The limit will be the real memory bandwith, not the FSB bottleneck. Latency is a issue, therefore we have prefetch in newer CPUs, which trades a bit bandwith for latency. There are several applications which are highly memory/FSB bandwith limited nowadays, much media (audio/video) work, Quake3 is also supposed to be memory/FSB limited in low resolutions (reason why P4 loves it). Thats also why the P4 needs such a high clocked FSB, to shuffle all the data to the pipeline, to prevent it from being bandwith starved, latency is not such a problem for the P4 (else it would have problems while using the higher latency RDRAM).
There is no technical limitation that holds the EV6 bus of AthlonXP back at 2x133 MHz - AMD has stated that it could scale up to 2x200 MHz without any real problems - so I would still say that latency, not bandwidth, is the main motivation for the onchip controller. On the Pentium4 platform, going from the i850 chipset (PC800 RDRAM) to the i845 chipset (PC133 SDRAM) gives a theoretical bandwidth reduction of 67%, but an actual performance reduction of no more than about 23% in the most bandwidth demanding 3d games (including Quake3; according to this (http://www6.tomshardware.com/mainboard/01q3/010702/index.html) article). And the P4 does have some problems with RDRAM latency - the P4X333 chipset using DDR RAM is in general faster, while providing less bandwidth than, the i850E chipset (unless officially unsupported PC1066 RDRAM is used).
Colorbuffer compression is in general too difficult to be worthwhile. Z values tend to vary rather smootly from one pixel to the next, making compression easy, while color values tend to vary rather erratically, making compression difficult. I'd estimate a ~1.2:1 ratio to be achievable for 16-bit colors, even less for higher color depths (this comes at the cost of making random accesses difficult, negating any bandwidth usage advantage). Multisample buffer compression may be easy, though.
There is a paper (http://www.stanford.edu/~kwkoh/cs448a/compression.pdf) from somebody at the Standford university where he reaches compression rations of up to 4.65:1 for colorbuffer compression.
x-bit labs did a comparsion (http://www.xbitlabs.com/cpu/athlonxp-166/) with an Athlon XP 2000+ between FSB of 2*133 and overclocked to 2*166 using DDR266 and DDR333. Seems it depends on the application...
Regards / ushac
arjan de lumens
01-Jun-2002, 23:29
There is a paper (http://www.stanford.edu/~kwkoh/cs448a/compression.pdf) from somebody at the Standford university where he reaches compression rations of up to 4.65:1 for colorbuffer compression.
Looks rather suspect to me. They claim to have reaced a compression ratio of 3.5:1 using Huffman coding alone. This would imply (assuming 32-bit rgba color) that they use 2.28 bits to encode each color component. Having programmed a Huffman encoder myself, I will say this: The Huffman method just isn't that good on rgba image data. Unless you feed it posterized images or cheat some other, similar, way.
And: I refuse to believe that the compression ratios of PNG and JPEG-LS (state-of-the-art losslessly compressed image file formats) are that easily beaten by that much.
Also, they claim to have doen testing with GL traces from 'Quake 4' ..!? Since when has Quake 4 been available?
Mintmaster
02-Jun-2002, 00:26
I'm not agreeing with a generalized processor farm withing the graphics chip doing everthing - that would defeat the purpose of a graphics chip in the first place. You still need the rasterizer, triangle setup, Z-buffer and occlusion systems, etc.
I (and I believe others in this thread) just think that the shaders part could be shared. Just the brute force math and simple program execution. Eventually they will be doing the same thing anyway (higher precision in the pixel pipeline, textures in the vertex pipeline), so why segregate the resources?
Sure there will be some data routing, but I don't think that'll be a huge problem. Since we're talking about massively parallel calculations, you can easily buffer much of the data to be routed, and then efficiency could be maintained. I'm sure only a select number of polygons on the screens will be using mathematically intensive pixel shaders, so why not use those math units for vertex power when they're sitting idle?
I think its very logical to combine the vertex and pixel shaders. The unified shader will really just be a processor suited to massively parallel mathematical tasks, and both vertex and pixel processing fall under that category.
How much and what will be shared among vertex and pixel shader implementations is hard to tell at the moment, it is still a bit in the future.
However, the pixel and vertex shader instruction sets are almost certain to merge within the next few releases of DirectX.
This will motivate 3d vendors to search for ways to merge at least portions of the pixel and vertex shaders in hardware. As usual, I expect different hardware vendors to take different approaches. Some might try more general solutions, others more specific and targeted solutions.
I do see the main technology focus toward 3d hardware programmability for the forseeable future though. Adding a specific feature such as dot3 bump mapping gives you one more feature, but improving the programmability of the shaders gives you countless additional features.
Currently pixel shaders use very small programs containing only a few instructions. Some Renderman shaders (Renderman in 3d consumer hardware is closer than most people think) require over a thousand similar instructions (all in floating point by the way) and resemble some fairly sizeable C functions. Performing this level of computation is the difference between Renderman level CGI (probably 80% of the CGI rendering market, with Mental Ray and Exluna owning much of the rest) and current realtime 3d graphics. Closing the gap is going to require a lot of transistors dedicated to sophisticated high performance shaders.
Also, they claim to have doen testing with GL traces from 'Quake 4' ..!? Since when has Quake 4 been available?
I asked one of the authors about this issue. He said it seems to be a typo and that it was just a "Quake" where they captured OGL streams from.
One of them said Huffman was used because it is easy to implement in hardware and works well with their tile cache approach.
Ailuros
03-Jun-2002, 09:25
Excuse the OT:
ram,
Do you happen to have any links related to research on geometry compression too? I'd be interested to read some. Thanks in advance.
Dave Baumann
03-Jun-2002, 11:14
Ailuros - heres one, with a surprisingly obvious URL:
http://www.3dcompression.com/
3Dlabs have also been talking about some Wavelet based geomtry compression which I've not idea about.
The Multi-Res Modeling Group (http://www.multires.caltech.edu/pubs/pubs.htm) over at Caltech have a lot of papers on wavelet geometry compression...
Regards / ushac
That compression method is designed so you can send large models over a modem, but not so you can decompress them at high speed with little complexity. Realistically something like Sun's Java3D compression method is what can be expected soonish for 3D hardware.
arjan de lumens
03-Jun-2002, 12:33
I asked one of the authors about this issue. He said it seems to be a typo and that it was just a "Quake" where they captured OGL streams from.
One of them said Huffman was used because it is easy to implement in hardware and works well with their tile cache approach.
I stll find the results they claim hard to believe, since their methods are similar to the methods PNG uses (delta predictors+Huffman for both approaches) yet the compression ratio they claim is far beyond what PNG achieves, even though a PNG encoder, unlike a hardware framebuffer compressor, has practically infinite time available to compress the data. Also, Huffman encoding/decoding is a highly serial process which is difficult and really expensive to parallellize (due to variable-length symbols; you cannot even begin to correctly encode/decode a symbol correctly until you are done encoding/decoding all preceding symbols), and so is rather badly suited to hardware implementation.
I am back from my short trip.
Simon F:
There are two immediate problems that I see with this.
1: Plumbing: Shipping data around the chip so that it can be used with a centralised processing resource (and make it efficient) would nightmare.
2: Cost: To make an (on chip) generalised processor farm do what even medium powered graphics cards currently do, you would have to make it huge. I'm taking a risk of starting a silly debate here but how about this for an analogy?...
You are Henry Ford. You want to build a fleet of cars. You could construct a plethora of "home workshops" . The advantage would be that you could also build furniture or whatever else you wanted. The trouble is, nothing would be done very fast.
Alternatively, you could build a specialised car production line, much smaller than the collection of home workshops, and produce a lot of cars rapidly
Humm, lets see. Imagine the following scenario:
- A single .09 micron (8 metal layers) die size of GF4, probably more or less 250Millions transistors.
-edited: Simple but powerfull RISC like SIMD graphics processor and floating point intensive with 4 millions transistors with a small memory (128KB 1TSRAM) like a small cell.
- Each 4 cells have a big 512bits bus to share and form a small group
- 8 groups total 32 processors.
- 4MB of 1T-SRAM inside the chip (multiple purposes)
- A big scheduller and dynamic load balance processor.
- A big crossbar in the center of the chip connecting all 8 groups (using the extra metal layers) and the internal 1TSRAM memory and external memory (the external has its own crossbar).
Keep the processors busy is really a difficult task, but maybe with many buses, local memory and crossbars it could work ( I hope).
I will think more later
Crossbars take too much area, the future is on chip switching fabrics.
What is a switching fabric?
Its the circuitry needed for routing data.
So instead of a crossbar or centralised mailbox system (can be a register set, or embedded memory) as means of communication between nodes you give them their own switching fabrics and put them in a network ... the most natural topology of the network being a mesh of course.
I think I understand.
Many possibilities of use:
- Have the multiples processors work like a virtual systolic system with cell to cell communication (using the switch fabric) with minimal centralized communication (possible bottleneck).
- Have the processors work each one in an different tile.
- Some multigroup multicast capability could be usefull to transmite data and programs saving bandwith and latency too.
It is all programming 8)
mboeller
03-Jun-2002, 20:54
I just searched for polygon-compression too ( cause of Kristof's comment's about Kyro ). Here are the links I found :
http://www.gvu.gatech.edu/gvu/people/official/jarek.rossignac/abstracts.html
http://www-grail.usc.edu/pubs.html
http://www.comp.nus.edu.sg/~tulika/publications.htm
http://staff.ncst.ernet.in/~dinesh/research/compression/geomComp.html (with link to Java3D and VRML compression )
http://www.cc.gatech.edu/gvu/modeling/compression.html
arjan de lumens
04-Jun-2002, 01:30
Some thoughts around the sea-of-DSPs approaches that people here suggest:
You still need hardwired texturing units for acceptable 3D performance. Breaking the texturing operation (with trilinear interpolation, perspective division, texture coordinate clamping and scaling, compressed texture unpacking, etc) into a sequence of instructions in a standard RISC/CISC/VLIW instruction set will quickly degrade performance by a factor of 15-30 relative to what is acheivable with a similarly clocked hardwired texture mapper. Even SIMD instructions won't help this. One possible solution is to let each DSP have its own texture mapper, accessible through specialized instructions, but this still leaves the problem that the texture mapper, while fully pipelined, has a quite high latency (perhaps ~10-20 clocks) and that you really want it to be utilized as much of the time as possible.
There are numerous small tasks performed in a standard fixed-function renderer (rasterization, setup of gouraud colors/texture coordinates, alpha test, fog, stencil test, dithering, polygon Z offsetting etc) which will quickly add up to a substantial number of instructions needed per pixel if done by DSPs. This overhead, along with texturing, is what kills the 3D performance of a pure sea-of-DSPs approach compared to the traditional hardwired pipeline. Again, SIMD doesn't really help very much.
Distribution of program code to each DSP can be ugly, unless each of them has a large enough instruction cache. Perhaps not a big problem, but caches cost transistors.
Unless you set the DSPs up to function as a systolic array of some sort, the direct interconnects between them probably isn't particularly important for performance. At most, a simple mesh interconnect should suffice for most practical uses. If one DSP produces lots of data that another DSP consumes, it is probably appropriate to let them share a high-bandwidth local memory block.
Off-chip memory access patterns would likely be extremely irregular, producing lots of DRAM page breaks all the time. So a crossbar DRAM controller would be absolutely necessary for half-decent performance. Also, the way the DSPs access memory data must be strictly controlled at all times, otherwise you wil run into a total data coherency nightmare.
Stream processing can deal quite well with high latencies. Locality from a global point of view is mostly dependent on the algorithm ... a naive memory interface would jump from here to there, but by accumulating and reordering/batching memory accesses you can remove that problem and only worry about the algorithmic side (a crossbar would not suffice IMO).
Ailuros
04-Jun-2002, 02:31
Dave, ushac, mboeller,
Thank you very much indeed. Those should keep me busy a couple of days :o
Here's one on Gamasutra about Dense Meshes (>80K) and how to represent and store them. (Actually talks of Wavelet compression)
http://www.gamasutra.com/features/20020410/brickhill_01.htm
http://www.vrml.org/WorkingGroups/vrml-cbf/cbfwg.htm
You still need hardwired texturing units for acceptable 3D performance. Breaking the texturing operation (with trilinear interpolation, perspective division, texture coordinate clamping and scaling, compressed texture unpacking, etc) into a sequence of instructions in a standard RISC/CISC/VLIW instruction set will quickly degrade performance by a factor of 15-30 relative to what is acheivable with a similarly clocked hardwired texture mapper. Even SIMD instructions won't help this. One possible solution is to let each DSP have its own texture mapper, accessible through specialized instructions, but this still leaves the problem that the texture mapper, while fully pipelined, has a quite high latency (perhaps ~10-20 clocks) and that you really want it to be utilized as much of the time as possible.
It begs the question to be asked, at what point will textures cease to be necessary? Texture Mapping is an approximation/substsitute for geometric detail because the processing power wasn't there. With the advancement in lithography and the huge increase in transistor counts afforded, when - if ever - will texture be replaces by geometry and high-level vertex (traingle = pixel in size = sub-pixel accuracy, why Fragment shade?) shading? Disregarding OD, you need to sustain around 80M triangles a second to cover a 1280*1024 screen @ 60hz. If you can solve the bandwith and storage problems, when - or again, will, this happen?
EDIT: Ohh yeah, you can't do the TCU thing because it's a catch 22 scenario. Your drastically increasing transistor count, which means that (assuming your bound to a set process, say 0.13um) it'll come at the sacrifise of your array and loose programmability. Programmable power is directly related to transistor counts, and thus lithography limts, and as such is ultimatly controlled by Moore law. Unless you can break it threw technological advancment or Multichip.
If you stop using textures its time to stop using explicit surface representations altogther, they are tied at the hip ... when you go that far its time to switch to point-clouds.
It begs the question to be asked, at what point will textures cease to be necessary? Texture Mapping is an approximation/substsitute for geometric detail because the processing power wasn't there
Probably never. There are enough uses for texture mapping that would require a wasteful amount of geometry power to emulate using just geometry that you are unlikely to ever see it go away.
If you ignore real-time raytracing, the only way to get good real-time reflections is with texture mapping. And, even if a good way for doing geometric reflections were discovered, doing blurry reflections purely in object space would be virtually impossible (you could supersample the reflections; however, that would have all the same problems associated with accumulation-buffer based depth of field and motion blur algorithms). On the other hand, MIP mapping textures is an ideal solution to the problem.
I think another part of the problem is a mentality that most (all) programmers share -- don't be wasteful. Even if a machine had infinite resources, and could handle everything in object space, you would still be likely to see image-space algorithms and texturing featured prominently, because they do exactly what you want easily and cheaply. Texturing isn't broken, so we probably shouldn't focus on fixing it.
Some thoughts around the sea-of-DSPs approaches that people here suggest:
You still need hardwired texturing units for acceptable 3D performance. Breaking the texturing operation (with trilinear interpolation, perspective division, texture coordinate clamping and scaling, compressed texture unpacking, etc) into a sequence of instructions in a standard RISC/CISC/VLIW instruction set will quickly degrade performance by a factor of 15-30 relative to what is acheivable with a similarly clocked hardwired texture mapper. Even SIMD instructions won't help this. One possible solution is to let each DSP have its own texture mapper, accessible through specialized instructions, but this still leaves the problem that the texture mapper, while fully pipelined, has a quite high latency (perhaps ~10-20 clocks) and that you really want it to be utilized as much of the time as possible.
There are numerous small tasks performed in a standard fixed-function renderer (rasterization, setup of gouraud colors/texture coordinates, alpha test, fog, stencil test, dithering, polygon Z offsetting etc) which will quickly add up to a substantial number of instructions needed per pixel if done by DSPs. This overhead, along with texturing, is what kills the 3D performance of a pure sea-of-DSPs approach compared to the traditional hardwired pipeline. Again, SIMD doesn't really help very much.
Distribution of program code to each DSP can be ugly, unless each of them has a large enough instruction cache. Perhaps not a big problem, but caches cost transistors.
Unless you set the DSPs up to function as a systolic array of some sort, the direct interconnects between them probably isn't particularly important for performance. At most, a simple mesh interconnect should suffice for most practical uses. If one DSP produces lots of data that another DSP consumes, it is probably appropriate to let them share a high-bandwidth local memory block.
Off-chip memory access patterns would likely be extremely irregular, producing lots of DRAM page breaks all the time. So a crossbar DRAM controller would be absolutely necessary for half-decent performance. Also, the way the DSPs access memory data must be strictly controlled at all times, otherwise you wil run into a total data coherency nightmare.
You are right, let me try to address your concerns:
1- It must not be a pure DSP like RISC. It could be a graphics specialized RISC (microcoded RISC). Was not the Vérité V1000 a risc like processor?
2- Each processor could have a small 1T-SRAM local memory (64KB or 128KB) to store the programs and some data.
3- Tasks could be distributed by a task scheduller processor.
4- It will be like a graphics pipeline then each processor (or group of processors) will run a specialized program determined by the scheduller.
5- The processors of the pipeline will communicate with each other using a switching fabric. The switching fabric is necessary to give flexibility. Some multicast capabalitie could be usefull.
6- One internal 4MB 1T-SRAM with high bandwith could be used as a large cache to the RISC farm.
Well, I am not a graphics expert but I think some of the people here could think/design something better.
I stll find the results they claim hard to believe, since their methods are similar to the methods PNG uses (delta predictors+Huffman for both approaches) yet the compression ratio they claim is far beyond what PNG achieves, even though a PNG encoder, unlike a hardware framebuffer compressor, has practically infinite time available to compress the data.
They tested 24-bit RGB and compressed per color channel. The reason they give for the high compression ratio is that one of their test scenes, the "Atlantis" part, is all shades of blue, therefore they get an unusual high compression ratio for this scene which results in a higher average. The others scenes gave compression rations between 2:1 and 3:1.
Also, Huffman encoding/decoding is a highly serial process which is difficult and really expensive to parallellize (due to variable-length symbols; you cannot even begin to correctly encode/decode a symbol correctly until you are done encoding/decoding all preceding symbols), and so is rather badly suited to hardware implementation.
There are other compression algorithms too which should yield to similar compression rations. Nvidia and Ati said they used DDPCM and an algorithm with loseless 4:1 z-compression. Would be interessting to see what kind of compression you get if using this algorithm for RGB data. I still think color buffer is a viable option for the future, even if it is expensive to parallize. Available silicon space is growing much faster than memory bandwith, so "wasting" logic even for relativly small gains like a 2:1 to 3:1 compression for RGB is IMO worth the logic.
You are right, let me try to address your concerns:
1- It must not be a pure DSP like RISC. It could be a graphics specialized RISC (microcoded RISC). Was not the Vérité V1000 a risc like processor?
It had a RISC core, yes. IIRC it was used for triangle setup stuff and the rendering pipe was hardwired.
Gunhead
04-Jun-2002, 13:32
I agree with gking about texures.
I'd like to add about the "hack" nature of texturing: although, say, a bumpmap emulates "true" geometry and a lightmap emulates "true" lighting, there are other cases where a texture doesn't emulate anything but simply represents the original colouring of the object's surface. So I believe texturing won't need to go away.
And a highly usable (but "simple" to employ) function like render-to-texture would be difficult to replace with something else, I guess?
arjan de lumens
04-Jun-2002, 14:47
They tested 24-bit RGB and compressed per color channel. The reason they give for the high compression ratio is that one of their test scenes, the "Atlantis" part, is all shades of blue, therefore they get an unusual high compression ratio for this scene which results in a higher average. The others scenes gave compression rations between 2:1 and 3:1.
For such a scenario as "Atlantis", depending on how smoothly the color changed, compression rates of 4:1 to 10:1 may possibly be attainable; the other scenarios sound substantially harder to get good compression ratios out of, but 2:1 might be attainable with substantial effort.
There are other compression algorithms too which should yield to similar compression rations. Nvidia and Ati said they used DDPCM and an algorithm with loseless 4:1 z-compression. Would be interessting to see what kind of compression you get if using this algorithm for RGB data. I still think color buffer is a viable option for the future, even if it is expensive to parallize. Available silicon space is growing much faster than memory bandwith, so "wasting" logic even for relativly small gains like a 2:1 to 3:1 compression for RGB is IMO worth the logic.
The 4:1 compression NV/ATI claim for Z compression is a best-case number; AFAIK, the typical compression ratio they reach is closer to 2:1 for most practical scenes. DDPCM works well on untextured surfaces or textures with smooth gradients or large uniformly colored patches - for highly detailed textures, it breaks down. If RGB compression is taken into use, I would expect the first methods to be used to be a bit cruder than Huffman, in order to allow better parallellism more easily.
A full parallel Huffman decoder capable of decompressing, say, 256 bits per clock (which is needed if 64-bit color or 8 pipelines are desired) is doable with a technique called parallel prefix computation, but such a circuit takes tens of millions of transistors. You want it? Pay up. :-?
vBulletin® v3.8.6, Copyright ©2000-2013, Jelsoft Enterprises Ltd.