Features the R420 should have had

GameCat

Newcomer
We all know that the R420 is very similar to the R300 feature wise, bar a few additions like longer instruction length shaders and 3Dc. I really don't care that it doesn't support shader model 3.0, although texture lookups in vertex shaders would be kind of nice, most of the other features in SM3.0 are more about performance than fancy effects. Floating point blending and filtering is also a nice feature that the R420 lacks, but it isn't THAT big of a deal IMHO since you can do plenty of HDR effects anyway on it with higher precision integer formats if you tweak a little.

There are many features which aren't related to SM3.0 that I do miss however.

* Percentage closer filtering on shadow maps and native support for depth textures. This offers muh higher quality shadow maps. With the added power of the latest cards, you might want to do your own shadow map filtering to get decent soft shadows, but that really isn't a reason not to support bilinear PCF on shadowmaps since it gives you higher quality for the same cost even when you do your own filtering.

* Partial derivative operators in the pixel shader. This is useful in LOTS of shaders.

Does anyone have any idea why these aren't supported? I can't imagine supporting PCF can be very expensive transistor wise, you basically need to do a compare with the r tex coord and the texels before filtering. Precision doesn't need to be high at all. Is this related to the fact that ATi do texture lookups parallel to math ops and nvidia have some sort of hybrid unit that does both?
Partial derivative operators are also cheap, basically a subtraction. They do require access to registers on a quad level though. Maybe this causes some concurrency issues? Might be related to why nvidia has a register usage performance hit and ATi does not?

I know too little about graphics hardware architecture to come up with reasons so I'm counting on you guys :) Of course it might not be expensive or hard to support at all, in which case ATi are simply lazy and we should all badger them about it so they don't do the same mistake next gen ;)
 
When it goes to pcf you can do it yourself inside pixel shader, not a big deal imho. Besides - the way gf3+ does it does not work with cube maps. It is very hardwired functionality and I am very positive about it NOT being present in new ATI hw.
Lack of gradient instructions is kind of disappointing and I don't know any easy way to emulate them. I guess they are generally useful to handle aliasing issues in procedural shaders, so I'd put this functionality togother with ps3.0 - interesting to researchers, not really useful at present time.
I think that lack of fp blending (not filtering - this is again doable in a shader) is VERY disappointing. A lot of extra work will be needed to make true hdr renderer work on r420. However, there are plenty of R3xx out there and they need to be supported anyway, so that code path has to be written anyway :(. I still have no idea what am I going to do with particle systems, because for them you have to have blending, for other stuff it is just a few extra rendertarget switches - it hurts, but doesn't kill you.
 
You can't do real PCF in the pixel shader without really racking up the instruction length. It'll be a good 5-10 times faster with native support at minimal silicon cost, as GameCat says. The demos on ATI's site don't do PCF, although they claim to. They just do some supersampling.

PCF requires you to weigh 4 comparisons (i.e. 1 or 0 values) by the same weights the 4 nearest texels are weighted when doing bilinear filtering. It can be extended to trilinear too, but you need a way to generate mipmaps, and now I'm getting off topic...

In any case, PCF is very nice for shadow maps, and GameCat is right. frost_add, you have a good point regarding cubemaps. I think all of Humus' demos require them, and he compares distance squared. I'm not too familiar with NVidia's implementation, but I believe a lot of this has to do with DirectX.

I can only think they didn't add these little, easy-to-implement features because the architecture was so similar to R300, and they didn't want developers alienating their previous generation and supporting NVidia's features (if that makes sense...). ATI has a stronghold with their previous generation. Either that or I'm being way too cynical and they were just lazy in some aspects of R420.

Gradient instructions, I believe, really just need a wire (if I may be overly simplistic). R300 already calculates these values (or very similar ones) per quad when doing texture lookups. It would just be a matter of occupying a texture instruction slot.

One way of getting a number proportional to the max of the x and y derivatives is to just do a dependent lookup into a texture with different shades of grey in each mip-map (zeckensack suggested this). Trilinear filtering should take care of the rest. I think such a value could be useful in antialiasing algorithms. GameCat, were you thinking of any other application?
 
Hey I've done PCF filtering in "just" 23 instructions.
To me it looks quite long though...
 
Mintmaster said:
Either that or I'm being way too cynical and they were just lazy in some aspects of R420.

I'm guessing that it's not so much laziness then "other priorities". Afterall, they have two other major projects going on, the next gen Nintendo and X-Box 2 chips. And if you're going to keep an architecture over a generation then the R300 is definitely a good one to choose :)
 
Well, laziness is a relative term, which I tend to use liberally. I should have followed that line with a ;) or a :D - I don't actually think they're lazy.

As GameCat says, these are pretty easy things to do. They definately would have made it into R420 if ATI wanted, and it wouldn't cost any significant die area. ATI deserves a break though. They should see plenty of profits this time around, carrying over the momentum they had in the previous year and a half.
 
GameCat said:
Floating point blending and filtering is also a nice feature that the R420 lacks, but it isn't THAT big of a deal IMHO since you can do plenty of HDR effects anyway on it with higher precision integer formats if you tweak a little.

Assuming that the high-precision integer formats support blending, which "the other thread" suggests that they don't.
 
Mintmaster said:
One way of getting a number proportional to the max of the x and y derivatives is to just do a dependent lookup into a texture with different shades of grey in each mip-map (zeckensack suggested this). Trilinear filtering should take care of the rest. I think such a value could be useful in antialiasing algorithms. GameCat, were you thinking of any other application?

Well, I find you often have some shader derived value that indicates height/displacement and you need a normal for lighting calcs. This is easily solved with partial derivative instructions but requires lots of extra work to do manually since you basically need to perform needless extra work per fragment to get values of neighbouring fragmetns. More specifically, I would personally use this in a cloud shader.

nutball said:
Assuming that the high-precision integer formats support blending, which "the other thread" suggests that they don't.

I think that's a Direct3D issue. It works in OpenGL IIRC. I think one of Humus' demos uses it. But I haven't tried personally, so I might be spreading misinformation. My guess is that they can fudge some blend modes like simple additive, but can't support all modes. Apparently alpha test is broken as well. This would mean they can't report the caps bit in Direct3D but they are free to expose the functionality in OpenGL. If you use the unsupported features in OpenGL you just get a software fallback. They support the accumulation buffer in hardware with a signed 16-bit integer format, so obviously it can be made to work under certain restrictions.
 
GameCat said:
I think that's a Direct3D issue. It works in OpenGL IIRC. I think one of Humus' demos uses it. But I haven't tried personally, so I might be spreading misinformation. My guess is that they can fudge some blend modes like simple additive, but can't support all modes. Apparently alpha test is broken as well. This would mean they can't report the caps bit in Direct3D but they are free to expose the functionality in OpenGL. If you use the unsupported features in OpenGL you just get a software fallback. They support the accumulation buffer in hardware with a signed 16-bit integer format, so obviously it can be made to work under certain restrictions.

The difference between D3D and OpenGL is that Direct3D allows you to do exact query of the form "Can I do blending with this format?" As far as I know, OGL does not. Not even talking about blend modes ... In D3D there's plenty of caps related to blending. If they can support additive but not other modes I guess they could still expose it in D3D.
It might be that: a) they can't do it because of some suble api issue that could prevent the driver from passing whql tests b) they don't want to expose partial feature in D3D - the problem is that developers usually take caps like blend modes as granted, especially recently (because most hardware exposes all of them, at least on 32-bit rt's) c) they don't really support in in hardware, and they do sw fallback all the time
 
Mintmaster said:
PCF requires you to weigh 4 comparisons (i.e. 1 or 0 values) by the same weights the 4 nearest texels are weighted when doing bilinear filtering. It can be extended to trilinear too, but you need a way to generate mipmaps, and now I'm getting off topic...

Staying off topic. Look up Rendering Antialiased Shadows with Depth Maps, by Reeves, Salesin and Cook. Siggraph 1987.

PCF should *NOT* be weighted. Look at the c code in Figure 4.
Code:
return(((float) inshadow) / (ns*nt));

This is important. Bilinear Weighted PCF (which some implementors have chosen) can only ever give you *low* *quality* shadow filtering. You can't do several bilinear Weighted PCF fetches and combine them in any meaningful way. (That holds for Trilinear Weighted PCF as well.)

Why do people like Bilinear Weighted PFC? Because it's "better" in their eyes than single sampled depth maps. And it's "free" - it costs virtually the same as a single sampled depth map. (Trilinear Weighted PCF is *not* free. It's not so much the cost of the fetch as the cost of generating the mipmap array. You should *not* GENERATE_MIPMAP depth maps.)

FWIW, you can implement the Percent Closer Filtering Algorithm of Figure 4 in a fragment shader. (Including the random biases if you insist, but they are so overkill, way off topic to go into details why.)

But you have to turn off the Bilinear Weighted PCF to get it right. You have to do the 16 point samples.

Yes, it's expensive. Is the quality worth it? In a homer donut voice - "shadows - mmmmmm".

(
Oooops, OpenGL->DX universal translator:
depth map -> shadow map
GENERATE_MIPMAP -> automipmap
fragment shader -> pixel shader
)

Back to on-topic, I'm afraid I'll just keep listening. Good stuff.

-mr. bill
 
Mintmaster said:
It'll be a good 5-10 times faster with native support at minimal silicon cost, as GameCat says.

Based on what do you keep saying its minimal silicon cost ?

Lets see shadow mapping usually involes comparing a stored depth value(s) (projected light depth) with a value provided from the vertex shader (light depth in camera view). So you have to pass a value from the PS into the filtering logic, actually that would be a different value for each sample taken (pixel) and in addition this should be a pretty high accuracy value (24 or even 32bits). So on a GF6800 or ATI420 that means passing sixteen (16) 24 or 32 bit values all the way through texture coordinate gen and then to the filter logic... thats a lot of dataflow there. Then you need a subtract unit which subtracts the texture samples from the value you passed through (or the other way around too lazy to figure it out) and then you need to do a compare. You need the subtract and compare for each sample taken, lets say bilinear that would be 16 pipes x 4 samples is 64 subtract units and 64 compare units and then you can use the normal bilinear logic... Ok, now does that sound sensible for just ONE specific fixed function usage ?

I think the real killer is passing the value to compare with from the PS into the filter logic where you do the compare, you'd need a bus of 16x24bits = 384 bits wide or 16x32bits = 512 bits... not to mention extra latency introduced by the subtract/compare units which now sit in your filtering pipeline.

PCF is not used frequently enough to validate all that silicon and fixed limited usage logic... or not ?

K-
 
For what it's worth, I do think the increased (infinite?) fragment shader length is significant. Especially if you can trade some of those instruction slots for the missing fixed function stuff.

And I'm sure you can do with less than 23 instructions if you scale down quality. Most game renderers will let you control shadow quality anyway if currently shipping games are any indication.

Now the bad thing about dedicated fixed function circuitry is that it just sits idle if you don't use it, you don't gain anything and still have a larger die, which may have further implications on overall performance. With a software solution to a given problem you can make various tradeoffs, you can exchange performance between "it" and other things. The good thing about ff hw, of course, is high performance if you want the exact thing the ff hw was designed to do. That's all there's to it.

I may be stating the obvious here, but I do think it's a useful angle to look at this current issue. Apparently, somewhere down the line all the lessons learned from "HW T&L", "EMBM" or "matrix palette skinning" have been lost ;)

You'll still be able to accomplish all effects you want on these chips. And don't tell me that 512 instruction slots is too little, it'll perform like dog anyway if you get even close to that.
 
GameCat said:
nutball said:
Assuming that the high-precision integer formats support blending, which "the other thread" suggests that they don't.

I think that's a Direct3D issue. It works in OpenGL IIRC. I think one of Humus' demos uses it. But I haven't tried personally, so I might be spreading misinformation.

From this thread:

OpenGL guy said:
The CAPS bits we expose is what the HW supports. I16 and FP16 blending is not supported for R3x0 or R420.

There's the ping-pong buffer trick of course, but that's a PITA to implement. Maybe that's what the demo used :? (perhaps the author would like to comment :))
 
Kristof, I really think you're overexaggerating the costs.

First of all, the light distance value can go directly from the vertex shader. You don't need a new bus because the value is interpolated just like a texture coordinate. If you have a 2D shadow map, like the GF3+ supports, you only need 2 channels for tex coords, and then you can use another channel for the depth. This additional channel is already goes to the texture unit because 3 coords needed for cube mapping or 3D textures. Heck, they're probably full 4 channel iterators, because you can use the alpha channel when passing values from VS to PS via texcrd, so even cube map shadow maps should be easy. In any case, no new bus is needed for NVidia's current shadow map support.

Second, you don't have to subtract in order to compare - that's just how ps 2.0 (not ps 2.x) does it to limit arguments in the cmp function. You can directly do binary compare, and make a hierarchical structure to keep it compact. I counted less than 200 gates for an unoptimized 32-bit comparison unit (PM me if you want me to describe my design), so we're talking less than 100,000 transistors for 64 of them in the X800. A drop in the pond, really. The longest path should be much shorter than that found in the adders and multipliers of the filtering, but if not, then either do a look-ahead or just sacrifice one stage of the fifo (that absorbs texture fetch latency) when shadow maps are used, and get a touch less performance.

You could do just 16-bit comparisons, or only have PCF in half the pipes to reduce this paltry silicon cost even further. I think I was pretty justified in saying the cost is minimal.

True, shadow maps aren't used very often in games, but I'm sure they would be if ATI supported them more. Many games already do a rendering from the light's POV to fake shadow maps or soft shadows (UT2003 does, AFAIK), but just use the texture on the floor or wall. Humus, it seems, prefers shadow maps due to their ease of implementation compared to stencil shadows.

As for FP filtering, only 2 or 4 units are needed given the bandwidth costs. I don't have any figures for you here (blending units are a bit more complex), but given the X800XT has 16 full shader pipes, it should be a relatively small cost. Even if it needs a few million transistors, I definately think it's worth it in order to get HDR into games, as this can really hurt ATI if NVidia gets developers to jump on board.

I seriously doubt these features were skipped because of a lack of capability or silicon resources. The real reason, alas, is unknown.
 
mrbill said:
PCF should *NOT* be weighted. Look at the c code in Figure 4.
I'm not entirely sure what you're trying to say here, but I was just trying to describe Figure 2. Compare then filter. The ATI demo uses constant filtering weights.

You mention the cost of implementing a huge shader, but I think multiple bilinear PCF samples (especially if sampled from depth maps created with jittered light positions) would be better than doing anything too fancy from a cost/benefit point of view. I think the GeForces let you sample a larger region than 4 neighboring texels, likely with the aid of the texture walking logic used in anisotropic filtering.

Low quality it may be, but keep the cost benefit ratio in mind. 4 bilinear PCF samples from 4 depth maps with a jittered light position can generally be done in 4 cycles. Doing anything better in 4 cycles is not an easy proposition by any means.
 
AFAIK there is no such things as a texture unit, you have a texture address generator and a texture filter unit and in between is the texture cache. The texture filter unit is where your logic has to go, the tex coords only go to the address generator. So you are still adding a bus or extending one.

A simple compare is not good enough either, you need a bias on the depth compares as well... it really isn't all that simple to implement properly. A pure compare is definitely not a good idea due to difference in accuracy - you need a compare with a range, a range which ideally is specified by the API. If you do an equal type test you'll get lots of artefacts and no-one will use your HW implementation. I guess you could add the bias in the vertex shader though... anyway not a HW engineer so no real clue, just don't like it when people go posting things are easy or have minimal cost without giving some sensible reasoning to back that claim up.

I think as was already said the true reason is you implement one specific HW feature that is not officially exposed by one of the key APIs so apps probably will end up implementing it in the PS anyway... which means your HW implementation was wasted. If its not in the D3D API then its unlikely a company will implement a feature since it will most likely go unused and thus results in wasted silicon area.

And the jury still seems to be out on how to exactly do nice soft shadows, does not seem like true PCF is what most apps are implementing anyway.

K-
 
Mintmaster said:
mrbill said:
PCF should *NOT* be weighted. Look at the c code in Figure 4.
I'm not entirely sure what you're trying to say here, but I was just trying to describe Figure 2. Compare then filter. The ATI demo uses constant filtering weights.
The ATI demo is doing PCF. Constant filtering weights are PCF. The result is proportional to the number of samples passed. If you take four samples, the result should be 0.0, 0.25, 0.50, 0.75 or 1.0 depending on how many samples passed. Period. That's "true PCF."

(But you shouldn't be only taking four samples. You should be taking many more.)

Bilinear weighted PCF is not PCF. If you take four samples, and one passes, the result could be *anywhere* between 0.0-1.0, depending on how close you are to the passed sample. That's just wrong, there is no meaning to being close to the passed sample.

Doing four Bilinear weighted PCF is just compounding the wrong. Yes, you get the sixteen samples. But you noise them up while you combine them.

Is it cheap? Maybe. (I'm not sure if you are suggesting four different depth maps or not. If you are, the cost of generating four depth maps is *not* insignficant.) But even if you keep it "cheap" don't discount the huge tradeoffs you are making. Particularly since there's really no good reason you need to.

-mr. bill
 
Kristof said:
I think as was already said the true reason is you implement one specific HW feature that is not officially exposed by one of the key APIs so apps probably will end up implementing it in the PS anyway... which means your HW implementation was wasted. If its not in the D3D API then its unlikely a company will implement a feature since it will most likely go unused and thus results in wasted silicon area.

I think that can be a reason, but if all IHVs followed it 100% of the time, there would be little innovation in HW. Many features come in HW first, then the spec is created/mapped around the feature. If there is perfect alignment in the heavens, sometimes it can be arranged that the HW and the API can be extended at the same time, through careful lobbying, or if a benevolent dictator demands it. But this is certainly not the case for most features in most standards.

It's more or less mirrors the case like how you would do it in academia: Conceive a new technique. Do an initial implementation (first HW implementation). Publish a paper (the "extension" for OGL). Technique gets picked up by other players, "commercialized", refined, common denominator is agree upon. -> Standard.

Or if you prefer: Netscape picks up HTML from Berners Lee. Hacks in new features unilateral (<TABLE>, <EMBED>). Microsoft licenses Mosaic/Spyglass. Hacks new unilateral features (<IFRAME>, <OBJECT>). W3C sits players down, crafts new spec (HTML2.0/3.0) Next revision, players tweak their implementations to support new spec.

I don't believe in this idea that spec comes before hardware. The HW features/requirements from a product management and engineering perspective are often designed a year or so before the spec is agreed to and/or finalized.
 
Back
Top