ATI's decision concerning TMUs

Mariner · Apr 23, 2006

dizietsma said:
Club3d 7600GT = Â£137.01
Club3D X1600XT = Â£126.00

If 3x the shading power is suppose to futureproof your card it does not seem to be doing a very good job in current high end games like FEAR at present, never mind 1Q2007.

Mind you the X1600XT are in stock (maybe unsurprisingly) whereas the 7600GT is not at the place I got the comparison from.

Overclockers seem to have a bit more of a price disparity: 100 quid (+ VAT) for the X1600XT and 120 quid for the 7600GT. I think I'd possibly still tend towards the 7600GT.

Sharkfood · Apr 23, 2006

What I find outwardly satisfying is how, once again, the entire world is being remapped 180 degrees by the NV marketing machine towards the exact *opposite* of what the current (and near future) industry is truly delivering.

For years, NV fought 3dfx with their "fillrate is king", even though fillrate at that point in time had a bigger impact on performance. NV was still trying to dispell this fact through the hypemachine of T&L/ALU performance and forget about TMU's. Of course, everyone "playing" 3dmark99 and 2000 knew the choice was clear. Those playing Quake, Quake2 and the other 99.9% of the games thought otherwise.

Now that a competitor has ALU coverage to be looking towards the future direction of tmu vs. alu requirements coming down the line for games, it seems that suddenly "fillrate is king" is the new push! I'm just glad [H]ardOCP rose to the challenge to try and debunk the obvious impact this new push already has in games such as Oblivion. "Throw on 16xHQAF!! Compare them apples to oranges!"

Razor1 · Apr 24, 2006

Humus said:
You mean Parallax Occlusion Mapping? Either way, you're comparing different techniques, so that's not very useful. But I could easily rip out the dynamic branching from for instance the ParallaxMapping sample in the ATI SDK. For the distance function technique performance dropped from 83fps to 28fps on my X1800XL AIW.

For dynamic branching to be faster you don't really need to have a long shader. The POM or distance function loops is quite short. A typical shader would be pretty short, while the equivalent non-db shader could be long if you want to achieve the same quality. Even in just simple lighting shaders you'll get a nice speedup by doing early out on the attenuation. It's not about the length of the shader, but the ratio between the amount of work you save and what you still need to compute.

Well that is true, but the point I'm making is there are alternatives to shaders like POM, that are just as good and don't need DB, when we start geting to the point where alternatives aren't there we are going to be looking at shaders that will be quite a bit more expensive. Of course there are places where DB can help right now, but if lighting algo's start moving to things like ambient occulision (sorry for the wrong word usage, was a bit late last night hehe), soft shadow performance isn't going to relay on DB performance.

Mintmaster · Apr 24, 2006

Razor1, you don't understand. The alternatives are NOT as good. If you make them as good, then they are MUCH slower. That is what Humus is telling you.

The relief mapping that Natasha was talking about (this one) is almost the same as POM, and can use dynamic branching to speed it up also:

Polycarpo et al said:
Beginning at point A, we step along the AB line at increments of d times the length of AB looking for the first point inside the surface (Figure 9). If the graphics card supports shading model 3.0, d varies from fragment to fragment as function of the angle between VDâ€™ and the interpolated surface normal at the fragment.

POM uses the exact same shortcut. On top of that, you can exit the linear search as soon as you find your intersection, just like POM. The only difference between this technique and POM is the binary search step, which is an optimization that can increase artifacts.

The only time these techniques perform equally on NVidia and ATI is when you don't do dynamic branching, and fix the number of texture accesses. In doing so, you either drastically reduce the quality through insufficient sampling artifacts, or you decrease the speed by having so many texture accesses.

There's no magic alternative here. POM, steep parallax mapping, and relief texture mapping are all very similar. All need many texture accesses to avoid artifacts, and all can have vastly improved performance with DB.

DemoCoder · Apr 24, 2006

Humus,
I'd like to see you recode your dynamic branching alpha+stencil demo and combine it with SM3.0, and then compare the performance of the two. Maybe implement POM using both techniques.

Here's what Humus had to say back then:

Humus on SM3.0 DB said:
One of the main features of pixel shader 3.0 is that is supports dynamic branching (also called data dependent branching), something that nVidia of course makes a big deal about. This demo however shows a way to implement dynamic branching without pixel shader 3.0, using basic graphics functionality such as alpha testing and stencil testing. Not only is it doable, but also gives every bit of the performance of real dynamic branching.
One scenario where dynamic branching comes in handy is when you have range-limited lights. Whenever a fragment is outside the range of the light, you can skip all the lighting math and just return zero immediately. With pixel shader 3.0 you'd simply implement this with an if-statement. Without pixel shader 3.0, you just move this check to another shader. This shader returns a truth value to alpha. The alpha test then kills all unlit fragments. Surviving fragments will write 1 to stencil. In the next pass you simply draw lighting as usual. The stencil test will remove any fragments that aren't tagged as lit. If the hardware performs the stencil test prior to shading you will save a lot of shading power. In fact, the shading workload is reduced to the same level of pixel shader 3.0, or in some cases even slightly below since the alpha test can do a compare that otherwise would have to be performed in the shader.
The result is that early-out speeds things up considerably. In this demo the speed-up is generally in the range of two to four times as fast than without early-out. And with more complex lighting, the gain would have been even larger.

Pete · Apr 24, 2006

DemoCoder said:
And I think, the anti-SLI whiners have missed perhaps an important and serendipitious fallout from NVidia pushing SLI -- the potential future of using GPUs for Physics processing (and maybe Aureal-like physics based audio processing). Without SLI, the only way we'd get physics processors is via a solution like Ageia, which I think will fail, for many reasons.

Do you need to synchronize the second, physics GPU to the first, or to give it x8 or x16 bus speed? Did SLI really push forward discrete physics processing?

I'm not sure why AGEIA will fail where a second GPU won't, unless you're talking about critical mass and SLI's pre-existing userbase propelling physics on GPU faster and further than AGEIA's cards. Even then, are there that many SLI setups in use to significantly improve physics' chances?

I'm not sure why ppl in the 2nd position would accept the IQ hit from assigning the second GPU to physics. If they bought two 3D cards for higher res and more AA in the first place, why would they be happy with dropping down to a more mundane res and AA level but with extra physics? Wouldn't they prefer to add a physics card (or third GPU) and maintain their display detail?

If you're torn between SLI or physics, then 2) becomes very attractive. But if you're looking for bang/buck or max IQ, 2) just looks like an imperfect compromise.

And setting physics up for wildly variable GPU speeds doesn't seem ideal, though that's the story of PC gaming (and LCD is rearing its head with 256MB PhysX cards). But will designing a GPU with an eye toward double duty as a PPU compromise its primary function? Does NV or ATI have the manpower to devote to this new facet--as much as an AGEIA or a Havok? It already takes months after a title's release to see fully optimized drivers.

Humus · Apr 24, 2006

DemoCoder said:
Humus,
I'd like to see you recode your dynamic branching alpha+stencil demo and combine it with SM3.0, and then compare the performance of the two. Maybe implement POM using both techniques.

I have a dynamic branching demo in the pipeline, which should be done tomorrow, but it's not based on that demo though so it's not comparable. But regarding that technique, it's still faster than dynamic branching even on X1900 in simple cases. You can use the EarlyOut samples in the ATI SDK to compare. It's about 10-25% faster (depending on what options you choose on the menu) than dynamic branching. On more complex techniques like POM I expect dynamic branching to be much faster, not to mention a lot more convenient, due to the amount of extra passes and how much extra reads and writes to render targets I'd need.

poopypoo · Apr 24, 2006

[Edit: I may edit this more as i finish reading the thread; for now I have to go buy kitty formula mix. One thing I was definitely wrong about is that the thread got a LOT more interesting after page two.

So, hopefully I didn't diss anyone accidentally. FOr my next edit, I'm now snipping lots of what I said out because it was already presented better.]

Lol, I basically said a lot of stuff that's been covered better already. I think that it's essentially up to those of us who do expect our graphics cards to last a long time to stand up for that when making purchases. Both companies make reasonably-well-lived products, but you really have to pore through the benches and the dev interviews (and maybe be a little lucky) to buy a long-term winner. Sometimes it seems as if , with the "enthusiast" BS running rampant in the marketing, perhaps graphics card lifetimes will just get shorter and shorter -- but I doubt it. I think that ultimately, consumers who benefit from purchasing the better implementations of really-used technologies will return to the company that gives it to them. It's not a strange corporate tactic at all -- it's delivering quality goods. What, sadly, I believe ATi (in this case) shold be doing more of, is getting the news out. Prove that the DB-heavy games are coming. Fund some analyses of how long R300 has remained a viable card. Pay off some *&%!!ing developers -- we all know it's happening; make it a clear-cut avalanche of games, the way it was when my Kyro II was getting absolutely destroyed by all the T&L games. Regardless, I am not a stockholder, so I'll just keep buying the products that suit my (somewhat less "enthusiastic") needs.

Final Edit, i promise: actually, I don't really want ATi to do any of that (perhaps because i'm not a stockholder? ^^; ). The more they can distance themselves from nV's shady techniques, and focus on producing well-engineered products, the better.

Geo · Apr 24, 2006

Pete said:
I'm not sure why ppl in the 2nd position would accept the IQ hit from assigning the second GPU to physics. If they bought two 3D cards for higher res and more AA in the first place, why would they be happy with dropping down to a more mundane res and AA level but with extra physics? Wouldn't they prefer to add a physics card (or third GPU) and maintain their display detail?

If you're torn between SLI or physics, then 2) becomes very attractive. But if you're looking for bang/buck or max IQ, 2) just looks like an imperfect compromise.

What if I'm not particularly interested in SLI at all, but am willing to leave my older second card ride that second slot for PPU duties when I upgrade? Since it's not really SLI there doesn't need to be a connection, doesn't have to be the same generation card, etc. Right?

_xxx_ · Apr 24, 2006

geo said:
... am willing to leave my older second card ride that second slot for PPU duties when I upgrade?

Et tu, Brute?

Geo · Apr 24, 2006

_xxx_ said:
Et tu, Brute?

What? Who did I betray? Betrayal implies a duty of loyalty. . . Who is it I'm supposed to be loyal to? Ageia? Errr. . .why?

I'm not convinced this stuff is going to be such a difference-maker anyway. Anything that lets more people check it out for themselves without making a signficant investment to do so (or any at all if they already have an SLI mobo and an extra card sitting around), sounds like a real good idea to me.

_xxx_ · Apr 24, 2006

Just kidding, of course

Meaning, you too fell for the whole physics hype...

Geo · Apr 24, 2006

We were just talking about this a bit in the IRC chan, and someone mentioned that if the techdemos don't blow them away (and they don't) then its prolly a lost cause.

My response was it might not be a "one grand moment" kind of addition, but rather the accumulation of many smaller moments over the course of a game that still makes a significant addition to the enjoyment of said game.

But I'd still rather try that out for myself without having to shell out for a new dedicated physics card to find out.

_xxx_ · Apr 24, 2006

It's a dead end IMHO. By the time that takes off, we'll have enough processing power in the (multi)CPU/(single)GPU combo to make that obsolete for any kind of normal use.

And I'm also pretty sure that noone will build a game with 1 million boulders rolling down the hill or some such, but maybe a few 1000s. Which will give us the goods without the need for any specialized HW.

Maintank · Apr 24, 2006

Humus said:
It's more like 20-30% faster on average. By the end of its lifetime it will probably be something like 2-3x faster.

You think the X1900XT will be 2-300% faster than the X1800?
That is quite a bold prediction.

DemoCoder · Apr 24, 2006

Pete said:
Do you need to synchronize the second, physics GPU to the first, or to give it x8 or x16 bus speed? Did SLI really push forward discrete physics processing?

I'm not sure why AGEIA will fail where a second GPU won't, unless you're talking about critical mass and SLI's pre-existing userbase propelling physics on GPU faster and further than AGEIA's cards.

From a consumer perspective, your AGEIA card is a bookend or doorstop if a game isn't using much physics, or if a game doesn't support the AGEIA API. On the other hand, with a second GPU, if a game doesn't support it, or isn't a physics heavy game, you still derive value from the second card. Simply put, the GPU can be used as either a renderer or a PPU, whereas the AGEIA PPU is only a PPU and isn't likely to be able to compete with GPUs if it is reprogrammed to do software rasterization.

AGEIA doesn't have the money or resources to keep up with NVidia and ATI in this space either.

Maintank · Apr 24, 2006

DemoCoder said:
From a consumer perspective, your AGEIA card is a bookend or doorstop if a game isn't using much physics, or if a game doesn't support the AGEIA API. On the other hand, with a second GPU, if a game doesn't support it, or isn't a physics heavy game, you still derive value from the second card. Simply put, the GPU can be used as either a renderer or a PPU, whereas the AGEIA PPU is only a PPU and isn't likely to be able to compete with GPUs if it is reprogrammed to do software rasterization.

AGEIA doesn't have the money or resources to keep up with NVidia and ATI in this space either.

They have a nice list of games that will use their API. None of that really matters once Microsoft gets their butt into gear and developes a Physics API.

Besides isnt Nvidia's implementation just pretty graphic applied physics? It doesnt send anything back to the CPU to update the game world does it?

Xmas · Apr 24, 2006

Maintank said:
You think the X1900XT will be 2-300% faster than the X1800?
That is quite a bold prediction.

No, I think he meant 100-200% faster.

pakotlar · Apr 24, 2006

It seems to me that a lot of people are outright dismissing the x1800xt now that the x1900xtx is here. While on paper it does not compare as favorably, it performs admirably to the competition, in most cases. Let's take a look at Oblivion performance for a minute. According to firingsquad's benchmarks, the x1800xt outperforms (comes to parity @ indoor) by up to 16% (foliage, hdr, 1280) the GF7900GTX, while offering lower bandwith, fillrate (24 tmu @ 650mhz), and clockspeeds. 50% fewer ALU's as well. It is outperformed by the X1900XTX by 14% in this case, but, more importantly, provides the same gameplay experience (26 vs 29 min framerate). For a part that now costs US299 (512MB), that is amazing. Of course there are cases where it does not perform as admirably (AOE3 for example), but all in all it has turned into an incredible midrange part. Whether or not that was ATI's intention (clearing stock maybe?), the benefit to the consumer is there. I enjoy the technical discussions about architecture efficacy, but let us step back on this occasion and remember the benefits and, dare I say, real world use , that these parts allow us (no, Kyle, "apples to apples" need not apply

). In the case of oblivion, it is the question of paying $200 or 66% more for 15% gain (that being said the x1900xtx is a shader monster, + hyper-z gains). But when discussion sways to performance/dollar or "fps/mm^2" market price should be taken into account.

trinibwoy · Apr 24, 2006

pakotlar said:
It is outperformed by the X1900XTX by 14% in this case, but, more importantly, provides the same gameplay experience (26 vs 29 min framerate).

Well that's what started all of this in the first place, it's just that you are looking at it as "X1800 compares favorably to X1900 given the price and architectural differences" instead of "X1900 compares poorly to X1800 given the price and architectural differences".

But on its own, at its current price point R520 512MB is very attractive especially considering the featureset and framebuffer advantage it has over its direct competitor - 7900GT.

ATI's decision concerning TMUs

Mariner

Sharkfood

Razor1

Mintmaster

DemoCoder

Pete

Moderate Nuisance

Humus

Crazy coder

poopypoo

Geo

Mostly Harmless

_xxx_

Geo

Mostly Harmless

_xxx_

Geo

Mostly Harmless

_xxx_

Maintank

DemoCoder

Maintank

Xmas

Porous

pakotlar

trinibwoy

Meh