ATI's decision concerning TMUs

pakotlar · Apr 24, 2006

trinibwoy said:
Well that's what started all of this in the first place, it's just that you are looking at it as "X1800 compares favorably to X1900 given the price and architectural differences" instead of "X1900 compares poorly to X1800 given the price and architectural differences".

But on its own, at its current price point R520 512MB is very attractive especially considering the featureset and framebuffer advantage it has over its direct competitor - 7900GT.

Our expectations are too high.

Pete · Apr 24, 2006

geo said:
What if I'm not particularly interested in SLI at all, but am willing to leave my older second card ride that second slot for PPU duties when I upgrade? Since it's not really SLI there doesn't need to be a connection, doesn't have to be the same generation card, etc. Right?

That's what I was thinking, but this does necessitate a MB with two PEG slots (and possibly a burlier PSU) while also segmenting the PPU market (do you optimize for G7x? R5x0? mainstream, performance, high-end?).

DemoCoder said:
From a consumer perspective, your AGEIA card is a bookend or doorstop if a game isn't using much physics, or if a game doesn't support the AGEIA API. On the other hand, with a second GPU, if a game doesn't support it, or isn't a physics heavy game, you still derive value from the second card.

A second GPU sounds good as a physics card with a big bonus. It still suffers from the issues I mentioned to geo, though. More importantly, will it be as good (fast and/or general) with physics as a PhysX card?

I understand that SLI lets you hedge your bet WRT physics, I'm just not convinced it's a good bet considering SLI's extra costs. Then again, if a PhysX card is going to set you back $250, you're right in that putting that money toward SLI will guarantee you benefits with almost every game, and an SLI system's extra costs (enhanced MB and PSU) are pretty insignificant compare to $250. But I'd like to know how a 7600 or X1600 compares to a PhysX card as a pure PPU. (I'd also note the huge backlash against setting a standard of GPU power greater than Intel's GMA950 in the Ars thread on the new Mac Mini, and that's probably a much lower additional cost than SLI's base cost.)

I'm not familiar with the various physics engines (Ageia, Havok, ?), but are you saying something like PhysX wouldn't be able to accelerate other physics engines, or that Ageia would be disinclined to support other physics APIs?

OK, I just read this article and it cleared some things up for me--namely, Havok FX will be partially accelerated by SM3 cards only. I suppose there's no way to shunt those operations to PhysX (or any other PPU) without at least Havok's approval? The article also shows Ageia as primarily an IHV, in which case you'd think they'd have an incentive to support as many physics APIs as possible, tho that obviously depends on the API owner.

Edit: Sorry for continuing this OT conversation. It's interesting, but it probably belongs here. Heck, I'm probably just rehashing arguments in that thread.

Ailuros · Apr 25, 2006

geo said:
We were just talking about this a bit in the IRC chan, and someone mentioned that if the techdemos don't blow them away (and they don't) then its prolly a lost cause.

My response was it might not be a "one grand moment" kind of addition, but rather the accumulation of many smaller moments over the course of a game that still makes a significant addition to the enjoyment of said game.

But I'd still rather try that out for myself without having to shell out for a new dedicated physics card to find out.

Mods might excuse the further OT, but it seems harder to link that context in the other thread.

I meant something totally different: would any kind of additional physics (whether through a PPU or multi-GPU setup) add only quantity in terms of effects, or increase output quality too. If there are physics effects that are feasable only on either/or above sollution that truly increase output quality then I'm all game; if on the other hand all I'll get is 200 rocks bouncing down a hill compared to 20, then I'm not so sure if I'm really that interested after all.

The techdemo part was in another context; in a techdemo any company can include the maximum possible from any underlying hardware and since it's not an interactive game where other bottlenecks or resources are a consideration you more or less get a best case scenario. Relative techdemos from IHV X or Y or even Ageia don't tell me personally much, until I see added physics in a real game.

Ailuros · Apr 25, 2006

I understand that SLI lets you hedge your bet WRT physics, I'm just not convinced it's a good bet considering SLI's extra costs. Then again, if a PhysX card is going to set you back $250, you're right in that putting that money toward SLI will guarantee you benefits with almost every game, and an SLI system's extra costs (enhanced MB and PSU) are pretty insignificant compare to $250. But I'd like to know how a 7600 or X1600 compares to a PhysX card as a pure PPU.

Here's a quick translation from one of Demirug's posts in the german 3DCenter forums about two different descriptions for the PPU floating around. One speaks of 32 simple Vector4 FPUs and the second one of 4 blocks with 6 FMACs each ( FMAC = Scalar MADD on GPU). First sounds to be able to reach a modern GPU, the latter rather not (I assume those would be high end GPUs). What remains as a benefit is that PhysX being more suitable for stream processing than current GPUs (definitely not though with next generation GPUs).

That's of course concerning only the theoretical part of throughputs and nothing else.

Maintank · Apr 25, 2006

Pete: I believe Aegia also developes and gives away the PhysX physics engine that is complimented by their card. So while you may not have an actual Aegia PhysX PPU, a game may have the engine and scale accordingly using your CPU.

ondaedg · Apr 25, 2006

Ailuros said:
Here's a quick translation from one of Demirug's posts in the german 3DCenter forums about two different descriptions for the PPU floating around. One speaks of 32 simple Vector4 FPUs and the second one of 4 blocks with 6 FMACs each ( FMAC = Scalar MADD on GPU). First sounds to be able to reach a modern GPU, the latter rather not (I assume those would be high end GPUs). What remains as a benefit is that PhysX being more suitable for stream processing than current GPUs (definitely not though with next generation GPUs).

That's of course concerning only the theoretical part of throughputs and nothing else.

I have read a few places that stream processing is ideal for physics calculations hence the potential of the PS3. I wouldn't be surprised if Nvidia and Sony have reached an agreement to offer GPUs with Cell processors to do physics calculations. We could call it a GPPU.

Pete · Apr 26, 2006

Thanks, Ail. Blachford sure paints a rosy picture.

Does R2VB make the R5x0 series more amenable to stream processing?

Maintank, yeah, I've read that you can get PhysX for free if code for the PPU, but is that specific code like Havok FX, or "generic" PhysX code that could be "accelerated" on a dual-core CPU?

Subtlesnake · Apr 26, 2006

Naughty Boy! said:
All your arguments here are crap or tangential, and cannot explain why R580 performs so relatively poorly vs R520.

This is probably just a troll, but in any case the R580 doesn't perform poorly compared to the R520 - the R580 is 20 - 30% faster which is quite reasonable considering the die size has only increased by 20%.

Nvidia has a per-transistor performance advantage because of the decisions ATI made with the R520 - they doubled the number of transistors (160 - 320) for only a 30% increase in performance.

An X1800 XT clocked at 650 MHz couldn't come anywhere near to a 7900 GTX, and yet with only an extra 20% more transistors ATI has done just that, equalling a card with 50% more theoretical texturing performance.

Ailuros · Apr 26, 2006

Pete said:
Thanks, Ail. Blachford sure paints a rosy picture.

I wouldn't call it "rosy"; it's a fine article that doesn't contain any exaggerations. It's natural that the PPU could predictably as a dedicated unit be way more efficient for the tasks it was designed for than a general purpose CPU of whatever kind. Comparing it with an SM3.0 GPU of today though comes down as to what exactly you're going to throw at it and how well it could handle it.

Just for the record this one isn't the best case scenario Demirug meant in his post. That said Blachford mentions D3D10 and as I said I predict that against those the PPU doesn't sound to stand many chances IMHLO.

Let me take the KISS approach though; I could have two alternatives in the future: 1. add a PPU for say $250 or 2. add another GPU for $250 in a multi-GPU config. Do I expect that a multitude of games will be overloaded with tons of added physics? Not really, not at least at first. For all those cases where there aren't any additional physics coming into play, I personally see the second case above to be a better investment since there's inevitably more a user can do with a secondary GPU. One presupposition would be of course that with a given set of added effects the PPU won't waltz all over a secondary GPU with more or less the same cost in terms of performance.

Does R2VB make the R5x0 series more amenable to stream processing?

Call for Humus to better answer that question or anyone more experienced, which might help in some way to get this thread back on it's original track.

What needs to be said again and again that irrelevant what approach or design decisions each IHV have taken this far in its architectures, it is pretty clear that the concentration in the foreseeable future will be much more in the department of arithmetic efficiency/floating point power than multitexturing fillrates. Once NVIDIA de-couples entirely arithmetic from texture OPs, I'm having severe doubts that they'll need as many TMUs as ALUs in their future designs. It doesn't take any wizzard to figure that one out.

Pete · Apr 27, 2006

Last OT post, honest

Ail, when i said "rosy picture," I was linking literally!

The diagrams paint a rosier picture than "4 blocks with 6 FMACs each." They appear to show 4 blocks with 4x6=24 FMACs each. Each VPE block consists of 4 VPUs, and each VPU appears to be capable of processing six FP32 vectors and one integer per clock.

(Can that be called 16 Vec6 units, or am I totally off track?)

That's still inferior to 32 Vec4 units, so not quite the best-case scenario (that you said Demi said Ageia's patent said). And that's not even considering Blachford's reasonable suggestion that one VPE of the four he described may be disabled for yields--though I somewhat doubt that, given the $250+ MSRP.

Ailuros · Apr 27, 2006

Pete said:
Ail, when i said "rosy picture," I was linking literally! The diagrams paint a rosier picture than "4 blocks with 6 FMACs each." They appear to show 4 blocks with 4x6=24 FMACs each. Each VPE block consists of 4 VPUs, and each VPU appears to be capable of processing six FP32 vectors and one integer per clock.

Remember this was a wacky translation that I did on the fly

1 FMAC equals to 1 Scalar MADD, so it's supposed to do 24 MADDs per block and thus 96 MADDs all together, right?

G70 = 24 * 8 MADDs/clock = 192 MADDs/clock and I don't even dare to touch the R580 in comparison. Either my layman math is totally off or I don't see any huge advantage here.

If I haven't misunderstood this whole thing, then I can see that kind of arithmetic throughput even on a X1600XT, at least on paper.

(Can that be called 16 Vec6 units, or am I totally off track?)

No idea. Can you even add integer values?

That's still inferior to 32 Vec4 units, so not quite the best-case scenario (that you said Demi said Ageia's patent said). And that's not even considering Blachford's reasonable suggestion that one VPE of the four he described may be disabled for yields--though I somewhat doubt that, given the $250+ MSRP.

What theoretical yield problems could they have with 125M on low-k 130nm TSMC?

Pete · Apr 27, 2006

I'm a big fan of wacky translations!

No idea about yield problems, I was just repeating Blachford's speculation. If it's to be 500MHz, that's about as fast as we've seen on 130nm, no? I guess the process is mature enough that you can now maximize yields at that speed?

Ailuros · Apr 27, 2006

Not only is the process as mature as it can be nowadays, but chip complexity isn't all that high either.

Humus · Apr 29, 2006

Pete said:
Does R2VB make the R5x0 series more amenable to stream processing?

Well, R2VB allows you to keep the data on GPU and use it directly for rendering without reading back to the CPU. So yes, R2VB makes things more streamable.

Ailuros · Apr 29, 2006

Humus said:
Well, R2VB allows you to keep the data on GPU and use it directly for rendering without reading back to the CPU. So yes, R2VB makes things more streamable.

That said I'd have a long hard laugh if NVIDIA should in the foreseeable future opt for a similar technique.

Bob · Apr 30, 2006

Ailuros said:
That said I'd have a long hard laugh if NVIDIA should in the foreseeable future opt for a similar technique.

Why bother? Render to Texture + Vertex Texture Fetch, which are part of the Direct3D9 specification, provide not only the same "no CPU intervention" feature of R2VB, but are much cleaner and extensible. IMHO, I do not think R2VB will last longer than a generation or two.

Mintmaster · Apr 30, 2006

You're absolutely right, but the problem is that vertex texture fetch is so slow right now as to be nearly unusable. If ATI can convince anyone to implement R2VB in a game, it's quite likely NVidia would implement it, even if the game supports VTF as well. We're talking about an order of magnitude difference in vertex processing speed.

Humus · Apr 30, 2006

Bob said:
IMHO, I do not think R2VB will last longer than a generation or two.

Well, since it's in DX10 I bet it's here to stay.

DemoCoder · Apr 30, 2006

Ailuros said:
That said I'd have a long hard laugh if NVIDIA should in the foreseeable future opt for a similar technique.

NVidia already supports R2VB in OpenGL (via render to texture and EXT_pixel_buffer_object). They just don't support the FourCC ATI Hack. R2VB should not be supported officially using this technique. MS should at it to the DX API itself. But with DX10 and streamout, it should be supported by that general mechanism.

chrisATI · Apr 30, 2006

Bob said:
Why bother?

R2VB can be faster than vertex texture fetching. With R2VB the memory access patterns can be more regular and predictable. Things can get slightly more complicated with VTF because the texture address is computed on the fly in the vertex shader. In the end, they are two different techniques and they both have their strengths and weaknesses. It makes sense to use R2VB in places where it is the more efficient solution.

ATI's decision concerning TMUs

pakotlar

Pete

Moderate Nuisance

Ailuros

Epsilon plus three

Ailuros

Epsilon plus three

Maintank

ondaedg

Pete

Moderate Nuisance

Subtlesnake

Ailuros

Epsilon plus three

Pete

Moderate Nuisance

Ailuros

Epsilon plus three

Pete

Moderate Nuisance

Ailuros

Epsilon plus three

Humus

Crazy coder

Ailuros

Epsilon plus three

Bob

Mintmaster

Humus

Crazy coder

DemoCoder

chrisATI