X1800/7800gt AA comparisons

Status
Not open for further replies.
Jawed said:
I think the other missing ingredient in this discussion is how the NVidia and ATI architectures construct batches.

If a triangle is too small to fill a batch (i.e. there are less quads in the triangle than the nominal batch size for the architecture), does the GPU fill the batch with more triangles (e.g. the succeeding triangles in a mesh)? Or is the empty space in the batch just entirely lost cycles?

Jawed

That's a good question and I would like to know the answer too. That thing called 'recursive rasterization' that I'm using and I have doubts anyone else is using even when Akeley, in that 2001 Stanford course slides, seemed to suggest that NVidia (others?) used it, allows traversing triangles in parallel and can generate, for the same tile, all the fragments generated from multiple triangles, in a single recursive traversal step. That feature could help to fill tile based batches for sure if you are rendering closely connected triangles (the usual triangle mesh) that don't overlap.
 
Nite_Hawk said:
Seems like it would be incredibly wasteful to not fill the batch, but again, I suppose it depends on how hard it is to fill the batch with triangles from the succeeding mesh. It probably also depends on how much space is left. Do you worry about it if you can only cram one more triangle in?

Speaking of which, how big are the triangle batches? Do we know?

Nite_Hawk

In fragments? Varies a lot, even more if you are removing fragments before shading with HZ and early Z. I have never tried to get the average per triangle batch fragments with a game trace. Imagine particle rendering, lot of triangles and very few fragments.

In triangles I have seen all kind of batches from less than 10 triangles to tens of thousands in the game traces that I have briefly skipper over. But at the end the average must be in the hundreds to thousands if you want good performance with a GPU as they have a big overhead when starting a new batch (all the very large GPU pipeline must be filled again) and for changing render states.
 
Well, it seems to me that it would be stupid to build an architecture that can't fill the batch with a set of triangles that use the same texture and pixel shader.
 
Jawed said:
Earlier I was forgetting that these memory devices have a burst length of 8, I think.
The new GDDR3 ones have a configurable burst length of either 4 or 8, while the older, 144-ball ones only allow 4.

NVidia fills batches with quads from multiple triangles. I'm not sure about R300/R420, but I think the 4x4 pixel threads of R520 certainly belong to one triangle each.
 
Chalnoth said:
Well, it seems to me that it would be stupid to build an architecture that can't fill the batch with a set of triangles that use the same texture and pixel shader.

And I keep saying that (unless I'm proved as being stupid by any IHV engineer sounding off :LOL: ) that I doubt current GPUs are pipelining triangles (or said in another way OpenGL primitive batches or could be called draw commands) with different render states. The graphic program changes some state, sends a draw command for X triangles, GPU renders those triangles until pipes are empty, graphic program changes the render state again, sends another draw command and GPU renders those triangles. I think some pipelining of state changes and the end of a draw command with the start of the next one is possible. But having triangles or fragments with different associated render states on the same stage? No way. Why would be state changes and small batches so costly? Ignoring that they spend CPU cycles ...
 
Nite_Hawk said:
Speaking of which, how big are the triangle batches? Do we know?
I've always assumed in R3xx...R4xx that the batch size is the screen-tile size, i.e. 64 quads (256 pixels). Dunno for sure.

In R520 we know the batch size is 4 quads, so the relationship between screen-tiles and batches is relatively soft.

In NV40 it's in the region of 256 quads per fragment-quad with all four quads in 6800U/GT working together (i.e. 4096 pixels in total). In G70 it's 256 quads per fragment-quad, but each fragment-quad is independent - so 1024 pixels per batch.

But there's a wishy-washy factor at play in NVidia architectures that somehow brings the effective batch sizes way down (to about 80% of the sizes I've stated). A real mystery what's going on there...

In G70 I think there are, effectively, 6 concurrent batches working on a large triangle. i.e. if a triangle consists of at least 6 quads (24 pixels), then each of the fragment-quad pipelines will share the workload of rendering the triangle. As far as I can tell NVidia GPUs rasterise each triangle in a round-robin fashion across the available fragment-quads.

Jawed
 
If each triangle in a mesh has different gradients and texture coordinates from its neighbours (actually I don't really understand this stuff) then that means that the "batch state" in the GPU has to be able to hold the triangle data for multiple triangles.

If that's the case, then that would seem to create a limitation on a GPU's ability to fill a batch with triangles. Sure the limit might be, say, 16 triangles instead of 1, but it still creates problems when rendering distant, fairly high-poly objects, where each triangle is 10 or 20 pixels.

So, are GPUs capable of holding multiple-triangles' data like this?

Jawed
 
Chalnoth said:
Well, duh, because they focused on off-angle surfaces. It really upsets me that nobody focused on off-angle surfaces back when the Radeon 9700 was released and the GeForce4 Ti cards were still beating the pants off of it in anisotropic filtering quality.
Chalnoth, you incessant rants and refusal to accept even the most obvious facts about NV's few glaring shortcomings in the past piss everybody off here.

I showed you many times that GF4 took an enormous performance hit with anisotropic filtering. It had very little to do with extra samples for off angle surfaces. NV30 also had angle independent AF (or very nearly so), and it had a performance hit very similar to ATI all else being equal. 90+% of rendered pixels in a Quake3 demo are vertical or horizontal. Yet we see this from GF4:
http://graphics.tomshardware.com/graphic/20020206/geforce4-17.html#anisotropic_performance
117 fps at 1024x768 w/ 8xAF; 132 fps at 1600x1200 w/o AF
Fillrate drops to almost 1/3! My 9700P shows 10-20% hit max in Q3 with 16x Quality AF. This is an extreme case, but usually the GF4 had 3x the performance drop with AF.

No one cares if the GF4 quality is a bit better when it has a performance hit way higher. It's always been performance first, quality second (up to a certain point, obviously).

Today, graphics cards from ATI and NVidia are often within 20% of each other, which is hard to notice when not looking at a graph. Games use more off angle surfaces now too than back then since gamers are demanding more varied environments. Hence the focus on AF quality. You being pissed about this tells more about your bias than the media's. In fact, even today only HardOCP has stated that it makes a noticeable difference. Some other sites are even writing off ATI's higher quality AF as insignificant.

Chalnoth said:
Except these two things do not follow. Firstly, I really don't see how you can quantify ATI as doing more "forward thinking." It was, afterall, nVidia was the first one to implement a large number of the technologies that we take for granted in 3D graphics now, including anisotropic filtering, FSAA, MSAA, programmable shaders, and hardware geometry processing.
MSAA is a speed optimization, and GF4 barely outpaced theoretical SSAA (given the same RAMDAC downsampling) - 4xAA reduced fillrate by 70% instead of 75%. Colour compression was the real innovation. The shader hardware in the original Radeon was unbelievably close to DX8 PS1.0. Both had 8 math ops, both had fixed mode dependent texturing, but the Radeon had a 2x2 matrux multiplication first instead of 3x3. It had 3 texture multitexturing instead of 4. I worked at ATI, and I am (rather was) very familiar with R100/R200 architecture. The vertex shaders were barely changed - R100 just didn't quite meet the spec, which rumour says was changed too late for ATI, so they couldn't call it a programmable vertex shader according to Microsoft. Saying who invented what in any field is often a wash, and realtime graphics is no different.

I'm not agreeing with the statement that ATI is way more innovative or forward-looking than NVidia, but rather that these innovations are very evolutionary. Both companies are driving each other similarly, especially when you consider how early design decisions are made. If that's what you're saying too, then maybe you shouldn't come off like your saying ATI is just a follower.

Either way, lay the AF thing to rest. GF4's AF speed was pathetic.
 
Jawed said:
I'm not sure how you conclude that, since R5xx has the same per-quad texture cache organisation as R3xx...R4xx.

The difference in R5xx is that the caches are larger (I think) and fully associative.

Jawed
I was just thinking in terms of the following:
1. Cache hits reduce latency penalties (over misses).
2. The X1K architecture masks latency with thread swapouts.
3. Therefore, cache duplication, even though storage inefficient, would not cause any penalty because any threads needing different cache data would just be scheduled around the latency.

If this does not apply to this situation, my bad.
 
RoOoBo said:
In triangles I have seen all kind of batches from less than 10 triangles to tens of thousands in the game traces that I have briefly skipper over. But at the end the average must be in the hundreds to thousands if you want good performance with a GPU as they have a big overhead when starting a new batch (all the very large GPU pipeline must be filled again) and for changing render states.
Tens of thousands seems a mite large, considering that 10,000 pixels is a 100x100 pixel block. This, of course, may be incentive for looser constraints in what constitutes a batch.
 
Jawed said:
So, are GPUs capable of holding multiple-triangles' data like this?
At least some are.



Mintmaster, seems like both ATI and 3dfx weren't happy about PS1.0 ...
 
Chalnoth said:
Oh, I read those posts, but I don't ever take such claims as being indicative of what we'll see when the drivers are actually released.

Unless they are claims made by nVidia.


http://www.beyond3d.com/forum/showthread.php?p=62174&highlight=driver#post62174

Chalnoth said:
As JC stated, nVidia claims that future compiler improvements in the driver will improve performance (which means, to me, that the drivers need to translate the ARB intructions to NV30 instructions), which makes lots of sense, especially given the DX9 results posted earlier.
 
ERK said:
I was just thinking in terms of the following:
1. Cache hits reduce latency penalties (over misses).
2. The X1K architecture masks latency with thread swapouts.
3. Therefore, cache duplication, even though storage inefficient, would not cause any penalty because any threads needing different cache data would just be scheduled around the latency.

If this does not apply to this situation, my bad.
Sorry, yeah, you're right that's exactly how R520 is able to skirt the issue more effectively.

Jawed
 
Er, you're taking that statement rather out of context.

Edit: give me a moment, I'm trying to figure out the proper context myself.
 
Okay, I think I've got the proper context now.

I was arguing that future driver improvements should improve the performance of shaders in the NV30. I was stating this in part because I had heard that the NV30 was a VLIW design, which are notoriously hard to write compilers for. In part because this was what nVidia was claiming themselves would help.

Now, I believe I've been very consistent over the years in separating how I think performance will change and what people should consider when buying something. I've always stated that you should buy a product for what it can do for you now.

But that doesn't mean I can't speculate as to how performance will change with newer drivers. If I remember correctly, the NV3x did indeed inrease its shader performance more than the R3xx over the next year, but it took an entirely new architecture to become truly competitive (the NV4x). This is a prime example of what I was talking about: if you expect driver improvements to save an architecture, you're setting yourself up for failure.

Expectations on how future drivers will improve a video card should only be factored in if you consider the comparison between the cards to otherwise be a wash. An example of this perspective can be seen in how I chose to purchase an SLI motherboard. I never really expected to use SLI, but as I was browsing through the nForce4 motherboards available at the time, attempting to select a layout that I liked, I threw out motherboards until I was within a few dollars of the excellent Asus A8N-SLI motherboard. So, the marginal probability that I would ever make use of SLI made me decide to purchase the SLI motherboard, in a situation where it might have otherwise been a wash.

I think that's the only situation where you'd want to buy a video card based on the promise of improved drivers. If, for example, the X1800 XL was available for the same price as the GeForce 7800 GT, and the XL performed a few percent better (which it may in a month or two) so that the performance was essentially even, it might be a good idea to select the X1800 XL because it's much more likely to improve performance via drivers than the 7800 GT (although I still wouldn't, because linux and OpenGL support are very important to me).
 
Chalnoth said:
Er, you're taking that statement rather out of context. I was arguing a point after the optimizations had been made.

The context is right there for everyone to see. And no optimization were made to Doom at the time but of course it "makes lots of sense". Thats the second time you have outright contradicted yourself to put nVidia in a good light and ATI in a bad one.

Reminds me of the last lame response.
http://www.beyond3d.com/forum/showpost.php?p=223030&postcount=438

Chalnoth said:
My previous comment around that time, where I stated essentially that I didn't consider the benefits of the R200 over the NV2x worth buying a R200 over, were based on the fact that to date, zero games had made use of the higher precision offered by the R200. We all knew that DOOM3 would, but that was still a while away. Even today, very few games make use of the added precision available on the R200.

Please note that this argument (what I will buy) is different, in my mind, from the argument about what I consider to be good for the future of 3D games, and what to be bad for the future of 3D games..
 
Xmas said:
At least some are.
Do you have any concept of the limit on the number of triangles in a batch in NVidia GPUs? Do the triangles have to have the same normal?

I kinda suspect that ATI GPUs are strictly one-triangle.

I seem to remember that the 16x16 size was described as a trade-off between small triangles and cache.

http://www.beyond3d.com/reviews/ati/r420_x800/index.php?p=5

Reducing the tile size allows for higher efficiency with smaller triangles, while larger triangles favour texturing efficiency.

There was a forum post by one of the ATI guys describing this - but I can't find it.

Jawed
 
Chalnoth said:
Okay, I think I've got the proper context now.

I was arguing that future driver improvements should improve the performance of shaders in the NV30. I was stating this in part because I had heard that the NV30 was a VLIW design, which are notoriously hard to write compilers for. In part because this was what nVidia was claiming themselves would help.

Now, I believe I've been very consistent over the years in separating how I think performance will change and what people should consider when buying something. I've always stated that you should buy a product for what it can do for you now.

But that doesn't mean I can't speculate as to how performance will change with newer drivers. If I remember correctly, the NV3x did indeed inrease its shader performance more than the R3xx over the next year, but it took an entirely new architecture to become truly competitive (the NV4x). This is a prime example of what I was talking about: if you expect driver improvements to save an architecture, you're setting yourself up for failure.

Expectations on how future drivers will improve a video card should only be factored in if you consider the comparison between the cards to otherwise be a wash. An example of this perspective can be seen in how I chose to purchase an SLI motherboard. I never really expected to use SLI, but as I was browsing through the nForce4 motherboards available at the time, attempting to select a layout that I liked, I threw out motherboards until I was within a few dollars of the excellent Asus A8N-SLI motherboard. So, the marginal probability that I would ever make use of SLI made me decide to purchase the SLI motherboard, in a situation where it might have otherwise been a wash.

I think that's the only situation where you'd want to buy a video card based on the promise of improved drivers. If, for example, the X1800 XL was available for the same price as the GeForce 7800 GT, and the XL performed a few percent better (which it may in a month or two) so that the performance was essentially even, it might be a good idea to select the X1800 XL because it's much more likely to improve performance via drivers than the 7800 GT (although I still wouldn't, because linux and OpenGL support are very important to me).

Do we really need to go through what you have said about improving nVidia performance with driver improvements. You realize there is a search tool on the forum.

http://www.beyond3d.com/forum/showthread.php?p=97373&highlight=driver#post97373

Chalnoth said:
Yes, and it is most likely that performance will increase more for the NV3x architecture than the R3xx architecture before release.

Remember that the NV3x architecture is newer, and there is certainly much more headroom for improvement in future driver releases.

http://www.beyond3d.com/forum/showthread.php?p=239451&highlight=driver#post239451

Chalnoth said:
Remember that the NV40 has been shown to be more shader-efficient than the R3xx core on immature drivers. If nVidia can get a much better driver compiler by the fall, the NV4x's shader throughput might increase by 50%, giving these cards a huge advantage.

http://www.beyond3d.com/forum/showthread.php?p=239590&highlight=driver#post239590

Chalnoth said:
nVidia is not ATI. The NV4x architecture is quite a bit more complex than the R3xx architecture, and so has much more room to grow with future driver improvements.

For someone that doesn't believe you should consider driver improvements a reason to buy a card, you sure talk about it a lot when it involves nVidia.
 
Last edited by a moderator:
Jawed said:
Do you have any concept of the limit on the number of triangles in a batch in NVidia GPUs? Do the triangles have to have the same normal?
No, this part of the pipeline has no concept of a surface normal (actually, no part has). The Z-gradients don't have to be identical, and the face register depents on the winding of the vertices.

I don't know how many different triangles can be in a batch, but I would be surprised if it's more than 16.
 
Rys said:
I bet good money that you can't avoid writing about nVidia in any way for the next 50 constructive posts you make on these forums. It's all you talk about, bringing the same stuff up over and over again. A thousand posts making the same tired old points that we've all read a thousand times, that we fully get and understand, but which you persist in bringing up time after time.

Yet another thread derailed by ATI vs NVIDIA chatter, eventually dropping to the depths of days gone by and more derivative and highly boring posts by you about NV30 being shit at DX9, rather than any meaningful discussion on technology of the present and what's going on with current hardware. The topic says X1800 and 7800, not GeForce FX 5800 and two year old stale pissings which you must bring up again and again.

If you must continue on your merry NV30 and NVIDIA-hating way, do it in 3D Graphics Companies and Industry, if anywhere, and keep it out of 3D Technology and Hardware. This forum section is not somewhere for you to post stuff like that.

But really, please just stop it entirely.


I think it would be really, really nice if you'd bother quoting me in context, as you might try at least reading, if not quoting, the remarks I specifically responded to. Instead, you've quoted only my remarks separately, made the mistake of characterizing them in inflammatory terms, while you completely ignore the set of inflammatory, incorrect, utterly inaccurate remarks I responded to in the first place. Remarks not made by me.

Amazing, really, that you'd single out my remarks as if I brought up the topic, uh, which I did not do. Your comments are very sad and utterly out of place. Try reading what I write in context, please.
 
Status
Not open for further replies.
Back
Top