Recent Radeon X1K Memory Controller Improvements in OpenGL with AA

Dave Baumann said:
There's no point. With NV4x/G70's dispatch the quads are likely to be working very closely in region to one another, so an L1/L2 cache design works here - ATI's quads are working on completely different regions, making it much less likely that they'll need to share much texture data between one another.
What about ground textures (e.g. repeating cobbles) - won't there be multiple instances in the texture caches of ATI GPUs?

Jawed
 
Jawed said:
What about ground textures (e.g. repeating cobbles) - won't there be multiple instances in the texture caches of ATI GPUs?

Jawed

But no penalty for this for R5XX series, right? Don't know about previous gen.
 
RoOoBo said:
However even if tiled based shader work distribution can help in accessing data (and removes the requirement for a crossbar to order back the quads to their propper ROPs or MCs) random/round robin distribution is good for better load balancing (the check board case for example :)) in the shader units. The tile algorithm has the danger, when queues aren't large enough, of unbalancing between the pipelines or a slow start for some of the pipelines if the first fragments/triangles miss them. What is better? Depends.
I think the other missing ingredient in this discussion is how the NVidia and ATI architectures construct batches.

If a triangle is too small to fill a batch (i.e. there are less quads in the triangle than the nominal batch size for the architecture), does the GPU fill the batch with more triangles (e.g. the succeeding triangles in a mesh)? Or is the empty space in the batch just entirely lost cycles?

Jawed
 
ERK said:
But no penalty for this for R5XX series, right? Don't know about previous gen.
I'm not sure how you conclude that, since R5xx has the same per-quad texture cache organisation as R3xx...R4xx.

The difference in R5xx is that the caches are larger (I think) and fully associative.

Jawed
 
Jawed said:
I think the other missing ingredient in this discussion is how the NVidia and ATI architectures construct batches.

If a triangle is too small to fill a batch (i.e. there are less quads in the triangle than the nominal batch size for the architecture), does the GPU fill the batch with more triangles (e.g. the succeeding triangles in a mesh)? Or is the empty space in the batch just entirely lost cycles?

Jawed

Seems like it would be incredibly wasteful to not fill the batch, but again, I suppose it depends on how hard it is to fill the batch with triangles from the succeeding mesh. It probably also depends on how much space is left. Do you worry about it if you can only cram one more triangle in?

Speaking of which, how big are the triangle batches? Do we know?

Nite_Hawk
 
Jawed said:
I think the other missing ingredient in this discussion is how the NVidia and ATI architectures construct batches.

If a triangle is too small to fill a batch (i.e. there are less quads in the triangle than the nominal batch size for the architecture), does the GPU fill the batch with more triangles (e.g. the succeeding triangles in a mesh)? Or is the empty space in the batch just entirely lost cycles?

Jawed

That's a good question and I would like to know the answer too. That thing called 'recursive rasterization' that I'm using and I have doubts anyone else is using even when Akeley, in that 2001 Stanford course slides, seemed to suggest that NVidia (others?) used it, allows traversing triangles in parallel and can generate, for the same tile, all the fragments generated from multiple triangles, in a single recursive traversal step. That feature could help to fill tile based batches for sure if you are rendering closely connected triangles (the usual triangle mesh) that don't overlap.
 
Nite_Hawk said:
Seems like it would be incredibly wasteful to not fill the batch, but again, I suppose it depends on how hard it is to fill the batch with triangles from the succeeding mesh. It probably also depends on how much space is left. Do you worry about it if you can only cram one more triangle in?

Speaking of which, how big are the triangle batches? Do we know?

Nite_Hawk

In fragments? Varies a lot, even more if you are removing fragments before shading with HZ and early Z. I have never tried to get the average per triangle batch fragments with a game trace. Imagine particle rendering, lot of triangles and very few fragments.

In triangles I have seen all kind of batches from less than 10 triangles to tens of thousands in the game traces that I have briefly skipper over. But at the end the average must be in the hundreds to thousands if you want good performance with a GPU as they have a big overhead when starting a new batch (all the very large GPU pipeline must be filled again) and for changing render states.
 
Well, it seems to me that it would be stupid to build an architecture that can't fill the batch with a set of triangles that use the same texture and pixel shader.
 
Jawed said:
Earlier I was forgetting that these memory devices have a burst length of 8, I think.
The new GDDR3 ones have a configurable burst length of either 4 or 8, while the older, 144-ball ones only allow 4.

NVidia fills batches with quads from multiple triangles. I'm not sure about R300/R420, but I think the 4x4 pixel threads of R520 certainly belong to one triangle each.
 
Chalnoth said:
Well, it seems to me that it would be stupid to build an architecture that can't fill the batch with a set of triangles that use the same texture and pixel shader.

And I keep saying that (unless I'm proved as being stupid by any IHV engineer sounding off :LOL: ) that I doubt current GPUs are pipelining triangles (or said in another way OpenGL primitive batches or could be called draw commands) with different render states. The graphic program changes some state, sends a draw command for X triangles, GPU renders those triangles until pipes are empty, graphic program changes the render state again, sends another draw command and GPU renders those triangles. I think some pipelining of state changes and the end of a draw command with the start of the next one is possible. But having triangles or fragments with different associated render states on the same stage? No way. Why would be state changes and small batches so costly? Ignoring that they spend CPU cycles ...
 
Nite_Hawk said:
Speaking of which, how big are the triangle batches? Do we know?
I've always assumed in R3xx...R4xx that the batch size is the screen-tile size, i.e. 64 quads (256 pixels). Dunno for sure.

In R520 we know the batch size is 4 quads, so the relationship between screen-tiles and batches is relatively soft.

In NV40 it's in the region of 256 quads per fragment-quad with all four quads in 6800U/GT working together (i.e. 4096 pixels in total). In G70 it's 256 quads per fragment-quad, but each fragment-quad is independent - so 1024 pixels per batch.

But there's a wishy-washy factor at play in NVidia architectures that somehow brings the effective batch sizes way down (to about 80% of the sizes I've stated). A real mystery what's going on there...

In G70 I think there are, effectively, 6 concurrent batches working on a large triangle. i.e. if a triangle consists of at least 6 quads (24 pixels), then each of the fragment-quad pipelines will share the workload of rendering the triangle. As far as I can tell NVidia GPUs rasterise each triangle in a round-robin fashion across the available fragment-quads.

Jawed
 
If each triangle in a mesh has different gradients and texture coordinates from its neighbours (actually I don't really understand this stuff) then that means that the "batch state" in the GPU has to be able to hold the triangle data for multiple triangles.

If that's the case, then that would seem to create a limitation on a GPU's ability to fill a batch with triangles. Sure the limit might be, say, 16 triangles instead of 1, but it still creates problems when rendering distant, fairly high-poly objects, where each triangle is 10 or 20 pixels.

So, are GPUs capable of holding multiple-triangles' data like this?

Jawed
 
Chalnoth said:
Well, duh, because they focused on off-angle surfaces. It really upsets me that nobody focused on off-angle surfaces back when the Radeon 9700 was released and the GeForce4 Ti cards were still beating the pants off of it in anisotropic filtering quality.
Chalnoth, you incessant rants and refusal to accept even the most obvious facts about NV's few glaring shortcomings in the past piss everybody off here.

I showed you many times that GF4 took an enormous performance hit with anisotropic filtering. It had very little to do with extra samples for off angle surfaces. NV30 also had angle independent AF (or very nearly so), and it had a performance hit very similar to ATI all else being equal. 90+% of rendered pixels in a Quake3 demo are vertical or horizontal. Yet we see this from GF4:
http://graphics.tomshardware.com/graphic/20020206/geforce4-17.html#anisotropic_performance
117 fps at 1024x768 w/ 8xAF; 132 fps at 1600x1200 w/o AF
Fillrate drops to almost 1/3! My 9700P shows 10-20% hit max in Q3 with 16x Quality AF. This is an extreme case, but usually the GF4 had 3x the performance drop with AF.

No one cares if the GF4 quality is a bit better when it has a performance hit way higher. It's always been performance first, quality second (up to a certain point, obviously).

Today, graphics cards from ATI and NVidia are often within 20% of each other, which is hard to notice when not looking at a graph. Games use more off angle surfaces now too than back then since gamers are demanding more varied environments. Hence the focus on AF quality. You being pissed about this tells more about your bias than the media's. In fact, even today only HardOCP has stated that it makes a noticeable difference. Some other sites are even writing off ATI's higher quality AF as insignificant.

Chalnoth said:
Except these two things do not follow. Firstly, I really don't see how you can quantify ATI as doing more "forward thinking." It was, afterall, nVidia was the first one to implement a large number of the technologies that we take for granted in 3D graphics now, including anisotropic filtering, FSAA, MSAA, programmable shaders, and hardware geometry processing.
MSAA is a speed optimization, and GF4 barely outpaced theoretical SSAA (given the same RAMDAC downsampling) - 4xAA reduced fillrate by 70% instead of 75%. Colour compression was the real innovation. The shader hardware in the original Radeon was unbelievably close to DX8 PS1.0. Both had 8 math ops, both had fixed mode dependent texturing, but the Radeon had a 2x2 matrux multiplication first instead of 3x3. It had 3 texture multitexturing instead of 4. I worked at ATI, and I am (rather was) very familiar with R100/R200 architecture. The vertex shaders were barely changed - R100 just didn't quite meet the spec, which rumour says was changed too late for ATI, so they couldn't call it a programmable vertex shader according to Microsoft. Saying who invented what in any field is often a wash, and realtime graphics is no different.

I'm not agreeing with the statement that ATI is way more innovative or forward-looking than NVidia, but rather that these innovations are very evolutionary. Both companies are driving each other similarly, especially when you consider how early design decisions are made. If that's what you're saying too, then maybe you shouldn't come off like your saying ATI is just a follower.

Either way, lay the AF thing to rest. GF4's AF speed was pathetic.
 
Jawed said:
I'm not sure how you conclude that, since R5xx has the same per-quad texture cache organisation as R3xx...R4xx.

The difference in R5xx is that the caches are larger (I think) and fully associative.

Jawed
I was just thinking in terms of the following:
1. Cache hits reduce latency penalties (over misses).
2. The X1K architecture masks latency with thread swapouts.
3. Therefore, cache duplication, even though storage inefficient, would not cause any penalty because any threads needing different cache data would just be scheduled around the latency.

If this does not apply to this situation, my bad.
 
RoOoBo said:
In triangles I have seen all kind of batches from less than 10 triangles to tens of thousands in the game traces that I have briefly skipper over. But at the end the average must be in the hundreds to thousands if you want good performance with a GPU as they have a big overhead when starting a new batch (all the very large GPU pipeline must be filled again) and for changing render states.
Tens of thousands seems a mite large, considering that 10,000 pixels is a 100x100 pixel block. This, of course, may be incentive for looser constraints in what constitutes a batch.
 
Jawed said:
So, are GPUs capable of holding multiple-triangles' data like this?
At least some are.



Mintmaster, seems like both ATI and 3dfx weren't happy about PS1.0 ...
 
ERK said:
I was just thinking in terms of the following:
1. Cache hits reduce latency penalties (over misses).
2. The X1K architecture masks latency with thread swapouts.
3. Therefore, cache duplication, even though storage inefficient, would not cause any penalty because any threads needing different cache data would just be scheduled around the latency.

If this does not apply to this situation, my bad.
Sorry, yeah, you're right that's exactly how R520 is able to skirt the issue more effectively.

Jawed
 
Xmas said:
At least some are.
Do you have any concept of the limit on the number of triangles in a batch in NVidia GPUs? Do the triangles have to have the same normal?

I kinda suspect that ATI GPUs are strictly one-triangle.

I seem to remember that the 16x16 size was described as a trade-off between small triangles and cache.

http://www.beyond3d.com/reviews/ati/r420_x800/index.php?p=5

Reducing the tile size allows for higher efficiency with smaller triangles, while larger triangles favour texturing efficiency.

There was a forum post by one of the ATI guys describing this - but I can't find it.

Jawed
 
Jawed said:
Do you have any concept of the limit on the number of triangles in a batch in NVidia GPUs? Do the triangles have to have the same normal?
No, this part of the pipeline has no concept of a surface normal (actually, no part has). The Z-gradients don't have to be identical, and the face register depents on the winding of the vertices.

I don't know how many different triangles can be in a batch, but I would be surprised if it's more than 16.
 
Bob said:
It's on the order of 20 of triangles per shader pipe. The triangle normal is irrelevent.

Bob's reply to Jawed about triangle batch size per unit.
 
Back
Top