Opinions needed on this Interview

DaveBaumann said:
Xmas said:
I don't quite buy that argument. There is such a thing as a L2 texture cache.

I've asked ATI about that before and AFAIK they don't have a L2 cache

Given a < 8 KB size for the primary cache (L1), I can't see the need for a separate L2 for vertex data. So at least for vertex data, I presume that to be true. My own benchmarks see less than a 16KB cache setup for geometry, and not much more per pixel, and no more for a reuse cache (L2). Looking at the cache for pixel data is outside of my knowledge/benches, although I am fairly drunk as I type :p
 
Tweaker said:
Jawed said:
http://www.beyond3d.com/forum/viewtopic.php?p=64708#64708

Load balancing is nearly perfect in most cases (3dmark, for example, is linear with # chips)...

Jawed


DaveBaumann said:
However, certianly in a bench such as 3DMark05 Tiling will be of very little benefit, much like SFR is, and AFR is the best solution for producing the highest score

:?

Dave's quote refers to 3DMark '05, which is known to be geometrically (vertex) bound. In this case, Tiling (or split screen for that matter) will not do much to help, whereas alternate frame would.

Jawed was referencing Sireric's comments back from late '02, which I'm assuming references 3DMark '03....primarily pixel shader bound. In this case, Tiling (or split screen) should provide improvement.
 
Eric was prolly talking about 3DMk03 (or was it 01?). 05 has a huge vertex load so AFR should be the one. Who knows eh?...

Jawed
 
Rys said:
Given a < 8 KB size for the primary cache (L1), I can't see the need for a separate L2 for vertex data. So at least for vertex data, I presume that to be true. My own benchmarks see less than a 16KB cache setup for geometry, and not much more per pixel, and no more for a reuse cache (L2). Looking at the cache for pixel data is outside of my knowledge/benches, although I am fairly drunk as I type :p
I wasn't talking about vertex data, only texture data. Those caches are separate. I guess the post-transform vertex cache is somewhere in between 1 and 3 KiB (16-entry for ATI, 24-entry for NVidia).

NV40 has a L1 texture cache (probably 512 to 1024 bytes) per quad pipeline, as well as a somewhat bigger L2 texture cache that is shared across all quad pipelines.
 
Hey jb, after reading through this again, I am really curious about your comment here:

You just have to understand that it helps NV as well as they are going to spin anything off in the best light they can. But its intresting to read these things and compare them to some of the recent ATI interveiws as the style of the answers are almost night and day

That one kinda left me scratching my head. I mean, PR is PR, no matter who does it, but I don't really get your point here. We see spin from everyone, and we have the Quake/Quack things on one side, and then Huddy's infamous power point about "Don't push SM 3.0 until we have our part ready"

Just curious if you had some concrete examples of what you are talking about.
 
Xmas said:
I don't quite buy that argument. There is such a thing as a L2 texture cache. Its purpose is to prevent that a texel that is needed for several quads is fetched multiple times from video memory. So even if a quad pipeline is working on a quad that is not connected to the previous quad, the cache saves valuable bandwidth.
I've never profiled texture cache performanc in any way, but I suspect the main reason for the texture cache is to hide memory latency. I doubt they are big enough to see significant reuse. I'd be curious to see some numbers though.
 
3dcgi said:
I've never profiled texture cache performanc in any way, but I suspect the main reason for the texture cache is to hide memory latency. I doubt they are big enough to see significant reuse. I'd be curious to see some numbers though.
Without reuse, how much would a cache buy you? Bilinear filtering means most texels are used four times. There is massive reuse, especially when magnifying textures. The cache doesn't need to be big, because the rendering process is optimized to take advantage of locality.
 
JoshMST said:
We see spin from everyone, and we have the Quake/Quack things on one side, and then Huddy's infamous power point about "Don't push SM 3.0 until we have our part ready"

By the way, that Huddy statement has quotes around it --is that a direct quote? Unfortunately, I didn't keep that powerpoint when I had the chance. I was remembering something about performance in that original quote. And, actually, that it was specific to branching. Which seems more prescient than anything at this point, at least as regards NV40 --tho ATI still has to prove their part of the statement with R520.

Edit: Oh, here it is: "Steer people away from flow control in ps3.0 because we expect it to hurt badly. [Also it's the main extra feature on NV40 vs R420 so let's discourage people from using it until R5xx shows up with decent performance...]"

Yeah, a little of both I guess. But not really dirty pool unless of course it doesn't "hurt badly" and apparently it does. . .or if R520 isn't much better at it.
 
Thanks for that Pete.

What's interesting is how Eric describes how a pipe can have multiple tiles all queued in its FIFO for it to work on. Apart from anything else, this could imply that a tile is only shaded by a single quad. In terms of texture caching this seems like a win to me - you're better off texturing triangles with a minimum of quads, whenever possible. The tiled organisation of quads, in effect, provides for a virtual L2 cache, at least at the level of triangles. Obviously, triangles that are bigger than a tile will chew through more quads.

Eric doesn't really talk about pipes versus quad-pipes, so it's unclear to me if he actually means "quad-pipes" when he says "pipe". But I'm going to assume he means "quad-pipes".

Anyway, Dave is still dodging the question I asked earlier, how big are tiles in a single-card/chip configuration and are the supertiles different in size?

All I'm suggesting is that a single-card/chip configuration uses, say, a 16-pixel square tile and when super-tiling for multi-card/chip configurations is operating, the tiles are still 16-pixels square, but instead of only having, say, 4 quads (X800XT) to assign tiles to, there are now 16-quads to assign tiles to (2x X800XT).

It's worth considering what happens when you mix cards with differing numbers of quads, e.g. X800 Pro with X800XT. With 7 quads the neat pattern that Dave drew in his second diagram doesn't seem such a good fit. Therefore I'm not convinced Dave has got the correct granularity for the tiles (super-tiles) in his diagram and further, I think the organisation he's proposing doesn't lend itself to scalability - e.g. a 3 card/chip solution.

Quad assignment is deterministic within the cards, each acting independently - if each card knows how many quads there are in total, it knows precisely which tiles are destined for its quads - therefore the cards do not need to communicate with each other in order to agree on tile assignment.

For these reasons I think tile granularity runs at the quad level (e.g. a 16x16 pixel tile will require 64 passes through a single quad pipeline) - there's only 1 quad processing pixels in a tile, regardless of whether you have a single-card/chip or a multi-card/chip configuration.

I'll draw a diagram later. Meanwhile feel free to kick my sandcastle over.

Jawed
 
This is how a single card with 3 quads uses tiles:

b3d05.gif


And this is how a pair of cards would use tiles:

b3d06.gif


Each tile is shown as 16x16 pixels. Therefore each tile calls its owning quad 64 times to complete rendering.

Jawed
 
Jawed - Personally I would say that the quad tile regions rendered next to each other eg:

Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4

or

Q1 Q2 Q1 Q2 Q1 Q2 Q1 Q2
Q3 Q4 Q3 Q4 Q3 Q4 Q3 Q4

Probably more likely the latter.

Remember the dispatch engine will be dispatching from setup trianlges - with such large gaps between each of the regions a quad within a chip is rendering (as in your pictures) a lot of scene geometry is going to need to be buffered for all quads to be active. I would guess its probably more likely that the geometry batches are going to be fairly localised so the smaller the areas between each quad the less geometry needs to have been setup and buffered in order to have all the quads busy.

The "SuperTiling" is just a meta tile scheme that groups the quad tiles together and blocks sections them off between each board.
 
http://www.beyond3d.com/forum/viewtopic.php?p=389402#389402

Anyway, each pipe has huge load balancing fifos on their inputs, that match up to the tiles that they own. Each pipe is a full MIMD and can operate on different polygons, and, in fact, can be hundreds of polygons off from others. The downside of that is memory coherence of the different pipes. Increasing tile size would improve this, but also requires larger load balancing. Our current setup seems reasonably optimal, but reviewing that, performance wise, is on the list of things to do at some point. We've artificially lowered the size of our load balancing fifos, and never notice a performance difference, so we feel, for current apps, at least, that we are well over-designed.

I interpret the "huge load balancing fifos" to imply that the queues of raster-sectioned triangles ready for the pixel shaders to work on are, collectively, "huge".

Eric sorta seems to imply that a quad owns a tile "Anyway, each pipe has huge load balancing fifos on their inputs, that match up to the tiles that they own." :?:

This would translate into hundreds :?: of entries per quad. Each quad having a separate FIFO. The number of tiles in each quad's FIFO depends on the overall density of triangles across tiles...

The best case memory accesses arise when triangles are entirely within a single tile. This is because solely a single quad's L1 cache is consumed by these triangles' textures - rather than having these textures appearing multiple times in separate quads' caches - this appears to be what Eric is referring to when he says:

I could imagine that if you did single pixel triangles in one tile over and over, that performance could drop due to tiling, but memory efficiency would shoot up, so it's unclear that performance overall would be hurt.

Since you can only rasterise triangles when you've "worked them all out" (i.e. found the edges of all the triangles and worked out depth), there has to be a fair amount of geometry work completed before you can start shading (i.e. a queue). By dividing the frame into tiles the setup engine (working with the Hierarchical Z unit) works out the rasterisations for all triangles that fall into those tiles, and once each tile is completely rasterised it can put it into the tile queue.

Page 9 is quite explicit about this, now that I've had a rummage:

http://www.ati.com/products/radeonx800/RADEONX800ArchitectureWhitePaper.pdf

The Setup Engine passes each quad pipeline a tile containing part of the current triangle being rendered.

I dare say that page is quite convincing that a tile is owned by a quad. So, together with knowing that a tile is currently 16x16 pixels, it seems conclusive to me that my diagrams hold.

Additionally, if you perform a simple round-robin tile allocation using equal-sized tiles, then each card's set-up engine knows which tiles it's going to work on, without having to communicate this with the other card. So you have a simple mechanism that allows each card to shade pixels based on its quad capacity.

For example a 2-quad card will only get one third of the tiles if it's working with a 4-quad card. Neither card has to agree with the other which tiles are its own.

Alternatively, in the E&S system, the level of AA required (beyond 6x) determines how many cards share a tile, each card rendering a different AA sampling pattern on the tile.

Of course, if anyone can persuade Eric to give us a more definitive insight, that would be 8)

Jawed
 
Jawed - simple. Find an image on the web (there have been some about) of what happens when you re-enable a non-working quad. There is a fairly fine grained chequer pattern where you can see the non-working quads, the distribution is fairly close.

Terms such as "huge FiFo's" are relative - huge on chip FiFo's are probably still fairly small in real storage - to render in a distribution such as yours, to effectively balance the load you'd have to be binning effectively then entirety of the setup triangles, and that ain't gonna happen in a fifo.

You can begin rendering a triangle as soon as the setup engine spits one out, you don't need to have worked all of them out.
 
JoshMST said:
Hey jb, after reading through this again, I am really curious about your comment here:

Just curious if you had some concrete examples of what you are talking about.

Well take a look at this one:
http://3dcenter.org/artikel/2005/03-31_a_english.php

I know that all of NV interviews are scrubbed by their PR department. And you can see that in some of what you had. Now please I am not saying its bad, wrong, evil, ect. Just it is what it is. If you look at that above ATI interview there seems to be a lot less PR infulence. Some have noticed this trend for awhile now....

Again Josh, not saying one is better than the other.....
 
Jawed, are you suggesting that a 3 quad card and a 4 quad card would stay as such while working together in Ati's version of SLI? (AVP/AMR/whatever)

I thought the better card was supposed to be downgraded to the level of the worse card... So that in case of x800pro+x800xt both cards would only have 3 quads working, making the situation identical to as if two x800pros were in fact used instead.
 
DaveBaumann said:
You can begin rendering a triangle as soon as the setup engine spits one out, you don't need to have worked all of them out.

No, but you need to have worked out the visible portions of the triangles being shaded - that's what hierarchical-Z is all about. The setup engine doesn't issue single triangles, it has to issue visible portions of triangles within the constraints of the tile being issued. Otherwise you'd be shading vast numbers of pixels that are overdrawn by the next triangle that comes along.

The example on page 9 of the whitepaper shows a 16x16 pixel tile. Grr.

Jawed
 
Mendel said:
Jawed, are you suggesting that a 3 quad card and a 4 quad card would stay as such while working together in Ati's version of SLI? (AVP/AMR/whatever)

Yes.

I thought the better card was supposed to be downgraded to the level of the worse card... So that in case of x800pro+x800xt both cards would only have 3 quads working, making the situation identical to as if two x800pros were in fact used instead.

Sorry, I'm as much in the dark as you are. The "downgrade" idea was invented by The Inquirer as far as I can tell, so who knows?

The only downgrade that makes sense to me is in terms of capabilities (e.g. shader model 2.0b on one card versus 3.0 on the other - so in this case a downgrade to 2.0b seems inevitable).

Clock rates are a more interesting question - if you've got one significantly slower card in a pair then the user can prolly work out for themselves that it just isn't worth using :)

Jawed
 
Back
Top