Opinions needed on this Interview

Geo · Apr 20, 2005

Well, Josh, I'm sorry if I lead you into trouble with NV in answering my questions. I can understand their sensitivity on this issue --it's less clear to me why you don't take responsibility for the context entirely indpendently of them.

After all, the article absolutely screams "hey folks, we are soon to have two competing multi-card implementations --wouldn't it be fun to compare/contrast them as best we can at this point (given that one of them is not-yet-quite-announced)?".

It's perfectly understandable that NV would be skittish/sensitive about picking questions to answer in that context, but I just don't get why your (and ours as readers) interest can't be entirely clear.

DemoCoder · Apr 20, 2005

DaveBaumann said:
And the effects of that depend on the architecture at hand, as I said in the replay.

I don't understand how your reply explains away the issue Nvidia raised. If texture spans two tiles, it must be fetched by both graphic cards, unless ATI has some kind of texture fetch virtualization that only allows *part* of a texture to be fetched. That means extra bus traffic.

JoshMST · Apr 20, 2005

Oh no Geo, this is not your fault. I really could have handled my responses a lot better and should not have gone skipping down the garden path in a daze. I guess we have all been so itchy about the accusations of bias as of late, as well as Charlie's little spiel about how we are all being bought off and are paid mouthpieces for company X, I think I really overreacted here in trying to distance myself from being viewed as a "NVIDIA fanboi". Add to that my state of consciousness last night, it turned out to be one of the dumbest things I have ever posted. Half that stuff wasn't even true (like the MAXX), and the other stuff, while not exactly false, was not entirely true (eg. they didn't answer all my questions... until I got new revisions of questions to them).

My head is still spinning.

Dave Baumann · Apr 20, 2005

DemoCoder said:
I don't understand how your reply explains away the issue Nvidia raised. If texture spans two tiles, it must be fetched by both graphic cards, unless ATI has some kind of texture fetch virtualization that only allows *part* of a texture to be fetched. That means extra bus traffic.

Because the tiling division on ATI's chips is already at a finer-grained level per quad than the tiling per board would be - that particular overhead is there already at the chip level. Any cross chip/board "Super" Tiling would be courser than the per quad chip level tiling already present.

Geo · Apr 20, 2005

Well, it really is a useful article, Josh (even if it needs an introduction making the context clearer, IMHO

), and I don't doubt it is getting a lot of hits. Made the front-page of [H]! I hope this little misunderstanding here doesn't make it less enjoyable for you.

DemoCoder · Apr 20, 2005

Yes, but the comparison is not "Supertiling vs tiling on one chip" it's "Supertiling vs SFR" and when Nvidia talks about extra texture fetches, they are talking about each GPU needing to have a copy of the texture. Everytime a texture crosses a supertile boundary, that texture must be uploaded to both GPUs eating up bus bandwidth. This issue has nothing to do with chip architecture at all. It has to do with how the screen is divided between GPUs. Only some kind of texture virtualization on a shared bus could solve it.

Since SFR has less boundaries, there are less textures that cross the boundary, and therefore the total amount of textures in a scene can be partitioned into two or more sets which are uploaded only to the cards that need them.

Dave Baumann · Apr 20, 2005

Since SFR has less boundaries, there are less textures that cross the boundary, and therefore the total amount of textures in a scene can be partitioned into two or more sets which are uploaded only to the cards that need them.

Sorry? You're suggesting that each board contains less than the full scene of texture data?

Jawed · Apr 20, 2005

DaveBaumann said:
Since SFR has less boundaries, there are less textures that cross the boundary, and therefore the total amount of textures in a scene can be partitioned into two or more sets which are uploaded only to the cards that need them.

Click to expand...

Sorry? You're suggesting that each board contains less than the full scene of texture data?

I've been arguing this precise point for months now and been ignored.

As I said earlier:

So if all textures end up on both cards (when wouldn't they?...), I don't think it's valid to say this is a bottleneck for supertiling.

Jawed

(edited for clarity)

Dave Baumann · Apr 20, 2005

NVIDIA's documentation says thats not the case, the textures are duplicated across each board. Because you don't know where you'll be looking from one frame to the next and because of the SFR load balancing, the textures would need to be shifted in and shifted out of each board per frame, which is pretty much unfeasible.

DemoCoder · Apr 20, 2005

That depends if we're discussing real-time rendering, or off-line rendering. I assumed that super-tiling meant offline rendering, since historically this cluster technique was used for clusters. Of course, there is no point to do this in realtime, not only because of the unpredictability of rapid scene changes/workload, but also because you'd need access to post-transformed vertices.

Dave Baumann · Apr 20, 2005

No, the "SuperTiling" being discussed is ATI's meta-tiling that is widely thought to be the method ATI will use instead of SFR - Eric Demers has mentioned it here before.

Xmas · Apr 20, 2005

DaveBaumann said:
DemoCoder said:

I don't understand how your reply explains away the issue Nvidia raised. If texture spans two tiles, it must be fetched by both graphic cards, unless ATI has some kind of texture fetch virtualization that only allows *part* of a texture to be fetched. That means extra bus traffic.

Click to expand...

Because the tiling division on ATI's chips is already at a finer-grained level per quad than the tiling per board would be - that particular overhead is there already at the chip level. Any cross chip/board "Super" Tiling would be courser than the per quad chip level tiling already present.

I don't quite buy that argument. There is such a thing as a L2 texture cache. Its purpose is to prevent that a texel that is needed for several quads is fetched multiple times from video memory. So even if a quad pipeline is working on a quad that is not connected to the previous quad, the cache saves valuable bandwidth.

However, if in a two-chip setup each chip is rendering on different sides of some boundary, both chips have to fetch some texels that fall along that boundary. The more boundaries you have, the more such "wasted" mem accesses you have.

This is one of the reasons why 3dfx went from scanline interleave on V2 to stripe interleave on V5.

DemoCoder, I don't get what you mean by fetching textures. You mean, from main memory to video memory? Of course all textures need to be completely present in video memory when they are bound. On current architectures without memory virtualization, that is. But the same thing applies to NVidia's SLI solution.

DemoCoder · Apr 20, 2005

Xmas said:
DemoCoder, I don't get what you mean by fetching textures. You mean, from main memory to video memory? Of course all textures need to be completely present in video memory when they are bound. On current architectures without memory virtualization, that is. But the same thing applies to NVidia's SLI solution.

Ya, I meant main memory fetches. I was thinking of a Renderdrive/Gelato like offline render, and how I'm minimize upload, and maximize effective memory. Of course, the only way to solve it for real-time is virtualization. I misunderstood what the Nvidia guy was talking about.

Many games seem to be using texture atlases now, so hopefully virtualization becomes standard.

Dave Baumann · Apr 20, 2005

Xmas said:
I don't quite buy that argument. There is such a thing as a L2 texture cache.

I've asked ATI about that before and AFAIK they don't have a L2 cache - but then it makes little sense when the work per quad is distributed in much larger than 4 pixel regions anyway (you can see how large by looking at boards with defective quads turned back on - they have quite large chequer-board patterns and this corresponds to a quad processing region). An L2 cache will operate better when the per quad workload distribution is on a round robin basis as it is more likely that one of the neighbouring quads is working on the same texture (a la NV4x).

Jawed · Apr 20, 2005

Xmas said:
I don't quite buy that argument. There is such a thing as a L2 texture cache.

There is in NVidia hardware. Just out of curiosity, do we know ATI has L1 and L2 caching for textures?...

Its purpose is to prevent that a texel that is needed for several quads is fetched multiple times from video memory. So even if a quad pipeline is working on a quad that is not connected to the previous quad, the cache saves valuable bandwidth.

Hence ATI's super-tiling algorithm, which is grouping quads of pixels together so that texturing is localised to a quad pipeline. Well, that's my interpretation, anyway. Now a L1 cache with 4-way set association (one per quad pipeline) is all you need.

But, that's a guess, as I don't understand the cache architecture of R300 or later.

Jawed

DemoCoder · Apr 20, 2005

Is the "wasted" fetches really that much anyway? One of the reasons why I overlooked that was because it doesn't seem to be a problem as long as the screen is divided very coursely (in half, in fours, in 8s). Its only if the divisions are small does it seem to waste "alot".

Dave Baumann · Apr 20, 2005

Jawed said:
Just out of curiosity, do we know ATI has L1 and L2 caching for textures?...

Read up

Hence ATI's super-tiling algorithm, which is grouping quads of pixels together so that texturing is localised to a quad pipeline.

All "Super Tiling" will be doing is blocking off some quad regions from one chip/board and making them visible to another.

eg:

Single board

Dual Chip / Board

(Q = Quad, B = Board / Chip)

DemoCoder said:
Is the "wasted" fetches really that much anyway?

When we're talking about these levels of render performance, not really.

Xmas · Apr 20, 2005

DaveBaumann said:
I've asked ATI about that before and AFAIK they don't have a L2 cache - but then it makes little sense when the work per quad is distributed in much larger than 4 pixel regions anyway (you can see how large by looking at boards with defective quads turned back on - they have quite large chequer-board patterns and this corresponds to a quad processing region). An L2 cache will operate better when the per quad workload distribution is on a round robin basis as it is more likely that one of the neighbouring quads is working on the same texture (a la NV4x).

Hm, I remember reading some paper recently that mentioned both ATI and NVidia are decompressing DXTC to the L1 cache. Maybe I misinterpreted that, meaning that ATI really only has small per-quadpipe-caches and no global cache.

The tiles seem to be 16x16. If you apply the one-texel-per-pixel rule of thumb, and assume half of the edge texels are fetched twice, you get only about 6%. When using 64x64 supertiles, this really gets insignificant.
I guess this NVidia guy had much a much smaller tile size in mind.

I guess we have to wait and see whether the less-than-optimal dynamic frame splitting or the less-than-optimal big tiles provide a better load balancing result.

Jawed · Apr 21, 2005

Dave, in your first diagram it's unclear how many pixels square each tile is (i.e. you haven't indicated tile boundaries). I think it'd make the diagrams clearer. (Guessing you'll be needing diagrams like this for a future article - perhaps ATI will have some they prepared earlier, huh?...)

I haven't a clue how tile sizes would vary depending on the number of chips/boards (1 or 2 etc...) working together. Maybe tile sizes are fixed regardless of configuration.

Meanwhile, here's a nice snippet from way back when:

http://www.beyond3d.com/forum/viewtopic.php?p=64708#64708

Load balancing is nearly perfect in most cases (3dmark, for example, is linear with # chips)...

Jawed

Jawed · Apr 21, 2005

Xmas said:
I guess we have to wait and see whether the less-than-optimal dynamic frame splitting or the less-than-optimal big tiles provide a better load balancing result.

It'll also be interesting to see how apps split between supertiling and AFR on MVP.

Obviously with SLI we know how quite a few apps have "chosen" one or the other (SFR or AFR). It'll be interesting to see if the "same" choices are made when running on MVP (equating SFR and supertiling loosely, since AFR is identical across both platforms).

AFR did look like a dead cert for 3DMk05 on MVP, but that December 2002 comment by Eric makes me wonder...

Jawed

Opinions needed on this Interview

Geo

Mostly Harmless

DemoCoder

JoshMST

Dave Baumann

Gamerscore Wh...

Geo

Mostly Harmless

DemoCoder

Dave Baumann

Gamerscore Wh...

Jawed

Dave Baumann

Gamerscore Wh...

DemoCoder

Dave Baumann

Gamerscore Wh...

Xmas

Porous

DemoCoder

Dave Baumann

Gamerscore Wh...

Jawed

DemoCoder

Dave Baumann

Gamerscore Wh...

Xmas

Porous

Jawed

Jawed

Similar threads