View Full Version : R580 Architecture Interview
Dave Baumann
21-Jan-2006, 14:51
<a href="http://www.beyond3d.com/reviews/ati/r580/int/"><img border="1" src="http://www.beyond3d.com/reviews/ati/r580/images/focus.gif" align="right" width="100" height="66"></a>In our recent "<a href="http://www.beyond3d.com/reviews/ati/r580/" target="_b3dout">R580: ATI Radeon X1900 XTX and Crossfire</a>" article we took a look at the architecture of this new ATI chip and the performance it brings with its interesting configuration. With R580 scaling up the number of pixel shader processors fairly drastically, and yet not similarly scaling either the number of texture inputs or ROP outputs, some may still have the impression that R580 is a fairly imbalanced part.
In an effort to dig further into ATI's thinking behind the design of R580 and gain a greater understanding of the decisions that have lead to this process we decided to put some questions to them. In reply we have answers from Eric Demers, from ATI's Desktop Graphics Engineering group, as well as a few comments from Richard Huddy, who brings in the perspective from their ISV relations group, which has a front line role to play with developers and who plays an important part in shaping future hardware. <a href="http://www.beyond3d.com/reviews/ati/r580/int/">Click here to read more</a>.
Entropy
25-Jan-2006, 12:45
Good article.
Takes a step away from the immediate, and looks at the architectural side. Would have been nice to have a couple of questions and answers that extrapolated into the future as well, (ignoring performance and focussing on architecture), and possibly a question or two that focussed more on the practical aspects - architecture as needed by the marketplace. You did have some Q&A going on there, and there were some answers that made me a bit sceptical, so it would have been interesting with some deeper digging.
For example:
The ATI white paper on X1900 / R580 makes the statement that a part needs to be designed to be best balanced towards the applications that it will be running when its available. Obviously this is key in the design phase, but with development timescales such as they are it would suggest that you are trying to predict application usage 1-3 years down the line, which, even with the best research, developer contact and developer steering must still end up with an element of "crapshoot" it?
[Eric Demers] We talk to ISVs to get an idea of what they think will happen in the next few years. As well, we look at shaders for titles coming out next year or two. With both of those pieces of data, we can get a pretty good idea at what will be the newest titles at the introduction time. Those are definitely the types of titles we shoot for, much more so than the older titles (which people will play less of anyway by then). Having said that, there is some guesswork involved as well. But it's also a chicken and egg thing, in that ISVs will tell us what they are doing, but they will also be influenced to designing games with our new technology in mind. If we come out with 3:1 ALU:TEX ratio HW, then designers will tend to add more ALUs for next games, and so it's a mutually influenced evolution.
It's a very valid question, but the answer is dodgy.
* First off, they make a distinct selection in "the types of titles they shoot for". And that selection does NOT seem to based on volume, because the volume sellers typically aren't particularly graphically demanding.
* Second, the assumption that people play older titles less, (if meaning "less than they play new titles") just isn't correct. The top volume applications are not the latest and greatest. Online statistics and even sales statistics both show this.
* A very large portion of shelf space at retailers are claimed by titles that aren't bleeding edge, for the simple reason that they defend their presence by generating revenue.
* People do not only play their most recent purchase.
So why does he make this claim? Is it truly his perception of reality? Or is he trying to create justification for the X1900? My guess is the latter, because he also make referral to the chicken and egg problem and how "ISVs will tell us what they are doing, but they will also be influenced to designing games with our new technology in mind". So I guess those statements are geared to influence the perception of the readers (consumers and developers) as to what the future will be.
This begs the question of how representative the X1900 is, or indeed ever will be, when it comes to typical game players. And when it comes to actual games, even the titles used for benchmarking by reviewers, selected for being extraordinary in their graphical requirements, (and therefore by definition atypical and unrepresentative), show only modest gains for the x1900 over the x1800, despite an impressive theoretical factor of three in pixel shader performance.
My answer to Dave Baumanns question would be that yes, at this time ATI missed the window a bit, and ended up with a design that isn't quite optimal in terms of spent transistors, when taking their typical application space into account. This is particularly clear in the weak texture capabilities of the RV530. It would be interesting to know if this was by accident, or if they judge the applications space different from their consumers (actual games played).
Yum! Looking forward to the expressions of joy and new rounds of analysis from Jawed. :lol: A very useful interview.
Tho, re BW limitations, it would have been interesting to get a take on the upcoming move to GDDR4 re the R580 architecture. So that when Richard says "hey, when you're BW-limited, you're BW limited", is that still true re R580 architecture when GDDR4 hits?
* Second, the assumption that people play older titles less, (if meaning "less than they play new titles") just isn't correct. The top volume applications are not the latest and greatest. Online statistics and even sales statistics both show this.
Well, now, let's land that in some actual title names, okay? Which ones did you have in mind? And then let's go look at the FPS of R580 for those titles and you tell me if a reasonable gamer should be feeling put out that R580 isn't doing more for those titles.
Dave Baumann
25-Jan-2006, 15:10
So why does he make this claim? Is it truly his perception of reality? Or is he trying to create justification for the X1900? My guess is the latter, because he also make referral to the chicken and egg problem and how "ISVs will tell us what they are doing, but they will also be influenced to designing games with our new technology in mind". So I guess those statements are geared to influence the perception of the readers (consumers and developers) as to what the future will be.
This begs the question of how representative the X1900 is, or indeed ever will be, when it comes to typical game players. And when it comes to actual games, even the titles used for benchmarking by reviewers, selected for being extraordinary in their graphical requirements, (and therefore by definition atypical and unrepresentative), show only modest gains for the x1900 over the x1800, despite an impressive theoretical factor of three in pixel shader performance.
X1900 specifically? ITs probably not represenative of typical gamers in the least - typical gamers aren't spending $600+. I think the point of the situation is that the titles that sell well are the titles that feed of the common base, which boards beyond $200 are really not going to play badly in the first place - best selling 3D title is probably Sims 2, but its not really a need to spend massive amounts of transistors accelerating thing like that as we're already at the point of diminishing returns. Of course, your $600 user is going to care about Unreal 2007 running well and looking gorgeous, so making an architecture aimed towards that is probably a reasonable thing to do for these high end boards.
In the discussion I had with Richard (which may not have been picked up wholly in the article) is that they are concentrating on the on the development houses who's engines are likely to span more than just their own titles; UE3 is probably a very important case here.
My answer to Dave Baumanns question would be that yes, at this time ATI missed the window a bit, and ended up with a design that isn't quite optimal in terms of spent transistors, when taking their typical application space into account. This is particularly clear in the weak texture capabilities of the RV530. It would be interesting to know if this was by accident, or if they judge the applications space different from their consumers (actual games played).
Whats with the "weak texturing capabilities" of RV530? As our testing indicates, when you pair it against it direct predecessor (RV410) despite its lower texture rate, in no gaming insance is it actually slower. At the moment I think RV530 is a better example of the design principles than R580 is.
:cry: I feel the answers were often dodged at the critical parts. Sigh.
[Eric Demers]I'd have to check with the compiler team, but on average, I think we see about 2.3 scalars per instructions being close to the average. Being able to do 2 full scalars (one using VEC and one Scalar) pretty much means that we are pegged out; as well the smaller ALU gets used a lot as well, giving an effective 2~4 scalars per cycle. As well, the average shader instruction (multiple scalars) to texture ratio is around 3 right now, and from what ISVs are telling us, it's likely to increase in the next few years. Consequently, the number of ALUs seems to be hitting the "sweet" spot for new applications (while being slightly underutilized for older apps) as well.
I really don't understand that :oops: The question was about vec/scalar, but there's no reference to vectors in the answer.
[Eric Demers] Right now, they seem to balance at 1:1 (TEX:ROP), but the trend is towards lowering ROPs, in general. The reality is that shading per pixel is increasing, which usually means many ALUs and many textures per pixel, as well as many cycles per pixel. Since we need only 1 ROP per cycle per pixel, effectively, the ROP throughput requirement is going down on new apps. An RV530 is a prime example – It doesn't have more ROP than the R515, but having triple the shading and double the Z, it's around 2x the speed of the R515 in a lot of cases. Finally, with HDR becoming more popular, the BW requirements of these pixels is high, so that the ROP throughput is possibly going down, even though each operation is 2x wider.
I wonder if a hidden part of this answer is that in future games overdraw will be drastically reduced? I'm thinking of occlusion querying and generally more advanced geometry shader techniques.
Also there's no doubt that GDDR4 is going to make ROPs run faster. Even though the interview seems to proceed as though it doesn't exist. Big SIGH.
[Eric Demers] If you look at apps such as 3dmark05, FEAR, D3/Q4, and many new upcoming titles, you'll see that the RV530 often doubles (or more) the performance against RV515, at the same clock. Same is true for X1900 vs X1800.
Apart from Toyshop, what else runs twice as fast on X1900XTX as X1800XT?
Also, I think RV530 and R580 are gaining around 20% performance through increased texture unit utilisation (20%) and in RV530's case, the double-rate Z is prolly making a huge difference in D3/Q4.
I don't like this answer.
Jawed
Whats with the "weak texturing capabilities" of RV530? As our testing indicates, when you pair it against it direct predecessor (RV410) despite its lower texture rate, in no gaming insance is it actually slower. At the moment I think RV530 is a better example of the design principles than R580 is.
This I can wholly agree with, RV530 being less bandwidth constrained in relation to its texturing capability.
And with GDDR4 (presuming R590, at least, gets it) there should be a big bump in texturing performance at the ultra high end before Vista arrives.
I just hope that RV560, 8-1-3-1/2 isn't constrained by a high-end (700-800MHz) 128-bit bus... That would be a real shame. Sadly, it seems very likely - small die, limited pads for the bus, blah blah blah...
Jawed
I just hope that RV560, 8-1-3-1/2 isn't constrained by a high-end (700-800MHz) 128-bit bus... That would be a real shame. Sadly, it seems very likely - small die, limited pads for the bus, blah blah blah...
Doesn't sound too bad, at least if it is 8-1-3-1. Basically half a R580, so half the memory bandwidth of the R580 should be enough, no?
It's a shame that the interview didn't pursue the register file size. I was pretty shocked to discover from the Hexus R580 review that the nominal capacity is 2 FP32s per fragment - (compared with 12 for Xenos as documented by M$ dev stuff). Even though Xenos and R580 actually seemingly have the same register file size (768KB).
In a way that would tend to suggest that R580 is better suited to texture-latency hiding in (older) PC games, which include a lot of texture-intense short shaders where 1 or 2 registers is prolly the maximum. I imagine such "naive" shaders will be relatively rare on XB360.
At the same time, R580 can stretch to very high register counts per fragment (just like Xenos), 8, 10 etc. at the cost of the total number of threads per shader unit (nominal max 128 threads - 10 FP32s means 25 threads instead, with 1200 fragments in flight per shader unit, which is a lot). In this situation I'm going to boldly guess that the larger number of fragments per thread in R580, 48 instead of 16 in R520, helps texturing coherency (less texture cache thrashing), so R580 may be at less of a disadvantage than it appears. This is, after all, why older GPU designs have large batches - to hide texturing latency and make best use of texture cache coherency.
Although, to be honest, Eric's answers all seem to indicate that R580's texturing capability starts off with plenty of disadvantages compared with R520 and he seems to be at a loss as to why texturing in R580 doesn't keel over and die horribly.
The concept of TMU idle time barely crops up. I'd really like a more detailed discussion of TMU scheduling to maximise utilisation...
In theory for every fragment in flight, every ALU operation that isn't matched by a cycle of TMU operation is unduly bottlenecked. As the ALU:TEX ratio increases the only solution is to increase the number of fragments in flight in order to increase the chance that there's a texture operation that can be performed, which is where the register file size comes in, again...
Ah well.
So, after all that, the 2 FP32s per fragment doesn't seem so bad :smile: But I wonder if such a low starting point can be sustained in R600 etc.
(Oh, and I wonder what RV530's register file sizing is.)
Jawed
Doesn't sound too bad, at least if it is 8-1-3-1. Basically half a R580, so half the memory bandwidth of the R580 should be enough, no?
Not if R580's texturing is more bandwidth-limited than RV530.
Jawed
Dave Baumann
25-Jan-2006, 17:09
It's a shame that the interview didn't pursue the register file size. I was pretty shocked to discover from the Hexus R580 review that the nominal capacity is 2 FP32s per fragment
Errrrm, I'm pretty sure its more than two. I would suggest that the application may be an issue here.
Well you and Rys feel free to fight it out :twisted:
Jawed
I'm already seeing a 50:50 bw distribution between ROP and Texture data in games like Doom3 or Quake4. So I doubt you need anything fancy like a deferred rendered removing overdraw to get the same or even distributions more biased to texture data in the future. It escaped to me, though, that given that you are limited by the rasterizer Z sampler throughput beyond 2xAA and that with AA z and color data is very compresible higher AA modes can actually reduce bandwidth per cycle rather than increase it ... But I haven't done anything on AA yet so I have an excuse for that.
My guess is that the ratio will remain the same while developers keep using fillrate only passes either for stencil o shadow map generation. After all even now wiihout those passes ROP fillrate and bw requeriments are pretty low.
About threads and latency hiding my experiments show that you don't really need so much active threads (nor even ready threads) if you support true out of order execution of those threads. And you don't get that large hit if you use 8+ FP registers per fragment when you only have 2 as budget for the maximum possible number of threads. The implementation of the memory already supports open page misses so it shouldn't be that far from real GPU latencies.
sireric
25-Jan-2006, 18:48
Couple of items, in no particular order:
1) R580 texture keel over:
That's ridiculous. The texturing system on R580 should be equivalent to R520. The same pixels are being processed by both, in the same way, with the same memory subsystem and the same caching. I expect the two to be perfectly identical. However, on a given application, since the ALU load is reduced on R580, I expect the overall performance to be increased, and so effects on texture (such as enabling AF) should have a higher % effect on R580 than R520 (and they do), even though the R580 will always be faster, clock per clock than R520. All I was saying is we are moving the bottlenecks around, but nothing is worst. Your statement "Eric's answers all seem to indicate that R580's texturing capability starts off with plenty of disadvantages compared with R520 and he seems to be at a loss as to why texturing in R580 doesn't keel over and die horribly" comes out of nowhere and I certainly don't understand it. Another thought: In fact, since R580 is going to saturate the texture unit (more than R520), I expect R580 to be slightly more efficient on texturing than R520. The granularity increase should not make any difference from a texturing standpoint, since it's the same pixels in the same order as before.
2) Nominal registers per pixel/fragment:
If you look at the GP/GPU analysis, you'll see that for both R520 and R580, performance of the architecture wrt to GPR usage is optimal. There's no sweet spot for 2 GPRs -- The latency hiding is generally the same, regardless of the GPRs in use. Some really bad cases can be constructed, but those I believe are corner cases, and that generally our shader compiler should eliminate the real cases.
3) Vec/Scalar:
Our compiler data indicates that currently, the mixture, per instruction, of vector/scalar is indicating roughly 2~3 independant streams iare common. That would indicate that VEC3 + Scalar is going to be pegged with 2 streams. If we had a SCALAR+SCALAR+VECTOR, we could do even more work (though not 50% more work). The ratio of each of these instructions (2~3 streams / instruction) to the texture fetch instructions is above 2:1 right now in the newest, gfx intensive apps. From what ISVs tell us, that's likely to grow. That doesn't mean we tripple performance on an R580, since the bottleneck shifts. But it means that the overall design is best matched to newest code bases.
4) On ISV selections:
We select ISV for architectural investigation based on their demonstrated abilities to deliver gfx intensive applications. That's our main focus. We do not pick it based on the most popular game right now (Sim2 or Civ4) but on those games that push the gfx envelope. Those are our guides for how the future is likely to develop. Those are also the applications we target to be the best at, with the undestanding that the other apps are likely to max out the CPU instead of the GFX.
I'm already seeing a 50:50 bw distribution between ROP and Texture data in games like Doom3 or Quake4. So I doubt you need anything fancy like a deferred rendered removing overdraw to get the same or even distributions more biased to texture data in the future.
I was under the impression that HL-2 does a Z-prepass to populate Z, and assuming that this technique will become more popular, particularly as XB360 effectively requires a Z-prepass.
Also (and I admit I don't really understand it), D3D10 has enhanced occlusion query capabilities, which I interpret to mean that not only will the right geometry get tessellated etc., but that mostly the right geometry will ever make it to fragment shading - reducing overdraw.
One topic I've been meaning to start-up for a while is: multiple render targets are slow - why? I get the feeling that they're gonna get an awful lot more use in the future - and I haven't wheedled anything out of DX10 that is specifically targetted towards MRT performance (except that stream out could be useful if the target's datapoints are write-once). Otherwise MRTs seem ROP-bound...
It escaped to me, though, that given that you are limited by the rasterizer Z sampler throughput beyond 2xAA and that with AA z and color data is very compresible higher AA modes can actually reduce bandwidth per cycle rather than increase it ... But I haven't done anything on AA yet so I have an excuse for that.
It puzzles me why AA comes in 2x loops - except for the fact that AA comes in 2x steps, 2xAA, 4xAA, 6xAA etc. Otherwise, with 4xAA being a "preferred" setting you'da thunk that special-case hardware for 4xAA in one loop would have happened by now.
My guess is that the ratio will remain the same while developers keep using fillrate only passes either for stencil o shadow map generation. After all even now wiihout those passes ROP fillrate and bw requeriments are pretty low.
Really? Even with FP16 blending?
About threads and latency hiding my experiments show that you don't really need so much active threads (nor even ready threads) if you support true out of order execution of those threads. And you don't get that large hit if you use 8+ FP registers per fragment when you only have 2 as budget for the maximum possible number of threads. The implementation of the memory already supports open page misses so it shouldn't be that far from real GPU latencies.
I would be interested if you can simulate R520 and R580 running D3 and get the same kind of performance difference between the two that Dave's getting (15% roughly). I can't remember if you've posted an analysis of 1:1 versus 3:1 in a non-unified architecture :cry:
Jawed
One topic I've been meaning to start-up for a while is: multiple render targets are slow - why? I get the feeling that they're gonna get an awful lot more use in the future - and I haven't wheedled anything out of DX10 that is specifically targetted towards MRT performance (except that stream out could be useful if the target's datapoints are write-once). Otherwise MRTs seem ROP-bound...
You are writing up to four (?) 'colors' per fragment. Of course they are slower than outputing a single color. I guess that ROPs only support to output one 'color' or render target output per cycle. And in any case they would be bw limited if that wasn't the case.
It puzzles me why AA comes in 2x loops - except for the fact that AA comes in 2x steps, 2xAA, 4xAA, 6xAA etc. Otherwise, with 4xAA being a "preferred" setting you'da thunk that special-case hardware for 4xAA in one loop would have happened by now.
May be because above 2xAA per cycle you get bw bound in average with the current memory system. Or perhaps is the length of the internal datapaths. Even being inside a chip there must be a limit to the bus and cache write port widths.
Really? Even with FP16 blending?
Is FP16 blending single cycle right now? In fact even 32 bit RGB blending seems still to be two cycle. It's bandwidth limited though because requires a read and a write (even if each is in a different cycle).
I would be interested if you can simulate R520 and R580 running D3 and get the same kind of performance difference between the two that Dave's getting (15% roughly). I can't remember if you've posted an analysis of 1:1 versus 3:1 in a non-unified architecture :cry:
I did and I think it had a benefit of 2-4% (no numbers at hand) at 1024 and 8xAF when adding 2x and 3x ALUs (near no improvement from 2x to 3x). But my guess is that it's just coincidence that the benefit is similar to Dave's test at the same resolution as it should be more CPU limited at that resolution that anything else. If I ever try at 1600 (I didn't thought that at larger resolutions the performance characteristics may change significally just because triangles are larger and vertex, setup or batch limited zones become a smaller percentage of the whole frame rendering time, also we don't have any monitor that supports 1600, I think..., in fact they barely support 1024 so ... ;)) then may be the simulator could surprise me and still get similar differences but I have my doubts ...
I haven't posted any analysis of 1:1 vs 3:1 (the bit about non-unified doesn't matters for that kind of analysis and I haven't used the non-unified configuration for ages) but in a week or two I will put online an article that compares, just for the fun, 3:3 against 3:1.
2) Nominal registers per pixel/fragment:
If you look at the GP/GPU analysis, you'll see that for both R520 and R580, performance of the architecture wrt to GPR usage is optimal. There's no sweet spot for 2 GPRs -- The latency hiding is generally the same, regardless of the GPRs in use. Some really bad cases can be constructed, but those I believe are corner cases, and that generally our shader compiler should eliminate the real cases.
I guess that also depends on what you count as a 'register'. I would think that people should be counting 'live' registers at any point of the shader program (lately I'm counting also interpolated attributes, but that's a another discussion) but I guess people go to the cheap way ... or may be those nice tools that I never get to test are already reporting 'true' register usage.
3) Vec/Scalar:
Our compiler data indicates that currently, the mixture, per instruction, of vector/scalar is indicating roughly 2~3 independant streams iare common. That would indicate that VEC3 + Scalar is going to be pegged with 2 streams. If we had a SCALAR+SCALAR+VECTOR, we could do even more work (though not 50% more work). The ratio of each of these instructions (2~3 streams / instruction) to the texture fetch instructions is above 2:1 right now in the newest, gfx intensive apps. From what ISVs tell us, that's likely to grow. That doesn't mean we tripple performance on an R580, since the bottleneck shifts. But it means that the overall design is best matched to newest code bases.
About vector vs scalar instructions I found something rather funny. UT2004, which is using our own library generated shaders for the fixed function, seems to use a rather impressive number of scalar ops compared with Doom3 or Quake4 (we don't have an optimizer that tries to reschedule the original ARB instructions, we just reorder to reduce dependency chains, so real GPUs may have more chances of using the scalar ALUs). I wonder what kind of shaders are we generating :P.
1) R580 texture keel over:
That's ridiculous.
I wasn't suggesting R580 should :!: But you talked about R580's texturing being "logically bottlenecked" yet never described why the bottleneck doesn't actually transpire. It was couched in terms of an ALU-bottleneck in R520 becomes a texturing bottleneck in R580 - yet R580's texturing performance seems better.
I'm just trying to unravel how it is that seemingly texture-intensive games like D3 are, even with AF, actually ALU-limited, which is what you're saying as far as I can tell. And, seemingly, that TMU utilisation is only a side-effect of removing an ALU bottleneck and isn't really dependent on the number of fragments in flight.
Does that mean that texturing is not freely-schedulable? Or is D3's texturing too dependent for it to be a useful comparison game in unravelling ALU:texturing performance?
The texturing system on R580 should be equivalent to R520. The same pixels are being processed by both, in the same way, with the same memory subsystem and the same caching. I expect the two to be perfectly identical. However, on a given application, since the ALU load is reduced on R580, I expect the overall performance to be increased, and so effects on texture (such as enabling AF) should have a higher % effect on R580 than R520 (and they do), even though the R580 will always be faster, clock per clock than R520. All I was saying is we are moving the bottlenecks around, but nothing is worst. Your statement "Eric's answers all seem to indicate that R580's texturing capability starts off with plenty of disadvantages compared with R520 and he seems to be at a loss as to why texturing in R580 doesn't keel over and die horribly" comes out of nowhere and I certainly don't understand it.
I was hoping you'd describe why texturing performs so well on R580, even though 3:1 ALU:TEX makes it seem like it shouldn't :cry:
But apparently R520's texturing in D3, for example, is sub-optimal, and R580 just helps it be where it should be - solely because R520 is ALU-limited.
2) Nominal registers per pixel/fragment:
If you look at the GP/GPU analysis, you'll see that for both R520 and R580, performance of the architecture wrt to GPR usage is optimal. There's no sweet spot for 2 GPRs -- The latency hiding is generally the same, regardless of the GPRs in use. Some really bad cases can be constructed, but those I believe are corner cases, and that generally our shader compiler should eliminate the real cases.
Is 2 FP32s per fragment the limit in order for R5xx to support 128 threads per shader unit?
Presumably the latency-hiding comes out the same because a higher GPR usage implies a higher ALU:TEX ratio?
3) Vec/Scalar:
Our compiler data indicates that currently, the mixture, per instruction, of vector/scalar is indicating roughly 2~3 independant streams iare common. That would indicate that VEC3 + Scalar is going to be pegged with 2 streams. If we had a SCALAR+SCALAR+VECTOR, we could do even more work (though not 50% more work). The ratio of each of these instructions (2~3 streams / instruction) to the texture fetch instructions is above 2:1 right now in the newest, gfx intensive apps. From what ISVs tell us, that's likely to grow. That doesn't mean we tripple performance on an R580, since the bottleneck shifts. But it means that the overall design is best matched to newest code bases.
Thanks, that makes more sense to me.
4) On ISV selections:
We select ISV for architectural investigation based on their demonstrated abilities to deliver gfx intensive applications. That's our main focus. We do not pick it based on the most popular game right now (Sim2 or Civ4) but on those games that push the gfx envelope. Those are our guides for how the future is likely to develop. Those are also the applications we target to be the best at, with the undestanding that the other apps are likely to max out the CPU instead of the GFX.
Sims2 is more of a graphics hog than I was expecting :lol: :
http://www.hardocp.com/article.html?art=OTUzLDEx
Jawed
I guess that also depends on what you count as a 'register'. I would think that people should be counting 'live' registers at any point of the shader program (lately I'm counting also interpolated attributes, but that's a another discussion) but I guess people go to the cheap way ... or may be those nice tools that I never get to test are already reporting 'true' register usage.
I've always assumed that constants for a shader are held once per thread (or once per something "larger" than a thread, whatever you want to call that).
I'd also assumed that the interpolated attributes are effectively constants, too - but well I know I'm on pretty shaky ground there.
Jawed
sireric
25-Jan-2006, 20:52
I wasn't suggesting R580 should :!: But you talked about R580's texturing being "logically bottlenecked" yet never described why the bottleneck doesn't actually transpire. It was couched in terms of an ALU-bottleneck in R520 becomes a texturing bottleneck in R580 - yet R580's texturing performance seems better.
Well, that's correct. The bottleneck shifts from a mixture of ALU and texture to all texture on R580, in general (though not always; some shaders still have higher ALU ratios than 3). It doesn't mean that R580 is worst at texturing than R520, it's just a bigger bottlneck for it; it will always be faster, clock for clock than R520. We did this shift to match what we are seeing in the newest games, and what is coming next.
I'm just trying to unravel how it is that seemingly texture-intensive games like D3 are, even with AF, actually ALU-limited, which is what you're saying as far as I can tell. And, seemingly, that TMU utilisation is only a side-effect of removing an ALU bottleneck and isn't really dependent on the number of fragments in flight.
Actually, from what I remember, D3 isn't that ALU bound. it's actually reasonably balanced between ALUs and texture, and R520 does very well. I think that R580 is a little stronger, but not that much. The actual D3 core shader isn't very long, and doesn't represent what the latest games are doing, in terms of shading. Nb: Well, since Q4 and probably a bunch of other games are using this same shader, it does represent some of what the newest games are doing; my bad. Though I stand by the fact that the trend for alu:tex is increasing.
Does that mean that texturing is not freely-schedulable? Or is D3's texturing too dependent for it to be a useful comparison game in unravelling ALU:texturing performance?
I'm not sure I understand that question. Texturing is freely schedulable, assuming you've got no dependancies and you have the appropriate resources available.
I was hoping you'd describe why texturing performs so well on R580, even though 3:1 ALU:TEX makes it seem like it shouldn't :cry:
Again, why would texturing be any worst on R580 than R520? It's the same pixels getting the same textures with a similar texture architecture. One should expect, off hand, the performance to be identical. The advantage of the R580, that I mentioned above, is that it's going to be issuing even more texture requests, thus saturating the texture & MC units, and so achieving higher efficiency (i.e. no bubbles). As well, it has a higher MCLK, so it has more BW and can get more texels out, per unit time.
But apparently R520's texturing in D3, for example, is sub-optimal, and R580 just helps it be where it should be - solely because R520 is ALU-limited.
Again, I think D3 isn't very ALU limited. But R580 should get a boost (I need to check some benchmarks to confirm).
Is 2 FP32s per fragment the limit in order for R5xx to support 128 threads per shader unit?
Presumably the latency-hiding comes out the same because a higher GPR usage implies a higher ALU:TEX ratio?
The number of threads is not really the important thing. It's the latency of each threads times (*) the number of active threads. This gives you the latency that can be hidden. Consequently, if a thread has a few ALU instructions, then you need more threads. If it has many, then it needs less threads. As you increase the number of GPRs, you find that you also increase the number of instructions, and so make each thread higher latency and so you reduce the need to have more threads.
Thanks, that makes more sense to me.
Sims2 is more of a graphics hog than I was expecting :lol: :
http://www.hardocp.com/article.html?art=OTUzLDEx
Jawed
Regretfully, I think that the coding was somewhat suboptimal and that's why its more demanding than it really should be. But that's a good point; it's probably on the list of internal performance regression apps already.
Lemme try, with numbers plucked from thin air, for the concept.
R520 has a "speed limiter" on ALU at 70fps.
R520 has a "speed limiter" on Tex at 90fps.
R580 has a "speed limiter" on ALU at 100fps.
R580 has a "speed limiter" on Tex at 90fps.
Now, which one is faster at Tex? Answer: It's a tie.
Now, which one is Tex limited? Answer: R580, because it'd go faster if it had more Tex where R520 wouldn't.
Now, which one is faster (assuming no other limiters)? Answer: R580, at 90fps.
So, really what I got out of it is R520 has Tex in "reserve" it doesn't get to use, and this is what obscures (when trying to prove it with benchmarks anyway) the fact that R580 is actually more tex-limited, albeit at a higher overall fps than R520.
Do I get a cookie? :lol:
Or do I get an answer to "what does GDDR4 do, if anything, to all this?"
Actually, from what I remember, D3 isn't that ALU bound. it's actually reasonably balanced between ALUs and texture, and R520 does very well. I think that R580 is a little stronger, but not that much. The actual D3 core shader isn't very long, and doesn't represent what the latest games are doing, in terms of shading. Nb: Well, since Q4 and probably a bunch of other games are using this same shader, it does represent some of what the newest games are doing; my bad. Though I stand by the fact that the trend for alu:tex is increasing.
From my tests I would say that Quake4 seems to have a bit more ALU than Doom3 (that and there way less ROP workload and way more vertex workload).
The number of threads is not really the important thing. It's the latency of each threads times (*) the number of active threads. This gives you the latency that can be hidden. Consequently, if a thread has a few ALU instructions, then you need more threads. If it has many, then it needs less threads. As you increase the number of GPRs, you find that you also increase the number of instructions, and so make each thread higher latency and so you reduce the need to have more threads.
So that means that you don't switch threads with every instruction/instruction group/fetch cycle/whatever but only when there are large latencies or dependencies (aka texture fetch) involved? I don't expect an answer though :lol:
sireric
25-Jan-2006, 21:25
So that means that you don't switch threads with every instruction/instruction group/fetch cycle/whatever but only when there are large latencies or dependencies (aka texture fetch) involved? I don't expect an answer though :lol:
It's complicated. We try to keep all our units doing something every cycle. Certainly if a thread requires texture data, it gets put to sleep. That's all I'll say ;-)
Thanks, Eric - I think Dave's performance results for D3 and Q4 (15-20% faster) seemed distractingly out of proportion with the games' supposed bottlenecks and there's been frustratingly little analysis of what's really going on.
And, ahem, it actually paints a rosier future for 3:1 GPUs in the mid-range - if D3 engine can get 15-20% faster when fragment-rate limited even though it sorta "shouldn't" then that's nice.
---
There was a time when I was gung-ho for the freely-schedulable architecture (at least I bored the console folk too) because not only would dynamic branching lose its shackles but texturing would be able to shoot ahead in a most dramatic fashion, with the TMUs spending less time idle (when they could be doing useful work for another thread) :grin: R520 put the dampeners on that, because in all the results it seemed impossible to discern any performance gains solely from the shader units/scheduler - clouded by cache improvements, ROP improvements, memory architecture improvements etc.
So, anyway, it's great to find out that texturing performance is part of the equation - it has benefitted as I was originally expecting way back when.
Now I can go dance in the street. I'm sure it seems like a case of "of course" to you guys, but it's been up and down out here.
Jawed
Entropy
26-Jan-2006, 00:04
4) On ISV selections:
We select ISV for architectural investigation based on their demonstrated abilities to deliver gfx intensive applications. That's our main focus. We do not pick it based on the most popular game right now (Sim2 or Civ4) but on those games that push the gfx envelope. Those are our guides for how the future is likely to develop. Those are also the applications we target to be the best at, with the undestanding that the other apps are likely to max out the CPU instead of the GFX.
Fair enough.
However, here you are back to referring to how you feel the future is likely to develop, but "The ATI white paper on X1900 / R580 makes the statement that a part needs to be designed to be best balanced towards the applications that it will be running when its available."
Fact is, even using the most graphically demanding games available the gain over the X1800 product ranges from a few percent up to roughly 30 % in the latest version of Splinter Cell and FEAR. Your very impressive factor of three increase in pixel shading capabilities, plus other tweaks, doesn't have all that much to show for itself. Now if an investment in transistors doesn't pay off all that well, may or may not mean a whole lot to a part such as the X1900, but it does call into question the trickle down approach of GPU design, as X1600 customers might not buy it to play potential future pixel shader heavy games (at which time it in all likelyhood will be insufficient anyway) and might be better served by having gates/cost balanced differently. I've got roughly a hundred paid games on my shelves, 3 or 4 of which I have time to play at all, none of which would benefit from the enhancements offered by the X1900.
Now this may be as you intended it. That would be OK, as there are other valid reasons to produce these cards. You might do wiser to exclude those phrases in your white papers though.
It could also be that you overestimated the impact of pixel shading capabilities at this time. This wouldn't be that strange, as Dave Baumann wrote, it has to be a bit of a crapshot, and you are in a situation where you know hardware directions and games development trends long before they become product at all, much less dominate the market. The insider perspective is distinctly out of sync with that of the consumer.
So which is it?
BTW, it is appreciated that you share information about policies and technicalities.
From a technical perspective I mostly wonder about just how bandwidth constrained the X1900 and the X1600 are. The numbers make suggestions, but the horses mouth is always preferable.
Megadrive1988
26-Jan-2006, 00:43
exellent interview - thanks for the effort.
it'll be interesting to see what the design decisions are for the R600, which should be 8-10 months away, and what improvments & additions (and subtractions) are made from the Xenos architecture.
will bandwidth increase dramatically enough (GDDR4, real 512-bit bus) to justify more texture units and more ROPs. maybe not.
sireric
26-Jan-2006, 02:36
Fair enough.
However, here you are back to referring to how you feel the future is likely to develop, but "The ATI white paper on X1900 / R580 makes the statement that a part needs to be designed to be best balanced towards the applications that it will be running when its available."
Fact is, even using the most graphically demanding games available the gain over the X1800 product ranges from a few percent up to roughly 30 % in the latest version of Splinter Cell and FEAR. Your very impressive factor of three increase in pixel shading capabilities, plus other tweaks, doesn't have all that much to show for itself. Now if an investment in transistors doesn't pay off all that well, may or may not mean a whole lot to a part such as the X1900, but it does call into question the trickle down approach of GPU design, as X1600 customers might not buy it to play potential future pixel shader heavy games (at which time it in all likelyhood will be insufficient anyway) and might be better served by having gates/cost balanced differently. I've got roughly a hundred paid games on my shelves, 3 or 4 of which I have time to play at all, none of which would benefit from the enhancements offered by the X1900.
Now this may be as you intended it. That would be OK, as there are other valid reasons to produce these cards. You might do wiser to exclude those phrases in your white papers though.
It could also be that you overestimated the impact of pixel shading capabilities at this time. This wouldn't be that strange, as Dave Baumann wrote, it has to be a bit of a crapshot, and you are in a situation where you know hardware directions and games development trends long before they become product at all, much less dominate the market. The insider perspective is distinctly out of sync with that of the consumer.
So which is it?
BTW, it is appreciated that you share information about policies and technicalities.
From a technical perspective I mostly wonder about just how bandwidth constrained the X1900 and the X1600 are. The numbers make suggestions, but the horses mouth is always preferable.
Off hand, just looking at Dave's review of the X1900, I get this delta wrt to X1800:
FC plain: 6, 7%
FC HDR: 33, 25%
SC: 36, 30, 28%
Fear: 28, 28, 21%
D3: 15, 14, 13%
Q4: 17, 20, 18%
(NB: the shadow app shows ~2x and some of that is fetch removal. But without 3xALU, that fetch4 would of done nothing, since it would of been ALU bound, so those are real gains for this design)
Take out 5% for clocks or so, I think it's safe to say we see around 10~25% improvement on current popular games at medium resolutions (once Dave gets his 30" Dell going, we should see higher resolutions and more distancing between the results). Given a roughly 20% area increase, that seems like a very fair trade off for me. Older games will still be faster too, but are much more likely to be CPU bound anyway.
For the newest games with the highest shader content, we see the biggest gains, and that trend is going to continue for newer titles.
My view for a new card is that it should give improvements in older games that aren't cpu bound, but, more importantly, should rule the latest games out and promise something amazing for games soon to be released. I think that the X1900 boards do all of those things.
We work with ISVs to try to predict the future and within a 6 month window, I think we do a pretty good job (again, focusing on gfx intensive games). We've removed a lot of the bottlenecks from the ALU part of the pipe, and moved them to memory bandwidth and texture side. That's what we set out to do (partially due to the fact that we have external control on memory bandwidth) and we got good results out of it.
At the end, in an absolute way, when you tax this card at its most, with the highest resolutions and AA and all the highest quality game settings, then you really see it shine and it delivers on its promises without compromise.
I believe it to be an excellent product and we stand behind our architectural decisions.
TurnDragoZeroV2G
26-Jan-2006, 02:58
Enjoyed reading the interview.
It's a shame that the interview didn't pursue the register file size. I was pretty shocked to discover from the Hexus R580 review that the nominal capacity is 2 FP32s per fragment - (compared with 12 for Xenos as documented by M$ dev stuff). Even though Xenos and R580 actually seemingly have the same register file size (768KB).
I was under the impression that Xenos' total register file was [64 threads*64 fragments*4 registers] + [32 threads*64 vertex*4 registers]. In fact, if you search for that total of registers, you'll find this (http://www.beyond3d.com/forum/showthread.php?t=23775&highlight=24%2C576).
/only has alphavsfinal and gpu overview docs, forgiveness for ignorance begged
dizietsma
26-Jan-2006, 07:38
In one question Dave asked why even some heavily pixel shader bound games do not show the x3 increase in rate, and the interviewee said that some games show a doubling which I guess is saying "yes, I agree they don't show x3"
What's the explanation for the 2/3 possible theoretical maxium being acheieved rather than 3/3 ? Presumably something is pegging it back. Is it just non-optimal coding or is there a physical limitation somewhere in the architecture ?
overclocked
26-Jan-2006, 08:31
In one question Dave asked why even some heavily pixel shader bound games do not show the x3 increase in rate, and the interviewee said that some games show a doubling which I guess is saying "yes, I agree they don't show x3"
What's the explanation for the 2/3 possible theoretical maxium being acheieved rather than 3/3 ? Presumably something is pegging it back. Is it just non-optimal coding or is there a physical limitation somewhere in the architecture ?
It still pretty good, for ex if X1900XT gets double the frames in UE3 comared to X1800XT they got double performance for a ~20% transistor increase. Thats how i got it that they got most performance/price :-) As you said there is many things that can hamper performance, some of the smart-ass people could probably explain in detail.
I was under the impression that Xenos' total register file was [64 threads*64 fragments*4 registers] + [32 threads*64 vertex*4 registers]. In fact, if you search for that total of registers, you'll find this (http://www.beyond3d.com/forum/showthread.php?t=23775&highlight=24%2C576).
/only has alphavsfinal and gpu overview docs, forgiveness for ignorance begged
One of the PPTs I have is more explicit.
Xenos segments its workload into a maximum of 32 vertex threads (each of 64 vertices) and 64 fragment threads (64 fragments per). The document refers explicitly to 12 registers being the limit, before the number of threads is reduced:
Use fewer r-registers
Number of threads is limited by number of r-registers.
More than 12 r-registers in pixel-shader makes clause boundaries slightly slower.
Using more than 16 is slow. More than 32 makes everything really slow (up to 10x). Avoid it if you can.
Fewer threads also means serialized fetch delays can’t be hidden.
Of course I could be misinterpreting that. I think another document couches the 12-register limit in slightly different terms. Not sure which, now :oops:
Presumably the 12-register limit applies to all threads, vertex or fragment - but with a ceiling of 64 fragment threads, I was just comparing the portion of the register file that's relevant to fragments in both R580 and Xenos.
Jawed
In one question Dave asked why even some heavily pixel shader bound games do not show the x3 increase in rate, and the interviewee said that some games show a doubling which I guess is saying "yes, I agree they don't show x3"
What's the explanation for the 2/3 possible theoretical maxium being acheieved rather than 3/3 ? Presumably something is pegging it back. Is it just non-optimal coding or is there a physical limitation somewhere in the architecture ?
A simple example, I guess, might be that some of a shader's ALU instructions are dependent on texture results.
So if a shader has 3 ALU instructions and 1 texture instruction, you might expect to get the full 3:1 speed-up. But if you can only execute one ALU instruction "in parallel" with the texture instruction, because ALU instructions 2 and 3 need the texture result, then the overall average speed-up might come in at around 2x.
I say "in parallel" because the latency-hiding mechanism of thread-switching means that whilst thread 37's ALU instruction-1 is executing, thread 25's TMU instruction is executing. Texture caching means that not all texture operations will take ages (i.e. a lot can complete in 1 cycle). But the cache is only so big, texture-fetch bandwidth is only so much, etc.
Longer shaders can be less texture-dependent - but on the other hand, if a shader performs dependent texturing (i.e. has to calculate or look-up texture coordinates) and immediately after the texture fetch performs other ALU operations on those new texture results, then you're stuck with the worst case and that portion of the shader will be a bottleneck.
Jawed
I wonder if R520 is a victim of ATI's D3 shader replacement. This replaces a texture instruction with some ALU instructions (dunno how many).
Since R520 is slightly ALU-bound in D3, it would be interesting to see if the original shader code (with the extra texture lookup(s) ) actually runs a little faster.
Jawed
sireric
26-Jan-2006, 17:22
I wonder if R520 is a victim of ATI's D3 shader replacement. This replaces a texture instruction with some ALU instructions (dunno how many).
Since R520 is slightly ALU-bound in D3, it would be interesting to see if the original shader code (with the extra texture lookup(s) ) actually runs a little faster.
Jawed
Just turn off Cat AI.
OpenGL guy
26-Jan-2006, 18:16
In one question Dave asked why even some heavily pixel shader bound games do not show the x3 increase in rate, and the interviewee said that
some games show a doubling which I guess is saying "yes, I agree they don't show x3"
What's the explanation for the 2/3 possible theoretical maxium being acheieved rather than 3/3 ? Presumably something is pegging it back. Is it just non-optimal coding or is there a physical limitation somewhere in the architecture ?
It's very rare that you will acheive the full theoretical improvement from a change in any piece of hardware. Take F.E.A.R. for example. It uses many shaders, some of which are quite good from an ALU:TEX ratio perspective. However, on R520 at least, the stencil shadow rendering accounts for about half the time spent rendering the scene. (This is easily measured by simply disabling shadows in the game and noticing that your framerate doubles.) Now you may say to yourself "Well, then ATI should have gone with double Z!", however, we already have double Z when AA is enabled. This accounts for much of the fine AA performance the HW acheives in F.E.A.R. when AA is enabled.
Now if you assume that stencil shadows are half the rendering time and shader processing the other half, you'll see that doubling your Z/stencil performance doesn't help as much as tripling your shader power.
Just turn off Cat AI.
Then you turn of the triliniar optimizations off too. Renaming the exe is really the only way to see the impact of app-specific optimizations.
Is D3's minimum texture filtering trilinear? Too long now since I played with the demo...
D3 is showing 15% gains on X1900XT over X1800XT under no-AA/no-AF.
Jawed
One of the PPTs I have is more explicit.
Do you have a link? Thanks.
Do you have a link? Thanks.
Unfortunately, no - I haven't seen it in the wild.
Jawed
TurnDragoZeroV2G
28-Jan-2006, 00:07
Heh :grin:
Unfortunately, no - I haven't seen it in the wild.
Jawed
I'm sure we can find someone to host it! :razz:
Is D3's minimum texture filtering trilinear? Too long now since I played with the demo...
D3 is showing 15% gains on X1900XT over X1800XT under no-AA/no-AF.
Jawed
high quality setting uses 8xAF
Let me ask a really basic question then: in the first table of results here:
http://www.beyond3d.com//reviews/ati/r580/index.php?p=17
(weird that double-slash in the middle there!)
is D3 using any trilinear filtering? Would the presence or absence of trylinear with Cat AI on or off be felt as a performance difference?
Or is AI on/off a good test of just the D3-shader replacement, and no more?
Jawed
Let me ask a really basic question then: in the first table of results here:
http://www.beyond3d.com//reviews/ati/r580/index.php?p=17
(weird that double-slash in the middle there!)
is D3 using any trilinear filtering? Would the presence or absence of trylinear with Cat AI on or off be felt as a performance difference?
Or is AI on/off a good test of just the D3-shader replacement, and no more?
Jawed
First of all while d3 uses 8xAF in high quality setting i assume dave disabled it manually in the config file to achieve his noAA/noAF scores , thus only trilinear is used in this case.
If only trilinear is used with cat ai low i doubt you would see a performance difference to ai off as trilinear optimisation is so light.
Mintmaster
30-Jan-2006, 04:22
However, on R520 at least, the stencil shadow rendering accounts for about half the time spent rendering the scene. (This is easily measured by simply disabling shadows in the game and noticing that your framerate doubles.)
Ah, I figured as much. This would explain why X1600 doubles up on the X1300, but I'm surprised that the X1600XT does so well against NVidia's parts. I guess the double Z helps a lot. Does the X1600 have quadruple Z when AA is enabled?
One question, OpenGL guy: Does R5xx have a min-max implementation of hierarchical Z? That is, does changing the z-test mid-scene have any consequences for performance anymore? I always guessed that this is why R520 jumped over R420 by so much in this game, but wasn't sure.
Mintmaster
30-Jan-2006, 04:25
Or is AI on/off a good test of just the D3-shader replacement, and no more?
Jawed
Try just looking at a flat wall where the textures are magnified. Then AF and filtering tricks won't affect your data.
sireric wrote :
Again, why would texturing be any worst on R580 than R520? It's the same pixels getting the same textures with a similar texture architecture. One should expect, off hand, the performance to be identical. The advantage of the R580, that I mentioned above, is that it's going to be issuing even more texture requests, thus saturating the texture & MC units, and so achieving higher efficiency (i.e. no bubbles). As well, it has a higher MCLK, so it has more BW and can get more texels out, per unit time.
On the same exact logic one could say that adding more TMUs would saturate the MC , and so make the MC more efficient with less bubbles .
Why that logic doesn't apply in this case ? Is the MC so efficient that adding more TMUs won't change the situation ?
sireric
01-Feb-2006, 01:00
sireric wrote :
On the same exact logic one could say that adding more TMUs would saturate the MC , and so make the MC more efficient with less bubbles .
Why that logic doesn't apply in this case ? Is the MC so efficient that adding more TMUs won't change the situation ?
Very true. It would be more efficient and increase performance. The issue is cost vs. performance. If it improved performance 5~10%, but cost 60M transistors -- How does that compare to 30% improvement in speed for the same cost when increasing ALUs? Now, I'm not saying the TMU cost is the same as the ALU (I'd have to double check, but I think 1 more TMU would of been similar), but when we looked at what we could do with our budget, we decided that the best bang for the buck was to add more ALUs. That gives bigger % gains, and is much more forward looking. Adding more TMUs would simply quickly die, since there's not enough BW to go with it (even if it got more efficient); as well, apps are simply writing shaders with more ALUs as time goes on.
At the end, I would love more BW, more TMUs, but I like 3:1 ratios much more from an ALU:Texture standpoint.
Love_In_Rio
01-Feb-2006, 10:12
Sireric, so you think in the future bandwith will be one of the bottlenecks to care of, how do you see things can evolve to face this problem ? Increasing bus width and memory speed will be allways enough ? or could any similar TBDR solution be any day applied ?
sireric
07-Feb-2006, 23:14
Sireric, so you think in the future bandwith will be one of the bottlenecks to care of, how do you see things can evolve to face this problem ? Increasing bus width and memory speed will be allways enough ? or could any similar TBDR solution be any day applied ?
BW has been the anchor of 3D architecture development for a long time, and I don't see that changing anytime soon. The rate of BW increase has been somewhat linear, while the increase in firepower has not been quite, since technology has improved fast, and chips have grown. I could see the two stabilizing at some point and both following the technology curve, but that's pure speculation.
As for increasing BW, well, you can do many different things. The "easiest" has been to go with the new technology (DDR2, DDR3, DDR4) as that becomes available. A rough estimate of doubling bw per pin is reasonable, per tech upgrade. As well, the change to 256b a few years ago was a good thing that allowed a more balanced situation; increasing that can be done as well, though the cost there is more complex. The final way is to increase compression. That's a very good thing. There are many algorithms on texture, and many ways for rasters as well; the future is very open there.
As for chunk architectures, they do have their costs as well, but the sheer raster cost can be reduced (texture cost doesn't really go down). At the cost of potentially large on-chip storage or even BW increase for binning off chip. Nothing comes for free.
As well, the change to 256b a few years ago was a good thing that allowed a more balanced situation; increasing that can be done as well, though the cost there is more complex. The final way is to increase compression. That's a very good thing. There are many algorithms on texture, and many ways for rasters as well; the future is very open there.
:shock:
Every other "competent source" poo-poos this very heavily!
Is 384 a useful possibility, do you think?
sireric
08-Feb-2006, 04:26
:shock:
Every other "competent source" poo-poos this very heavily!
Is 384 a useful possibility, do you think?
Well, more can be a good thing, but it does cost. If it costs more than its worth, than it's not worth it :-)
In a classic renderer, it's generally a good idea to link heavy bandwidth clients to dedicated memory channels. This prevents heavy clients from stepping on each other's feet. If you don't, then you need a full crossbar to route data/requests from clients to their destination channel. Consequently, though not always, it's good to link the number of channels and the heavy hitting clients, such as a ROP or a Z or even a texture unit (though that's not quite so easy, since texture is often a gather operation, but there are ways to minimize this). So, having an odd number of channels, say 5x64b channels (i.e. 384b) would suggest something aligned to that, say 20 pixels at the ROP level, and possibly some multiple of 5 for the Z. Such a beast could be constructed (a case for 3 has been made, with some of our X800 pro falling into that category), though pure powers of two are often favored, for sheer simplicity. On the other hand, you could simply put a larger MC to do the routing. All things are possible :-)
Okay, that's fair. I'm not going to run off shouting to the four corners of our world that Sireric has promised >256b bus. :lol:
I do find it newsworthy however that you aren't ruling it out, as I really have had the impression that "informed opinion" considered it highly unlikely.
Or are you just trying to avoid a "David Kirk moment" for folks to point at later, no matter how unlikely? :razz:
sireric
08-Feb-2006, 18:26
Okay, that's fair. I'm not going to run off shouting to the four corners of our world that Sireric has promised >256b bus. :lol:
I do find it newsworthy however that you aren't ruling it out, as I really have had the impression that "informed opinion" considered it highly unlikely.
I doubt anyone would rule it out completely, but it is costly.
Or are you just trying to avoid a "David Kirk moment" for folks to point at later, no matter how unlikely? :razz:
Well, I know I've said dumb things and will continue to do that in the future. That's life.
Well, more can be a good thing, but it does cost. If it costs more than its worth, than it's not worth it :-)
In a classic renderer, it's generally a good idea to link heavy bandwidth clients to dedicated memory channels. This prevents heavy clients from stepping on each other's feet. If you don't, then you need a full crossbar to route data/requests from clients to their destination channel. Consequently, though not always, it's good to link the number of channels and the heavy hitting clients, such as a ROP or a Z or even a texture unit (though that's not quite so easy, since texture is often a gather operation, but there are ways to minimize this). So, having an odd number of channels, say 5x64b channels (i.e. 384b) would suggest something aligned to that, say 20 pixels at the ROP level, and possibly some multiple of 5 for the Z. Such a beast could be constructed (a case for 3 has been made, with some of our X800 pro falling into that category), though pure powers of two are often favored, for sheer simplicity. On the other hand, you could simply put a larger MC to do the routing. All things are possible :-)
In the case of a non power of two configuration you could specialize in subgroups of power of two buses. Let's say 4 for texture/rop and 1 for vertex data (in the case vertex data ever required so much bw and memory).
In ATI quad core based architecture with 4/8 channels the most obvious approach seems to distribute the channels per quad core for the ROPs and 'hope' for the texture accesses to don't hurt too much with accesses to the 'wrong' channels. At least that is what I'm currently implementing in the simulator (vertex data ends accessing most time only channel 0 as the OpenGL library/driver is currently aligning data buffers at 4 KBs).
If you specialize too much the usage of the memory channels you may end hurting your maximum fillrate (even if it's only for blending) as you can't access the whole GPU bw. But at the end, other than z/stencil passes (which are compressed and require way less bw) you don't really need peak fillrate. Now that I remember I saw some strange numbers when testing fillrates with our 9600 and 9800 which may indicate that only 1/2? and 3/4? of the peak bw was dedicated to ROP. If I had more time I would try to test that.
And our simulator is currently limited to simulate memory at the GPU frequency so it's a bit more bw starved than modern high end GPUs.
sireric
09-Feb-2006, 00:34
In the case of a non power of two configuration you could specialize in subgroups of power of two buses. Let's say 4 for texture/rop and 1 for vertex data (in the case vertex data ever required so much bw and memory).
In ATI quad core based architecture with 4/8 channels the most obvious approach seems to distribute the channels per quad core for the ROPs and 'hope' for the texture accesses to don't hurt too much with accesses to the 'wrong' channels. At least that is what I'm currently implementing in the simulator (vertex data ends accessing most time only channel 0 as the OpenGL library/driver is currently aligning data buffers at 4 KBs).
If you specialize too much the usage of the memory channels you may end hurting your maximum fillrate (even if it's only for blending) as you can't access the whole GPU bw. But at the end, other than z/stencil passes (which are compressed and require way less bw) you don't really need peak fillrate. Now that I remember I saw some strange numbers when testing fillrates with our 9600 and 9800 which may indicate that only 1/2? and 3/4? of the peak bw was dedicated to ROP. If I had more time I would try to test that.
And our simulator is currently limited to simulate memory at the GPU frequency so it's a bit more bw starved than modern high end GPUs.
No, we haven't dedicated channels to any one clients in the HW, though SW can remap buffers and such, to specific channels if they so desire. We usually design to have more BW available than is consummed by a single client; if checking on ROP BW, you need to see if you are hitting the theoretical pipe limit, the BW limit or some other case that prevents peak from being achieved. You can't easily tell that from outside (though we can :-)).
But in general, you want to put heavy hitters such as Z & ROPs into dedicated channels that don't interfere, and where you can maximize coherence. Then distribute gather clients such as texture and vertex fetch over the whole channel space. In fact, given the ROP/Z surfaces are usually fixed, but the gather surfaces are huge and variable, it really makes sense to distribute them. Having said that, you can imagine a case where a texture can pound on a single channel, reducing the overall performance drastically, since it will only receive 1/n of the total BW, and the ROP/Z using that channel will get derailed. Proper texture alignment and distribution is essential.
Mintmaster
10-Feb-2006, 08:35
Having said that, you can imagine a case where a texture can pound on a single channel, reducing the overall performance drastically, since it will only receive 1/n of the total BW, and the ROP/Z using that channel will get derailed. Proper texture alignment and distribution is essential.
Does that happen much? I would expect that some fine-grained striping would take care of that, and with several quads working simultaneously you'd have to be pretty unlucky to have textures pounding on one channel.
Oh, one more question along the line of geo's comments. How many more pins did you need to go from 4x64-bit to 8x32-bit? Is pin count (by that I mean connections to the silicon) the main problem in moving to a wider bus?
sireric
10-Feb-2006, 18:18
Does that happen much? I would expect that some fine-grained striping would take care of that, and with several quads working simultaneously you'd have to be pretty unlucky to have textures pounding on one channel.
Generally, no. But it does happen at times, and certainly one can construct such a case. It's usually related to minification on a texture with an addressing pattern that would match your channel pitch. There's no way to detect this however, so there's nothing you can do to prevent it (you'd have to change your address mapping on the fly, per texture, based on HW feedback, which isn't generally possible).
Oh, one more question along the line of geo's comments. How many more pins did you need to go from 4x64-bit to 8x32-bit? Is pin count (by that I mean connections to the silicon) the main problem in moving to a wider bus?
Well, you at least need the extra addressing pins. That also depends on the total amount of physical memory you would want to address in either config. Off hand, just counting quickly in my head, I would guess 30% increase or so, but I would really need to go calculate (I might be way off).
vBulletin® v3.8.6, Copyright ©2000-2013, Jelsoft Enterprises Ltd.