Upcoming ATI Radeon GPUs (45/40nm)

CarstenS · Jul 18, 2008

ShaidarHaran said:
The increase in shader processor count is irrelevant, since RV670 was already tex-bound and thus incapable of utilizing all of its SPs in most situations. It is the "base" increase in TMU count to 40 that has finally allowed the SPs to stretch their legs a bit more with a higher utilization rate (i.e. they're not sitting idle waiting on texture lookups).

I am one of the many not able to follow your line of argument.

If you say, the increase to 40 TMUs has allowed the SPs to stretch their legs without regard of their proportional increase as well, in my understanding you're basically proposing, that we're seeing all the glory of 320 SPs and 40 TMUs in action - leaving the other 480 SPs sitting idle?

fellix said:
Well, after a brief testing with 3DMark's single-texture feature, looks like GDDR5 pretty much feeds the RBE's blenders to their near-theoretical maximum, that is ~11300 of 12000 MPix (16*750MHz). Some 200MHz more on the memclock yelds ~11500 MPix.

On the other hand, G200's rates are way off of it's 32 mammoth array of ROPs.

Of course - those results make perfect sense. I'm getting about 70 percent of max out of my GTX280s ROPs in this particular test because it is in essence more of a bandwidth test. That's why this particular rate has not increased much more than linear compared to G80 and downright sucks if you compare it to RV770 XT.

OTOH the 115 GBytes per second of bandwith do only have to sustain a maximum of 12GPixels per second whereas the 150ish GBytes per second in GT200 have to feed a theoretical 22 GPixels per second. I am not surprised, they're not able to do that.

edit:
Actually, I think it's quite an interesting comparison, albeit off-topic. Taken fillrates and bandwidths into account, I'm getting about 37 percent more fillrate out of only 28 percent more bandwidth.

edit2:
Just re-ran some tests. With the same bandwith-per-pixel-ratio as your 4870 (a very healthy 9,6 bytes per pixel if I am not mistaken) - meaning about 480 MHz core clock for mem running at 1150 MHz - I'm getting about 96,9 percent of ST-fill.

mboeller · Jul 18, 2008

3dilettante said:
Neat. I think such a product would be an interesting data point.

Then wait for RV730. 16TMU's + 320 SP's; at least according to the rumours.

ShaidarHaran · Jul 18, 2008

Pantagruel's Friend said:
I've read through this thread, and I still don't understand one thing: why are you so sure that the gaming performance increase came from the TEX boost? What if it was shader bound all the way?

I'm not saying the entire perf. increase observed with RV770 is due solely to the increase in TMU count, I'm saying it has been the primary contributor in non-AA or heavy shadowing scenarios. It's what has allowed the shader core to stretch its legs.

3dilettante · Jul 18, 2008

mboeller said:
Then wait for RV730. 16TMU's + 320 SP's; at least according to the rumours.

But that didn't change anything. The ratio is still 4:1.
The SP count and TMU count match R600, so what we'd be able to tease out would be what effect every other enhancement had outside of ALU and TEX unit increases.

We could see whether the reduced per TMU power of RV770 is compensated for by the reorganization of the caches and memory.

ShaidarHaran · Jul 18, 2008

CarstenS said:
I am one of the many not able to follow your line of argument.

If you say, the increase to 40 TMUs has allowed the SPs to stretch their legs without regard of their proportional increase as well, in my understanding you're basically proposing, that we're seeing all the glory of 320 SPs and 40 TMUs in action - leaving the other 480 SPs sitting idle?

Well I certainly didn't say the additional 480 SPs are idle, what I said was the original 320 SPs are now fully capable of being utilized. Who knows how many of the additional SPs are also used? I don't, (yet) but I'm hoping the other thread I've created will bring about answers soon enough.

ShaidarHaran · Jul 18, 2008

3dilettante said:
But that didn't change anything. The ratio is still 4:1.
The SP count and TMU count match R600, so what we'd be able to tease out would be what effect every other enhancement had outside of ALU and TEX unit increases.

We could see whether the reduced per TMU power of RV770 is compensated for by the reorganization of the caches and memory.

Now if Wavey would just leak some RV730 perf. numbers we'd be able to end this discussion right here and now

Well, at least we could isolate the perf gains from uarch enhancements.

Pantagruel's Friend · Jul 18, 2008

ShaidarHaran said:
I'm not saying the entire perf. increase observed with RV770 is due solely to the increase in TMU count, I'm saying it has been the primary contributor in non-AA or heavy shadowing scenarios. It's what has allowed the shader core to stretch its legs.

Yes, and I'm doubting exactly that. What if the additional 480 ALUs are the key contributor to the extra performance?
I'm thinking this since I've played a bit around with GPU perfstudio and Crysis - that experiment indicated that Crysis is quite far from being texture bound. I know it's no proof, but then, I've never seen any proof that other games were TEX bound, either.

3dilettante · Jul 18, 2008

I'm not sure per-cycle utilization is the best determining factor, if I'm now interpreting what Dave said to me earlier correctly.

My argument is based more on wall-clock time, and the relative speedup between RV670 and RV770.
Utilization can actually be unchanged, since the unit counts have been raised.

My argument is that ALU-heavy stretches have already had a significant amount of speedup with the introduction of R600, while texturing improvements were weaker.
As a result, the fraction of time taken per frame that we'd attribute to ALU limits was smaller, while the texture time was longer.

Doubling both, even if it halved the time for each side, would mean a smaller absolute decrease for the ALU component.

To state numerically (not real units, not real workload, just illustrative):

10 time units for ALU-limited work and 10 for TEX-limited pre R600 on some workload, so a total of 20 units.

R600 comes along, now it's something like 2 units for ALU, 8 for texture.
RV770 comes along and doubles everything.
1 unit ALU, 4 for TEX.
Absolute improvement of ALU:1 unit
Absolute improvement for TEX: 4 units

ShaidarHaran · Jul 18, 2008

Pantagruel's Friend said:
Yes, and I'm doubting exactly that. What if the additional 480 ALUs are the key contributor to the extra performance?

They're not. There are very few workloads that are ALU-bound on RV670, because the utilization rate is rather low in the average case.

Pantagruel's Friend said:
I'm thinking this since I've played a bit around with GPU perfstudio and Crysis - that experiment indicated that Crysis is quite far from being texture bound. I know it's no proof, but then, I've never seen any proof that other games were TEX bound, either.

I imagine games that give you "ultra" texture quality options would be very much texture-bound.

ShaidarHaran · Jul 18, 2008

3dilettante said:
I'm not sure per-cycle utilization is the best determining factor, if I'm now interpreting what Dave said to me earlier correctly.

My argument is based more on wall-clock time, and the relative speedup between RV670 and RV770.
Utilization can actually be unchanged, since the unit counts have been raised.

My argument is that ALU-heavy stretches have already had a significant amount of speedup with the introduction of R600, while texturing improvements were weaker.
As a result, the fraction of time taken per frame that we'd attribute to ALU limits was smaller, while the texture time was longer.

Yes, it's all relative. ALU perf. was already top-notch with that many ALUs, despite the low utilization rates, whereas texturing performance was severely lacking IN COMPARISON, causing shaders to stall during tex lookups.

3dilettante said:
Doubling both, even if it halved the time for each side, would mean a smaller absolute decrease for the ALU component.

To state numerically (not real units, not real workload, just illustrative):

10 time units for ALU-limited work and 10 for TEX-limited pre R600 on some workload, so a total of 20 units.

R600 comes along, now it's something like 2 units for ALU, 8 for texture.
RV770 comes along and doubles everything.
1 unit ALU, 4 for TEX.
Absolute improvement of ALU:1 unit
Absolute improvement for TEX: 4 units

Thanks again 3d, you're doing a much better job of arguing my point than I

OpenGL guy · Jul 18, 2008

ShaidarHaran said:
I imagine games that give you "ultra" texture quality options would be very much texture-bound.

That usually only increases the size of the textures, not how many texture lookups are done per shader. In other words, an increase in texture bandwidth usage.

Freak'n Big Panda · Jul 18, 2008

OpenGL guy said:
That usually only increases the size of the textures, not how many texture lookups are done per shader. In other words, an increase in texture bandwidth usage.

Yeah those 'ultra' quality textures settings do not add any work for the TF units.

They're not. There are very few workloads that are ALU-bound on RV670, because the utilization rate is rather low in the average case.

Did you measure this? I'd really like some proof for this statement. I think R600s performance woes are far more complex then just 'not enough tex'.... Jawed mentioned that z rate might be more limiting then filtering op/s and Dave heavily suggested that a lack of texure filtering performance was not the pivotal bottleneck with R600. Of course if this had been the case we would have seen a 400ALU 48TMU RV770 which would have been cheaper to produce and performed almost identically to a 800/40 ver. ATI's DX10 arch is actually fairly balanced given RV770s stellar performance.

Pantagruel's informal work with Crysis also suggests that R600 is not tex bound. Can you find any application that is tex bound to back up your claims?

ShaidarHaran · Jul 18, 2008

Freak'n Big Panda said:
Yeah those 'ultra' quality textures settings do not add any work for the TF units.

Does anyone know this for a fact? I see an assertion but no evidence...

Freak'n Big Panda said:
Did you measure this? I'd really like some proof for this statement.

I'm afraid I cannot directly link you to any numbers ATM, but the other thread I've started should bear fruit before too long.

Freak'n Big Panda said:
I think R600s performance woes are far more complex then just 'not enough tex'....

I never said that. I said the following a few posts up:

me said:
I'm not saying the entire perf. increase observed with RV770 is due solely to the increase in TMU count, I'm saying it has been the primary contributor in non-AA or heavy shadowing scenarios. It's what has allowed the shader core to stretch its legs.

AlexV · Jul 18, 2008

ShaidarHaran said:
Does anyone know this for a fact?

Yes.

Pantagruel's Friend · Jul 18, 2008

ShaidarHaran said:
They're not. There are very few workloads that are ALU-bound on RV670, because the utilization rate is rather low in the average case.

Yes I also think it's low, but why? If the key factor is low ILP (which I suspect), then adding a lot of extra superscalar units does help.

ShaidarHaran said:
I imagine games that give you "ultra" texture quality options would be very much texture-bound.

I doubt that (on the same ground as OpenGL guy) - but I can't prove it :smile:
Also, Crysis is supposed to be using a lot of different maps that consume TF resources, not only textures.

Oh, and one more thing: from Freak'n Big Panda's post I realized I forgot to mention the HW I ran the Crysis test on. It wasn't an R600, it was an RV630. It's a different ALU:TEX ratio - I don't think it matters much, but I don't want to mislead anyone.

Freak'n Big Panda · Jul 18, 2008

Does anyone know this for a fact? I see an assertion but no evidence...

Everything you've said in this thread has been an assertion without evidence

Neeyik · Jul 19, 2008

ShaidarHaran said:
Does anyone know this for a fact? I see an assertion but no evidence...

There shouldn't be a reason as why higher resolution textures stress the TF units more, as the filtering instructions will still be same, as will the number of texels fetched and blended, regardless as to what size textures are being sampled. It's texture cache hits/misses that're probably going to be the main thing that high res textures affect (and therefore bandwidth, by proxy).

ShaidarHaran · Jul 19, 2008

Morgoth the Dark Enemy said:
Yes.

Freak'n Big Panda said:
Everything you've said in this thread has been an assertion without evidence

The following is sufficient.

Neeyik said:
There shouldn't be a reason as why higher resolution textures stress the TF units more, as the filtering instructions will still be same, as will the number of texels fetched and blended, regardless as to what size textures are being sampled. It's texture cache hits/misses that're probably going to be the main thing that high res textures affect (and therefore bandwidth, by proxy).

Thanks Neeyik.

Dave Baumann · Jul 19, 2008

3dilettante said:
I'm not sure what you mean by that, can you clarify?
The ALUs can still do work while the TMUs are doing something, right?
In a TEX-limited shader, the ALUs may not be doing much, but by virtue of there being more than double the number of TEX units, the chip can make sure that it can churn through those stretches of the application faster.
For math-limited shaders, the the TMU idles when the ALUs are busy.

On each case (R600/RV770) the organization is such that there are multiple individual execution "cores" (SIMD's). Each SIMD is organised such that there are 4 textures associated with 80 SP's (or 16 pixel / vertex processing units). Each SIMD will execute a "thread", or a batch (which is will have ALU and texture workloads in it).

If we take a SIMD in isolation then, yes, TEX and ALU instructions will execute simultaneously. However if the batch contains instructions that are primarily texture heavy then the ALU's in that SIMD will see lower utilization over the course of executing that batch to completion; if the batch is ALU heavy then that SIMD may experience lower utilization of the texture units.

Because the SIMD's are individual, separate "cores", if one is experiencing low ALU utilization because its texture limited then it can't just steal texture resources from another SIMD in order to speed up the execution of that batch - it has to eat it and stick with the resources it has until the thread is completed.

So, on a per batch basis the overall execution doesn't alter in terms of cycles it takes to execute because the SIMD's are fixed in terms of the execution resources they have.

Cycle for cycle the speed of execution of an individual batch (in pure int32 tex / ALU instruction execution) does not change. All that does change is that RV770 is executing 2.5x the number of batches at the same time as R600 is. Effectively the highway when from 4 lanes to 10 (but there's no lane swapping!).

If the number of textures per SIMD had changed then that would have altered the execution rate of a batch in some way.

hoom · Jul 19, 2008

I think (apart from the unit shrinking & AA changes which themselves are very impressive) the real big story with RV770 is in the cache changes, both that they get their own dedicated bandwidth and I think also the separation of the vertex cache.

I get the impression that the R6x0 vertex usage was either directly causing cache thrashing or indirectly by taking up too much space & causing TEX to operate out of an effectively smaller cache.
Fits with some peculiar benchmarks where R6x0 would throw out huge vertex performance but fall over as soon as some other work got thrown in.
Also the various ATI guys round here have repeatedly mentioned the work on cache usage modelling/the changes made, gotta be some reason for that

Upcoming ATI Radeon GPUs (45/40nm)

CarstenS

Moderator

mboeller

ShaidarHaran

hardware monkey

3dilettante

ShaidarHaran

hardware monkey

ShaidarHaran

hardware monkey

Pantagruel's Friend

3dilettante

ShaidarHaran

hardware monkey

ShaidarHaran

hardware monkey

OpenGL guy

Freak'n Big Panda

ShaidarHaran

hardware monkey

AlexV

Heteroscedasticitate

Pantagruel's Friend

Freak'n Big Panda

Neeyik

Homo ergaster

ShaidarHaran

hardware monkey

Dave Baumann

Gamerscore Wh...

hoom