AMD: R8xx Speculation

How soon will Nvidia respond with GT300 to upcoming ATI-RV870 lineup GPUs

  • Within 1 or 2 weeks

    Votes: 1 0.6%
  • Within a month

    Votes: 5 3.2%
  • Within couple months

    Votes: 28 18.1%
  • Very late this year

    Votes: 52 33.5%
  • Not until next year

    Votes: 69 44.5%

  • Total voters
    155
  • Poll closed .
Is it known where the register file physically resides?
In the die shots there appear to be 17 shiny/bright patches per cluster that are all the same size. 16 are placed at the corners of "squares", so in the central area of adjoining clusters it appears as if three sets of 4 of these patches are joined together in lumps.

That's my interpretation, anyway.

Jawed
 
In the die shots there appear to be 17 shiny/bright patches per cluster that are all the same size. 16 are placed at the corners of "squares", so in the central area of adjoining clusters it appears as if three sets of 4 of these patches are joined together in lumps.

That's my interpretation, anyway.

Jawed

That's what I've assumed. Reducing their size wouldn't help anything, though.

Why would the darker areas of logic between the SRAM patches shrink?
There would just be empty die area in the SIMD array if the reg file was shrunk without a commensurate redesign of the SIMD hardware around it.
 
There was a summary memo published, about SRAM cell density for various semiconductor manufacturers, and the data on TSMC's 45/40nm nodes was quite good in comparison, so I guess the whole ALU array should go down in size even more.
 
That's what I've assumed. Reducing their size wouldn't help anything, though.

Why would the darker areas of logic between the SRAM patches shrink?
There would just be empty die area in the SIMD array if the reg file was shrunk without a commensurate redesign of the SIMD hardware around it.
I tend to think you're right.

I'm just puzzled why, according to GPUSA, the assembly and register allocation for RV730 differs from RV770 - and why there's a similar pattern for R6xx GPUs.

Also, wasn't there a time when Rightmark's SM4 shaders performed fantastically badly on ATI? I'm wondering if the compiler has been through some drastic alterations, but what we're seeing in GPUSA is some regression in the case of RV730 compilation.

Or, perhaps it's a tweak based on the ALU:TEX ratio of the hardware. But the performance reported, in my view, implies that RV730 and RV770 are both running the same code - so I'm doubtful about trusting GPUSA.

It's a shame that ixbt reports its results in fps and not pixels per second.

Jawed
 
I'm just puzzled why, according to GPUSA, the assembly and register allocation for RV730 differs from RV770 - and why there's a similar pattern for R6xx GPUs.

Or, perhaps it's a tweak based on the ALU:TEX ratio of the hardware. But the performance reported, in my view, implies that RV730 and RV770 are both running the same code - so I'm doubtful about trusting GPUSA.
GPUSA has it right.
 
That's what I've assumed. Reducing their size wouldn't help anything, though.

Why would the darker areas of logic between the SRAM patches shrink?
There would just be empty die area in the SIMD array if the reg file was shrunk without a commensurate redesign of the SIMD hardware around it.
You can change memories and other logic without a significant redesign. Just resynthesize and place and route. The SIMDs in 770 and 730 have different widths so they are different designs that also have a lot of similarities. The most straightforward way to handle the differences is to relayout the SIMD.
 
So why doesn't the performance scaling (RV730->RV770 and Fire->Mineral) indicated by GPUSA match ixbt's results?
For Fire->Mineral, do you know for sure that the same number of pixels are being shaded on the screen? I remember you making a similar mistake with ShaderMark before, where you forgot that only a sphere on the screen had the applied shader.

For RV730->RV770, the scaling looks right to me. I see 278 for the 4670 and 684 for the 4870. Direct scaling of the former would lead you to think the latter would score 672 fps. Maybe the lower instruction high register shader that was compiled for RV770 didn't quite give the expected increase, but it did give some improvement, so it was worth it.
 
For Fire->Mineral, do you know for sure that the same number of pixels are being shaded on the screen? I remember you making a similar mistake with ShaderMark before, where you forgot that only a sphere on the screen had the applied shader.
Yes, I was crestfallen, not being familiar with Shadermark :cry: ...

I can't run these tests but rummaging:

http://www.ixbt.com/video3/rightmark2.shtml

they've used the Stanford Bunny. SIGH. What's the matter with a full screen quad?

For RV730->RV770, the scaling looks right to me. I see 278 for the 4670 and 684 for the 4870. Direct scaling of the former would lead you to think the latter would score 672 fps.
278*2.5=695.

Maybe the lower instruction high register shader that was compiled for RV770 didn't quite give the expected increase, but it did give some improvement, so it was worth it.
But why won't RV730 benefit from this? Thus opening up the question, is it due simply to some interaction with ALU:TEX or is it because RV730 has less register file per ALU?

Jawed
 
they've used the Stanford Bunny. SIGH. What's the matter with a full screen quad?
Because everyone loves bunnies! It's rather amusing that they didn't use the bunny for the fur shader. :LOL:

278*2.5=695.
Oops. I thought the 4870 was clocked at 725 MHz, not 750.

Anyway, the 4670 won't be 2.5 times as fast drawing the background and there will be some per-frame overhead. Assume the latter is 100 microseconds, the 4870 took 75 microseconds to draw the background, and the 4670 took twice as long. Then the bunny was shaded 2.60x faster on the 4870. The overhead may be even longer.

But why won't RV730 benefit from this? Thus opening up the question, is it due simply to some interaction with ALU:TEX or is it because RV730 has less register file per ALU?
I'm guessing the latter, or maybe it's just a carryover from texture limited shader optimizations.

You may recall that compute-limited shaders let you get away with less register space (I hope you learned that from the program I made for you ;) ). However, in order for RV730 to be 80% as fast as RV770 in all texture limited shaders (as per the TMU count), it needs 80% of the total register space of RV770. Even if per-ALU space is the same, it only has 40%, so it would benefit more from low register use.

Again, this only holds true for texture bound shaders, so such an optimization isn't needed for compute bound shaders like the two you're talking about. Nonetheless, it's possible that the shader compiler isn't tuned to this degree.
 
I've found an "anomaly" when comparing the performance of RV730 with RV770.

3DMark Vantage's Perlin Noise test is generally considered a purely ALU test. It's a test that was used at several sites to debug the performance issues with HD4830: 560 v 640 ALU lanes.

But performance on HD4670 is wildly short of expected:

http://www.techreport.com/articles.x/15559/5
http://www.pcgameshardware.com/aid,..._4830_New_AMD_graphics_card_reviewed/?page=12
http://www.elitebastards.com/cms/in...sk=view&id=617&Itemid=27&limit=1&limitstart=3

HD4670 - ~ 14.5fps
HD4850 - ~ 45.3fps

HD4670 is, theoretically, 48% of HD4850. That should amount to 21.7fps. So, why is performance about 67% of theoretical?

A screenshot in the 3DMark Vantage reviewer's guide indicates that it's a full screen shader. Unfortunately I can't install Vantage as I'm on XP, so I can't rummage to get at the code itself.

Is it doing lots of render target blending?

Strangely, there's a note on the PCGH page from AMD stating:

The driver isn't optimized for the lower number of SIMD cores yet.

This note relates to HD4830's lower core count in comparison with HD4850/HD4870 (and it appears to be a speculative comment by AMD at around the time AMD wasn't sure why benchmark anomalies with HD4830 were arising).

But, could this statement also imply that driver (compiler) adjustments are required for the 8-core (cluster) count in RV730? i.e. that core count is a determinant of compiler algorithm? In addition to, or instead of ALU:TEX ratio?

Jawed
 
Strangely, there's a note on the PCGH page from AMD stating:



This note relates to HD4830's lower core count in comparison with HD4850/HD4870 (and it appears to be a speculative comment by AMD at around the time AMD wasn't sure why benchmark anomalies with HD4830 were arising).

But, could this statement also imply that driver (compiler) adjustments are required for the 8-core (cluster) count in RV730? i.e. that core count is a determinant of compiler algorithm? In addition to, or instead of ALU:TEX ratio?

Jawed

That statement was referring to the "old" version of the test which was done with the misconfigured HD4830, which had only 560 Shaders enabled. Sorry, but someone must have forgotten to update that.


P.S.:
This supposed driver optimization was the thing that led to my question about the physical location of the register file earlier. I was wondering, whether the RF was not 'attached' to the SIMD-core as a whole but rather to the individual Vec5-units with possible access to other units' RFs as need may be. If that'd be the case, then every SIMD-core based on RV730-architecture would only boast half as much register space as RV770 but the chip as a whole (with 320 shaders active) would have as much kiByte as you'd expect. Of course, the driver would also have to take this into account when assigning work and distributing it more evenly.
 
Last edited by a moderator:
That statement was referring to the "old" version of the test which was done with the misconfigured HD4830, which had only 560 Shaders enabled. Sorry, but someone must have forgotten to update that.
Well I'm confused. Someone from AMD was referring to the driver as not being optimised for HD4830's cores, and this could be the reason for the performance shortfall. Why would it matter which number of cores? As long as the number is less than 10, then there's a suggestion the driver needs optimisation.

I know it's tenuous - I'm just wondering if there's a possibility this indicates another variable that affects shader compilation.

Jawed
 
Well I'm confused. Someone from AMD was referring to the driver as not being optimised for HD4830's cores, and this could be the reason for the performance shortfall. Why would it matter which number of cores? As long as the number is less than 10, then there's a suggestion the driver needs optimisation.
To be honest: "teh driver not being fully optimized yet" is one of the standard excuses you get to hear if you see subpar performance anywhere. :) But nonetheless, wrt to comparative RV730 results in this case there could be more to it - read on.


I know it's tenuous - I'm just wondering if there's a possibility this indicates another variable that affects shader compilation.

Jawed
Please see my "P.S."
 
P.S.:
This supposed driver optimization was the thing that led to my question about the physical location of the register file ealier.
Aha!

I was wondering, whether the RF was not 'attached' to the SIMD-core as a whole but rather to the individual Vec5-units with possible access to other units' RFs as need may be.
As far as I can tell each RF is private to a set of 5 ALU lanes. Each RF consists of 4 banks, each bank being 256 32-bit values. Or 64 128-bit values, if you prefer - this is, as far as I can tell, how the patent organises the RF, with scalars (32-bits) being accessible even though the RF is addressed on 128-bit boundaries.

If that'd be the case, then every SIMD-core based on RV730-architecture would only boast half as much register space as RV770 but the chip as a whole (with 320 shaders active) would have as much kiByte as you'd expect.
I can't quite work out what you're saying here.

Jawed
 
All I was trying to say was: If RF is - as you call it - private to a Vec5, then each RV730-SIMD has only half as much RF as RV770 (and maybe even RV670) because it only consists of 40 ALUs/8 Vec5.

But since it has a fully blown Quad-TMU, it cannot for example hide latency as well as the bigger chips. That has to be taken into account for by the driver - if that's at all possible.
 
All I was trying to say was: If RF is - as you call it - private to a Vec5, then each RV730-SIMD has only half as much RF as RV770 (and maybe even RV670) because it only consists of 40 ALUs/8 Vec5.

But since it has a fully blown Quad-TMU, it cannot for example hide latency as well as the bigger chips. That has to be taken into account for by the driver - if that's at all possible.
Yes, agreed the driver needs to be aware of this.

But the question I'm raising is solely related to the number of cores (clusters: ALUs + TUs), not ALUs within a core or ALU:TEX.

We have previously discussed the ratio interpolator:TEX - RV730 is better off in this respect than RV770. So is HD4830 when compared with HD4850/70. This may be all that's relevant...

Jawed
 
Back
Top