AMD: R9xx Speculation

While we can't use GFLOPS to compare a GeForce vs Radeon, we can use it to compare 2 different Radeon.
When doing so a Radeon HD6850 (1488 GFLOPS) ~ HD 5770 (1360 GFLOPS), and a HD6870 (2016 GFLOPS) ~ HD 5850 (2088 GFLOPS).

The problem in these comparisons is that for 2xxx- 5xxx series radeons, the published flops numbers are inflated and are only applicable when multiplying by constants. (or, do those constants have to be stored into register first? this would mean those numbers are ALWAYS inflated numbers and always false)

There are not enough register read ports to read inputs to 5 MAD operations per cycle per VLIW processor. So even though there are 5 MAD-units , it cannot execute 5 MAD operations simultaneously.


6xxx series propably have kept the number of register ports intact while removing one MAD unit, meaning it does not have the register port bottleneck and all MAD-units can really execute MAD operations in parallel.
 


Yes. Looks like the reference 6870 will be a lot cheaper than the 470. Preorder prices typically aren't below MSRP, and Asus products with custom features are typically gawd awful expensive. I wouldn't be surprised to see 6870's under $229 next week, assuming this place isn't just trying to generate hits and actually intends to sell products at that price.
 
While we can't use GFLOPS to compare a GeForce vs Radeon, we can use it to compare 2 different Radeon.
When doing so a Radeon HD6850 (1488 GFLOPS) ~ HD 5770 (1360 GFLOPS), and a HD6870 (2016 GFLOPS) ~ HD 5850 (2088 GFLOPS).
Not really, if the 5D->4D transition is a fact, 68XX will gain up to 20% efficiency on that alone.

E: what hkultala said.
 
Overclocked and custom-cooled HD 6870 on pre-order for $270? Well, I think that spells trouble for the GTX 470.
The OC is so mild that I'd call that non-overclocked - all perf diif due to this could easily be classified as within margin of error... Seriously 1.5% overclocked? Granted it doesn't show memory clock but since it isn't mentioned I have to assume memory uses reference clock.

The problem in these comparisons is that for 2xxx- 5xxx series radeons, the published flops numbers are inflated and are only applicable when multiplying by constants. (or, do those constants have to be stored into register first? this would mean those numbers are ALWAYS inflated numbers and always false)

There are not enough register read ports to read inputs to 5 MAD operations per cycle per VLIW processor. So even though there are 5 MAD-units , it cannot execute 5 MAD operations simultaneously.
It isn't exactly true it depends on the instruction sequence. For instance if you had 5 mads if some source registers are used in several of them it works just fine.
Besides, that flop number is fairly meaningless for graphics anyway, a typical shader doesn't only consist of mads, and other instructions require less source arguments. The bigger problem is definitely trying to find 5 instructions which can be executed in parallel due to dependencies, the smaller problem is read port limits.

6xxx series propably have kept the number of register ports intact while removing one MAD unit, meaning it does not have the register port bottleneck and all MAD-units can really execute MAD operations in parallel.
I think I lost track what is now assumed to be the shader organization of Barts... But yes obviously indeed VLIW-4 should achieve higher utilization compared to VLIW-5, both because it's easier to extract non-dependent instructions as well as read ports never being a bottleneck (if read ports stayed the same).
 
There are not enough register read ports to read inputs to 5 MAD operations per cycle per VLIW processor. So even though there are 5 MAD-units , it cannot execute 5 MAD operations simultaneously.

Code:
    06 ALU: ADDR(166) CNT(124) KCACHE0(CB1:0-15) 
         43  x: MULADD_e    T1.x,  R12.x,  R4.w,  R20.w      
             y: MULADD_e    ____,  R12.x,  R4.z,  R20.z      
             z: MULADD_e    T2.z,  R12.x,  R4.y,  R20.y      
             w: MULADD_e    T0.w,  R12.x,  R4.x,  R20.x      
         44  x: MULADD_e    ____,  R13.x,  R4.z,  R22.z      VEC_201 
             y: MULADD_e    T2.y,  R13.x,  R4.y,  R22.y      VEC_201 
             z: MULADD_e    T0.z,  R13.x,  R4.x,  R22.x      VEC_201 
             w: MULADD_e    T1.w,  R13.x,  R4.w,  R22.w      VEC_201 
             t: MULADD_e    R10.y,  R12.y,  R6.z,  PV43.y      VEC_102 
         45  x: MULADD_e    ____,  R0.x,  R4.z,  R24.z      VEC_201 
             y: MULADD_e    T1.y,  R0.x,  R4.y,  R24.y      VEC_201 
             z: MULADD_e    T1.z,  R0.x,  R4.x,  R24.x      VEC_201 
             w: MULADD_e    T3.w,  R0.x,  R4.w,  R24.w      VEC_201 
             t: MULADD_e    R10.x,  R13.y,  R6.z,  PV44.x      VEC_102 
         46  x: MULADD_e    R11.x,  R12.x,  R5.x,  R27.x      VEC_201 
             y: MULADD_e    ____,  R12.x,  R5.z,  R27.z      VEC_201 
             z: MULADD_e    T3.z,  R12.x,  R5.y,  R27.y      VEC_201 
             w: MULADD_e    T2.w,  R12.x,  R5.w,  R27.w      VEC_201 
             t: MULADD_e    R9.x,  R0.y,  R6.z,  PV45.x      VEC_102 
         47  x: MULADD_e    T2.x,  R13.x,  R5.x,  R21.x      VEC_201 
             y: MULADD_e    T0.y,  R13.x,  R5.w,  R21.w      VEC_201 
             z: MULADD_e    ____,  R13.x,  R5.z,  R21.z      VEC_201 
             w: MULADD_e    R11.w,  R13.x,  R5.y,  R21.y      VEC_201 
             t: MULADD_e    R2.w,  R12.y,  R3.z,  PV46.y      VEC_102 
         48  x: MULADD_e    T3.x,  R0.x,  R5.x,  R23.x      VEC_201 
             y: MULADD_e    T3.y,  R0.x,  R5.w,  R23.w      VEC_201 
             z: MULADD_e    ____,  R0.x,  R5.z,  R23.z      VEC_201 
             w: MULADD_e    R7.w,  R0.x,  R5.y,  R23.y      VEC_201 
             t: MULADD_e    R11.z,  R13.y,  R3.z,  PV47.z      VEC_102 
         49  x: MULADD_e    T0.x,  R15.x,  R4.x,  R17.x      VEC_201 
             y: MULADD_e    R4.y,  R15.x,  R4.w,  R17.w      VEC_201 
             z: MULADD_e    ____,  R15.x,  R4.z,  R17.z      VEC_201 
             w: MULADD_e    R4.w,  R15.x,  R4.y,  R17.y      VEC_201 
             t: MULADD_e    R4.z,  R0.y,  R3.z,  PV48.z      VEC_102 
         50  x: MULADD_e    R5.x,  R15.x,  R5.x,  R26.x      VEC_201 
             y: MULADD_e    R5.y,  R15.x,  R5.w,  R26.w      VEC_201 
             z: MULADD_e    ____,  R15.x,  R5.z,  R26.z      VEC_201 
             w: MULADD_e    R5.w,  R15.x,  R5.y,  R26.y      VEC_201 
             t: MULADD_e    R5.z,  R15.y,  R6.z,  PV49.z      VEC_102 
         51  x: MULADD_e    T0.x,  R12.y,  R6.y,  T2.z      
             y: MULADD_e    ____,  R12.y,  R6.w,  T1.x      
             z: MULADD_e    T2.z,  R15.y,  R3.z,  PV50.z      VEC_201 
             w: MULADD_e    T0.w,  R12.y,  R6.x,  T0.w      
             t: MULADD_e    T1.x,  R15.y,  R6.x,  T0.x      
         52  x: MULADD_e    R4.x,  R12.z,  R8.w,  PV51.y      VEC_021 
             y: MULADD_e    T2.y,  R13.y,  R6.y,  T2.y      VEC_201 
             z: MULADD_e    T0.z,  R13.y,  R6.x,  T0.z      VEC_201 
             w: MULADD_e    ____,  R13.y,  R6.w,  T1.w      VEC_201 
             t: MULADD_e    T2.x,  R13.y,  R3.x,  T2.x      VEC_120 
         53  x: MULADD_e    R7.x,  R13.z,  R8.w,  PV52.w      VEC_021 
             y: MULADD_e    T1.y,  R0.y,  R6.y,  T1.y      VEC_201 
             z: MULADD_e    T1.z,  R0.y,  R6.x,  T1.z      VEC_201 
             w: MULADD_e    T3.w,  R0.y,  R6.w,  T3.w      VEC_201 
             t: MULADD_e    T3.x,  R0.y,  R3.x,  T3.x      VEC_120 
         54  x: MULADD_e    R11.x,  R12.y,  R3.y,  T3.z      VEC_210 
             y: MULADD_e    R9.y,  R12.y,  R3.x,  R11.x      VEC_201 
             z: MULADD_e    T0.z,  R12.y,  R3.w,  T2.w      VEC_201 
             w: MULADD_e    T1.w,  R13.y,  R3.y,  R11.w      
             t: MULADD_e    T3.z,  R13.z,  R8.x,  T0.z      VEC_021 
         55  x: MULADD_e    R5.x,  R15.y,  R3.x,  R5.x      VEC_201 
             y: MULADD_e    ____,  R13.y,  R3.w,  T0.y      VEC_021 
             z: NOP         ____      
             w: MULADD_e    T0.w,  R12.z,  R8.x,  T0.w      VEC_021 
             t: MULADD_e    T1.z,  R0.z,  R8.x,  T1.z      VEC_021 
         56  x: MULADD_e    R9.x,  R13.z,  R8.z,  R10.x      VEC_201 
             y: MULADD_e    T3.y,  R0.y,  R3.w,  T3.y      VEC_120 
             z: MULADD_e    R10.z,  R0.z,  R8.z,  R9.x      VEC_102 
             w: MULADD_e    R7.w,  R0.y,  R3.y,  R7.w      VEC_120 
             t: MULADD_e    R2.y,  R13.z,  R1.w,  PV55.y      
         57  x: MULADD_e    T2.x,  R12.z,  R1.z,  R2.w      VEC_021 
             y: MULADD_e    T0.y,  R15.y,  R6.w,  R4.y      VEC_120 
             z: MULADD_e    R7.z,  R13.z,  R1.x,  T2.x      VEC_120 
             w: MULADD_e    T2.w,  R15.y,  R6.y,  R4.w      VEC_120 
             t: SETGT_DX10  R0.x,  KC0[4].x,  R25.x      
         58  x: MULADD_e    R3.x,  R12.z,  R1.w,  T0.z      
             y: MULADD_e    R3.y,  R15.y,  R3.w,  R5.y      VEC_120 
             z: MULADD_e    R3.z,  R0.z,  R1.x,  T3.x      VEC_120 
             w: MULADD_e    R3.w,  R15.y,  R3.y,  R5.w      VEC_120 
         59  x: MULADD_e    T0.x,  R12.z,  R8.y,  T0.x      
             y: MULADD_e    T2.y,  R13.z,  R8.y,  T2.y      VEC_210 
             z: MULADD_e    T0.z,  R12.z,  R8.z,  R10.y      
             t: MULADD_e    R20.x,  R12.w,  R14.x,  T0.w      
         60  x: MULADD_e    T1.x,  R15.z,  R8.x,  T1.x      VEC_201 
             y: MULADD_e    T1.y,  R0.z,  R8.y,  T1.y      
             z: MULADD_e    R9.z,  R12.z,  R1.x,  R9.y      VEC_120 
             w: MULADD_e    T3.w,  R0.z,  R8.w,  T3.w      
             t: MULADD_e    T2.w,  R15.z,  R8.y,  T2.w      
         61  x: MULADD_e    T3.x,  R0.z,  R1.y,  R7.w      
             y: MULADD_e    T3.y,  R12.z,  R1.y,  R11.x      VEC_102 
             z: MULADD_e    R6.z,  R0.z,  R1.w,  T3.y      
             w: MULADD_e    R7.w,  R13.z,  R1.y,  T1.w      VEC_210 
         62  x: MULADD_e    R17.x,  R15.w,  R14.x,  T1.x      VEC_201 
             y: MULADD_e    R20.y,  R12.w,  R14.y,  T0.x      VEC_102 
             z: MULADD_e    R11.z,  R13.z,  R1.z,  R11.z      
             t: MULADD_e    R17.y,  R15.w,  R14.y,  T2.w      
         63  z: MULADD_e    R4.z,  R0.z,  R1.z,  R4.z      
         64  x: MULADD_e    ____,  R15.z,  R1.x,  R5.x      
             y: MULADD_e    R8.y,  R15.z,  R8.w,  T0.y      
             z: MULADD_e    R5.z,  R15.z,  R8.z,  R5.z      
             w: MULADD_e    ____,  R15.z,  R1.y,  R3.w      
         65  x: MULADD_e    R26.x,  R15.w,  R16.x,  PV64.x      
             y: MULADD_e    R1.y,  R15.z,  R1.w,  R3.y      
             z: MULADD_e    R1.z,  R15.z,  R1.z,  T2.z      
             t: MULADD_e    R26.y,  R15.w,  R16.y,  PV64.w      
         66  x: MULADD_e    R22.x,  R13.w,  R14.x,  T3.z      VEC_210 
             y: MULADD_e    R22.y,  R13.w,  R14.y,  T2.y      VEC_201 
             z: MULADD_e    R20.z,  R12.w,  R14.z,  T0.z      
             w: MULADD_e    R20.w,  R12.w,  R14.w,  R4.x      
             t: MULADD_e    R22.z,  R13.w,  R14.z,  R9.x      
         67  x: MULADD_e    R24.x,  R0.w,  R14.x,  T1.z      VEC_210 
             y: MULADD_e    R24.y,  R0.w,  R14.y,  T1.y      VEC_201 
             z: NOP         ____      
             w: MULADD_e    R22.w,  R13.w,  R14.w,  R7.x      
             t: MULADD_e    R24.z,  R0.w,  R14.z,  R10.z      VEC_120 
         68  x: MULADD_e    R23.x,  R0.w,  R16.x,  R3.z      VEC_210 
             y: MULADD_e    R23.y,  R0.w,  R16.y,  T3.x      VEC_210 
             z: NOP         ____      
             w: MULADD_e    R24.w,  R0.w,  R14.w,  T3.w      VEC_201 
             t: MULADD_e    R23.z,  R0.w,  R16.z,  R4.z      VEC_120 
         69  x: MULADD_e    R27.x,  R12.w,  R16.x,  R9.z      VEC_201 
             y: MULADD_e    R27.y,  R12.w,  R16.y,  T3.y      VEC_201 
             z: MULADD_e    R27.z,  R12.w,  R16.z,  T2.x      VEC_201 
             w: MULADD_e    R23.w,  R0.w,  R16.w,  R6.z      
             t: MULADD_e    R27.w,  R12.w,  R16.w,  R3.x      VEC_120
 
The problem in these comparisons is that for 2xxx- 5xxx series radeons, the published flops numbers are inflated and are only applicable when multiplying by constants. (or, do those constants have to be stored into register first? this would mean those numbers are ALWAYS inflated numbers and always false)

There are not enough register read ports to read inputs to 5 MAD operations per cycle per VLIW processor. So even though there are 5 MAD-units , it cannot execute 5 MAD operations simultaneously.
That is not entirely true. While it is correct, that you can read at max 12 32bit values from the register file (which would be enough for only four 3-operand-instructions like MAD), you have also the "pipeline registers" PV and PS which can deliver up to 5 additional 32bit values (the results of the previous instructions in the 5 slots are stored here). Furthermore up to 4 32bit values can be encoded directly in each VLIW instruction. So it is very well possible to saturate the 5 slots with MADs. The register read port restriction is only a hard limit if you have basically no dependencies in the instruction stream, surely not very common. You get easily more than 4 slots populated on usual code working with float4 for instance.

Interestingly, the lack of independent instructions can limit the throughput to less than 5 operations per VLIW instruction, but a bit of dependency (which you always have) enables to pack more than 4 operations in one VLIW instruction.

Edit:
Too slow! Jawed just used the SKA to prove it is possible. Was probably faster than writing so many words of explanation ;)
 
Edit:
Too slow! Jawed just used the SKA to prove it is possible. Was probably faster than writing so many words of explanation ;)
Yes, I have a naive OpenCL matrix-matrix multiplication kernel lying around. This should get seriously faster with VLIW-4, provided the texture cache system can keep up :cool:

Have you seen my PM?
 
The OC is so mild that I'd call that non-overclocked - all perf diif due to this could easily be classified as within margin of error... Seriously 1.5% overclocked? Granted it doesn't show memory clock but since it isn't mentioned I have to assume memory uses reference clock.

If the reference design is clocked at 900MHz as is believed, yes, that's a pretty inconsequential improvement, but ASUS still had to run a battery of tests after doing it, which costs money.

In the end, overclocking the GPU by 1.5% is probably almost as expensive as doing it by ~8%, which is more typical.

Anyway, the point is that unless this pre-order offer is complete FUD, we can probably expect the HD 6870 around $249, maybe even less.
 
MSI HD 6870 spotted...
msi68702210.png


EyeFinity IV this time. :smile:
 
Reading between lines, no-x seems to imply that Barts is just an improved Juniper (or maybe better: condensed Cypress) design (i.e. something like Juniper + 4SIMDs + 16ROPs + 128bit more interface + some NI tweaks).

That would translate to a chip with 1120SPs (2*7SIMDs with 16*5 Evergreen-Shaders each), 56TMUs, 32ROPs and a 256bit mem interface. All we know with respect to die size and rough performance levels would probably fit in with this when clocked @ 900Mhz.

As a result, Cayman would arguably turn out as the only "real" (4D-VLIW) NI GPU after all ...

~ 390mm² would have been roughly ~ 250mm² @ 32nm, maybe ~260mm² due to some planned adjustments for the "new" process node now possibly removed for the 40nm realization of the same architecture?

Maybe they decided to "port" only their sweet-spot (@32nm) design back to 40nm - and went with a more time-efficient, hybrid design for Barts?

I don't know, but no-x seems to know something - so let's talk about his hints.
 
Reading between lines, no-x seems to imply that Barts is just an improved Juniper (or maybe better: condensed Cypress) design (i.e. something like Juniper + 4SIMDs + 16ROPs + 128bit more interface + some NI tweaks).

That would translate to a chip with 1120SPs (2*7SIMDs with 16*5 Evergreen-Shaders each), 56TMUs, 32ROPs and a 256bit mem interface. All we know with respect to die size and rough performance levels would probably fit in with this.

As a result, Cayman would probably turn out as the only "real" (4D-VLIW) NI GPU after all ...

~ 390mm² would have been roughly ~ 250mm² @ 32nm, maybe ~260mm² due to some planned adjustments for the "new" process node now possibly removed for the 40nm realization of the same architecture?

Maybe they decided to "port" only their sweet-spot (@32nm) design back to 40nm - and went with a hybrid design for Barts?

I don't know, but no-x seems to know something - so let's talk about his hints.

sorry if this seems a bit out of line .. but I couldn't help myself

"Barts is just an improved Juniper (or maybe better: condensed Cypress) design (i.e. something like Juniper + 4SIMDs + 16ROPs + 128bit more interface + some NI tweaks)."

if barts is JUST .. then you go on about doubling the interface, adding 50% more rops reconfigruing the SIMD layout and throwing in some "NI tweaks".. it's like saying that FERMI was JUST an improved G80... quite dismissive to say the least.
 
If the reference design is clocked at 900MHz as is believed, yes, that's a pretty inconsequential improvement, but ASUS still had to run a battery of tests after doing it, which costs money.
I suspect for such a minimal overclock you could easily get away without any addtional tests. Those chips are guaranteed to run at some frequency with some temperature, and they need some margin in testing. The frequency increase is so tiny that it is very likely within the safety margin it "should" still run. Or, if they don't want to rely on this, and say the spec calls for it to work at for instance 95 degrees at that frequency, they could simply adjust the cooling (with better cooling solution or simply more aggressive fan setting) so it doesn't exceed 90 degrees. I really doubt they do addtional testing with such a tiny overclock (which would likely be the reason the overclock is that small in the first place, unless AMD forbids more). This is a consumer product after all, and doesn't need to meet mil-spec qualification.
 
As to the :(, yes, that's absolutely a slap in the face that it's called 68xx. One nice thing for AMD moving to names rather than number schemes for their cards is that it's easier for them to obscure the fact that Barts would have been named Rv940 and Cayman would have been Rv970 had they used the traditional naming schemes.

Either way from everything revealed thus far, 68xx is a big FU to your everyday consumer. And I can only imagine every card going down the line will be similar in offering little to no performance improvement over the cards that they replace. Yay for PR. :p

I'm just waiting to see final performance numbers for 68xx in reviews. If it's similar to what's been revealed thus far, I won't be buying any AMD products for 1 or 2 generations.

Regards,
SB

Do not say you haven't been warned! :D

(forget about 6700 and focus on the Barts part in that post)
 
"Barts is just an improved Juniper (or maybe better: condensed Cypress) design (i.e. something like Juniper + 4SIMDs + 16ROPs + 128bit more interface + some NI tweaks)."

if barts is JUST .. then you go on about doubling the interface, adding 50% more rops reconfigruing the SIMD layout and throwing in some "NI tweaks".. it's like saying that FERMI was JUST an improved G80... quite dismissive to say the least.
You got me wrong there. I didn't mean "just" in the sense of "not worth the bother", but "just" in terms of "not entirely derived from the NI arch".

Irrespective of specs, Barts seems to iron out all the major deficiencies of the EG architecture - and to actually pull off GTX 470 performance @ 230mm² is nothing short of an amazing achievement.

It's just that ... if they actually managed to achieve that performance level with an EG/NI hybrid ... an entirely NI based Cayman chip could probably even further improve on Barts' already stellar Perf/mm² enhancements ...

Feel free to tell me to keep my hopes down, though :D
 
Back
Top