AMD: R9xx Speculation

HD6950 = 2x Barts at 40nm, coming before Xmas
HD6970 = 2x Cayman at 28nm, coming next year.

The driver names still state this:
235,ANTILLES PRO (671C),NI CAYMAN
236,ANTILLES XT (671D),NI CAYMAN

Not to mention that 28nm should be only for the NI [insert geographicly northern island here] -series
 
I thought the 256bit interface was pretty much confirmed? Seems like a waste with only 16 rops (granted that's exactly the ratio redwood is using), unless the rops have been improved otherwise (z fillrate?).
I agree, that 256bit interface is very likely, but I don't agree it would be wasted using 16 ROPs. Look at RV790. Average gaming performance of RV870 is only 50% higher despite having +100% ROPs. RV870 is no way limited by ROPs. Bart doesn't need extreme MSAA 8x performance (it isn't a high-end part, mainstream product and its quality wouldn't be judged based on tesselation benchmarks, which can benefit of higher Z-performance.

It is possible, that Bart will have 32 ROPs, I can be mistaken... But I can't ignore that ATi used 16 ROPs for long 6 years. They doubled it with GDDR5 in 2009 (16 ROPs for mainstream / 32 for high-end), but it doesn't mean that they will double it with every half-generation now, does it?
 
The driver names still state this:
235,ANTILLES PRO (671C),NI CAYMAN
236,ANTILLES XT (671D),NI CAYMAN

Not to mention that 28nm should be only for the NI [insert geographicly northern island here] -series


What a mess... :rolleyes: What do you mean by this "28 nm... only for NI"? I thought HD 6000 series is codenamed NI and next is something else. But if the rumoured codenames Ibiza, Cozumel, Kauai are right, then it is not logical they are SI. :rolleyes:
 
While that is true, a particular bottleneck for trilinear and especially anisotropic filtering is already the L1 cache bandwidth itself. It is designed to deliver a peak bandwidth just sufficient for bilinear filtering. While [strike]ATI[/strike] AMD GPUs show an exceptional efficient use of the theoretically available bandwidth in texturing tests (up to 99% or so), it is simply not enough when turning to more complicated filtering algorithms. The tries to limit the performance loss (by using less texture samples) are quite infamous as it can lead to texture shimmering. I guess it would be good to retain the 16 units (but now only 4 slot units) to 4 TMU ratio, as this would slightly increase the texturing power (and L1 bandwidth) available per shader instruction. Maybe this is already enough to improve the filter quality. Nvidia went with more filtering hardware and L1 cache bandwidth per texturing unit. Apropos nvidia, the L2 cache bandwidth on GF100 is just half that of Cypress.
I think ATI holds texels in L1 in uncompressed form, whereas NVidia holds them compressed.

Also the CUDA and Fermi Update presentation linked by Trinibwoy, when describing L1, is referring to the SM L1. The TMU L1 is separate and of lower bandwidth (I've not seen a figure, merely that one of the guides says reads through texture cache are possible with Fermi but suffer reduced bandwidth).

I wonder if a revised texturing architecture:

http://forum.beyond3d.com/showpost.php?p=1468605&postcount=1608

will be key.

One nice thing of the 5 slot -> 4 slot change is that the improvements will be most pronounced in situations with low ALU utilizations, where it is needed most. The potential drawbacks are mainly compensated by sheer unit count growth and in areas with high ALU utilization (still pending is the actual solution for transcendental stuff), where ATI GPUs have a huge advantage either way.
Another, very minor, element is the disparity between register file bandwidth and ALU capability. Only 12 distinct operands can appear in one instruction cycle (excluding literals), short of the 15 that 5 MADs could theoretically demand.

Similarly, operations (involving an address) against LDS are limited to two per cycle. If these are a serialisation in the code then less lanes will be wasted while this serialisation is in effect.
 
What a mess... :rolleyes: What do you mean by this "28 nm... only for NI"? I thought HD 6000 series is codenamed NI and next is something else. But if the rumoured codenames Ibiza, Cozumel, Kauai are right, then it is not logical they are SI. :rolleyes:

I mean that the current codenames for the HD6-family are NI [island here]
All the islands in current lineup are geographicly southern ones (in the caribbean ocean), so my logic would translate that meaning to
"NI family, southern islands"
While the 28nm chips would be
"NI family, northern islands"

The problem comes from the island names for the other NI/SI-lineup if they're indeed correct, Cozumel and Kauai are around the same level as the current SI-lineup names geographicly, but Ibiza is notably more north. I don't have at my hand a map showing equator etc to say exactly wether the Cozumel and Kauai are bit more north too, but Cozumel seems to be a bit more south than Turks and Caicos.
 
Ok, I promise to stop with the guesstimations after this post - but CaymanX2 @40nm still sounds like a very bad idea to me: They already had to lower clocks significantly on the stock 5970 cards in order to stay within reasonable power limits - and Cypress was a 330mm2/180W GPU ...

With Cayman arguably being near the 400mm2 die-size mark as well as near to the 200W power mark, well, it's up to AMD to go with a pair of lower-clocked, HUGE, rare and expensive Caymans or a pair of high-clocked, approximately RV770 sized, high-yielding Barts.

If they were Nvidia, they'd certainly go for the brute force option - but I just expect the inventor of the sweet-spot strategy to know better than this :p



The PCI-e power specifications and dual >300mm2 GPUs basically don't seem to get along very well. If Barts actually turns out to be approximately as fast as Cypress clock-for-clock while maintaining a smaller die-size and featuring lower power consumption, they could probably clock two of them @ HD 5870 levels (850Mhz+) without breaking the 300W barrier (850-725)*100/725 then results in nearly 20% performance gain over HD 5970 ... a card with two Caymans@700Mhz (?) arguably won't do a whole lot better than that, but probably cost twice as much to make.

The important thing is that Barts should turn out WAY faster than GF104 when clocked properly ... the only thing they need to do is to BEAT the upcoming GF104 X2 card - by how much doesn't really matter.

I understand your wattage worries, but what if AMD is aiming for a PCIe 3.0 validation? PCIe 3.0 specs should already be finalised.

Is there a chance that the 6000 series would be PCIe 3.0 and if yes, could the PCIe 3.0 specs allow for more than 300W per card?

TBH, I dunno how this stuff gets decided. Just wondering! :?:
 
I understand your wattage worries, but what if AMD is aiming for a PCIe 3.0 validation? PCIe 3.0 specs should already be finalised.

Is there a chance that the 6000 series would be PCIe 3.0 and if yes, could the PCIe 3.0 specs allow for more than 300W per card?

TBH, I dunno how this stuff gets decided. Just wondering! :?:

I never heard any plans that they were, it would also put quite a limit on your possible user/consumer base to release a card that needs new components that abide by a newly released specification though we are talking about the ultra enthusiast where it usually is a "money is no object" segment.

All I could find, in the non-confidential info, is this:

PCI-SIG said:
Q15: Will PCIe 3.0 enable greater power delivery to cards?
A15: The PCIe Card Electromechanical (CEM) 3.0 specification will consolidate all previous form factor power delivery specifications, including the 150W and the 300W specifications.

http://www.pcisig.com/news_room/faqs/pcie3.0_faq/
 
I also don't remember seeing anything on upcoming platforms being PCI Express 3.0. Or did I miss something?
 
Another, very minor, element is the disparity between register file bandwidth and ALU capability. Only 12 distinct operands can appear in one instruction cycle (excluding literals), short of the 15 that 5 MADs could theoretically demand.
If the rumors turn out to be true, this change may be somewhat more signficant. The T unit has hung off of the register read network like a remora or leech, albeit one that has somehow cross-bred with a swiss army knife, making it a rather useful hanger-on.
Dropping the extra consumer from the network would lop off an extra path from each part of the network.

From an aesthetic standpoint, one would wonder why not go in and junk some of the baroque register read logic in general, if they were already going in to streamline things anyway.

Another plus is that if there is a more symmetric VLIW scheme, it would be easier for the compiler to work with, but that would depend on how the responsibilities are assigned per lane.
 
If the rumors turn out to be true, this change may be somewhat more signficant. The T unit has hung off of the register read network like a remora or leech, albeit one that has somehow cross-bred with a swiss army knife, making it a rather useful hanger-on.

Maybe it's not that useful, especially if a lot of standard workloads don't use it that much? It seems to me that this redesign is all about raising utilisation. Any losses are made up (and then some if numbers are to be believed) by the increased number of units.

I'm guessing that AMD have profiled this and decided that this is the way to go.
 
If the rumors turn out to be true, this change may be somewhat more signficant. The T unit has hung off of the register read network like a remora or leech, albeit one that has somehow cross-bred with a swiss army knife, making it a rather useful hanger-on.
Dropping the extra consumer from the network would lop off an extra path from each part of the network.
Yes, this is part of my original argument about area savings. Hardly massive, though.

From an aesthetic standpoint, one would wonder why not go in and junk some of the baroque register read logic in general, if they were already going in to streamline things anyway.
You've totally lost me there.

Another plus is that if there is a more symmetric VLIW scheme, it would be easier for the compiler to work with, but that would depend on how the responsibilities are assigned per lane.
I said that back then too.

Anyway, I'm still cautious about this whole thing - it amounts to a big change, after all.
 
Yes, this is part of my original argument about area savings. Hardly massive, though.
It does save a data path from each of the three GPRx values, removes 3 of the widest multiplexers, and two constant accesses.
The area might be modest, but it was sitting at a pretty crucial part of the pipeline, so there could be power and clock speed benefits.

You've totally lost me there.
RV770 has 4 register memories that produce per instruction 3 4-wide GPR values that are then subdivided out to each unit, subject to a host of restrictions and special cases.
I presume the complexity of the process is needed to keep the port count low and it is more apparent to software because the VLIW architecture precludes the complexity hiding offered by an operand collector.

If things were designed just to be "pretty", it would look neater if register access for NI were just "receive register identifier, read register".
 
RV770 has 4 register memories that produce per instruction 3 4-wide GPR values that are then subdivided out to each unit, subject to a host of restrictions and special cases.
I presume the complexity of the process is needed to keep the port count low and it is more apparent to software because the VLIW architecture precludes the complexity hiding offered by an operand collector.
You're forgetting that the ALUs aren't the only RF clients.

If things were designed just to be "pretty", it would look neater if register access for NI were just "receive register identifier, read register".
Feel free to show how you'd achieve 16 bytes from each of 12 different addresses in 3 physical clocks with some "pretty and simple" scheme...
 
You're forgetting that the ALUs aren't the only RF clients.
For the purposes of determining register read scheduling for each ALU instruction bundle, they are the primary read clients.
As far as the documented ISA description of the process goes, they are the only clients in the logic for determining what the code can source in a given instruction.

Feel free to show how you'd achieve 16 bytes from each of 12 different addresses in 3 physical clocks with some "pretty and simple" scheme...

The Itanium 2 integer read file can provide 12 32-bit operands in a cycle, and it can as a bonus provide 8 writes.
It is physically possible, but the heavily ported file would probably lead to a reduction in the overall capacity of a GPU register file, and there would be power concerns.

It is, from an aesthetic standpoint, much less ugly than what is revealed to software on the GPU.
 
So you're suggesting for the sake of "software aesthetics", while they're having an efficiency drive they should throw in way more ports?
 
So you're suggesting for the sake of "software aesthetics", while they're having an efficiency drive they should throw in way more ports?
I am not suggesting they should, as I am aware that there are physical realities that intrude.

My comment was that since AMD is allegedly removing one source of ugliness in the ALU layout, it would look nicer if they went ahead and did something about smoothing over the register read rules.
As far as graphics functions are concerned, the software-visible papering over of a 4-banked register memory is survivable.
Despite this, it is a nasty implementation detail to reveal to software, and it does restrict what the compiler can pack in a cycle, however uncommon in graphics we can expect stretches of code dominated by references to just one bank to be.

It is not something I would miss if it were worked around, and I doubt if in the Fusion era if it would be tolerated forever, assuming AMD is truly serious in advancing the level of integration.
 
Back to the rumour mill and expected perfromance gains. Here is what I found. :D

The Cayman (Caiman) is a bit ferocious... *Yikes*

800px-Caiman_crocodilus_Costa_Rica_2.jpg


DirectX 9/10 performance: +30% to +50%
Tessellation performance: +200% to +300%


Sorry, 265586888, for using your post. ;)

http://semiaccurate.com/forums/showthread.php?t=2308&page=9
 
I guess we'll see about the numbers.
As far as tessellation goes, if somewhat puzzling throughput numbers for the tesselator in some benchmarks were improved to 1 triangle per cycle, that would be one bottleneck that would scale by 2-3x.
 
I am not suggesting they should, as I am aware that there are physical realities that intrude.

My comment was that since AMD is allegedly removing one source of ugliness in the ALU layout, it would look nicer if they went ahead and did something about smoothing over the register read rules.
I think it would be relatively easy to make it "disappear" on the software side if instead of making these fetches illegal, the hw would recognize that itself and simply fetch operands in more cycles as necessary (and stall execution). That would be more like traditional ISAs where everything is usually allowed but you get penalties for doing odd things the hw can't really do. But since this isn't really a fixed ISA baking the logic into the compiler instead looks quite natural to me.
Though if the 4 banks are a problem but itanium-style fetch isn't efficient (the itanium on the whole didn't seem to make efficient use of its transistors), maybe something like 2 banks with twice the read ports (per bank) would be an option? Might make sense if the 4D arrangement is more like 2+2? And beef up logic for operand fetching to try to optimize fetch order itself in case of conflicts (and stall if not possible - after all with 2 banks and twice the read ports that would be one additional clock max). Or maybe it doesn't make sense :).
 
I think it would be relatively easy to make it "disappear" on the software side if instead of making these fetches illegal, the hw would recognize that itself and simply fetch operands in more cycles as necessary (and stall execution). That would be more like traditional ISAs where everything is usually allowed but you get penalties for doing odd things the hw can't really do.
Transparent handling of hazards is not easily reconciled having a VLIW or VLIW-like architecture. The chip does recognize certain instances, but this probably is weighed against the transistor cost.
If the scheduler could recognize that there were bank or source conflicts, it would need hardware units that would "collect" said operands, which would sound oddly familiar.

Though if the 4 banks are a problem but itanium-style fetch isn't efficient (the itanium on the whole didn't seem to make efficient use of its transistors), maybe something like 2 banks with twice the read ports (per bank) would be an option? Might make sense if the 4D arrangement is more like 2+2? And beef up logic for operand fetching to try to optimize fetch order itself in case of conflicts (and stall if not possible - after all with 2 banks and twice the read ports that would be one additional clock max). Or maybe it doesn't make sense :).

Ports could be added, but if the VLIW scheme is in place they would need to be exposed in the instructions. The register files would be physically larger as a result of the additional ports, which could require a drop in capacity.
Having the hardware schedule itself around hazards would need a change in the whole execution scheme.

It may not be worth it for a GPU, but the low-level programming model could pose challenges if AMD merges units for some future Fusion chip.
 
Back
Top