AMD: R9xx Speculation

I translated the above as: "A to note recently Tepeout whole series of RV9xx all are 40nm, doesn't exist 28nm hardware to strenghthen TES, to increase Blu Ray decoding quality, 3D Mark Vantage not much increase"

Is the part in bold correct "tapeout whole series"? Supposedly there are 3 chips in Hectoncheires(?sp aarrggh)

Just following up previous post...it's spelt "Hecatonchires", of which there are 3 members: Briareus, Cottus and Gyges.

Interestingly Briareus ~ "sea goat" has another name "Aigaion", or "Aegean" in English.

Looking at the NI codenames - Mykonos, Ibiza and Cozumel. Mykonos is an island in the Aegean sea, would be nice if the 2 chips were somehow related, make things a bit easier to remember.

Sadly suspect AMDs intention is the opposite. :cry:

Edit: Misremembered the Strings the Gipsel found in the driver: should be Kauai not Mykonos, so no obvious relationship betwen the two series.
 
Last edited by a moderator:
Could someone please explain how ati designs are more dense with transistors but use less power than Nvidia designs? Is that considered an example of superior engineering? I would assume it is, but how do they do it?

Also,
N.I. = rv1070 (new design, new node, next year), and
Hecatonchires = rv970 (refresh, same node, this year)
?
 
Could someone please explain how ati designs are more dense with transistors but use less power than Nvidia designs? Is that considered an example of superior engineering? I would assume it is, but how do they do it?

Comes down essentially to overhead per ALU. Nvidia effectively has a control overhead per 16 alus (8 in G80), iirc. ATI has a control overhead per 80 alus. This has impact not only in the pipeline but also in the register file and any bypass networks.
 
The memory controllers and miscellaneous doodads would seem to fit the somewhat nebulous uncore (not sure what counts as core in a Cypress-style GPU).
The L2 cache seems tied closely enough to the memory controllers, so that might change.

The shader arrays are tightly linked to the TMU, and LDS, and the GDS has to interface with them. That seems to cap what can be done about those.
The scheduling hardware is also tightly linked to the current SIMD structure and instruction support.
ROPs interact with memory and shader writeback, but what point is there in modifying them if the shaders that feed them don't change?

Perhaps whatever is on the other side of the setup engine can be fiddled with. The setup engine feeds into the scheduler, so it is one step removed from the shaders. Would that count as uncore, or at least notcore? (the latter not being a serious word)
 
The memory controllers and miscellaneous doodads would seem to fit the somewhat nebulous uncore (not sure what counts as core in a Cypress-style GPU).
The L2 cache seems tied closely enough to the memory controllers, so that might change.

The shader arrays are tightly linked to the TMU, and LDS, and the GDS has to interface with them. That seems to cap what can be done about those.
The scheduling hardware is also tightly linked to the current SIMD structure and instruction support.
ROPs interact with memory and shader writeback, but what point is there in modifying them if the shaders that feed them don't change?

Perhaps whatever is on the other side of the setup engine can be fiddled with. The setup engine feeds into the scheduler, so it is one step removed from the shaders. Would that count as uncore, or at least notcore? (the latter not being a serious word)

Exactly what Lixian said.:smile: Setup/raster+tess+cache. And since these were the architectural advantages of fermi, then without B1 it seems 6870 should be able to beat a 480.
 
Exactly what Lixian said.:smile: Setup/raster+tess+cache. And since these were the architectural advantages of fermi, then without B1 it seems 6870 should be able to beat a 480.

I guess then that might add credence to the idea that there will be a B1 variant of G100? (If the above is true).

However early information can often be false information and if ATI are going through GloFo then the regular rumour mill doesn't exactly apply here.
 
I wonder what will happen to the leaks from ATi side should they switch to GF. Are the fab workers in Germany/NY any less loose lipped?
 
I wonder what will happen to the leaks from ATi side should they switch to GF. Are the fab workers in Germany/NY any less loose lipped?

I would say that the standard rumour mill isn't developed as much in Germany as compared to Taiwan etc because there hasn't been any reason to really point our noses in that direction for rumours and tidbits.
 
With TSMC making a mess of 40nm and 28nm causing concerns it does make you wonder whether AMD and nvidia will be thinking more and more about Global Foundaries or other.
 
With TSMC making a mess of 40nm and 28nm causing concerns it does make you wonder whether AMD and nvidia will be thinking more and more about Global Foundaries or other.

http://www.semiaccurate.com/2009/08/18/nvidia-takes-huge-risk/
Betting people say that TSMC will eat the costs and Nvidia will be a GlobalFoundries customer at the earliest opportunity anyway.

http://www.semiaccurate.com/2010/03/30/atis-next-generation-outed/

The lead off parts were due to come on TSMC's 28nm process, which is set for Q1/2011, followed by derivatives on GlobalFoundries' 28nm process.

After 40/32/28, can't say I blame nv/ati. :???:
 
The memory controllers and miscellaneous doodads would seem to fit the somewhat nebulous uncore (not sure what counts as core in a Cypress-style GPU).
The L2 cache seems tied closely enough to the memory controllers, so that might change.

The shader arrays are tightly linked to the TMU, and LDS, and the GDS has to interface with them. That seems to cap what can be done about those.
The scheduling hardware is also tightly linked to the current SIMD structure and instruction support.
ROPs interact with memory and shader writeback, but what point is there in modifying them if the shaders that feed them don't change?

Perhaps whatever is on the other side of the setup engine can be fiddled with. The setup engine feeds into the scheduler, so it is one step removed from the shaders. Would that count as uncore, or at least notcore? (the latter not being a serious word)

They could also change the ALU clocks like nvidia did long ago. ROP-s and TMU-s are likely bandwith bound but ALU-s usualy not.
Its quite hard to increase the clocks to 1.2 GHz for the whole chip, but not if u would just increase the SP clocks to 1.2 GHz.
Something like 900 MHz for the whole gpu and 1.2 GHz for the SP-s.(1.3333 multiplier) Just this mild frequency increase could earn with the same 1600 SP-s near 40% more performance for same die area.
But It surely has some penalties to keep two clock domains on the chip.
edit: Actualy whats the disadvantage to keep two or more clock domains on the chip ?
 
They could also change the ALU clocks like nvidia did long ago. ROP-s and TMU-s are likely bandwith bound but ALU-s usualy not.
Its quite hard to increase the clocks to 1.2 GHz for the whole chip, but not if u would just increase the SP clocks to 1.2 GHz.
Something like 900 MHz for the whole gpu and 1.2 GHz for the SP-s.(1.3333 multiplier) Just this mild frequency increase could earn with the same 1600 SP-s near 40% more performance for same die area.
But It surely has some penalties to keep two clock domains on the chip.
edit: Actualy whats the disadvantage to keep two or more clock domains on the chip ?

1) Having multiple clocks that are not multiplies of each others is problematic for data integrity, needs additional buffers and data integrity logic

2) These increase the latency of the data going thru the clock speed boundary

3) Distribution of clock signals becomes more complex thing to do when there are multiple different clock signals

4) Control logic becomes more complex when everything cannot be calculated by simple clock cycles.
 
1) Having multiple clocks that are not multiplies of each others is problematic for data integrity, needs additional buffers and data integrity logic

2) These increase the latency of the data going thru the clock speed boundary

3) Distribution of clock signals becomes more complex thing to do when there are multiple different clock signals

4) Control logic becomes more complex when everything cannot be calculated by simple clock cycles.

The 9xxx nvidias had 2.5 multiplier. My 1.3333 was a bad example :LOL:. But of course keeping in sync the scalar sp-s could be much easyer than the vector ones.
 
Its looking more and more like the 5870 is bandwidth limited at high resolutions by reading the 2GB edition benchmarks.

Do you guys think ati will go with a wider bus with the refresh ?
 
The 9xxx nvidias had 2.5 multiplier. My 1.3333 was a bad example :LOL:. But of course keeping in sync the scalar sp-s could be much easyer than the vector ones.

2.5 is also fractional multipler, its equally bad(or might be worse, as requires multply/division by 5, 4/3 requires multiplying/division by only 3 and 4)

2 is an easy one. 4 is easy one. 8 is easy one. 3, 5, 6 are bit more difficult, but easier than 2.5 or 1.33
 
Its looking more and more like the 5870 is bandwidth limited at high resolutions by reading the 2GB edition benchmarks.

Do you guys think ati will go with a wider bus with the refresh ?
According to B3D benchmark,5870 is not bandwidth limited.

I'm pretty sure ATI won't use weird bus like 384bit.512bit bus will cost too much die space.I think they will stay with 256bit bus this year.
 
According to B3D benchmark,5870 is not bandwidth limited.

I'm pretty sure ATI won't use weird bus like 384bit.512bit bus will cost too much die space.I think they will stay with 256bit bus this year.

If the gpu was designed for 256bit bus and around 4.8GHz clocks than just increasing the clocks wont show to much increase. U have the same 32 ROP-s for same 8*32bit controlers and buffers.
As the caches are now several hundred GB/s , and the gpu is designed for a given buss width ,bandwith , ROP-s,buffers u wont see much difference with just memory clock changes. At least my theory. :oops:
I just want to say that u could see much more difference with the 384bit when the whole gpu would be designed round it.
 
Back
Top