AMD: R7xx Speculation

Status
Not open for further replies.
This .. for me.. seems to be the only reason ATI pushed through with it's design of R600/R700. not because of what problems it faced this year. but of the future potential of the ring-bus controller.

That's tantamount to suggesting that AMD/ATI is doing its R&D in public.
 
I think that is basically the design decission behind R600.
although the scale and utilization of the controller in R600 is out of proportion that same controller would fit snuggly in a situation where it would be controlling multiple bus stops, both internal and external.

This .. for me.. seems to be the only reason ATI pushed through with it's design of R600/R700. not because of what problems it faced this year. but of the future potential of the ring-bus controller.

The fact that ATI went this route (including shader based AA) seems to me that want as little dependencies on on-die logic that would make one single "processor" bulky .. or dependant on the other processors.

I have no idea how well it would scale to the low end, but seen 670's recent 8xAA numbers it would suggest it's potential unger high res/high AA is enourmous.

I pretty much agree with everything here. They seem to be preparing for this so far. It look like a great plan, and a big risk at the same time. Let's hope for the best.

Again, I think this is very close to the truth. Very good speculation.
 
So we can kind of expect to see the 512bit bus again?

Whats kind of intresting is that "2" chips will be used for mid range, perhaps called the HD 4670. But if each chip is roughly the same performance of RV670(if fudo is right), that would be some pretty crazy performance for a mid range. Even if the chips perform more like R580 than RV670, that would still be pretty impressive.

Scalability is what intrest me the most here. A four chip 512bit, could be perhaps scaled down to a 2 chip 256bit SKU, and further down to 1 chip 128bit.
 
I think it'd be safer to assume each chip is somewhere in the region of RV630.

Jawed

Yep. Let's hope that's not true though, as 4x RV630 isn't much of a step up from where they are now, let alone where NV will be at when R700 is released. Also if true, we can likely look forward to yet another 16 ROP architecture from AMD. :rolleyes: They'd better be clocking those cores at >1GHz or they're in for a world of hurt.
 
I think it'd be safer to assume each chip is somewhere in the region of RV630.

Jawed

Actully the more I think about it Jawed, you may be right. Does not sound far fetched at all. Assuming the 4 chips scale as one, the specs assuming close to RV630 level or perhaps even RV635 may actully make for nice fast GPU.

So what sounds about right for a 72mm2 300 million transistor chip? Maybe two RBE's, two TEU's, 32sp? One RBE, two TEU, 24sp?
 
Actully the more I think about it Jawed, you may be right. Does not sound far fetched at all. Assuming the 4 chips scale as one, the specs assuming close to RV630 level or perhaps even RV635 may actully make for nice fast GPU.

So what sounds about right for a 72mm2 300 million transistor chip? Maybe two RBE's, two TEU's, 32sp? One RBE, two TEU, 24sp?
128-bit bus, 8 TUs, 4 RBEs, 120 SPs ~150mm2 on 65nm. On 45nm it'd be about 75mm2 :p

So, four of those would add up to 512-bit bus, 32 TUs, 16 RBEs, 480 SPs.

I know, RV630 is 390M transistors. Something's gotta give... Prolly TUs, and with them some SPs (otherwise the ALU:TU ratio goes "too high") :cry:

Jawed
 
Perhaps the 300m number was a rough sketch or perhaps ATi cut some other redundant or usless parts.... maybe one being tessalation.
 
I know, RV630 is 390M transistors. Something's gotta give... Prolly TUs, and with them some SPs (otherwise the ALU:TU ratio goes "too high") :cry:
You dont need UVD, I/O on each of these cores, shave some transistor budget off that way.
 
You dont need UVD, I/O on each of these cores, shave some transistor budget off that way.

I think that there will be 1 master core containing UVD and IO and the slave core without it. So, the value part might be only the master core and mainstream part might be composed by the master core and a single slave core.
 
I think that there will be 1 master core containing UVD and IO and the slave core without it. So, the value part might be only the master core and mainstream part might be composed by the master core and a single slave core.

That rather goes against the whole scalability and yield concept though... More likely to see these things on a separate chip on the PCB.
 
The low-end part needs full-speed video-decode and should be able to deal with a full-width PCIe connection. So, why would one care about scaling those things?

Anyway, if this is really what R700 is, then I hope the master core is able to delegate work to slave dies (i.e. that you get similar scaling characteristics to monolithic designs), rather than this AFR crud. With a 4 tile design, AFR and a 60fps game, you'd have the input lag of a 15fps framerate :cry:.
 
A master + slave cores would be unecessarily messy IMO. And ruin easy scalability. If all "cores" were the same then won't have to worry about matching "good" masters with an appropriate amount of "good" slaves. And when you have to order thousands of wafers at a time (I'm assuming), then if either masters or slaves has a particularly unlucky bad batch you've just not given yourself potentially hundreds if not thousands of usless master or slave cores.

A rather more elegant solution IMO...

Would be one controller/dispatcher chip that contains all features that don't require massive parrallelism. For example - UVD, RAMDAC, thread dispatcer, ec...

Then you have multiple cores speciallizing in all those parrallel tasks that can benefit from more units.

Each part would then be smaller allowing for more chips or more redundancy to improve yields and greater granularity.

This would also allow for the whole to act as one big monolithic GPU, thus doing away with the need for AFR or other multi-GPU algorythms. However, how much latency would there be? And would it be possible to effectivel hide i?

Then again I'm not a chip designer so all this might be virtually impossible or impractical for one reason or another. :p

Regards,
SB
 
Ah, Fudzilla.

R700 Fudofacts:
1) 300 Transistors per cell
2) Same performance as RV670.

Am I the only one who thinks the two do not add up?
 
A rather more elegant solution IMO...

Would be one controller/dispatcher chip that contains all features that don't require massive parrallelism. For example - UVD, RAMDAC, thread dispatcer, ec...

Then you have multiple cores speciallizing in all those parrallel tasks that can benefit from more units.

Each part would then be smaller allowing for more chips or more redundancy to improve yields and greater granularity.

This would also allow for the whole to act as one big monolithic GPU, thus doing away with the need for AFR or other multi-GPU algorythms. However, how much latency would there be? And would it be possible to effectivel hide i?

Then again I'm not a chip designer so all this might be virtually impossible or impractical for one reason or another. :p

Regards,
SB

That's what I'm thinking, see of it as the current clusters placed off of the die, but effectively still on the same package.
Although data now has to travel through the external bus, as I mentioned before, I think the ring-bus controller will be the essential part in this setup.

remember http://beyond3d.com/content/reviews/16/5
R600 is a unified, fully-threaded, self load-balancing shading architecture, that complies with and exceeds the specification for DirectX Shader Model 4.0. The major design goals of the chip are high ALU throughput and maximum latency hiding

and
http://beyond3d.com/content/reviews/16/9
Indeed the massively threaded nature of the sampler hardware is a means to hide latency, using the memory controller to make sure the shader core -- as a client of the sampler array via the MC -- is kept busy. Remember we said that the shader core would sleep threads while waiting on sampler data? The two schedulers work together to make sure that's the case, since a trip to a DRAM to open a page and get data, then feed it through the logic for filtering can be hundreds of clocks of wait time. You don't want to stall the shader core for that, at all.

It is basically essential for the system to work this way if it does not want to fall back to "DieFire" (crossfire on a die)
 
SB, but now you are stuck with multiple chips even for the low end, which strikes me as less than ideal, or a separate low-end design, which supposedly is one of the things his type of design is trying to avoid. Also, I don't understand why matching good slaves and masters is such a problem - what you describe requires exactly the same, it's just that your master has no ALUs/TMUs.

The fact that one can mix and match cores with different leakage/power characteristics to reach an overall frequency-at-a-certain-power target for the final package is one of the few things (yield being the other) that to me seems a convincing reason to go with this kind of approach (*). So even with completely homogeneous cores, one might want to invest some thought into how cores are matched up.

That said, I'm also not a HW engineer.
 
Status
Not open for further replies.
Back
Top