AMD: R7xx Speculation

Geo · Nov 25, 2007

Geeforcer said:
Ah, Fudzilla.

R700 Fudofacts:
1) 300 Transistors per cell
2) Same performance as RV670.

Am I the only one who thinks the two do not add up?

There's been rumbles from multiple directions that the roadmap may be changing. I'm no longer sure at all that the first R7 gen part will end in double aught.

Jawed · Nov 25, 2007

Arun said:
I think that's a very small number of lanes. What it does is communicate with the SiS southbridge AFAIK, so it's <=4x. Heck, it might even be 1x! I'm not sure how much more expensive 16x would be though (does all the control logic have to be duplicated?)

The I/O Controller is capable of 500MB/s in both directions simultaneously, i.e. 1GB/s:

That is 4x PCI Express lanes, since each one is 250MB/s.

I would expect pretty much all the logic would be duplicated per lane, since they're all logically independent - i.e. a 16-lane setup could be doing 16 different PCI Express transfers concurrently.

Jawed

ShaidarHaran · Nov 25, 2007

Geo said:
There's been rumbles from multiple directions that the roadmap may be changing. I'm no longer sure at all that the first R7 gen part will end in double aught.

As in R700 itself has been re-designed/replaced with R7x0, or we'll be seeing value parts first?

Geo · Nov 25, 2007

I was exactly as precise as I meant to be.

Tho one might additionally note that it isn't necessarily an either/or.

Jawed · Nov 25, 2007

Geo said:
I was exactly as precise as I meant to be. Tho one might additionally note that it isn't necessarily an either/or.

Considering how far behind NVidia they are, one would hope they'd bring stuff forward...

I don't know if we ever had a timescale for 45nm. I think it's fair to say pretty much everyone has been surprised at how rapidly 55nm has come to pass (less than 6 months after the first 65nm ATI GPU - though you might argue it should have been a 10/11 month gap if the schedules came out as planned).

So, 45nm being brought forward? Pushed back?

Does anyone believe 45nm in Q2/Q3 2008?

Jawed

no-X · Nov 25, 2007

ShaidarHaran said:
As in R700 itself has been re-designed/replaced with R7x0, or we'll be seeing value parts first?

R700 will be postponed. ATi will launch R720 (exactly RV670 doubled + higher clocks) which will be quite fast, than R820 (same number of processing units, SM-next support, higher-clocks) and R880 (3-times more processing units, very succesful desing). Finished R700 will be introduced under R900 code-name and beaten by nVidias solution 6 months before it's launch. Or...?

Arun · Nov 25, 2007

Jawed said:
Does anyone believe 45nm in Q2/Q3 2008?

I could believe it in Q3, yes. TSMC's 45nm process was ready in September 2007, while their 65nm process was in May 2006. Look at the timeframes on RV610/RV630 and you can guess the rest. That would be incredibly aggressive though, as the adoption rate for 45nm seems to be even slower than 65nm in other markets. So this would look even slightly more aggressive than RV610/RV630, although far from unimaginable.

FWIW, I'm expecting NVIDIA's first 45nm handheld chip to tape-out in late Q2. Of course, that one includes more analogue and some RF AFAIK, which tends to slow things down. I'm really not expecting NV to have any discrete PC GPU on 45nm before 2009, but we'll see - I wouldn't completely exclude the possibility either.

EDIT: And good catch on that being a PCI Express 4x link!

ShaidarHaran · Nov 25, 2007

no-X said:
R700 will be postponed. ATi will launch R720 (exactly RV670 doubled + higher clocks) which will be quite fast, than R820 (same number of processing units, SM-next support, higher-clocks) and R880 (3-times more processing units, very succesful desing). Finished R700 will be introduced under R900 code-name and beaten by nVidias solution 6 months before it's launch. Or...?

Sounds like a synopsis of R360->R420->R520->R580 with the codenames changed

silent_guy · Nov 25, 2007

Jawed said:
Why is larger than 512 bits impractical? The traditional problem with 512 bits was getting that many connections to a single die.

I'm sure it's technically possible. More thinking about commercial practicality, at least in the next 2 or 3 years or so. It took quite a while to go from 256 to 512 bits and even that has not proven to be all that necessary. With GDDR5 bandwidth double from what it is now, I don't think GPU designers are going to try to push for larger than 512 soon.

But I do wonder if there'll be much calling for such a narrow bus. In theory the performance of a GPU such as RV610 (soon to be RV620) will be falling off the bottom of the chart in a couple of years.

I think we'll see stripped down stuff for a long time to come. GDDR5@64-bits: 40GB/s? Way faster than anything I've had!

Supposedly one of the benefits of the ring-bus was the removal of hotspots.

My opinion on this hasn't changed.

The notion that you have to spread around stuff around a die for thermal reasons has never registered on my personal radar. Now GPUs have an unusually large power consumption, but even then you'd expect their problems filter through to others as everybody moves up on the technology ramp.

Just google "synopsys hot spots". You'll find a bunch of stuff about lithography hot spots, routing hot spots, and power rail hot spots: all fixable with small localized changes.

I could find 1 article that's very relevant. But have a look at this: "For one thing, the engineers will consider reducing the impact of hotspots by attaching the die directly to a high thermal-conductivity heat spreader, such as a copper plate." They are talking here about very low cost packages where heat removal is indeed more of a problem because plastic isn't the best conductor in the world. GPUs have used heat spreaders for years.

Further down the article: "In such cases, it's best to distribute them relatively evenly over the die while still avoiding the corners and/or edges." Maybe the ring bus controllers shouldn't have been (or aren't) located at the edges after all...

I still don't see how something pedestrian like a blob with interconnect and a lot of parallel FIFO's (with only 1 working at a time) is of larger concern wrt thermal behavior than a dense core with hundreds of adders and multipliers that are all glitching like crazy and moving data each and every clock cycle. It just doesn't make any sense.

Seemingly smaller chips are easier to clock higher, which I presume is a hotspot issue.

I doubt it. It may be for thermal reasons at the system level, but that's something different entirely.

I think the wide-spread fine-grained redundancy makes that moot.

Unless ATI found the holy grail of redundant random logic, I think they're very much complementary. Until then, if you can to chose one, block level redundancy is more efficient.

Jawed · Nov 25, 2007

silent_guy said:
I'm sure it's technically possible. More thinking about commercial practicality, at least in the next 2 or 3 years or so. It took quite a while to go from 256 to 512 bits and even that has not proven to be all that necessary. With GDDR5 bandwidth double from what it is now, I don't think GPU designers are going to try to push for larger than 512 soon.

Overall I agree - but from the point of view of what GDDR5 brings.

It'll be interesting to see if R7xx chips each have a 64-bit or 128-bit memory bus. Chip-to-chip data transfer may turn out to be more important - i.e. the priority may lie in dedicating more die area to moving data between chips. The ring bus effectively posits this, being 2x the bandwidth of the entire VRAM bus.

I think we'll see stripped down stuff for a long time to come. GDDR5@64-bits: 40GB/s? Way faster than anything I've had!

Yeah, we may also be at 32nm or beyond before we see GDDR5 deployed on $50 discrete graphics cards (if they even exist).

My opinion on this hasn't changed.

8x32-bit memory channels on one side of a crossbar connecting to 4x L2s, 4x ROPs, 4x SIMDs. I guess you can cascade crossbars. Wouldn't that crossbar be quite a hotspot?

Also, even if that's irrelevant, ATI still wants to go to a fully distributed, virtualised, memory system - a "ring" looks pretty useful when you're trying to connect 2, 3, 4 or more chips.

The notion that you have to spread around stuff around a die for thermal reasons has never registered on my personal radar. Now GPUs have an unusually large power consumption, but even then you'd expect their problems filter through to others as everybody moves up on the technology ramp.

I suspect this is where the 6-monthly re-implementation schedule comes in. To keep that schedule they have to cut corners, one of which is "off the shelf".

Just google "synopsys hot spots". You'll find a bunch of stuff about lithography hot spots, routing hot spots, and power rail hot spots: all fixable with small localized changes.

I could find 1 article that's very relevant. But have a look at this: "For one thing, the engineers will consider reducing the impact of hotspots by attaching the die directly to a high thermal-conductivity heat spreader, such as a copper plate." They are talking here about very low cost packages where heat removal is indeed more of a problem because plastic isn't the best conductor in the world. GPUs have used heat spreaders for years.

I interpret that to mean that GPUs are years ahead in terms of encountering these problems.

Further down the article: "In such cases, it's best to distribute them relatively evenly over the die while still avoiding the corners and/or edges." Maybe the ring bus controllers shouldn't have been (or aren't) located at the edges after all...

Ah, but are the ring bus controllers hotspots? What the ring bus replaces (i.e. some kind of crossbar hierarchy) would have, supposedly, produced a hot-spot constrained GPU.

I still don't see how something pedestrian like a blob with interconnect and a lot of parallel FIFO's (with only 1 working at a time) is of larger concern wrt thermal behavior than a dense core with hundreds of adders and multipliers that are all glitching like crazy and moving data each and every clock cycle. It just doesn't make any sense.

I'm all out of ideas.

I doubt it. It may be for thermal reasons at the system level, but that's something different entirely.

Well here's some recent stuff:

Semiconductor device with integrated heat spreader

THERMAL MANAGEMENT DEVICE FOR MULTIPLE HEAT PRODUCING DEVICES

ATI seems to spend a fair amount of effort on solving heat problems generally. I think the second of those is for stuff you'd find in a mobile device. Then again, maybe these are just land-grab patent applications.

Unless ATI found the holy grail of redundant random logic, I think they're very much complementary. Until then, if you can to chose one, block level redundancy is more efficient.

GPUs are "embarrassingly parallel" - they should be pretty amenable to fine-grained redundancy. Especially on the scale we're seeing now. I think ATI's general model of having one die for one or two SKUs, per performance category (with 4 or 5 categories altogether) drives them towards fine-grained redundancy. Apart from RV560/570, every other "block level redundant" die in recent years has only existed as some kind of end of life part (XL/GTO, X1900GT/1950GT, HD2900GT that kind of thing).

Jawed

hoom · Nov 25, 2007

That is 4x PCI Express lanes, since each one is 250MB/s.

1* PCI Express is 250MB/S full duplex -> 500MB/S = 2*.

Maybe the ring bus controllers shouldn't have been (or aren't) located at the edges after all...

I thought that was a given? R520 die

Jawed · Nov 25, 2007

hoom said:
1* PCI Express is 250MB/S full duplex -> 500MB/S = 2*.

Thanks

Jawed

aca · Nov 26, 2007

Jawed said:
Well here's some recent stuff:

Semiconductor device with integrated heat spreader

THERMAL MANAGEMENT DEVICE FOR MULTIPLE HEAT PRODUCING DEVICES

ATI seems to spend a fair amount of effort on solving heat problems generally. I think the second of those is for stuff you'd find in a mobile device. Then again, maybe these are just land-grab patent applications.

Actually, I know several people working/experimenting on the integration of heat pipes on chip. The problem is that the materials needed (non electrical conductive but good thermal conductive) are kind of special such that you will not see them for commercial applications any time soon. Maybe the first ones to use will be military/space. For the rest, heat pipes outside the chip are basically a common thing these days, it seems.

Arun · Nov 26, 2007

Jawed said:
GPUs are "embarrassingly parallel" - they should be pretty amenable to fine-grained redundancy.

Errr, no. GPUs are massively parallel in terms of *processing* logic, but not in terms of *control* logic. Here is a list of systems in a DX10 GPU that are *not* amenable to fine-grained redundancy:
- All I/O & display analogue.
- Video decoder hardware.
- PCI Express controller.
- Memory controller.
- Input assembly*.
- Clipping/Culling.
- Triangle Setup*.
- Rasterization.
- Global VS/PS arbitration.
- Texture addressing*.
- ALU Scheduler.
*: Depends slightly on implementation or only part of the process may be fine-grained (i.e. not enough)...

That's NOT a small list, and it's NOT a negligible part of the GPU. Good luck implementing fine-grained redundancy for any of those things, unless your definition of fine-grained is to duplicate things (increasing area by 25% to 100%) and not using them!

Jawed · Nov 26, 2007

Arun said:
Errr, no. GPUs are massively parallel in terms of *processing* logic, but not in terms of *control* logic. Here is a list of systems in a DX10 GPU that are *not* amenable to fine-grained redundancy:
- All I/O & display analogue.
- Video decoder hardware.
- PCI Express controller.
- Memory controller.
- Input assembly*.
- Clipping/Culling.
- Triangle Setup*.
- Rasterization.
- Global VS/PS arbitration.
- Texture addressing*.
- ALU Scheduler.
*: Depends slightly on implementation or only part of the process may be fine-grained (i.e. not enough)...

Which of those items are amenable to block-level redundancy? Nearly everything you've identified is a global, one-off, functional item.

What proportions of the die are these two sets (your list, and the block-level redundant functions)?

Jawed

Buntar · Nov 27, 2007

silent_guy said:
I still don't see how something pedestrian like a blob with interconnect and a lot of parallel FIFO's (with only 1 working at a time) is of larger concern wrt thermal behavior than a dense core with hundreds of adders and multipliers that are all glitching like crazy and moving data each and every clock cycle. It just doesn't make any sense.

I was under the impression that of concern for hotspots was not the MC itself but rather the "memory bus" wires surrounding it. As I understand it, those "buses" run at the DRAM frequency, and that frequency is way higher than the core frequency.

Jawed said:
RV610 is also incapable of 8xMSAA.

Is there a discussion of why RV610 is incapable of 8x MSAA? I couldn't find one.

Jawed · Nov 27, 2007

Buntar said:
Is there a discussion of why RV610 is incapable of 8x MSAA? I couldn't find one.

I presume it's nothing more than a simplification of the hardware, particularly as performance would be poor except in fairly old games. 6.4GB/s and 12.8GB/s are the bandwidths, apparently, down there in the realms I barely think about

Jawed

silent_guy · Nov 27, 2007

Buntar said:
I was under the impression that of concern for hotspots was not the MC itself but rather the "memory bus" wires surrounding it. As I understand it, those "buses" run at the DRAM frequency, and that frequency is way higher than the core frequency.

No, there's no way the bus itself is a problem.
Think about it: you just have a small blob of buffers or FFs every other 200um or so. The wires don't even glitch during transitions.
Compare this with thousands of densely packed glitching combinatorial gates of a multiplier/adder/other random logic.

silent_guy · Nov 27, 2007

Jawed said:
Ah, but are the ring bus controllers hotspots? What the ring bus replaces (i.e. some kind of crossbar hierarchy) would have, supposedly, produced a hot-spot constrained GPU.

Crossbars can be problematic due to routing issues. Lots of wires converging in small places... which results in low cell placement density. Exactly the kind of environment where you'd expect thermal hot spots not to be a problem. I think it's a complete non-issue.

Buntar · Nov 27, 2007

Jawed said:
I presume it's nothing more than a simplification of the hardware

Which hardware would be simplified? The ROP unit in RV610 is advertised to be exactly the same as the one in RV630.

AMD: R7xx Speculation

Geo

Mostly Harmless

Jawed

ShaidarHaran

hardware monkey

Geo

Mostly Harmless

Jawed

no-X

Arun

Unknown.

ShaidarHaran

hardware monkey

silent_guy

Jawed

hoom

Jawed

aca

Arun

Unknown.

Jawed

Buntar

Jawed

silent_guy

silent_guy

Buntar

Similar threads