The LAST R600 Rumours & Speculation Thread

Status
Not open for further replies.
ATI have commented in the past about the spread of cards released and that as time goes on there would be more steps from low end to high end. So going for 512 bit at the high end doesn't mean that low end won't be 64 bit but rather that now you can have 64,128,256,512 bit cards. This means it should be easier to have a price point for all pockets IMHO.

True enough, but I think there's another aspect that comes into play.

ATi's die sizes have been pretty huge in the past for what they were capable of vs the competition. I have to believe a larger bus takes up a bit of space. I think RV630, for example, is 128-bit to keep it cheap as possible to produce, even at 65nm, even at the cost of some performance it may or may not need. If the <80mm2 rumor is true, and it competes with the 8600 series which is also supposedly 128-bit, no doubt that chip is a huge winner for ATi, as mainstream is where the $ are. The 8600 series chip could possibly be twice as big...Putting them in the situation Nvidia was last gen with the ability to chop prices if need-be, while also being able to create 'x2'-style cards if need-be to compete.

That brings me to my next question...When will we see a 256-bit R600 part if that is indeed the sweet-spot? I always thought parts would follow the R5xx scheme with Rv630 being 1/4 R600, and RV660/670 would follow and be 2xRv630 (with the same bus) and 3/4 R600 respectfully, but I also expected Rv630 to have a 256-bit bus...So I dunno. I guess Rv660 is still the best candidate though...perhaps 256-bit and 1/2 R600...but where does that leave R670? 512-bit? 256-bit? 384-bit? :p
 
At first thought, it seems unlikely to me that AMD will produce "odd" bus configurations such as 384-bit, because the number of memory channels it implies doesn't sit well with the ring-bus architecture.

At the low-end, this changes though - if RV515 is any guide, the low-end dies are too small to justify using a ring-bus. Perhaps that'll be true in RV6xx. So a 64-bit memory bus isn't "odd" in the same way that 384-bit bus is.

As far as I can tell the 512-bit external memory bus in R600 will consist of 16x 32-bit channels, where each ring-stop "commands" 2 channels. So, that's 8 ring-stops. Well, that's based on "doubling-up" R580...

So a half-R600 GPU would have 4 ring-stops, 8x 32-bit channels, 256-bit total external bus.

Then a quarter-R600 GPU would have 4 ring-stops, with 4x 32-bit channels.

---

ATI GPUs seem to tile both textures and render targets across all memory channels. i.e. in order to fetch a block of a render target from memory (say 16x16 pixels), a request is sent to all memory channels concurrently.

R5xx GPUs with a ring-bus all seem to have four ring-stops, and they access memory through either 256- or 128-bits. I might argue that four ring-stops underlie the ring-bus architecture: the symmetry and common memory tilings it produces, regardless of GPU capability, might be crucial in the overall design of the memory system.

So it makes me wonder whether AMD will implement all R6xx GPUs with four ring-stops (or use a crossbar on the smaller GPUs). If that's so, then that would mean R600 has 4x 32-bit channels per ring-stop. I could argue: if 2x 32-bit channels works, why not 4x?

Then you could ask, why not 3x 32-bit channels per ring-stop, for a total of 384-bit? Erm... dunno :LOL: Like pretty much everything with R600, it's a wilderness.

Jawed
 
At first thought, it seems unlikely to me that AMD will produce "odd" bus configurations such as 384-bit, because the number of memory channels it implies doesn't sit well with the ring-bus architecture.
Is that, perhaps, the answer to the question "why 512 bits?" Could it be simply that 256 bits weren't quite enough, but that there was no number between 256 and 512 that was compatible with the ring-bus, so they had no option but to go all the way to 512? :oops:
 
I could be wrong, but i don't see why the number of ring stops has to be tied to the number of memory channels at all. I would think the number of stops is more of a function of a size of the chip/length of the wires. You only want to add stops if you have to, since adding stops increases latency. The number of stops or the number of channels per stop does not have to be 2^n.
 
Is that, perhaps, the answer to the question "why 512 bits?" Could it be simply that 256 bits weren't quite enough, but that there was no number between 256 and 512 that was compatible with the ring-bus, so they had no option but to go all the way to 512? :oops:

I see no reason whatsoever why the ringbus wouldn't work in a 384 bit configuration.

Why 512 bits? Maybe just because it was technically feasible and they ran with it? Not necessarily an unreasonable thing to do if what absolutely the best performance and don't know what the competition is up to.
 
I could be wrong, but i don't see why the number of ring stops has to be tied to the number of memory channels at all. I would think the number of stops is more of a function of a size of the chip/length of the wires. You only want to add stops if you have to, since adding stops increases latency. The number of stops or the number of channels per stop does not have to be 2^n.

There are 2 kinds of stops: I assume the discussion here is about nodes where data can enter the ring. Those are different than repeaters that are placed in between nodes to meet timing. Obviously, both will increase latency.

The number of stops indeed doesn't have to be a power of 2: R520 has 1 additional stop for PCIe traffic.
 
Is that, perhaps, the answer to the question "why 512 bits?" Could it be simply that 256 bits weren't quite enough, but that there was no number between 256 and 512 that was compatible with the ring-bus, so they had no option but to go all the way to 512? :oops:
Well, before "512-bit is confirmed", I was arguing that since ATI worked on GDDR4 (which will eventually end-up almost 2x as fast as GDDR3, in theory - even if it starts out only 10% faster) they would hardly also do a 512-bit bus at the same time.

GDDR4's early incarnations would theoretically have been enough to reach into the 60-100GB/s range? With ATI closely involved in the development of GDDR4, you'd sorta hope they'd know what would be realisable in the time frame? So why do 512-bit at the same time? Against that, if R600 was, like Vista, supposed to be a year earlier (and perhaps 90nm?), 512-bit was crucial because GDDR4 would be coming later?

Perhaps ATI had 512-bit as a backup plan? Pretty outrageous backup plan. I dunno.

If R600's ROPs are >2.5x as capable as R580's with ~2x the bandwidth, for example, that might be evidence the bandwidth is useful.

On the other hand, G80 can do some fancy things with only ~86GB/s, stuff that R580 struggles with (i.e. G80 seems proportionally better). Perhaps we'll get a clearer idea of the bandwidth effectiveness of G80 when B3D's performance analysis appears - the wildly higher pipeline counts in G80 make it hard to discern what's going on. (Another interesting comparison point is R580 CrossFire...)

Jawed
 
There are 2 kinds of stops: I assume the discussion here is about nodes where data can enter the ring. Those are different than repeaters that are placed in between nodes to meet timing. Obviously, both will increase latency.

The number of stops indeed doesn't have to be a power of 2: R520 has 1 additional stop for PCIe traffic.

Ah thanks, so there are some signal propagation stages between the stops? As in it takes more than 1 clock for data to travel between one stop to the next? Just want to clarify. I previously imagined it as 1 clock to get from one stop to the next.
 
Ah thanks, so there are some signal propagation stages between the stops? As in it takes more than 1 clock for data to travel between one stop to the next? Just want to clarify. I previously imagined it as 1 clock to get from one stop to the next.

Yes. That's how I think it is anyway. ;)

But it's not unreasonable: it's pretty much impossible to span multi mm distances without reflopping.
 
I could be wrong, but i don't see why the number of ring stops has to be tied to the number of memory channels at all. I would think the number of stops is more of a function of a size of the chip/length of the wires. You only want to add stops if you have to, since adding stops increases latency. The number of stops or the number of channels per stop does not have to be 2^n.
What's going on with RV570 and RV560?:

http://www.beyond3d.com/forum/showpost.php?p=886671&postcount=67

http://www.beyond3d.com/forum/showpost.php?p=886678&postcount=68

?

Jawed
 
Sure, but I'm not seeing that in April, are you? Maybe in the summer, or fall.

Erm, funnily enough, Nvidia has done a lot to suprise us. First with an early G70 which made ATI jump at trying to get the R580 out(of course the R520 has PCB problems which 'delayed' it), then an early fully shader unified G80 which is early enough and fast enough to last until the G81(or whatever they call it) so I won't be suprised if we see a quicker G8x in 3 months from the release of the R600.
 

I never really understood that either when it was originally posted. Dave's response didn't seem to make sense, although I wrote it off as just being uneducated in the matter. I assumed 4x32 external, 4x[32x2] internal.

On a tangent, is there some reason 12x32 (12x[32x2]) would not work? I understand it would probably require a strange RAM setup for proper equal length traces, but would the ring bus be able to function with that setup and not on quantities of 2 (ie 3?)? If this was answered in a former post, my apologies.

I only ask because a 384-bit 3/4 R600 on 65nm would be a pretty exciting part to look forward to in H2 rather than one that would likely end up with a 256-bit bus (as I doubt it would end up with 512-bit, although possible.)

I suppose it all comes down to how much the architecture is tied together scalability-wise. Are the shaders/bus independent or connected in a certain scale...I suppose no one knows that yet though. :p

I won't be [surprised] if we see a quicker G8x in 3 months from the release of the R600.
I concur. I think the plan all-along has been to release the 65nm G8x a couple/few months after R600, whenever it shows up.
 
Last edited by a moderator:
I concur. I think the plan all-along has been to release the 65nm G8x a couple/few months after R600, whenever it shows up.
I think you'll be very disappointed if you're expecting a 65nm G80 shrink that quickly. So far, we've heard of no tape-out, and I imagine that they'll need to perform at least one respin. Also, I've been told that NVIDIA targets 100 days between tape-out and launch. Finally, don't forget that porting to 65nm from the 80GT process (that G80 uses) is not trivial, and AMD will probably have a better time of it with R600 (which uses the 80HS process).

I'd rather see something more interesting than a basic die shrink anyway :D
 
I think you'll be very disappointed if you're expecting a 65nm G80 shrink that quickly. So far, we've heard of no tape-out, and I imagine that they'll need to perform at least one respin. Also, I've been told that NVIDIA targets 100 days between tape-out and launch. Finally, don't forget that porting to 65nm from the 80GT process (that G80 uses) is not trivial, and AMD will probably have a better time of it with R600 (which uses the 80HS process).

I'd rather see something more interesting than a basic die shrink anyway :D

80GT ?
I think you meant 90GT, no ?
 
I never really understood that either when it was originally posted. Dave's response didn't seem to make sense, although I wrote it off as just being uneducated in the matter. I assumed 4x32 external, 4x[32x2] internal.
I was hoping someone would decode it, too.

On a tangent, is there some reason 12x32 (12x[32x2]) would not work? I understand it would probably require a strange RAM setup for proper equal length traces, but would the ring bus be able to function with that setup and not on quantities of 2 (ie 3?)? If this was answered in a former post, my apologies.
I think the fundamental problem is that you're better off arranging memory in powers of 2 - 3s don't map to most of the data formats of textures or render targets too well.

e.g. a non-AA render target is 8 bytes per pixel (4 colour, 3 Z, 1 stencil) so you could put 2 bytes on each of 4 channels. Or you could take 16 pixels and put 4 on each channel. etc. What happens when you've got 2xAA, or 6xAA? etc.

I once saw a patent talk about memory tiling for graphics, but I've lost track of it.

G80 seems to use memory differently though. I've talked before about crossbars, caches and memory controllers - G80 and R5xx are quite different. It's my hypothesis that the ring-bus architecture uses memory channels symmetrically, while G80 uses a crossbar between L1 and L2 caches, seemingly making it less reliant on concurrent symmetric access to all memory channels.

As far as trace lengths go, it prolly only matters that the traces are equalised per memory chip (or set of memory chips sharing some subset of control lines). I suspect that each bus, to each separate memory chip, is adaptive - i.e. it tunes itself to the conditions it finds, and therefore doesn't care about the other chips. Another topic that's fairly dark.

I only ask because a 384-bit 3/4 R600 on 65nm would be a pretty exciting part to look forward to in H2 rather than one that would likely end up with a 256-bit bus (as I doubt it would end up with 512-bit, although possible.)
I dare say there's a trade between memory speed and bus width. e.g. 256-bit at 1.4GHz is 90GB/s. The way GDDR speeds ramp over its lifetime is a bit of a mystery to me, I dunno how soon 1.4GHz chips will be priced/available for midrange cards.

I suppose it all comes down to how much the architecture is tied together scalability-wise. Are the shaders/bus independent or connected in a certain scale...I suppose no one knows that yet though. :p
Generally, I'm extrapolating from what I observe to be the symmetry of R5xx's ring-bus. But as I said earlier, it's hard to argue against 3x32-bit channels per ring-stop, so I'm just hoping someone comes along with some insights...

Jawed
 
I don't think the ring bus uses memory more symmetrically or anything like that. IIRC According to the R520 article one of the reasons was to reduce hot spots from a centralized mem controller. Another reason i think is because it gets quite complicated to build a huge crossbar with many channels. With every channel added the complexity grows exponentially. Complexity limits speed. Also the R520 memory channels are independent, even if they are on the same stop. I don't see any problems with data alignment which would prevent each ring stop from having 3 memory channels.
 
Last edited by a moderator:
Erm, funnily enough, Nvidia has done a lot to suprise us. First with an early G70 which made ATI jump at trying to get the R580 out(of course the R520 has PCB problems which 'delayed' it)

I don't think G70 was any great surprise. Nice part and all, but no great surprise. If R520 had been out in April as expected, it'd been a fine competitive situation. R580 was actually a bit late from the original schedule because of R520 being very late.
 
I don't think G70 was any great surprise. Nice part and all, but no great surprise. If R520 had been out in April as expected, it'd been a fine competitive situation. R580 was actually a bit late from the original schedule because of R520 being very late.
Well, for some, G70's configuration was somewhat of a surprise :oops: But yeah, timescales put an extra twist on things.
 
Well, for some, G70's configuration was somewhat of a surprise :oops:

For G70? Or G71? G70 was the first high-end single gpu bump up from NV40, wasn't it? (Yes, yes, SLI jumped in there. . .I know). Roughly 14 months after NV40? I wouldn't say we all had it nailed, but there was no ZOMG factor anywhere on the order of G80, and the timing was hardly speedy (ATI having X850'ed 7 months before). Beefed up the second ALU, unequal number of shader quads vs ROP quads (which 6600GT had already previewed that capability, I think), and transparency AA. All in all, I'd say decently within the standard deviation for a refresh a year out from original family release. Big thing at the time is we were all a bit surprised they'd go that large a die on 110nm, as I recall (which seems pretty quaint now!).

I mean, if there was a strategic surprise by NV to point at in the last couple years before G80, it wouldn't be G70 that I'd point at. . . it'd be SLI.
 
Status
Not open for further replies.
Back
Top