AMD: R7xx Speculation

Status
Not open for further replies.
I don't get your math. If I multiply the stuff you mentioned, I come up with 4TB/s... That said, I don't understand the calculation neither - why 16 texels per pixel? Shouldn't that be 4 for bilinear? In any case, I suspect even under somewhat bad conditions you'd usually only have 1 or so, bilinear (with mipmaps) tends to be perfect for texture caches. That still gives 128GB/s - meaning the chip doesn't have enough bandwidth for this anyway. Though DXT1 textures would only use 8GB/s, and DXT5 only 16GB/s...
I suspect for really good performance you'd want half the memory bandwidth as aggregate link bandwidth, with all textures split up (with some tiling pattern) between the two chips - meaning each chip would still have the same memory bandwidth as a single chip configuration (aside from pathological cases where all texture accesses from a chip go to the memory of the other chip). Though if you assume texture fetch doesn't consume that much bandwidth (after all, your ROPs probably want some too, and as said with compressed formats it should be much lower) maybe something like one fourth the bandwidth instead of half could be enough...

I think you answered your own question when you mentioned texture compression.

I mean, I don't know of any games using uncompressed textures all the time. How would they fit all that stuff on a DVD or a few CDs?
 
Yeah I saw those but weren't they made up by some Chinese website? ATi is obviously moving in this direction but I haven't seen anything indicating that we will see it in R700.

There's one simple reason for that....such a high level of inter-die integration would probably require significant architectural change. R600 was definitely an attempt at single die supremacy so I'm not expecting anything along these lines until AMD's next architecture rolls out.

Sort of.

Except R600 was to lay the groundwork for R700. R700 at the time was planned as a multi-chip solution.

One way to interpret this is that R600's ringbus could be considered the groundwork for inter-GPU communication and memory access when using multiple GPUs as a single monolithic (to the OS and any programs accessing it) GPU.

Or there may be other bits and bobs there that were used solely to lay the groundwork for future multi-GPU products.

Only time will tell where things go. And there's always the chance that any prior work done in hopes of leading to a workable (IE - not AFR) multi-GPU appearing and working as a monolithic GPU will be a dead end.

Regards,
SB
 
A shared memory pool utilizing DMA and existing memory interface infrastructure(s) are present on each GPU. No additional hardware nor separate traces need be run (beyond what is necessary to enable dual GPUs on a PCB, that is).
So what would connect them? I don't think making each chip a drop on eachother's memory bus is realistic.
 
So what would connect them? I don't think making each chip a drop on eachother's memory bus is realistic.

This is my point. Nothing needs to connect them! A shared memory pool (and a driver with great load balancing ability) is "all" that's necessary.
 
I don't get your math. If I multiply the stuff you mentioned, I come up with 4TB/s...
:oops: Sigh I even attempted the calculation more than once, though now I'm getting 512GB/s :???:

That said, I don't understand the calculation neither - why 16 texels per pixel? Shouldn't that be 4 for bilinear? In any case, I suspect even under somewhat bad conditions you'd usually only have 1 or so, bilinear (with mipmaps) tends to be perfect for texture caches. That still gives 128GB/s - meaning the chip doesn't have enough bandwidth for this anyway. Though DXT1 textures would only use 8GB/s, and DXT5 only 16GB/s...
ARGH, I meant 4 texels per pixel not 16.

I was trying to come up with the worst case texturing bandwidth. Assuming that fp16 texels aren't compressed and considering bilinear filtering with minication of at least 50% with no mipmap so that every pixel is filtered from 4 distinct texels. Anyway it was a total bust :oops:

I suspect for really good performance you'd want half the memory bandwidth as aggregate link bandwidth, with all textures split up (with some tiling pattern) between the two chips - meaning each chip would still have the same memory bandwidth as a single chip configuration (aside from pathological cases where all texture accesses from a chip go to the memory of the other chip).
Yeah I agree, textures would need tiling to help with load-balancing. The pathological case is possible even with a single GPU (i.e. reading all texels from just one memory channel instead of from all four, say).

Though if you assume texture fetch doesn't consume that much bandwidth (after all, your ROPs probably want some too, and as said with compressed formats it should be much lower) maybe something like one fourth the bandwidth instead of half could be enough...
I assume that the RBEs are assigned to screen space tiles so that they only use GPU-attached, not foreign, memory. So there should be no cross-GPU traffic relating to colour/Z/stencil operations. The only render target related cross-GPU traffic should occur when stitching-together the tiles from the constituent GPUs.

Jawed
 
This is my point. Nothing needs to connect them! A shared memory pool (and a driver with great load balancing ability) is "all" that's necessary.
A shared memory pool doesn't become shared by putting it on the PCB without any traces to it. Both need connections to the memory, either directly or through a bridge. As I said, putting both graphics chips on the same memory bus is probably not realistic.
 
A shared memory pool doesn't become shared by putting it on the PCB without any traces to it. Both need connections to the memory, either directly or through a bridge. As I said, putting both graphics chips on the same memory bus is probably not realistic.

I knew there was no way to phrase this with 100% technical accuracy. Please refer to previous posts in which I clearly mentioned trace routing ;)

A good example of a texture that is not compressed is an HDR (fp16) render target that's waiting to be post-processed (e.g. tonemapped).

Jawed

Notice I said "all the time" ;) I didn't mean uncompressed textures are never used, just that they are used with less frequency than their compressed brethren.
 
One way to interpret this is that R600's ringbus could be considered the groundwork for inter-GPU communication and memory access when using multiple GPUs as a single monolithic (to the OS and any programs accessing it) GPU.
The fully distributed memory system, based on a ring bus, appears to be fundamental to sharing memory across multiple chips. There is no single functional block that arbitrates reads or writes.

The partially distributed memory system of R5xx, with distributed reads but centralised writes, relies upon the central portion of the memory system to route writes to the requisite memory channel.

So I see the evolution from R5xx to R6xx as being about multi-chip as well as reducing the routing pain associated with a centralised block.

In theory R700 is the full multi-chip implementation of this distributed memory system, but I have to admit I'm not gung-ho about it coming to fruition. Is multi-chip shared-memory just another Fast-14 kind of saga that'll run and run?

Jawed
 
I knew there was no way to phrase this with 100% technical accuracy. Please refer to previous posts in which I clearly mentioned trace routing ;)
Which again makes me come back to what connects the two, are you seriously suggesting putting them on the same memory bus?
 
Notice I said "all the time" ;) I didn't mean uncompressed textures are never used, just that they are used with less frequency than their compressed brethren.
"all the time" is a completely pointless caveat, since I was trying to determine the worst-case :rolleyes:

Jawed
 
Which again makes me come back to what connects the two, are you seriously suggesting putting them on the same memory bus?

Isn't this the point of the ring bus anyway?

"all the time" is a completely pointless caveat, since I was trying to determine the worst-case :rolleyes:

Jawed

Misunderstanding then. I didn't realize it was the worst case you were after. It wasn't until your 2nd post that you said as much.
 
Interesting. I will try to summarize:

So ShaidarHaran is suggesting a single memory bus with 2 GPUs, like current C2D(with a shared cache for both cores) or maybe like multi CPU architectures previous to the Athlon 64.

Others are suggesting a NUMA arch. like the Athlon 64, and the rest (boring and pessimist view :p) of the bunch thinks they will change nothing and R700 with be the same crappy AFR on a stick.
 
So, the crux of the question is, how to connect the two rings so that data from either GPU's memory can get anywhere?

Jawed

Cannot each GPU simply have traces routed to each memory bank? Why the need for extra logic? Sure it simplifies PCB design, but it introduces latency and likely becomes a bottleneck to throughput as well.

Obviously if R700 is an MCM (which I'm not saying it is) that would simplify matters greatly.
 
Interesting. I will try to summarize:

So ShaidarHaran is suggesting a single memory bus with 2 GPUs, like current C2D(with a shared cache for both cores) or maybe like multi CPU architectures previous to the Athlon 64.

Others are suggesting a NUMA arch. like the Athlon 64, and the rest (boring and pessimist view :p) of the bunch thinks they will change nothing and R700 with be the same crappy AFR on a stick.

Precisely. I actually thought of mentioning the C2D as an example, but it does have a crossbar for the purpose of cross-die communication (and thus sharing of data in the L2) so it's not a perfect example.
 
Cannot each GPU simply have traces routed to each memory bank? Why the need for extra logic? Sure it simplifies PCB design, but it introduces latency and likely becomes a bottleneck to throughput as well.

Obviously if R700 is an MCM (which I'm not saying it is) that would simplify matters greatly.
Are you suggesting that both GPUs have traces to all memory chips :oops: :?:

GDDR is point-to-point, I don't see how that could work.

Jawed
 
Yes, a shared ring is the likely approach here.
Somehow I don't see a shared ring make it all the way from one end of the PCB to the other end, if the cooler pic that was leaked recently is to be trusted. That would pointlessly complicate clock management IMHO.
 
Status
Not open for further replies.
Back
Top