AMD: R7xx Speculation

no-X · May 2, 2008

Silent_Buddha said:
However, is GPU rendering actually latency tolerant enough to be able to deal with that?

Wouldn't it be sufficient to share memory just for textures to avoid their duplicity? Inter-die interconnection could be simpler.

Jawed · May 2, 2008

no-X said:
Wouldn't it be sufficient to share memory just for textures to avoid their duplicity? Inter-die interconnection could be simpler.

More and more graphics algorithms generate textures (e.g. shadow maps) or create vertex buffers (e.g. on-GPU animation/skinning) so they've got to bite the bullet.

Jawed

Silent_Buddha · May 2, 2008

Thanks for the replies both of you.

Realize I'm just an armchair onlooker.

If I'm understanding then. The latency for data to travel from say GPU1 to GPU2 is fairly insignificant compared to the latency required to actually access the data in say GPU1's connected NUMA pool of memory?

In other words, if GPU2 had to access data residing in GPU1's NUMA pool of memory, the latency involved isn't significantly higher than if GPU1 was accessing it?

All this, assuming IF something like this gets implemented.

Now another question. Would that be easier to implement or better than say some sort of GPU "northbridge" which connects to each GPU and then in turn connects to memory. Similar to how Intel CPU's still access memory in a multi CPU system?

I'm assuming that once you scale past 2 GPUs on a package, a NUMA architecture would be the obvious choice?

Regards,
SB

3dilettante · May 2, 2008

Silent_Buddha said:
If I'm understanding then. The latency for data to travel from say GPU1 to GPU2 is fairly insignificant compared to the latency required to actually access the data in say GPU1's connected NUMA pool of memory?

The trip between the GPUs would be measured in nanoseconds, which isn't much for a design meant to tolerate hundreds of cycles of latency.
The trip from memory controller to memory and back is the same with or without NUMA, so it's not an advantage for either.

The unknown is the contention for the memory controller, and the ring bus.
Ideal latencies for the ring bus and memory controller in a light-load situation are much better than when they are congested.

If both GPUs are congested equally, then the difference is muted, since the GPUs could load their memory controllers and ring buses equally even if they didn't share data.
If one GPU is slammed with 2 GPUs' worth of traffic, performance suffers worse, as one GPU's controller and ringbus take a nose dive and limits the entire system, whereas a more load-balanced alternative could have gotten by without the traffic jam.

Multiple hops beyond two sockets can change things, as more traffic and more hops can add up quite a bit, depending on implementation.

Silent_Buddha · May 2, 2008

Thanks 3dilettante. I've got a much better grasp of it now.

Would implmention of NUMA require a significant transistor budget? And would it be something that could be implemented into the existing memory controller or would it have to be, say, another level of memory control?

I believe I've read that the memory controllers for R(v)6xx are programmable (to what degree I have no idea). Is it possible for them to be configured to handle load balancing of the 2 memory pools? And if so, I'd imagine they would have to at least be beefed up a bit in order to handle the additional duties?

Regards,
SB

Jawed · May 2, 2008

Silent_Buddha said:
Thanks for the replies both of you. Realize I'm just an armchair onlooker.

Crappy old cane chair here...

If I'm understanding then. The latency for data to travel from say GPU1 to GPU2 is fairly insignificant compared to the latency required to actually access the data in say GPU1's connected NUMA pool of memory?

For texels, say, I think latency in R600 consists of:

generate request - e.g. cache system has to generate a real address based on paged virtual memory system
transmit request to memory controllers, which means constructing a packet that travels around the ring bus, contending with all other ring bus traffic and therefore subject to prioritisation- as well as distance-related latency
memory controllers have to queue and prioritise requests received
each memory controller issues a command to its DRAMs, trying to take account of the read-write/page/row status of the DRAMs, i.e. trying to minimise the differences in successive commands as difference=latency
DRAMs take time to turn-around the requested data, consisting of setting up addresses on pages/rows, fetching data and organising it for transmission
the memory controllers then have to submit returned data to the ring bus, again subject to packetisation/prioritisation/workload
texels arrive in L2
texels need to be transmitted from L2 into the various L1 lines ready to be used for actual filtering

So between GPUs 1 and 2 you have a longer ring-bus trip. The interface that joins the two GPUs prolly doesn't need to be terribly complex as the ring-bus itself already enforces all the prioritisations and packetisations required to transmit data.

The key thing about R600 is that the memory controllers are fully distributed and slaved to the cache controllers. This works because the cache controllers keep each other fully up-to-date on the allocation and cache-status of pages of memory, which basically means that the memory controllers are reduced to interfacing with DRAM. Though I think it's worth pointing out that there's prolly only one of each type of cache controller in R600, one for texel L2, one for read-write cache, one for colour/Z/stencil cache etc.

But this does mean that in a 2 GPU system there will be additional ring bus traffic caused by cache controllers talking to each other.

Well, that's my interpretation, anyway.

In other words, if GPU2 had to access data residing in GPU1's NUMA pool of memory, the latency involved isn't significantly higher than if GPU1 was accessing it?

Yep. And, I'm presuming that GPU1 could fulfill texel requests out of its own L2 if they happen to be there, say.

Now another question. Would that be easier to implement or better than say some sort of GPU "northbridge" which connects to each GPU and then in turn connects to memory. Similar to how Intel CPU's still access memory in a multi CPU system?

My pet theory is that ATI started on the ring bus idea because they wanted to go NUMA.

Jawed

nicolasb · May 2, 2008

The Inquirer is talking up ATI's prospects: http://www.theinquirer.net/gb/inquirer/news/2008/05/02/summer-bring-gpu-war

This section caught my eye:

You saw a little of that with the 3870X2, but the bridge was a simple PCIe switch. The real magic this time is a bridge that shares memory, GDDR5 in this case. Yup, you will have 2 GPUs with one set of memory.

This simplifies designs, lowers chip cost, and speeds time to market. You get two full variants for the design cost of 1.25, and you are on the happy end of the cost/area curve for fabbing silicon. While the early word on GT200 is that it is again 500mm^2+, ATI will have 2x chips that are much smaller, which translates into a huge cost advantage.

The other nice thing is that the bridge should keep the GPUs hidden from the system. This has a disadvantage of hard-wiring in the Crossfire modes leaving a little performance on the table, but when you have two of them in the system, it looks like two GPUs, not four. One look at the 1 -> 2 -> 4 scaling rates will show what a win that is.

Mat3 · May 2, 2008

The other nice thing is that the bridge should keep the GPUs hidden from the system.

So if true, does it mean they'll automatically work on the same frame, using their tiling to divide it up, presumably?

Sound_Card · May 2, 2008

Do we have a "Kentsfield" GPU here? I still find it fishy that ATi brought out a 512bit bus when R600 clearly did not need it. Perhaps serving a greater purpose further down the line and it's implementation on R600 was a lab test for say.

It's a great technological accomplishment. It allowed for the best cost per bandwidth to be achieved on our products. Now that we've proven it can be done, we can certainly decide to use this weapon again, as required. But I won't comment on future products ;-)

http://beyond3d.com/content/interviews/39/5

Basically, if you look at the architecture of any modern GPU, R5xx/6xx or G80, it comprises pretty modular units connected by a big interconnect. Imagine if the interconnect was more distributed like say an Opteron and HT, you could have four small chips instead of one big one.

http://www.theinquirer.net/en/inquirer/news/2006/11/17/big-gpus-are-set-to-die

4870 X2 is an interesting version. AMD did not send out any specs to its partners and it is expected the board will be a bit more than just a 3870 X2 two RV770 GPUs. ATI is said to be making some changes,

http://www.tomshardware.com/news/ati-radeon-4800,5223.html

Maybe I'm looking too hard into it, but for sure, R700 does not sound like a typical crossfire stick the way things are coming across.

Sound_Card · May 2, 2008

Jawed said:
My pet theory is that ATI started on the ring bus idea because they wanted to go NUMA.

Jawed

That would be something for sure.

nate240 · May 2, 2008

I remember this pic coming up a while ago. Isn't this kinda what you guys are discussing now?

Link

Karma Police · May 2, 2008

If the R700 is going to be seen as one GPU by the OS, then does that mean you could have up to 4 R700's (8 GPU's) in a single system (despite the impracticality)?

Lukfi · May 2, 2008

R700 having shared memory? Well that certainly would be something. Although I can't help but being skeptical. Not that it wouldn't make sense, it really does if multi-GPU is the future, but it would be a fundamental change... remember AMD's trying to play safe.

AlexV · May 2, 2008

Karma Police said:
If the R700 is going to be seen as one GPU by the OS, then does that mean you could have up to 4 R700's (8 GPU's) in a single system (despite the impracticality)?

No.

Lukfi · May 2, 2008

Why not? Technically, it would be possible on a motherboard with 4 PCIe x16 slots. You just couldn't run them in CrossFire.

ECH · May 2, 2008

It's odd after all this time they haven't announced a release date yet. Unless they plan on releasing July/August 2008 time frame...

Jawed · May 2, 2008

nate240 said:
I remember this pic coming up a while ago. Isn't this kinda what you guys are discussing now?

Link

Yeah.

But, sadly, it seems AFR is going to spoil the party.

Jawed

Sound_Card · May 2, 2008

Jawed said:
Yeah.

But, sadly, it seems AFR is going to spoil the party.

Jawed

Would AFR benifit from a closer proximity? How do we know it's AFR?

Jawed · May 3, 2008

Sound_Card said:
Would AFR benifit from a closer proximity? How do we know it's AFR?

http://www.techreport.com/articles.x/14284/2

While there are vague noises about some other alternative to AFR, I'm not going to put any store in them. AFR appears to be the focus.

Separately, I'm utterly bemused by the shirking of SuperAA for example - surely a good way to get some 4xMSAA performance...

Jawed

rwolf · May 3, 2008

Jawed said:
Yeah.

But, sadly, it seems AFR is going to spoil the party.

Jawed

Why? Think of all the common stuff you can share. It is significant. You avoid all sorts of duplication, memory allocation, loading objects, computing geometry etc.

AMD: R7xx Speculation

no-X

Jawed

Silent_Buddha

3dilettante

Silent_Buddha

Jawed

nicolasb

Mat3

Sound_Card

Sound_Card

nate240

Karma Police

Lukfi

AlexV

Heteroscedasticitate

Lukfi

ECH

Jawed

Sound_Card

Jawed

rwolf

Rock Star

Similar threads