R6xx Speculation - Implementation of Ring Bus for Multi chip interconnect.

Rys · Nov 1, 2007

PCI Express might not turn out to be the best bus implementation for inter-die comms where both are one the same package. I'd imagine the bus would be custom, wouldn't have to be one of the usual widths thrown around like 512 or 1024-bit, and could well be multi-tier/channel, even between just two GPUs.

_xxx_ · Nov 2, 2007

Custom bus for sure, I'll bet on it.

turtle · Nov 2, 2007

So...Does this come into play? Lends some weird kinda credence to that phony-looking rumor.

"In GDDR5 the transmitted data is secured via CRC (cycle redundancy check) using an algorithm that is well established within high quality communication environments like ATM networks," reads a Qimonda white paper on GDDR5 (PDF available here). "The algorithm detects all single and double errors with 100% probability."

This morning, Qimonda promised it could achieve a 3x performance boost over GDDR3 currently being shipped by, among other factors, doubling their bandwidth to 20 GB/sec per module. That "per module" is important, as it points to the possibility of simultaneous writes to multiple 512 Mb modules by multiple cores (Is anyone at Sony listening?).

One of Qimonda's early customers could be Intel, one of whose stealth projects is a GPU/CPU combo processor code-named Larabee. Engineers expect to see the first samples of this chip late next year, and although it would put Intel on a par with AMD in the graphics department, analysts are actually seeing it as an "nVidia killer." Larabee would reportedly require GDDR5.

http://www.betanews.com/article/New_Qimonda_GDDR5_Memory_Promises_3x_Performance_of_GDDR3/1193935014

cadaveca · Nov 2, 2007

Um, yeah, multiple cores means that they can effectively split the datapath from 32-bit to 2x 16bit channels, for double-core(that's GDDR5 cores, BTW) designs, not double gpu designs(Quimonda referes to this as "clamshell"). Feel free to read the whitepaper @ Quimonda website to confirm, but here it is:

Graphics system designers expect GDDR5 standard to offer high flexibility in terms of frame buffer and bandwidth variation.
GDDR5 supports this need for flexibility in an outstanding way with its clamshell mode. The clamshell mode allows 32
controller-I/Os to be shared between two GDDR5 components. In clamshell mode each GDDR5 DRAM’s interface is reduced
to 16 I/Os. 32 controller I/Os can, therefore, be populated with two GDDR5 DRAMs, while DQ’s are single loaded and the
addresss and command bus is shared between the two components. Operation in clam shell mode has no impact on
system band width.
Example: System configurations with 512M GDDR5 device using a controller with 256 bit interface:
A) 8pcs of 512M GDDR5 in standard mode ➔ Frame buffer: 512 MB
B) 16pcs of 512M GDDR5 in clamshell mode ➔ Frame buffer: 1 GB
Every GDDR5 component supports the clam shell mode. In this way, multiple frame-buffer variants can be built up using
only one component type which drastically reduces the number of different inventory positions and increases flexibility in
a very dynamic market environment.

http://www.qimonda-news.com/download/Qimonda_GDDR5_whitepaper.pdf

Per module is important only becuase each module supports both 32-bit and 16-bit datapaths, adn obviosly that bandwidth is far differnt when in 16-bit. ALso of note is that the whitepaper states 5GB/sec per pin, not 20GB/sec per module.

Scalable clock frequency and data rate
GDDR5 allows the system to dynamically scale the memory I/O data rate according to the workload. The I/O data rate of
GDDR5 can be gaplessly varied from 5 Gbps down to 200 Mbps (50 MHz clock frequency). Towards lower frequencies the
PLL is turned off for additional power saving.

Gotta love people that think they can speedread, but can't. They could NOT have got it more wrong.

Jawed · Nov 2, 2007

The ATI patent applications on stuff for what I'm guessing are GDDR5 are interesting, though I'm not sure they're entirely relevant to this thread:

ASYMMETRICAL IO METHOD AND SYSTEM

WRITE DATA MASK METHOD AND SYSTEM

PROGRAMMABLE PREAMBLE SYSTEM AND METHOD

This is slightly more relevant:

GRAPHICS-PROCESSING SYSTEM AND METHOD OF BROADCASTING WRITE REQUESTS TO MULTIPLE GRAPHICS DEVICES

but still not really on topic.

Jawed

turtle · Nov 2, 2007

Thanks for hitting the beat and clarifying.

AlNom · Nov 2, 2007

Hm... I get errors going to those links (IE or Firefox).

:?:

turtle · Nov 2, 2007

Yep, me too.

BRiT · Nov 2, 2007

Weird. The first set of links didn't work for me. As soon as I did a search of my own, the original links starting working.

If they're not working for you, try a QuickSearch first, then click on the original links.

turtle · Nov 2, 2007

Yes sir...weird, I clicked your (Before your edit) and it worked.

Yeah, it's working now. Must have been down? Weird. They were just B3D'ed.

Patents are once again very enlightening. Thanks gentlemen.

2900guy · Nov 2, 2007

Jawed said:
The ATI patent applications on stuff for what I'm guessing are GDDR5 are interesting, though I'm not sure they're entirely relevant to this thread:

ASYMMETRICAL IO METHOD AND SYSTEM

WRITE DATA MASK METHOD AND SYSTEM

PROGRAMMABLE PREAMBLE SYSTEM AND METHOD

This is slightly more relevant:

GRAPHICS-PROCESSING SYSTEM AND METHOD OF BROADCASTING WRITE REQUESTS TO MULTIPLE GRAPHICS DEVICES

but still not really on topic.

Jawed

here a copy of the entire PDF for easy read with diagrams:

http://www.savefile.com/files/1165668

hoom · Nov 2, 2007

PCI Express might not turn out to be the best bus implementation for inter-die comms where both are one the same package. I'd imagine the bus would be custom

Maybe not the best but would probably be the cheapest & easiest to implement & with excellent compatibility in both drivers & hardware with other Crossfire modes.

Or... Does anyone know where the existing Crossfire bridge comes on & off the existing R5x0 & R6x0?
If its got a direct link into the ringbus other than through the PCIE I/O, that would be presumably a prime candidate for linking MCM.

Xmas · Nov 2, 2007

cadaveca said:
Per module is important only becuase each module supports both 32-bit and 16-bit datapaths, adn obviosly that bandwidth is far differnt when in 16-bit. ALso of note is that the whitepaper states 5GB/sec per pin, not 20GB/sec per module.

Can someone with a bit of knowledge in memory technology explain the advantages of having two separate half-width datapaths with a shared address/command bus over sharing both adress/command and data busses?

satein · Nov 2, 2007

Jawed said:
The ATI patent applications on stuff for what I'm guessing are GDDR5 are interesting, though I'm not sure they're entirely relevant to this thread:

ASYMMETRICAL IO METHOD AND SYSTEM

WRITE DATA MASK METHOD AND SYSTEM

PROGRAMMABLE PREAMBLE SYSTEM AND METHOD

This is slightly more relevant:

GRAPHICS-PROCESSING SYSTEM AND METHOD OF BROADCASTING WRITE REQUESTS TO MULTIPLE GRAPHICS DEVICES

but still not really on topic.

Jawed

Sorry, my post may be inrelavant. But the INQ reported that the GDD5 samples ship out now!!
http://www.theinquirer.net/gb/inquirer/news/2007/11/02/gddr5-samples-ship

The INQ said:
...
New to GDDR5 is the concept of cyclic redundancy checks, long used in comms and storage technology. Since there is so much data traveling through the extra bandwidth GDDR5 brings, checking it for consistency appears to be a key requirement.

Qimonda reckons that there will be a 3x improvement in performance over current GDDR3 memory, and bandwidth will be up to 20GB/sec. Crucially, the spec supports simultaneous writes to multiple modules, opening the possibility of parallel writes for extra speed...

Thus, it may help supporting the patents you found on the ATi!!

3dilettante · Nov 2, 2007

Here's an random possibility for an multi-die solution: something like the Smithfield P4 or what they do for large CMOS sensors that exceed reticle size.

Take two cores like RV670 and cut them out of the wafer in pairs.

Either the bus is a quick drop from one core down into the package and back up again, or they could stitch them together by linking them with data lines passing through one of the chip's metalization layers.

That's even less complicated than a dedicated MCM bus spec, the connections could be treated as a slightly slower connection off of a T stop on the ring bus.

For yields, it would be possible to fuse off the connections if one core is bad, or even saw the dies singly and fuse off the connections if you want a single cores.

Mintmaster · Nov 2, 2007

Xmas said:
Can someone with a bit of knowledge in memory technology explain the advantages of having two separate half-width datapaths with a shared address/command bus over sharing both adress/command and data busses?

I assume you're comparing two busses with a single shared, double width bus in the latter, right? I guess it's less complexity when 90% of the time you need to transfer data both ways anyway, and you care about worst case speed more than average case.

Latency is probably a tad less too, since instead of streaming a burst of data one way and then the other way, both can happen simultaneously, so the first piece of data arrives quicker.

Mintmaster · Nov 2, 2007

Jawed said:
Also, an on-die bus has an advantage over a memory-chip (off-die) bus: addressing and data lines are not needed, it's all multiplexed. This means less pads to implement the on-MCM connections than connections to memory.

Yup, just like RSX connect to the XDR through Cell, or Xenos parent-daughter connections. Addressing still consumes some of the bandwidth, but the longer the burst lengths are, the less relevent it becomes.

cadaveca · Nov 2, 2007

Xmas said:
Can someone with a bit of knowledge in memory technology explain the advantages of having two separate half-width datapaths with a shared address/command bus over sharing both adress/command and data busses?

As I quoted from the whitepaper, this allows configurations of both 512mb and 1024mb using the same bus interface, keeping in mind what Mintmaster said about burst-length. Yes, there may be a penalty in bandwidth, but if your controller is designed the right way, this loss of bandwidth is almost a moot point.

However, when you take into account that the trace length for these IC's does not need to be exactly the same distance(as most GDDR implemetation require now), this could possibly allow for a siongle interconnect to connect to two different buffers, so i guess that it might be possible to use this for multi-gpu work, but only in having shared buffers. Each "node" could potentially write to two different IC's...and each IC could potentially have an interconnect from TWO NODES(16-bit from each), although I question the usefulness of this.

turtle · Nov 2, 2007

cadaveca said:
As I quoted from the whitepaper, this allows configurations of both 512mb and 1024mb using the same bus interface, keeping in mind what Mintmaster said about burst-length. Yes, there may be a penalty in bandwidth, but if your controller is designed the right way, this loss of bandwidth is almost a moot point.

However, when you take into account that the trace length for these IC's does not need to be exactly the same distance(as most GDDR implemetation require now), this could possibly allow for a single interconnect to connect to two different buffers, so i guess that it might be possible to use this for multi-gpu work, but only in having shared buffers. Each "node" could potentially write to two different IC's...and each IC could potentially have an interconnect from TWO NODES(16-bit from each), although I question the usefulness of this.

I fail to see how it isn't useful. Say you have two gpus, or rather two cores, say 64x4 or 32x8 each, with each 64-bit/32-bit bus connected to four/two memory chips, each gpu designated a 16-bit route for a shared buffer, how could that be bad? Sure, 16-bit is a slower connect, but with speeds like 3200mhz for example, (and surely higher as GDDR4 can hit at least 3200mhz+), it could still be 6.4gb/s to each chip, right? Wouldn't that give each gpu a rate of greater than 100gbps (like R600)? That x2 hardly sounds bad, although granted the memory would be shared.

As usual, I must be missing something about the finer intricacies of a memory bus.

cadaveca · Nov 2, 2007

I assume addressing will be the issue. With the "clamshell" mode, 2 Ic's share the same 32-bit address space, split into two. what happens when both memory controllers want to address the same IC at the same time?

R6xx Speculation - Implementation of Ring Bus for Multi chip interconnect.

Rys

Graphics @ AMD

_xxx_

turtle

cadaveca

Jawed

turtle

AlNom

Moderator

turtle

BRiT

(>• •)>⌐■-■ (⌐■-■)

turtle

2900guy

hoom

Xmas

Porous

satein

3dilettante

Mintmaster

Mintmaster

cadaveca

turtle

cadaveca

Similar threads