FB-DIMM style architecture for GPUs?

DemoCoder · Aug 13, 2006

I wonder if the IHVs are considering a memory architecture that utilizes standard GDDRn behind a serial-to-parallel chip which might let them increase throughput per pin and reduce routing requirements for the PCBs. Latency will increase, but if they know the memory architecture they are targeting, they can tune the GPU to hide the latency (if it isn't that bad)

The idea is to avoid custom RAMBUS $$$ approaches, use off the shelf memory modules like FB-DIMM, but go with a serial packet interface instead.

Geo · Aug 13, 2006

What started you down that road?

I've been wondering if the bloody next gen chips are going to be big enough to make 512-bit a technical possibility (whether its an economic win would be the secondary question). Some people have been talking about 400mm2. If you can do 256-bit at less than 200mm2 (G71), then I don't see why you couldn't do 512-bit on twice that. Tho it would also be fair to ask if they intend to stay at that kind of size, and if not, if they'd be willing to spend the marketing muscle to "gee whiz" 512-bit, and then the marketing muscle later to "meh" it if the sizes go down later making it no longer practical.

Of course, if they are closer to 300mm2, then flush the above in all likliehood.

DmitryKo · Aug 13, 2006

While Rambus had a somewhat flawed market strategy, the idea is still valid. All the latest developments confirm that serialization of buses is the future (SATA, PCI-E, FB DIMM etc.), and I bet all the major players are considering this scenario, although they would probably integrate Advanced Memory Buffer controller right onboard.

eDRAM-based framebuffer would be even better for less latency, but it looks like there's a curse for any PC chip that utilizes it, as indicated by Rendition V4400 and the many BitBoys projects...

Arun · Aug 13, 2006

DmitryKo said:
eDRAM-based framebuffer would be even better for less latency, but it looks like there's a curse for any PC chip that utilizes it, as indicated by Rendition V4400 and the many BitBoys projects...

I'm personally interested in Z-RAM in the 65 or 55nm era. Basically, it's much, much more dense than eDRAM, and often faster in practice too. And unlike eDRAM, it doesn't lag by one process generation at some fabs. The one "disadvantage"?

It's based on an side effect of SOI, so you need to manufacture your chips on a SOI process. Which results in higher performance/lower wattage and heat, but also higher costs. And TSMC currently isn't the biggest fan of SOI, only recently introducing it and with very little apparent enthusiasm.

Uttar

Anon Lamer · Aug 13, 2006

Why not a multiplexer? 256 bit in, 512 out?

The_Wolf_Who_Cried_Boy · Aug 13, 2006

If more bandwidth is required wouldn't it be much simpler to increase the bit prefetch in the DRAM arrays and use quad or octuple data signaling per pin?

Rys · Aug 13, 2006

The_Wolf_Who_Cried_Boy said:
If more bandwidth is required wouldn't it be much simpler to increase the bit prefect in the DRAM arrays and use quad or octuple data signaling per pin?

What if you're stuck designing around an older DRAM standard which doesn't help you, both in core DRAM rate (core clock or prefetch) and pin signal rate? The only feasible solution if you want massive bandwidth is to make the bus wider.

INKster · Aug 13, 2006

What about Rambus XDR-like tech ?

Sunrise · Aug 13, 2006

DmitryKo said:
eDRAM-based framebuffer would be even better for less latency, but it looks like there's a curse for any PC chip that utilizes it, as indicated by Rendition V4400 and the many BitBoys projects...

...which really isn´t a curse per se but had to do solely with the extremely limited amount of partners who had fabs capable of producing an ASIC with CMOS+eDRAM on one and the same die, because the libraries back then were all different and not cross-compatible with "standard" CMOS-designs. There are also not only one but different implementations of eDRAM in existence, e.x. Infineon´s vs. NEC´s is different which also limited their capacity back then. It´s the designs that were flawed, because they didn´t take into account the high risks and feasibility, not eDRAM per se.

ATi did a very smart approach in that they kept it very flexible, meaning that they used it outside of their main logic die (which isn´t a standard approach), which basically doesn´t limit them to either shrink it on NEC´s next-gen UX7LSD in the future or integrate it on the same die while process technology advances further, but i doubt it (the last one).

Speaking in that context i don´t think that eDRAM will ever get "mainstream" in the sense of using it in mass on desktop GPUs. With Z-RAM being on the horizon, there are just too many positives speaking in favour of it. It´s already validated on 90nm (TSMC and Freescale) and it´s density grows with process tech, so it´s already more feasible on 90nm and grows even more in the future, which gives you the advantage of a smaller die area and some overall process die cost improvements. However, it always requires SOI to work (10% higher costs, but offset by performance and yields), so we´ll have to see how IHVs cope with it, if at all, because there isn´t that much of expertise in the GPU (IHV and semi) world.

Rolf N · Aug 13, 2006

DemoCoder said:
I wonder if the IHVs are considering a memory architecture that utilizes standard GDDRn behind a serial-to-parallel chip which might let them increase throughput per pin and reduce routing requirements for the PCBs. Latency will increase, but if they know the memory architecture they are targeting, they can tune the GPU to hide the latency (if it isn't that bad)

Such a conversion will deliver the worst of both worlds. It will give you all the routing headaches between the DRAM and the converter chips, only worse because the converter won't have much logic to begin with and will hence be entirely pad/package limited.

It will also give you some latency overhead.

This really only gets interesting if you integrate the serial interface directly into the DRAM chips (or packages).

You can do something like this on a pluggable module as long as you have space for the traces, because you'll simplify the path over the mobo which is more critical to begin with, but a graphics card isn't a base board plus modules. It's just one seamless design. Different rules apply.

Acert93 · Aug 13, 2006

It will be interesting of a GPU maker takes the approach that ATI did with Xenos, but using Z-Ram instead. The problem of course is to avoid tiling with 4xMSAA at 1080p you need a framebuffer in the 64MB range. Of course Z-Ram is denser than DRAM and will scale, so maybe this will be a good tradeoff in the future; dedicate some logic to Z-Ram to substantially increase framebuffer bandwidth.

Seems like Xenos would be a good text case for ATI to identify real world bottlenecks and issues with small/fast framebuffers, so hopefully they will explore this in future desktop GPUs when the tech becomes feasible.

silent_guy · Aug 13, 2006

geo said:
If you can do 256-bit at less than 200mm2 (G71), then I don't see why you couldn't do 512-bit on twice that.

Mathematics.

Doubling the area of your die only increases the circumference by 40%. The IO pads are usually still on the side of the die. The internal pads are mainly used to supply power.

Geo · Aug 13, 2006

Ah, that's interesting. Maybe I should be going back to look for the move from 64 to 128 to compare with then on 128 to 256 re die size.

Arun · Aug 13, 2006

And following on what geo said - errr, why is the smallest 128-bit chip at 100mmÂ² (RV515), when the smallest 256-bit chip is at less than 200mmÂ² (G71)? Sure, you can claim that it's really the absolute limit for G71, and that 128-bit could work on a 80mmÂ² chip, but even that isn't x4... Unless you claim all chips limited to 64-bit could in fact support 128-bit if the pins were added, that it's just a cost reduction thing because they didn't think any AIB would want such a configuration, basically?

Uttar

hoom · Aug 13, 2006

What started you down that road?

Gotta get big bandwidth (& modular RAM) to a Torrenza socketed GPU somehow :?:

silent_guy · Aug 13, 2006

Uttar said:
And following on what geo said - errr, why is the smallest 128-bit chip at 100mm² (RV515), when the smallest 256-bit chip is at less than 200mm² (G71)? Sure, you can claim that it's really the absolute limit for G71, and that 128-bit could work on a 80mm² chip, but even that isn't x4... Unless you claim all chips limited to 64-bit could in fact support 128-bit if the pins were added, that it's just a cost reduction thing because they didn't think any AIB would want such a configuration, basically?

In addition to not scaling linearly with the size of the die, there's also an initial offset that must be overcome: even with no external memory interface, you'd still need real estate on the side of the die for PCI/AGP/PCIe, video DAC's, PLL's, test pins etc. All those together can easily count of, say, 200 equivalent IO pins. Adding 30 pins or so per 64-bits for address and control(*) means going from 128 to 256 is really going from 158 to 276. But with the 200 additional IO pins, it's more like going from 358 to 476 pins: an 32% pin increase instead of 100%. In terms of area, 1.32^2 = 1.76. Less than double the size of the die, which corresponds nicely with the 200mm2.

When going from 256 to 512 is going from 286 to 572 is really going from 486 to 752 pins or 55% overall pin increase. 1.55^2 = 2.39. If a current die needs 200 mm2 for a 256 bit bus, you'd need a die of 478mm2 to fit your 512 bits.

(All rough estimates, of course, but you get the idea.)

(*) 30 pins: assuming shared controls per 64 bits. The ATI marketing material for the 580 claims unique control per 32 bits DRAM, so the amount of additional pins would be higher.

sireric · Aug 14, 2006

DemoCoder said:
I wonder if the IHVs are considering a memory architecture that utilizes standard GDDRn behind a serial-to-parallel chip which might let them increase throughput per pin and reduce routing requirements for the PCBs. Latency will increase, but if they know the memory architecture they are targeting, they can tune the GPU to hide the latency (if it isn't that bad)

The idea is to avoid custom RAMBUS $$$ approaches, use off the shelf memory modules like FB-DIMM, but go with a serial packet interface instead.

Well, GDDR4 gets us up to 2.5 Gbits per second per wire (pin). While I know of some technology that can perhaps double that (such as rambus), there's no way it could multiply the BW by 32 at this time, assuming a x32 memory device. As well, you would need a serial encoder on output (Easy as it's on the ASIC) but really hard on the receiving side -- i.e. a receiver per dram chip; that would be much more costly, and probably a significant routing problem.

While routing density on the board can be an issue, the board costs and densities are generally not the biggest issue.

DemoCoder · Aug 14, 2006

What if you want to get more than 256 data pinns out of the GPU.

_xxx_ · Aug 14, 2006

I implied the possibility of using the XDR or alike long ago as a necessity for the future. We're running out of pins after all...

Geo · Aug 14, 2006

_xxx_ said:
I implied the possibility of using the XDR or alike long ago as a necessity for the future. We're running out of pins after all...

Rambus clearly says they are targeting graphics, so I would assume they at least were allowed to make their pitch to ATI/NV re desktop. Be interesting to know what kind of reception it was given, and on what grounds.

FB-DIMM style architecture for GPUs?

DemoCoder

Geo

Mostly Harmless

DmitryKo

Arun

Unknown.

Anon Lamer

The_Wolf_Who_Cried_Boy

Rys

Graphics @ AMD

INKster

Sunrise

Rolf N

Recurring Membmare

Acert93

Artist formerly known as Acert93

silent_guy

Geo

Mostly Harmless

Arun

Unknown.

hoom

silent_guy

sireric

DemoCoder

_xxx_

Geo

Mostly Harmless

Similar threads