Nvidia GT300 core: Speculation

MfA · Aug 3, 2009

We already have too many memory spaces as is ... not that I'm advocating Larrabee's central caches with strict coherence, but some type of caching which can gracefully degrade to external memory seems to me more desirable than more fixed function memory.

aaronspink · Aug 3, 2009

Jawed said:
A GPU's main problem is that it's a high power thing, much higher than consumer CPUs - feeding the core with power puts the squeeze on pins for I/O.

The one or more hub chips should be low power, so the balance of I/O : power should keep the size of that chip down. Secondly, the GPU<->hub I/O is less pins than the pins required to implement GDDR, which means more pins for power and/or a smaller GPU.

The hub chips would actually have fairly high power density. High speed I/Os aren't known for being low power, and what you are essentially doing is doubling the number of high power I/Os on these hub chips. So you are actually looking at the range of 25-50W of purely parasitic power being added to the design.

GDDR5 doesn't even use differential signalling. NVidia can make an entirely proprietary interconnect, and do what it likes to get the signalling it wants.

There are real limits. Considering the frequencies of GDDR5, it would be fully bleeding edge to do an interconnect at 2x GDDR5 frequencies and likely require differential signaling which means you are right back where you started as far as pin counts. For it to make sense, you would need to run the GPU<>HUB signaling at a min of 4x GDDR5 frequencies which is a range still in the very early stages of advanced research at this point.

The alternative is multiple small hub chips, e.g. 4x 128-bit. The perimeter is 19mm of GDDR5 and 10mm of GPU connection per chip = ~30mm. The area would be say 19mm² of GDDR5, 10mm² of GPU connection and say 7mm² of MC on 55nm meaning the entire chip is <40mm². Each hub chip would be something like 4x11mm, say.

~30mm of I/O would require a ~8mmx8mm hub chip. 64mm2 x 4 is a lot of area.

Then there's the question of whether the ROPs go on the hub chip too, making room for vastly more ALUs The Tesla (or should that be Fermi?) variant then has hubs that are ROP-less and support ECC.

The volumes aren't there.

aaronspink · Aug 3, 2009

trinibwoy said:
Why would you need such high speeds? I thought the whole point was that for an equivalent effective width you need fewer physical pins when using a proprietary differential signalling bus. So a 512-bit custom bus will actually be much smaller than a 512-bit GDDR5 interface. But it can run at the same 4.5GT/s speed as GDDR5 does, it'll just take up less space.

Differential signaling requires 2x the number of pins/wire per signal than single ended. If you ran the custom differential interface at the same frequency, you would need ~2x the number of wires and pins for the signaling. You need to run the differential interconnect significantly faster (~2x) to break even on pin count.

trinibwoy · Aug 3, 2009

aaronspink said:
Differential signaling requires 2x the number of pins/wire per signal than single ended. If you ran the custom differential interface at the same frequency, you would need ~2x the number of wires and pins for the signaling. You need to run the differential interconnect significantly faster (~2x) to break even on pin count.

If that's the case then that patent only makes sense if you're trying to access a lot of very slow memory. Probably a dead end.

dkanter · Aug 3, 2009

trinibwoy said:
If that's the case then that patent only makes sense if you're trying to access a lot of very slow memory. Probably a dead end.

And that's what we've been saying for a while now : )

The idea of an off-chip memory bridge just doesn't make a lot of sense.

Off-chip I/O for framebuffer is fine...

DK

3dilettante · Aug 3, 2009

The basic premise of the patent was an IC that converted a parallel DRAM interface to a proprietary serial one.
Is this distinct from the FB-DIMM concept, other than the fact that Nvidia pasted "GPU" over the section that would have been a CPU?

dkanter · Aug 3, 2009

3dilettante said:
The basic premise of the patent was an IC that converted a parallel DRAM interface to a proprietary serial one.
Is this distinct from the FB-DIMM concept, other than the fact that Nvidia pasted "GPU" over the section that would have been a CPU?

At that conceptual level, no.

I'd need to read the patent more carefully to understand what the advantage of their claimed approach is though. I'm pretty sure they aren't going after quite the same advantages as FBD (capacity expansion while maintaining bandwidth)....but I could be wrong.

David

Gubbi · Aug 4, 2009

aaronspink said:
Differential signaling requires 2x the number of pins/wire per signal than single ended.

While differential signaling requires more pins than single ended, I wouldn't put the penalty at 2x. You need a significant number of power and ground pins for single ended signalling, just look at GDDR5 pinouts.

Cheers

Jawed · Aug 4, 2009

aaronspink said:
The hub chips would actually have fairly high power density. High speed I/Os aren't known for being low power, and what you are essentially doing is doubling the number of high power I/Os on these hub chips. So you are actually looking at the range of 25-50W of purely parasitic power being added to the design.

~30mm of I/O would require a ~8mmx8mm hub chip. 64mm2 x 4 is a lot of area.

Yeah, parasitic power is unquestionably an issue. If hubs are used, that might mean that laptop chips (e.g. 128-bit or 64-bit memory bus) don't use hubs, and that it's reserved for 256/512-bit chips.

If the hub is used to make the GPU smaller (not even sure if this would be by a significant degree), there'd be some power saving arising from the smaller GPU.

You seem to be suggesting that this kind of power will require a square chip, 8x8mm, as opposed to the rectangular 11x4mm that I was suggesting. And that stacked-I/O, as MfA was suggesting, wouldn't be an option, either.

There are real limits. Considering the frequencies of GDDR5, it would be fully bleeding edge to do an interconnect at 2x GDDR5 frequencies and likely require differential signaling which means you are right back where you started as far as pin counts. For it to make sense, you would need to run the GPU<>HUB signaling at a min of 4x GDDR5 frequencies which is a range still in the very early stages of advanced research at this point.

The sideport on RV770 is the same bandwidth as PCI Express - it is seemingly nothing more than a 16 lane version 2 PCI Express port dedicated to inter-GPU communication. It's 38% of the area of the main PCI Express port, which appears to be a direct result of the less demanding physical environment required for the connection. That seems to imply substantially lower power consumption, too.

The I/O for hub communication should benefit substantially from a constrained physical environment, too, in comparison with GDDR5 I/O. I guess some pins in GDDR5 that are power/ground would be traded for signal pins in a differential configuration too.

The volumes aren't there.

You mean volumes aren't there for an ECC variant? If the scalable memory buffer is viable for Nehalem-EX, then why isn't an ECC-specific version viable for NVidia? Why should NVidia's volumes over 3 years, say, be painfully expensive? Particularly if, say, a 16GB ECC board can sell for $8000.

I am inclined to agree that this concept may have arisen in an era that didn't contemplate GDDR5, e.g. when GDDR4 signalled "not much performance gain" over GDDR3, or that it was purely defensive. The sheer performance of GDDR5 may undo this.

Will GDDR5 go differential in a couple of years' time?...

Jawed

aaronspink · Aug 4, 2009

Jawed said:
You seem to be suggesting that this kind of power will require a square chip, 8x8mm, as opposed to the rectangular 11x4mm that I was suggesting. And that stacked-I/O, as MfA was suggesting, wouldn't be an option, either.

Stacked-I/O requires more package layers resulting in higher cost and worse signaling. 4mm will leave little room for any logic within the chip itself. It might be doable, might not. Still you are talking about close to 200mm2 of extra silicon.

The sideport on RV770 is the same bandwidth as PCI Express - it is seemingly nothing more than a 16 lane version 2 PCI Express port dedicated to inter-GPU communication. It's 38% of the area of the main PCI Express port, which appears to be a direct result of the less demanding physical environment required for the connection. That seems to imply substantially lower power consumption, too.

We're talking about I/O that would have to run in the range of 10 GT/s. You aren't going to reach those speeds without some exotic signaling.

The I/O for hub communication should benefit substantially from a constrained physical environment, too, in comparison with GDDR5 I/O. I guess some pins in GDDR5 that are power/ground would be traded for signal pins in a differential configuration too.

Its roughly the same physical constraints with much higher bandwidth requirements.

You mean volumes aren't there for an ECC variant? If the scalable memory buffer is viable for Nehalem-EX, then why isn't an ECC-specific version viable for NVidia? Why should NVidia's volumes over 3 years, say, be painfully expensive? Particularly if, say, a 16GB ECC board can sell for $8000.

Cause the Nvidia solution is going to have to sell for <$100 total for the vast majority of the volume! There are different realities for something with an ASP of ~100 and something with an ASP of >1000.

Will GDDR5 go differential in a couple of years' time?...

Its going to have to do something. Given the physical constraints they might be able to push GDDR5 to around 6 or so GT/s but then they'll likely be out of signaling margin. A reasonable solution is probably to go to something like a 5-7 GT/s SiBi signaling technology.

MfA · Aug 4, 2009

aaronspink said:
Stacked-I/O requires more package layers resulting in higher cost and worse signaling.

Just to put some numbers to it ... what does an extra substrate layer add to the cost of an AMB sized chip?

3dilettante · Aug 4, 2009

The possibility of having several such buffer chips brings to mind comparisons to Xenos and the POWER MCMs.

The solution here would appear, at first glance, to fall between the two in complexity and cost.

Some rough sums are as follows:

Xenos:

off-package bandwidth ~44 GB/s (22 to memory, 22 to the CPU bus)
on-package ~32 GB/s (daughter die)
Throw in .5 GB/s for some extra stuff.

Power 6 (one chip)

off-package 145 GB/s (75 to memory, 50 to off-package CPUs, 20 I/O)
on-package 160 GB/s (80 to on-package L3, 80 for MCM links to other CPUs)
with 4 chips it's 580,400 respectively

A hypothetical GPU with buffer chips for 115 GB/s (or RV770-level bandwidth)

off-package: 123 GB/s (115 memory, 8 for I/O like PCI-e)
on-package: 115 GB/s (to buffer(s)) (123 if there is some kind of I/O routing through a buffer somehow)

A future GPU with 512 bits for a memory bus at higher speed grades won't match a POWER6 MCM, but it can get close.
250-300 GB/s external bandwidth is 1/2 of a 4-chip MCM, and the package bandwidth is 3/4 that of a stonking expensive product.

It should be noted that current consumer-level package MCM bandwidths are in the tens of GB/s.
32 for Xenos, Core2 used an FSB <10 GB/s, the future 12-core AMD chips will probably use a buffed-out HT link (HT3 at 32 bits (not that this is used to my knowledge) is 25 GB/s)

The fact that the buffer chips have on-package interconnect on one side and off-package interconnect on the other may make the package bigger.
The Xenos daughter die is pretty snug where it is, and its traffic is with the Xenos exclusively.
The buffer situation comes closer in some respects to POWER6 in this regard, with CPUs that have massive amounts of bandwidth going off-package.

Jawed · Aug 5, 2009

Looking at Xenos die, the HSIO for communication with EDRAM has almost exactly 50% the area per GB/s of the DDR interface. The FSB on Xenos (connecting to Xenon) has 10% more area per GB/s than HSIO, so it's very close, despite being off-package and further away than the GDDR3 chips. I think this is because FSB is really two independent, one-way, buses, each of 10.8GB/s. I suppose it's worth noting that the command/address part of GDDR is effectively one-way, too. Though GDDR5 provides for clamshell, i.e. sharing command/address between two chips.

Anyone have a good idea how much more area the physical part of GDDR5 consumes in comparison with GDDR3, assuming that both are capable of running in the region of 1.1GHz base clock? 20%? 50%?

Jawed

Jawed · Aug 5, 2009

aaronspink said:
Cause the Nvidia solution is going to have to sell for <$100 total for the vast majority of the volume! There are different realities for something with an ASP of ~100 and something with an ASP of >1000.

Yeah, it seems the pricing would appear to enforce this as a solution for 512-bit GPUs only, which are clearly meant to leave NVidia at >$100 a piece.

Its going to have to do something. Given the physical constraints they might be able to push GDDR5 to around 6 or so GT/s but then they'll likely be out of signaling margin. A reasonable solution is probably to go to something like a 5-7 GT/s SiBi signaling technology.

GDDR5 is currently available in 6Gbps form, and is supposed to reach ~7Gbps.

Jawed

aaronspink · Aug 5, 2009

Jawed said:
Yeah, it seems the pricing would appear to enforce this as a solution for 512-bit GPUs only, which are clearly meant to leave NVidia at >$100 a piece.

Even the 512b GPUs don't have an ASP much about $100 and that's if they are lucky.

GDDR5 is currently available in 6Gbps form, and is supposed to reach ~7Gbps.

I tend to believe performance claims when they've been demonstrated. Right now the highest I've seen is in the upper 4's.

rjc · Aug 6, 2009

Four more GT300 variants tip up

From SemiAccurate: Four more GT300 variants tip up

SINCE WE BROKE the news about Nvidia's GT300 and it's huge die size, there is more about the family that has come to light. It will have some ankle biting offspring in short order.

Yes, the oft-delayed monster is going to have children, four of them. For the sharp-eyed, this is more than Nvidia has done at any time in the past. While none of them have taped out yet, our moles say that Nvidia usually waits until they get the first silicon back to see if there are any changes that need to be made to subsequent designs. With mask sets running into the 7 figures, this is what you call a smart move.

Four! So 5 chips total....guess that is consistent with the recent past where Nvidia generally has put out more discrete parts than AMD, ie both parties are putting out one more part than they did in the previous generation.

dkanter · Aug 6, 2009

Jawed said:
Yeah, parasitic power is unquestionably an issue. If hubs are used, that might mean that laptop chips (e.g. 128-bit or 64-bit memory bus) don't use hubs, and that it's reserved for 256/512-bit chips.

If the hub is used to make the GPU smaller (not even sure if this would be by a significant degree), there'd be some power saving arising from the smaller GPU.

At the chip level you save a small amount, but you'd increase board level power substantially by basically doubling the # of pins and transistors you wiggle per bit.

You seem to be suggesting that this kind of power will require a square chip, 8x8mm, as opposed to the rectangular 11x4mm that I was suggesting. And that stacked-I/O, as MfA was suggesting, wouldn't be an option, either.

The sideport on RV770 is the same bandwidth as PCI Express - it is seemingly nothing more than a 16 lane version 2 PCI Express port dedicated to inter-GPU communication. It's 38% of the area of the main PCI Express port, which appears to be a direct result of the less demanding physical environment required for the connection. That seems to imply substantially lower power consumption, too.

You mean volumes aren't there for an ECC variant? If the scalable memory buffer is viable for Nehalem-EX, then why isn't an ECC-specific version viable for NVidia? Why should NVidia's volumes over 3 years, say, be painfully expensive? Particularly if, say, a 16GB ECC board can sell for $8000.

If you have ECC, I wonder if you need to worry about your MC<>DRAM interconnect a bit more and use something more robust than GDDRx?
GDDR5 was likely designed with a notion that bit errors are more acceptable than they are in the CPU/GPGPU world.

I am inclined to agree that this concept may have arisen in an era that didn't contemplate GDDR5, e.g. when GDDR4 signalled "not much performance gain" over GDDR3, or that it was purely defensive. The sheer performance of GDDR5 may undo this.

Will GDDR5 go differential in a couple of years' time?...

Jawed

GDDR5 was designed to be single ended for backwards compat. with GDDR3/4. I suspect GDDR6 may need diff. signaling.

However, figuring out how to support both high performance/high price DRAM and lower price DRAM for midrange/low-end systems on the same memory controller is very tricky.

Maybe once GDDR5 is low price, the spec could be extended to be differential...and then diff. GDDR5 could be the 'cheap' DRAM for a GDDR6 memory controller. But I don't know whether anyone would like that, or if it's remotely feasible when you consider teh actual trade-offs involved.

DK

Jawed · Aug 6, 2009

dkanter said:
If you have ECC, I wonder if you need to worry about your MC<>DRAM interconnect a bit more and use something more robust than GDDRx?
GDDR5 was likely designed with a notion that bit errors are more acceptable than they are in the CPU/GPGPU world.

Part of the GDDR5 interface is dedicated lines for error detection. Detected errors cause a re-transmission attempt and may be used as a signifier to kick off re-training to adapt to varying voltages/temperatures.

GDDR5 was designed to be single ended for backwards compat. with GDDR3/4. I suspect GDDR6 may need diff. signaling.

However, figuring out how to support both high performance/high price DRAM and lower price DRAM for midrange/low-end systems on the same memory controller is very tricky.

Yeah, that's a serious problem.

In fact this is a motivator in the patent document. The hub chips insulate the GPU chips from the vagaries of memory technology and varying interface types. But the hub chips incur a significant increase in entry-level cost for the board as a whole, as well as the power penalties.

Now you could argue that the rising tide of IGP softens that blow - i.e. that the entry-level cost for a board is rising anyway. But it seems the penalties are so severe that it might only be possible with the biggest GPUs. In which case the DDR flexibility would de-emphasise the problems the chip team has when they're building a new huge GPU over a multi-year timeline, enabling the existing strategy of delivering the halo chip as the first chip in a new architecture. The smaller chips would then be engineered for the specific memory types. This would make NVidia "laggy" on memory technology adoption - but NVidia is already laggy, if taking GDDR5 at face value - though NVidia had the first GDDR3 GPU.

GDDR3 seems to be facing a rapid tail-off - it may be only 18 months before it disappears entirely. ATI may not use it in the upcoming generation, sticking with DDR and GDDR5. Dunno if that effectively means that GDDR5 would face a yet-more-rapid tail-off if it were replaced by GDDR6.

The rising tide of IGP also hampers the economies of scale that make an interation of GDDR viable. Against that discrete GPUs are still undergoing growth. But the notebook sector is putting a squeeze on everything.

Maybe once GDDR5 is low price, the spec could be extended to be differential...and then diff. GDDR5 could be the 'cheap' DRAM for a GDDR6 memory controller. But I don't know whether anyone would like that, or if it's remotely feasible when you consider teh actual trade-offs involved.

I suspect the basic electrical differences between single-ended and differential will be a problem when looking at the interfacing from pads to pins and the types of pins. Unless there was a way to configure some pad/pin combinations in the interface to switch from being power/ground for single-ended to signalling for differential.

Jawed

MfA · Aug 6, 2009

I don't see how differential signalling is supposed to gain them anything in bandwidth per pin ... well if they start using multilevel or even more complex encoding perhaps.

fellix · Aug 6, 2009

It is supposed to offer stable faster transmission rates.

Nvidia GT300 core: Speculation

MfA

aaronspink

aaronspink

trinibwoy

Meh

dkanter

3dilettante

dkanter

Gubbi

Jawed

aaronspink

MfA

3dilettante

Jawed

Jawed

aaronspink

rjc

dkanter

Jawed

MfA

fellix

Similar threads