Nvidia GT300 core: Speculation

Status
Not open for further replies.
Is that the main cause for the quick demise of GDDR4? I was under the impression that it was more of a mix of various factors including the lack of support from nVidia, relatively short lifetime before GDDR5 introduction and higher than predicted scaling of GDDR3. BTW I was surprised to find that there have been quite a few HD4670 and HD4850 released with GDDR4 memories, I thought that after R6xx ATi/AMD would have removed the support for it from the memory controllers but apparently they didn't.

The differentials to support GDDR4 over GDDR3 are minimal. The reality is that GDDR4 was really just an enhanced variant of the basic GDDR3 tech.
 
These arbitrary shapes allow a reduction in the ball count, too, e.g. a power pin on the substrate can feed multiple pads on the chip surface, instead of requiring that each pad has a dedicated ball. Additionally the solder reflow process enforces a minimum spacing between balls, so by reducing the count of balls required to deliver ground and power connections, you can move the pads closer together. Power isn't a problem because the under-bump metallisation layer can be beefy, considerably more so than metal layers within the chip.
Does reducing the ball count mean that each ball individually is tasked with providing more amperage?
I don't know the behavior of solder bumps, but the delivery of a large amount of clean current--even as voltage is decreased--was cited as one reason why pinouts have increased so much. If there is not a 1:1 relationship, is there a possible stressor on the solder like extra heat?

Apart from the fact that memory performance is desperately behind other technology scalings, no, there's no reason for the cycle to get faster.
The DRAM industry has been content to treat performance as a secondary or tertiary concern to capacity and affordability so far.

What boutique memory was Qimonda betting the farm on?
GDDR5, well, part of the farm.
There was also the buried word line technology they were working on, which probably would have been far more significant in helping it survive bankruptcy than the GDDR5 standard that other manufacturers were already making.

Now that Intel has woken up and discovered the need for efficient memory interfaces on consumer CPUs (as core counts race towards oblivion), maybe there'll be a new dance floor.
The criteria for efficiency in that environment are different,
It's also that the interests of a memory manufacturer don't mesh with the interests of a memory consumer.
If GDDR5 were something Qimonda had more exclusive control over, it would have been more of an asset for possible insolvency investors.
Since bigger players were already ramping it, the value was less.
AMD's interest was in getting a wide manufacturing base to lower cost and as such would have restricted the margins of the manufacturers.
 
I don't think Rambus has any announced design wins. I think my main point was that they can use a bog standard 65nm process and pretty simple board to hit 12.8gbps.

GDDR5 only ships at pretty low frequencies now, although there have probably been some high speed demos.
Well that means that until XDR2 ships, that remains a fairly vacuous comparison. If XDR2 first ships at 12.8Gbps, then great. Do RAMBUS technologies typically ramp in speed over their life-time? Or do they enter with a big-bang at full speed?

It seems like the goal was to protect reading instructions stored in memory, not data. That would complicate ECC if the writes cannot be assumed to be reliable...but not necessarily a fatal flaw. It would all depend on the probability of a write error compared to the probability of a SER in the DRAMs.
I dare say this asymmetry should make anyone with even an ounce of engineering in them wince. It may be that the drive strength of the GPU is much better than the drive strength of the memory chip - that would allow me to turn the wince off.

However, what I'm really wondering is what BER is considered acceptable and within spec for GDDR5 and what HW can achieve. Usually this is expressed as BER<N*10^-k for k>11.
It seems only AMD and NVidia, currently, could care about this, so we'll prolly never know.

Errors are highly non-linear WRT transmit rate, getting exponentially worse at high speed.

Since DDR transmits data at pretty low rates, it actually isn't needed.

FBD needed it since it was designed for servers, and it's rather high speed.
Will Intel publish target BER for Nehalem EX's memory system?... Or is it easier to simply use redundant hardware/computation when you're building a monster cluster?...

Jawed
 
It seems only AMD and NVidia, currently, could care about this, so we'll prolly never know.
I'm pretty sure this information is available to customers if they simply just ask. You might need to sign something, sure, but....

Will Intel publish target BER for Nehalem EX's memory system?... Or is it easier to simply use redundant hardware/computation when you're building a monster cluster?...
Same again, I think.
 
They have efficient interfaces, they are just tuned for very different workloads. GDDR5 would do kind of poorly for TPC-C...partially because you usually want as much memory as possible.
I don't really know anything about making TPC-C go fast, but how about using 2GB of GDDR5 as L4 cache for your 64GB of slow DDR3 per socket? :p

Jawed
 
They basically designed the spec. They basically committed to a contract manufacturing order with the DRAM vendors. There is no politics or luck involved. Just normal biz. Its not like they didn't have the majority of the control.
While I believe one makes one's own luck, there are limits.

Apart from anything else, it seems RV770 came out right on the first silicon - if AMD was planning for second silicon launch then that means RV770 was "early" in comparison with GDDR5. It's also worth remembering that GDDR5 was constrained to the extent that X2 and 1GB HD4870 models appeared later. Though I'll happily agree some of that's likely to be merely engineer-bandwidth.

Obviously we can't tell how much slack there was in the RV770 and GDDR5 projects, nor how the two ramped once HD4870 launched. In theory NVidia could have been using GDDR5 as early as winter 2008 as NVidia had been evaluating GDDR5 as a possible solution on GPUs older than GT2xx.

Jawed
 
Does reducing the ball count mean that each ball individually is tasked with providing more amperage?
I don't know the behavior of solder bumps, but the delivery of a large amount of clean current--even as voltage is decreased--was cited as one reason why pinouts have increased so much. If there is not a 1:1 relationship, is there a possible stressor on the solder like extra heat?
I don't know the physics of solder bumps, either.

As to current limitations, I suspect that the pads and RDLs/vias are the bottleneck. It seems these are much smaller than the solder balls, but the physics need to be understood :???:

The patent application merely says that UBM "provides a relatively efficient delivery of current and thus relatively small ohmic losses". Nothing about heat. I certainly would expect each ball to suffer with extra heat from within the die. Additionally eutectic balls and pads have lower current thresholds than the older designs.

The DRAM industry has been content to treat performance as a secondary or tertiary concern to capacity and affordability so far.
I dare say once consumer CPUs got cache, DRAM had an excuse. My main point is that now that core counts in consumer CPUs are on the cusp of rising rapidly, bandwidth per pin needs to catch up. It's all very well going 4 or 8 socket in HPC, but the opposite happens in the consumer space, particularly when graphics moves onto the CPU package/die and shares memory bus.

Jawed
 
it's not too bad, dual channel ddr3 provides up to the same bandwith as Cell's XDR, and it's going to be standard on the low end. All motherboards have been 2x64bit for a while. not terrible for a low end integrated GPU.

Socket 1366, 3x64bit, is overprovisionned for bandwith, to make room for six-core processors. Then we'll get yet another memory transition. I've always felt the memory transitions come before processors that use it (dual channel ddr before dual cores, dual channel ddr2 before the quads, etc.). Except for that bandwith guzzling pentium 4 :)
 
Does reducing the ball count mean that each ball individually is tasked with providing more amperage?
I don't know the behavior of solder bumps, but the delivery of a large amount of clean current--even as voltage is decreased--was cited as one reason why pinouts have increased so much. If there is not a 1:1 relationship, is there a possible stressor on the solder like extra heat?
Not sure if you even mean that, but short of smelting or physically straining and cracking the metal, heat doesn't affect conductivity here to any relevant degree, nor does it impact capacitance or inductivity of the line. Of course there are materials with high thermal coefficients, but they aren't used for solder.

Heat can have significant impact on conductivity within the semiconductor itself, and it's a big factor in the life expectancy of nearby electrolytic capacitors.

I've once learned the (macroscopic) rule of thumb that the maximum "safe" current density for copper wiring is around 5A/mm², but I'm sure chip wiring routinely exceeds that.
 
I apologize if it has been posted elsewhere but PCGH has a quite interesting interview with Bill Dally:

http://www.pcgameshardware.com/aid,...chnology-DirectX-11-and-Intels-Larrabee/News/

Well, they're actually pretty good, so it's hard to faults with them. But there's always room for improvement. But i think it's not about wanting, but about opportunities to make them even better. The areas where there's opportunities to make them even better is mostly in the memory system. I think that we're increasingly becoming limited by memory bandwidth on both the graphics and the compute side. And I think there's an opportunity from the hundreds of processors we're at today to the thousands of cores we're gonna be at in the near future to build more robust memory hierarchies on chip to make better use of the off-chip bandwidth.

Interesting...
 
I find myself agreeing with a lot of those points (not that my agreement lends opinions of someone with his credentials more weight than the other way around).
Off-die bandwidth isn't scaling and it has inherent inflexibilities on a physical level that aren't getting better either.
Any transaction that stays on-die saves orders of magnitude in latency and power.

That there will be increasingly robust hierarchies makes sense, and the trend was already in evidence, anyway.
A fully flat, fast, and generalized communications system would be nice, but such things don't scale.

The trade-off between high local or specialized demand and limited global capability is an interesting thing to work with.
 
With GDDR 6 no where in sight, may be XDR2 will take off?
The memory speed is a abit like HDD, not keep up with pace of CPU development. May be we need an SSD sort of leap improvement for memory?
 
Irrespective of NVIDIA, the fundamental difference between GDDR4 and GDDR5 was that there was only one DRAM vendor signed to produce GDDR4, we already knew that there was 3 vendors geared up for GDDR5.

So as of last Thursday the 3 vendors making GDDR5 are:


Elpida will take time to get going, they are also ramping aggressively on DDR3 (40% of Q3 wafers in according to Dramexchange)...so hypothetically if you were launching products in a month or so probably wouldn't be a good idea to rely on them.
 
I'll go waaaaaaaayy out of my expertise level and suggest that some of the GDDR6 conversation has to do with expected GP-GPU memory access schemes rather than standard GPU memory schemes... and may have fallen into the OT realm of "where is Nvidia going with all this memory hierarchy discussion..."
 
I apologize if it has been posted elsewhere but PCGH has a quite interesting interview with Bill Dally:

http://www.pcgameshardware.com/aid,...chnology-DirectX-11-and-Intels-Larrabee/News/



Interesting...

It's very amusing to hear Dally talk about a solution in search of a problem, since it seems to me like several of his past companies have fallen into that category.

I'm not sure that a SW renderer is really that much less efficient (power wise) than a hardware one. I haven't seen any hard numbers one way or another and it really depends on the workload. I wouldn't trust the 20X he mentioned, I bet its far lower on average.

Anyway, Bill's really known for his emphasis on bandwidth at the expense of latency...and that kind of thought process is a better fit at the national labs or NV than at AMD or intel.

Increasing bandwidth is a tricky problem, bounded by pins, power, area and software. The general goal is to trade some cheap (transistors) for more bandwidth.

You can try using compression, although that hasn't work too well in the past for generalized data (it's great for textures obviously). Caches help quite a bit as well for many workloads, larger register files for others. In theory, TBDR is good here...and the low-end seems to bear that out (PowerVR's success). Re-ordering and coalescing in the memory controller is a major avenue as well, although that adds tremendous design complexity.

There are some other tricks I believe NV has up their sleeve, but I'd rather discuss that in an article.

DK
 
Tiling helps for the heavy lifting of first hits and shadow rays to some extent (textures have only trivial reuse regardless). Soon that won't be a heavy lift any more though ... raytracing and GI algorithms with 3D datastructures are always going to play havoc with caches. Caching helps, but the need for bandwidth is not going to slow down IMO.
 
Tiling helps for the heavy lifting of first hits and shadow rays to some extent (textures have only trivial reuse regardless). Soon that won't be a heavy lift any more though ... raytracing and GI algorithms with 3D datastructures are always going to play havoc with caches. Caching helps, but the need for bandwidth is not going to slow down IMO.

I can think of ways to make tiling work for secondary rays, but it requires sorting in between bounces.

DK
 
I can think of ways to make tiling work for secondary rays, but it requires sorting in between bounces.

DK
If you don't plan to sort secondary rays to extract locality you shouldn't even bother to try to run that stuff on a wide SIMD architecture.
 
What are the chances that the "GT300" that taped-out in July is actually a 32nm chip? The rumours from earlier in the year could relate to a 40nm chip and NVidia is now planning on making a swift change to 32nm, a bit like the originally-planned swift change from 65nm-55nm for GT200, last year?

If so, could a 32nm "GT300" arrive February/March/April time? NVidia could be planning for 1 re-spin?

How soon could a 32nm GPU be here, is TSMC shipshape on 32nm? Would NVidia really bust a gut to get a monster chip out on 32nm ahead of the littler chips?

Jawed
 
Status
Not open for further replies.
Back
Top