Nvidia GT300 core: Speculation

Status
Not open for further replies.
Given that RV770 is ~ on par with GT200 and the most conservative rumors peg AMD's performance part at 1.5x RV770 the only way Nvidia can save face is to hit at least 3x GT200 performance. If they go another round where AMD's X2 (or X4 :D) part totally dominates their big chip then they'll be in more trouble considering they (seemingly) don't have anything in the performance/mainstream segment on the way to stem the bleeding there.
 
This time around NVidia gets to benefit from the fact that GDDR5 gives double the bandwidth for less than double the area (of I/O and MCs). So per-ROP performance should improve with the resulting extra area to spend, e.g. with enhanced compression. ROPs could well double in count.

There's also still the big unknown of whether NVidia will use off-die memory hubs, which could save another substantial chunk of area.

Jawed
 
Given that RV770 is ~ on par with GT200 and the most conservative rumors peg AMD's performance part at 1.5x RV770 the only way Nvidia can save face is to hit at least 3x GT200 performance. If they go another round where AMD's X2 (or X4 :D) part totally dominates their big chip then they'll be in more trouble considering they (seemingly) don't have anything in the performance/mainstream segment on the way to stem the bleeding there.

If NV gets trapped from below and above (as you mentioned), it would be problematic.

However, I suspect that it was a bit too late to really impact GT300 floor plan much by the time that RV770 showed up. You simply cannot make changes very late in the design cycle for most chips - it's easier for GPUs than CPUs, but not by a whole lot. I know where the 'feature freeze' point is for a CPU, but I don't really know for a GPU.

It would be interesting to talk to someone at NV or ATI about the whole design process and timeline from start to finish.

Anyway, it seems to me like GT200 was essentially a 32 SM device that had 2 SMs cut out due to reticle limits. So if they can optimize die area a little more, 64 SMs seems quite reasonable for GT300. It also depends on how many memory pipelines they have. Currently, they seem to be following a trend of increasing the ratio of SM:TMUs, by putting more SMs in each TPC (from 2-->3).

RE: two chip solution

I don't see why NV would want to put any part of the memory interface on a separate chip. That adds a lot of latency and power and doesn't save much. It's not clear what the goal would be...

Their GPU is already huge, which means you have room for a lot of pins...why go and make two huge chips with high pin counts?

Let's just say another interconnect and chip would add 10ns of latency to the memory access. That's approximately 150 cycles and now you need to re-adjust all the internal buffering in the chip to make them deeper. That adds area and power and makes scheduling trickier...not good.

The memory controllers in GPUs are very complicated and do a lot of read/write coalescing and scheduling. I think doing that off-die doesn't make much sense to me. You really need to be able to get all the commands and addresses to the memory controllers as fast as possible for them to perform well and adding a 150 cycle delay (or even 70 cycle if you are thinking slow clock) will really hurt that a lot.

Also, the only way to really save pins on:

GPU-->MC-->DRAM

Is to have the GPU-->MC interconnect be substantially faster than the MC-->DRAM interconnect (which is GDDR5 running at 3.6-4.5GT/s or more). Even using something advanced like Hypertransport could only get you to ~5-6GT/s, which isn't a whole lot faster. You'd need to be running those pins at something like 9GT/s, which people don't do over copper at the moment.

Also, given the width of the memory interface (256b+), your MC chip will definitely be pad limited and will end up quite bloated and using expensive silicon (you can't do GDDR5 on 90nm).

I just don't see why you'd ever want a separate memory controller for a GPU, it does not make sense. We went over this before, and I couldn't see any reason for it to be attractive then, and I don't see any reason now.

Reading patents to find out what products are doing in the future is a really complicated business, especially since all these companies file a lot of patents that aren't used for products.



@nAo - it's true that FLOPs aren't all that matter. The ease of use of architecture matters a lot and NV has invested a great deal there. The real question is how efficient are ATI and NV's drivers and the JIT compilers contained in them.

My understanding is that ATI's is highly efficient for graphics, but not so good for GPGPU. NV is optimizing a bit more for GPGPU and is vastly more efficient there...but at the cost of worse FLOP/mm2 and FLOP/w.
 
The AMBs don't seem to add a lot of cost to FB-DIMMs, they didn't need 45nm to hit 5GHz signalling speed either. That said, memory hubs are ridiculous and I suspect the patent is an offshoot from a project they started when they were scared of Rambus.
 
The AMBs don't seem to add a lot of cost to FB-DIMMs, they didn't need 45nm to hit 5GHz signalling speed either. That said, memory hubs are ridiculous and I suspect the patent is an offshoot from a project they started when they were scared of Rambus.

You can't really draw parallels between FBD and a GDDRx based system.

As I pointed out, GDDRx already runs at near FBD speeds (3.6-4.5Gbps). FBD only worked because it used a very narrow high speed interconnect and was able to send equivalent traffic using fewer pins (FBD was 3.2Gbps with a memory technology that ran at ~0.8Gbps so they could use far fewer pins). To reduce the pin count for GDDRx, you'd need an interconnect that runs ~4X faster (so 10-15gbps), which doesn't exist.

Also, AMB only works because it improves the SI for high capacity systems when you have many DRAMs on a bus. GDDRx only has one DRAM deep IIRC, so you can't improve the SI at all...

Although if you wanted to, you could use such a technology to get a lot of DRAM in one card. That'd be horribly expensive though.

David
 
I was pointing out that for whatever reason the cost of AMBs isn't that high, even if you double it to account for using the same speed on both ends.
 
Last edited by a moderator:
However, I suspect that it was a bit too late to really impact GT300 floor plan much by the time that RV770 showed up. You simply cannot make changes very late in the design cycle for most chips - it's easier for GPUs than CPUs, but not by a whole lot. I know where the 'feature freeze' point is for a CPU, but I don't really know for a GPU.
I suspect the freeze point is quite late for GPUs - G80 gives every impression of having 128-bits of extra memory bus and corresponding ROPs tacked on. G100 seems to have been cancelled and GT200 came in its place, with a ~7 month delay. The latter case is interesting because CUDA documentation during 2007 outlined future chip capabilities that GT200 doesn't provide.

I think it's reasonable to assume that feature blocks (e.g. ALUs, TMUs) can be fairly independent of each other - hence the ability to have an architecture that optionally features double-precision (both ATI and NVidia do this). Connecting up various counts of these units is obviously tricky when combinatorial explosions happen - but we've got no ready way to assess where the pain barriers are there.

I do think NVidia is closer to pain though, with 8 MCs connecting to the 10 clusters in GT200 compared with 4 MCs connecting to 10 clusters in RV770.

Their GPU is already huge, which means you have room for a lot of pins...why go and make two huge chips with high pin counts?
A GPU's main problem is that it's a high power thing, much higher than consumer CPUs - feeding the core with power puts the squeeze on pins for I/O.

The one or more hub chips should be low power, so the balance of I/O : power should keep the size of that chip down. Secondly, the GPU<->hub I/O is less pins than the pins required to implement GDDR, which means more pins for power and/or a smaller GPU.

Let's just say another interconnect and chip would add 10ns of latency to the memory access. That's approximately 150 cycles and now you need to re-adjust all the internal buffering in the chip to make them deeper. That adds area and power and makes scheduling trickier...not good.
NVidia's architecture "already copes" with >30% variation in ALU : DDR, e.g. GTX275 is 1404:1134 and 9800GTX is 1836:1100

http://www.gpureview.com/show_cards.php?card1=609&card2=571

So if typical worst-case latency is equivalent to 500 cycles on GTX275, then it's equivalent to 674 cycles on 9800GTX. Indeed, GTX275 has a larger register file per multiprocessor (double), so should benefit considerably from this difference.

Apart from anything else, particularly in graphics, the texture cache is designed to make most texture operations suffer much less than worst-case latency.

The memory controllers in GPUs are very complicated and do a lot of read/write coalescing and scheduling. I think doing that off-die doesn't make much sense to me. You really need to be able to get all the commands and addresses to the memory controllers as fast as possible for them to perform well and adding a 150 cycle delay (or even 70 cycle if you are thinking slow clock) will really hurt that a lot.
Agreed, increasing the queue length for the MCs due to the extra latency is problematic.

Also, the only way to really save pins on:

GPU-->MC-->DRAM

Is to have the GPU-->MC interconnect be substantially faster than the MC-->DRAM interconnect (which is GDDR5 running at 3.6-4.5GT/s or more). Even using something advanced like Hypertransport could only get you to ~5-6GT/s, which isn't a whole lot faster. You'd need to be running those pins at something like 9GT/s, which people don't do over copper at the moment.
GDDR5 doesn't even use differential signalling. NVidia can make an entirely proprietary interconnect, and do what it likes to get the signalling it wants.

Also, given the width of the memory interface (256b+), your MC chip will definitely be pad limited and will end up quite bloated and using expensive silicon (you can't do GDDR5 on 90nm).
I'll happily admit to having no characterisation of MC power characterstics. Clearly there's a bit of it that's hitting some crazy clocks. GDDR5 chips are ~5W each though, aren't they, and they're running their I/O at similar crazy clocks.

The 256-bit GDDR5 MCs on RV770 appear to be about 13mm², so say 15mm² for 512-bit of MC on 40nm. 512-bit of GDDR5 would take ~75mm². Say the I/O for the GPU is 40mm². So about 130mm² all told.

But the perimeter the I/O occupies could be 75mm for the GDDR5 and say 40mm for the GPU connection. 115mm of perimeter requires a monster chip, so wouldn't be practical.

The alternative is multiple small hub chips, e.g. 4x 128-bit. The perimeter is 19mm of GDDR5 and 10mm of GPU connection per chip = ~30mm. The area would be say 19mm² of GDDR5, 10mm² of GPU connection and say 7mm² of MC on 55nm meaning the entire chip is <40mm². Each hub chip would be something like 4x11mm, say.

Then there's the question of whether the ROPs go on the hub chip too, making room for vastly more ALUs :p The Tesla (or should that be Fermi?) variant then has hubs that are ROP-less and support ECC.

I just don't see why you'd ever want a separate memory controller for a GPU, it does not make sense. We went over this before, and I couldn't see any reason for it to be attractive then, and I don't see any reason now.
It frees the set of GPUs you build from having to implement DDR2/GDDR3/GDDR5 interfaces, now the hub chips are specific to those types. It allows you to put more than 4GB of memory on a 512-bit bus. It allows you to build a hub chip dedicated to ECC.

Reading patents to find out what products are doing in the future is a really complicated business, especially since all these companies file a lot of patents that aren't used for products.
And GPUs seem to be more fraught than CPUs.

@nAo - it's true that FLOPs aren't all that matter. The ease of use of architecture matters a lot and NV has invested a great deal there. The real question is how efficient are ATI and NV's drivers and the JIT compilers contained in them.
I've seen a report of 25-minute compilation time for a CUDA kernel :p I've personally experienced multi-minute compilation time for a crazy Brook+ kernel.

My understanding is that ATI's is highly efficient for graphics, but not so good for GPGPU. NV is optimizing a bit more for GPGPU and is vastly more efficient there...but at the cost of worse FLOP/mm2 and FLOP/w.
NVidia sacrificing for GPGPU is prolly the right description. Certainly not to the tune of being vastly more efficient.

Jawed
 
Why would you need perimeter I/O on a memory hub? The logic is much simpler so just using area I/O shouldn't be much of a problem.
 
Why would you need perimeter I/O on a memory hub? The logic is much simpler so just using area I/O shouldn't be much of a problem.
Could be - I was being pessimistic. I was going to mention the double-stacked I/O that we see in GT215 for the PCI Express port as an example of non-perimeter usage, but decided to keep it simple.

The other side of the coin is what's the right bit-width of hub, e.g. a 128-bit hub serves 3 different GPU chips in varying counts.

Jawed
 
Know of anything in CUDA other than .surf space that didn't make it into the GT200 arch?
We discussed the changes ages ago, I imagine a search would root them out, eventually.

---

Have you noticed that the performance optimisation section of the programming guide refers to a future change in structure of shared memory? I'm looking at the 2.2-beta version which describes avoiding bank conflicts for the storage of arrays of double and says that the suggested technique will be slower on future hardware.

I guess that means shared memory will have 32 banks, corresponding with the doubling that'll be required for D3D11.

Jawed
 
Even using something advanced like Hypertransport could only get you to ~5-6GT/s, which isn't a whole lot faster. You'd need to be running those pins at something like 9GT/s, which people don't do over copper at the moment.

To reduce the pin count for GDDRx, you'd need an interconnect that runs ~4X faster (so 10-15gbps), which doesn't exist.

Why would you need such high speeds? I thought the whole point was that for an equivalent effective width you need fewer physical pins when using a proprietary differential signalling bus. So a 512-bit custom bus will actually be much smaller than a 512-bit GDDR5 interface. But it can run at the same 4.5GT/s speed as GDDR5 does, it'll just take up less space.

I guess that means shared memory will have 32 banks, corresponding with the doubling that'll be required for D3D11.

Hmmm, are you sure? I wouldn't think you'd need an increase in the number of banks due to an increase in capacity. The only reason for the number of banks to increase is if the fetch width increases. And the only reason for that to increase is if the shader clock multipler is now 4x or the SIMD width has doubled. The latter would fit in with the ideas we've tossed around in the past about 16 wide SIMDs (which is one way for them to significantly increase compute density). Of course the whole SFU scheduling setup would have to change too unless warp size is also doubled.
 
I guess that means shared memory will have 32 banks, corresponding with the doubling that'll be required for D3D11.

Yes, however they have been noting from early on that half-warp optimizations will not apply to future hardware. So switching to 32-banks would not be much of a surprise.

I'm really wondering what changes CUDA 3.0 will bring?

Something like the ability to run more than one program per core, ie MPMD (but NOT DWF), just doesn't seem like a major change. We can already do this somewhat with coherent dynamic branching in a kernel (latest NVidia raytracing paper shows advantages with issuing only enough threads to fill the GPU and having warps pull their own work).

I'm still holding out a little hope that we get with CUDA 3.0 an interface to buffer between different kernels without going out to GDDR. And I don't mean the trivial stuff like simply keeping intermediate results in shared memory...
 
Hardware queues would be great, IMHO. I think that pipeline parallelism is the next logical step for GPU's to evolve into. With the dynamically sized workloads coming in because of tessellation, I would have normally thought that both of them would go down this route, but hey, they have some really smart people working for them.
 
Hmmm, are you sure? I wouldn't think you'd need an increase in the number of banks due to an increase in capacity. The only reason for the number of banks to increase is if the fetch width increases. And the only reason for that to increase is if the shader clock multipler is now 4x or the SIMD width has doubled. The latter would fit in with the ideas we've tossed around in the past about 16 wide SIMDs (which is one way for them to significantly increase compute density). Of course the whole SFU scheduling setup would have to change too unless warp size is also doubled.
No, I'm not sure, it's a guess based on the way an array of doubles takes 64 bits per element and that their suggestion to split this into two arrays, one for HI and one for LO 32-bit elements would be slower in the new architecture.

It appears to imply that double would suffer no bank conflict, which results either from a doubling in bank count or doubling in bits per bank.

As to how to implement a doubling of capacity, by increasing the bank count you keep the timings of bank operations unchanged. The timing of a bank is basically defined by its size. Though it's worth pointing out that in NVidia's architecture the relationship between the shared memory and ALU clocks is pretty loose - it's never 1:2 precisely.

If the bits per bank are doubled then more bandwidth would be wasted in gathers from shared memory. This is analogous to having 16-bit data elements stored in shared memory in current GPUs - you'll end up fetching a lot of data that doesn't get used.

You could say that only the GPU that supports computation of doubles needs "64-bits per bank", but then most computations are single precision and that would be wasteful. Though you might argue this inherently allows for 2 operands per clock to come from shared memory, as opposed to the single operand currently supported. That would prolly be quite welcome.

To be honest, I haven't thought this through in any great detail though. There's so many combinations.

Given that this example relates to double precision arithmetic, it could also be taken to imply that throughput in GT300 would be substantially higher, e.g. double.

Jawed
 
http://www.eetimes.com/news/latest/showArticle.jhtml;?articleID=218900011

So, GT300 should be around 5x the performance of G80 (3 years).

Jawed


What bunk.

The only reasons that GPUs have had the performance improvements they've had in the past has been because of the increased die sizes and increased power envelopes. Both of these are maxed now, so they'll basically be process dependent. That means they will scale will standard silicon scaling.

A new process comes online roughly every 18-24 months providing ~2x logic density and ~20-30% transistor switching frequency improvement. The problem is that 2x logic density at the same frequency likely blows past the power limits, which implies that the increase in transistor performance will likely be applied to offset frequency degradation due to voltage scaling to meet the power envelope. Which implies a performance improvement roughly capped at around 50-55% every year. Or roughly 3.5-4x every 3 years.
 
Something like the ability to run more than one program per core, ie MPMD (but NOT DWF), just doesn't seem like a major change.
I presume that multiple kernels (i.e. VS and PS) do this already, so the big question is what's holding back the architecture from making this more fluid? There's potential for mess and fragmentation in the register file with more than two types of kernels, as each type will have a different warp-register alignment (due to count of registers allocated per thread). There's instruction cache thrashing - though I can't really see the big deal there with instruction issue being so infrequent. What else?

We can already do this somewhat with coherent dynamic branching in a kernel (latest NVidia raytracing paper shows advantages with issuing only enough threads to fill the GPU and having warps pull their own work).
This is where, you could argue, higher-level abstractions come in. It's a bloody mess in the mean time. It's also a bit like the uber-shader problem of D3D10, though not really the same.

I'm still holding out a little hope that we get with CUDA 3.0 an interface to buffer between different kernels without going out to GDDR. And I don't mean the trivial stuff like simply keeping intermediate results in shared memory...
Small on-die queues that are accessible by all clusters seem like a reasonable idea, but there's still the fundamental issue of sizing and enqueue/dequeue rates. The very tight bounds on these in VS and PS have made it work well so far and we've already discussed at length ATI's solution to the "arbitrariness" of GS output, which is off die, while I'm guessing NVidia uses shared memory for this.

If the GPUs get cached gather/scatter support, then maybe these queues can ride on top of that.

Jawed
 
Status
Not open for further replies.
Back
Top