However, I suspect that it was a bit too late to really impact GT300 floor plan much by the time that RV770 showed up. You simply cannot make changes very late in the design cycle for most chips - it's easier for GPUs than CPUs, but not by a whole lot. I know where the 'feature freeze' point is for a CPU, but I don't really know for a GPU.
I suspect the freeze point is quite late for GPUs - G80 gives every impression of having 128-bits of extra memory bus and corresponding ROPs tacked on. G100 seems to have been cancelled and GT200 came in its place, with a ~7 month delay. The latter case is interesting because CUDA documentation during 2007 outlined future chip capabilities that GT200 doesn't provide.
I think it's reasonable to assume that feature blocks (e.g. ALUs, TMUs) can be fairly independent of each other - hence the ability to have an architecture that optionally features double-precision (both ATI and NVidia do this). Connecting up various counts of these units is obviously tricky when combinatorial explosions happen - but we've got no ready way to assess where the pain barriers are there.
I do think NVidia is closer to pain though, with 8 MCs connecting to the 10 clusters in GT200 compared with 4 MCs connecting to 10 clusters in RV770.
Their GPU is already huge, which means you have room for a lot of pins...why go and make two huge chips with high pin counts?
A GPU's main problem is that it's a high power thing, much higher than consumer CPUs - feeding the core with power puts the squeeze on pins for I/O.
The one or more hub chips should be low power, so the balance of I/O : power should keep the size of that chip down. Secondly, the GPU<->hub I/O is less pins than the pins required to implement GDDR, which means more pins for power and/or a smaller GPU.
Let's just say another interconnect and chip would add 10ns of latency to the memory access. That's approximately 150 cycles and now you need to re-adjust all the internal buffering in the chip to make them deeper. That adds area and power and makes scheduling trickier...not good.
NVidia's architecture "already copes" with >30% variation in ALU : DDR, e.g. GTX275 is 1404:1134 and 9800GTX is 1836:1100
http://www.gpureview.com/show_cards.php?card1=609&card2=571
So if typical worst-case latency is equivalent to 500 cycles on GTX275, then it's equivalent to 674 cycles on 9800GTX. Indeed, GTX275 has a larger register file per multiprocessor (double), so should benefit considerably from this difference.
Apart from anything else, particularly in graphics, the texture cache is designed to make most texture operations suffer much less than worst-case latency.
The memory controllers in GPUs are very complicated and do a lot of read/write coalescing and scheduling. I think doing that off-die doesn't make much sense to me. You really need to be able to get all the commands and addresses to the memory controllers as fast as possible for them to perform well and adding a 150 cycle delay (or even 70 cycle if you are thinking slow clock) will really hurt that a lot.
Agreed, increasing the queue length for the MCs due to the extra latency is problematic.
Also, the only way to really save pins on:
GPU-->MC-->DRAM
Is to have the GPU-->MC interconnect be substantially faster than the MC-->DRAM interconnect (which is GDDR5 running at 3.6-4.5GT/s or more). Even using something advanced like Hypertransport could only get you to ~5-6GT/s, which isn't a whole lot faster. You'd need to be running those pins at something like 9GT/s, which people don't do over copper at the moment.
GDDR5 doesn't even use differential signalling. NVidia can make an entirely proprietary interconnect, and do what it likes to get the signalling it wants.
Also, given the width of the memory interface (256b+), your MC chip will definitely be pad limited and will end up quite bloated and using expensive silicon (you can't do GDDR5 on 90nm).
I'll happily admit to having no characterisation of MC power characterstics. Clearly there's a bit of it that's hitting some crazy clocks. GDDR5 chips are ~5W each though, aren't they, and they're running their I/O at similar crazy clocks.
The 256-bit GDDR5 MCs on RV770 appear to be about 13mm², so say 15mm² for 512-bit of MC on 40nm. 512-bit of GDDR5 would take ~75mm². Say the I/O for the GPU is 40mm². So about 130mm² all told.
But the perimeter the I/O occupies could be 75mm for the GDDR5 and say 40mm for the GPU connection. 115mm of perimeter requires a monster chip, so wouldn't be practical.
The alternative is multiple small hub chips, e.g. 4x 128-bit. The perimeter is 19mm of GDDR5 and 10mm of GPU connection per chip = ~30mm. The area would be say 19mm² of GDDR5, 10mm² of GPU connection and say 7mm² of MC on 55nm meaning the entire chip is <40mm². Each hub chip would be something like 4x11mm, say.
Then there's the question of whether the ROPs go on the hub chip too, making room for vastly more ALUs
The Tesla (or should that be Fermi?) variant then has hubs that are ROP-less and support ECC.
I just don't see why you'd ever want a separate memory controller for a GPU, it does not make sense. We went over this before, and I couldn't see any reason for it to be attractive then, and I don't see any reason now.
It frees the set of GPUs you build from having to implement DDR2/GDDR3/GDDR5 interfaces, now the hub chips are specific to those types. It allows you to put more than 4GB of memory on a 512-bit bus. It allows you to build a hub chip dedicated to ECC.
Reading patents to find out what products are doing in the future is a really complicated business, especially since all these companies file a lot of patents that aren't used for products.
And GPUs seem to be more fraught than CPUs.
@nAo - it's true that FLOPs aren't all that matter. The ease of use of architecture matters a lot and NV has invested a great deal there. The real question is how efficient are ATI and NV's drivers and the JIT compilers contained in them.
I've seen a report of 25-minute compilation time for a CUDA kernel
I've personally experienced multi-minute compilation time for a crazy Brook+ kernel.
My understanding is that ATI's is highly efficient for graphics, but not so good for GPGPU. NV is optimizing a bit more for GPGPU and is vastly more efficient there...but at the cost of worse FLOP/mm2 and FLOP/w.
NVidia sacrificing for GPGPU is prolly the right description. Certainly not to the tune of being vastly more efficient.
Jawed