Will Nvidia return to a 512-bit width bus for GF110 / GTX 580?

That would be really sweet. I'm tired of seeing both Nvidia and AMD / ATI goto 512-bit bus (R600, GT200) and then pull back. I know it's for cost / die reasons but the tech enthusiast in me wants to see progression not regression. I know faster GDDR5 RAM makes having a 256-bit or 384-bit (or 448-bit) bus okay in last-gen and current-gen GPUs, but eventually both companies are going to need 512-bit at the high-end again.
 
The tech enthusiast in me wants 512 bit bus with high end XDR2 but that isn't going to happen plus unlike previous generations bandwidth demands don't seem to be growing at such a large rate this gen.
 
What's good on 512bit interface? HD2900XT was slower than 8800GT and GT200 was about 5-10% faster than HD4890. The memory controller is so complex, that if you use the space for additional ALUs/TMUs instead + 256/384bit interface, the resulting performance is better...
 
I don't know...according to anandtech, barts has a "simple" GDDR5 memory controller... maybe a "simple" low speed GDDR5 512 bits MC >> "complex" high speed GDDR5 256 bits MC.
 
That would be really sweet. I'm tired of seeing both Nvidia and AMD / ATI goto 512-bit bus (R600, GT200) and then pull back. I know it's for cost / die reasons but the tech enthusiast in me wants to see progression not regression. I know faster GDDR5 RAM makes having a 256-bit or 384-bit (or 448-bit) bus okay in last-gen and current-gen GPUs, but eventually both companies are going to need 512-bit at the high-end again.

Yep as no-X said, what use is a 512bit interface if it isn't gonna be utilised? The only advantage is that you can pack huge amounts of memory for professional cards. Right now they offer 3 and 6 GB variants of GF100. With a 512 bit bus they can offer 4 and 8 GB. And of course using a power of 2 bus you can avoid uncommon memory sizes like 1.28 GB(GTX 470), 892 MB(GTX 260), 640MB (8800 GTS) or 768 MB(GTX 460), etc.

And by increasing b/w efficiency, the need to move to a bigger memory bus is negated. Look at RV730, i.e. Radeon HD 4670 which with about a quarter of the memory b/w of R600 outperformed it(of course this does not take into account the fact that R600 did not need that much b/w anyway but you get my point)

I don't know...according to anandtech, barts has a "simple" GDDR5 memory controller... maybe a "simple" low speed GDDR5 512 bits MC >> "complex" high speed GDDR5 256 bits MC.

I'll quote dkanter's post from another thread, higher the bus width, you have to double the internal plumbing leading up to it. Plus the PCB becomes more complex as 512 bit uses 16 memory chips vs 12 memory chips in the case of 384 bit
 
You guys are making big assumptions that 512-bit BW wouldn't be utilized. It was overkill for R600, but some games have proven to be >30% BW limited on the HD 4850 and Cypress, so doubling BW will give you up to 20% framerate increase there. That's huge, and you'd need at least 50% more SPs to achieve that.

For high end cards, cost won't increase that much. You need twice the number of memory chips, but they're half the capacity. You need a more complex PCB, but that cost is not very relevant for $400+ cards.
 
I remember when the graphics industry first made the leap to cards with a 256-bit bus, it was 2002, the 3Dlabs P10 VPU (76m transistors) and Matrox Parhelia-512 (80m transistors) both had 256-bit busses, and both (or at least one of them) launched before the ATI R300. Neither of those two cards could take full advantage of the 256-bit bus, because, IIRC neither had bandwidth-saving techniques, any type of hidden surface removal or Z-culling or whatever, only the R300 did (and later the much-delayed NV30 / GeForce FX).

How many cards have used a 512-bit bus?
I count:
*R600 / Radeon HD 2xxx
*GT200 / GeForce GTX 280

There may be more but I can't think of any.
 
How many cards have used a 512-bit bus?
I count:
*R600 / Radeon HD 2xxx
*GT200 / GeForce GTX 280

There may be more but I can't think of any.

+ GT200b / GTX 285

that's all...

here's my list of GPUs sorted by memory interface and die-size
64bit GPUs:

GT218 (GeForce G210): 57 mm²
RV810 (Radeon HD5400): 59 mm²
RV620 (Radeon HD34x0?): 67 mm²
G72 (GeForce 7300): 77 mm²
RV615 (Radeon HD2400): 82 mm²
RV505 (Radeon X1300LE/CE): cca 85 mm²
G98 (GeForce 8400): 86 mm²
NV44/44a (GeForce 6200/6500/7100): 110 mm²


128bit GPUs:

RV515 (Radeon X1300): 100 mm²
GT216 (GeForce GT220): 100 mm²
Xenos (xbox360, 65nm): 100 mm² (guess)
RV830 (Radeon HD5500): 104 mm²
G86 (GeForce 8500): 117 mm²
RV635 (Radeon HD3650): 120 mm²
G73 (GeForce 7600): 127 mm²
GT215 (GeForce GT240): 139 mm²
RV740 (Radeon HD5700): 140 mm²
G96 (GeForce 9500): 144 mm²
RV530 (Radeon X1600): 150 mm²
RV630 (Radeon HD2600): 150 mm²
NV43 (GeForce 6600): 150 mm²
RV410 (Radeon X700): 156 mm²
MCP7A (GeForce 9300m/9400m): 160 mm²
G84 (GeForce 8600): 169 mm²
RV840 (Radeon HD5700): 170 mm²
Xenos (Xbox360, 90nm): cca 186 mm²


192bit GPUs:

GF106 (GeForce GTS45x): 238 mm²


256bit GPUs:

Parhelia: 180 mm²
G94b (GeForce 9600GT): 190 mm²
RV670 (Radeon HD38x0): 192 mm²
G71 (GeForce 7900GTX): 196 mm²
NV42 (GeForce 6800GS): 213 mm²
R300 (Radeon 9700): 218 mm²
G94 (GeForce 9600GT): 225 mm²
RV570 (Radeon X1950PRO): 230 mm²
R430 (Radeon X800XL): 240 mm²
Barts (Radeon HD6800): 255 mm²
RV770 (Radeon HD4800): 256 mm²
G92b (GeForce GTS250): 264 mm²
R420 (Radeon X800XT): 281 mm²
RV790 (Radeon HD4890): 282 mm²
NV40 (GeForce 6800Ultra): 287 mm²
R520 (Radeon X1800): 288 mm²
RV870 (Radeon HD5800): 331 mm²
G92 (GeForce 8800GT): 334 mm²
G70 (GeForce 7800GTX): 334 mm²
R580 (Radeon X1900XT): 342 mm²
GF104 (GeForce GTX460): 367 mm²


384bit GPUs:

G80 (GeForce 8800GTX): 484 mm²
GF100 (GeForce GTX480): 530 mm²


512bit GPUs:

R600 (Radeon HD2900): 420 mm²
G200b (GeForce GTX285): 480 mm²
G200 (GeForce GTX280): 576 mm²
 
You guys are making big assumptions that 512-bit BW wouldn't be utilized. It was overkill for R600, but some games have proven to be >30% BW limited on the HD 4850 and Cypress, so doubling BW will give you up to 20% framerate increase there. That's huge, and you'd need at least 50% more SPs to achieve that.

For high end cards, cost won't increase that much. You need twice the number of memory chips, but they're half the capacity. You need a more complex PCB, but that cost is not very relevant for $400+ cards.

Rumors point at a 512bit buswidth for GF110 and a 2GB frame-buffer.

Assuming the GPU needs +/-230 GB/s bandwidth; on a 512bit bus with 900MHz GDDR5 you're already there, while on a 256bit bus you'd need 1800MHz GDDR5. Any possible availability problems for the latter aside, what would it cost compared to the first scenario and especially for 2GB frame-buffers / GPU?

Of course the middle of the road answer to the above would be 1200MHz@384bit, but on the other hand they'd probably win a few percentage performance out of the additional 512MB memory in extreme resolutions with high AA sample amounts.
 
Rumors point at a 512bit buswidth for GF110 and a 2GB frame-buffer.

And how large that GF110 should suppose to end up just to support 512-bit bus alongside increased numbers and clock its SPs to adequately saturate it? Rumor mill says that it should be GF104 derivative, and GF104 is nothing more than canceled GT212, when fermi was announced, but now adapted for dx11 pipeline instead originally dx10.1. It doesnt just share same kind of "simple" SPs with GT200 series but TMU performance also.

And GF104 proved to be better gaming part than GF100. So it's my reasonig we won't see 512b GDDR5 memory controller on GF110, i even don't believe that aka GF110 will even saturate Cayman memory bandwidth if they will share same one 256b 6Gbps.

Also bet they would done much better job with redesigning of poor performing Fermi GDDR5 MC than to widen it just to enhance consumption of already over-consuming MC to another heights. Better way, just opposite direction :), would be just as AMD done with Barts, to use improved version of 256b MC that will satisfy chip's needs. imo, GF104 shares same GF110 MC w/o redesigning (disregarding width)
 
Assuming the GPU needs +/-230 GB/s bandwidth; on a 512bit bus with 900MHz GDDR5 you're already there, while on a 256bit bus you'd need 1800MHz GDDR5. Any possible availability problems for the latter aside, what would it cost compared to the first scenario and especially for 2GB frame-buffers / GPU?
You can't ever say that a GPU needs X bandwidth. The optimal balance of BW, setup, ROPs, SIMD, etc changes for each triangle of the scene.

For BW, the best model is the one I described. Workloads are usually pretty starkly BW limited when they are, so linear models (i.e. scene time = X / GPUclk + Y / BW) extrapolate well. If you can crank out pixels as fast as Cypress can, then you'll be 15-35% BW limited in games with 154 GB/s, depending on the title. So moving up to 230 GB/s will buy you 5-11% performance. Moving up to 308 GB/s will buy you 11-21% performance.

I really don't know what the total costs are. I do know that the other way to get 20% performance - adding ALUs/setup along with the more complex crossbars and support logic to feed them - is not cheap.
 
512 CUDA cores would mean GF100 type 32 instead of GF104 type 48 grouping?

Yes, but they can make a combi solution :D

16 SIMD like GF100 with 48 cores like GF104, making the rumored 768 core GPU.
 
Last edited by a moderator:
I think the most likely configurations are GF100-like GPU with all SM's enabled (64 TMUs, 512 SPs, 384bit) and better power consumption or GF104 × 1.5 (96 TMUs, 576 SPs, 384bit).
 
And how large that GF110 should suppose to end up just to support 512-bit bus alongside increased numbers and clock its SPs to adequately saturate it?

Depends for which markets the chip is actually meant for.

Rumor mill says that it should be GF104 derivative
GF104's SMs are 3*16 with no DP support. According to Anandtech each GF104 SM (48SPs, dual scheduler, 8 TMUs, no DP) is roughly 25% bigger than a GF100 SM (32SPs, 4TMUs, DP at 1/2 rate).

When I hear 512SPs my mind obviously doesn't go to a 3*16 but a 2*16 configuration. The possibility that each SM could contain this time 8 TMUs doesn't justify in the least any supposed "GF104 derivative".

an canceled GT212, when fermi was announced, but now adapted for dx11 pipeline instead originally dx10.1.
I'd love to read that horseshit story in detail how that is even possible in a sensible way. The possibility that NV might had diverted GT212 resources when they killed that project into a GF104 development team has absolutely nothing to do with the nonsense rumor above. And once we're at it guess what the vaporware GT212 (DX10.1) had neither 2*16 nor 3*16 SMs if that helps.

It doesnt just share same kind of "simple" SPs with GT200 series but TMU performance also.
There's one major difference between GT200 and GF1xx TMUs in case you haven't noticed yet: in the latter the TMUs are "sitting in the SMs". Other than that in all likeliness GF110 might have 8 TMUs/SM but neither the TMU amount has anything to do with GT200 nor does it make the chip any sort of GF104 "derivative".

If it really contains 8 TMUs/SM it would mean that it was NV's only other option to increase performance by a noticeable notch compared to the GTX480 today.

And GF104 proved to be better gaming part than GF100. So it's my reasonig we won't see 512b GDDR5 memory controller on GF110, i even don't believe that aka GF110 will even saturate Cayman memory bandwidth if they will share same one 256b 6Gbps.
With roughly 110% more texel fill-rates compared to a GTX480 I'd suggest a healthy bandwidth increase compared to the latter's raw bandwidth. Does NVIDIA typically go the high frequency ram route in the recent past or is it just a weird coincidence that G80 was on 384bit, G92/256 and GT200/512bit. Convince me why a 512bit was a necessity on GT200 and why they didn't bank for a 384bit bus instead and then it will be a lot easier to reach common ground on that one. As brute force the 512bit bus scenario might sound it is rather typical for NVIDIA and no it isn't obviously the most "elegant" or efficient solution but that's besides the point.

Also bet they would done much better job with redesigning of poor performing Fermi GDDR5 MC than to widen it just to enhance consumption of already over-consuming MC to another heights. Better way, just opposite direction :), would be just as AMD done with Barts, to use improved version of 256b MC that will satisfy chip's needs. imo, GF104 shares same GF110 MC w/o redesigning (disregarding width)
See above. What's historically likelier and was the MC on G8x/9x/2x0 broken too?

You can't ever say that a GPU needs X bandwidth. The optimal balance of BW, setup, ROPs, SIMD, etc changes for each triangle of the scene.

I personally obviously can't but I'd imagine that each IHV's engineers run a long sequence of simulations time before release to find the "golden bandwidth spot" for each architecture.

For BW, the best model is the one I described. Workloads are usually pretty starkly BW limited when they are, so linear models (i.e. scene time = X / GPUclk + Y / BW) extrapolate well. If you can crank out pixels as fast as Cypress can, then you'll be 15-35% BW limited in games with 154 GB/s, depending on the title. So moving up to 230 GB/s will buy you 5-11% performance. Moving up to 308 GB/s will buy you 11-21% performance.

I really don't know what the total costs are. I do know that the other way to get 20% performance - adding ALUs/setup along with the more complex crossbars and support logic to feed them - is not cheap.

Some rumors want GF110 to have 8TMUs/SM. With 512SPs that would give you 128TMUs could get you in the </=110% texel fill-rate increase region compared to GTX480. That fill-rate increase won't buy them of course as much performance increase by far, but my layman's imagination would suggest that such a fill-rate increase would also need a strong bandwidth increase.

And I'm only thinking of a 512bit bus because the rumors for a 2GB framebuffer are ever repeating lately. What I can't figure out for the world is what they'd do with 64 ROPs if the ROPs remain the same and 8 per partition. If the entire chip should still have 4 raster units capable of 8 pixels/clock rasterizing (32 pixels/clock total), the hypothetical additional 32 ROPs sound like vast overkill.
 
Back
Top