NVIDIA shows signs ... [2008 - 2017]

Groo The Wanderer · May 27, 2009

Arun said:
Yeah, and what I was trying to say is that if R700 is already 2x256b GDDR5+Bridge, and that theoretically 1x512b GDDR5 should take the same or slightly less. And they can even get away with 1GB GDDR5 versus 2GB for R700 (and 2xRV870 I would assume) which is also on a single PCB. So I wouldn't worry too much about the memory power personally.

Oh, I missed the R700 bit, I was thinking R770, single die to single die. GT200 vs R770 is either 512b GDDR3 vs 256b GDDR3 or 256b GDDR5. All I was saying is that if GT300 goes to GDDR5, power use will go up. That's it.

Arun said:
Yes, but naturally that's not my point. My point is they can set the clock speed (and therefore, voltage) at whatever they need to hit a ~300W TDP. If the question is "who has the performance crown", then throwing more transistors at the problem but with a lower voltage is not a bad way to make sure you achieve that in your thermal/power limit. Of course, that is a secondary factor to overall design efficiency.

My point is that if they have to downclock, they likely won't hit 2x performance. ATI has much more ways to maneuver in that regard.

-Charlie

Arun · May 27, 2009

Groo The Wanderer said:
The religious side is that NV has been banging the GPGPU drum for years now, and skewing their chips that way for years.

You're overestimating the GPGPU die size impact by at least one order of magnitude. This is the same kind of kind of reasoning that says RV770 is a great product because it was designed to be a small chip. No. It's a great product because it's a really great architecture done by a great team and they did an awesome job in the back-end to achieve great transistor density in the back-end. It wouldn't magically be less efficient if you scaled it up unless you hit triangle setup bottlenecks etc... (or you had to go for lower clock speeds because of intra-chip variability)

The largest overhead for GPGPU on G8x/G9x/GT2xx is probably the 16KB of shared memory and, in GT200, the decision to go for 64KB RF which probably was more beneficial for perf/mm² to GPGPU than 3D. Apparently, RV770 had shared memory too and last I heard it wasn't a failure!

Now, if you're saying the overhead for GPGPU-only functionality on GT300 will increase, probably, but so will the HPC revenue this gets amortized over and AMD will obviously have more than in the G92 vs RV670 gen too for OpenCL. If you're assuming that overhead it will increase by an insane amount, then everything I've heard indicates to me you're wrong. Of course, what I heard could also be wrong, so it's my word versus yours at this point.

Jawed · May 27, 2009

Because of D3D11 compute shader there's practically no "GPGPU overhead" in the next GPUs. CS should be a big deal, e.g. in geometry and post-processing and with it able to work on D3D10 GPUs in a not-too-limited fashion, the "market" for GPGPU and accessibility for graphics programmers should have a huge impact.

Though it seems likely GT300 will push beyond the GPGPU capabilities of D3D11 - but I would argue these are fundamentals of architecture questions, such as task-level parallelism. Work that needs doing, so the question is when to introduce it and can it bring any advantages having it early.

Jawed

Arun · May 27, 2009

Groo The Wanderer said:
Oh, I missed the R700 bit, I was thinking R770, single die to single die. GT200 vs R770 is either 512b GDDR3 vs 256b GDDR3 or 256b GDDR5. All I was saying is that if GT300 goes to GDDR5, power use will go up. That's it.

Okay looking back at the quotes, I realize I should have quoted the original thing you said to be clear. You said:

ATI's power budget takes GDDR5 into account, NV's doesn't, so another black mark for NV. How much do you think the rumored 512b GDDR5 will consume?

My point was basically "we know how much 512b GDDR5 will consume, just look at R700, increase it for higher memory clock speeds and reduce it by whatever the PCIe bridge takes. And maybe also 1GB vs 2GB". So even assuming (unlikely) that NV engineers are too dumb to look at a GDDR5 specsheet, it's not hard to see that it's not such a big deal. Either way this is just a detail and it's best we don't focus too much on it...

With regards to GT300 in general, Charlie, the problem is you made a whole bunch of bizarre claims in your GT300 article that are nearly certainly horribly wrong. And that's before we consider non-GT300 tidbits you got wrong, such as:
- Process node density. The node numbers are marketing, 55nm is not 0.72x 65nm it's 0.81x, the real density is available publicly and you should only look at that. Density from a full node to another full node is not fully comparable, but you should at least be looking at the kgates/mm² and SRAM cell size numbers.
- "The shrink, GT206/GT200b" - I can tell you with 100% certainty GT206 got canned and had nothing to do with GT200b. However it still exists in another form
- GT212 wasn't a GT200 shrink, it had a 384-bit GDDR5 bus and that's just the beginning. You have had the right codenames and part of the story of what happened for some time, which is much more than the other rumor sites/leakers, but you don't have the right details.
- "Nvidia chipmaking of late has been laughably bad. GT200 was slated for November of 2007 and came out in May or so in 2008, two quarters late." - it was late to tape-out, but it was fast from tape-out to availability (faster than RV770). G80 was originally aimed at 2005 for a long time but only taped-out in 2006, does that make it a failure? For failures, look at G98/MCP78/MCP79 instead.
- etc.

Then we get to GT300:
- "The basic structure of GT300 is the same as Larrabee" - if that's true, you need scalar units for control/routing logic. That would probably be one of the most important things to mention...
- "Then the open question will be how wide the memory interface will have to be to support a hugely inefficient GPGPU strategy. That code has to be loaded, stored and flushed, taking bandwidth and memory." - Uhh... wow. Do you realize how low instruction bandwidth naturally is for parallel tasks? It never gets mentioned in the context of, say, Larrabee because it's absolutely negligible.
- "There is no dedicated tesselator" - so your theory is that Henry Moreton decided to apply for early retirement? That's... interesting. Just like it was normal to expect G80 to be subpar for the geometry shader, it is insane to think GT300 will be subpar at tesselation.
- " ATI has dedicated hardware where applicable, Nvidia has general purpose shaders roped into doing things far less efficiently" - since this would also apply to DX10 tasks, your theory seems to be that every NV engineer is a retard and even their "DXn gen must be the fastest DXn-1 chip" mentality has faded away.

If I took the time to list all this, it's so that I don't feel compelled to reply in this thread again too much. It's important to understand that it's difficult to take you seriously about NV's DX11 gen when that's your main article on it, and I wasn't even exhaustive in my list of issues with it and didn't include the details I know. Their DX11 gen is more generalized (duh) but not in the direction/way you think, not as much as you think, and not for the reasons you think. Anyway, enough posting for me now!

Davros · May 27, 2009

Arun said:
My point was basically ect, ect ect

what he said

I think you got a severe slapping there groo

digitalwanderer · May 27, 2009

Arun said:
_Full-body geek slam_

<Kelso voice on>
BUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUURN!
<Kelso voice off>
/me waits for the reply

trinibwoy · May 27, 2009

Groo The Wanderer said:
If you can throw transistors at a chip to lower absolute power, that costs die area. ATI is in a much better position than NV there, they can expand the die a lot if necessary. NV can't.

That's true. However, at this point we don't know what RV870 or GT300 look like in terms of power efficiency for a given level of performance.

It didn't pan out that way, but the architecture was likely already decided.

I don't know. I think objectively they've come a hell of a long way since CUDA first dropped. We've gone from essentially nothing to having GPUs in supercomputer clusters in less than two years. That's quite a feat, especially for a proprietary solution.

DegustatoR · May 28, 2009

Arun said:
it's so that I don't feel compelled to reply in this thread again too much.

Oh no, please continue, the last several pages were very funny =) Well, Charlie was funny. You were just the best 8)

trinibwoy · May 28, 2009

Jawed said:
People like to suggest Atom+Chipset neatly soldered to a board that's ready to go is < $X but where's the evidence?

In this case I think common sense rules. Bundle pricing obviously should come in under the sum of the parts. But all of the noise points to Atom pricing being not much lower. Sure it's just noise but where there's smoke.....

And even if that's the case, if 99% of Intel's production is this pair of chips soldered to a board then it would actually cost extra to unbundle

It's not a question of unbundling. The only defensible argument would be if exhaustive testing was only done on the Atom+chipset combo and additional resources would be required for Atom only tests. However, I highly doubt Intel doesn't know an Atom CPU is fully functional before sending it down the line for integration with the chipset. In which case it costs them nothing to not do so.

I'm not trying to deny the possibility of venal behaviour by Intel, just that NVidia's at least as capable so why take sides?

It's not taking sides per se. But two wrongs don't make a right. And unrelated issues like these certainly don't cancel each other out.

dkanter · May 28, 2009

Jawed said:
In addition to this it appears NVidia will be moving the space-hungry DDR physical IO off die onto a hub chip. That could save a wodge of space. If the ROPs end-up on that chip (sort of like bastard offspring of Xenos's EDRAM daughter die) then that's a monster amount of space saved on die.

That then makes for a monster dense Tesla chip, with no messing with ROP functionality as it is partnered by a dedicated memory hub chip that doesn't have ROPs on it.

So instead of GT200 being ~25% area ALUs, GT300 could be 80% area ALUs (rest being texturing, control, video, IO-crud). This would be quite an eye-opener

Indeed, a bit genius.

Ironic: if I'm right about architecture then NVidia's asymmetric dual-chip configuration is superior to ATI's symmetric one. GT300 will be a whole lot slimmer than expected while having the grunt to take on 2xATI, without any of that AFR bullshit.

Jawed

What makes you think they are doing this? Moving your I/Os to another chip doesn't necessarily solve that problem at all, and introduces quite a few more. Not that it's impossible, but I'm very curious to hear your reasoning.

I don't know what the mm2/gbps and mW/gbps is for GDDR5, but for them to have a win, they'd need to use an interconnect between the "GPU" and "GPU memory controller" that is way better (in both those metrics) than GDDR5.

If they have 200GB/s of memory bandwidth (very reasonable), what interconnect will move 200GB/s between two chips, how much area and power will it use? That power and area is all extra, since with a normal architecture it'd just be GDDR5<-->DRAM. Now you have GPU<-->magic interconnect<-->GDDR5<-->DRAM.

Moreover, it seems like that would do rather catastrophic things to latency for atomics, which goes against NV's goal of focusing on GPGPU.

So I'm curious:
1. What interconnect would you imagine they use?
2. What exactly would they move to the 2nd "GPU memory controller"?
3. Can you elaborate why you think they might go this route?

DK

rpg.314 · May 28, 2009

Jawed said:
That then makes for a monster dense Tesla chip, with no messing with ROP functionality as it is partnered by a dedicated memory hub chip that doesn't have ROPs on it.

Umm,

1) then the tesla chip won't be able to do any atomic operations.

2) and why on earth would a tesla chip have video stuff?

EDIT: 1 more snag here. Then both your compute and NVIO3 chip will need an atleast 512 bit mem interface, imposing rather large lower bound on the die size (by pad limits)

rpg.314 · May 28, 2009

GDDR5<-->DRAM. Now you have GPU<-->magic interconnect<-->GDDR5<-->DRAM.

This has me confused. What do you mean by GDDR5 and DRAM being diffferent? Does GDDR5 here refer to the memory controller?

TimothyFarrar · May 28, 2009

Charlie,

Why not elaborate on the details of GPGPU architectural changes of GT300 compared to GT200 which have brought you to the conclusions you have rendered? I would expect after seeing the detail to which you have presented the 8/9 series defect issue that you also have something substantial to back up what you are saying about the GT300. For your arguments about GT300 to have any weight here, you are going to have to get into some core architectural design choices which support your position.

For example what basic structure are you referring to that is common between Larrabee and GT300? Are you referring to GT300 having a similar per multiprocessor writable and coherent L1 and shared L2 memory hierarchy as Larrabee. Or perhaps similar 4-way hardware hyperthreading with software fibers to extend this to N-way as per the Larrabee design (in contrast to the N<=32 variable hardware "hyperthreading" design of the GT200)? Or perhaps a combined vector+scaler register file and instruction set? Or perhaps GT300 using an L2 cache to communicate to/from the texture units like Larrabee does?

TimothyFarrar · May 28, 2009

dkanter said:
Moreover, it seems like that would do rather catastrophic things to latency for atomics, which goes against NV's goal of focusing on GPGPU.

As per DX11/OpenCL/CUDA global atomics are in the form of,

previousDataAtAddress = atomicOp(address, operationParameters)

Assuming the memory controller did the atomic ALU operation (instead of CPU design where the CPU does the atomic ALU operation on cacheline in L1), latency for atomics wouldn't be any higher than a standard global data read. And in cases where the kernel/shader doesn't need the return parameter of the atomic operation, latency wouldn't be a problem at all. The basic idea here being that you could place something like atomic operations which require core to core synchronization post-MC where the individual memory requests are already "serialized" and don't require coherency with other MCs (completely avoiding the coherency problem). Atomic operation throughput would be mostly a function of atomic operation distribution to different MCs (as a function of address as per how MCs divide up the address space), and throughput of the special atomic ALU unit attached to the MC.

dkanter · May 28, 2009

rpg.314 said:
This has me confused. What do you mean by GDDR5 and DRAM being diffferent? Does GDDR5 here refer to the memory controller?

Yes, I was using GDDR5 to refer to the memory controller. It's all a little confusing as GDDR5 could refer to:
1. A memory controller
2. A physical layer interface
3. A type of DRAM

DK

Jawed · May 28, 2009

trinibwoy said:
It's not a question of unbundling. The only defensible argument would be if exhaustive testing was only done on the Atom+chipset combo and additional resources would be required for Atom only tests.

You think that inventory overheads, separate packaging etc. are irrelevant? I don't know the costs. It might only be 1% difference, which would be irrelevant. But I don't know how to work them out :???:

All this arguing is over a rumour that's very convenient for NVidia's marketing strategy. Regardless of the pricing NVidia cries foul but doesn't declare the costs. NVidia's making a pre-emptive strike on Pineview and "getting people addicted to Ion awesomeness so that they reject Pineview". I don't blame them - Ion is a better product for that 1% of Atom consumers who want to do stuff... whatever that stuff is. But I'm not going to trust NVidia on the prices Intel's set even if I expect Intel to be anti-competitive and that Intel has OEMs in a vice-like grip that NVidia deeply envies.

Jawed

Jawed · May 28, 2009

dkanter said:
What makes you think they are doing this?

I replied over here:

http://forum.beyond3d.com/showthread.php?p=1296379#post1296379

(This thread's seriously the wrong place to get stuck into the nitty gritty of next gen GPUs.)

Jawed

Jawed · May 28, 2009

rpg.314 said:
1) then the tesla chip won't be able to do any atomic operations.

Replied over here:

http://forum.beyond3d.com/showthread.php?p=1296385#post1296385

Jawed

3dilettante · May 29, 2009

Just to go back to the thermal cycling issue, I was wondering how positively or negatively switching to ceramic substrates would affect the portion of the problem dealing with the different expansion rates of organic substrates and silicon dies.

Perhaps it is not possible to switch packaging without refactoring the silicon and revalidating the package wiring. Ceramic substrates are probably more expensive (though if it lead to a reduced differential in expansion, it might have prevented some future multi-million dollar lawsuit, so...).

Sinistar · May 29, 2009

Experts: 'Predatory pricing' for Intel's Atom legal
Intel is free to squeeze out Nvidia's Ion from the netbook market under current antitrust law

http://www.computerworld.com/action/article.do?command=viewArticleBasic&articleId=9133636&intsrc=news_ts_head

If this article is accurate, intel is charging $25 for the bundle and $45 for the single chip.

NVIDIA shows signs ... [2008 - 2017]

Groo The Wanderer

Arun

Unknown.

Jawed

Arun

Unknown.

Davros

digitalwanderer

trinibwoy

Meh

DegustatoR

trinibwoy

Meh

dkanter

rpg.314

rpg.314

TimothyFarrar

TimothyFarrar

dkanter

Jawed

Jawed

Jawed

3dilettante

Sinistar

I LIVE

Similar threads