NVIDIA Fermi: Architecture discussion

Not necessarily. For example, a good profiler is only useful for a certain hardware. Although sometimes a shader optimized for a hardware will run well on another hardware too, but unless the architecture of NVIDIA's GPU and AMD's GPU converges at some points, tools designed for NVIDIA's GPU are not going to be very useful for AMD's GPU.
I think you should look at the paragraph that you decided to cut out from your quote. ;)
 
I think you should look at the paragraph that you decided to cut out from your quote. ;)

Well, the problem is, AMD's current design, although better in peak performance (density) and in many image processing works, it's not necessarily better in other works. It's much easier to write and optimize for a scalar model rather than a vector model. In the end, it's possible that many GPGPU programs may run faster on NVIDIA's GPU than on AMD's GPU even AMD's GPU has higher peak performance. If this turns out to be true, there's no reason why NVIDIA should go AMD's route. Of course, AMD may tries to go NVIDIA's route, but even so they are not going to have very similar architectures, at least in near future.
 
@Mintmaster

Won't clauses go away as memory access patterns and latencies become more varied and unpredictable? I don't see how they're sustainable in the compute world.
I don't think so. Clauses are a great way to minimize batch switching and maximize latency hiding. I'm sure hyperthreading works in a similar way, where as many loads as possible for one thread are issued while another thread is occupying the execution units.

For a lot of applications they don't matter because there could be enough threads to saturate tex throughput even with dependent ALU-TEX-ALU-TEX sequences, but some register heavy programs may not have many threads and clauses will help immensely.
 
Well, the problem is, AMD's current design, although better in peak performance (density) and in many image processing works, it's not necessarily better in other works. It's much easier to write and optimize for a scalar model rather than a vector model.
Hence all the time I'm wasting explaining how AMD can move to a scalar model (or quasi-vector model that allows serially dependent code, if that's how you want to view my proposal).

In any case, note that I said NVidia will be fine "IF optimizations on their hardware do not carry over to ATI's". I don't think that will be the case by the time GPGPU becomes a substantial market. ATI will add the few missing features when it needs to (speedy atomic operations, public-write shared memory, indirect branching, more cache access, etc).
 
Hence all the time I'm wasting explaining how AMD can move to a scalar model (or quasi-vector model that allows serially dependent code, if that's how you want to view my proposal).

In any case, note that I said NVidia will be fine "IF optimizations on their hardware do not carry over to ATI's". I don't think that will be the case by the time GPGPU becomes a substantial market. ATI will add the few missing features when it needs to (speedy atomic operations, public-write shared memory, indirect branching, more cache access, etc).

I don't think missing features is the major problem though. The major problem, IMHO, is whether GPU are going to have very similar performance characteristics, just like current CPU. For example, do they have similar texture cache? Does an access pattern on shared memory have similar performance on both GPU architecture? Even CPU sometimes has quite different performance characteristics, such as P4 vs K7.

Also, when developers use something as their primary developing platform, they tend to spend more time optimizing for it. Therefore, it's quite likely that their programs are going to perform better on that platform.

Of course, if NVIDIA actually hit the jackpot and becomes the king of GPGPU and GPGPU becomes the best thing since sliced bread, AMD could do what they did in the x86 CPU world: by mimicking the leading architecture's performance characteristics, but that's not an easy thing to do.
 
I don't think missing features is the major problem though. The major problem, IMHO, is whether GPU are going to have very similar performance characteristics, just like current CPU. For example, do they have similar texture cache? Does an access pattern on shared memory have similar performance on both GPU architecture? Even CPU sometimes has quite different performance characteristics, such as P4 vs K7.
I think there will be very little room for special optimization with massively parallel processing, and things that are good for one GPU will be good for another. CPUs have very complicated prefetch and branch prediction, and their performance is very much determined by their ability to cleverly eliminate stalls than their raw throughput ability. For parallel workloads, however, there is no branch misprediction and no stalls from cache misses. GPUs will be able to hide just about everything if it's a parallel task. Caches in GPUs are there to reduce BW requirements, not avoid stalls, and that is acheived through intelligent data arrangement in the program rather than clever prefetch or cache line lifetime algorithms in the hardware.

Sure, companies like Matrox or S3 had hardware that didn't live up their specs, but ATI and NVidia know what they're doing, and workloads are nearly always one throughput cap or another. It may be setup rate, bandwidth, ALU, interpolators, etc. but one of them is always near 100%, thus the performance is quite predictable. It's not like few-threaded CPUs workloads at all.

What's more is that once you get your code functional, optimizing on another platform may be as simple as changing a few parameters, e.g. block size for matrix multiplication, thread/group size, etc.
Of course, if NVIDIA actually hit the jackpot and becomes the king of GPGPU and GPGPU becomes the best thing since sliced bread, AMD could do what they did in the x86 CPU world: by mimicking the leading architecture's performance characteristics, but that's not an easy thing to do.
I think for GPGPU it's far easier than with x86 CPU for all the reasons I mentioned above. However, right now the pressure is on NVidia to match ATI's compute density. ATI is sort of playing the waiting game for GPGPU, getting ready to pounce as a far more cost effective solution for any lucrative application.
 
However, right now the pressure is on NVidia to match ATI's compute density.

Are we talking about theoretical FLOPS here? Or FLOPS within transistor area.
 
Are we talking about theoretical FLOPS here? Or FLOPS within transistor area.
Even measured flops per transistor. For example, prunedtree measured 880 Gflops on RV770 for matrix multiplication while a much larger GT200 achieved 375 flops with a CUDA optimized algorithm. I don't know how big GT200b is, but that's over 3x difference in perf/mm2 for a very real and basic operation.

Yes, GPU Folding does much better on NVidia hardware, but that's largely due to feature omission by ATI, workload differences, coding, etc. These are things that ATI can nullify very quickly if GPGPU grows.
 
NVidia's tools may be great, but not only is it probably very easy for ATI to basically copy them feature for feature on the software front when the market becomes larger, but even if they can't then open standards will make it irrelevent because final deployment can be on any hardware.

Open standards don't commodify the leader as much as you think, for many reasons.

1) Developers tend to stick to the platform that they learn. Languages, IDEs, you name it. There's a switching cost from habitual learning.

2) Almost no one implements standards the same way. Two vendor implementations often differ in bugs and performance.

3) A vendor can protect their market without draconian methods like Apple. Consider the fact that Microsoft retains a majority of the browser market, despite 5 different standards have come and going, (HTML1-5), and Microsoft's browser is 20x slower than current browsers, less secure, and feature incomplete. This, despite the fact that Netscape once had 80% marketshare. (and no, I do not attribute this to solely the fact that IE is often pre-installed. Differences in IE6/IE8 marketshare curves show people are habitually sticking, i.e. choosing to upgrade their existing browser)

If Nvidia builds a substantial developer community and strong developer relations with software publishers, chances are, they will retain an advantage, even if they have inferior price/performance.

I think it is dangerous to assume that OpenCL/DX11 = write once, run anywhere, that is, you just tweak a few parameters and everything runs optimally. I think the reality will be much more brutally harsh, with bugs and compiler differences, combined with hidden hardware hazards, meaning that it will still be important to spend ample time optimizing for differing architectures.

I mean, the Folding@Home example points to issues surrounding general purpose portability. I don't expect this to really go away unless AMD and NVidia architectures converge.
 
I think there will be very little room for special optimization with massively parallel processing, and things that are good for one GPU will be good for another.
While I agree with this when compared to the CPU vs. GPU space, I'm not sure I agree in general. Already to get top performance you often have to write code optimized to specific chips, even if it's all DirectCompute for instance, but even more so with OpenCL. I almost always have to optimize things like block sizes and local memory usage, not to mention atomics and such to get the optimal performance out of various targets. In fact in the latest code I've been writing I've actually had to formalize it in terms of variables tweaked for various architectures, because otherwise you can be 2x or more off the best performance of a given chip.

I suspect this will get worse rather than better as these programming languages get more expressive too, but we'll see.
 
Differences in IE6/IE8 marketshare curves show people are habitually sticking, i.e. choosing to upgrade their existing browser
Sorry, but this arguement is wrong. IE6 to IE7 to IE8 upgrades come from Windows Update.
99% of people either just click "ïnstall updates", or have them disabled.
 
Sorry, but this arguement is wrong. IE6 to IE7 to IE8 upgrades come from Windows Update.
99% of people either just click "ïnstall updates", or have them disabled.

It's irrelevant how they get the install, this data comes from browsers surveys of users (e.g. polling human beings not server log analyis). The reality is, a significant number of people choose to stick with IE based browsers. In some Asian locales, browsers that wrap and extend IE (and thus, require specific user download) are prevalent.

IE's continued presence can be boiled down to 3 factors: 1) default browser for new machines 2) corporate admins locking down XP desktops and forcing IE6 and 3) users who simply are used to IE and like it and do not want to switch to a foreign UI.

Oh, and in some countries, like South Korea, historical US crypto-export bans, coupled with heavy usage of a 128-bit crypto-activex control by government and banks, and a recent supreme court precedent, have effectively required IE by law. (South Korea man sued government for right to use non-IE browser on govt website, and lost case in supreme court IIRC)
 
1) Developers tend to stick to the platform that they learn. Languages, IDEs, you name it. There's a switching cost from habitual learning.

IMHO good developers adapt to new technologies/languages. I don't think people like John Carmack are using the same tools for Rage that they used to write Wolf3D.

3) A vendor can protect their market without draconian methods like Apple. Consider the fact that Microsoft retains a majority of the browser market, despite 5 different standards have come and going, (HTML1-5), and Microsoft's browser is 20x slower than current browsers, less secure, and feature incomplete.

I hope that was intended to be humourus. Apple has a solid product. So does Nvidia with CUDA. Microsoft never did. They just used their monopoly.

If Nvidia builds a substantial developer community and strong developer relations with software publishers, chances are, they will retain an advantage, even if they have inferior price/performance.

Why would that be happen? If the competitor has a better product and consumers buy that prodcuct, developers should target that, no Nvidia's inferior solution.

I think it is dangerous to assume that OpenCL/DX11 = write once, run anywhere, that is, you just tweak a few parameters and everything runs optimally. I think the reality will be much more brutally harsh, with bugs and compiler differences, combined with hidden hardware hazards, meaning that it will still be important to spend ample time optimizing for differing architectures.

If 90% of your OpenCL can be shared with a few vendor-specific optimizations what's the problem? It's not different from how people optimized code for AMD or Intel in the past.

I mean, the Folding@Home example points to issues surrounding general purpose portability. I don't expect this to really go away unless AMD and NVidia architectures converge.

If the OpenCL abstraction layer is good enough you shouldn't care what the hardware looks like under the hood, right?
 
IMHO good developers adapt to new technologies/languages. I don't think people like John Carmack are using the same tools for Rage that they used to write Wolf3D.

Statistics dictate that most developers are not on the level of John Carmack. That's why most devs license engines and modify them.

Why would that be happen? If the competitor has a better product and consumers buy that prodcuct, developers should target that, no Nvidia's inferior solution.

That's an idealized view of the world, that consumers will instantly choose the optimal product according to a set of criteria, and that developers will target that. In the real world, consumers choose to buy products based on many criteria as well as persuasive marketing. In the real world, a company releasing a new superior productive does not (most of the time) instantly disrupt the installed base.

And, in the real world, time is money. So unless a product is truly disruptive and so revolutionary that it is a world apart, simply being a little bit better, would not necessarily convince a developer to give up productive tools purely for performance.

Look at it this way. Games are expensive and risky to develop. A developer wants to minimize development expense, and maximize his market. To do that, he'll use off the shelf solutions when possible if they solve his needs (as long as the costs are not too high), and he'll target the lowest common denominator.

What this means is, one or two quarters of strict ATI dominance cannot meaningfully transform the deployed base of cards, nor is it going to convince devs to just drop NVidia. If anything, the lower performing card that is easier to develop on will be a closer target to the average consumer.

Thus, I stand by my comments that in the short to medium term, ATI's cost advantage is not going to sway developers or alter how games are developed very much. If Nvidia surrenders market share for 1-2 years, things could look different. In the mean time, good developer relations, marketing, and sales pressure can act to minimize losses.

There is a big difference in what geek/developer/enthusiasts like, and what is good for development business.

If the OpenCL abstraction layer is good enough you shouldn't care what the hardware looks like under the hood, right?

You could say the same for OpenGL and DirectX, yet ensuring optimal performance across GPUs often requires special code paths. Hell, NVidia and ATI have special *driver* code paths (app detection) for some games. The reality is, the abstraction layer can only do so much. All RDBMS support SQL, but queries have differing performance characteristics on different databases, such that almost all applications that are multi-database are specialized.
 
I think it is dangerous to assume that OpenCL/DX11 = write once, run anywhere, that is, you just tweak a few parameters and everything runs optimally. I think the reality will be much more brutally harsh, with bugs and compiler differences, combined with hidden hardware hazards, meaning that it will still be important to spend ample time optimizing for differing architectures

It will be just like Java, write once... debug everywhere ;).
 
It will be just like Java, write once... debug everywhere ;).
If it's as smooth as java interoperability, using OpenCL could actually be enjoyable. I'm more pessimistic though. Khronos should have developed a reference implementation together with the spec. If there is some ambiguity in the spec you can at least compare with the ref-implementation.
 
Guys, stop with the IE analogies. They're completely irrelevent, because nobody is paying for hardware (or anything) there. NVidia's business model is to provide free software development tools to create a market that moves massive quantities of hardware - enough to be at least comparable to their GPU business.

Competition in a market that large will necessitate that NVidia is at least within spitting distance of ATI in hardware cost effectiveness. It could be as simple as "give us your code and we'll make it run at the same speed with 30% lower hardware cost". Who isn't going to take advantage of that?

NVidia painted them into a bit of a corner with their CUDA architecture, and I don't think they have much room to improve (G80 -> G92 -> GT200 don't get much better in perf/transistor, and I'm skeptical about GT300 doing a lot better). They'll have to hope that GPGPU necessitates AMD to decrease efficiency by adopting some of their more expensive design decisions. After seeing R600, they were probably pretty confident in their competitiveness (as was I, which is why I still have a 8800GTS), but today it's a different story.
While I agree with this when compared to the CPU vs. GPU space, I'm not sure I agree in general. Already to get top performance you often have to write code optimized to specific chips, even if it's all DirectCompute for instance, but even more so with OpenCL. I almost always have to optimize things like block sizes and local memory usage, not to mention atomics and such to get the optimal performance out of various targets. In fact in the latest code I've been writing I've actually had to formalize it in terms of variables tweaked for various architectures, because otherwise you can be 2x or more off the best performance of a given chip.
This is very much in agreement with what I'm saying, though. A couple parameters are pretty easy to tweak, and you'll probably want those in your code anyway even for the purpose of optimizing for successive generations of NVidia hardware.

If you have a killer application of GPGPU, and get your code functional on NVidia hardware, wouldn't you choose AMD for your final product if you can tweak the same code to run more cost effectively on it?
 
If you have a killer application of GPGPU, and get your code functional on NVidia hardware, wouldn't you choose AMD for your final product if you can tweak the same code to run more cost effectively on it?

It depends on how easily to "tweak" the codes. Sometimes it's pretty hard (such as, say, to "vectorize" in some cases). There are also time constraint. In many situation it's more important to deliver a usable product as soon as possible. So cost efficiency is not necessarily the most important issue.
 
I'm not saying it's necessarily the case. I'm saying it's usually the case, especially for the large scale applications that will have to arrive to grow this market. And again, I don't expect it to be difficult to optimize by the time GPGPU becomes big. You won't be using different algorithms due to major feature discrepencies, because there won't be any.
 
Back
Top