What gives ATI's graphics architecture its advantages?

Well, from a 50,000 foot view, both AMD and NVidia are milking their "same" architectures, if anything, AMD is more of an incrementalist than NVidia when it comes to architecture. The real difference is, NVidia is betting on workloads (tessellation, double precision, CPU-like problems, ECC, etc) that haven't materialized yet. I'd say NVidia went for (potentially unnecessary) complexity, and AMD went with a KISS approach.

I just think all of this drama about the companies themselves gets into the realm of aesthetics and subjective opinion, and f*nb*yism, whereas actual discussion of the merits of a design, from a cost benefit, engineering, or performance perspective, or discussion of effects on types of workloads, is much closer to something which is objective.

The original thread question was actually legitimate: Why are AMD chips smaller for similar performance (implicitly, on workloads being benchmarked today), and there is an objective answer to this question that doesn't evolve emotion, or corporate intrigue, and other junk.
 
I am a little confused with the last point , is it really necessary to add more transistors to achieve higher clocks ? I thought clocks is a matter of voltage and heat , other than that , any chip could be clocked higher with the right cooling (for example HD5870 reaches 1300MHz already on LN ).

Voltage and heat are notthe only limiting factors. As you say, the cooler the chip, the higher it can go. But no matter how cool the chip is, there is the limit of how fast a signal can travel: the speed of light. The maximum possible frequency of a chip is determined by the maximum time needed for the signals to propagate in one clock tick. Because the speed of light is a constant the only thing you can do is reduce the distance the signal has to travel. For this you can use pipelining and this requires extra transistors for buffering and splitting the work in stages.
 
Voltage and heat are notthe only limiting factors. As you say, the cooler the chip, the higher it can go. But no matter how cool the chip is, there is the limit of how fast a signal can travel: the speed of light. The maximum possible frequency of a chip is determined by the maximum time needed for the signals to propagate in one clock tick. Because the speed of light is a constant the only thing you can do is reduce the distance the signal has to travel. For this you can use pipelining and this requires extra transistors for buffering and splitting the work in stages.

AFAIK, the interconnect delay is much larger than the speed of light.
 
AFAIK, the interconnect delay is much larger than the speed of light.

much slower is probably the more apt term. Copper in the typical PCB material goes at roughly 1/2 the speed of light. The dialectic constant on die is roughly the same as well. If someone could ever get an air bridge process to work you could almost double the signal propagation speeds on die.
 
~1400MHz vs. 850 MHz ALU clock.

The flops calculation already accounts for clock speeds.

That's a rather silly premise to begin with.

Not in context of Dave's suggestion. Besides the theoretical advantage is 2x so I don't understand why you think a measured advantage of 1.5-2x isn't proportional....

It's not proportional by any means. 5x the SIMD width of a scalar design brings maybe 1.5-2x overall game performance, but that's pretty damn good considering that going from 1x to 5x only adds maybe 20% to the transistor cost and 10% to the board cost.

Which was Dave's point about bang for the buck. But if Cypress is effectively 1.5-2x on the shading workload then that must mean Fermi catches up somewhere else (where?) or that the shading workload isn't significant to begin with. So it's still an open question.

[edit] wait, in your comparison are you referring to Fermi or a theoretical 320 scalar ALU Cypress @ 850Mhz?
 
much slower is probably the more apt term. Copper in the typical PCB material goes at roughly 1/2 the speed of light. The dialectic constant on die is roughly the same as well. If someone could ever get an air bridge process to work you could almost double the signal propagation speeds on die.

I though it was 2/3 but it's been a while since I took those subjects.
 
Why should a scalar architecture be very significant for developers? High performance computation on the CPU has long relied on vectorization to get maximum performance, so why should GPGPU go back to scalar only?

It seems natural to me that a strictly scalar architecture would have more overhead.
 
Which was Dave's point about bang for the buck. But if Cypress is effectively 1.5-2x on the shading workload then that must mean Fermi catches up somewhere else (where?) or that the shading workload isn't significant to begin with. So it's still an open question.

The final frame rate could be bound to fillrate,texelrate and bandwidth (also inter chip bandwidth) with high resolution and AA/AF and not shaders. And those are far from 1.5-2x except texelrate.
 
Across a large cross section of reviews and games (and removing obvious framebuffer limitations) at the moment Fermi averages at 13% faster than Cypress at 2560x1600 (all AA/AF combinations) - yet it has >300% more triangle rate, 24% more fillrate, 270% more Z-fillrate and 16% more bandwidth.

I mean no disrespect but did the 2GB Radeon HD 5870 fare better in the so called framebuffer limited scenarios?

Reading reviews of the Radeon HD 5870 Eyefinity certainly brings that to question. My main problem with the reviews is the lack of minimum framerates being showed in the graphs.

Crysis seems to get a modest boost in the minimum framerates (from 11.6 to 16.3 fps in Anandtech's review). In everything else they are neck to neck (the 1GB vs 2GB Radeon HD 5870). Admittedly, comfortable close to the Geforce GTX 480 in most cases.
 
The final frame rate could be bound to fillrate,texelrate and bandwidth (also inter chip bandwidth) with high resolution and AA/AF and not shaders. And those are far from 1.5-2x except texelrate.

Of course, but we know that the shader advantage doesn't result in an overall performance advantage. You can't explain that away by saying the advantage in other areas isn't as pronounced. All cards see the same workload and if you claim Cypress is faster at some portion of that workload then it obviously has to be sufficiently slower somewhere else if the final result isn't a win. The other option of course is that it's not faster at shading but for a given real-world throughput its implementation is cheaper. Can't have it both ways.
 
Why should a scalar architecture be very significant for developers? High performance computation on the CPU has long relied on vectorization to get maximum performance, so why should GPGPU go back to scalar only?

It seems natural to me that a strictly scalar architecture would have more overhead.

Ease of development, that and many algorithms require pointer-chasing through data structures which are inherently not-vector oriented. There are three ways to vectorize:

1) code it by hand
2) have your compiler / tools do it
3) have your hardware do it

This is not too different than decisions made in the past with regard to optimal instruction scheduling:

1) write assembler by handle specifically optimized for your CPU pipeline
2) let your compiler do it
3) let the CPU re-order it

Or say, memory hierarchy (LRU cache vs manually managed ScratchPad memory), or say, locking vs transaction memory (w/HW support), or manual memory management vs GC (w/HW support like Azul Systems's Vega).

Now, the problem with #1 is that a) it's time consuming and b) you can't always predict ahead of time what needs to be done. The problem with approach #2 is that a) the tools are often not as smart as human beings and b) you can't always statically predict ahead of time what needs to be done.

The problem with #3 is that it's expensive to implement (but you are often able to optimize based on run-time gathered information)

Now, all modern development uses a mix of all three strategies. 1) You try and write 'optimal' code that won't confuse your tools/compiler too much 2) your compiler tries to make up for your incompetence or laziness and 3) Your hardware tends to try and help you out.

The only argument is over the proper mix of HW support and SW support, and who pays the costs: consumers, or developers. Remember the arguments over in-order CPUs on consoles, or CELL manually managed local store, DMA, and SPs? Development time does matter, and can shift mindshare.

No one has yet devised a simultaneously performant and easy to use system for concurrent programming. Shared-none is the easiest to get right, not can't run all algorithms efficiently. Thread-based concurrency with concurrency primitives, offers performance, but difficulty of development (see Mars Pathfinder). Then there is the problem of simultaneously targeting architectures with fundamentally different widths and memory architecture.

My point is, we still can't ignore serial non-vector performance. We still have to solve physics, AI, and all kinds of other problems. Maybe Fusion is the best approach, I dunno. All I'm saying is, we've got 2 TFLOPs now, and graphics are only marginally better than they were 4 years ago, and game physics still suck. (ok, some of the problems are that we're trying to solve O(n^3) problems, but anyway)
 
Why should a scalar architecture be very significant for developers? High performance computation on the CPU has long relied on vectorization to get maximum performance, so why should GPGPU go back to scalar only?.

Both nv and AMD have loop level vectorization in hw. What makes AMD's gpu's peculiar is that it exposes ILP level vectorization to programmer, which AFAIK, nobody else does. For graphics shaders, it works fine. For others not so much.

The bigger problem is that manually packing 4 scalar work items into a vec4 workitem breaks the conceptual simplicity of the overall programming model and adds complexity. That is the tradeoff.

When you think amd's ve4/5, think doing inline sseX with intrinsics (OCL is just syntactic sugar), not loop level autovectorization by compiler.
 
Of course, but we know that the shader advantage doesn't result in an overall performance advantage. You can't explain that away by saying the advantage in other areas isn't as pronounced. All cards see the same workload and if you claim Cypress is faster at some portion of that workload then it obviously has to be sufficiently slower somewhere else if the final result isn't a win. The other option of course is that it's not faster at shading but for a given real-world throughput its implementation is cheaper. Can't have it both ways.

Or you could just explain it away with "not a factor in current games so it doesn't count" like you did for Nvidia's 300% geometry edge that nets them nothing (or maybe 13%)
 
Or you could just explain it away with "not a factor in current games so it doesn't count" like you did for Nvidia's 300% geometry edge that nets them nothing (or maybe 13%)

Already did. That is option #2. In any case the geometry advantage argument is sorta irrelevant since for the past few cycles we had the same dynamic at play with AMD having a large theoretical lead in flops and a geometry lead as well (higher clocks).
 
Btw. what is the primary bottleneck of GF100 performance?

Geometry and ALU power are really unlikely. Many people points at the 64 texturing units, but performance drop with AF is comparable to GT200, not higher, so the ~2.5-time higher ALU:TEX ratio doesn't seem to be major problem. Z-rate and blending rate is also fine... Is it the limitation of 32 pixel/clock or bandwdiwth?
 
Btw. what is the primary bottleneck of GF100 performance?

Geometry and ALU power are really unlikely.
Why is ALU unlikely???
Many people points at the 64 texturing units, but performance drop with AF is comparable to GT200, not higher, so the ~2.5-time higher ALU:TEX ratio doesn't seem to be major problem.
Not in the benchmarks I've seen.
http://www.computerbase.de/artikel/...geforce_gtx_480/5/#abschnitt_skalierungstests

Z-rate and blending rate is also fine... Is it the limitation of 32 pixel/clock or bandwdiwth?
Blend rate is fine in theory not so much in practice according to the hardware.fr numbers (outclassed by Cypress pretty much). Some of that is due to the 32 pixel/clock so I'd say that is a limiting factor. Since the achievable rop rate seems to be so low regardless of memory clock I sort of doubt it's really limited by a lack of memory bandwidth. The z-rate though definitely is indeed fine.

Looks to me like it could be a combination of alu, tex, and the weirdo low rop throughput/rasterization limit. That is very very speculative though...
 
Last edited by a moderator:
Why should a scalar architecture be very significant for developers? High performance computation on the CPU has long relied on vectorization to get maximum performance, so why should GPGPU go back to scalar only?

It seems natural to me that a strictly scalar architecture would have more overhead.

It does have more overhead. But it makes life a lot easier.

For instance, suppose you have a function call in your kernel. Handling a function call per scalar is a hell of a lot easier than figuring out how to emit a function call across a 4-wide vector. One let's you use the same C code that you had before (function operating on scalar), the other requires that you either lose performance or regenerate the function call.

DK
 
Back
Top