What gives ATI's graphics architecture its advantages?

Kaotik

Drunk Member
Legend
Supporter
the topic is a fact, think what ATI did last gen with HD4 series and now does with HD5 series, compared to the nVidia competition.

From quick glance, nV's "truly scalar" design is far more effective, as in, it's far more easy to get the "full potential" out of them in real world situations compared to ATI's WLIV design - however ATI was last gen just a tad behind, just like with this current gen, with substantially smaller chips

And then the tesselation, sure, tesselation and geometry performance overall is the strong point of GF100, but what nV does with 16 tesselators, ATI does with 1 or 1½ dependin on how you view it

So what is it that makes them so much more effective while at least in shader utilization, GeForces should be much more effective?

edit:
pressed enter too early on the topic, bit drunk here
 
Posting whilst drunk is not advised. :D I suggest you read up something like http://www.anandtech.com/show/2937 which was the story behind much of why they are successful at present.

I've read that, aswell as the RV7xx story, the thing I'm not getting is that on my limited understanding, the GF's should be running rounds around Radeons with scalar shaders and all, yet in practice it's almost the opposite (given the transistor ratio etc)
 
I've read that, aswell as the RV7xx story, the thing I'm not getting is that on my limited understanding, the GF's should be running rounds around Radeons with scalar shaders and all, yet in practice it's almost the opposite (given the transistor ratio etc)

Its because Nvidias implementation is efficient in an architectural sense relative to the ATI architecture but its not efficient in terms of performance per transistor and from that they aren't as efficient in terms of performance per watt. Quantity in the case of the ATI chips is a quality all to itself in terms of flops per mm^2 even if they aren't as efficient when it comes to overall utilization.
 
The whole thing kind of reminds me of OoOE chips vs Itanium, in the sense, of how much you rely on software to 'pack' instructions optimally, vs more fine grained dynamic super-scalar techniques. It seems NVidia went with finer granularity, and hence more logic overhead.
 
I've read that, aswell as the RV7xx story, the thing I'm not getting is that on my limited understanding, the GF's should be running rounds around Radeons with scalar shaders and all, yet in practice it's almost the opposite (given the transistor ratio etc)

Game shaders usually have a *lot* of statically extractable ILP which is why AMD has high utilization from it's 5-way VLIW. ~60% on the lower side, IIRC.

Even if you throw away the 5x VLIW, their ALUs have ~30% more flops/mm, according to an analysis done by Jawed earlier.

In rv7xx, they had gddr5, so immediately had a HUGE bandwidth advantage.

In rv8xx, they had less of an advantage with mem clocks, but it as still there.
 
Could one describe Nvidias solution as GPU-Computing centric in a way that their architecture allows programmers to leverage more performance per time invested in optimization? Whereas with AMD you need to closely pay attention to vectorization to extract more performance?

If so, then this architectural change to "scalar" is by far the biggest investment in GPU-Computing, leaving double precision far behind. After all, they seem to want to push this as much as possible, winning developers over without them having to re-learn the way they're used to program. But they're paying a hefty price in all their current Geforce/Quadro/Tesla-Lineup. :eek:
 
ATI was last gen just a tad behind, just like with this current gen, with substantially smaller chips

Well that handicaps the comparison in ATi's favor a bit. You're taking a measure of 3D performance vs die size where Nvidia has invested transistors in performance and features that aren't relevant to 3D gaming. On the flip-side there are workloads that prove those investments were not made in vain where Nvidia's perf/mm^2 is just fine.

It's arguably worse with GF100 vs Cypress. Cypress does just what's necessary to achieve DX11 compliance and provide excellent performance in games whereas GF100 obviously goes a step further and facilitates high performance in a more diverse set of workloads.

It's hard to answer though. Could Nvidia have simply added a tessellator to GT200 and doubled up the SIMDs and achieved parity with Cypress for ~2billion transistors? Who knows.
 
Could one describe Nvidias solution as GPU-Computing centric in a way that their architecture allows programmers to leverage more performance per time invested in optimization? Whereas with AMD you need to closely pay attention to vectorization to extract more performance?
Yes. vector ILP isn't freely available in most code, unlike graphics. At the same time, I think the perils of vec4 vs scalar can be fixed in compiler and caches. So they don't really need to change their quite optimized SIMD design.

If so, then this architectural change to "scalar" is by far the biggest investment in GPU-Computing, leaving double precision far behind. After all, they seem to want to push this as much as possible, winning developers over without them having to re-learn the way they're used to program. But they're paying a hefty price in all their current Geforce/Quadro/Tesla-Lineup. :eek:
DP is needed for HPC and is in infancy. The killer app of consumer GPGPU TODAY is graphics. You do the math.

I think the real issue is that even if you throw out the 5x ILP from VLIW, AMD still had (last gen) a ~30% advantage over nv in flops/mm for their alu's.
 
rpg.314,

I see what you mean. But from my perspective, you sure CAN do something about vectorization or parallelization of code. But to make the developers life easier, they just shouldn't have to, regardless of HOW exactly you rid them of this problem.

GPUs main business is graphics, right, but not GPGPUs. But that's what Nvidia obviously is desperately hoping to change with all their efforts in hard- and software.

Regarding the FLOPS/mm² issue: It's not about the size of the ALUs itself, it's also about keeping them fed with enough data - possibly under less-than-stellar circumstances. Just the same way, AMDs doing their ROPs, which can do single-cycle RGB9E5 whereas Nvidias cannot, yet both are quite good at INT8-formats.
 
rpg.314,

I see what you mean. But from my perspective, you sure CAN do something about vectorization or parallelization of code. But to make the developers life easier, they just shouldn't have to, regardless of HOW exactly you rid them of this problem.
Yeah, but that's being a bit too idealistic.:smile: More practically, I'd say AMD should target 80% of the perf of any code from scalar architectures on VLIW.
GPUs main business is graphics, right, but not GPGPUs. But that's what Nvidia obviously is desperately hoping to change with all their efforts in hard- and software.
I think AMD should start beating the Fusion drum, assuming it is on track. HPC would take a decent multi-socket Fusion over Fermi any day of the week.

Regarding the FLOPS/mm² issue: It's not about the size of the ALUs itself, it's also about keeping them fed with enough data - possibly under less-than-stellar circumstances. Just the same way, AMDs doing their ROPs, which can do single-cycle RGB9E5 whereas Nvidias cannot, yet both are quite good at INT8-formats.
AMD has had a bw/pin advantage for ~2 yrs now. :smile:
 
I think AMD should start beating the Fusion drum, assuming it is on track. HPC would take a decent multi-socket Fusion over Fermi any day of the week.

I don't follow this line of thought. Why do you presume that GPUs should only be used for "dumb/easy" high throughput workloads? Aren't there workloads out there that would benefit from a combination of high throughput and flexibility? Keeping all the hard stuff on the CPU doesn't sound like the way to go. Even in a Fusion world the GPU has to be a lot smarter than it is today.
 
I think NVidia basically just dug themselves a hole with the G80 shader architecture. They weren't impressed with R520, and had a plan to push GPU computing, so they figured that they would be fairly safe with making an SIMD engine that was more CPU-like than the one ATI came up with. They were validated even more when R600 came out and later when G92 clocked really high, so they thought they were sitting pretty. The less efficient GT200 looked fine.

Suddenly ATI got its act together with RV770, making their architectural decisions produce a chip that gave them the perf/mm2 advantage that should have been there from the beginning. NVidia also lost the ability to clock their shaders so high. Instead of a factor of 2-3 advantage, it shrunk down to 1.6 with Fermi vs Cypress.

In terms of actual architectural decisions, one is that NVidia chose a warp size of 32 instead of ATI's 64. Another is that they use semaphores to track texture latency instead of the much simpler clause system that ATI uses, so the scheduler is more complex. There's higher L1 bandwidth and branching rate per ALU as well, though that's a bit of a moot point when you look at total ALU count.

All in all, ATI is trying to squeeze in as many FLOPs as it thinks are useful given that there is an overhead for each SIMD engine to control everything. Cypress has 20 schedulers, each doing 16x5x2 flops per clock. Fermi basically has 16 pairs of schedulers, each doing 16x2 flops per clock. Cypress gets lower utilization, but much higher density.
The whole thing kind of reminds me of OoOE chips vs Itanium, in the sense, of how much you rely on software to 'pack' instructions optimally, vs more fine grained dynamic super-scalar techniques. It seems NVidia went with finer granularity, and hence more logic overhead.
The difference being, of course, that computational density of an Itanium type design doesn't help you much in desktop CPU competitiveness because your workload isn't embarassingly parallel.

I think it's pretty easy for ATI to switch to a mostly scalar design with minimal changes. It the rest of the design where the differences really show up.
 
I don't follow this line of thought. Why do you presume that GPUs should only be used for "dumb/easy" high throughput workloads? Aren't there workloads out there that would benefit from a combination of high throughput and flexibility? Keeping all the hard stuff on the CPU doesn't sound like the way to go. Even in a Fusion world the GPU has to be a lot smarter than it is today.

I wasn't clear enough there. I wish for Fermi-ish flexibility for fusion gpu's too. I guess that sentence should have read

HPC would take a decent multi-socket Fusion over discrete gpu hanging off the PCIe bus any day of the week even if the discrete gpu has ~4x more bogo-flops and ~2x more bogo-gbps.
 
I think AMD should start beating the Fusion drum, assuming it is on track. HPC would take a decent multi-socket Fusion over Fermi any day of the week.

Is Fusion even headed for that space any time soon, though? For the foreseeable future this appears to be a solution which will be targetting notebooks and desktops, and possibly netbooks. Basically, any market where IGPs are currently considered to be acceptable solutions for many customers.

With actual HPC trends going towards fewer sockets rather than more, and Fusion providing a heterogeneous execution platform at that, how do you see it competing at all in the same space that Fermi is targetting?
 
why isnt ATi a lot faster than nvidia with their huge advantage in shading power, texturing, and fillrate? from an architectural perspective, nvidia's shaders are way behind but in games like crysis or stalker they do fine. something's not adding up.
 
why isnt ATi a lot faster than nvidia with their huge advantage in shading power, texturing, and fillrate? from an architectural perspective, nvidia's shaders are way behind but in games like crysis or stalker they do fine. something's not adding up.

I would guess biggest problem would be setup/rasterization rates
 
Is Fusion even headed for that space any time soon, though? For the foreseeable future this appears to be a solution which will be targetting notebooks and desktops, and possibly netbooks. Basically, any market where IGPs are currently considered to be acceptable solutions for many customers.

As I see it, AMD has merely announced the parts targeting low hanging fruit of integration. When Intel gets into the game as well and once AMD has design experience with this thing then we'll start to see the results Bulldozer's modularity coming into play.

With actual HPC trends going towards fewer sockets rather than more, and Fusion providing a heterogeneous execution platform at that, how do you see it competing at all in the same space that Fermi is targetting?

Heterogeneity is the future. I don't see any problem there.

Fewer sockets of shared memory parallelism are a win in face of intra-node distributed memory parallelism.

The future of a majority of HPC is a single desktop. You can already get individual desktops today with 32-48 cores.

And until NV adds 10GbE/IB to their gpu's they'll have a huge hole of MPI latency to dig themselves out of.
 
Back
Top