AMD: R9xx Speculation

You are missing the point. The discussion is about two levels of vectorization (SIMD width - 64 and VLIW width - 4) in the same program vs one (just the SIMD width) and not about vectorization per se.
Work-item vectorisation can also improve memory system access patterns, e.g. by increasing the coherency of cache use by making a longer burst.

The problem with these chips is you're programming the memory system as much as you're programming the ALUs.
 
The GPU score is 1.6k higher than GTX"512"'s @ expreview (clocked at 480 levels), assuming the 6800-score is real, and on similar hardware

And that would still fit the original rumor that AMD's refresh is faster than NV's possible current high-end.
 
Juniper->Barts scaling the same would mean GF104 is about to meet its maker.
You mean the supposed scaling to 1280 ALUs, 80 TMUs and 256 Bit memory interface on the same process node? If that was true, I'd have to agree.

But then, there are other possibilities for AMD to refresh than just add more of the same stuff, especially when 40nm wafer space still is quite rare, I'd assume it to be wise, not to scale your whole lineup of chips by +50mm² (for the sake of the argument).
 
You mean the supposed scaling to 1280 ALUs, 80 TMUs and 256 Bit memory interface on the same process node? If that was true, I'd have to agree.
Two issues: Vantage seems to flatter ATI currently and 128-bit is still in play as an option for Barts.

But then, there are other possibilities for AMD to refresh than just add more of the same stuff, especially when 40nm wafer space still is quite rare, I'd assume it to be wise, not to scale your whole lineup of chips by +50mm² (for the sake of the argument).
Yeah, reinstating stuff that was cut from Evergreen.
 
If the memory system (and some other bits?) is radically better in SI then maybe all that's possible with 128-bits. It'd be cool if it was.
I believe it was originally planned as a 128bit part (at 32nm), not that it still is a 128bit part. Expecting die size around 260-270mm², it wouldn't be clever not to utilize it for 256bit interface...
 
Both Samsung and Hynix still list 6.0GHz modules as their fastest products...
True, but there's also this.

Maybe Hynix doesn't list that memory anywhere else yet because AMD is their exclusive cutomer for that memory right now. The time-frame surely fits, that memory was supposed to be in mass-production for some time now.

Of course it's also possible that this is just a test sample and the final 6870 will come with a lower memory clock.
 
Anandtech had a story about Cypress where AMD basically said something to that effect. The chip had to be pared down to meet size requirements. Density suffered due to additional measures taken to increase yields and counteract process issues.

One bit of curiousity is whether or not Northern Islands will be as aggressive in implementing such measures, assuming TSMC was able to remedy the problems that made AMD bloat Cypress in the first place.
 

The GPU-Z part seems right for a unrecognized GPU. Then we have the Vantage numbers:
GPU Test 1: 37.53
GPU Test 2: 30.50
CPU Test 1: 3682.88
CPU Test 2: 31.55
GPU Score: 11634
CPU Score: 25839
3DMark Score: X11963

The numbers add up correctly according to the Vantage score calculation, so no obvious fake that way.
But does the ratio between the 2 GPU tests look sufficiently far from something we know, like a high clocked 480 or 5870? (of course the cpu may also influence that ratio, so the cpu scores should be comparable).

Two issues: Vantage seems to flatter ATI currently and 128-bit is still in play as an option for Barts.

Is it? The Ati-forum piece looks very much like it's based on PCB blueprints, so things like memory bus ("pin compatible to 5800") and power envelope (2*6 pin) seems pretty certain.
 
Anandtech had a story about Cypress where AMD basically said something to that effect. The chip had to be pared down to meet size requirements. Density suffered due to additional measures taken to increase yields and counteract process issues.

One bit of curiousity is whether or not Northern Islands will be as aggressive in implementing such measures, assuming TSMC was able to remedy the problems that made AMD bloat Cypress in the first place.

As 3dilettante said.

Thanks!

I'm sure they will be as aggressive with the Product Requirement Specification as they were with Cypress, because they don't have a process shrink to help with power consumption and die size. Assuming of course it's a continuing evolution of TeraScale 2. More SIMD means more transistors, means more heat, means more die, means lower yeilds... unless you've got a bunch of process engineering tricks up your sleeve to make the most of TSMC 40nm.

Perf/watt is a key marker for sales to OEM these days and thats where AMD's focus appears to be.
 
Lane masks in LRB and bitwise masks in SSE. Similar mechanisms implemented in hw on ati and nv.
And for short vectors it's easy to do the same in software.
I'd like to see a moderately complex problem "vectorized" over ATI's vliw slots before I'd believe it. Divergence and all the warts included.
I've done that already.
Basically it looks like manual loop unrolling with a handling of the special cases for the possible divergences. One can handle that often efficiently with conditional moves for short divergences, or with normal control structures which has effectively the same charactistics as the lane masking.
You bloat the code but get quite some speedup (if the divergences don't dominate, but in that case GPUs suck either way).
And it twists the programming model completely out of shape. :rolleyes:

Now there are multiple cores, vector width (64) and another vector width(4) to tackle. All for at best 4x perf gain.
No, as the normal vectorization in GPUs is implicit, you don't handle that part at all (besides when dealing with shared memory). You simply add vectorization, that's it. It's roughly the same as using SSE intrinsics, only more flexible.
For one out of 3 vendors. Not worth the trouble in my book.

The following is from nv's marketing slides but IS absolutely true.

-- 2-3x is not worth spending much effort. Wait a little or spend some money.

-- 10-15x gains are worth doing big rewrites.
Obviously I'm not reading your or nvidias books. A factor of 2 often decides if something is feasible or not. And even a 50% speedup is nothing to sneeze at in my book. If you have to write something from the ground up either way (which is the case for GPGPU much more often than not) it is definitely not an unsurmountable task to get it implemented with relative ease if one has thought about and planned that stuff before.

Btw., I mentioned that to you before and I will reiterate it, but also nvidia GPUs often gain from explicit vectorization as it reduces the granularity of memory accesses and increases the burst lengths. It is simply more cache friendly and with a lot of algorithms being bandwidth limited, it can be astonishingly efficient for some problems in view of the "scalar" nature of nvidia GPUs.
 
And for short vectors it's easy to do the same in software.

Btw., I mentioned that to you before and I will reiterate it, but also nvidia GPUs often gain from explicit vectorization as it reduces the granularity of memory accesses and increases the burst lengths. It is simply more cache friendly and with a lot of algorithms being bandwidth limited, it can be astonishingly efficient for some problems in view of the "scalar" nature of nvidia GPUs.

My experience with vectorization on Nvidia GPUs has not been positive. The extra register pressure caused by vectorizing code often causes large occupancy losses and ends up significantly harming performance. That's one reason AMD requires larger register files than Nvidia.
 
Back
Top