AMD: R9xx Speculation

Jawed · Aug 27, 2010

rpg.314 said:
You are missing the point. The discussion is about two levels of vectorization (SIMD width - 64 and VLIW width - 4) in the same program vs one (just the SIMD width) and not about vectorization per se.

Work-item vectorisation can also improve memory system access patterns, e.g. by increasing the coherency of cache use by making a longer burst.

The problem with these chips is you're programming the memory system as much as you're programming the ALUs.

neliz · Aug 27, 2010

Kaotik said:
The GPU score is 1.6k higher than GTX"512"'s @ expreview (clocked at 480 levels), assuming the 6800-score is real, and on similar hardware

And that would still fit the original rumor that AMD's refresh is faster than NV's possible current high-end.

rpg.314 · Aug 27, 2010

CRoland said:
It will be at least as difficult to vectorize such code using AVX on a CPU.

It might be a little easier. At least you can select using bit masks on AVX.

Kaotik · Aug 27, 2010

http://forums.techpowerup.com/showpost.php?p=2004976&postcount=4
W1zzard, author of GPU-Z:

nothing i can see in the gpuz part of the screenshot that suggests it's fake

CarstenS · Aug 27, 2010

Jawed said:
Juniper->Barts scaling the same would mean GF104 is about to meet its maker.

You mean the supposed scaling to 1280 ALUs, 80 TMUs and 256 Bit memory interface on the same process node? If that was true, I'd have to agree.

But then, there are other possibilities for AMD to refresh than just add more of the same stuff, especially when 40nm wafer space still is quite rare, I'd assume it to be wise, not to scale your whole lineup of chips by +50mm² (for the sake of the argument).

no-X · Aug 27, 2010

vr-zone said:
...1.6 GHz, or a whopping 6.4 GHz effective

Both Samsung and Hynix still list 6.0GHz modules as their fastest products...

Space Giraffe · Aug 27, 2010

7 Gbps GDDR5 went into mass production back in June IIRC. Is it soon enough for ATI to start using them?

Jawed · Aug 27, 2010

CarstenS said:
You mean the supposed scaling to 1280 ALUs, 80 TMUs and 256 Bit memory interface on the same process node? If that was true, I'd have to agree.

Two issues: Vantage seems to flatter ATI currently and 128-bit is still in play as an option for Barts.

But then, there are other possibilities for AMD to refresh than just add more of the same stuff, especially when 40nm wafer space still is quite rare, I'd assume it to be wise, not to scale your whole lineup of chips by +50mm² (for the sake of the argument).

Yeah, reinstating stuff that was cut from Evergreen.

no-X · Aug 27, 2010

Jawed said:
If the memory system (and some other bits?) is radically better in SI then maybe all that's possible with 128-bits. It'd be cool if it was.

I believe it was originally planned as a 128bit part (at 32nm), not that it still is a 128bit part. Expecting die size around 260-270mm², it wouldn't be clever not to utilize it for 256bit interface...

TKK · Aug 27, 2010

no-X said:
Both Samsung and Hynix still list 6.0GHz modules as their fastest products...

True, but there's also this.

Maybe Hynix doesn't list that memory anywhere else yet because AMD is their exclusive cutomer for that memory right now. The time-frame surely fits, that memory was supposed to be in mass-production for some time now.

Of course it's also possible that this is just a test sample and the final 6870 will come with a lower memory clock.

ferro · Aug 27, 2010

TKK said:
True, but there's also this.

Nice. This particular module would give you 2GB on a 256 bit bus. Eyefinity-ready!

trinibwoy · Aug 27, 2010

Gipsel said:
The four x,y,z,w slots are actually arranged as two pairs, where the result of the operation in one slot can be forwarded to the other member of the pair.

Oh ok, now I understand what you're referring to.

caveman-jim · Aug 27, 2010

Jawed said:
Evergreen is clearly not the design AMD intended to build for D3D11

Can you expound on why you think that? Link me to previous discussion if you've stated it already and I missed it. Thanks

3dilettante · Aug 27, 2010

Anandtech had a story about Cypress where AMD basically said something to that effect. The chip had to be pared down to meet size requirements. Density suffered due to additional measures taken to increase yields and counteract process issues.

One bit of curiousity is whether or not Northern Islands will be as aggressive in implementing such measures, assuming TSMC was able to remedy the problems that made AMD bloat Cypress in the first place.

Jawed · Aug 27, 2010

caveman-jim said:
Can you expound on why you think that? Link me to previous discussion if you've stated it already and I missed it. Thanks

As 3dilettante said.

Psycho · Aug 27, 2010

http://vr-zone.com/articles/alleged-ati-radeon-hd-6870-3dmark-vantage-benchmark-leaked/9692.html

The GPU-Z part seems right for a unrecognized GPU. Then we have the Vantage numbers:
GPU Test 1: 37.53
GPU Test 2: 30.50
CPU Test 1: 3682.88
CPU Test 2: 31.55
GPU Score: 11634
CPU Score: 25839
3DMark Score: X11963

The numbers add up correctly according to the Vantage score calculation, so no obvious fake that way.
But does the ratio between the 2 GPU tests look sufficiently far from something we know, like a high clocked 480 or 5870? (of course the cpu may also influence that ratio, so the cpu scores should be comparable).

Jawed said:
Two issues: Vantage seems to flatter ATI currently and 128-bit is still in play as an option for Barts.

Is it? The Ati-forum piece looks very much like it's based on PCB blueprints, so things like memory bus ("pin compatible to 5800") and power envelope (2*6 pin) seems pretty certain.

Jawed · Aug 27, 2010

Psycho said:
Is it? The Ati-forum piece looks very much like it's based on PCB blueprints, so things like memory bus ("pin compatible to 5800") and power envelope (2*6 pin) seems pretty certain.

2 extra layers, for what? If they're both 1GB 256-bit boards, why are there two extra layers?

caveman-jim · Aug 27, 2010

3dilettante said:
Anandtech had a story about Cypress where AMD basically said something to that effect. The chip had to be pared down to meet size requirements. Density suffered due to additional measures taken to increase yields and counteract process issues.

One bit of curiousity is whether or not Northern Islands will be as aggressive in implementing such measures, assuming TSMC was able to remedy the problems that made AMD bloat Cypress in the first place.

Jawed said:
As 3dilettante said.

Thanks!

I'm sure they will be as aggressive with the Product Requirement Specification as they were with Cypress, because they don't have a process shrink to help with power consumption and die size. Assuming of course it's a continuing evolution of TeraScale 2. More SIMD means more transistors, means more heat, means more die, means lower yeilds... unless you've got a bunch of process engineering tricks up your sleeve to make the most of TSMC 40nm.

Perf/watt is a key marker for sales to OEM these days and thats where AMD's focus appears to be.

Gipsel · Aug 27, 2010

rpg.314 said:
Lane masks in LRB and bitwise masks in SSE. Similar mechanisms implemented in hw on ati and nv.

And for short vectors it's easy to do the same in software.

rpg.314 said:
I'd like to see a moderately complex problem "vectorized" over ATI's vliw slots before I'd believe it. Divergence and all the warts included.

I've done that already.
Basically it looks like manual loop unrolling with a handling of the special cases for the possible divergences. One can handle that often efficiently with conditional moves for short divergences, or with normal control structures which has effectively the same charactistics as the lane masking.
You bloat the code but get quite some speedup (if the divergences don't dominate, but in that case GPUs suck either way).

rpg.314 said:
And it twists the programming model completely out of shape.

Now there are multiple cores, vector width (64) and another vector width(4) to tackle. All for at best 4x perf gain.

No, as the normal vectorization in GPUs is implicit, you don't handle that part at all (besides when dealing with shared memory). You simply add vectorization, that's it. It's roughly the same as using SSE intrinsics, only more flexible.

rpg.314 said:
For one out of 3 vendors. Not worth the trouble in my book.

The following is from nv's marketing slides but IS absolutely true.

-- 2-3x is not worth spending much effort. Wait a little or spend some money.

-- 10-15x gains are worth doing big rewrites.

Obviously I'm not reading your or nvidias books. A factor of 2 often decides if something is feasible or not. And even a 50% speedup is nothing to sneeze at in my book. If you have to write something from the ground up either way (which is the case for GPGPU much more often than not) it is definitely not an unsurmountable task to get it implemented with relative ease if one has thought about and planned that stuff before.

Btw., I mentioned that to you before and I will reiterate it, but also nvidia GPUs often gain from explicit vectorization as it reduces the granularity of memory accesses and increases the burst lengths. It is simply more cache friendly and with a lot of algorithms being bandwidth limited, it can be astonishingly efficient for some problems in view of the "scalar" nature of nvidia GPUs.

RecessionCone · Aug 27, 2010

Gipsel said:
And for short vectors it's easy to do the same in software.

Btw., I mentioned that to you before and I will reiterate it, but also nvidia GPUs often gain from explicit vectorization as it reduces the granularity of memory accesses and increases the burst lengths. It is simply more cache friendly and with a lot of algorithms being bandwidth limited, it can be astonishingly efficient for some problems in view of the "scalar" nature of nvidia GPUs.

My experience with vectorization on Nvidia GPUs has not been positive. The extra register pressure caused by vectorizing code often causes large occupancy losses and ends up significantly harming performance. That's one reason AMD requires larger register files than Nvidia.

AMD: R9xx Speculation

Jawed

neliz

GIGABYTE Man

rpg.314

Kaotik

Drunk Member

CarstenS

Moderator

no-X

Space Giraffe

Jawed

no-X

TKK

ferro

trinibwoy

Meh

caveman-jim

3dilettante

Jawed

Psycho

Jawed

caveman-jim

Gipsel

RecessionCone

Similar threads