AMD Bulldozer Core Patent Diagrams

mczak · May 25, 2011

entity279 said:
Yes, I meant marketing cores.

The point was that most of Intel's overing is still quads so you're point with half the FP performance is not standing.

I think his point was that intel really handles 256bit AVX natively whereas AMD needs to split it up into 2 instructions.
That said, the AMD SSE unit is definitely more beefy otherwise (though I think latency is quite a bit worse). There's only one sse unit handling muls for instance in Sandy Bridge, so doing nothing but muls is twice as fast on BD (and if you're doing 256bit muls still as fast). It only gets better if you'd do FMAD (though not much since with as many muls as adds you can do the adds "for free" on SB). Not sure how though they'd behave with real code where the most frequent instructions tend to be the shuffle and pack/unpack ones

.

hoho · May 25, 2011

mczak said:
I think his point was that intel really handles 256bit AVX natively whereas AMD needs to split it up into 2 instructions.

My point was that one BD module with 1x256bit SIMD and 2x int cores has roughly equivalent SIMD throughput to one Intel core. Or in other words, BD has half the 256-bit SIMD throughput per int core vs Intel.

mczak said:
That said, the AMD SSE unit is definitely more beefy otherwise (though I think latency is quite a bit worse)

Most likely true on both cases. Only question will be how much software support will be needed to actually benefit from it. If most devs just opt to go with lowest common dominator and not write special code for BD it won't good to say the least.

rpg.314 · May 25, 2011

mczak said:
I think his point was that intel really handles 256bit AVX natively whereas AMD needs to split it up into 2 instructions.

It doesn't matter in the slightest.

3dilettante · May 25, 2011

There is a relatively minor performance degradation in 256-bit mode, not an improvement.
It matters enough that gcc will emit two 128-bit ops instead of one 256-bit for BD, or at least for this first-gen BD.
FMA4 was the horse AMD bet on, though there is no word on whether Intel will re-adopt it.

rpg.314 · May 25, 2011

I thought the splitting referred to 2 uops inside the decoder. I didn't know that compiler itself was tuned to crack them into 2 instructions.

3dilettante · May 25, 2011

If the chip is presented with an AVX 256-bit instruction, it will crack it.
However, for BD this will result in a small performance penalty.

Knowing this, a compiler run set to target BD will avoid using 256-bit instructions in favor of 128-bit ones.

hoho · May 25, 2011

3dilettante said:
There is a relatively minor performance degradation in 256-bit mode, not an improvement.
It matters enough that gcc will emit two 128-bit ops instead of one 256-bit for BD, or at least for this first-gen BD.

Wait, does that mean a single int core technically can use both halves of the 256bit SIMD unit in parallel to run twice as many 128bit SIMD instructions?

3dilettante · May 25, 2011

There are two 128-bit SIMD FMA units in the BD FPU.
The SB FPU has one 256-bit SIMD FADD and one 256-bit SIMD FMUL.

In 128-bit code, both can do two 128-bit SIMD operations, with BD able to handle a mix that does not contain both FADD and FMUL.
Each design has different mix of instructions it can best target.
256-bit AVX is not what BD targets, 128-bit FMA4 is what it would prefer.

Triskaine · May 25, 2011

For those of you who wonder how AMD's FP predicament arose in the first place I have created a concise summary of the events that transpired two years ago.

hoom · May 31, 2011

Seems they are delaying

http://semiaccurate.com/2011/05/30/bulldozer-and-ivy-bridge-both-delayed-a-bit/
http://www.xbitlabs.com/news/cpu/di...lock_Speed_of_FX_Bulldozer_Chips_Sources.html

swaaye · May 31, 2011

Hopefully they get a few months of sales time before Ivy Bridge cancels out any advantage they have.

Blazkowicz · May 31, 2011

Triskaine said:
For those of you who wonder how AMD's FP predicament arose in the first place I have created a concise summary of the events that transpired two years ago.

but then AMD changed their coding scheme from cancelled SSE5 to one compatible with AVX, so in effect Intel will have AVX w/ FMA3 and AMD will have AVX w/ FMA4, with a two year headstart for AMD.

Triskaine · May 31, 2011

Blazkowicz said:
but then AMD changed their coding scheme from cancelled SSE5 to one compatible with AVX, so in effect Intel will have AVX w/ FMA3 and AMD will have AVX w/ FMA4, with a two year headstart for AMD.

It is only a headstart if someone will actually use it. The adoption rate of FMA4 is unlikely to exceed that of 3DNow! .

Blazkowicz · May 31, 2011

sure (and it's the latest in instruction/feature set fragmentation, as CUDA support or Intel's video accelerator), but it can be used in various middlewares, open source encoders and the like.
even 3DNow! had a great use case, it was used by 3dfx drivers so it could keep the K6/2's head out of the water.

it also seems useful a lot :
http://en.wikipedia.org/wiki/Multiply-accumulate

A fast FMA can speed up and improve the accuracy of many computations which involve the accumulation of products:

* Dot product
* Matrix multiplication
* Polynomial evaluation (e.g., with Horner's rule)
* Newton's method for evaluating functions.

The 1999 standard of the C programming language supports the FMA operation through the fma standard math library function.

swaaye · Jun 2, 2011

I imagine it will get used by the x264 folks of they see value in it. They are usually quick to optimize for new CPU features that are worthy. They even have custom paths for a few CPUs, for example "SSE2slow" for K8 chips because they are slow with some SSE2 instructions and "SSE2fast" for just about everything else.

Blazkowicz said:
even 3DNow! had a great use case, it was used by 3dfx drivers so it could keep the K6/2's head out of the water.

All of the graphics companies used 3DNow in their drivers. Even Rendition had it in their final beta drivers.

But 3DNow helped a lot more if games themselves used it. Quake 2 was the only real significant example of that though and AMD itself optimized that exe.

mczak · Jun 2, 2011

Triskaine said:
It is only a headstart if someone will actually use it. The adoption rate of FMA4 is unlikely to exceed that of 3DNow! .

I think it should be better. 3DNow! required you to write a completely different code path. But FMA4 you can do with much less effort, though granted unless some SSE/AVX code is really MUL/ADD heavy it's probably still not really worth the effort.

fellix · Jun 9, 2011

Pictures and benchmarks of Zambezi ES sample with reduced clock-rate:

http://www.chiphell.com/thread-210890-1-1.html

rpg.314 · Jun 9, 2011

Does this thing look good?

fellix · Jun 9, 2011

Not really. There are also AIDA64 benches where the cache sub-system is behaving abnormally slow on the same Zambezi ES sample. The thing is just unstable at its nominal 3,2GHz clock-rate and occasionally spews BSOD. That's the reason for the lowered core clocks.

itsmydamnation · Jun 10, 2011

there also using a stepping that charlie (if you believe him) said months ago was broken. I guess its a question of, have AMD done this stuff on purpose? considering how long B0 steppings have been out in the wild, you would expect to hear something if there where major problems.

AMD Bulldozer Core Patent Diagrams

mczak

hoho

rpg.314

3dilettante

rpg.314

3dilettante

hoho

3dilettante

Triskaine

hoom

swaaye

Entirely Suboptimal

Blazkowicz

Triskaine

Blazkowicz

swaaye

Entirely Suboptimal

mczak

fellix

rpg.314

fellix

itsmydamnation