AMD Bulldozer Core Patent Diagrams

Jawed · Oct 27, 2010

3dilettante said:
I rechecked the descriptions for SB's integer SIMD units, and the current implementation does not have wider integer operations, rather there is a promise for more at some later date.

Would that be AVX2? Is there such a thing planned?

There are two integer SIMD blocks for both BD and SB.

On both I see 3: BD has IMAC, MMX and MMX and SB has 3 (one "MUL" and two "ALU") :???:

Are the pix on RWT no good? Am I misreading them?

BD does have an integer FMAC on the first FP port.

I guess what you're saying is that AMD's MMX = SB's ALU in RWT's diagrams, and BD has IMAC while SB has IMUL.

Consumer apps have a problem in that they are poorly threaded and like single-threaded performance better. The BD FPU is higher latency and its read/write capability is not better than a single core.

But in basic terms there's two 128-bit SSE units available to a single thread.

For many games, it is more of a question if SIMD shows up at all.

That was my impression.

The integer pipelines and single-threaded performance would matter more in the client space for games, which does not look like it favors BD.

I think the single-threaded game era has effectively ended. That's not to say per-thread efficiency isn't very important.

With regards to x264, it seems the Sandy Bridge preview thread has some chatter about using the special-purpose hardware in SB for the codec.

I saw that. Though Jason seemed pretty dismissive of that specifically. He was much more excited about the 2 load ports from L1.

This is a lateral move around the SSE/AVX debate, apparently.

If there's anything in it I expect we'll hear noises from Jason pretty early in SB's life - he was very prompt with Nehalem tweaks.

3dilettante · Oct 27, 2010

Jawed said:
Would that be AVX2? Is there such a thing planned?

AVX changes a number of things so that it is more easily scaled upwards, such as a more generic context-saving method.
Int SIMD may go to 256-bit, and possibly 512. On the FP side, Intel said it is possible to go to 512 or 1024 bits.

On both I see 3: BD has IMAC, MMX and MMX and SB has 3 (one "MUL" and two "ALU") Are the pix on RWT no good? Am I misreading them?

I left the shuffle/XBAR blocks to a separate sentence because it wasn't clear from the descriptions if all the blocks are necessarily 1:1 In functionality.
I think the MMX and SIMD ALU blocks do pretty much the same thing, going by the descriptions.

The FMAC does not have a matching unit in SB.
The XBAR might cover integer shuffles, but I'm not certain.
SB has a few extra places where it can send miscellaneous ops like shuffles and blends, but the BD diagram does not state exactly where those go.

But in basic terms there's two 128-bit SSE units available to a single thread.

It seems that way. It's not a major leap, and given the possible size of Zambezi, it's a sign that the client space is not the target for this.
If the clocks can be jacked up to some impressive numbers it can be more impressive.
A 5 GHz BD could jam a lot of Int SIMD ops through, for example.
I certainly hope it clocks higher than 4 GHz, but I fear its speed demon approach may not be justifiable without something significantly faster than 4.

Jawed · Oct 28, 2010

3dilettante said:
It seems that way. It's not a major leap, and given the possible size of Zambezi, it's a sign that the client space is not the target for this.

JF's blog posting seems mainly to say one thing with regard to FP: "we're not going balls to the wall with AVX yet, in the meantime your SSE code might run faster than it did."

If the clocks can be jacked up to some impressive numbers it can be more impressive.
A 5 GHz BD could jam a lot of Int SIMD ops through, for example.
I certainly hope it clocks higher than 4 GHz, but I fear its speed demon approach may not be justifiable without something significantly faster than 4.

Is that before or after "turbo"? The clock range seems kinda ludicrous, e.g. +30% for "speed-demon" and +20% for turbo. I dare say SB isn't much different in this respect, a huge unknown.

hoom · Oct 28, 2010

WTF with the 'AMD can implement AVX later' stuff :?:

AVX is fully implemented in BD. BD just isn't designed around maximising peak AVX throughput.

3dilettante · Oct 28, 2010

Jawed said:
JF's blog posting seems mainly to say one thing with regard to FP: "we're not going balls to the wall with AVX yet, in the meantime your SSE code might run faster than it did."

The FPU's basic design comes from AMD's aborted SSE5 attempt. It is weak at AVX because it probably predates AVX and was jerked around by a previously mentioned SSE5/AVX/XOP/FMA3/FMA4/FMAlater/FMA3again/OMGWTFBBQ debacle.

The current FPU design is an awkward missing-link design between SSE5 and AVX.

Is that before or after "turbo"? The clock range seems kinda ludicrous, e.g. +30% for "speed-demon" and +20% for turbo. I dare say SB isn't much different in this respect, a huge unknown.

SB's release clocks and initial turbo levels are pretty much official.
The top bin at release will have a 3.8 GHz turbo. It may have half a year or more to go up several speed grades before Zambezi.

As for the idea of BD's clocks reaching such high levels being implausible, I agree. This is why I'm much cooler to Zambezi in particular after the disclosure of more information about BD and SB.

AMD's strip-tease approach to architectural reveal is less enthralling when each admission adds little to be buzzed about.

hoom · Oct 28, 2010

It is weak at AVX because it probably predates AVX and was jerked around by a previously mentioned SSE5/AVX/XOP/FMA3/FMA4/FMAlater/FMA3again/OMGWTFBBQ debacle.

Wha?
AVX is a subset of SSE5.
XOP is the rejig of SSE5 to match the VEX encoding of AVX for the overlapping features while still allowing the extra bits which AMD implemented & Intel didn't.

The BD design is fundamentally geared to the idea that INT|FPU instruction ratio is at least 2|1 for most loads so it makes perfect sense that they wouldn't make an AVX peak-rate monster.
The difference is not in the details of the FPU implementation but in the higher level '2 INT cores sharing a FPU' Module architecture.

Sure it means that SB will easily win Blu-ray ripping benchmarks but thats really no biggie for me as long as BD is competitive in benchmarks for workloads I do use.

3dilettante · Oct 29, 2010

hoom said:
Wha?
AVX is a subset of SSE5.

SSE5's instructions, and its whole encoding scheme and context saving methods did not match AVX at all.
It's hard to imagine how SSE5, which did not provision for 256-bit operands or any future growth in vector width, can somehow be a superset of AVX, which does.

50% of the FPU's FP throughput is tied up in FMAC functionality that will not be used by AVX for the lifetime of Zambezi.
SSE5 leaned hard on the FMAC units.
The 128-bit data paths and use of doubles for decoding AVX shows BD's core did not shift too far from the SSE5 it had at inception.

XOP is the rejig of SSE5 to match the VEX encoding of AVX for the overlapping features while still allowing the extra bits which AMD implemented & Intel didn't.

XOP is an attempt to wedge SSE5 as close as possible to AVX. Its encoding does not totally match, for reasons known to AMD and Intel. What this means for the future of XOP, I do not know, just that if Intel decides to use some of the bits in dispute, XOP cannot follow.

The BD design is fundamentally geared to the idea that INT|FPU instruction ratio is at least 2|1 for most loads so it makes perfect sense that they wouldn't make an AVX peak-rate monster.

FMA was how it was supposed to get its big FP boost, rather than longer vector widths.
It was still aiming for more FP power.
It's not geared to AVX peak-rate because it wasn't designed for AVX.

Gubbi · Oct 29, 2010

3dilettante said:
50% of the FPU's FP throughput is tied up in FMAC functionality that will not be used by AVX for the lifetime of Zambezi.
SSE5 leaned hard on the FMAC units.
The 128-bit data paths and use of doubles for decoding AVX shows BD's core did not shift too far from the SSE5 it had at inception.

I think it is pretty clear that AMD had most of the FPU completed when the decision to ditch SSE5 and embrace AVX was made. AMD being the minor player had to follow Intel. AVX execution in BD is going to be similar to SSE execution on 3DNow exe units.

The rationale for both is perfectly valid. For SSE5/FMA , not only can you get more performance for vectors, but scalar code could be sped up considerably as well (remember scalar SSE2 is the default today, x87 RIP!). AVX potentially doubles performance for vectorizable workloads.

Cheers

3dilettante · Oct 29, 2010

Dresdenboy's blog has a link to some GCC patch files with descriptions of the BD pipeline and instructions.

They appear to be based off of some NDA information, and a few spots appear to be incomplete or are commented as being based on information coming from K8.

Instruction latencies for some of the more complex ops have gone up, 33-50% in the case of IMUL, which seems to be consistent with a higher-clocked design.

Some parts appear inconsistent with what has been described so far.

http://gcc.gnu.org/ml/gcc-patches/2010-10/msg01883.html

+;; The bdver1 contains four pipelined FP units, two integer units and
+;; two address generation units.
+;;
+;; The predecode logic is determining boundaries of instructions in the 64
+;; byte cache line. So the cache line straddling problem of K6 might be issue
+;; here as well, but it is not noted in the documentation.
+;;
+;; Three DirectPath instructions decoders and only one VectorPath decoder
+;; is available. They can decode three DirectPath instructions or one
+;; VectorPath instruction per cycle.

...

+;; Model the fact that double decoded instruction may take 2 cycles
+;; to decode when decoder2 and decoder0 in next cycle
+;; is used (this is needed to allow throughput of 1.5 double decoded
+;; instructions per cycle).

Some data is commented as coming from K8, and may not be accurate. The decode part is not commented as such, but this description does not correspond to a 4-wide decoder.
The direct-path and double throughputs seem to be more appropriate for a K8.

mczak · Nov 2, 2010

AlexV said:
I'm not that sure of that, TBH. Traditionally, when AMD had something that Intel didn't have, that meant either no support for it until Intel had it too or death. AMD64 is the one exception, and that was a special case...FMA4 certainly isn't that. It's a bit 3DNow-ish, IMHO.

Not sure I agree here. 3DNow was somewhat difficult to implement - due to the different vector length compared to SSE, this pretty much meant you had to do two completely different implementations of your code if you wanted to support both.
However, FMA4 should be rather easy to exploit.

rpg.314 · Nov 6, 2010

Some good news at last?

http://www.semiaccurate.com/2010/11/06/amd-demo-bulldozer-next-week/

fehu · Nov 6, 2010

http://www.xbitlabs.com/news/cpu/di...ulldozer_2_Set_to_Support_New_Extensions.html

hoom · Nov 7, 2010

It's hard to imagine how SSE5, which did not provision for 256-bit operands or any future growth in vector width, can somehow be a superset of AVX, which does.

OK my bad I've been remembering this wrong it seems

But there was a decent chunk of overlap.

Its encoding does not totally match, for reasons known to AMD and Intel.

It matches for the AVX instructions.
AMD stayed out of the way of potential future Intel instructions but if Intel brings out VEX encoded equivalents to the other ones then AMD should be able to map the Intel encoding to the same instruction.

50% of the FPU's FP throughput is tied up in FMAC functionality that will not be used by AVX for the lifetime of Zambezi.
SSE5 leaned hard on the FMAC units.

Yeah, I'm kinda disappointed that they don't seem to have gone for the bridged FMA which would have allowed the MUL & ADD to still both be used for non-FMA ops.
On the other hand, they can do 2* 128bit ADD, 2* 128bit MUL or one of each per clock as long as the other core isn't doing an FP op that clock which is at least likely to be more area efficient.

Tahir2 · Nov 7, 2010

What is AMD and Bulldozer doing different to what Intel and the Pentium 4 architecture did when trying to increase clockspeed at the expense of IPC?

It did not work for Intel, why will it work for AMD?

DarthShader · Nov 7, 2010

They are doing it on 32nm, duh.

Tahir2 · Nov 7, 2010

I know your being sarcastic, but Phenom II at 32nm would lead to increased clockspeeds too...

itsmydamnation · Nov 8, 2010

on newer motherboards (higher FSB) and LN2 netburst hit near 10 ghz, so it could reach its designed speeds.

i think the bigger issue with net brust was that some parts of the chip ran at 2X clock speed and also that branch misses really hurt. Speed instead of IPC wasn't its downfall, it was the overall design itself. IBM seem to have no problems with very high clocking power chips, granted thats not the exact same market.

entity279 · Nov 8, 2010

itsmydamnation said:
i think the bigger issue with net brust was that some parts of the chip ran at 2X clock speed

Apparently, this wasn't true anymore for Prescott. Even Intel wouldn't have imagined having 15 GHz+ ALUs.

eastmen · Nov 8, 2010

i can't wait to see performance numbers. my q9550 is getting long in the tooth. I want to upgrade but i don't feel whats currently out is worth upgrading to. i will most likely hold off till next fall tho so hopefully the home verison of sledgehammer is out by then

hkultala · Nov 9, 2010

Tahir2 said:
What is AMD and Bulldozer doing different to what Intel and the Pentium 4 architecture did when trying to increase clockspeed at the expense of IPC?

Same as what IBM did with it's latest power series processors? (which totally kick ass)

Their pipeline stages have length which is about equal to bulldozer's pipeline stage lenghts.

It did not work for Intel, why will it work for AMD?

Because intel did not do the same thing than what AMD is doing.

Bulldozer increases pipeline length moderately, like 40-50%.

P4 increased pipeline length more than 100%.

And the major reason for P4(willamette, northood)'s small IPC was not long pipeline itself, but other design choices:

1) Very small L1D cache(to achieve low latency with high clock speed). Bulldozer has twice bigger L1D(per core, meaning 4 times more L1 capacity per thread than nortwood with HT enabled), and better memory reordering for cases where it misses.

2) L2 cache with too small block count (small size combined to big block size). Bulldozer has 8/4 times more L2 cache blocks / core, or 16/8 times more cache blocks per module than P4(willamette/northwood)

3) P4's replay mechanism often ruined performance. (especially it ruined SMT/HypeThreading performance). AFAIK bulldozer will also have some kind of replay mechanism, but it's not needed as often as P4's replay so it's impact on performance will be much smaller ( and there is not SMT for integer code in bulldozer so it cannot disturb that )

4) High clocked mini-ALU's with lowest bits aligned to different clock edge than highest bits, which added lots of delays when data had to go away from the mini-ALU's (like for shifting and multiplication, which made those instructions very slow). Bulldozer puts the whole integer core into one place and same clock domain, all integer operations are relatively fast and overheads occur only when moving data to FPU (where there are always longer latencies anyway)

Presscott is a different story:
Presscott fixed 1 and 4 of those, and helped 2, but increased pipeline much more, having pipeline stages which are almost half of even bulldozer's pipeline stage length, one third of P6/K7 pipeline stage length.

So it's in completely different pipeline length class than bulldozer, and in no way a balanced architecture.