AMD Bulldozer Core Patent Diagrams

YMM-use "appears" to need a mode-switch, which means the processor knows when the 256bit fp-unit has to serve 2x 128bit XMM-SIMDs and could rewire the 256bit-block into 2x 128bit blocks, if only the FP would have two schedulers.

JK-AMD just confirmed my guesses (heard my prayers):

http://www.semiaccurate.com/forums/showthread.php?t=2731&page=6

* two FP-schedulers (this is implicit)
* "runtime" FPU-split/merge (with process-switch granularity)
* 1x ratio for <=SSE
* 0.5x ratio for AVX/SSE5
* no tandem FPU (like 2x 128 units in the same SSE-thread)

I'll buy it. :smile:
 
Just for the archive: a speculated FMA unit of Bulldozer which can do 1x FMAX or 1x FMUL + 1x FADD:

http://citavia.blog.de/2009/11/23/some-additional-bits-of-information-7441398/

(I made a similar diagram in another thread about FMA in Evergreen, this was the source I couldn't find back then)

There'd be quite a few restrictions on the operand combinations you could do with the second variant. If an FMUL and FADD instruction shared an operand, it's certainly possible but the you'd need a 4x128-bit output datapath to the retirement buffer. Mind you that in most cases, that's 2x128-bit that'll never be used.

Granted this isn't a mobile processor and they may not be worried about power but that is just downright wasteful not to mention all the extra forwarding paths that will be needed to support multiple entry points in the FADD stage along with the ones associated with the extra output datapaths.

Honestly, the most sensible implementation would probably be variant 1 where 4x scalar 64-bit operations could be issued each cycle.
 
Well I guess SSE will enter the same compatibility-state as 3DNow. You do not think that all the SSE (128bit) logic is a waste, just because you run 3DNow programs. As such I see Bulldozer as a pure 256bit design, and AMD should have done it that way. Any excess of resources for old software, is maybe lamentable but unchangeable.

In fact the entire 3DNow -> SSE -> AVX path should have been an extreme rich experience for AMD to build very flexible internal networks.
 
If an FMUL and FADD instruction shared an operand, it's certainly possible but the you'd need a 4x128-bit output datapath to the retirement buffer. Mind you that in most cases, that's 2x128-bit that'll never be used.
Which is why the referred paper says bridged FMA makes for a 40% bigger FPU than pure FMA.
The FPU being shared between two INT cores should help make that not such a bad thing. Same for the 1*256bit/2*128bit capability.

You get FMA that is a bit slow & a bit more power hungry than a pure FMA. But you keep similar speed and similar power to a classic MAD for MUL & ADD (where a pure FMA is comparatively slow & power hungry).

Its not the fastest or lowest power option but it seems like a good compromise for a general purpose processor that is likely to be running mostly non-FMA code (legacy SSEn) but needs to be capable of FMA for the new instructions.

Presumably later architectures may migrate to a pure FMA once old software is less common/performance of it is deemed less critical.

Edit: or refer back to Nov 09 itsmydamnation post http://forum.beyond3d.com/showpost.php?p=1362342&postcount=94

Also while re-reading the thread I'd like to modify my response to this:
Great for future workloads, but not so hot for today's software if I'm not mistaken.
BD will do 1*128bit FP per core ie same as Phenom architecture so should be same or better FP performance for existing software.
For future software 256bit will be there but shared between two cores.
Given equal number of cores, 256bit FP heavy stuff may suffer vs an architecture with 1*256bit per core.
I saw a JF-AMD post somewhere pointing out that 8 module BD vs 8 core SB will be: 16 cores vs 8 cores, 16 threads vs 16 threads & 8*256bit FP vs 8*256bit FP so that shouldn't be too much of an issue assuming similar clocks.
 
Last edited by a moderator:
You get FMA that is a bit slow & a bit more power hungry than a pure FMA. But you keep similar speed and similar power to a classic MAD for MUL & ADD (where a pure FMA is comparatively slow & power hungry).

Before we have the latency tables for Bulldozer we can get all excited, so lets go: :D

A FMA-block of that type could possibly also run a ADD and a MUL in parallel, OoO-parallel or thread-parallel if the ports can be routed to two different register-files. Then comes up the question, how is the internal register file organized? Do we have a unified register space for all related 3 modules? The promise of AMD to be able to refactor to 4:1 or 1:1 Bulldozer variants, indicates that the register<->opunit network is very advanced and flexible, in addition to finegrained (occupying fractions of units). In fact a splittable SSE2-implementation offers now some sort of MIMD!

It really sounds like this processor is a huge step, bigger than the Opteron design.
 
Before we have the latency tables for Bulldozer we can get all excited, so lets go:
Well I was referring to the latency tables in the paper which was written by AMD guys who had actually implemented the 3 types in 65nm hardware so its more reliable than pure theory.
Of course its entirely possible that they have actually built something different or found some improvements since <shrug>
 
16 instructions per cluster/clock?! :oops:

Not so long ago we were all presuming it was max 4 ops per cluster because they'd be making efforts to keep overhead low :LOL:

I wonder if they found some sort of alternative way to do decode that scales significantly more power & area-efficiently than previous implementations?

Presumably this is only up to 8 on a restricted set of instructions that require little register space & execution resource?
But if those instructions are fairly common or at least a common bottleneck then it could be a fairly significant performance win.
 
16 instructions per cluster/clock?! :oops:
I think it is 8 ops/cluster
Presumably this is only up to 8 on a restricted set of instructions that require little register space & execution resource?
But if those instructions are fairly common or at least a common bottleneck then it could be a fairly significant performance win.
I'd think so too. On AMD, (and presumably on Intel too), some instructions have a fast decode path.
 
4ops/clock It is the practical limit of ILP, and is not tied to x86 in any way.
how is it not tied to x86? CISC instructions do more work than RISC instructions. for example look at x86's xchg instruction. it swaps the value of 2 registers. to do that in something like mips you would have to use 3 registers and it generates dependencies
x86

xchg eax, edx
Code:
mips

mov    $t3, $t1
mov    $t1, $t2
mov    $t2, $t3
i know exchanges are not very common but my point holds.
 
how is it not tied to x86? CISC instructions do more work than RISC instructions. for example look at x86's xchg instruction. it swaps the value of 2 registers. to do that in something like mips you would have to use 3 registers and it generates dependencies

Code:
mips

mov    $t3, $t1
mov    $t1, $t2
mov    $t2, $t3
i know exchanges are not very common but my point holds.

4ops means 4 basic ops. The kind found in RISCs.
 
This is about instruction decode bandwidth, not instruction execution. Decode bandwidth is tied to the memory (sub-)systems and is not a x86-only problem space. The deal with very fast IPS processors is, that you may have to consider that instruction fetch is in need to be served as fast as data fetch; with the little difference that data need not be processed, instruction but need to be broken up and put into the pipeline and . Being able to put a half cacheline of isns totally ready in front of the pipeline in a single clock is a big deal (though not unprecedented). It's within restrictions, most notably that the occupied memory (sub-)systems resources are within 64 bytes (16 bytes isns, 16+16 bytes load, 16 bytes store = 64 bytes). I asume this is the maximum Bulldozer's memory system can stem, you don't need to try to waste logic to be able to handle more isns/loads/stores, because you are memory limited anyway.
 
4ops means 4 basic ops. The kind found in RISCs.
ahh so you are talking about micro-ops. then it is probably about the same instruction count, possibly significantly more if operands are from memory.

what really needs to improve for IPC to increase is cache miss rates. trading off higher latency for higher hit rate is probably a better option for future cpu's.
 
Back
Top