AMD Bulldozer Core Patent Diagrams

3dilettante · Sep 20, 2010

Jawed said:
For single-threaded work, load/store for a single thread in FP should be able to use both load/store paths shouldn't it? L2->L1 only has to be duplicated so that everything in either L1 is in the other, to enable arbitrary loads. Stores from FP can use both paths to keep the two L1s coherent.

Memory operations for a single thread go through the load/store path of that thread's core.
The other core's memory pipeline is not accessible.
Back when the talk was of clusters and not cores, I had hoped this was not the case, since then AMD could have really offered a really high-bandwidth input to the FPU.

As it stands, the FPU can only have a load throughput that matches the throughput of a single core, at 256 bits/cycle sustained.
This is the same whether it is one thread or two.
With AVX, SB should be able to sustain 384 bits/cycle of reads to its FPU, though write capability may be better with BD in the multithreaded case.

3dilettante · Sep 20, 2010

Jawed said:
Wouldn't run-ahead pre-fetching for single-threaded work put everything in L2 at least, so L3 latency becomes irrelevant in that scenario?

Sorry for the out of order reply, I didn't see this before.
I'm not entirely sure what you mean by run-ahead prefetching.
There is the memory-level parallelism that is extracted by the OoO core's speculating ahead of dependences and firing off loads up to the capacity of the load queue.
If by run-ahead you mean some kind of scout threading, which will fire off accesses even without knowing if the address values it is using will actually be generated by the code it is running through, that is not on the menu for BD.

There are stride-based prefetchers, and some kind of predictive prefetcher AMD has not detailed. These don't really run ahead so much as they trigger on certain memory access patterns and start memory accesses under the assumption the pattern will hold.
The effectiveness of the implementation is not yet known.

The L2 seems large enough that it will be more resistant to cache polution.
Without knowing what AMD has done for the L3, it would add some amount of latency to every L2 miss. I'd be curious which would win out. Intel has a larger L1, smaller but faster L2, and an L3 with at least 20% higher latency than BD's L2, though its cores are just about that much more capable at hiding latency.
BD has the small L1, the large L2, and an unknown L3. The best the L3 can offer can't be better than what the L2 sets as a baseline.
I'm willing to be wowed by AMD, but SB's L3 arrangement does look really nifty.

Jawed · Sep 20, 2010

3dilettante said:
If by run-ahead you mean some kind of scout threading, which will fire off accesses even without knowing if the address values it is using will actually be generated by the code it is running through, that is not on the menu for BD.

As far as I can tell, there appears to be concensus on this, but no actual reasoning. Why not?

For example in a single-threaded scenario the stride analyser can create instructions to be executed by the second core, so that L2 (or L1) is populated with the strided data in time to be used by the thread.

Anyway, I've hunted around for people's justifications, but I can't find any :???:

3dilettante · Sep 20, 2010

Jawed said:
As far as I can tell, there appears to be concensus on this, but no actual reasoning. Why not?

If the implementation is run-ahead by having the other core perform the scouting, it is because there is no mechanism for sending register or execution status between cores.
If it is scout threading on one core, it is because each core has one thread.

If there is some other form of speculative runahead, it would have to work on a single core with only one thread to work with. I have not found a denial of something that specific, though I would question the utility since this runahead would be on top of an already aggressive OoO speculative pipeline that will be heavily exercised without runahead.

For example in a single-threaded scenario the stride analyser can create instructions to be executed by the second core, so that L2 (or L1) is populated with the strided data in time to be used by the thread.

The L2 prefetchers would initiate accesses in the cache without generating instructions. They wouldn't modify or create a code stream for the other core to run.

Gubbi · Sep 21, 2010

3dilettante said:
The L2's capacity is the big non-debatable advantage BD has over SB that I see.
Other features are unknowns (branch prediction, prefetchers, L3, clocks, memory controller), numerically inferior (Int units, FP ports, rename, load/store capacity, ROB length), or not future-proof (128-bit SSE advantage, slight FP rename advantage both negated by AVX).

We are always going to be comparing apples to oranges with BD since one BD core or module isn't equal to a core from Intel.

SB has a bigger ROB and deeper queues, which benefit a single threaded workload, - but only if SMT is disabled. The ROB and queues are partitined to each context with SMT enabled. SB has more integer units, but execution ports are shared for a variety of execution units, so dependent on instruction mix SB will be faster or slower than a BD core.

The big handicap for BD is AVX. I think the BD FP unit was conceived before the AVX spec was finalized. It won't be as fast as SB for AVX workloads. I'm guessing AMD is already working on a double width FP unit (and data paths) contingency for BD 2.

Cheers

Triskaine · Sep 21, 2010

Gubbi said:
We are always going to be comparing apples to oranges with BD since one BD core or module isn't equal to a core from Intel.

Exactly, if we compare SB and BD on a per-thread basis then it isn't as clear cut:

Load-Queue 32 40
Store-Queue 18 24
RS-Entries 27 40
PRF INT 80 96
PRF FLOAT 72 80
ROB-Entries 84 64

SB-Thread BD-Core

I've made a few educated guesstimates and came to the conclusion that a single BD module should be about the same size as an SB Core ( ~ 18 mm²). Both of them offer two threads, so from a sillicon point of view comparing a Module to a SB Core is fair.
Sandy Bridge-EP will bring a maximum of 8C/16T per socket, AMD's Interlagos will bring a maximum of 8M/16C per socket, so AMD is well aware of BD's strength's and weaknesses.

If BD has no terrible drawbacks or bottlenecks we don't know about then it will likely be an adequate competitor to SB in the server market.

However, the situation on the client side will not be so favorably, as intel's emphasis on Singel-Thread performance will likely be more attractive to many customers than BD's Multithread power.

The preliminary Benchmarks show that Bobcat's IPC is on the same level as of K8 and if AMD can achieve that with a relatively low power core then a single BD core can easily top that.

So while it is very probable that BD won't match Sandy on ST IPC, an IPC for BD on the level of C2D to Nehalem is not an unreasonable expectation.

3dilettante · Sep 21, 2010

Gubbi said:
We are always going to be comparing apples to oranges with BD since one BD core or module isn't equal to a core from Intel.

Fortunately for us, the threads they will be running will not be apples and oranges.
For various server loads, BD should have an advantage on a module to core basis.

I have taken care to differentiate between the single and multithreaded cases, and noted that this is why Zambezi looks like it may not fit very well for client workloads, which tend to favor higher single-threaded performance, have no widespread incidence of scaling to 8 cores, and are frequently not as latency tolerant.

Triskaine said:
I've made a few educated guesstimates and came to the conclusion that a single BD module should be about the same size as an SB Core ( ~ 18 mm²).

Educated guesses from what?

Sandy Bridge-EP will bring a maximum of 8C/16T per socket, AMD's Interlagos will bring a maximum of 8M/16C per socket, so AMD is well aware of BD's strength's and weaknesses.

The latter is an MCM, which will have an impact.
I also have not seen AMD making headway in changing the per-core licensing fees of some software products, which will mean that for economic reasons there are workloads where Interlagos would be extremely hard to justify. Even if the throughput is lower for SB, it would cost half as much in licensing, and those fees can buy hardware multiple times over.

So while it is very probable that BD won't match Sandy on ST IPC, an IPC for BD on the level of C2D to Nehalem is not an unreasonable expectation.

BD will be facing Ivy Bridge by the time it makes it to mainstream desktops, since Zambezi is seriously overkill in core count. However, without that unnecessarily high core count, Zambezi is not competitive in FP.

Gubbi · Sep 21, 2010

3dilettante said:
I have taken care to differentiate between the single and multithreaded cases, and noted that this is why Zambezi looks like it may not fit very well for client workloads, which tend to favor higher single-threaded performance, have no widespread incidence of scaling to 8 cores, and are frequently not as latency tolerant.

I agree that a 4 module/8 core BD would be overkill for most client machines and that most people would be better off with a 2 module/4 core chip which use the power envelope headroom to clock higher.

But, I don't see *anything* in the microarchitecture of BD that should make it significantly weaker than what Intel is offering. Regardless, BD will be a huge jump over existing AMD CPUs.

The L2 cache size-timing tradeoff is unsual, but I'd be surprised if AMD hadn't done the simulation to justify this organization. We'll see in a years time if they suddenly rearrange the L2 to something smaller and faster.

Cheers

rpg.314 · Sep 21, 2010

Perhaps they are relying on Fusion (and BD+fusion) to compete in the desktop market, which is why BD seems to be server oriented.

3dilettante · Sep 21, 2010

Gubbi said:
I agree that a 4 module/8 core BD would be overkill for most client machines and that most people would be better off with a 2 module/4 core chip which use the power envelope headroom to clock higher.

But, I don't see *anything* in the microarchitecture of BD that should make it significantly weaker than what Intel is offering. Regardless, BD will be a huge jump over existing AMD CPUs.

Is this facing off against a 4-core SB?

The L1 is smaller, and I reviewed some data for miss rates, and a 4-way 16KiB cache is measurably worse in miss rate than a 64 KiB 2-way cache.
Integer width would be twice as high for SB.

In FP, the FPU can issue 2 FP loads. That is a total of 4 FP loads per clock for a 2-module BD.
SB will have double that.

The best-case situation is that BD's FPU can split its FMACs into dual ADD and MUL units. In that case it has the same 128 SSE througput as SB, though in the case of a large number of memory operands, this is limited by the load capability. In a multithreaded case, there is the possible added advantage of 1 more write through the other core.

When sustaining an FMUL and FADD per clock, SB will have FP shuffle capability.
BD will have its FP throughput halved.

With AVX, SB would completely embarrass a 2-module BD. It would have 50% higher read bandwidth, double the FP throughput, and nearly double the register rename capability.

rpg.314 said:
Perhaps they are relying on Fusion (and BD+fusion) to compete in the desktop market, which is why BD seems to be server oriented.

Llano will be holding the fort for quite some time, less if it gets delayed again.

rpg.314 · Sep 21, 2010

3dilettante said:
Llano will be holding the fort for quite some time, less if it gets delayed again.

Anandtech says it is expected in late 2Q11. Still about 4 months after SB.

SB's integration of cpu and gpu cache hierarchy is really cool. It's a pity it will come later from AMD. Besides SB has a clear lead in the video encode department.

Gubbi · Sep 22, 2010

3dilettante said:
Is this facing off against a 4-core SB?

I don't think a 4-core SB vs a 2 module BD is an apples-to-apples comparison, nor do I think a 4-core SB vs a 4 module BD is, hence my comment above about apples and oranges.

In order to evaluate BD and SB we need to look at:

1. Go-for-broke single thread performance of a SB core with SMT disabled vs. a BD core/module.
2. Massively multithreaded performance; as many cores in a package within a given power envelope.
3. Something in between, like a multithreaded application that can reasonably use 8-16 contexts. Again limited by power envelope.

3dilettante said:
The L1 is smaller, and I reviewed some data for miss rates, and a 4-way 16KiB cache is measurably worse in miss rate than a 64 KiB 2-way cache.
Integer width would be twice as high for SB.

The data cache is smaller, the instruction cache isn't. A 16KB 4-way cache has a 1% miss rate on SpecInt 2000 (old, but that is the one I can remember off the top of my head ), a 64KB 2-way cache might have half that, but that doesn't impact average latency if the smaller cache allows operating frequency to be just a few percent higher.

On top of that AMD must be confident that they can schedule around L2 latency, otherwise they would have made it smaller and faster. Pathological cases where you are pointer chasing in a working set that is larger than 16KB and smaller than 64KB will suck on BD compared to Opteron, but that case is just that, pathological.

As for execution resources, it all comes down to if you compare half a module to one SB core or not.

Cheers

3dilettante · Sep 22, 2010

Gubbi said:
I don't think a 4-core SB vs a 2 module BD is an apples-to-apples comparison, nor do I think a 4-core SB vs a 4 module BD is, hence my comment above about apples and oranges.

From the i7 to i5, that will be the comparison.
A 2 module BD would be going against a SB i3, or rather, it would if it were released in the same year and not probably after Ivy Bridge.
The number of FP blocks looks like a decent marketing point to gauge comparisons.

At best, an i3 versus BD will likely have the same performance comparison as the larger chips: weaker IPC in integer, potentially better integer multithreading, roughly equal FP in SSE, likely loss in AVX.

In order to evaluate BD and SB we need to look at:

1. Go-for-broke single thread performance of a SB core with SMT disabled vs. a BD core/module.

A common client workload where performance matters at all.

2. Massively multithreaded performance; as many cores in a package within a given power envelope.

Not a significant factor for the client space.

3. Something in between, like a multithreaded application that can reasonably use 8-16 contexts. Again limited by power envelope.

Not too common as of yet, outside of a few cases like multimedia apps. SB can trade blows here, unless it's an app that loves AVX.

Power is an unknown factor. If AMD's process does not reasonably match Intel's, it may not be able to clock as high as needed.

The data cache is smaller, the instruction cache isn't.

Instruction cache misses are a fraction of what a data cache experiences. I was not considering the Icache.

A 16KB 4-way cache has a 1% miss rate on SpecInt 2000 (old, but that is the one I can remember off the top of my head ), a 64KB 2-way cache might have half that, but that doesn't impact average latency if the smaller cache allows operating frequency to be just a few percent higher.

The L1 is 33% longer latency, as is the cost of an L2 hit.
The average latency is 33% longer by default, without changing the ratio of L1 and L2 hits.

On top of that AMD must be confident that they can schedule around L2 latency, otherwise they would have made it smaller and faster.

There are other reasons why they wouldn't or couldn't, some of which depend on the latency and effectiveness of the L3 and uncore.
If the L3 is prohibitively slow, or whatever xbar used to communicate with the memory controller, L3, or other cores prioritizes bandwidth over latency, then minimizing L2 misses is worthwhile even if it hurts performance in cases where accesses stay within the L2 or L1.

Pathological cases where you are pointer chasing in a working set that is larger than 16KB and smaller than 64KB will suck on BD compared to Opteron, but that case is just that, pathological.

16KB is pretty small on an absolute scale. It's not tiny like the P4's 8KB data cache, but that cache was uncomfortably small a decade ago. Working sets fortunately don't follow Moore's law, but they do grow slowly as time passes. I'd be curious how close they've come to doubling in a decade.
The vast majority of misses at that size are capacity misses, and that was with old SPEC benchmarks.

rpg.314 · Sep 22, 2010

Gubbi said:
I don't think a 4-core SB vs a 2 module BD is an apples-to-apples comparison, nor do I think a 4-core SB vs a 4 module BD is, hence my comment above about apples and oranges.

A BD module vs SB core comparison could be reasonable, if they occupy reasonably same area.

Barring that, I guess we'll have to just settle for comparing chips which are close enough in price.

Gubbi · Sep 22, 2010

3dilettante said:
Instruction cache misses are a fraction of what a data cache experiences. I was not considering the Icache.

You're right that I$ misses are rare on client workloads, but are quite common in server workloads.

My point was that you can schedule around D$ misses, I$ misses effectively adds to pipeline latency.

3dilettante said:
The L1 is 33% longer latency, as is the cost of an L2 hit.
The average latency is 33% longer by default, without changing the ratio of L1 and L2 hits.

Nehalem increased D$ latency over Core 2s with 33% with only minor impact. Prescott increased latency from 2 to 4 cycles over Northwood, albeit doubling D$ at the same time, but again with modest negative impact.

Besides, measuring latency in cycles can be misleading, absolute time (nano seconds) is preferable, IMO.

3dilettante said:
There are other reasons why they wouldn't or couldn't, some of which depend on the latency and effectiveness of the L3 and uncore.
If the L3 is prohibitively slow, or whatever xbar used to communicate with the memory controller, L3, or other cores prioritizes bandwidth over latency, then minimizing L2 misses is worthwhile even if it hurts performance in cases where accesses stay within the L2 or L1.

When I first saw the 2MB L2 cache figure, my first thought was that they had planned L3-less "duroned" low end desktop versions.

I actually think that the large L2 is a result of a server centric view. You want to cache the active part of your software stack and a fair chunk of your data. If I'm not completely off, I think Oracle uses around 5-600KB of code actively. This and the top of the most commonly used indices would fit in 1-2MB, but not in 512KB.

A smaller and faster L2 would probably make more sense for clients.

I can't really see any reason why BD's L3 should be obscenely slow. From the (faked?) die-shot I've seen, the L3 takes up the same area as the sum of the L2s. If they can keep latency below 40 cycles, each core can still schedule around most of the latency.

Cheers

3dilettante · Sep 22, 2010

Gubbi said:
Nehalem increased D$ latency over Core 2s with 33% with only minor impact. Prescott increased latency from 2 to 4 cycles over Northwood, albeit doubling D$ at the same time, but again with modest negative impact.

Nehalem significantly improved its L2 latency in conjunction with slowing down the L1, although it did reduce L2 capacity massively compared to the earlier architectures.
Its L3 latency was still rather slow, but that is much improved with Sandy Bridge.

Besides, measuring latency in cycles can be misleading, absolute time (nano seconds) is preferable, IMO.

AMD has provided no time per cycle to derive wall clock time.
However, it seems optimistic AMD can clock its chip 33% higher to reach equivalent absolute time figures. That would require turbo speeds between 4.6 and 5 GHz. Something in the 4-4.5 range may be feasible. Sandy Bridge at introduction will have chips capable of turbo up to 3.8.
The lower end of BD's turbo range is hardly worth discussing.

When I first saw the 2MB L2 cache figure, my first thought was that they had planned L3-less "duroned" low end desktop versions.

It a likely possibility. The L3 in current AMD chips is more of a server-centric means of minimizing off-chip accesses and coherence traffic. If the L2s weren't so small in Phenom, there would have been minimal performance impact in losing the L3, and possibly a gain if AMD were able to strip the L3 check part out of the L2 miss process.

I actually think that the large L2 is a result of a server centric view. You want to cache the active part of your software stack and a fair chunk of your data. If I'm not completely off, I think Oracle uses around 5-600KB of code actively. This and the top of the most commonly used indices would fit in 1-2MB, but not in 512KB.

That could work if the capacity were in L2 cache or in a large L3, if it wasn't too slow. It obviously works for Intel.

A smaller and faster L2 would probably make more sense for clients.

Large size couldn't hurt, but faster definitely would help.
The large and slow L2 may be an awkward compromise AMD has made between the client and server markets. The server workloads would like generous capacity, either in the L2 or L3. The client market would like lower latency, but the best AMD could do is add a fair amount of extra L2 to lower the miss rate and reduce miss rate to the slower parts of the heirarchy.

I can't really see any reason why BD's L3 should be obscenely slow. From the (faked?) die-shot I've seen, the L3 takes up the same area as the sum of the L2s. If they can keep latency below 40 cycles, each core can still schedule around most of the latency.

The cores are gunning for high clock rates. Getting respectable cycle numbers for the L3 and uncore would be harder, since even the nearer L2 got slower in cycle terms.
If the possibly hacked up die shot is accurate, it may be that AMD has subdivided the L3 into local tiles, which may salvage some of its latency numbers.

SB's L3 latency sounds very good, in part because of the subdivision of the cache.

Part of the unknown is how AMD is linking its modules and L3(s?). The shared L2 and the capping of the number of modules to 4 sounds like it may still be using a crossbar arrangement, which offers theoretically more uniform latency but puts an upper bound on scalability.
There were complaints about the crossbar's limited bandwidth in Phenom, where higher speeds of RAM yielded markedly reduced benefits because of that bottleneck.

hoom · Sep 23, 2010

Regarding the run-ahead stuff, I read that as run-ahead to refill the queues/caches ASAP after a branch mis-predict/pipeline flush.

BD has a deeper pipeline than they've previously had so a mis-predict/flush would have a larger penalty & they would want to do what they can to minimise that.
Better predictors is an obvious thing to do but after the inevitable miss, refilling the queues & caches quickly seems like a good idea too.

itsmydamnation · Sep 23, 2010

3dilettante you really seem to know your stuff but it seems your coming from a point of view of looking for reasons that BD wont be able to compete. How about putting the glass is 1/2 full cap on and thinking from that perspective of what information hasn't been realesed yet what do you think BD would have to do to equal or excessed SB performace.

John Fruehe has stressed severial times there is some stuff in the BD design that has not been disclosed yet spercificly designed around single thread performace. how about given what you know of the design so far take a guess on what they might be.

cheers

rpg.314 · Sep 23, 2010

3dilettante said:
16KB is pretty small on an absolute scale. It's not tiny like the P4's 8KB data cache, but that cache was uncomfortably small a decade ago. Working sets fortunately don't follow Moore's law, but they do grow slowly as time passes. I'd be curious how close they've come to doubling in a decade.

If you are running SB with hyperthreading on, isn't the effective size of L1 16 KB.

Gubbi · Sep 23, 2010

3dilettante said:
AMD has provided no time per cycle to derive wall clock time.
However, it seems optimistic AMD can clock its chip 33% higher to reach equivalent absolute time figures. That would require turbo speeds between 4.6 and 5 GHz. Something in the 4-4.5 range may be feasible. Sandy Bridge at introduction will have chips capable of turbo up to 3.8.
The lower end of BD's turbo range is hardly worth discussing.

Micth Alsup mentioned on Usenet that AMD were gunning for 12 FO4 inverter delays per pipe stage, vs 16/17 for K8, this was before he left AMD. You add 5 inverter delays for latches and jitter and you end up with 17 FO4s/21 FO4s or around 80% which gives an upper bound of 25% frequency increase.

Now, since AMD din't mention anything about operating frequency in their BD press material, I'm guessing present silicon is well short of that.

Cheers