AMD Bulldozer Core Patent Diagrams

Haswell should bring an L1 that can support AVX without the bandwith constraint experienced with SB and IB. That's on top of promoting integer SIMD to 256 bits, adding gather, and adding FMA.
Steamroller promises to make its FPU a little smaller.
For concurrent workloads, the lock elision and transactional memory support provide an opportunity for Haswell to scale its performance in heavily multithreaded integer.
AMD thus far has promised that it is not going to improve its cache or memory architecture much.

The transactional memory instructions are a big change, but these instructions could probably be added to cores without affecting the portions of the pipeline aimed at single threaded performance improvements. They could affect how Hyperthreading is implemented, and I eagerly await more details.

For Haswell, the new FMAC and the associated framework to keep it fed could be a huge boost. AMD does seem to be rebalancing its CPU Cores toward relatively more integer to floating-point performance, but it's not fabbing a massive GPU onto the same die for no reason. I think they want people to make use of the GPU when really heavy streams of floating-point calculations arise; the existing FPUs are sufficient to address legacy code and any sporadic floating point math that might come up.

The increase in L1 Icache size is a change whose magnitude is not yet given. I'm not sure how many workloads were complaining about the 64KB on Bulldozer, however.
Double the decoders sounds like an increase in the width of the instruction prefetch or the number of them is in order.
The problem with expanding the L1 size is that the aliasing problem would worsen.

Idle thought:
They could increase the associativity and cache block size to chip away at the index bits.
Perhaps split the L1 internally with a pseudo-associative cache? That would be two 32-64KB halves/banks of 4-ways and 128 byte blocks. The associativity and block length would take down 2 bits of aliasing, then a rule that syonyms on the last bit are split between the halves.
It sounds complicated, but I'm not certain a 128KB 64-byte line 2-way cache is going to do much for this architecture.
The decoupled prediction and fetch logic would help buffer the hiccups inherent to having larger lines and the variability of having a pseudo-associative cache.

The plus side to having two decoders, if they are fed, is that Steamroller might be able to brute force through one known soft spot for Sandy Bridge, where its fetch bandwidth becomes constraining if the uop cache misses. The core that's consuming this instruction bandwidth really isn't that big a bruiser, though.

The loop buffers and expanded predictor for Steamroller would take on significance because it sounds like no matter what the front end of the pipeline is going to be longer, and anything that keeps the rest of the core from feeling it is an improvement. Sandy Bridge probably has a decently long pipeline, part of which is mitigated by the uop cache. The uop cache was hard to engineer, so that probably means AMD can't manage it yet. Loop buffers in the tradition of Nehalem would be consistent with AMD's multiyear lag behind the leader.

It does sound like AMD is steering away slightly from its cluster based approach mainly to give each core more single threaded performance. Giving a decoded-uops cache to each core could take a significant chunk of silicon per core to implement and sounds like it would naturally lead to a splitting the decoder in two to better service each core's uop-cache individually.

Separating the usual ICache into 2 higher-associativity, but smaller pieces seems like it might entail more complication in the Ifetcher or even a split, which might be going far enough to defeat the point of the cluster base approach. It also probably does less for single threaded performance and power savings than a uops cache since the latter's contents are much "closer" pipeline-wise to the execution units than the Icache. From Agner Fog's tests, having only 2-way associativity in its ICache hurts the BD, especially since it's servicing two threads; it sounds like addressing the poor associativity would be a better first step than splitting it up. Whatever AMD did, reducing L1 misses by 30% is a lot...
 
Last edited by a moderator:
Doubling the decoders doesn't reduce fetch stalls, in fact the opposite problem occurs since the demand on fetch is now higher. The post-decode buffer, on the other hand, does decrease demand.
The proviso about feeding the decoders is basically a "what-if" where AMD doubles down on the decode duplication and gives them more raw bandwidth than what is currently available.
The single-threaded case wouldn't change as much, but the dual threaded case would give an advantage relative to an SB that misses the uop cache.
It sounds brute-force and rather aggressive, so I may have been charitable for entertaining the thought.

However, if the decoders are truly duplicated without degrading them, then the aggregate throughput in the case of doubles and microcode would be significantly better, since BD can't do two doubles at once and microde blocks the front end for the other thread.
 
From a time to market and therefore effort perspective, wouldn't reusing the existing decode be quicker then designing a "new" 2 wide decoder? Would a 4 wide decoder make any meaningful difference to "regular consumer code"?
 
My guess is that the new split decoders might have something to do w/ the uop cache that was mentioned. The caches are meant to improve single threaded performance by caching the results of the decode stage instead of before it and the contents are tied to each core separately, so I'm guessing they probably saw a benefit in its implementation to separating the decoders.

It also is implied that there isn't going to be any AVX2 support until excavator from the total lack of mention. It does seem like they revised BD to support AVX at the last minute and it wasn't an ideal implementation, and hopefully their inplementation of scatter/gather in AVX2 is up to snuff. Wild guess but maybe they'll alias some of the GPU's units for the CPU's FPU needs eventually, not sure if that's realistic or how the context switching would work...
 
From a time to market and therefore effort perspective, wouldn't reusing the existing decode be quicker then designing a "new" 2 wide decoder? Would a 4 wide decoder make any meaningful difference to "regular consumer code"?

They are not going to use a 2-wide decoder, it would decrease signle-thread performance too much, and would usually not be any better than the single 4-wide decode for 2 threads.

So the decoders will be either 3- or 4-way. 3 might be the best comphromise.
 
Doubling the decoders doesn't reduce fetch stalls, in fact the opposite problem occurs since the demand on fetch is now higher. The post-decode buffer, on the other hand, does decrease demand.

The demand is the amount of instructions that need to be fetched to execute the program. doubling decoders don't really increase it. The decoders are not asking for more instructions. The fetcher is giving decoders code stream, and they will decode what they get.

Better branch prediction will also decrease demand, as less instructions will be fetcher.

With single decoder, if the buffer between ifetch and decode got full, ifetch had to wait/stall.
And when the buffer had quite many instructions , and there was a branch prediction miss, all these instructions for the thread had to be flushed out.

With Steamroller there is in average less instructions waiting to be decoded, so when a branch prediction miss occurs, there are less instructions flushed. So with more decoders, the total amount of instructions that had to be fetched is actually less.
 
Last edited by a moderator:
The demand is the amount of instructions that need to be fetched to execute the program. doubling decoders don't really increase it. The decoders are not asking for more instructions. The fetcher is giving decoders code stream, and they will decode what they get.

This is just a matter of semantics.. nothing increases the amount of instructions that "need" to be fetched, but increasing decoder width (or increasing execution width, and so on) can increase the amount of fetch bandwidth it could utilize. In other words, it could move a bottleneck onto the fetch units more often.

But no, I didn't mean that more fetch bandwidth would be needed to get the same amount of work done, if that's what you thought I was saying.

Better branch prediction will also decrease demand, as less instructions will be fetcher.

But the cost of a branch mispredict is pretty much uniform across the pipeline, so the relative demand stays the same..

With single decoder, if the buffer between ifetch and decode got full, ifetch had to wait/stall.

Well yeah. And when the execution units can't find enough to execute from its OoO window the decode units stall, and so on. There's no question that more decoder width is better if the bandwidth wasn't good enough (that's kind of a trivial statement), the question is how often was there not enough bandwidth. But I don't think anyone is really questioning that in the dual-threaded scenario the decoders were a bottleneck some of the time.

And when the buffer had quite many instructions , and there was a branch prediction miss, all these instructions for the thread had to be flushed out.

With Steamroller there is in average less instructions waiting to be decoded, so when a branch prediction miss occurs, there are less instructions flushed. So with more decoders, the total amount of instructions that had to be fetched is actually less.

It doesn't matter if a fetched instruction was waiting to be decoded or if it was further along in the pipeline.. a branch misprediction must flush all instructions that were ever fetched after that branch, regardless of whether or not they've been decoded. So the branch misprediction penalty is not reduced by having more decode/execution/whatever resources. And you still have to fetch the same amount to get back to an equivalent amount of work done.

If you want to look at it as energy wasted instead of time it probably wastes less if the data never got to leave the fetch buffers.

You could say that wider decoders mean that fetch buffers don't need to be as large, but since there are already separate ones for each thread you'd lose in the single threaded case by making them smaller. Since the single threaded decode bandwidth isn't increasing. You could say the same thing for the post-decode buffer, depending on how robust it is - if it were a real cache it'd be worth relying, but a loop buffer either works or doesn't and pretty easily becomes completely useless if not executing a small enough loop.. so I don't know if AMD will want to rely on it to guarantee a performance baseline.
 
There are quiet some sharp minds here so I will dare a few questions.
AMD went for CMT on the premise that it would 80% of the performances of 2 cores for 50% of the costs. I haven't made measurements (and comparing a bulldozer module to previous AMD cores may not be an optimal comparison) but looking at both a Trinity die and then at Llano seems to tell another story.

AMD might improve their modules performances with Streamrollers and then Excavators but I would not bet on a significant diminution in the size of the module (their high density libraries may do that but that also applies to their others processors lines).

So what do you think of CMT?
The premise was +80% of performances for 50% more silicon and it looks like what AMD will pull out is +100% of performances but a 100% increase in silicon (which is a bit moot).
Do you think the approach is failure? So you would expect them to abandon it for their next brand new architecture.

Do you think that if CMT were to be pushed further it could actually get closer to its premise? By pushed further I mean a module consisting of more than 2 cores.
 
AMD went for CMT on the premise that it would 80% of the performances of 2 cores for 50% of the costs. I haven't made measurements (and comparing a bulldozer module to previous AMD cores may not be an optimal comparison) but looking at both a Trinity die and then at Llano seems to tell another story.

This comparison would make sense if there existed a full, old-school, BD based dual-core, with no shared elements. Versus K8L they added quite a few things, so it's not necessarily surprising that the module looks fat compared to that.
 
This comparison would make sense if there existed a full, old-school, BD based dual-core, with no shared elements. Versus K8L they added quite a few things, so it's not necessarily surprising that the module looks fat compared to that.
Agreed.
I made some quick (and gross) measurements of Star cores and Piledriver modules.
I found out that PD are ~93% of the area two Star cores would cover. That without L2, the L2 isze are a wash. By eyes I would say that there is less a bit less "glues" between a 2 modules part and a 4cores part which may push the advantage further in favor of BD/PD.

It ain't that bad, in power constrained environment at least, Trinity offer anywhere between 110% 120% the performances of llano. There are cases where Llano win but cases where trinity won by a greater margin.

So let say CMT is a good idea, do you think that after Excavator AMD could go further and increase the number of cores within a module (like having 4 cores in a module)?
Or there are constrained in how big they can scale the front end?
 
AMD already refers to the new Jaguar quad-core architecture as a "Compute Unit", together with a shared L2. If they push Jaguar a bit over the pure mobile concept, it would fit very well in a sever/WS envelope, with some sort of scalable interconnect cache/memory infrastructure.
 
An interesting hear-say update about AMD's CPU development:

VR-Zone Article

The rumored gains are nice, but AMD's main problem has always been execution. Keller being back in charge is a very hopeful change though.
 
Last edited by a moderator:
Low-Performance of AMD Microprocessors May Be Conditioned by… Poor BIOS.

qCazA0q.png


:???:
 
The content of the article is reasonable (if you ignore the assertion that there could be other magic fixes for relevant performance issues) but the title is a total farce.
 
Does anyone have any idea what this NRAC thing might be? My guess would be a workaround for some x87-specific bug.
 
This is a nice hack but the gains are in a benchmark from the mid nineties.
Anything less than 10 year old is probably able to use SSE2 or more, and x87 is completely unavailable to any 64bit application. This is good if you had a performance problem in Quake 1 and aren't bothered by any potential bug - the article doesn't explain too much why but says the gain goes away with multi-threading or perhaps multi-tasking.

It's a lot irrelevant, that's slightly more useful than unlocking better 80286 mode performance, you would have to do number crunching all day on a single legacy single-threaded program you can't recompile or update.
 
Best is to go to the source and check XS thread about it. Original Bulldozer was affected by this and hangs after applying patch. Later iterations of it (Piledriver, Richland) are working fine and give speed up in SuperPi (so far only, so it might be specific instruction mix used in it).
 
Some purported early benchmarks of steamroller:

http://www.chinadiy.com.cn/html/21/n-11921.html

A 34% improvement in integer IPC is nothing to sneeze at if true and could put it neck and neck w/ Haswell. The FPU takes a step back, and I seem to remember reading somewhere that the steamroller was going to pare down redundancies in its FPU for steamroller somewhat, so this drop is consistent with what we've heard.

EDIT: It was this article:

http://techreport.com/review/23485/amd-cto-reveals-first-steamroller-details

We're unsure what the floating-point "rebalance" is all about. Currently, Bulldozer's floating-point performance is relatively weak, in part because a single FPU is shared between two integer cores. Streamlining the FPU's execution hardware might save power, as is being claimed here, but we worry about performance. If "adjust to application trends" means hardware better suited to common workloads, then fine. If it means "gut the FPU and rely on the graphics processor to do floating-point math," well, that's less promising. We'll have to get more specifics about what's happening here. Update: AMD tells us it's not reducing the execution capabilities of the FP units at all. They've simply identified some redundancies (for instance, in the MMX units) and will be re-using some hardware in order "to save power and area, with no performance impact." So no worries, there, it seems.

Of course PR would say that. ;) Steamroller is a much more APU oriented CPU. With consumers using GPU functionality more and more for their needs, a zippy CPU based FPU isn't so important for the masses. They weren't in a hurry to fix BD's L3 latency issues either as mentioned in the anandtech article since presumably most of their consumer parts won't have it.

I guess scientists doing simulations will learn to make do with GPUs for their floating point needs, but they are much less predictable and clean than using a CPU's instruction set and unified memory. AMD's memory unification starting with Kaveri is a nice start for properly unleashing the GPU.
 
Last edited by a moderator:
L3 latency isn't the issue its not great but its still way faster then main memory. its L3 throughput ( especially write) that's the issue. The L'2s are large and the L3 is an eviction cache, things are fetched into L1 and L2. But the L3 throughput........

http://www.vmodtech.com/main/wp-con...2133c9d-16gxh-with-amd-fx-8350/aida64-mem.jpg
http://cdn.overclock.net/e/e4/350x700px-LL-e4eb580f_cachememtest.png


want i want to know is the L1 "broken" 6:1 ratio of read to write seems kinda pointless to me.

edit: here is my FX8350... its running ESXi and its doing a fair bit while the benchmark was going........

cachemem.png
 
Last edited by a moderator:
Try the latest version of AIDA (3.0). The memory benchmark suite is now fully multi-threaded and latency readings are much more accurate.
 
Back
Top