AMD Bulldozer Core Patent Diagrams

I don't see it so bad.
Clock Mesh gave them a free +10%, and it seems they mainly improved BD's single core IPC. BD architecture gains overall more by single-core IPC improvements than Intel, since you have (sort of) two cores.
So, for 1-year distance, a +20% increase in performance is pretty good.

Also, it is very interesting they didnt fix yet the huge front-end problem: intel has 32K8W L1I whereas AMD has yet a pathetic 64K2W... and AMD cores are way more hungry since they cannot cover latencies with HT (not talking of the AMD decoder ofc).

But I am curious to get steamroller info, where they *should* fix frontend issues.

edit: just a note on schedulers - win7 scheduler kills AMD cmt since by default it reschedule threads without processor affinity, so trashing L2 cache. On intel this is not a problem since L2 cache is just 128kb and L3 cache can keep everything up. On AMD you get a lot of problems due to basically trashing 2MB of cache instead of 128Kb..
Also, I do not believe MS did rewrite W7 scheduler. One would be mad to touch a critical working component with such huge changes 'on the flight'. So it is likely that W8 will have a better scheduler for AMD.
 
Last edited by a moderator:
did you mean 256K L2 for Intel?, or are you implying it's spiit in two halves for each thread (it isn't, I'd think, but I've never ever thought about how two (or more other archs) thread would share the L1 or L2 cache..)
 
edit: just a note on schedulers - win7 scheduler kills AMD cmt since by default it reschedule threads without processor affinity, so trashing L2 cache. On intel this is not a problem since L2 cache is just 128kb and L3 cache can keep everything up. On AMD you get a lot of problems due to basically trashing 2MB of cache instead of 128Kb..
Also, I do not believe MS did rewrite W7 scheduler. One would be mad to touch a critical working component with such huge changes 'on the flight'. So it is likely that W8 will have a better scheduler for AMD.

No. It was already covered just above you; there is no extra special sauce hiding in W8. If you want to see how much it doesn't matter, load up both types of processors with as many threads as they can simultaneously take -- the AMD loses by a mountain, and that's nothing that a scheduler can 'fix'.

As for trashing their cache? Yes, but because they've yet to spend the R&D to get the associativity beyond 2-way on their front end which directly contradicts how they want certain parts of these cores to be shared... This isn't a scheduling issue, it's a design issue.

Edit: Let's put this to rest, can we? PC Stats tested the exact same rig, powered by an FX-8150, on both Win7 and Win8. Here is their basic summary:
For the most part Windows 8 saw a 1%-5% improvement with the AMD FX-8150 processor... <snip> Even with these 6-of-1, half a dozen of the other benchmark results between Windows 8 and Windows 7, on the whole the AMD FX-8150 processor was slightly faster under the Windows 8 Developer environment. Not at all what PCSTATS expected, but a welcome result none the less for AMD.

Also, keep in mind that this was before Microsoft made the changes to the W7 scheduler that corrected nearly all of the affinity questions. One or two of the tests came out pretty far ahead, but those instances were not always in Windows 8's favor either. The performance gap that is left is negligible, at best...
 
Last edited by a moderator:
I don't know how MS would have got that same 1%-5% improvement with the windows 7 patch as with windows 8, if not for rewriting, or tweaking the scheduler. and scheduler variant could be choosen based on the CPUID.
I believe that on linux, you can recompile your kernel to choose a different scheduler (perhaps the older one which was displaced by a newer, mostly more efficient one). real-time linux is a well known case of different scheduler.

microsoft can patch the kernel, else they're toast if there's any security vulnerability in it.
they have so many billion dollars lying around, too. if neckbeard hippies can do it, they can do it, even if < 0.2% of their users care (users who might push for windows server 2012 instead of 2008, to slightly better use opterons, and AMD fanbois)
 
I'm having a hard time understanding your post; are you asking if MS really changed their kernel scheduler? Or what?

MS did indeed make an updated Win7 / Server 2008r2 NTOSKRNL specifically to address the scheduling woes of the AMD Bulldozer line. Here it is: http://support.microsoft.com/kb/2645594

They also made a second hotfix to disable core parking on AMD Bulldozer cores which can cause some headaches when the OS tries to halt one half of a core while the other half might still be working on something: http://support.microsoft.com/kb/2646060

So, yes, Microsoft reworked the scheduler and the power management (via the above manually installed hotfixes as neither come automatically) to bring Win7 up to par with the scheduling capabilities of Win8. Win8's scheduler is slightly better, in that you don't have to manually select the option -- it can detect and deploy automatically for Bulldozer.
 
oh, I am just wondering what kind of scheduler changes had been added by MS.
BD architecture problem with scheduler was due not only the 'mask' used to allocate threads in the cores, but rather the way threads are scheduled and re-scheduled over the same core, in order to (possibly) preserve L2 coherency.

Let me explain what i mean: Intel uses a small L2 cache and a huge L3, so it matters little on which core you re-schedule your thread. BD has 2Mb of L2, which content is potentially trash once you flip thread. So I was curious to know how MS dealt with this.

@Blazkowicz: of course - large parts of kernel are C-written. Also, MS can even hotpatch in some cases - that is, patch a running element. About MS spending money to add features to the kernel... you cant just 'add things' to the kernel or rewrite stuff that executes everytime under 1000000000 different conditions 'on the flight'. You're not developing a media player that can crash, you're developing a kernel that should not crash. Hence patching is a thing, rewriting is another.
 
Let me explain what i mean: Intel uses a small L2 cache and a huge L3, so it matters little on which core you re-schedule your thread. BD has 2Mb of L2, which content is potentially trash once you flip thread. So I was curious to know how MS dealt with this.

Intel also has super fast core-to-core cache data transfer. For AMD, if a thread is scheduled to a different module, all those cache misses has to go through uncore.

Cheers
 
Also keep in mind that part of the "fix" was to disable core parking, which basically means they're going to be at a power deficit when compared to the Intel side of house.

Really, this isn't a Microsoft issue, it's an AMD design issue. The Bulldozer platform isn't suddenly better in a Linux platform, either...
 
I often see "parked" against Bulldozer cores under W8, if the system is "near idle".

It works fine if the system is very very idle, but under light workloads (non-idle, but not full load by any stretch) the OS cannot park 'adjacent' cores. You leave 'em running, or you leave 'em both in standstill. You can't do one of each.

Or said another way, you can only park cores in sets of two on the Bulldozer platform -- if you want performance to be good.
 
Right this moment WMP is playing music, Task Manager and Resource Monitor are showing, I'm editing this message in IE, Firefox is showing FB and there's 90 processes in total (1600 threads), resulting in about 7% cpu usage and from 0 to 4 cores (in pairs) parked - average 2 (on a 6 core processor). Processor clock is ~1600MHz.

Is that "very very idle"?
 
Could you perhaps park three cores? Or five? Instead of two or for? That was the point I was attempting to make. The power savings could be better if you could park them in odd numbers, but it really doesn't work that way.
 
But how much power saving are you really buying if you could do that. Most of the power hungry stuff is shared (thats the whole point right?), its really what L1D, register files ALU's/AGLU's and load/store buses.
 
Last edited by a moderator:
Maybe? The unfortunate part here is that AMD chose for pairs of physical cores to share units, rather than having their own. That requires that physical cores must be stopped in pairs, which reduces the granularity in which you could potentially be saving power.

You could attempt to make a similar argument with Intel; shutting down a physical core will deprive you of potentially two logical cores. But in that instance, the second logical core really doesn't "exist", it's just a superset of the physical unit. You really can't disconnect the two, as they share basically ALL of the logic rather than just a specific part.
 
You could attempt to make a similar argument with Intel; shutting down a physical core will deprive you of potentially two logical cores. But in that instance, the second logical core really doesn't "exist", it's just a superset of the physical unit. You really can't disconnect the two, as they share basically ALL of the logic rather than just a specific part.

Nope. A BD core also "does't extist", in this view. It's just a bunch of ALUS, registers and some data paths. An intel HT core its just that, minus the ALUS (which btw don't eat that much power). We don't really know (IMHO) the granularity at which Intel or AMD architectures can power down unused logic, registers. So I would qualify this line of thinking as being speculative, at best.
 
The unfortunate part here is that AMD chose for pairs of physical cores to share units, rather than having their own.
Nothing unfortunate about that at all. The first version of the architecture is pretty ropey in places, there's no doubt about that, but the concept seems sound.
 
Nothing unfortunate about that at all. The first version of the architecture is pretty ropey in places, there's no doubt about that, but the concept seems sound.

Hopefully they make the modules wider both for int and FP then add SMT to each core, especially when we start talking about really wide vectors this seems like a really good use of power hungry resources. It also seems to me things like there cache design suit this kind of throughput over absolute minimum latency design and multithreaded application design seems to lean that way as well.

the added side effect of this over just having more cores is that light threaded applications benefit from the work as well.

I will likely be disappointed but one can always hope :D
 
Nope. A BD core also "does't extist", in this view. It's just a bunch of ALUS, registers and some data paths. An intel HT core its just that, minus the ALUS (which btw don't eat that much power). We don't really know (IMHO) the granularity at which Intel or AMD architectures can power down unused logic, registers. So I would qualify this line of thinking as being speculative, at best.

It's not just a bunch of ALUs, registers, and data paths, it's also a dedicate and pretty wide OoO scheduler and dual load/store units with a fair bit of reordering capacity/forwarding, also L1 DTLB and dcache.. surely these things don't use a negligible amount of power. Maybe not a dominating portion compared to the shared front-end and FPU but significant nonetheless.

I never got the impression that there's fine grained power gating to anything inside a module, just the entire module itself. Other stuff can be clock gated though.
 
Back
Top