AMD Bulldozer Core Patent Diagrams

Though apparently in this form it was a big failure. Anyway, I'm not fully convinced the uop cache is that much better than the trace cache was (though it should have lower overhead indeed), the pentium 4's problem with the trace cache wasn't so much the trace cache itself but simply that it relied on the code to be in the trace cache too much because the decoder was slow as molasses.
The uop cache is definitely simpler and it doesn't have a bunch of the glass jaws the trace cache introduced. The mapping to the L1 is straightforward and its contents aren't dependent on the sequence of execution and not subject to duplication, and various events that flushed the entire trace cache in P4 do not do the same for the uop cache.

That this could be snuck in as a parallel process to the decoder block probably took some engineering work, and it does buffer the more limited 16B decoder path.

And I'm also baffled by the low l1i associativity, it seems so obvious it's not enough. Yet AMD didn't even fix it with piledriver.
The memory pipeline is pretty fundamental to the functioning of the front end. Changing that probably means changing the front end, which is more significant than what a tweaked Bulldozer variant is going to get.
 
I still think AMD went a bit too far with their automated design approach for Bulldozer, especially regarding the front-end. If you look carefully on the die-shot, the front-end block is very similar to what they have been using since K8. Now, of course, the thing is somehow patched for dual-threaded workloads, but I can't help to think there are too many legacy leftovers there, where the "copy-pasted" L1i being just one of them.
 
The uop cache is effectively a cache for one trace. If anything it validates the trace cache concept, - massive issue width while consuming less power.

The problem with the P4 was that it was limited to decoding a single instruction per cycle when it missed the trace cache.

Intel picked the low hanging fruit by exploiting the very predictable nature of loops, but I wouldn't be surprised to see more aggresive "uop" caches in the future with support for more traces. It won't be called a trace cache, that name is forever stigmatized.

Cheers
 
I still think AMD went a bit too far with their automated design approach for Bulldozer, especially regarding the front-end. If you look carefully on the die-shot, the front-end block is very similar to what they have been using since K8. Now, of course, the thing is somehow patched for dual-threaded workloads, but I can't help to think there are too many legacy leftovers there, where the "copy-pasted" L1i being just one of them.

The more automated design flow doesn't automatically lead into the decision to have a low-associativity L1 Icache.
The contents of the cache have changed, since branch prediction no longer has bits in the L1 like in K8.

The cache would have been reimplemented in BD, so it's not a matter of simple matter of copy/paste. The designers looked at major design parameters of the previous gen's L1, and they kept them.
 
The memory pipeline is pretty fundamental to the functioning of the front end. Changing that probably means changing the front end, which is more significant than what a tweaked Bulldozer variant is going to get.
The l1 instruction cache might be quite linked to the whole frontend, but I don't think this poses a fundamental problem for increasing cache size or associativity. Maybe they weren't able to increase associativity while not increasing latency though.
That said are you suggesting AMD is going to stick with 2-way l1 instruction cache associativity for the next 5 years or so? Because tweaked Bulldozer is all that's on the roadmap (well apart from the low-power designs).
fwiw l1i associativity for core2 was 8, 4 on Nehalem and now back to 8 for Sandy/Ivy Bridge. Clearly these things can be redesigned.
AMD OTOH stuck with 2-way 64KB L1 instruction/data caches for forever (since K7 days - K6 also had two-way caches but only 2x32KB). Only BD now has different L1D (Bobcat is also "blessed" with 2-way 32KB L1I though it probably makes sense there, and the 8-way 32KB L1D is actually more than what you get with BD...)
 
The uop cache is effectively a cache for one trace. If anything it validates the trace cache concept, - massive issue width while consuming less power.
I don't see it as having any traces. The contents of a trace cache reflects the execution path taken by the processor. If code is stored contiguously in memory as AA if BB else CC, the trace cache may contain AABB and AACC or various combinations of trace fragments.
The uop cache is a post-decode cache that has a fixed relationship to the linearly addressed Icache, and it primarily validates a benefit for a chip running a complex ISA. To a limited extent, it gathers some low-hanging fruit for the original purpose of a trace cache--solving the problem of discontinuities in the fetch stream compromising superscalar issue.

The l1 instruction cache might be quite linked to the whole frontend, but I don't think this poses a fundamental problem for increasing cache size or associativity. Maybe they weren't able to increase associativity while not increasing latency though.
The memory closest to the execution pipeline has less leeway in terms of how it impacts cycle time and the stages in the pipeline that deal with fetch and prediction.
The TLB and tag check logic would be altered if the ratio of tag and index bits changes.
One possible, if unlikely change, would be to significantly change the associativity or reduce capacity so that it matches the size/associativity ratio of Sandy Bridge.
This would eliminate the aliasing problem entirely, and discard a portion of the cache fill pipeline used to invalidate synonyms.

That said are you suggesting AMD is going to stick with 2-way l1 instruction cache associativity for the next 5 years or so?
AMD promised little more than increases in some buffers and minor changes like new instructions for Piledriver coupled with improved clocks at a given power level, and that's what we got.
Some reports say that more change is in store with Steamroller, more so than was promised with Piledriver.
Since Steamroller is also meant to be on a new non-SOI node, more changes could be in the air because various parts of the pipeline will need to be adjusted anyway.

Because tweaked Bulldozer is all that's on the roadmap (well apart from the low-power designs).
fwiw l1i associativity for core2 was 8, 4 on Nehalem and now back to 8 for Sandy/Ivy Bridge. Clearly these things can be redesigned.
Sandy Bridge is a bit more than a tweaked Nehalem.
Bulldozer to Piledriver is something like the SB to IVB transition, without the node jump.

AMD OTOH stuck with 2-way 64KB L1 instruction/data caches for forever (since K7 days - K6 also had two-way caches but only 2x32KB).
The size and associativity haven't changed, but the BD L1 is physically different and it no longer serves as part of the branch predictor.
 
Well, it's only ~6KBytes in size. Not much to find there. But it's a mile better than the old trace cache for instance, with its huge redundancy overhead and complete lack of immunity to branch miss-predictions. I think Intel's reasoning was/is still much more about the power savings mantra, for the mobile SKUs, mostly. Performance is a hit and miss, anyway.

The cache holds 1536 uops. According to David Kanter's Sandy Bridge article (http://www.realworldtech.com/page.cfm?ArticleID=RWT091810191937&p=4) it is accessed in 32 byte windows of 4 uops each, meaning that each uop is 8 bytes. This is actually kind of surprising since Prescott also used 64-bit uops which were generally much more limited than SB fused uops.

That would make it 12KB, but it probably takes up significantly more space than a normal 12KB 6-way set-associative instruction cache would due to metadata mapping instruction addresses between it and the L1 instruction cache.
 
The cache holds 1536 uops. According to David Kanter's Sandy Bridge article (http://www.realworldtech.com/page.cfm?ArticleID=RWT091810191937&p=4) it is accessed in 32 byte windows of 4 uops each, meaning that each uop is 8 bytes.
My interpretation is that the 32 byte window represents the 32 byte aligned chunks as represented in memory and the standard ICache.
Any given window can be represented by up to 18 uops that can take up to 3 lines in the uop cache.
There isn't a 1:1 correspondence between the external representation and the uop cache in terms of size or op count. I'm not sure how to arrive at 8 bytes per uop in a general case.

There is a restriction that instructions with 64-bit immediates take up two slots, which may give a granularity to the uop cache where a 32 bit immediate can fit comfortably, so perhaps each slot is between 64 and 128 bits in length.

The amplification of a 32 byte chunk of instructions would be variable.
If slots are 64 bits and each way can have up to 6 uops, that's at least 48 bytes of uop, not counting metadata. A 32 byte window mapping to a fully occupied way would be amplified by a factor of 1.5.
96 bits per uop would give a factor of 2.25.

That would make it 12KB, but it probably takes up significantly more space than a normal 12KB 6-way set-associative instruction cache would due to metadata mapping instruction addresses between it and the L1 instruction cache.
Mapping seems like it could be derived with a pointer and enough length counters.
There is at least 48 bits for the IP of the first instruction in the window. Then there would be theoretical max of 18 length counters per window. The max byte length for x86 is 18, which naively makes me think 5 bits per counter. This assumes there is a valid way to pad an instruction out to 18 bytes while translating to one uop. It may be unnecessary to be that naive because an instruction that long wouldn't allow enough room in the window for 17 additional instructions.
 
My interpretation is that the 32 byte window represents the 32 byte aligned chunks as represented in memory and the standard ICache.
Any given window can be represented by up to 18 uops that can take up to 3 lines in the uop cache.
There isn't a 1:1 correspondence between the external representation and the uop cache in terms of size or op count. I'm not sure how to arrive at 8 bytes per uop in a general case.

There is a restriction that instructions with 64-bit immediates take up two slots, which may give a granularity to the uop cache where a 32 bit immediate can fit comfortably, so perhaps each slot is between 64 and 128 bits in length.

The amplification of a 32 byte chunk of instructions would be variable.
If slots are 64 bits and each way can have up to 6 uops, that's at least 48 bytes of uop, not counting metadata. A 32 byte window mapping to a fully occupied way would be amplified by a factor of 1.5.
96 bits per uop would give a factor of 2.25.

I'm not talking about the 32-byte x86 instruction window that the uop cache scans from, I'm talking about the interface coming out of the uop cache which is being labelled as 32-bytes. I may have incorrectly read a description of the former as one for the latter, though, since I was just skimming for some textual reference to the label. On second read, I think that he's using 32 byte to refer to what x86 instructions it could be "representing." The 32-byte x86 windows don't correspond to the 4 uops that can come out of the cache, though, but the full 6 uop lines, and I think you'd be hard pressed to get any 4 uops to fit 32 bytes of x86 code.. But the correlation between x86 instructions and uops is irrelevant to what I'm saying, I'm referring strictly to the size of uops here (and thus the size of the uop cache). Of course I understand/agree with what you're saying. I'm strictly interested in the uop size.

64-bits might fit as uop size. Despite being the same size as Prescott's (generally simpler) uops the uop fusion rules don't really add a lot of extra data per-uop. If this number is incorrect then I suspect it's not that much larger, for example 80 bits; 128 bits would really surprise me. We both agree they're fixed width though, right?

The confusion behind fellix's original comment is probably because of Intel's claim that the uop cache performs "like a 6KB instruction cache." Going just by capacity that'd imply that a uop in the cache is worth about 4 bytes of x86 code - the average bytes/instruction in typical programs is probably well under 4, but the average uops/instruction is also going to be a bit higher than 1. Then the uop cache will have some unused parts in lines. So it's a pretty reasonable sounding estimate.
 
Last edited by a moderator:
I'm not talking about the 32-byte x86 instruction window that the uop cache scans from, I'm talking about the interface coming out of the uop cache which is being labelled as 32-bytes.
I'm trying to think through the description of the process. The term "window" is used multiple times, but I wasn't sure what exactly is being referenced in each instance.
I reread the description of the uop cache hit process a few times, and after thinking it through the initial guess of 64 bits fits. Upon a hit the uop cache sends the 1-3 lines to an intermediate buffer, and that buffer can take 6 uops a cycle but it has an output limit of 4 uops/32 bytes, which gives each one 64 bits. Since each way of the uop cache has 6 uops, that would make mean each way of the uop cache contains 48 bytes worth of op data plus additiona meta data. 32 sets * 8 ways *48 bytes per way / 8 bytes in 64bits gives 1.5K uops.

64-bits might fit as uop size. Despite being the same size as Prescott's (generally simpler) uops the uop fusion rules don't really add a lot of extra data per-uop. If this number is incorrect then I suspect it's not that much larger, for example 80 bits; 128 bits would really surprise me. We both agree they're fixed width though, right?
Fixed or mostly fixed since ops with 64-bit immediates apparently span more than one slot according to Intel's documentation.
 
Last edited by a moderator:
I will need to go back and review numbers from years back.
On a multithreaded basis, there may be more areas where on a module to core basis Piledriver is more competitive.
In terms of single thread performance, I will need to check where it is in relation to Westmere before worrying about SB and IB.
 
I will need to go back and review numbers from years back.
On a multithreaded basis, there may be more areas where on a module to core basis Piledriver is more competitive.
In terms of single thread performance, I will need to check where it is in relation to Westmere before worrying about SB and IB.

perf per clock its only just up to Llano (but still behind in quite a few areas) which is about 5% on avg better then phenom II, So no where near intel, but its big turn around considering it is a minor revision, the fact at same TDP you have almost 1000mhz over llano and turbo isn't used in that review means significant performance increase over llano.

There have been statements from AMD saying steamroller with bring there single thread performance much closer to intels, i wonder what they are going to change? Decode, load/store to the FPU, number of ALU's, trace cache?

it will be very interesting to see what steamroller is, looks like its good enough for atleast sony, maybe even Mircosoft. I guess we will see if bulldozer was Yonnah or northwood :LOL:
 
There's around some benchmark with the last win8 preview?

Is this question in relation to the updated scheduling that was touted in Win8 to provide an incremental performance improvement on Bulldozer? My own speculation: I doubt it's significant enough to write about.
 
Is this question in relation to the updated scheduling that was touted in Win8 to provide an incremental performance improvement on Bulldozer? My own speculation: I doubt it's significant enough to write about.

The most important scheduler changes were already released for windows 7 during the winter, so I don't except big improvements over them with w8.
 
Back
Top