http://www.brightsideofnews.com/new...core-architectural-enhancements-unveiled.aspxIncreased L2 BTB size from 5K to 10K and from 8 to 16 banks.
AMD simply reused the layout from BD's FPU to stitch a wider pseudo-native 256-bit pipes in this case.I think there is still an upper/lower data path split, but I'm not sure which way is best to handle it.
The units themselves have a double pink line through the middle, which may be a way to separately gate each half.
It may be better to rotate the whole FPU 90 degrees. Instead of it being left:right=Hi:Lo, it's left=register file 0, right=register file 1.
Each half would have its own Hi:Lo split. There seems to be some extra routing going on between the upper and lower halves of each side to permit shuffling between them.
The first picture in my previous post is the best I could find.Do you have a higher resolution shot of Bulldozer's FPU as well?
The grainy shots I have wouldn't give the same kind of symmetry when flipping the lower to upper.
The upper half, with the slightly wider register file arrays, is actually a 80-bit implementation because of the x87 comparability. The lower half is 64-bit that omits the x87 stack. It's a similar approach found in K10, compared to K8.Each quadrant has mirror symmetry within itself, but while the top two quadrants match, the lower two have slightly different patterns from the top and from each other.
I'm talking about how the hardware at the far edge of the units has a different axis of symmetry across the two chips.The upper half, with the slightly wider register file arrays, is actually a 80-bit implementation because of the x87 comparability. The lower half is 64-bit that omits the x87 stack. It's a similar approach found in K10, compared to K8.
It's sort of one-half of a POWER7's threading scheme, which should give more resources per thread.so could this be a Power7 kind of style SMT for the int cores but only 1 thread per set of resources?
The AGUs get a micro op for every memory access, so FP instructions with a memory operand will send to both an AGU and the FP unit.Aren't the ALU's /AGLU's needed for passing data to the FPU? could the second set be targeted towards this, if you have two threads per int core maybe having extra "dedicated" units is worth the cost.
Assuming this diagram of Bulldozer is labeled correctly, I'm in concurrence with you that it's the microcode ROM.The L1 has two sections, separated by a pink section that could be the microcode ROM.
It's gotta be a 2 x 48KB L1I. One for each decode unit. I think I can actually make out each of the 4 partitions that make up the 4 wide decode; I'm learning a lot from this shot.There is what appears to be a fetch buffer for both sections, so I am left wondering if this is one big L1 I$ or two.
Good catch on the ALUs/AGUs. They also look more symmetrical.A single integer core looks to have twice the integer ALUs, twice the AGUs, but the multiplier and divider sections don't look replicated. The physical register file doesn't appear to be doubled. If it is bigger, it's not significant enough to split it into new sections or appear to be more than incremental growth.
Agreed with rename and retire. Ancestry table looks a bit bigger. As far as Instruction Wakeup and Pickup goes, the table appears to be 33% larger. I'm seeing 3 partitions in BD/PD, and 4 in SR. The bit sandwiched in between, logic I'm assuming, also appears to be more robust.Interestingly, it looks like the rename tables and retire structures are doubled in size.
The table that might have to do with waking up/picking instructions could be bigger, but it isn't doubled.
The area outside the register tables looks larger to me, although maybe it's just because the payload and immediate storage were moved. Yeah, I think that's it.The odd thing from a single-threaded perspective is that the scheduler logic outside of the tables is either much denser or not much larger. It's also not really necessary to have double the retirement tracking or rename tables for a core whose decoder is still 4-wide--if the core is single-threaded.
Perhaps it isn't.
It looks largely the same. I think I see some tables near the border of the FPU that look larger, but the LSU is a real mess. I don't have an aid to dissect it either.I've only had fuzzy BD shots to compare with, which makes the load/store section particularly hard to analyze.
The L1 data cache appears to be different, but not necessarily much bigger in area. If it's not bigger, it may be more aggressively banked. The interfacing logic on the side of the L1 doesn't appear to have more subunits, which might mean the port count hasn't changed. I don't know if its bandwidth has changed, but the fuzzily pictured width of that interface doesn't seem to be much different.
The L/S section appears relatively narrower compared to the sections that did grow, which could indicate it has been slightly modified.
There are a few duplicated/grown structures, which might be queues for loads and stores. My die-shot-fu isn't good enough to know which one is which. The more obviously duplicated structure may be a pair of store queues.
I don't think I've seen anything stating that the BPU logic is doubling.Where are the µCode-ROMs and why is there nothing to see of the doubled-branch-prediction logic?
To me it is "just" doubled ... I zoomed into that pic:One final thing of note: L1D appears to be 2 x ~21 or ~22KB. It's an "odd" number, but let me explain how I came to this conclusion.
I did it, see the thread at S|A. The rest is boring ;-)And at this point, I'm too bored to look at the FPU.
From the leaked information of BSN:I don't think I've seen anything stating that the BPU logic is doubling.
- Store to load forwarding optimization
- Dispatch and retire up to 2 stores per cycle
- Improved memfile, from last 3 stores to last 8 stores, and allow tracking of dependent stack operations.
- Load queue (LDQ) size increased to 48, from 44.
- Store queue (STQ) size increased to 32, from 24.
- Increase dispatch bandwidth to 8 INT ops per cycle (4 to each core), from 4 INT ops per cycle (4 to just 1 core). 4 ops per cycle per core remains unchanged.
- Accelerate SYSCALL/SYSRET.
- Increased L2 BTB size from 5K to 10K and from 8 to 16 banks.
- Improved loop prediction.
- Increase PFB from 8 to 16 entries; the 8 additional entries can be used either for prefetch or as a loop buffer.
- Increase snoop tag throughput.
- Change from 4 to 3 FP pipe stages.