Is this AMD Steamroller?

fellix · May 25, 2013

A brief assessment:

* huge L1i cache!
* beefed-up INT cores and larger L1d caches;
* two FP/SIMD units (4x128-bit FMA?);

itsmydamnation · May 25, 2013

well from my layman viewing, lots of structures look very similar to piledriver and lots look very different. So its ever a very good fake or it its steamroller.

where did it come from?

we know steamroller has 128KB L1i and this looks double pildedriver. would be nice if we had a way to scale it to pile driver core.

FPU looks twice as big, but AMD have said nothing about that.

edit:
given things like http://www.anandtech.com/show/6201/amd-details-its-3rd-gen-steamroller-architecture
say only 2 FMA's so maybe Excavator, but isn't this a little or a lot early for that.

Paran · May 25, 2013

What is the source of this shot?

Opteron · May 25, 2013

Where are the µCode-ROMs and why is there nothing to see of the doubled-branch-prediction logic?

Increased L2 BTB size from 5K to 10K and from 8 to 16 banks.

http://www.brightsideofnews.com/new...core-architectural-enhancements-unveiled.aspx

Blazkowicz · May 26, 2013

I have trouble understanding what the hell it is, just the basics.
Is that a GPU in the top half, and two bulldozers modules on the bottom? (with the big orange squares the left and the right being L2) /edit : this sounds dumb. /edit (bis) : I now believe it's that again.

Or it's two modules in the center, lol, GPU on bottom, scary looking control structures/whatever on the top and.. L3 on an APU?

fellix · May 26, 2013

No, the picture is a shot of a single module (two cores) with the L2 cache cropped out. The layout is similar to Bulldozer's architecture -- no GPU is involved here in any way.

3dilettante · May 26, 2013

For fun, I'll just assume this is a true representation of a future core, or someone put a decent amount of effort into making this up.
There isn't a good shot of this core and Bulldozer of equivalent quality, but here's what it sort of looks like to me.

The L1 has two sections, separated by a pink section that could be the microcode ROM. There is what appears to be a fetch buffer for both sections, so I am left wondering if this is one big L1 I$ or two.
For the instruction section there are two of everything, save the predecoder, branch predictor, and microcode. It think there's a little pink rectangle below the microcode that is a microcode-related engine, and even this is duplicated.

A single integer core looks to have twice the integer ALUs, twice the AGUs, but the multiplier and divider sections don't look replicated. The physical register file doesn't appear to be doubled. If it is bigger, it's not significant enough to split it into new sections or appear to be more than incremental growth.
The portions of the pipeline related to gathering operands and immediates for each instruction are doubled.
Interestingly, it looks like the rename tables and retire structures are doubled in size.
The table that might have to do with waking up/picking instructions could be bigger, but it isn't doubled.
The odd thing from a single-threaded perspective is that the scheduler logic outside of the tables is either much denser or not much larger. It's also not really necessary to have double the retirement tracking or rename tables for a core whose decoder is still 4-wide--if the core is single-threaded.
Perhaps it isn't.
If the integer pipeline is partitioned into halves, it might make the bypassing network less of a nightmare than expanding it from 2 ALUs and 2 AGUs to 4x4 single-cycle.

I've only had fuzzy BD shots to compare with, which makes the load/store section particularly hard to analyze.
The L1 data cache appears to be different, but not necessarily much bigger in area. If it's not bigger, it may be more aggressively banked. The interfacing logic on the side of the L1 doesn't appear to have more subunits, which might mean the port count hasn't changed. I don't know if its bandwidth has changed, but the fuzzily pictured width of that interface doesn't seem to be much different.
The L/S section appears relatively narrower compared to the sections that did grow, which could indicate it has been slightly modified.
There are a few duplicated/grown structures, which might be queues for loads and stores. My die-shot-fu isn't good enough to know which one is which. The more obviously duplicated structure may be a pair of store queues.

The FP unit appears to have rotated a bunch of components 90 degrees, and doubled the capacity of the register file. The two halves of each bank of registers aren't physically identical. This may be a legacy/full vector set distinction.
For what it's worth, what watchimpress stated is the retire control is duplicated. There were already two banks, one per thread with the original dual-threaded FPU. This could mean it's now able to support four.

I think there is still an upper/lower data path split, but I'm not sure which way is best to handle it.
The units themselves have a double pink line through the middle, which may be a way to separately gate each half.
It may be better to rotate the whole FPU 90 degrees. Instead of it being left:right=Hi:Lo, it's left=register file 0, right=register file 1.
Each half would have its own Hi:Lo split. There seems to be some extra routing going on between the upper and lower halves of each side to permit shuffling between them.
There's a slight break in the symmetry of the left and right sides, however, particularly in the lower right. There may be some special functions that use different hardware there. There is some additional routing in the lower half that could explain how one half could still use whatever special hardware is on the other side.
This may also explain the potentially narrower depiction of the FPU in some slides, which went form 2 FMAC+ 2MMX(one being also FSTORE) to 2 FMAC + 1MMX. The hardware would be mostly the same, but the common case would be that each thread can only see its side of the FPU and other ops that could cross to the other side have the one port doing double-duty.

fellix · May 26, 2013

Here's the original shot in full size and more detail - even the data paths are visible:

https://securecdn.disqus.com/uploads/mediaembed/images/500/9389/original.jpg

fellix · May 26, 2013

3dilettante said:
I think there is still an upper/lower data path split, but I'm not sure which way is best to handle it.
The units themselves have a double pink line through the middle, which may be a way to separately gate each half.
It may be better to rotate the whole FPU 90 degrees. Instead of it being left:right=Hi:Lo, it's left=register file 0, right=register file 1.
Each half would have its own Hi:Lo split. There seems to be some extra routing going on between the upper and lower halves of each side to permit shuffling between them.

AMD simply reused the layout from BD's FPU to stitch a wider pseudo-native 256-bit pipes in this case.
Here's a bit more clear shot of the FPU block in BD and what I think happened:

3dilettante · May 26, 2013

Do you have a higher resolution shot of Bulldozer's FPU as well?
The grainy shots I have wouldn't give the same kind of symmetry when flipping the lower to upper.

Each quadrant has mirror symmetry within itself, but while the top two quadrants match, the lower two have slightly different patterns from the top and from each other.

fellix · May 26, 2013

3dilettante said:
Do you have a higher resolution shot of Bulldozer's FPU as well?
The grainy shots I have wouldn't give the same kind of symmetry when flipping the lower to upper.

The first picture in my previous post is the best I could find.

3dilettante said:
Each quadrant has mirror symmetry within itself, but while the top two quadrants match, the lower two have slightly different patterns from the top and from each other.

The upper half, with the slightly wider register file arrays, is actually a 80-bit implementation because of the x87 comparability. The lower half is 64-bit that omits the x87 stack. It's a similar approach found in K10, compared to K8.

itsmydamnation · May 26, 2013

where did this die-shot come from.

edit:

so could this be a Power7 kind of style SMT for the int cores but only 1 thread per set of resources?

Opteron · May 26, 2013

You guys probably missed this?:

Excavator with 256bit FMACs?

3dilettante · May 27, 2013

fellix said:
The upper half, with the slightly wider register file arrays, is actually a 80-bit implementation because of the x87 comparability. The lower half is 64-bit that omits the x87 stack. It's a similar approach found in K10, compared to K8.

I'm talking about how the hardware at the far edge of the units has a different axis of symmetry across the two chips.
The silicon at the edge of Bulldozer's FPU unit has an axis of symmetry that runs horizontally in the shot. The ends of each ALU match the hardware above, but do not match the hardware at the edge of the ALUs on the other side.
This shot does things differently. Within each quadrant, the lower and upper halves are symmetrical, but each quadrant doesn't mirror the one below.

This could point to a change in how the high and low bits are handled from bulldozer, which sent half of the bits to one side of the FPU and the other half to the other.
This chip may be sending bits to one side when issuing an instruction.

itsmydamnation said:
so could this be a Power7 kind of style SMT for the int cores but only 1 thread per set of resources?

It's sort of one-half of a POWER7's threading scheme, which should give more resources per thread.

Within each physical core, there has to be some forwarding between the clusters, since there seems to be only one multiplier and divider, and it's all hanging off of one register file. The bypass network could be simplified if single-cycle bypass only happened within one partition, but at some point bits need to travel to either the divider or multiplier or back to a register, and those won't split for threads.

The question would be if a thread can use the other side of the partition. Issue capability and register porting would need to increase, which makes it seem unlikely that a thread can use double the resources. On the other hand, it would seem unlikely that at least some of the throughput wasn't increased, otherwise, why double the units? This could lead to some single-threaded improvement, or extra slots to consume rename ops.

itsmydamnation · May 28, 2013

Aren't the ALU's /AGLU's needed for passing data to the FPU? could the second set be targeted towards this, if you have two threads per int core maybe having extra "dedicated" units is worth the cost.

Also fellix where did this come from?!?!?! .......... I was hoping AMD would go SMT per Core within CMT because that's where i saw a potential uarch advantage over Intel because you could spend even more power and transistors on a more complex front end and FPU. I just didn't expect it to happen, yet alone to be seeing a dieshot of it in 2013........

Raqia · May 28, 2013

I love sexy shots like this. Reminds me of when they doubled the FPU width for k10, looks like they just duplicated a lot of the circuitry beneath the original block. The much larger area of buffer cells near the decoders might be for the u-op trace-cache like structure (it should be in Steamroller as well as Excavator).

3dilettante · May 28, 2013

itsmydamnation said:
Aren't the ALU's /AGLU's needed for passing data to the FPU? could the second set be targeted towards this, if you have two threads per int core maybe having extra "dedicated" units is worth the cost.

The AGUs get a micro op for every memory access, so FP instructions with a memory operand will send to both an AGU and the FP unit.
This is one area where having additional units may have a slight benefit even without increasing issue width by allowing certain instruction mixes to issue in a single cycle instead of splitting it up.

The benefits are limited to the capacity of the memory scheduler and load/store pipeline. Again, it's hard to tell from the quality of images I have, but I'm not sure that has been doubled.
There's also the question in the single-threaded case how readily data can be read or bypassed to all four AGUs in a single cycle. For the current AGU setup (edit: I was thinking of INC and simple ops, I'm not able to recall which ones may have two sources), they already read only one register at most in a cycle, so the existing file could support this. Bypassing gets more complicated, however, unless there are changes in how many destinations a given unit can bypass to in a cycle.

A simpler case is a dual-threaded scenario where each thread gets a lane of integer units, in which case both could issue two FP instructions with memory operands at the same time.
I'm still not certain how the memory pipeline has been adjusted to match.

Homeles · Jun 2, 2013

Excavator is supposed to be highly synthesized, right? In my opinion, this is still largely hand-drawn.

Compare to the automated Jaguar core:

It's hard to see in detail, but everything is mashed together in a manner that could only be done by machine. Here's an aid:

Whereas the rough layout of this alleged SR core looks like this (I think the parts underneath the INT-0 and INT-1 labels are actually the LSUs):

On a high level, this core has a decidedly hand-drawn layout, no? The resolution of the shot isn't high enough to make good judgement of how automated the design is on a block-level, but it doesn't appear to be very different than the largely hand-drawn Bulldozer/Pilederiver.

I don't know what the exact density improvement is from 32nm gate first -> 28nm gate first, but if we stick to traditional nomenclature, it's around 50% denser. Assuming L2 and L3 cache sizes aren't increasing (I think L2's staying the same... now I'm having trouble remembering), this should be feasible as an 4 module/8 core layout, right?

The modules themselves take up about a quarter of the space on Bulldozer/Piledriver, and making the lax assumption that a Steamroller module is twice the size of a Piledriver module on an equivalent process, that would lead to a crudely calculated die size of ~400mm2 on 32nm (1/4 * 315mm2 = 78.75mm2 -> 78.75mm2 + 315mm2 = 393.75mm2). Obviously there are still some area savings from the module design, despite being drastically reduced. If we take that 50% density number and apply it to the 400mm number, we're roughly at 300mm2, which is definitely feasible on 28nm. A 2M/4C Kaveri would be a delightfully powerful device using this layout.

Now if module count is increasing to the 6M/12C number I've seen thrown around, this would become a very large die -- 400mm2 or so. There's no question that such a chip would be manufacturable, but at what cost? There's no doubt that such a chip would be competitive, assuming the design isn't borked, but would the margins be good enough?

So if we ignore any potential density increases obtained by using an automated design, this core would absolutely be feasible. I really believe this to be Steamroller, and not Excavator. It looks too hand-drawn (at least from what we've been told to expect from EX), the photo has appeared in a more appropriate time frame (i.e., it's awfully early for an Excavator leak), and there's no reason why it wouldn't work as an FX or Kaveri part.

My final attempt to discredit this die shot as being EX: we have been shown a demonstration of a high density library being applied to a Bulldozer FPU. The multicolored rectangle, which I assume to be the FPU pipelines themselves, is much denser. In the SR and BD comparison shots that fellix has graciously provided, the "multicolored rectangle" bit is virtually identical. If it were to have been synthesized under the high density library, it would be markedly smaller.

Unless everything we've been told about EX is wrong, this is SR. AMD's slide showing 128 bit FMACs on SR may have been erroneous, or perhaps some sort of deliberate obfuscation to prevent revealing too much of their design.

Homeles · Jun 2, 2013

One final thing of note: L1D appears to be 2 x ~21 or ~22KB. It's an "odd" number, but let me explain how I came to this conclusion.

BD's L1D, like this shot, is partitioned into 4 quadrants, like a "plus sign." In BD, there are 3 rows of cells per quadrant, while in SR, there are 4. Each row is 9 cells wide in SR, and with a painful amount of squinting at SemiAccurate's die shot of Vishera, I've determined that Piledriver has the same 9 cell width.

4 / 3 * 16KB = 21.33 KB

Now it could be that there is more storage per "cell." A more sensible number would be 24KB.

The final explanation, and the more likely one, is that I did not calculate the "pink" cells or whatever that pink stuff is in the SR shot. Bulldozer and Vishera apparently has a "pink part" that is half the size of SR's. I assume those are tags? If we count that extra pink bit as a cell, with SR having twice as many pink bits, the calculation comes to (4 * 11) / (3 * 10) = 1.466 * 16KB = 23.466 = ~24KB with liberal rounding. So maybe SR has a 2 x 24KB L1D?

3dilettante said:
The L1 has two sections, separated by a pink section that could be the microcode ROM.

Assuming this diagram of Bulldozer is labeled correctly, I'm in concurrence with you that it's the microcode ROM.

There is what appears to be a fetch buffer for both sections, so I am left wondering if this is one big L1 I$ or two.

It's gotta be a 2 x 48KB L1I. One for each decode unit. I think I can actually make out each of the 4 partitions that make up the 4 wide decode; I'm learning a lot from this shot.

A single integer core looks to have twice the integer ALUs, twice the AGUs, but the multiplier and divider sections don't look replicated. The physical register file doesn't appear to be doubled. If it is bigger, it's not significant enough to split it into new sections or appear to be more than incremental growth.

Good catch on the ALUs/AGUs. They also look more symmetrical.

I think the Instruction Payload Storage has been moved to the "top" of the ALUs, register file and AGUs. It appears to have doubled in size. Also, the Immeditate Value Storage has doubled as well, as you mention, and I think it makes up the "right" border directly on the right of the AGUs.

Interestingly, it looks like the rename tables and retire structures are doubled in size.
The table that might have to do with waking up/picking instructions could be bigger, but it isn't doubled.

Agreed with rename and retire. Ancestry table looks a bit bigger. As far as Instruction Wakeup and Pickup goes, the table appears to be 33% larger. I'm seeing 3 partitions in BD/PD, and 4 in SR. The bit sandwiched in between, logic I'm assuming, also appears to be more robust.

The odd thing from a single-threaded perspective is that the scheduler logic outside of the tables is either much denser or not much larger. It's also not really necessary to have double the retirement tracking or rename tables for a core whose decoder is still 4-wide--if the core is single-threaded.
Perhaps it isn't.

The area outside the register tables looks larger to me, although maybe it's just because the payload and immediate storage were moved. Yeah, I think that's it.

I've only had fuzzy BD shots to compare with, which makes the load/store section particularly hard to analyze.
The L1 data cache appears to be different, but not necessarily much bigger in area. If it's not bigger, it may be more aggressively banked. The interfacing logic on the side of the L1 doesn't appear to have more subunits, which might mean the port count hasn't changed. I don't know if its bandwidth has changed, but the fuzzily pictured width of that interface doesn't seem to be much different.
The L/S section appears relatively narrower compared to the sections that did grow, which could indicate it has been slightly modified.
There are a few duplicated/grown structures, which might be queues for loads and stores. My die-shot-fu isn't good enough to know which one is which. The more obviously duplicated structure may be a pair of store queues.

It looks largely the same. I think I see some tables near the border of the FPU that look larger, but the LSU is a real mess. I don't have an aid to dissect it either.

And at this point, I'm too bored to look at the FPU.

Opteron said:
Where are the µCode-ROMs and why is there nothing to see of the doubled-branch-prediction logic?

I don't think I've seen anything stating that the BPU logic is doubling.

Opteron · Jun 3, 2013

Homeles said:
One final thing of note: L1D appears to be 2 x ~21 or ~22KB. It's an "odd" number, but let me explain how I came to this conclusion.

To me it is "just" doubled ... I zoomed into that pic:
http://images.bit-tech.net/content_images/2012/11/amd-fx-8350-review/piledriver-b.jpg
and I see also 4 rows for Piledriver.

And at this point, I'm too bored to look at the FPU.

I did it, see the thread at S|A. The rest is boring ;-)
To me it looked like they used the high-density libraries. If you scale the official AMD demonstration pic so that the retire banks have the same size, then the width of SR's unit is identical, it is just deeper/wider, due to the 2 additional banks, probably for 2 additional threads.

I don't think I've seen anything stating that the BPU logic is doubling.

From the leaked information of BSN:

- Store to load forwarding optimization
- Dispatch and retire up to 2 stores per cycle
- Improved memfile, from last 3 stores to last 8 stores, and allow tracking of dependent stack operations.
- Load queue (LDQ) size increased to 48, from 44.
- Store queue (STQ) size increased to 32, from 24.
- Increase dispatch bandwidth to 8 INT ops per cycle (4 to each core), from 4 INT ops per cycle (4 to just 1 core). 4 ops per cycle per core remains unchanged.
- Accelerate SYSCALL/SYSRET.
- Increased L2 BTB size from 5K to 10K and from 8 to 16 banks.
- Improved loop prediction.
- Increase PFB from 8 to 16 entries; the 8 additional entries can be used either for prefetch or as a loop buffer.
- Increase snoop tag throughput.
- Change from 4 to 3 FP pipe stages.

Is this AMD Steamroller?

Similar threads