AMD Bulldozer Core Patent Diagrams

L3 latency isn't the issue its not great but its still way faster then main memory. its L3 throughput ( especially write) that's the issue. The L'2s are large and the L3 is an eviction cache, things are fetched into L1 and L2. But the L3 throughput........

http://www.vmodtech.com/main/wp-con...2133c9d-16gxh-with-amd-fx-8350/aida64-mem.jpg
http://cdn.overclock.net/e/e4/350x700px-LL-e4eb580f_cachememtest.png


want i want to know is the L1 "broken" 6:1 ratio of read to write seems kinda pointless to me.

That does seem like a major stumbling block. One of the slides for SR mentioned that store forwarding was improved so this might be more balanced now.
 
want i want to know is the L1 "broken" 6:1 ratio of read to write seems kinda pointless to me.
That's just a result of the L1 writethrough cache. So a L1 write is always writing to L2 too (the number is higher than L2 write presumably because a L2 write would also include a L2->L1 read). And yes this appears to be a weak point, though it's unclear how much this contributes to general lackluster performance of BD. I don't know if SR changes any of this as this was probably chosen for a reason, but one "easy" "fix" would be doubling L2 throughput.
 
The watch.impress segment has some interesting breakdowns of the chip.
Notably, the amount of HVT transistors drops massively from 32nm to 28nm.

The overall shift to having more nominal Vt transistors and a larger proportion being regular-length seems to match up with the general premise that Steamroller's top end is somewhat lower, so more can be put in the nominal pool than before.
However, the drop in HVT was such that I wonder if it had to do with some quirk like dropping SOI.
The leakage numbers show a generally more leakage-resistant process, except for again the fastest and leakiest transistors.


Electronics Weekly had one sentence mentioning resonant clocking, but is it any more so than the non-appearance for Piledriver?

Another blurb is the mention of the vdroop-detecting clocking scheme.
These days, these dynamic schemes echo for Intel's Foxton technology more than ever.
Intel has a vdroop-aware clock scheme for its experimental graphics core.
AMD seems to be using it to keep things functional at regular voltages, while Intel's for near-threshold.

edit:

One thing I forgot to comment on was the number of custom macros for Steamroller.
It has an order of magnitude more than AMD's Jaguar.
Part of that may go to the requirements for Steamroller's per-core performance range, as well as the historical tie that architectural line has with the old AMD fabs.
I would wonder if a Bulldozer-derived core would ever be found on a non-GF process with that level of specificity (and since Jaguar with much less hasn't been hopping fabs), and whether that could have been what scared Sony away from the rumored Steamroller PS4.
 
Back
Top