Trinity vs Ivy Bridge

I think Trinity looks very good considering that AMD is still using the same 32 nm process, and the chip is pretty much same in size (compared to Llano).

Compared to AMDs last generation (Llano):
+20% CPU performance
+20% GPU performance (gaming)
-15% power usage (normalized battery life)

I don't think we have seen this big gains (in performance per watt) in a new architecture since Pentium 4 -> Core2. Trinity is a big improvement.

Well yes, my comment was directed to the Piledriver's architecture performance.

Nonetheless, it seems that Trinity's performance advantages come mainly from higher clocks in both the CPU and GPU and an actually working Turbo function.

Nonetheless, it's still a good upgrade from Llano and if there are less fab limitations this time, it'll probably sell lots and lots (outside china).
Let's not forget that even the A10-4500M will probably be priced below dual-core Ivy-Bridges, and offer a similar overall experience (plus much better gaming performance).


What worries me is if laptop OEMs keep linking AMD to "low-cost laptops" and forever place their chips into machines with 40WH batteries, low-quality plastics, poor TN panels, low-speed HD drives and god-awful cooling systems that force the APUs to underclock.
 
Well yes, my comment was directed to the Piledriver's architecture performance.
Like you said the AMD itself promoted '10-15% improvement in performance/watt every year' on Bulldozer launch. Expecting more than this from AMD is not wise in following 2-3 years.
 
Like you said the AMD itself promoted '10-15% improvement in performance/watt every year' on Bulldozer launch. Expecting more than this from AMD is not wise in following 2-3 years.

Right now, I only expect AMD to stay competitive at a given TDP or price point. I fear that may not happen in the following 2-3 years if they stick with Bulldozer as a base architecture for their CPU cores..
 
Like you said the AMD itself promoted '10-15% improvement in performance/watt every year' on Bulldozer launch. Expecting more than this from AMD is not wise in following 2-3 years.
That was the plan last year, but VR-Zone has since reported that AMD will make more substantial changes in Steamroller onwards in an attempt to compete with Intel in the top end of the market again.
http://vr-zone.com/articles/amd-to-survive-and-thrive-still-/15564.html
http://www.xbitlabs.com/news/cpu/di...vements_with_Steamroller_Microprocessors.html
 
What worries me is if laptop OEMs keep linking AMD to "low-cost laptops" and forever place their chips into machines with 40WH batteries, low-quality plastics, poor TN panels, low-speed HD drives and god-awful cooling systems that force the APUs to underclock.

Do they have a choice? I doubt there's a huge market out there willing to pay for an expensive chassis without the best internals. Anand mentions a probable $600 ceiling for Trinity based laptops.
 
...VR-Zone...
The slides with 10-15% improvement figures were shown in Oct 2011 and the Steamroller based products are scheduled on 2013 then you would have ~2 years for a massive architecture change. This simply can't happen :)

On the other hand, sure the Steamroller architecture is meant to be much bigger departure from Bulldozer compared to Piledriver, which is rather just a new silicon revision. At least the Steamroller is rumored (A. Stiller - c't mag) to have split the decoders to two independent ones. This bottleneck has been pointed out by many.
 
At least the Steamroller is rumored (A. Stiller - c't mag) to have split the decoders to two independent ones. This bottleneck has been pointed out by many.
yeah, BD architecture seems to have an outrageous IPC kept dormant by the pesky front-end, even worse than K8/10 (which was front-end&retirement limited).

About BD architecture... it seems not that bad, at least on paper. It has surely good margins for improvement -whereas it's probably harder for Intel's one, which had already undergone several iterations since the old PM design.

Just a side note, reading the anadtech review... a perceptron?!?! No really, does it mean AMD engineers did really add a perceptron - a true one?!?!
 
Its nothing to do with drivers. TDP bettween the GPU and CPU is shared and power will be steered to the element that is most "hungry" in that app. You need to look at like for like APU power budgets.

I find it a bit strange that the same games it loses to Ivy are the same games that AMD loses heavily to Nvidia. Batman, Dirt 3, Skyrim etc, that seems a bit too much to be a coincidence to me.
 
That was the plan last year, but VR-Zone has since reported that AMD will make more substantial changes in Steamroller onwards in an attempt to compete with Intel in the top end of the market again.
http://vr-zone.com/articles/amd-to-survive-and-thrive-still-/15564.html
http://www.xbitlabs.com/news/cpu/di...vements_with_Steamroller_Microprocessors.html
It's AMD's word that this is the case.
We'll have to wait and see evidence that the same design pipeline that spat out BD (minus a slew of those it fired or has lost) can do significantly better.
There is no expectation that the task they face will be any easier, and significant headwinds coming up that they refuse to address.

yeah, BD architecture seems to have an outrageous IPC kept dormant by the pesky front-end, even worse than K8/10 (which was front-end&retirement limited).
Where would this outrageous IPC be hiding? The front end is a limiter, and as it turns out even weaker than expected, but there's not enough behind it to say there's a world of untapped performance per clock. That's not what the design targeted.

About BD architecture... it seems not that bad, at least on paper. It has surely good margins for improvement -whereas it's probably harder for Intel's one, which had already undergone several iterations since the old PM design.
On paper, it was described as purposefully trading off against performance for the sake of power and design simplification. The latter seems to be a higher priority. Its philosophical underpinnings promise underperformance.

Just a side note, reading the anadtech review... a perceptron?!?! No really, does it mean AMD engineers did really add a perceptron - a true one?!?!
I'm trying to find references, but there may have been one added to Bobcat first.
 
Piledriver, in the form of a follow up to the current FX 4100, would be good enough for me. not that I need it much for now, these days I run a dual core with web browser constantly wasting half of it.

100€ would be more than expensive enough for a CPU, considering that for it to make sense I need to upgrade memory, mobo, storage, graphics + learn yet another dirty command line things etc.
a low end Piledriver standalone CPU will be better value than a core i3 for me, with the drawback that it eats more power.
 
I find it a bit strange that the same games it loses to Ivy are the same games that AMD loses heavily to Nvidia. Batman, Dirt 3, Skyrim etc, that seems a bit too much to be a coincidence to me.

They could be winning for different reasons.
One possible reason in common is that AMD's designs at the high end seem to need a high pixel load to amortize higher overheads. Being limited at the front end seems to come up relatively frequently compared to Nvidia.
Intel's IGP might be better balanced, or its superior CPU and memory architecture is able to churn through the overhead better.

Intel's plans for reducing driver overhead further have been mentioned, so IVB may improve relative to Trinity this year, possibly.
 
Where would this outrageous IPC be hiding?
It is outrageous not in absolute sense, but relative, if you consider that decoders average ~2instr/cycle in the long run, and you have such decoder alternating between two cores, leading to starving CPU result. In absolute sense, K10 was the best given it's 9inst/cycle potential. Anyway, BD has the LSU 2R1W per core (so 4R2W/cycle, even if LSU seems to have some problem, maybe WCC slows it down?) so for a module you have a theoretical of 4(6 w/fused jcc)ALU+4AGLU (only mov r,r/r,[m], not like K10 design unfortunately)+2FPU if you were able to pump up properly the module.
Intel's processor is way behind it(4/5 alu+"agu", 2R1W/cycle), yet it uses way better all its resources.
BD also pays with a longer pipeline and some slower instruction clock-to-clock with intel's counterpart (and a shared l2 access, a smaller l1d with related higher trash etc...)

Its philosophical underpinnings promise underperformance.
It promises a single-thread drop for performance, and a drop for very intensive FPU applications that requires sharing of the FlexFPU.
Yet, it promises a better overall IPC module wise.

I'm trying to find references, but there may have been one added to Bobcat first.
Thank you - it would be very interesting to me, I couldn't believe my eyes when I read it.
 
It is outrageous not in absolute sense, but relative, if you consider that decoders average ~2instr/cycle and you have such decoder alternating between two cores, leading to starving CPU result.
The integer back end is rather constrained. There are situations where the front end can leave it stalled, but it's generally balanced for that anemic 2 ops per cycle.

In absolute sense, K10 was the best given it's 9inst/cycle potential. Anyway, BD has the LSU 2R1W per core (so 4R2W/cycle, even if LSU seems to have some problem, maybe WCC slows it down?)
Theoretically. Agner Fog's tests show that BD's read bandwidth suffered from some other penalty, because it couldn't sustain the bandwidth 2 reads per cycle should have garnered, and the write bandwidth to the L2 (required due to write-through) is very bad. Trinity should have doubled the latter, which is still pretty tight considering two cores have to share it.

so for a module you have a theoretical of 4(6 w/fused jcc)ALU+4AGLU (only mov r,r/r,[m], not like K10 design unfortunately)+2FPU if you were able to pump up properly the module.
Intel's processor is way behind it(4/5 alu+"agu", 2R1W/cycle), yet it uses way better all its resources.
BD also pays with a longer pipeline and some slower instruction clock-to-clock with intel's counterpart.
Why are you comparing BD two cores versus one of the others?

It promises a single-thread drop for performance, and a drop for very intensive FPU applications that requires sharing of the FlexFPU.
Yet, it promises a better overall IPC module wise.
The common usage of IPC, at least prior to AMD's marketing efforts, referred to instructions issued per cycle for a thread. It was more about straightline speed, not the aggregate number of instructions on the silicon. This is probably due to the term being commonplace prior to multicore solutions.
Weakening the language as AMD did is not an argument from a position of strength.
In general, the design is rife with special cases and crippling glass jaws. It needs code to cater to its many weaknesses, and so whatever utilization it supposedly garners is eclipsed by the loss of peak performance and a world that refuses to cater to a weak architecture.

Thank you - it would be very interesting to me, I couldn't believe my eyes when I read it.
It's a second-hand quote from a presentation.

http://www.realworldtech.com/beta/forums/index.cfm?action=detail&id=126504&threadid=126503&roomid=2
 
And here comes Albuquerque saying it's still not really explicit that the 17W part is a quad-core and the world+dog is just making up facts based on a vague slide.
Or something like that.

A6-4455M
Modules: 1
Cores: 2
TDP: 17W

poker+face.gif


Edit: Back to the topic at hand: Not surprising, and basically what I expected. Low clock, low CPU performance, more GPU than it can use. Sounds great for keeping themselves in the lowest cost bracket, complete with the lowest margins of the entire consumer-grade processor stack. This is a perfect way to keep themselves in just enough business to stay afloat, and not much else.

Sad :(
 
Last edited by a moderator:
There are situations where the front end can leave it stalled, but it's generally balanced for that anemic 2 ops per cycle.
in order to get that 2ops/cycle you'd need to decode 4 instructions (assumed single path) every cycle, since the decoder *alternates* between cores every cycle, so you decode (up to) 4 instructions every TWO cycles. Let me highly doubt you can get them: say you decode 3 mops/cycle, quite high. You then have 2 full cycles for executing 3 mops, filling 1.5 ALU, and supposing no AGLU usage, not even for mov r,r/mov r,m. Averaging a decoding between 2 and 3 mops/cycle, you get your 2+2 stuff quite underused (between 1 and 1.5 mops/cycle!).

Why are you comparing BD two cores versus one of the others?
Because of hyperthreading. A module runs two threads, and an Intel core runs two threads (granted, using very different architectures). The Intel core uses all its resources as its best and can grant high single-thread performance, whereas AMD module should give a slower single thread performance in exchange for a bigger 2-thread IPC.
Unless you consider the AMD 4M/8C as a full 8-core processor, but having a shared front-end, I think it is not.


The common usage of IPC, at least prior to AMD's marketing efforts, referred to instructions issued per cycle for a thread.
Well, if you consider mostly simple instructions for IPC they can go fine, since MOP output from the decoder is somewhat similar between AMD and Intel (not so equal, as i.e. even a push gets decoded differently...).

It needs code to cater to its many weaknesses, and so whatever utilization it supposedly garners is eclipsed by the loss of peak performance and a world that refuses to cater to a weak architecture.
yeah, this has been a weak point of AMD, all the time. If you need to optimize assembly, you go for intel's processor..
 
Last edited by a moderator:
in order to get that 2ops/cycle you'd need to decode 4 instructions (assumed single path) every cycle, since the decoder *alternates* between cores every cycle, so you decode (up to) 4 instructions every TWO cycles. Let me highly doubt you can get them: say you decode 3 mops/cycle, quite high. You then have 2 full cycles for executing 3 mops, filling 1.5 ALU, and supposing no AGLU usage, not even for mov r,r/mov r,m. Averaging a decoding between 2 and 3 mops/cycle, you get your 2+2 stuff quite underused (between 1 and 1.5 mops/cycle!).
... and there are single uop instructions and double uop instructions. Decoder can only decode one double uop instruction per cycle. Microcoded instructions are even worse, and block the decoder (for both threads) for several cycles (automatically causing pipeline underutilization for both cores, since the max decode rate equals the execution rate). Same is true for instructions that have long prefixes (4-7 prefixes = 14-15 extra decoding cycles, again blocking both cores). Also the CPU has instruction fusion, but instruction fusion doesn't increase the decoding rate (so fused instructions do not help with the decoding bottleneck). And the 2-way shared L1 instruction cache doesn't help much either...

But there's only 2 integer/logic ALU pipes per core (2 other pipes can only do memory operations), so it wouldn't be able to execute more than 2 integer/logic ALU instructions per clock even if the decoder was more capable. It would definitely reach closer to 2 ALU per cycle...
 
... and there are.. [...]
all true, I did just cut lengthy details ...just one more: the presence of double-vector instr. out of a perfect 2-1-1 sequence adds even more stall to the shared decoder, causing it to give an even lower throughput when it's shared. Intel has the same problem, of course, yet it suffers less since it does not have a whole 'core' to feed.

It would definitely reach closer to 2 ALU per cycle...
With this front-end? hmmm I doubt it. AMD says it was getting 2/Cycle using the K10 3-instr front end, hence their third unused ALU. SB can constantly pump up 4-5/Cycle only when it uses the TC on relatively optimized code... maybe it will happen when they fix/split the decoder somehow...
 
in order to get that 2ops/cycle you'd need to decode 4 instructions (assumed single path) every cycle, since the decoder *alternates* between cores every cycle, so you decode (up to) 4 instructions every TWO cycles. Let me highly doubt you can get them: say you decode 3 mops/cycle, quite high. You then have 2 full cycles for executing 3 mops, filling 1.5 ALU, and supposing no AGLU usage, not even for mov r,r/mov r,m. Averaging a decoding between 2 and 3 mops/cycle, you get your 2+2 stuff quite underused (between 1 and 1.5 mops/cycle!).
Masking the other core off showed performance improvement, although it was pretty modest.
There are just so many other ways to lose utilization that the decoupled front end is one weakness of many.
There are issue restrictions for which EXE pipeline can do what, such as MUL and DIV, and branches can only use one pipeline. In branchy code the core can look 50% thinner on a given cycle.
edit: Also a lack of move elimination, which is more noticeable with the claustrophobic 2 issue slots. Later iterations of the architecture will give the AGU ports the ability to handle moves, though. Intel's design does better.

Because of hyperthreading. A module runs two threads, and an Intel core runs two threads (granted, using very different architectures). The Intel core uses all its resources as its best and can grant high single-thread performance, whereas AMD module should give a slower single thread performance in exchange for a bigger 2-thread IPC.
For all but the most friendly apps, Bulldozer doesn't provide superior aggregate throughput.

Unless you consider the AMD 4M/8C as a full 8-core processor, but having a shared front-end, I think it is not.
The cores have separate memory pipelines, issue, and control hardware.


Well, if you consider mostly simple instructions for IPC they can go fine, since MOP output from the decoder is somewhat similar between AMD and Intel (not so equal, as i.e. even a push gets decoded differently...).
IPC as traditionally used involves the number of instructions a design can execute for a thread in a cycle. In more general terms, it is what a core can generally manage when given a non-toy workload.
Regardless, actual benchmarks show that those cores wind up stalling more, so even in aggregate terms the instructions issued in a clock is weak.

yeah, this has been a weak point of AMD, all the time. If you need to optimize assembly, you go for intel's processor..
The bigger problem is that general-purpose processors have trended towards being resilient enough to not require so much handholding.
Sandy Bridge has a few core optimizations, such as trying to keep instruction counts low enough to fit in the uop cache and the complex and simple decoder arrangement. It's still very strong in non-ideal situations.
Bulldozer has a raft of other problems on top of that and it drops off ideal very quickly.
 
Last edited by a moderator:
With this front-end? hmmm I doubt it. AMD says it was getting 2/Cycle using the K10 3-instr front end, hence their third unused ALU. SB can constantly pump up 4-5/Cycle only when it uses the TC on relatively optimized code... maybe it will happen when they fix/split the decoder somehow...
I meant that no matter how much they would improved the decoder, with only two ALU execution units (two doing only memory ops), the integer IPC wouldn't dramatically improve. Of course if we assume they get somewhere around 1.0 per cycle right now, and they could reach 1.5 with a (vastly) improved front end, that's 50% extra performance... but that would be overly optimistic... unless they drastically improved the cache system, branch predictors, store forwarding, etc... but it seems that these are actually things that Piledriver improved on, so the instruction decode might be even a bigger bottleneck this time, unless they improved it also from Bulldozer. We need more detailed architecture analysis (than the current reviews).
 
Back
Top