Trinity vs Ivy Bridge

The "ONLY example"?
Like, from the whole TWO (2) games that were tested in that article (bulletstorm and SC2), one was favourable to multi-gpus and the other one wasn't.
:LOL:
Oh, so the battlefield test that started the article doesn't count? Research! http://techreport.com/articles.x/21516/5

Nonetheless, it's funny how you keep running away from the v-sync argument.
I didn't "run away" from anything, your stated assertion is that everyone who isn't using VSYNC is automatically wrong and stupid and doesn't count. Hate to break it to you, but your assertion is not universal and thus does not negate the needs of people who are interested in playing without VSYNC. Or perhaps they are playing with VSYNC, but they're doing it on a 120HZ monitor (which is sufficiently high to exhibit microstuttering.)

As it seems, you probably threw out your perfectly fine CF setup and made a downgrade because you didn't know that v-sync would solve most of your problems.
Assumptions make an ass of you, not me. My second 4850 was damaged in-transit from the movers who packed and moved my things during my corporate relocation from KY to CA. As was my case and a hard disk. I gave the surviving 4850 to my brother, and replaced them both with a single 5850 which had effectively the same performance (when CF worked) and far better performance (when CF didn't.)

Oh, and I always play with VSYNC, without exception.
 
swaaye, the reason is simple:
Trinity is capable of using its iGPU to enhance the performance of a low-cost discrete GPU (Turks).
IvyBridge isn't.
Therefore, it's normal that someone could consider this as an advantage.

Gipsel, Andrew Lauritzen was talking about input latency, not frame-to-screen latency. He meant keyboard+mouse, afaict.
 
Last edited by a moderator:
http://www.legitreviews.com/article/1928/2/

Trinity sure thumps the 55W $1K 3920XM in Diablo 3 at least.

Isn't this a no-brainer? You could make the same claim about Trinity beating a $200 Quad Mobile Ivy Bridge system too, because it would be functionally identical. But that doesn't make sensationalist headlines, does it?

Yes, HD4000 isn't going to have the same graphics grunt that Trinity does when talking about fully feature-enabled high fillrate games. We know. It's bizarre that HD4000 beats Trinity in any single case at all, TBH. But grandstanding with the "OMG IT BEATS A $1000 PROCESSOR" is a bit dumb.

Tell ya what: throw an HD7850 into both of those systems and let's see who wins :)
 
Gipsel, Andrew Lauritzen was talking about input latency, not frame-to-screen latency. He meant keyboard+mouse, afaict.
He meant input latency as the delay between a user action and the time when the result gets visble on the screen. All latencies and delays in the process are kind of additive, don't you think?
(i) you press a key or move the mouse
(ii) game engine updates the state reflecting that user input and initiates rendering of the scene
(iii) GPU renders the scene (probably into a queue of buffers)
(iv) buffer flip (vsync adds a delay before this)
(v) you see the effect of your action on screen (some TVs manage to introduce another significant delay here for the motion estimation / interframe calculation or whatever)

But I guess we should stop this OT stuff. It leads nowhere.
 
Trinity results using OpenCL accelerated Handbrake (didn't see this mentioned).
Link
I didn't see any IQ comparisons though.
 
But I guess we should stop this OT stuff. It leads nowhere.

A novel idea!!!

To conclude: some people like CF, some don't. If you feel this horse has not been beaten enough, please make a new thread.
 
Both of you are pathetic.

I've been called worse, actually much worse, and I've lived :)

I think the original question that started this diatribe has merit though: in what sense does HD4000 really have the right to win against Trinity, at least when speaking about graphically robust games? I get the OpenCL stuff to a certain degree, but I never would expect HD4000 to pull ahead in actual gameplay. Crossfire certainly has at least some applicability here, in that more than a few of the Trinity designs will find themselves with a Hybrid crossfire configuration. Will the performance be so lopsided in those scenarios as to cause even worse problems? (A tragically slow Trinity GPU combined with a reasonably fast mGPU brother?)

Being a driver issue might leave wiggle room for home; being a larger pointer to an architecture starved for bandwidth or power is room for concern.
 
I'm curious about the VLIW4 aspect. Do you guys think they went this route because VLIW4 is space/transistor-efficient, or because it takes longer to design a CPU+GPU and GCN wasn't far enough along? I'm guessing it takes longer to design a CPUGPU.

Also, how does TSMC 28nm compare to GF 32nm? I suppose that's a horribly complicated question.

I suppose they wanted GCN only with the tighter integration we'll see on steamroller APU. that means a new iteration of the CPU and a new iteration of GCN are both needed.
not doing so will leave you with a "bad GCN" APU and a "good GCN" APU which would be a nightmare to support and market.
 
Isn't this a no-brainer? You could make the same claim about Trinity beating a $200 Quad Mobile Ivy Bridge system too, because it would be functionally identical. But that doesn't make sensationalist headlines, does it?

Yes, HD4000 isn't going to have the same graphics grunt that Trinity does when talking about fully feature-enabled high fillrate games. We know. It's bizarre that HD4000 beats Trinity in any single case at all, TBH. But grandstanding with the "OMG IT BEATS A $1000 PROCESSOR" is a bit dumb.

Tell ya what: throw an HD7850 into both of those systems and let's see who wins :)

It's not so much about the price (which is just marketing anyway) as it is about TDP, at least as far as I'm concerned.

That Trinity is a 35W part, while the Core i7 is a 55W one. Since base graphics clocks vary with power, and they both feature dynamic bi-directional power-management, that makes a big difference.
 
I'm curious about the VLIW4 aspect. Do you guys think they went this route because VLIW4 is space/transistor-efficient, or because it takes longer to design a CPU+GPU and GCN wasn't far enough along? I'm guessing it takes longer to design a CPUGPU.

Also, how does TSMC 28nm compare to GF 32nm? I suppose that's a horribly complicated question.

It's probably just a scheduling thing. Llano was on VLIW5 even though VLIW4 was already out, Trinity is on VLIW4 even though GCN is already out, and Kaveri will be on GCN even though Sea Islands will be out by then.

That's just the price of integration: a one-year lag in graphics technology.
 
It's probably just a scheduling thing. Llano was on VLIW5 even though VLIW4 was already out, Trinity is on VLIW4 even though GCN is already out, and Kaveri will be on GCN even though Sea Islands will be out by then.

That's just the price of integration: a one-year lag in graphics technology.

Retooling a graphics chip from TSMC's 40nm bulk process to GF's 32nm SOI process is a rather large undertaking.
The first was designed with a lot of input specifically from AMD's graphics division, and with the understanding graphics chip designs, with relatively low clock rates and high logic density, as a major use, while the later wasn't.

I have high hopes for Kaveri though, as both GCN(+) and Steamroller are being/have been designed natively for the 28nm bulk process (notice the lack of a successor for 32nm SOI Vishera successor using Steamroller on AMD roadmaps).
 
That Trinity is a 35W part, while the Core i7 is a 55W one. Since base graphics clocks vary with power, and they both feature dynamic bi-directional power-management, that makes a big difference.


The 3720QM features a TDP of 45W. On Sandy Bridge between 35W Dualcore and 45W Quadcore there wasn't a difference in GPU Performance apart from the slightly higher clocked GPU in the 45W Quad. I expect only a small difference in Ivy Bridge there as well. Maybe the gap will increase from 20% to 30% average or so, but that's it.
 
Retooling a graphics chip from TSMC's 40nm bulk process to GF's 32nm SOI process is a rather large undertaking.
The first was designed with a lot of input specifically from AMD's graphics division, and with the understanding graphics chip designs, with relatively low clock rates and high logic density, as a major use, while the later wasn't.

I have high hopes for Kaveri though, as both GCN(+) and Steamroller are being/have been designed natively for the 28nm bulk process (notice the lack of a successor for 32nm SOI Vishera successor using Steamroller on AMD roadmaps).

GCN was designed for TSMC's process, not GloFo's, so I'm not sure porting it will be much easier than it was for VLIW5/4. Incidentally, it probably should have taped-out by now, maybe AMD will start talking a bit more about it soon.
 
Yep we should be seeing a demo of Kaveri within a month, assuming it's all going to plan. I just realised that AMD has actually pulled in a month vs the Llano launch, which was in mid June last year (and bad availability).
 
Having a third port to handle a branch allows for things like handling a branch and a MUL at the same time. This is on top of a 50% advantage in most other integer ops.
Every BD core is 'underpowered&undersized' in order to roughly compare a module to a SB core (from rough measurement, a BD module seems 25% bigger than a SB core).
However, in BD you'd have 4ALU running every cycle, compared to the 3 of SB, and with higher memory operands (4R vs 2R). Even now, with the lower rate measured by Agner (around 1.8), you can make something like 3.6 memory ops/cycle, which is more than 2.

Without optimizations or tricks such as eliminating moves or allowing them to run on the AGU ports, there is no starvation because the back end is effectively 2-wide 90% of the time and the actual ALU ports get stuck with a MOV when they could be doing something else.
You miss the point: your front-end sets up the upper limit. If it cant pump at least 2MOP/Cycle/Core on average, but barely ~1.x, the processor will never be able to run faster than that, no optimization or AGLU renaming trick you'd use.
Now, you get 2MOP/Cycle *IF* you can decode all the time a perfect sequence of 2-1-1/1-1-1-1 at every cycle. Every tilt in it will cause huge penalties to your MOP/core bandwidth. Forget you get such perfect sequences, and so you end up in a MOP bandwidth that is usually around 1 and something. BD don't even need moving MOPs to AGLU since the MOP bandwidth is so low that... they would be unused all the time, unless running single-core with a full decoder for it.

BD looks to me a questionable(early?) implementation of what seems a nice overall architecture. If AMD will be able to reasonably pump up around 2MOP/core in a stable way, we'll see a nice speedup of AMD processors.
 
Every BD core is 'underpowered&undersized' in order to roughly compare a module to a SB core (from rough measurement, a BD module seems 25% bigger than a SB core).
However, in BD you'd have 4ALU running every cycle, compared to the 3 of SB, and with higher memory operands (4R vs 2R). Even now, with the lower rate measured by Agner (around 1.8), you can make something like 3.6 memory ops/cycle, which is more than 2.
The BD performance scenario is more fragile, and its advantage is only realizable in a single best case.
In a one-thread situation, SB beats BD.
In a two-thread situation where both threads can reach 2-wide utilization, BD may beat SB.
In a two-thread situation where one thread can reach 2 or more and the other cannot, SB wins.

Given that SB can still beat BD in various well-threaded integer apps, it seems that the back end can step all over itself with some regularity on top of other problems with the overall system.
A BD core is underpowered, undersized, and because of its cramped per-core capabilities more frequently underutilized.

You miss the point: your front-end sets up the upper limit. If it cant pump at least 2MOP/Cycle/Core on average, but barely ~1.x, the processor will never be able to run faster than that, no optimization or AGLU renaming trick you'd use.
That's good because there are enough cases where the back end throttles itself to obscure some amount of the front end shortfall. I don't think AMD would have expanded the move issue rate if this wasn't the case.

Now, you get 2MOP/Cycle *IF* you can decode all the time a perfect sequence of 2-1-1/1-1-1-1 at every cycle. Every tilt in it will cause huge penalties to your MOP/core bandwidth. Forget you get such perfect sequences, and so you end up in a MOP bandwidth that is usually around 1 and something.
And there are many sequences that decode fine and then step over the two ports they have to work with. The front end's weakness is reflected in the back end's weakness. It balances out in general.

BD don't even need moving MOPs to AGLU since the MOP bandwidth is so low that... they would be unused all the time, unless running single-core with a full decoder for it.
Why did AMD just expand MOV issue to the AGU ports if it wouldn't have an impact?
Not to mention that the front end can serve one core if the other core's thread has stalled, and as we've learned from the performance numbers, they do that a lot.


BD looks to me a questionable(early?) implementation of what seems a nice overall architecture. If AMD will be able to reasonably pump up around 2MOP/core in a stable way, we'll see a nice speedup of AMD processors.
Build 2 decoders.
 
http://www.notebookcheck.net/Trinity-in-Review-AMD-A10-4600M-APU.74852.0.html

There's a hint at the performance of a dual-graphics solution with Trinity in here, with an Asus K75 (A8-4500M + HD7670M).
iGPU scores are pretty much terrible because somehow the laptop only has a single DDR3 module so the iGPU's memory bandwidth is halved.
So this HD7640G in the A8-4500M has 256 VLIW4 sp @ 500-650MHz and 12.8GB/s (shared) bandwidth.

These results are from the 3DMark 11 GPU score:

Trinity HD7640G w/ 12.8GB/s shared bandwidth: 651 points
HD7670M : 1050 points
Trinity HD7640G 12.8GB/s + HD7760M: 1647 points

The A8-4500M's GPU with halved bandwidth is pumping up the HD7670M's score by 55%, which seems really good.
The A10-4600M (384 sp, 680MHz) is getting over 1000 GPU points in the reviews.

As a comparison, the most popular GF108/GF118 solutions (GT550M, GT630M) get between 900 and 1000 points.
The most common mini-Kepler GT640M gets 1700 points, and the GT650M gets over 2000 points.


If the Asus K75's results representative of crossfire performance, it's possible that an A10-4600M + HD7670M combo gets about the same results as a much more expensive Ivy-Bridge + GT650M combo.
Of course, this is all in the 720p 3DMark11 GPU score, where the CPU plays a rather small part.

Nonetheless, we could be seeing a Trinity combo for ~750€ that comes close to IB+nVidia combos for ~1100€ (I'm talking some Clevos and Asus G-series, for example) in gaming performance.
And I'm pretty sure that this same Trinity combo for ~750€ will far surpass any IB+nVidia combo for the same price, in gaming performance.
In v-synced gaming performance where Crossfire works, of course :p
 
Back
Top