jaguar has on chip south bridge?
edit
ehm... ok, I googled...
Jaguar is 15% faster in integer code compared to Bobcat, but it has also doubled the peak flops compared to Bobcat (double wide SIMD units and doubled SIMD bandwidth per core). Bobcat was already beating ATOM in benchmarks, so Jaguar should be over 2x faster than ATOM in flops heavy utilization. And top of that, it also doubled the core count from Bobcat (there's no 4 core ATOMs currently in the market to challenge the Jaguar either).I actually looked into Jaguar a little more for the first time. It's really puny, based on it being perhaps a 15% faster Bobcat.
I would say it's somewhere in the same ballpark as Intel's Atom.
If you compare a Jaguar core to a Piledriver core (Trinity), Jaguar doesn't look that bad really.Trinity has MUCH better single-thread performance than Jaguar. Everything that is not heavily threaded will work much faster on Trinity.
Well AMD has stated that the L2 works at half speed (not the bus interface though).I wonder if AMD can customize L2 cache clock depending on target power/performance. Bobcat had 1/2 speed L2 and I'm under impression Jaguar for tablets and nettops will continue that trend.
At least some reports are suggesting AMD can run L2 at full clock which for console design is very desired.
--> The IPC of Piledriver and Jaguar CPU cores should be pretty close.
That is true but comparing 4 Jaguar cores vs. 2 Piledriver modules the overall decode throughput is indeed the same. Of course if it is running only one thread (per module) then it should be better on Piledriver, OTOH if you run into instructions (when there are two threads per module) which need the microcode decoder Jaguar might be better (as it won't block the other thread).You left out that Bobcat can only decode/issue 2 instructions per cycle, while PD can decode 4 and issue 4 to an integer core and/or 4 to the FPU.
PD can do either two loads or one load + one store per cycle (the cache has only two ports), whereas Jaguar is limited to one load + one store. This is not really that much of a difference but yes PD is better. I don't think though it's that much of an issue, essentially that's the same capability for Jaguar as intel had up to Nehalem (whereas SNB/IVB now are more like PD in that regard, either two loads or one load + one store).And Bobcat can only do 1 load + 1 store per cycle, while PD can do 2 loads (I don't think it can actually do 2 stores though). These are not small differences - mainly, being able to support a load/store or two in conjunction with two ALU/branch/multiply/etc is a big deal, especially for x86.
I thought Bobcat (and Jaguar) could issue more than 2 uops per clock as well (up to 6?, each 2 of integer, load/store, simd), as long as there are enough ops in the queues (obviously the decoder couldn't feed that). Maybe it can only retire 2 per clock though, K8 had serious restrictions there as well. I might be totally wrong here .Even in FPU heavy code it's nice to be able to issue at least one integer instruction in addition to two FP ops for flow control/pointer arithmetic/etc. And PD's FPU is more flexible even w/o FMA code because it can do either 2 FMULs or 2 FADDs per cycle instead of just one of each.
No doubt it has more OoO resources, but even Bobcat is quite respectable (e.g. int PRF size is 64 for Bobcat and 96 for BD (not sure if same for PD?) though yeah Bobcat has very few entries in the int/address/simd schedulers (but Jaguar should increase that). Bobcat's Load/Store unit is also quite robust (some amd paper stated it's more advanced than what any other amd cpu had at that time, so probably better than what K10 had).PD also undoubtedly has much bigger OoO resources, and probably better load/store disambiguation.
In contrast though, Bobcat doesn't suffer in some workloads due to L1 write-through cache like BD does.As far as L1D Is concerned, Jaguar does have the bigger cache but loses in associativity (2-way vs 4-way) which is a liability on some workloads.
AMD stated 17 cycles for L2 for Bobcat (but Jaguar could be different) and 18-20 for BD (though the latter may include the L1 latency, not sure about the former - in any case the numbers don't look too different).And from test numbers I've seen its L2 is not just lower bandwidth but at least as high latency.
BD/PD decoder is shared between two cores. A module can decode 4 uops per cycle, but the decoder is time sliced (every other cycle) between two cores.You left out that Bobcat can only decode/issue 2 instructions per cycle, while PD can decode 4 and issue 4 to an integer core and/or 4 to the FPU.
Again PDs FPU is shared between two cores. If one core does 2 FMUL/FADD/FMA per cycle, the other core can do nothing. If resources are evenly split, both Jaguar and PD have equal flops throughput. However PD needs FMA code to reach it's peak, while Jaguar can do that on old code (with separate muls and adds). So Jaguar should have better peak performance in current/legacy code (FMA3/FMA4 usage in applications/games is still very low, partly because there's two implementations that are not compatible).And PD's FPU is more flexible even w/o FMA code because it can do either 2 FMULs or 2 FADDs per cycle instead of just one of each.
Yes, that's likely true however AMD improved OoO execution on Jaguar as well. According to (http://semiaccurate.com/2012/08/29/another-nugget-on-amds-jaguar/) the scheduler can handle more entries and the core has larger reorder buffers.PD also undoubtedly has much bigger OoO resources, and probably better load/store disambiguation.
Can you quote where you got this information? According to semiaccurate.com "the caches can run at half clock to save power when needed". If I understood that correctly, the caches can be dynamically down clocked to save performance (when CPU load is light).Well AMD has stated that the L2 works at half speed (not the bus interface though).
mczak said:That is true but comparing 4 Jaguar cores vs. 2 Piledriver modules the overall decode throughput is indeed the same. Of course if it is running only one thread (per module) then it should be better on Piledriver, OTOH if you run into instructions (when there are two threads per module) which need the microcode decoder Jaguar might be better (as it won't block the other thread).
I thought Bobcat (and Jaguar) could issue more than 2 uops per clock as well (up to 6?, each 2 of integer, load/store, simd), as long as there are enough ops in the queues (obviously the decoder couldn't feed that). Maybe it can only retire 2 per clock though, K8 had serious restrictions there as well. I might be totally wrong here .
sebbbi said:Already discussed in the last page, but I short recap.
Bobcat vs K10 (Athlon II X4 630 downclocked at 1.6Ghz):
http://www.xtremesystems.org/forums/...t-vs-K10-vs-K8
In generic integer calculations Bobcat has 5% slower IPC than similarly clocked K10. The slightly improved Stars core (Llano) has a few percents higher IPC than K10. Bulldozer has slightly worse IPC than Stars and Piledriver has pretty much equal IPC to Stars.
I read it on Hardware.frCan you quote where you got this information? According to semiaccurate.com "the caches can run at half clock to save power when needed". If I understood that correctly, the caches can be dynamically down clocked to save performance (when CPU load is light).
That's true... in theory. But the shared decoder has assumed to be one of the key bottlenecks in BD/PD architecture. In Steamroller (http://www.anandtech.com/show/6201/amd-details-its-3rd-gen-steamroller-architecture) the biggest change AMD is doing is moving back to having a separate decoder per core (just like they had in their earlier designs such as K10/Stars and Bobcat/Jaguar). AMD claims Steamroller is going to have 30% improved IPC, so it seems that shared decoder wasn't such a good idea.PD will be able to achieve better flexibility by alternating between threads as it decodes. This is relevant because it means that when one core can't utilize the extra decode bandwidth (due to the OoO window being backed up) the other core gets better decode bandwidth. Assuming that that's how PD works, anyway.
Ok, lets use AMDs official slides then. Bobcat slides (years ago) claimed 90% of the IPC compared to their desktop parts (K10 Phenom at that time). Bobcat was basically a simplified K10 core. They removed the third ALU pipeline, because it had very low utilization ratio (same reason was given why it wasn't present in BD). They also removed everything that allowed the core to clock high (lots of extra pipeline stages are needed to reach high clocks). Their focus wasn't on reducing IPC, they tried to keep it close to the desktop CPUs (90% was the goal according to the slides).The comparison that forum poster makes is poor. The benchmarks are not well chosen and his classifications aren't even correct. A few synthetics like Sandra, 3DMark, and "speedtraq" don't make for a comprehensive comparison in CPU performance.
This belief that Jaguar can surpass Llano's IPC is sorely lacking in any kind of architectural justification.
Ok, lets use AMDs official slides then. Bobcat slides (years ago) claimed 90% of the IPC compared to their desktop parts (K10 Phenom at that time).
That's true... in theory. But the shared decoder has assumed to be one of the key bottlenecks in BD/PD architecture. In Steamroller (http://www.anandtech.com/show/6201/amd-details-its-3rd-gen-steamroller-architecture) the biggest change AMD is doing is moving back to having a separate decoder per core (just like they had in their earlier designs such as K10/Stars and Bobcat/Jaguar). AMD claims Steamroller is going to have 30% improved IPC, so it seems that shared decoder wasn't such a good idea.
Ok, lets use AMDs official slides then. Bobcat slides (years ago) claimed 90% of the IPC compared to their desktop parts (K10 Phenom at that time).
Bobcat was basically a simplified K10 core. They removed the third ALU pipeline, because it had very low utilization ratio (same reason was given why it wasn't present in BD). They also removed everything that allowed the core to clock high (lots of extra pipeline stages are needed to reach high clocks). Their focus wasn't on reducing IPC, they tried to keep it close to the desktop CPUs (90% was the goal according to the slides).
You misunderstood me. I wasn't claiming that 2-wide Jaguar decoder is in any way comparable with 4-way Steamroller decoder. I was just using Steamroller as an example why the shared 4-way decoder in BD/PD is a likely bottleneck, as the 2x independent decoders are the largest announced change in Steamroller and we see a big 30% IPC gain.Totally false comparison. Yes, the shared decoder is worse than independent 4-wide ones. No, that doesn't make it the same as two independent 2-wide ones like in Jaguar.
The designs are different, but there haven't been a "totally different core" in modern CPU & GPU design for a long time (even Xeon Phi core was based on old Pentium design). AMD has always reused lots of building blocks. When you compare the front ends, caches, branch predictors, execution units, etc, you will see lot of similarity to previous AMD designs.That's not even remotely true, it's a totally different core. There's way more different than a third ALU + AGU
That's true. I shouldn't use web benchmarks that much. It would be interesting to see how a Jaguar based APU compares to Llano. 1.84 GHz Jaguar (quad core) shouldn't beat any of the current Llano options (there are no sub 1.6 GHz models), but with some underclocking for the Llano part we should be able to put this debate to restIt wasn't clear what they were comparing against, but I'd rather use real performance numbers from useful software and not a lay-person's small collection of synthetic benchmarks nor company marketing slides that are meant to portray the product in as positive a light as possible.
The designs are different, but there haven't been a "totally different core" in modern CPU & GPU design for a long time (even Xeon Phi core was based on old Pentium design).
AMD has always reused lots of building blocks. When you compare the front ends, caches, branch predictors, execution units, etc, you will see lot of similarity to previous AMD designs.
Why do you consider Bobcat FPU slow with x87? As far as I can tell it is exactly as fast as you'd expect given the simd float capabilities. It even still has "free" fxch and unlike the atom (which seems to suffer from some design flaw) it's got essentially the same instruction throughput as when using (scalar) SSE.b) Bobcat has totally new FPU(power optimized, slow x87, no 3dnow).
Why do you consider Bobcat FPU slow with x87? As far as I can tell it is exactly as fast as you'd expect given the simd float capabilities. It even still has "free" fxch and unlike the atom (which seems to suffer from some design flaw) it's got essentially the same instruction throughput as when using (scalar) SSE.
Oh and dropping 3dnow is hardly relevant here, ripping that out should be fairly trivial and be quite orthogonal to how the design of the fpu looks like (and it probably won't really save many transistors neither).
First time I heard of that. Source?Performance with 80-bit fp numbers is much slower, that's what I mean. (latency 5, throughput 3op/cycles, vs 4/1)