Haswell vs Kaveri

DSC · Jan 19, 2014

http://www.phoronix.com/scan.php?page=news_item&px=MTU3NTg

Malo · Jan 19, 2014

Kaotik said:
They bothered to put GPU PhysX in titles too, why not this?

How many of those titles were TWIMTBP? It really depends on how much AMD is willing to throw Gaming Evolved at titles. And the requirement of HSA makes it not worth it until a few years from now once you have an install base.

moozoo · Jan 20, 2014

RandSec said:
While it is possible to stay coherent across the PCIe, to some extent, it is simply not possible to "run monolithic-style HSA applications" on a PC dGPU, due to PCIe latency.

PCIe bandwidth is fine, and GPU video can hide modest latency, but all compute requires low latency, and dGPU compute on common memory will not have that. Tight HSA-style computation means using whichever is most appropriate: CPU or GPU. That means moving computations repeatedly back and forth between CPU and GPU, and doing that with overall practicality depends on low movement overhead. No conventional dGPU video card can possibly support that, because of PCIe.

Mapping the entire graphics cards memory into system space is mostly to enable shared data structures that contain pointers. Your right it does nothing solve the PCIe latency problems.

I guess my question is can a HSA enabled system do better for discrete graphics cards than CUDA 6's unified memory?
http://www.anandtech.com/show/7515/nvidia-announces-cuda-6-unified-memory-for-cuda

Saying it is "not possible" isn't helpful given a R9 290X level APU won't exist for a long time. And even if it did there is no way to put 4+ of them in a single system.

>PCIe bandwidth is fine, and GPU video can hide modest latency,

Actually I was thinking its better the CPU take the hit and access the GPU's memory across PCIe rather than the GPU taking the hit accessing the CPU's memory over PCIe.

I think CUDA 6's unified memory uses cudaDeviceSynchonize() to copy memory between the cards memory and system memory (I assume the memory in both appears at the same address).

Perhaps under HSA the memory on either side of the PCIe is mapped to the same address space on both sides. A write by the GPU to a block of memory would mark that block as dirty. When the CPU attempted to read from the block it would detect the block was dirty (via hardware) and read the updated copy across the PCIe bus. Ditto in reverse where the CPU writes to memory.

DSC · Jan 20, 2014

http://akiba-pc.watch.impress.co.jp/docs/mreview/buy/20140116_631060.html

Pressure · Jan 20, 2014

Llano (A8-3850), Trinity (A8-5600K), Richland (A10-6700), Kaveri (A10-7700K).

For those too lazy to visit the link

swaaye · Jan 20, 2014

Looks like they have a preferred die size for these APUs.

UniversalTruth · Jan 20, 2014

swaaye said:
Looks like they have a preferred die size for these APUs.

I think it is called the 'sweet spot'

hkultala · Jan 21, 2014

UniversalTruth said:
And another question- why are the L1 Data and L1 Inst. caches only 4-way and 3-way while on Intel processors they are 8-way?
Does it introuduce some type of advantage?

Lower-associativity caches can be made faster.
So either lower latency in cycles or higher clock speeds, AMD mostly tries to achieve higher clock speeds by using lower associativity caches.
(but in the end AMD loses the clock speed advantages with the manufacturing process disadvantage, but without the "higher clock speed microarchitecture decisions" the chip would clock lower than intel's chips)

Wynix · Jan 21, 2014

swaaye said:
Looks like they have a preferred die size for these APUs.

Seems like they are trying for higher margins, so die size is capped fairly low.

Raqia · Jan 21, 2014

3dilettante said:
Hmm, an amped high-clocking Jaguar.
It probably needs some extra pipe stages. With the current 15-cycle mispredict latency, we could maybe squeeze in a few more stages, sure there's a performance hit, but maybe boosting the length to something like 18-19 cycles wouldn't be too bad.
So its integer resources would be about 2 wide, with about two instructions per core per cycle and about 32KB of L1 Icache per core.
Shared, long latency L2.
It might be hard getting the L1 to clock as high, especially if we want to avoid losing performance with its very low associativity. Maybe a 4-way 16 KB L1 data cache.
The higher end workloads might enjoy having a more flexible load/store situation than one dedicated store and one load pipe.
Roughly 8 FLOPs per cycle per core.

Does that sound about right?

The shared Jaguar L2 runs at half the CPU clock speed to save power; tightening up those latencies should improve IPC quite a bit.

AMD's cache layouts on core clusters are a lot messier and hap-hazard than Intel's nice private L2 + ring bus L3 configuration; compare most post Nehalem cores to the sprawling Jaguar L2 on each quad cluster, or the Bulldozer L3 which is all over the place. Intel seems to be able to nicely tile their cores around a unifying L3 no matter what the CPU configuration whereas AMD seems to do something comparatively bespoke looking for each of their CPU configurations. I'm interested to see what the Silvermont based Knight's Corner die will look like. It doesn't seem like AMD's cores are designed around as nice a cache bus architecture and their performance certainly lag as a result as well.

3dilettante · Jan 21, 2014

swaaye said:
Looks like they have a preferred die size for these APUs.

It's called the best they can do with constrained process improvement.
Granted, Richland inflates the count since it's all but the same or exactly the same as Trinity, but that's four chips with all of a fraction of a node transition between them.
Llano was not an ideal size, and it's obvious that it's on a non-ideal process, but nothing with that as a baseline is going to radically change the minimum amount of area necessary to provide a certain amount of performance improvement.

I doubt they like selling chips that size for the prices they have.

Raqia said:
AMD's cache layouts on core clusters are a lot messier and hap-hazard than Intel's nice private L2 + ring bus L3 configuration; compare most post Nehalem cores to the sprawling Jaguar L2 on each quad cluster, or the Bulldozer L3 which is all over the place.

Making it look that good requires a lot more work.
AMD's been riding a similar uncore structure to what it had the first time Jim Keller worked there.
It's one thing I wondered about them improving, and it would be fitting that one designer who helped design it would be the one to retire it.

Intel purposefully switched from a crossbar to ring bus after Nehalem, which may have cost it a bit in top end performance but saved in terms of design effort and scalability across newer generations after the initial hump.
The ring bus involved a revamp of on-die flow control and it coincided with the revamp of Intel's interconnect and cache protocols.

AMD has a crossbar scheme that may be more latency-consistent (at this point it is uniformly terrible relative to Intel), but aside from the saved up-front effort of redesigning with a more scalable solution, it loses in terms of a design that can't be as readily modified or applied across products. In AMD's case, it just means that they take longer to put out significant changes, since they can't afford to do that many designs.

AMD has even regressed in some aspects, given the reduced CPU core counts, loss of multi-socket capability, and now the even worse memory latency of Kaveri.
I'm curious, but not sure, if some of the poorer memory bandwidth utilization AMD has relative to Intel is because of the on-die interconnect.
The hodge-podge of on-die buses used for cross-communication with the CPU and GPU is something AMD admits is not ideal, but it looks to me it got this messy because they are so reluctant to fiddle with that antiquated on-die interconnect while trying to get hUMA.

One point that I've idly wondered about is the loss of memory scaling after DDR3 2133 and the missing two DCTs.
If the northbridge clock isn't rising with the higher memory speeds, it could be bottlenecking on the crossbar ports for the controllers. Earlier AMD chips had the same problem.
A quad-channel solution wouldn't necessarily have the same problem if AMD gave those DCTs their own crossbar endpoints.

fellix · Jan 21, 2014

3dilettante said:
One point that I've idly wondered about is the loss of memory scaling after DDR3 2133 and the missing two DCTs.
If the northbridge clock isn't rising with the higher memory speeds, it could be bottlenecking on the crossbar ports for the controllers. Earlier AMD chips had the same problem.
A quad-channel solution wouldn't necessarily have the same problem if AMD gave those DCTs their own crossbar endpoints.

The old dual-core A64 had similar issues with the memory performance scaling, when DDR2 models hit the market. The slow exclusive L2 ultimately limited the bandwidth from the system memory to the cores.

3dilettante · Jan 21, 2014

swaaye said:
Looks like they have a preferred die size for these APUs.

Which tests pointed to the L2?
I saw benchmarks showing that upping the NB clock would allow more bandwidth to be available, but that wouldn't cover the L2.

Raqia · Jan 21, 2014

3dilettante said:
Making it look that good requires a lot more work.

I imagine that looking "nice and tight" is quite directly related to wiring distance / complexity involved and the resulting latencies that are possible. Intel can even use its L3 for its on-die GPU, whereas AMD has uniformly decided to shelf shared L3 for its APUs.

I agree that it'd be nice if Keller stepped up and gave AMD a unified solution. Perhaps this is made more difficult in bulldozer derived designs by the tightly paired design of each cluster and its corresponding large L2. These don't seem like they're designed around L3 which seem bolted on and high latency, whereas Intel's big cores have a comparatively small L2 and seem more more reliant on L3 for caching large data sets. Compare that to pre-nehalem designs like the Core 2 Duo or Silvermont which are closer in nature to AMD's BD derivatives when it comes to L2 cache sizes. (If Knight's Landing is indeed Silvermont based, I wonder if it would break up the pairing of cores with L2 for a different general architecture.)

EDIT: Looking at realworldtech, looks like a pair of silvermont cores w/ L2 does indeed form a "tile" for Knight's Landing; D. Kanter seems to think smaller L2s would make more sense if there's an inclusive L3 on board.

Dave Baumann · Jan 21, 2014

DSC said:
http://techreport.com/blog/25930/a-subjective-look-at-the-a8-7600-gaming-performance

kalelovil said:
I am somewhat disappointed Techreport did not seem to attempt more troubleshooting (Different memory? Different motherboard? etc.) or consulting AMD or other reviewers before publishing that (perhaps they did, but if so there is no note of it in the article).

As it is, and if they did not, it does give the effect of (most likely unintentionally) spreading FUD.

Albuquerque said:
Yeah, I agree with kalelovil. Those issues sound like a bad DIMM to me, not a processor failure.

http://techreport.com/blog/25930/a-subjective-look-at-the-a8-7600-gaming-performance

Update — Faulty memory appears to be behind the crashing we experienced with the A8-7600T...

3dilettante · Jan 21, 2014

Raqia said:
I imagine that looking "nice and tight" is quite directly related to wiring distance / complexity involved and the resulting latencies that are possible. Intel can even use its L3 for its on-die GPU, whereas AMD has uniformly decided to shelf shared L3 for its APUs.

I believe Intel generally runs its ring bus wires over the L3, for which the more straightforward topology of a ring bus should help.
The latency would be more variable that way, but I believe Intel's general solution was to engineer an L3 that is almost as fast as Bulldozer's L2.

I agree that it'd be nice if Keller stepped up and gave AMD a unified solution. Perhaps this is made more difficult in bulldozer derived designs by the tightly paired design of each cluster and its corresponding large L2. These don't seem like they're designed around L3 which seem bolted on and high latency, whereas Intel's big cores have a comparatively small L2 and seem more more reliant on L3 for caching large data sets.

Bulldozer's L3 situation is an evolution of the previous gen, since the uncore is similar. The module setup reduces the burden on the uncore since the shared L2 means the number of CPU clients is also cut in half.

Compare that to pre-nehalem designs like the Core 2 Duo or Silvermont which are closer in nature to AMD's BD derivatives when it comes to L2 cache sizes. (If Knight's Landing is indeed Silvermont based, I wonder if it would break up the pairing of cores with L2 for a different general architecture.)

If the slides are accurate, cores are paired behind a shared L2. The interconnect is drawn as changing to a grid of some kind.
The shared L2 should reduce the core burden on the interconnect, although in Intel's case it is doing so at the 32-64 core range instead of 4-8.

Frontino · Jan 24, 2014

Does anyone know transistor count of Intel's GT1, GT2, GT3 and GT3e?
I've searched on some websites but they have conflicting values, like TechPowerUp's same number for both HD 2000 and HD 3000 and same for HD 2500, 4000, 4600 and Iris Pro 5200.
Thanks.

3dilettante · Jan 24, 2014

One pretty obscure transition with Kaveri for AMD's mainline cores is that the switch to bulk means the chip's power gating won't be implemented the same way. Different gates were used previously because of SOI.
I didn't see anything that stood out in various reviews, at least not something I could tease out of the usual reasons for variance between reviewers and Kaveri being a new implementation of the design.

I suppose given Kaveri's much higher transistor count, it has been configured such that it is able to keep more transistors quiet than its predecessors. The slightly lower TDP and the lack of a Steamroller FX chip with 125W might be for other reasons, but power gates have influence at the bottom and top of the activity scale.

msxyz · Jan 28, 2014

I finally managed to install Win 7 on my Macbook Pro 13" late 2013

I'm really surprised by how fast this little PC is. My only complaint is the temperature. While under MacOS the fan throttling seems to work (and, oddly, the CPU is never allowed to go beyond the nominal speed), in Windows both cores under load reach the maximum turbo frequency, the fan never kicks in until the core reaches around 100 deg: only then the CPU is throttled back to avoid damage.

I've seen several similar complaints on the various Macintosh themed forums, so it's probably a Bootcamp related problem. I solved it by installing a third party application called 'Macs Fan control' which lets you adjust the fan speed and throttling in Macbooks, even under Windows. Now the temperature, under load, never goes above 82° although the fan is a bit annoying at these extremes.

As for the Haswell CPU, I didn't have the time to test properly except a few runs with Cinebench 10 and 3DMark in continuous loop while I tried to solve my little thermal problem.

One thing I noticed about the Intel Iris is that GPU-Z reports 40 shader cores / 32 TMUs and only 8 ROPs. From where is this info derived? All the literature I've seen speaks only of 40 shaders, whatever it means for Intel, but no info on other parts of the GPU. Does we even know if these 'shaders' are scalar or vector GP cores or something else?

Andrew Lauritzen · Jan 28, 2014

msxyz said:
I've seen several similar complaints on the various Macintosh themed forums, so it's probably a Bootcamp related problem.

Bootcamp on 2013 MBPs has been a nightmare for me :S First, EFI is not supported at all, so Windows 8+ is a pain to install. Even if you get it installed, no integrated graphics at all if you have discrete, thus battery life sucks in Windows (only a problem if you have the discrete card, but yeah). Then, as you note, power and cooling stuff is a mess. I guess Apple isn't really very motivated to fix it, but it's too bad because it's the only really nice laptop with quad core + Iris Pro right now

msxyz said:
One thing I noticed about the Intel Iris is that GPU-Z reports 40 shader cores / 32 TMUs and only 8 ROPs. From where is this info derived?

I think they basically just have a database somewhere that they fill in. I've seen those numbers wrong a lot of times.

FWIW you can think of the "cores" (EUs) as 8-wide SIMD ALUs that do 16 FLOPS/clk (MAD) each (it's not exactly that, but that's a reasonable way to think about it). There are 4 samplers that do 4 pixels/clk bilinear each... so I guess you could call that 16 TMUs (not sure where 32 comes from). ROP rate is 8 pixels per clock so that seems right at least.

In NVIDIA terms where a 650m is a 384:32:16 (FLOP:TEX:ROP) per clock, Iris Pro 5200 would be 320:16:8. However the peak turbo rate is 1.3Ghz which puts it roughly in line with the "standard" clocked 650m for peak rates depending on what the workload stresses. Of course, the 2012 MBP 650m was heavily over-clocked (more like a 665m or something) and thus is a little faster than the 5200, as the AnandTech benchmarks showed. Obviously the TDP of the Iris Pro solution is significantly lower though, even at those relatively high clocks.

Hope that makes sense.

Haswell vs Kaveri

DSC

Malo

Yak Mechanicum

moozoo

DSC

Pressure

swaaye

Entirely Suboptimal

UniversalTruth

hkultala

Wynix

Raqia

3dilettante

fellix

3dilettante

Raqia

Dave Baumann

Gamerscore Wh...

3dilettante

Frontino

3dilettante

msxyz

Andrew Lauritzen

Moderator