They bothered to put GPU PhysX in titles too, why not this?
While it is possible to stay coherent across the PCIe, to some extent, it is simply not possible to "run monolithic-style HSA applications" on a PC dGPU, due to PCIe latency.
PCIe bandwidth is fine, and GPU video can hide modest latency, but all compute requires low latency, and dGPU compute on common memory will not have that. Tight HSA-style computation means using whichever is most appropriate: CPU or GPU. That means moving computations repeatedly back and forth between CPU and GPU, and doing that with overall practicality depends on low movement overhead. No conventional dGPU video card can possibly support that, because of PCIe.
Looks like they have a preferred die size for these APUs.
And another question- why are the L1 Data and L1 Inst. caches only 4-way and 3-way while on Intel processors they are 8-way?
Does it introuduce some type of advantage?
Seems like they are trying for higher margins, so die size is capped fairly low.Looks like they have a preferred die size for these APUs.
Hmm, an amped high-clocking Jaguar.
It probably needs some extra pipe stages. With the current 15-cycle mispredict latency, we could maybe squeeze in a few more stages, sure there's a performance hit, but maybe boosting the length to something like 18-19 cycles wouldn't be too bad.
So its integer resources would be about 2 wide, with about two instructions per core per cycle and about 32KB of L1 Icache per core.
Shared, long latency L2.
It might be hard getting the L1 to clock as high, especially if we want to avoid losing performance with its very low associativity. Maybe a 4-way 16 KB L1 data cache.
The higher end workloads might enjoy having a more flexible load/store situation than one dedicated store and one load pipe.
Roughly 8 FLOPs per cycle per core.
Does that sound about right?
Looks like they have a preferred die size for these APUs.
Making it look that good requires a lot more work.AMD's cache layouts on core clusters are a lot messier and hap-hazard than Intel's nice private L2 + ring bus L3 configuration; compare most post Nehalem cores to the sprawling Jaguar L2 on each quad cluster, or the Bulldozer L3 which is all over the place.
The old dual-core A64 had similar issues with the memory performance scaling, when DDR2 models hit the market. The slow exclusive L2 ultimately limited the bandwidth from the system memory to the cores.One point that I've idly wondered about is the loss of memory scaling after DDR3 2133 and the missing two DCTs.
If the northbridge clock isn't rising with the higher memory speeds, it could be bottlenecking on the crossbar ports for the controllers. Earlier AMD chips had the same problem.
A quad-channel solution wouldn't necessarily have the same problem if AMD gave those DCTs their own crossbar endpoints.
Looks like they have a preferred die size for these APUs.
I imagine that looking "nice and tight" is quite directly related to wiring distance / complexity involved and the resulting latencies that are possible. Intel can even use its L3 for its on-die GPU, whereas AMD has uniformly decided to shelf shared L3 for its APUs.Making it look that good requires a lot more work.
I am somewhat disappointed Techreport did not seem to attempt more troubleshooting (Different memory? Different motherboard? etc.) or consulting AMD or other reviewers before publishing that (perhaps they did, but if so there is no note of it in the article).
As it is, and if they did not, it does give the effect of (most likely unintentionally) spreading FUD.
Yeah, I agree with kalelovil. Those issues sound like a bad DIMM to me, not a processor failure.
Update — Faulty memory appears to be behind the crashing we experienced with the A8-7600T...
I believe Intel generally runs its ring bus wires over the L3, for which the more straightforward topology of a ring bus should help.I imagine that looking "nice and tight" is quite directly related to wiring distance / complexity involved and the resulting latencies that are possible. Intel can even use its L3 for its on-die GPU, whereas AMD has uniformly decided to shelf shared L3 for its APUs.
Bulldozer's L3 situation is an evolution of the previous gen, since the uncore is similar. The module setup reduces the burden on the uncore since the shared L2 means the number of CPU clients is also cut in half.I agree that it'd be nice if Keller stepped up and gave AMD a unified solution. Perhaps this is made more difficult in bulldozer derived designs by the tightly paired design of each cluster and its corresponding large L2. These don't seem like they're designed around L3 which seem bolted on and high latency, whereas Intel's big cores have a comparatively small L2 and seem more more reliant on L3 for caching large data sets.
If the slides are accurate, cores are paired behind a shared L2. The interconnect is drawn as changing to a grid of some kind.Compare that to pre-nehalem designs like the Core 2 Duo or Silvermont which are closer in nature to AMD's BD derivatives when it comes to L2 cache sizes. (If Knight's Landing is indeed Silvermont based, I wonder if it would break up the pairing of cores with L2 for a different general architecture.)
Bootcamp on 2013 MBPs has been a nightmare for me :S First, EFI is not supported at all, so Windows 8+ is a pain to install. Even if you get it installed, no integrated graphics at all if you have discrete, thus battery life sucks in Windows (only a problem if you have the discrete card, but yeah). Then, as you note, power and cooling stuff is a mess. I guess Apple isn't really very motivated to fix it, but it's too bad because it's the only really nice laptop with quad core + Iris Pro right nowI've seen several similar complaints on the various Macintosh themed forums, so it's probably a Bootcamp related problem.
I think they basically just have a database somewhere that they fill in. I've seen those numbers wrong a lot of times.One thing I noticed about the Intel Iris is that GPU-Z reports 40 shader cores / 32 TMUs and only 8 ROPs. From where is this info derived?