Haswell vs Kaveri

HSA probably has no future if Intel doesn't support it.

The Bulldozer architecture is certainly a strange bit of nonsense. It seems like they should have overhauled Stars instead of pumping more resources into BD. It probably just shows how thinly stretched AMD R&D is and so they just keep putting band aids on.
 
Why you blame the failure of this CMT implementation on CMT itself?
Shared units can be done for hardware the design barely cares about.

When dealing with something where performance matters, the architect who designed the concept wanted it for certain specific things:
1) Enabling a tighter critical loop for a fireball-type architecture -- this has completely lost with the physical realities of today, although this was clear much earlier than BD
2) Enabling newer, interesting modes of execution such as speculative threading and other memory tricks. He flat out said CMT didn't make sense in the absence of doing things in a more interesting manner, and Bulldozer does nothing interesting.
Problem is, nobody is doing any of these interesting things on an ongoing basis, or if they are they aren't doing CMT.
If the interesting things that the designer of CMT said were needed to justify it are either impractical or don't need CMT at all, then CMT is pointless.

Something I simply don't understand is why AMD went for such a big chip, they made it tough for themselves to fight on price, extra transistors don't pay for them selves through an increase of performances.
The transistors themselves don't add cost. Kaveri isn't bigger than Trinity, and it's on a bulk process.
It likely trends towards being mildly cheaper to make.

AMD should simply rework the Jaguar architecture for more scalability upwards (laptop, desktop, etc.) with relaxed pipeline to allow for higher clockrates and full-speed L2. I would rather see eight pumped Jaguar cores in Kaveri than yet another botched attempt to iterate the poor Bulldozer.
Hmm, an amped high-clocking Jaguar.
It probably needs some extra pipe stages. With the current 15-cycle mispredict latency, we could maybe squeeze in a few more stages, sure there's a performance hit, but maybe boosting the length to something like 18-19 cycles wouldn't be too bad.
So its integer resources would be about 2 wide, with about two instructions per core per cycle and about 32KB of L1 Icache per core.
Shared, long latency L2.
It might be hard getting the L1 to clock as high, especially if we want to avoid losing performance with its very low associativity. Maybe a 4-way 16 KB L1 data cache.
The higher end workloads might enjoy having a more flexible load/store situation than one dedicated store and one load pipe.
Roughly 8 FLOPs per cycle per core.

Does that sound about right?

AAnyway the whole point is that Kaveri is as we know it and that is a 240mm^2 chip, two third of th7e PS4 chip, that performs poorly in comparison.
Orbis would not be a good desktop chip. Kaveri might not be that compelling, but it doesn't need a competely custom platform and custom code that doesn't do a fraction of the things of a desktop to not fall on its face.

Back to Entity279 point, another thing with CMT is that it scales poorly from a production pov, you move by increment of 2 cores /1 module with the matching amount of L2.
There is no rule stating that the number of cores has to be an even number, those Phenom X3 were nice.

Those still had four cores.
 
I haven't followed the whole Kaveri story, but it seems to me that AMD was making a big deal of this whole HSA thing, which I understand as a CPU and GPU living harmoniously under the same roof, sharing the bread etc. Intuitively, that sounds like something that can have a lot of benefits, but are there any tests right now that confirm this?

Or do we have to wait for developers to write software that specifically uses this feature? (Which, for a low end offering like this, doesn't seem all the likely?)

There's this:
HSA-JPG.png


And that:
HSA-LibreOffice.png

Source: http://www.extremetech.com/computin...-wait-for-the-first-true-heterogeneous-chip/5
 
So if you are big on charts in LibreOffice Calc and decoding what must be gigantic JPEGs, then Kaveri is your chip. Oh boy.
 
You have to start somewhere. For the launch of a mid-range consumer chip produced by a company with what must be less than 20% market share these days, that's better initial support than I anticipated.

Plus, Kaveri also does quite well in regular OpenCL, non-HSA applications.
 
HSA probably has no future if Intel doesn't support it.

I think it's a foregone conclusion that Intel will support the same functionality, it's simply the way the industry is heading a a whole. Ivybridge was already more "HSA like" than Richland.

Broadwell probably won't make that level of a leap but I's like to think Skylake will match Kaveri in HSA sysle functionality. It seems to me that without a discrete GPU product line, HSA makes even more sense for Intel than it does for AMD.
 
Either parts of AMD are severely confused about what process they use or 28SHP may be an PD-SOI process. An AMD representative confirmed to at least two different sites that 28SHP is an SOI process, their shop says it too (but admittedly, on products.amd.com it is listed with 32nm, which is likely a C&P error).

Just like how AMD's own feature table said pre-GCN cards had Mantle in direct contradiction to the next marketing tab over, I think we can conclude AMD doesn't have their A team updating the tables.

AMD's CTO, slides at Kaveri's launch detailing the semi-custom 28nm SHP, and statements to investors say it's bulk.
 
Either parts of AMD are severely confused about what process they use or 28SHP may be an PD-SOI process. An AMD representative confirmed to at least two different sites that 28SHP is an SOI process, their shop says it too (but admittedly, on products.amd.com it is listed with 32nm, which is likely a C&P error).

The talk that will be given at ISSCC to present Steamroller clearly states in its title that it's a bulk process. Plus, GloFo hasn't talked about SOI in a while.

Thanks!

I assume that you'll see the largest improvements where the GPU and CPU need to work closely together and spreadsheet stuff may be a really good example of that.

AMD also mentions very large performance gains in binary tree searches, since you don't have to flatten and copy the whole tree anymore. However, this hasn't been included into any distributed application at this point. Generally speaking, I suppose that any kind of very parallel processing on dynamic data structures can benefit significantly from HSA, which is a big change.
 
The talk that will be given at ISSCC to present Steamroller clearly states in its title that it's a bulk process. Plus, GloFo hasn't talked about SOI in a while.
About as long as they didn't talk about the SHP process AMD uses (claims to use?).
I'm aware of the arguments against it. I only brought it up because AMD obviously confirmed to use SOI for Kaveri to some sites (heise.de and golem.de) specifically asking for it.
 
Which representatives confirmed SOI?
Was one of them the CTO of AMD?
Certainly not. That's why my first alternative to explain the different accounts was that "parts of AMD are severely confused about what process they use". We should know it for sure in 4 weeks at the latest.
 
Broadwell probably won't make that level of a leap but I's like to think Skylake will match Kaveri in HSA sysle functionality. It seems to me that without a discrete GPU product line, HSA makes even more sense for Intel than it does for AMD.

There is pushback from the incumbent compute workhorses at Intel--the mainline cores. Events in recent years seem to indicate that some of those objections may have been at least partly overridden.

Skylake's revamping of the GPU's memory handling to align it closer to x86 page tables is a significant step, as GCN did the same thing.
 
http://techreport.com/blog/25930/a-subjective-look-at-the-a8-7600-gaming-performance

That said, Batman and Tomb Raider both crashed to the desktop multiple times during testing, and they weren't the only games to have issues.

Serious Sam also crashed a couple of times.

Dirt: Showdown crashed to the desktop multiple times, too.

More game crashes hit when I tackled Sleeping Dogs. One of them even hosed part of the Windows install, forcing me to re-image the system. Ugh.

:LOL:
 
Last edited by a moderator:
Screen%20Shot%202012-02-02%20at%203.12.58%20PM.png


From http://www.anandtech.com/show/5503/understanding-amds-roadmap-new-direction/2

What does "Extend to Discrete GPU" mean?
In terms of hUMA?
And is this only possible on a FM2+ board with a HSA enabled APU?

There's word that the Excavator APU (Carrizo) is on FM2+, with ddr3. That could be inferred btw : year 2014 has ddr4 for servers and in year 2015 availability (price) for the general public so a 2014 APU probably has ddr3 no matter what.

As for the PCIe bus, "Extend to Discrete GPU", that should mean PCIe 3.0 is extended with support for coherency. Is there a name for that and will any non AMD product support that?, I don't know. But that's a protocol update and the PCIe 16x controller is right in the CPU (with socket FM1/FM2/FM2+, like with socket 1156/1155/1150) so at least I can see how a new motherboard or new socket is not needed.

Yes it's meaningless without a HSA APU (or conceivably, a low cost CPU that is a HSA compatible APU with the GPU disabled)
It's a bit weird that Kaveri doesn't support "hUMA extended to discrete GPU", or maybe they can't even test and validate it (or don't want to, or it would be meaningless). Which GPUs are supported?, I don't think we know, it could be Bonaire, Hawaii and up, or only future products (20nm GPUs)

Possibly, Kaveri could support the feature with a BIOS update after Carrizo and next-gen dedicated GPUs are released.
 
Back
Top