SSE3 is supported by over 99% of current installed CPU base according to latest Steam Survey. It's a no brainer to support it at least. I don't think many current generation games are using AVX, since most games are developed first for current generation consoles, and those do not support wider than 128 bit vectors. PC CPUs are so much faster than current console CPUs that the extra work and version management for going beyond SSE3 doesn't pay off. It is much better to spend that time optimizing the GPU stuff (as the draw call / API overhead on PC is still an issue compared to consoles). I think the biggest single improvement is that AVX is supported by both AMD and Intel. SSE 4.X(a) had many different versions that weren't fully compatible with each other. Jaguar also supports AVX, and is the CPU in PS4. This is good news for PC gaming, since games will be AVX optimized already for the console. No extra work is needed to support it. Compared to SSE4.X, 128 bit AVX has some extra instructions such as broadcast and mask moves. These do save the count of memory instructions in some cases. Also the VEX prefix allows AVX to write result to a separate register (nondestructive operation), reducing the register pressure (and reducing the extra move operations). Both of these things are actually very good for Jaguar, since unlike Sandy/Ivy Bridge, AMD CPUs do not have "free" moves by register renaming. Also Jaguar can only sustain two uops per cycle. All the extra moves and extra shuffles take away slots that could be used for doing real work (adds and multiplys). AVX helps with that. 256 bit AVX on Jaguar: That's an interesting question that is not yet answered by AMD (as far as I know). Running 256 bit AVX on Bulldozer doesn't help at all. But Bulldozer has a separate shared vector pipeline, so that might yield slightly different results. Bobcat splits the 128 bit vector instructions to two 64 bit instructions in the decoder. 128 bit operations take two cycles to decode (according to Agner Fogs analysis) and are two separate instructions for the rest of the pipeline. So in case of Bobcat 128 bit (vs 64 bit) only helps by reducing the instruction cache usage. I don't see instruction cache being a bottleneck for Jaguar (it has very good L1 caches). Let's wait for the first Jaguar benchmarks (and Agner's analysis). They shouldn't be far away (since there's already some leaked Temash tablet benchmarks around the net).
Intel does have some very weird idea of market segmentation. However, in the case of TSX, at least one of the function (Hardware Lock Elision) is backward compatible (i.e. a code with TSX support runs fine on older CPU, just without the benefit).
BD does have free moves so it's not just intel cpus - only for xmm regs though not ymm (I guess it's not only easier for the implementation, it actually probably makes sense since you rarely need moves with avx anyway). Well the moves aren't entirely free since you still got the uops moving around, but that's the same as Sandy Bridge (only Ivy Bridge can do better). Not sure what Jaguar will do there though, Bobcat certainly wasn't as advanced. As for SSE3 I'm not convinced it's used. It may be supported on more than 99% of all cpus, but really the additional instructions are so minor (float horizontal add/sub and that's about it) you could as well take care of the remaining 1% of all cpus by just using SSE2 only. SSSE3 is way more interesting (byte shuffle for instance) as is sse4.1. But support for those is less wide-spread.
Moves take up decode bandwidth, but should be otherwise completely free on all physical register file OOOe implementations since the move is completely resolved in the renaming stage. Bobcat and Jaguar are physical register file OOOe machines too, but given the narrow decoder, moves probably have an impact. Cheers
Start8 from Stardock.com fixes that for $5. I'd still be using Win7 if it weren't for this little app.
The "K" series have always gone missing certain, specific features to keep them out of serious production systems. Have a look at the i7-2600, i7-2600k, i7-3770, and i7-3770k... http://ark.intel.com/compare/52213,52214,65719,65523 The "k" series are both missing VT-d and TXT. This new "subtraction" of TSX doesn't seem much different to me.
It probably doesn't seem different to the Intel management that made that decision either, but it's a lot different. TSX, unlike VT-d and TXT, is something that can apply to a wide variety of software but needs actual programming effort to utilize. But it's a lot harder to motivate software developers to do this when a lot of their userbase won't have access to it and can't test it. Only Ivy Bridge implemented this optimization, meaning on Sandy Bridge moves still flowed through the execution units (and of course Netburst uarchs did as well). Bulldozer only has it for SSE moves. I don't think it's necessarily completely free to allow multiple architectural registers to map to the same physical registers. Would not count it as a given on Bobcat and Jaguar.
There are at least two non-K SKUs that appear to have TSX disabled as well, at least according to Tomshardware. That's not including any omissions at i3 and below that might turn up eventually. I'm with most commentators in that I don't see the upside to fragmenting things like this, even though I'm not certain TSX will do much at the the core counts and typical software consumer SKUs are concerned with.
One possibility: they are locking it out of chips that might be used as xeon replacements and will unlock it for Xeons?
Yes, however Sandy/Ivy core can decode four instructions per cycle. Bulldozer/Piledriver (and Bobcat/Jaguar) can only decode two instructions per cycle (per core). Sandy/Ivy thus have plenty of free decode slots available for decoding the extra moves that will be eliminated by the register renaming mechanism. The wider Intel cores should benefit more from this feature compared to narrow AMD cores. I agree that both SSSE3 and SSE4.1 are more interesting than SSE3, but SSE4.1 has only 62% hardware coverage (source: Steam Survey). SSE4.1 is not a good baseline (if you are targeting only a single instruction set). SSE3 horizontal operations are handy for example in dot product implementation (dot = mul + 2 x horizontal add, or 2 x dot = 2 x mul + 2 x horizontal add). In SSE2 a single dot product costs you six instructions (mul + 2 x add + 3 x shuffle). Games ported from Xbox 360 tend to use (AoS) vector dot products, because dot products are very fast on Xbox 360 CPU (single cycle throughput rate). According to Steam Survey SSE3 has a 99.4% coverage, while SSE2 has 99.8% (0.4% difference). 0.4% is not a valid reason to choose SSE2. Unless you want to have dot products that require 2x-3x more instructions... or are calculating everything using SoA layout... but that seems to be something that gameplay programmers are not willing to do. You give them a good optimized vector class and that's the lowest level abstraction they are going to use. SoA vector batch processing is only used by low level engine programmers (as far as my experience goes).
That's probably the rationale, and at least fits for other feature segmentation like with VT-d, TXT, AESNI, ECC, etc.. but if they're looking at TSX as an enterprise-class only feature then they're not positioning it well, IMO.
Well I agree those instructions could be handy. But I still got some doubts that they are really all that useful - personally I've been able to avoid them whenever I first thought they'd be useful (actually that's not quite true but almost). And even if they are a perfect fit for your code (such as that AoS dot product) the performance benefit is most likely close to nonexistent, because the internal implementation is apparently exactly that, ordinary add + shuffle. They generate tons of uops, have high latency and crappy throughput. e.g. Wolfdale lists this as 3 uops, latency 7, throughput 1 every two clock. And it's typically worse for AMD where it looks like you could actually do better by doing the add+shuffle manually for some reason at least with Bulldozer (at least it's not the same order of fail as sse41 DPPS on BD which generates 16! uops and definitely looks like you could always do much better manually). So those instructions probably don't help as much as you'd think they do by just looking at the instruction count - they look much better on paper than they are. And the workarounds (using shuffles) really are quite trivial, in contrast to the instructions you get with ssse3/sse41 (emulating byte shuffles by hand is hilarious for instance, emulating rounding correctly tricky at best etc.). Oh and while here I'd like to bring up the other rare sse3 instruction, lddq, a band-aid specifically invented for helping the P4 because it's movdqu implementation was simply unbearable, and completely useless on any other cpu...
Crysis 3 only supports DX11 GPUs and only 58% of Steam users have that. That is lower than SSE4.1's share. Most next-gen games/ports will require a decent DX11 GPU (and a decent CPU). Every Intel CPU (that isn't low-end) from the past 5 years supports SSE4.1. But (not so) old AMD CPUs could be a problem... What we know (Steam): May 2012: SSE4.1 - 52.66%, SSE4.2 - 38.56%, DX11 GPUs - 45.83% November 2012: SSE4.1 - 59.94%, SSE4.2 - 46.70%, DX11 GPUs - 55.50% February 2013: SSE4.1 - 62.06%, SSE4.2 - 50.08%, DX11 GPUs - 58.32% Prediction: November 2013: SSE4.1 - ~70%, SSE4.2 - ~60%, DX11 GPUs - ~70% (when next-gen consoles launch) May 2014: SSE4.1 - ~75%, SSE4.2 - ~70%, DX11 GPUs - ~75% Next-gen games/ports could require SSE4.x and i think they should.
Crysis 3 is an outlier in that they aren't interested in selling a game to as many people as possible. They are more interested in pushing the tech as far as possible. Most companies take the Blizzard approach to trying to sell to the largest audience possible. But yes, the new consoles will change that dynamic. As then the largest pool of people will consist of PS4/Xbox next/PC which means Dx11 class features. But don't be surprised if you still have Dx9 class games with Dx11 added if the developer/publisher wants to target PS3/X360/older PCs in addition to PS4/Xbox next/Dx11 PCs. Regards, SB
That...is a sexy piece of hardware pr0n. Wow. Haswell is the first non-rectangular (4-core) core i-series CPU ever. That GPU looks to be absolutely massive, unless intel re-jigged the whole layout of the chip. In previous i-series chips, CPU cores were lined up in a row with L3 beneath them and the GPU tacked on to the side. Now I would assume that the GPU sits on the opposite side of the L3 compared to the cores, thus filling the chip out into a square-ish shape. That'd mean 50+ percent of the die is GPU... Ugh! Off-chip die is fairly large. Wonder what geometry it is manufactured with, I'd assume something coarser than 22nm, probably, seeing as DRAM is quite frugal with power, and older fabs are cheaper to run... Anyhow, damn nice piece of kit. I'm all hot and bothered now!