It's definitely a very complex thing to implement, and possibly a point where Intel might have been served by being even more restrictive in what SKUs got TSX, or relegating it to a specific enterprise or customer set. The rumor that Skylake with AVX-512 might be deploying to specific customers might be how large changes with potentially large blowback due to errata could be handled.So, yeah, that has little to do with Intel dumbassness and more to do with Intel... not QAing enough.
Speculation within the core, particularly at the level of Intel's processors, is massively complex.Seriously, though. With this many issues, TSX has to be insanely complex to implement in the hardware.
A high-performance cache subsystem is very complex.
The memory model implemented, the systems that depend on all of it, sharing/updating/interupts, are complicated.
Making something like all that appear to be handled atomically across multiple internal pipelines--breaking some of the abstraction that isolated them from one another, is complex.
It's also the case where I think the fallback to microcode that is often used to protect against errata might have reached a limit.
Buggy instructions or problems within the OoO engine can be replaced by an equivalent sequence of uops by the microcode engine, but if the bug was outside of the core and involved something becoming externally visible, it might have been outside the reach of a fix.
Something like a timing issue with an interrupt, cache transition, power state change, snoop, an eviction, a possible failure to update to the possible bloom filter Intel uses to allow the read set to spill out of the L1, could allow TSX transactions to somehow appear to be non-atomic or to miss a really badly timed update that should have aborted them.
AMD's TLB bug is a case where a badly timed cache transition could push out stale page data whose integrity is required for the functioning of the system, and which was effectively game-over for any environment that hit it.
And that was was something that just involved an L2/L3 cache eviction for standard TLB functionality.
AMD had a paper detailing a proposal for its Advanced Synchronization Facility, but no sign of that particular idea getting as far as Bulldozer.It is complex. IIRC AMD tried to bring similar extensions to Bulldozer, but no luck either. And no rumors that Zen is going to support it either.
I believe comp.arch had a discussion from someone involved in that proposal indicating that some of AMD's internal evaluations that might not have made it near Bulldozer had already walked back some of the provisions for ASF.
ASF did allow for more feedback on transactional state and failure counts/modes, which could allow for more intelligent handling and performance gains that a simple abort-only transactional model could not achieve. I hope I can find the discussion on it, since it seems like the walk-back on some of the paper's proposals caused it to lose some of its hoped-for improvements. However, that can also point to how something can look compelling from a conceptual or algorithmic standpoint but runs into a reality where an implementation proves to be beyond the means of the manufacturer, or winds up compromising other things so much that it's a pyrrhic victory. For example, for all the risk taken to implement it, it is still something whose gains are variable and speculation that trades heavily on memory transfers--which means more speculation on operations that come with a higher power cost than mis-speculated in-core ALU ops.
Possibly, that work was contemporaneous with one of the projects cancelled before Bulldozer. There are oddities to Bulldozer's memory subsystem and unexplained penalties in multithreaded scenarios that might point to there being "things" that could have been tried or hoped for, but the product was pretty much a "well we've got a ship something" where most of the reasons for doing CMT were discarded and the penalties from it could not be fully mitigated.
IBM's also been willing to deploy architectures with transactional memory earlier, and presumably is more willing to adjust its architectures for more esoteric design points and memory subsystems than Intel was able/willing to do for a core that had to span so many markets. In addition, some of IBM's experience with an internally more baroque multiprocessor system and cache architecture (various and many disclosed sharing states, and if I recall correctly hardware/system recovery for transitions that lead to failure).IBM has great tech in this field. More complex memory versioning and huge EDRAM based LLC. IIRC Intel's implementation is L1 cache only, so transaction size must be very small. IIRC IBM also uses their tech for speculative execution. Way ahead of Intel.
It also likely helps that IBM has so much control over the platform and software stack, since it gives them more levers to pull should corner cases arise.