That would indeed seriously suck, but I don't think it makes sense for them to do that. Quad-core is pretty much mainstream now and if they want single applications to make use of an increasing number of cores/threads then HTM becomes vital. I'd even argue that it's less important for the server market since they mostly run multiple independent single-threaded applications.
A number of discussions concerning HTM bring up the cost of lock propagation across sockets. In the case of the desktop, propagation of state changes on-die is pretty fast (at least if you're Intel), and there are optimizations that can be put in place to speed up on-die synchronization. There is no strong expectation of client systems returning to multi-socket setups any time soon. It may not return to multi-die setups either.
In the case of HPC and server systems that cross multiple sockets, the propagation time is longer. In the case where hundreds of nanoseconds to microseconds are involved, the overhead of rolling back failed transactions can be masked by general latencies, and the performance upside is higher.
A single desktop CPU might be able to retry locks in much less time with less coherence overhead since it all falls back to a single shared LLC, and typically most latencies are simply shorter by default, so transactions have less idle time to hide in. Given the general burstiness of desktop loads and sensitivity to latency, the upside may be less and Intel might want a product segmentation checkbox.
I did review AMD's ASF a bit more, and while there's no data that indicates BD was intended to have it, there are elements of the design that at the very least do not put roadblocks to its implementation.
ASF promises at a minimum that 4 64-byte cache lines can be designated as speculative locations. This matches the associativity of both the L1 and WCC.
The sizing of the L1 and its associativity also avoids the virtual index aliasing problem the L1 Icache has, which for ASF can be important since a false synonym eviction could cause ASF to fail to meet its minimum capacity requirements.
ASF at least initially was planned to work in the cache miss buffers, and perhaps HTM primitives could as well.
One other possibility is that the L1 can serve as location for ASF, since it not purely write-through and is snooped along with the rest of the module's hierarchy.