I Can Hazwell?

I don't know, the transactional memory sounds like a solution which should be on software's side, not on hardware.
Indeed, there is quite long history of "transactional memory" on software side and numerous implementations. I haven't tried any of them, but I'm sure all these come with very long list of situations where transactions (as in "transparent, automatic and carefree safety net") break down: network acess, filesystem access and whatnot.

The CPU-s already have most rudimentary transactional support - "compare and swap" instructions. I remember these used to stall all other work on CPU. If you think about transactions in Haswell as ability to atomically execute a bit more than one CPU instruction and having these without stalling the whole 8-core CPU, then transactions is quite reasonable feature to have.

Transactional support in CPU-s must have some appeal, because many have tried it. I remember Sun (canceled) and IBM (delivered as a part of one supercomputer). If Intel delivers mass-produced product that scratches an itch many programmers apparently have, then hats off for pushing state of the art forward.
 
Software based transactional memory is generally too slow to have any performance advantage, though.

Some CPU have "LL/SC" (load-link/store-conditional) instructions, such as many RISC CPU, which is in a way similar to transactional memory, and can also be used to implement lock-free algorithms. But of course transactional memory is still much broader.
 
I don't know, the transactional memory sounds like a solution which should be on software's side, not on hardware.
Think of it similarly of a HW encryption or even just prefetching. It's a HW implementation of an operation that should make it far more efficient than just implementing it in software
 
I don't know, the transactional memory sounds like a solution which should be on software's side, not on hardware. On the other hand, maybe platforms like "Concurrent collections" in .NET can be rewritten and optimized by using transactional memory.

Usual languages make a mess of side effects and mutation, which is why it's hard for them to do it with good performance. Haskell is the perhaps only language which can do it without a significant performnace hit.
 
This functionality allows the core to speculate past something that it normally couldn't, in the same vein of how current designs reorder memory operations and replay if it turns out some of the conflict.

The problem with synchronization was that it was something hardware couldn't safely make assumptions about without programmer intervention.

Intel hasn't disclosed any implementations yet, so what parts of the memory pipeline it changes and what limits it imposes are not yet known.
It could be using cache miss buffers, which was posited by another similar proposal. Writes and reads would be routed to the buffers, which would be also in contact with the cache controller.
Once the last instruction in the section ends the speculative state, it would be an all or nothing cache fill.
 
I'd be curious about the proposed use of per-line bits on each line in the L1/L2.

The lines in the write set must become visible atomically, but as presented it sounded like Kanter felt like a significant portion of the thousands of lines in the L2 could be in play for a transaction.

There could be a way to flash clear the listings, perhaps storing the bits in a compact table or using some kind of nested list next to the cache controller and core instead of across the L2.
If there is a sequential update process, there might be a period of inconsistency, where some lines in the transaction become visible externally while others are still pending. Another thread's accesses might straddle that update line, which may require a check by the cache controller and possibly an additional stall in processing the snoop.
 
I'd be curious about the proposed use of per-line bits on each line in the L1/L2.

As I understand it, it is used for hardware lock elision. Every cache line accessed within a HLE transaction sets a bit for each line it accesses. The L3 holds a copy of all cache lines in L1/L2. If a different processor tries to access a cache line, the L2 bits are checked for the particular cache line, if a bit is set, the transaction is aborted and data reloaded from L3 to the L1/L2 and the foreign processor. The critical section is then retried with strict synchronization.

This allows a processor to speculatively perform a critical section.

Cheers
 
As I understand it, it is used for hardware lock elision.
They're used for HTM and ellision. Ellision is overlaid on top of the HTM functionality.
My question concerns where he expects those bits to be located physically, and how they are updated atomically. The worst-case theoretical maximum is 5k+ lines being updated as being visible upon the completion of a speculative section. (It will probably give waaay before that.)

Per Intel's description, all modified lines in a transaction become visible to other readers atomically, and that can be violated if those bits were actually physically on every line and the core can't set them to be visible within a single cycle.
 
Per Intel's description, all modified lines in a transaction become visible to other readers atomically, and that can be violated if those bits were actually physically on every line and the core can't set them to be visible within a single cycle.

I would expect the bits to be located with the L2 cache tags. The L1s aren't store through, but the core still needs to update usage heauristics for the LRU algorithm used in L2 replacement so that a hot L1 line isn't evicted. When the core stores usage data, it can piggyback the HLE bit.

And yeah, they would need some special hardware to clear all bits in one cycle.

Cheers
 
My question concerns where he expects those bits to be located physically, and how they are updated atomically. The worst-case theoretical maximum is 5k+ lines being updated as being visible upon the completion of a speculative section. (It will probably give waaay before that.)

Per Intel's description, all modified lines in a transaction become visible to other readers atomically, and that can be violated if those bits were actually physically on every line and the core can't set them to be visible within a single cycle.

I would expect the bits to be located with the L2 cache tags. The L1s aren't store through, but the core still needs to update usage heauristics for the LRU algorithm used in L2 replacement so that a hot L1 line isn't evicted. When the core stores usage data, it can piggyback the HLE bit.

Cache tags are the smart place to put them, because with transactions nearly every operation that touches the tags will also need to at least look at the transaction bits. Although I'd fully expect that there will be txn bits in both L1 and L2 tags, and they will just be transported intact to L2 whenever data is evicted there. (so L2 tags don't need any extra space and complexity to handle the ways of the L1 cache).

And yeah, they would need some special hardware to clear all bits in one cycle.

The big point is that this hardware really isn't that expensive. To turn an operation from line select to flash broadcast needs one extra signal line and one extra gate per signal junction. Instead of each junction choosing "if a, then propagate to A, if not a then propagate to b", it goes to "if a or t then propagate to A, if (not a) or t then propagate to B".

And the only broadcast operations needed are commit (clear txn bits), and rollback (mark line as invalid).

So yes, it's entirely feasible for the entire cache to be used in a single transaction.
 
More talk now about Haswell being a multi-chip module in its higher incarnations, featuring an external L4 cache die entirely dedicated for graphics use. The schematic imagery shown over at VR-Zone suggests that the Haswell chip will feature a more square-ish overall shape rather than the increasingly rectangular appearance that has been the case since the Core i-series first launched.

Also, going by this image, the GPU portion now appears to occupy the majority of the die, and be more than twice the size of previous generations. Of course, this might not be an accurate representation of the chip and its layout, but speculation is fun, isn't it? After all, it's what drives every internet tech forum...! ;)
 
No idea, if the article doesn't say. Presumably leaked stuff, considering the diagrams are fairly detailed, but it could of course be the child of an active imagination in some random Adobe Illustrator user on the internet. :)
 
More talk now about Haswell being a multi-chip module in its higher incarnations, featuring an external L4 cache die entirely dedicated for graphics use.
I wonder how crazy would it have been to implement it as an on-chip NUMA instead of plain L4 cache just for GPU. (decent :p) OSes have had NUMA support for quite a while, I guess it would have kind-of worked with that kind of setup as well.
 
Probably not enough storage to have been of any use as NUMA RAM. If that die is almost all DRAM, how much could it possibly store, surely not more than say, 128MB or so. That's a piss in the ocean today in terms of RAM.

Even if it's 512MB it's nothing major to speak of really.

Also, how would you get the OS to properly manage this low-latency RAM and not waste it on some useless random resident DLL crap or whatever that isn't used very much? There's no special support for such memory devices to my knowledge (because they don't actually exist yet... ;))
 
Also, how would you get the OS to properly manage this low-latency RAM and not waste it on some useless random resident DLL crap or whatever that isn't used very much? There's no special support for such memory devices to my knowledge (because they don't actually exist yet... ;))
How would it be any different from any random implementation of NUMA? I thought when OS sees some app (or parts of RAM) gets used more it simply pulls that closer to the CPU. Though I admit I know really little about how NUMA works in reality so I may have unrealistic expectations :)
 
Well, how would it know this RAM is closer...? Besides, from what I understand no desktop OS can optimize RAM during runtime; once something's loaded somewhere it pretty much stays there.
 
Back
Top