I Can Hazwell?

Discussion in 'PC Industry' started by Grall, Nov 9, 2011.

  1. Miksu

    Regular

    Joined:
    Mar 9, 2003
    Messages:
    997
    Location:
    Finland
  2. Lux_

    Newcomer

    Joined:
    Sep 22, 2005
    Messages:
    206
    Indeed, there is quite long history of "transactional memory" on software side and numerous implementations. I haven't tried any of them, but I'm sure all these come with very long list of situations where transactions (as in "transparent, automatic and carefree safety net") break down: network acess, filesystem access and whatnot.

    The CPU-s already have most rudimentary transactional support - "compare and swap" instructions. I remember these used to stall all other work on CPU. If you think about transactions in Haswell as ability to atomically execute a bit more than one CPU instruction and having these without stalling the whole 8-core CPU, then transactions is quite reasonable feature to have.

    Transactional support in CPU-s must have some appeal, because many have tried it. I remember Sun (canceled) and IBM (delivered as a part of one supercomputer). If Intel delivers mass-produced product that scratches an itch many programmers apparently have, then hats off for pushing state of the art forward.
     
  3. pcchen

    Moderator Veteran

    Joined:
    Feb 6, 2002
    Messages:
    2,645
    Location:
    Taiwan
    Software based transactional memory is generally too slow to have any performance advantage, though.

    Some CPU have "LL/SC" (load-link/store-conditional) instructions, such as many RISC CPU, which is in a way similar to transactional memory, and can also be used to implement lock-free algorithms. But of course transactional memory is still much broader.
     
  4. hoho

    Veteran

    Joined:
    Aug 21, 2007
    Messages:
    1,218
    Location:
    Estonia
    Think of it similarly of a HW encryption or even just prefetching. It's a HW implementation of an operation that should make it far more efficient than just implementing it in software
     
  5. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Location:
    /
    Usual languages make a mess of side effects and mutation, which is why it's hard for them to do it with good performance. Haskell is the perhaps only language which can do it without a significant performnace hit.
     
  6. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    6,830
    Location:
    Well within 3d
    This functionality allows the core to speculate past something that it normally couldn't, in the same vein of how current designs reorder memory operations and replay if it turns out some of the conflict.

    The problem with synchronization was that it was something hardware couldn't safely make assumptions about without programmer intervention.

    Intel hasn't disclosed any implementations yet, so what parts of the memory pipeline it changes and what limits it imposes are not yet known.
    It could be using cache miss buffers, which was posited by another similar proposal. Writes and reads would be routed to the buffers, which would be also in contact with the cache controller.
    Once the last instruction in the section ends the speculative state, it would be an all or nothing cache fill.
     
  7. Lux_

    Newcomer

    Joined:
    Sep 22, 2005
    Messages:
    206
  8. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    6,830
    Location:
    Well within 3d
    I'd be curious about the proposed use of per-line bits on each line in the L1/L2.

    The lines in the write set must become visible atomically, but as presented it sounded like Kanter felt like a significant portion of the thousands of lines in the L2 could be in play for a transaction.

    There could be a way to flash clear the listings, perhaps storing the bits in a compact table or using some kind of nested list next to the cache controller and core instead of across the L2.
    If there is a sequential update process, there might be a period of inconsistency, where some lines in the transaction become visible externally while others are still pending. Another thread's accesses might straddle that update line, which may require a check by the cache controller and possibly an additional stall in processing the snoop.
     
  9. Gubbi

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,095
    As I understand it, it is used for hardware lock elision. Every cache line accessed within a HLE transaction sets a bit for each line it accesses. The L3 holds a copy of all cache lines in L1/L2. If a different processor tries to access a cache line, the L2 bits are checked for the particular cache line, if a bit is set, the transaction is aborted and data reloaded from L3 to the L1/L2 and the foreign processor. The critical section is then retried with strict synchronization.

    This allows a processor to speculatively perform a critical section.

    Cheers
     
  10. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    6,830
    Location:
    Well within 3d
    They're used for HTM and ellision. Ellision is overlaid on top of the HTM functionality.
    My question concerns where he expects those bits to be located physically, and how they are updated atomically. The worst-case theoretical maximum is 5k+ lines being updated as being visible upon the completion of a speculative section. (It will probably give waaay before that.)

    Per Intel's description, all modified lines in a transaction become visible to other readers atomically, and that can be violated if those bits were actually physically on every line and the core can't set them to be visible within a single cycle.
     
  11. Gubbi

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,095
    I would expect the bits to be located with the L2 cache tags. The L1s aren't store through, but the core still needs to update usage heauristics for the LRU algorithm used in L2 replacement so that a hot L1 line isn't evicted. When the core stores usage data, it can piggyback the HLE bit.

    And yeah, they would need some special hardware to clear all bits in one cycle.

    Cheers
     
  12. tunafish

    Regular

    Joined:
    Aug 19, 2011
    Messages:
    478
    Cache tags are the smart place to put them, because with transactions nearly every operation that touches the tags will also need to at least look at the transaction bits. Although I'd fully expect that there will be txn bits in both L1 and L2 tags, and they will just be transported intact to L2 whenever data is evicted there. (so L2 tags don't need any extra space and complexity to handle the ways of the L1 cache).

    The big point is that this hardware really isn't that expensive. To turn an operation from line select to flash broadcast needs one extra signal line and one extra gate per signal junction. Instead of each junction choosing "if a, then propagate to A, if not a then propagate to b", it goes to "if a or t then propagate to A, if (not a) or t then propagate to B".

    And the only broadcast operations needed are commit (clear txn bits), and rollback (mark line as invalid).

    So yes, it's entirely feasible for the entire cache to be used in a single transaction.
     
  13. Grall

    Grall Invisible Member
    Legend

    Joined:
    Apr 14, 2002
    Messages:
    9,337
    Location:
    La-la land
    More talk now about Haswell being a multi-chip module in its higher incarnations, featuring an external L4 cache die entirely dedicated for graphics use. The schematic imagery shown over at VR-Zone suggests that the Haswell chip will feature a more square-ish overall shape rather than the increasingly rectangular appearance that has been the case since the Core i-series first launched.

    Also, going by this image, the GPU portion now appears to occupy the majority of the die, and be more than twice the size of previous generations. Of course, this might not be an accurate representation of the chip and its layout, but speculation is fun, isn't it? After all, it's what drives every internet tech forum...! ;)
     
  14. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    6,830
    Location:
    Well within 3d
    Are their schematics sourced from anywhere, or are those diagrams just a best guess?
     
  15. Grall

    Grall Invisible Member
    Legend

    Joined:
    Apr 14, 2002
    Messages:
    9,337
    Location:
    La-la land
    No idea, if the article doesn't say. Presumably leaked stuff, considering the diagrams are fairly detailed, but it could of course be the child of an active imagination in some random Adobe Illustrator user on the internet. :)
     
  16. hoho

    Veteran

    Joined:
    Aug 21, 2007
    Messages:
    1,218
    Location:
    Estonia
    I wonder how crazy would it have been to implement it as an on-chip NUMA instead of plain L4 cache just for GPU. (decent :p) OSes have had NUMA support for quite a while, I guess it would have kind-of worked with that kind of setup as well.
     
  17. Grall

    Grall Invisible Member
    Legend

    Joined:
    Apr 14, 2002
    Messages:
    9,337
    Location:
    La-la land
    Probably not enough storage to have been of any use as NUMA RAM. If that die is almost all DRAM, how much could it possibly store, surely not more than say, 128MB or so. That's a piss in the ocean today in terms of RAM.

    Even if it's 512MB it's nothing major to speak of really.

    Also, how would you get the OS to properly manage this low-latency RAM and not waste it on some useless random resident DLL crap or whatever that isn't used very much? There's no special support for such memory devices to my knowledge (because they don't actually exist yet... ;))
     
  18. hoho

    Veteran

    Joined:
    Aug 21, 2007
    Messages:
    1,218
    Location:
    Estonia
    How would it be any different from any random implementation of NUMA? I thought when OS sees some app (or parts of RAM) gets used more it simply pulls that closer to the CPU. Though I admit I know really little about how NUMA works in reality so I may have unrealistic expectations :)
     
  19. Grall

    Grall Invisible Member
    Legend

    Joined:
    Apr 14, 2002
    Messages:
    9,337
    Location:
    La-la land
    Well, how would it know this RAM is closer...? Besides, from what I understand no desktop OS can optimize RAM during runtime; once something's loaded somewhere it pretty much stays there.
     
  20. liolio

    liolio French frog
    Legend

    Joined:
    Jun 28, 2005
    Messages:
    5,567
    Location:
    Bx, France

Share This Page

  • About Beyond3D

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...