Welcome, Unregistered.

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

Reply
Old 14-Feb-2012, 15:12   #26
3dilettante
Regular
 
Join Date: Sep 2003
Location: Well within 3d
Posts: 5,423
Default

This functionality allows the core to speculate past something that it normally couldn't, in the same vein of how current designs reorder memory operations and replay if it turns out some of the conflict.

The problem with synchronization was that it was something hardware couldn't safely make assumptions about without programmer intervention.

Intel hasn't disclosed any implementations yet, so what parts of the memory pipeline it changes and what limits it imposes are not yet known.
It could be using cache miss buffers, which was posited by another similar proposal. Writes and reads would be routed to the buffers, which would be also in contact with the cache controller.
Once the last instruction in the section ends the speculative state, it would be an all or nothing cache fill.
__________________
Dreaming of a .065 micron etch-a-sketch.
3dilettante is offline   Reply With Quote
Old 15-Feb-2012, 14:15   #27
Lux_
Member
 
Join Date: Sep 2005
Posts: 206
Default

Superb: David Kanter: Analysis of Haswell's Transactional Memory
Lux_ is offline   Reply With Quote
Old 15-Feb-2012, 15:17   #28
3dilettante
Regular
 
Join Date: Sep 2003
Location: Well within 3d
Posts: 5,423
Default

I'd be curious about the proposed use of per-line bits on each line in the L1/L2.

The lines in the write set must become visible atomically, but as presented it sounded like Kanter felt like a significant portion of the thousands of lines in the L2 could be in play for a transaction.

There could be a way to flash clear the listings, perhaps storing the bits in a compact table or using some kind of nested list next to the cache controller and core instead of across the L2.
If there is a sequential update process, there might be a period of inconsistency, where some lines in the transaction become visible externally while others are still pending. Another thread's accesses might straddle that update line, which may require a check by the cache controller and possibly an additional stall in processing the snoop.
__________________
Dreaming of a .065 micron etch-a-sketch.
3dilettante is offline   Reply With Quote
Old 15-Feb-2012, 18:24   #29
Gubbi
Senior Member
 
Join Date: Feb 2002
Posts: 2,847
Default

Quote:
Originally Posted by 3dilettante View Post
I'd be curious about the proposed use of per-line bits on each line in the L1/L2.
As I understand it, it is used for hardware lock elision. Every cache line accessed within a HLE transaction sets a bit for each line it accesses. The L3 holds a copy of all cache lines in L1/L2. If a different processor tries to access a cache line, the L2 bits are checked for the particular cache line, if a bit is set, the transaction is aborted and data reloaded from L3 to the L1/L2 and the foreign processor. The critical section is then retried with strict synchronization.

This allows a processor to speculatively perform a critical section.

Cheers
__________________
I'm pink, therefore I'm spam
Gubbi is offline   Reply With Quote
Old 15-Feb-2012, 19:44   #30
3dilettante
Regular
 
Join Date: Sep 2003
Location: Well within 3d
Posts: 5,423
Default

Quote:
Originally Posted by Gubbi View Post
As I understand it, it is used for hardware lock elision.
They're used for HTM and ellision. Ellision is overlaid on top of the HTM functionality.
My question concerns where he expects those bits to be located physically, and how they are updated atomically. The worst-case theoretical maximum is 5k+ lines being updated as being visible upon the completion of a speculative section. (It will probably give waaay before that.)

Per Intel's description, all modified lines in a transaction become visible to other readers atomically, and that can be violated if those bits were actually physically on every line and the core can't set them to be visible within a single cycle.
__________________
Dreaming of a .065 micron etch-a-sketch.
3dilettante is offline   Reply With Quote
Old 15-Feb-2012, 23:48   #31
Gubbi
Senior Member
 
Join Date: Feb 2002
Posts: 2,847
Default

Quote:
Originally Posted by 3dilettante View Post
Per Intel's description, all modified lines in a transaction become visible to other readers atomically, and that can be violated if those bits were actually physically on every line and the core can't set them to be visible within a single cycle.
I would expect the bits to be located with the L2 cache tags. The L1s aren't store through, but the core still needs to update usage heauristics for the LRU algorithm used in L2 replacement so that a hot L1 line isn't evicted. When the core stores usage data, it can piggyback the HLE bit.

And yeah, they would need some special hardware to clear all bits in one cycle.

Cheers
__________________
I'm pink, therefore I'm spam
Gubbi is offline   Reply With Quote
Old 24-Feb-2012, 13:44   #32
tunafish
Member
 
Join Date: Aug 2011
Posts: 407
Default

Quote:
Originally Posted by 3dilettante View Post
My question concerns where he expects those bits to be located physically, and how they are updated atomically. The worst-case theoretical maximum is 5k+ lines being updated as being visible upon the completion of a speculative section. (It will probably give waaay before that.)

Per Intel's description, all modified lines in a transaction become visible to other readers atomically, and that can be violated if those bits were actually physically on every line and the core can't set them to be visible within a single cycle.
Quote:
Originally Posted by Gubbi View Post
I would expect the bits to be located with the L2 cache tags. The L1s aren't store through, but the core still needs to update usage heauristics for the LRU algorithm used in L2 replacement so that a hot L1 line isn't evicted. When the core stores usage data, it can piggyback the HLE bit.
Cache tags are the smart place to put them, because with transactions nearly every operation that touches the tags will also need to at least look at the transaction bits. Although I'd fully expect that there will be txn bits in both L1 and L2 tags, and they will just be transported intact to L2 whenever data is evicted there. (so L2 tags don't need any extra space and complexity to handle the ways of the L1 cache).

Quote:
And yeah, they would need some special hardware to clear all bits in one cycle.
The big point is that this hardware really isn't that expensive. To turn an operation from line select to flash broadcast needs one extra signal line and one extra gate per signal junction. Instead of each junction choosing "if a, then propagate to A, if not a then propagate to b", it goes to "if a or t then propagate to A, if (not a) or t then propagate to B".

And the only broadcast operations needed are commit (clear txn bits), and rollback (mark line as invalid).

So yes, it's entirely feasible for the entire cache to be used in a single transaction.
tunafish is offline   Reply With Quote
Old 19-Mar-2012, 18:16   #33
Grall
Invisible Member
 
Join Date: Apr 2002
Location: La-la land
Posts: 6,694
Default

More talk now about Haswell being a multi-chip module in its higher incarnations, featuring an external L4 cache die entirely dedicated for graphics use. The schematic imagery shown over at VR-Zone suggests that the Haswell chip will feature a more square-ish overall shape rather than the increasingly rectangular appearance that has been the case since the Core i-series first launched.

Also, going by this image, the GPU portion now appears to occupy the majority of the die, and be more than twice the size of previous generations. Of course, this might not be an accurate representation of the chip and its layout, but speculation is fun, isn't it? After all, it's what drives every internet tech forum...!
__________________
"Du bist Metall!"
-L.V.
Grall is offline   Reply With Quote
Old 19-Mar-2012, 19:01   #34
3dilettante
Regular
 
Join Date: Sep 2003
Location: Well within 3d
Posts: 5,423
Default

Are their schematics sourced from anywhere, or are those diagrams just a best guess?
__________________
Dreaming of a .065 micron etch-a-sketch.
3dilettante is offline   Reply With Quote
Old 20-Mar-2012, 09:07   #35
Grall
Invisible Member
 
Join Date: Apr 2002
Location: La-la land
Posts: 6,694
Default

No idea, if the article doesn't say. Presumably leaked stuff, considering the diagrams are fairly detailed, but it could of course be the child of an active imagination in some random Adobe Illustrator user on the internet.
__________________
"Du bist Metall!"
-L.V.
Grall is offline   Reply With Quote
Old 20-Mar-2012, 15:37   #36
hoho
Senior Member
 
Join Date: Aug 2007
Location: Estonia
Posts: 1,218
Send a message via MSN to hoho Send a message via Skype™ to hoho
Default

Quote:
Originally Posted by Grall View Post
More talk now about Haswell being a multi-chip module in its higher incarnations, featuring an external L4 cache die entirely dedicated for graphics use.
I wonder how crazy would it have been to implement it as an on-chip NUMA instead of plain L4 cache just for GPU. (decent ) OSes have had NUMA support for quite a while, I guess it would have kind-of worked with that kind of setup as well.
hoho is offline   Reply With Quote
Old 20-Mar-2012, 17:31   #37
Grall
Invisible Member
 
Join Date: Apr 2002
Location: La-la land
Posts: 6,694
Default

Probably not enough storage to have been of any use as NUMA RAM. If that die is almost all DRAM, how much could it possibly store, surely not more than say, 128MB or so. That's a piss in the ocean today in terms of RAM.

Even if it's 512MB it's nothing major to speak of really.

Also, how would you get the OS to properly manage this low-latency RAM and not waste it on some useless random resident DLL crap or whatever that isn't used very much? There's no special support for such memory devices to my knowledge (because they don't actually exist yet... )
__________________
"Du bist Metall!"
-L.V.
Grall is offline   Reply With Quote
Old 20-Mar-2012, 17:36   #38
hoho
Senior Member
 
Join Date: Aug 2007
Location: Estonia
Posts: 1,218
Send a message via MSN to hoho Send a message via Skype™ to hoho
Default

Quote:
Originally Posted by Grall View Post
Also, how would you get the OS to properly manage this low-latency RAM and not waste it on some useless random resident DLL crap or whatever that isn't used very much? There's no special support for such memory devices to my knowledge (because they don't actually exist yet... )
How would it be any different from any random implementation of NUMA? I thought when OS sees some app (or parts of RAM) gets used more it simply pulls that closer to the CPU. Though I admit I know really little about how NUMA works in reality so I may have unrealistic expectations
hoho is offline   Reply With Quote
Old 20-Mar-2012, 21:28   #39
Grall
Invisible Member
 
Join Date: Apr 2002
Location: La-la land
Posts: 6,694
Default

Well, how would it know this RAM is closer...? Besides, from what I understand no desktop OS can optimize RAM during runtime; once something's loaded somewhere it pretty much stays there.
__________________
"Du bist Metall!"
-L.V.
Grall is offline   Reply With Quote
Old 11-Sep-2012, 19:02   #40
liolio
French frog
 
Join Date: Jun 2005
Location: France
Posts: 4,973
Default

IDF in live from anandtech
__________________
Sebbbi about virtual texturing
The Law, by Frederic Bastiat
'The more corrupt the state, the more numerous the laws'.
- Tacitus
liolio is offline   Reply With Quote
Old 11-Sep-2012, 19:09   #41
fellix
Senior Member
 
Join Date: Dec 2004
Location: Varna, Bulgaria
Posts: 3,018
Send a message via Skype™ to fellix
Default

Good Lord!

Haswell is indeed taking full advantage of the 22nm tech.
__________________
Apple: China -- Brutal leadership done right.
Google: United States -- Somewhat democratic.
Microsoft: Russia -- Big and bloated.
Linux: EU -- Diverse and broke.
fellix is offline   Reply With Quote
Old 11-Sep-2012, 20:09   #42
Raqia
Member
 
Join Date: Oct 2003
Posts: 414
Default

Looks like the main pipeline stays the same and there's a lot of quantity improvements to certain structures as well as tons of un-core changes. Some of the more significant changes so far for the CPU side:

uArch:
- Extra integer & extra store port
- AVX2
- TSX
- L1 latency & conflict improvements
- Beefed up cache bandwidth L1 & L2

power:
- Active idle state

No news on clockspeeds yet.
Raqia is offline   Reply With Quote
Old 11-Sep-2012, 20:12   #43
liolio
French frog
 
Join Date: Jun 2005
Location: France
Posts: 4,973
Default

Quote:
Originally Posted by fellix View Post
Good Lord!

Haswell is indeed taking full advantage of the 22nm tech.
Indeed after AMD presentation of the Streamrollers cores, some here hinted at mild improvements vs Ivy Bridge, I may not get everything but the thing looks like a monster.
They already had a huge advantage in cache bandwidth they are doubling that advantage.
They are doubling the FP throughput vs Ivy bridge 32/16 Sp/DP FLOPs per cycle per core.
They are improving single thread performances, bigger reordering buffer, improved branch prediction, better power management feature should allow the turbo to kick in more often.
They are widening the architecture, it can now execute 8 micro ops on the fly from 6 in previous architectures.

They improved significantly pretty much everything. It looks to me like a freaking monster. If Intel doesn't cut feature and artificially loweer the performances of their "low end" part the next core i3 might be all you need as gamer and more.

They made significant improvement to the GPU, video decode hardware, the video encode hardware.

All of this while lowering power consumption, it's nothing short of amazing.

Then there is compute, their GPU were doing good already they added extra muscles both for the GPU and the CPU and now the gpu in compute operation has 0.5TB/s of bandwidth the last level of cache the thing could literally fly.
__________________
Sebbbi about virtual texturing
The Law, by Frederic Bastiat
'The more corrupt the state, the more numerous the laws'.
- Tacitus

Last edited by liolio; 11-Sep-2012 at 20:22.
liolio is offline   Reply With Quote
Old 11-Sep-2012, 20:52   #44
sebbbi
Senior Member
 
Join Date: Nov 2007
Posts: 1,341
Default

Quote:
Originally Posted by liolio View Post
gpu in compute operation has 0.5TB/s of bandwidth the last level of cache the thing could literally fly.
Yeah, that's really good.

Total LLC in current high end i7 models is 8 MB (Haswell should be similar in this regard). That's quite a bit more shared cache for the GPU compared to GCN for example (768KB shared L2 in 7970, 512KB in models with 256 bit memory bus). This is excellent news for some GPU compute algorithms. I wonder if they can also use the 8 MB cache as a render target (like we Xbox 360 programmers use the EDRAM). 8 MB is not enough for a whole 1080p RT, but it would be really good for shadow map rendering for example (depth buffer only, and lots of overdraw). 512x512 shadow map is 1 MB, 1024x1024 is 4 MB (both would fit nicely to LLC).

Transactional memory seems to be really interesting as well
sebbbi is offline   Reply With Quote
Old 11-Sep-2012, 21:07   #45
Bouncing Zabaglione Bros.
Regular
 
Join Date: Jun 2003
Posts: 6,358
Default

Are Intel really telling us all this stuff eight months before launch? That seems a bit bonkers, or is there the posibility that they might surprise people with an early launch?

It's not like Intel are up against anything substantial from AMD.
Bouncing Zabaglione Bros. is offline   Reply With Quote
Old 11-Sep-2012, 21:14   #46
3dilettante
Regular
 
Join Date: Sep 2003
Location: Well within 3d
Posts: 5,423
Default

Intel's improved fetch bandwidth and widened the back end to handle two branches per cycle.
It doesn't mention predicting two branches per cycle, though.

It's kind of interesting to see a core that could for some reason generate 3 store addresses a cycle. Perhaps the extra port is to keep the store address out of the way of the load calculations as much as possible.

The L1 is much better in terms of bandwidth, and it sounds like they truly dual-ported it given the claimed elimination of bank conflicts. This was a noted constraint with SB and its competition.
Transactional memory is handled by the L1, which sidesteps my concern about earlier speculation of enlisting the larger and more distant L2.
No mention of how gather is handled, though.

There's a lot of stuff done with an eye towards comprehensive power management and SOC-level control, including further empowering the system agent to trick the OS as to the true status of the cores and their thread activity.
__________________
Dreaming of a .065 micron etch-a-sketch.
3dilettante is offline   Reply With Quote
Old 11-Sep-2012, 21:40   #47
Exophase
Senior Member
 
Join Date: Mar 2010
Location: Cleveland, OH
Posts: 1,952
Default

Quote:
Originally Posted by Bouncing Zabaglione Bros. View Post
Are Intel really telling us all this stuff eight months before launch? That seems a bit bonkers, or is there the posibility that they might surprise people with an early launch?

It's not like Intel are up against anything substantial from AMD.
Par for the course for Intel to release new uarch information at IDF just like they always have. When else would they?
Exophase is offline   Reply With Quote
Old 11-Sep-2012, 21:40   #48
sebbbi
Senior Member
 
Join Date: Nov 2007
Posts: 1,341
Default

Quote:
Originally Posted by 3dilettante View Post
No mention of how gather is handled, though.
Hopefully they spill the beans soon. I just hope that you don't have to code a loop for it (like you must for Knights Corner). The worst case is that it's just a long microcoded sequence, but that wouldn't make much sense. I am keeping my pessimistic view until Intel proves me otherwise. Efficient gather is almost too good to be true

... Intel has added two extra ports, but none of them does load related things. And "no changes to key pipelines" either. No mention about other load related improvements either. So my conclusion is that gather likely takes several cycles to complete (even without cache misses).
sebbbi is offline   Reply With Quote
Old 11-Sep-2012, 22:00   #49
Raqia
Member
 
Join Date: Oct 2003
Posts: 414
Default

Looks like the Intel's Oregon team is responsible for building around the basic pipeline flow foundation that its Haifa team lays out. This has been true for several consecutive pairs of architectural "tocks" since they started following that execution plan: Conroe->Nehalem, and SandyBridge->Haswell. I'm expecting Skylake to be the next major retooling of the pipeline by the Israel team.
Raqia is offline   Reply With Quote
Old 11-Sep-2012, 22:09   #50
I.S.T.
Senior Member
 
Join Date: Feb 2004
Posts: 2,567
Default

Quote:
Originally Posted by sebbbi View Post
And "no changes to key pipelines" either.
I noticed that as well. What exactly does it mean?
I.S.T. is offline   Reply With Quote

Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 12:16.


Powered by vBulletin® Version 3.8.6
Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.