Welcome, Unregistered.

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

Reply
Old 14-Dec-2007, 15:01   #1
digitalwanderer
Dangerously Mirthful
 
Join Date: Feb 2002
Location: Winfield, IN USA
Posts: 15,314
Default Phenom TLB discussion (from Vista SP1)

Quote:
Originally Posted by Morgoth the Dark Enemy View Post
How the heck did you get a 9900 Phenom?ES?Or are you playing it cool and naming an OCed 9500 that?
It's the spider kit from Tahoe, just got it this week. It's a real 9900 2.6Ghz, but it's a B2 still.
digitalwanderer is offline   Reply With Quote
Old 26-Dec-2007, 23:11   #2
WaltC
Senior Member
 
Join Date: Jul 2002
Location: BelleVue Sanatorium, Billary, NY. Patient privileges: Internet access
Posts: 2,694
Default

Quote:
Originally Posted by digitalwanderer View Post
It's the spider kit from Tahoe, just got it this week. It's a real 9900 2.6Ghz, but it's a B2 still.
Dig, assuming you haven't been brainwashed into peremptorily installing the TLB patch you might not need, how is the software you are running, running?

Last edited by WaltC; 26-Dec-2007 at 23:19. Reason: typo
WaltC is offline   Reply With Quote
Old 27-Dec-2007, 04:09   #3
Tim Murray
chaos dunk
 
Join Date: May 2003
Location: Mountain View, CA
Posts: 3,274
Default

Quote:
Originally Posted by WaltC View Post
Dig, assuming you haven't been brainwashed into peremptorily installing the TLB patch you might not need, how is the software you are running, running?
"Oh, my machine MIGHT not die with a machine check exception at any time due to a race condition in the L3 cache, so I won't install the TLB patch!" er...
Tim Murray is offline   Reply With Quote
Old 27-Dec-2007, 04:10   #4
digitalwanderer
Dangerously Mirthful
 
Join Date: Feb 2002
Location: Winfield, IN USA
Posts: 15,314
Default

I've actually had the TLB patch disabled since I got it and haven't had a single issue.
digitalwanderer is offline   Reply With Quote
Old 27-Dec-2007, 04:45   #5
Tim Murray
chaos dunk
 
Join Date: May 2003
Location: Mountain View, CA
Posts: 3,274
Default

Quote:
Originally Posted by digitalwanderer View Post
I've actually had the TLB patch disabled since I got it and haven't had a single issue.
That's like overclocking a computer, using it in everyday work without any noticeable issues, and then having it crash instantly once you're stress testing it. Just because you don't know how to replicate it outside of that one specific application doesn't mean it's stable, and eventually you'll probably find some app that exhibits the same behavior (you know, kill the machine).

From what AMD has said, there's a race condition in L3 when one core sets the dirty bit and another reads from the now-dirty page soon after; this leads to hilarious memory corruption and then generates a machine check error, from which the processor cannot be restarted (e.g., you have to reboot--as far as I can tell you can't just set an interrupt handler to deal with it). Of course, that race condition kind of has to be based on clock speed, as far as I can tell, so I don't know what the deal is there.

I get the feeling if you're running multithreaded apps with both heavy memory accesses and poor cache coherency, you're going to encounter the crash. There's no mention of any virtualization-specific conditions that would cause this anywhere, so it's not going to be virtualization-specific--it's just the right kind of workload to generate the crash.

holy crap are we ever off topic
Tim Murray is offline   Reply With Quote
Old 27-Dec-2007, 04:51   #6
digitalwanderer
Dangerously Mirthful
 
Join Date: Feb 2002
Location: Winfield, IN USA
Posts: 15,314
Default

And I feel you're being a bit of an epipolar bear again.

I see your point, I shouldn't like my PC because it could lock up at anytime....but it hasn't and I've been f-ing trying so I'm just starting to think/feel/believe that mebbe when they say it's a rare condition that doesn't come up much that it really isn't.

I'm not saying you're wrong Tim, just relaying my personal experiences with the actual hardware.
digitalwanderer is offline   Reply With Quote
Old 27-Dec-2007, 05:07   #7
Tim Murray
chaos dunk
 
Join Date: May 2003
Location: Mountain View, CA
Posts: 3,274
Default

Quote:
Originally Posted by digitalwanderer View Post
And I feel you're being a bit of an epipolar bear again.

I see your point, I shouldn't like my PC because it could lock up at anytime....but it hasn't and I've been f-ing trying so I'm just starting to think/feel/believe that mebbe when they say it's a rare condition that doesn't come up much that it really isn't.

I'm not saying you're wrong Tim, just relaying my personal experiences with the actual hardware.
Well, that doesn't mean it doesn't exist! And sure, it's rare in normal apps. You need the following conditions:

1. Multiple threads (or at least some sort of way for multiple cores to simultaneously access the same page, so multiple threads is probably the easiest way to imagine this).
2. The threads are accessing the same memory segments.
3. Some page (let's call it X) is cached in L3.
4. Core 1 writes to page X, which causes the dirty bit to be set on the page in L3.
5. Sometime very soon after (how soon that is, I have no idea), Core 2 writes to page X.

Now, here's where the TLB erratum hits you--the dirty bit (which exists in the TLB, hence the name) is ignored, you get memory corruption, and then the processor detects that things have gone HORRIBLY WRONG and generates the aforementioned machine check exception. Then the processor stops and you reboot.

My big problem with the idea that the TLB patch can be ignored is that while it's probably pretty stable right now, that kind of workload will be more common in six months. Six months later, it'll be even more common, and so on. Claiming that it's just not necessary for most people may be true-ish right now (you'll probably still find apps that need it, but maybe they'll be rare), but I don't think it will be as true in the future.
Tim Murray is offline   Reply With Quote
Old 27-Dec-2007, 05:12   #8
Skrying
S K R Y I N G
 
Join Date: Jul 2005
Posts: 4,815
Default

Wait, so the only issue is that it reboots the system? No long time damage? What the hell is the worry then? Just run the system till you run into the issue and once you do enable the fix. Easy...
Skrying is offline   Reply With Quote
Old 27-Dec-2007, 06:01   #9
Tim Murray
chaos dunk
 
Join Date: May 2003
Location: Mountain View, CA
Posts: 3,274
Default

Quote:
Originally Posted by Skrying View Post
Wait, so the only issue is that it reboots the system? No long time damage? What the hell is the worry then? Just run the system till you run into the issue and once you do enable the fix. Easy...
Data loss? Maybe rebooting the system is acceptable if you're just playing games, but if you're doing actual work with it, that's just not going to work (hence stopping shipments of Barcelona). There's absolutely no way I would run a Phenom without the patch.

I'm tempted to write an app that would basically make the thing crash if the erratum is correct, try to get some data on when it actually occurs. Wouldn't really be hard...
Tim Murray is offline   Reply With Quote
Old 27-Dec-2007, 06:15   #10
Skrying
S K R Y I N G
 
Join Date: Jul 2005
Posts: 4,815
Default

Quote:
Originally Posted by Tim Murray View Post
Data loss? Maybe rebooting the system is acceptable if you're just playing games, but if you're doing actual work with it, that's just not going to work (hence stopping shipments of Barcelona). There's absolutely no way I would run a Phenom without the patch.

I'm tempted to write an app that would basically make the thing crash if the erratum is correct, try to get some data on when it actually occurs. Wouldn't really be hard...
I'm not sure what person would purchase a Phenom for actual work, hell I'm not sure at all who in their right mind would purchase a Phenom at all. But sure data loss would be an issue, but then crashes seems to happen at the darnedest times anyway, so typical saving polices should prevent anything majort. If I was gaming mostly on the system, normal surfing, just doing every day tasks, or even just minor office work then I would bother with the patch until it actually occurs.

I still haven't seen anyone being able to have the issue pop up under normal every day use baring but purposefully trying to force it.
Skrying is offline   Reply With Quote
Old 27-Dec-2007, 10:02   #11
digitalwanderer
Dangerously Mirthful
 
Join Date: Feb 2002
Location: Winfield, IN USA
Posts: 15,314
Default

Quote:
Originally Posted by Tim Murray View Post
I'm tempted to write an app that would basically make the thing crash if the erratum is correct, try to get some data on when it actually occurs. Wouldn't really be hard...
"How difficult could it be?"

I double-dog dare you.
digitalwanderer is offline   Reply With Quote
Old 27-Dec-2007, 05:36   #12
Davros
Darlek ******
 
Join Date: Jun 2004
Posts: 9,651
Default

my god arnt you easilly pleased
Davros is offline   Reply With Quote
Old 27-Dec-2007, 08:59   #13
K.I.L.E.R
Retarded moron
 
Join Date: Jun 2002
Location: Australia, Melbourne
Posts: 2,949
Send a message via ICQ to K.I.L.E.R Send a message via AIM to K.I.L.E.R Send a message via MSN to K.I.L.E.R
Default

Where did you get the details about the TLB issue Tim?
If what you are saying is true then this issue is quite big, I heavily make use of mutiple threads in certain parts of my work that access shared memory.
If some parts of that memory is cached in the level 3 and mutiple threads access it after my main thread has written to it then I'm pretty much screwed? Thank goodness I didn't buy a Phenom.
__________________
I eat coffee.
K.I.L.E.R is offline   Reply With Quote
Old 27-Dec-2007, 10:18   #14
AlexV
Heteroscedasticitate
 
Join Date: Mar 2005
Posts: 2,362
Default

Quote:
Originally Posted by K.I.L.E.R View Post
Where did you get the details about the TLB issue Tim?
If what you are saying is true then this issue is quite big, I heavily make use of mutiple threads in certain parts of my work that access shared memory.
If some parts of that memory is cached in the level 3 and mutiple threads access it after my main thread has written to it then I'm pretty much screwed? Thank goodness I didn't buy a Phenom.
AMD documented it...erratum 298 IIRC. Dunno if the guy writing the erratum documentation came back from his holiday, but some reviews quoted AMDs text in full.
__________________
Donald Knuth: Science is what we understand well enough to explain to a computer. Art is everything else we do.
AlexV is online now   Reply With Quote
Old 27-Dec-2007, 15:11   #15
3dilettante
Senior Member
 
Join Date: Sep 2003
Location: Well within 3d
Posts: 4,242
Default

The way I've interpreted the description of the error is that updates to TLB entries are not (edit: completely) atomic.

TLB entries exist in memory, and while the TLB used directly by the core is separate from the L1 cache, TLB entries that are evicted from the TLB can reside as data in either the L2 or L3.

The problem as AMD described is that there is a window of time when a TLB entry that is present in the L2 needs to be updated due to a memory operation by the core up with the L1 and TLB (the aforementioned accessed and dirty bits), but another memory operation forces the L2 data to be evicted to the L3.

What this means is that the L3 gets an old copy of the TLB entry that can then be loaded by another core.
As a result, two cores have two different versions of the same TLB entry.

The time window is very small, one core must update a TLB entry that is cached in its L2 at the same time that some other operation evicts the old TLB data in the L2 to the L3. Then, if another core loads up that old data, it, the system, or system data is screwed.

Virtualization goes through a lot of common TLB accesses, which is why it is likely a bigger problem for server and virtualized loads. The problem with testing for this is that it requires a certain combination of events and data accesses that can force an L2 eviction at just the right time.

edit:
The erratum as I saw it described is a 2-parter. One involved evictions to the L3, the other occurs if the same L2 cache line is probed, in which case the core might simply forget to set the accessed and dirty bits and as an added bonus may corrupt another completely unrelated cache operation.

Here's a description of the bug and the OS workaround in Linux that avoids most of the performance penalty associated with the BIOS workaround.

https://www.x86-64.org/pipermail/dis...er/010259.html
__________________
Dreaming of a .065 micron etch-a-sketch.

Last edited by 3dilettante; 27-Dec-2007 at 18:04. Reason: clarifications added
3dilettante is offline   Reply With Quote
Old 27-Dec-2007, 16:46   #16
Bouncing Zabaglione Bros.
Regular
 
Join Date: Jun 2003
Posts: 6,177
Default

I run a MySQL database as a back end for a news reader. I would not be surprised to see that sort of app running on multiple cores could trigger this kind of problem. Rebooting your machine in the middle of disk writes to your relational database and shafting the in-ram disk and db caches is potentially a recipe for lots of hassle.

I suppose it's a case of how important stability is for you. If you don't mind your machine rebooting once a month while playing games, it's not a big deal. If you trash a database every few hours that then needs repair/rebuilding/restoring, it's too much hassle to live with.

I think the biggest problem is that the fix for this kills even more performance off chips that are already under-performing, which is why people are trying to justify ignoring the fix. No one would care if the fix didn't lower Phenom performance even more, and everyone would just install the patch and be done with it.
Bouncing Zabaglione Bros. is offline   Reply With Quote
Old 28-Dec-2007, 03:05   #17
Geo
Mostly Harmless
 
Join Date: Apr 2002
Location: Uffda-land
Posts: 9,156
Send a message via MSN to Geo
Default

Not everybody is running mission-critical apps on their PC. Certainly people who are should be using the patch. People who aren't, and own the cpu, might reasonably want to see if it bites them on the butt with their specific workload before making that decision.
__________________
"We'll thrash them --absolutely thrash them."--Richard Huddy on Larrabee
"Our multi-decade old 3D graphics rendering architecture that's based on a rasterization approach is no longer scalable and suitable for the demands of the future." --Pat Gelsinger, Intel
". . .its taking us longer than we would have liked to get a [Crossfire game] profiling system out there" --Terry Makedon, ATI, July 2006
"Christ, this is Beyond3D; just get rid of any f**ker talking about patterned chihuahuas! Can the dog write GLSL? No. Then it can f**k off." --Da Boss
Geo is offline   Reply With Quote
Old 31-Dec-2007, 18:06   #18
Albuquerque
Red-headed step child
 
Join Date: Jun 2004
Location: Guess ;)
Posts: 3,088
Default

Quote:
Originally Posted by Geo View Post
Not everybody is running mission-critical apps on their PC. Certainly people who are should be using the patch. People who aren't, and own the cpu, might reasonably want to see if it bites them on the butt with their specific workload before making that decision.
Agreed. And due to the way in which this "bug" operates, I'd still be comfortable saying that a large quantity of PC users in the wild wouldn't be affected. Hell, the same people would also be the ones who wouldn't notice the performance detriment of the fix either
__________________
"...twisting my words"
Quote:
Originally Posted by _xxx_ 1/25 View Post
Get some supplies <...> Within the next couple of months, you'll need it.
Quote:
Originally Posted by _xxx_ 6/9 View Post
And riots are about to begin too.
Quote:
Originally Posted by _xxx_8/5 View Post
food shortages and huge price jumps I predicted recently are becoming very real now.
Quote:
Originally Posted by _xxx_ View Post
If it turns out I was wrong, I'll admit being stupid
Albuquerque is offline   Reply With Quote
Old 31-Dec-2007, 19:12   #19
3dilettante
Senior Member
 
Join Date: Sep 2003
Location: Well within 3d
Posts: 4,242
Default

The patch isn't really an option for most people.

The limited release of Phenom before this bug popped up+the limited number of Spider boards that were sold to run Phenom+the need for BIOS updates to run Phenom as a drop-in replacement = not too many people.
Barcelona customers are pretty rare, and a number of the big ones are HPC installations that might use the OS workaround instead.

Going forward, every BIOS is going to have it and there won't be many that will have a way to turn it off.
I read that AMD's Overdrive currently doesn't have a switch for it either.
There's a nebulous "Turbo" button that apparently does turn it off, but what else does it do?
__________________
Dreaming of a .065 micron etch-a-sketch.

Last edited by 3dilettante; 31-Dec-2007 at 19:33.
3dilettante is offline   Reply With Quote
Old 31-Dec-2007, 20:06   #20
digitalwanderer
Dangerously Mirthful
 
Join Date: Feb 2002
Location: Winfield, IN USA
Posts: 15,314
Default

Quote:
Originally Posted by 3dilettante View Post
I read that AMD's Overdrive currently doesn't have a switch for it either.
There's a nebulous "Turbo" button that apparently does turn it off, but what else does it do?
Uhm, AMD's Overdrive Utility (AOD) sure does have a switch for it. Three settings too; on, off, and middling.

Just gotta click on the little light thingy in the upper-right hand corner. Green is on, yellow middling, red off.
digitalwanderer is offline   Reply With Quote
Old 31-Dec-2007, 20:16   #21
3dilettante
Senior Member
 
Join Date: Sep 2003
Location: Well within 3d
Posts: 4,242
Default

Perhaps I should restate what I said as "there is no explicit switch to disable the TLB patch".
There is a way to turn the workaround off, but it is not really marked as a switch for that one issue.
It is not confirmed that the setting that disables the workaround doesn't alter other settings.

This confused state is highlighted by your saying there are 3 settings. The workaround has a total of 2 states: on or off. That means other things may be affected.
__________________
Dreaming of a .065 micron etch-a-sketch.
3dilettante is offline   Reply With Quote

Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 07:06.


Powered by vBulletin® Version 3.8.6
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.