Phenom TLB discussion (from Vista SP1)

digitalwanderer · Dec 14, 2007

Morgoth the Dark Enemy said:
How the heck did you get a 9900 Phenom?ES?Or are you playing it cool and naming an OCed 9500 that?

It's the spider kit from Tahoe, just got it this week. It's a real 9900 2.6Ghz, but it's a B2 still.

WaltC · Dec 26, 2007

digitalwanderer said:
It's the spider kit from Tahoe, just got it this week. It's a real 9900 2.6Ghz, but it's a B2 still.

Dig, assuming you haven't been brainwashed into peremptorily installing the TLB patch you might not need, how is the software you are running, running?

Tim Murray · Dec 27, 2007

WaltC said:
Dig, assuming you haven't been brainwashed into peremptorily installing the TLB patch you might not need, how is the software you are running, running?

"Oh, my machine MIGHT not die with a machine check exception at any time due to a race condition in the L3 cache, so I won't install the TLB patch!" er...

digitalwanderer · Dec 27, 2007

I've actually had the TLB patch disabled since I got it and haven't had a single issue.

Tim Murray · Dec 27, 2007

digitalwanderer said:
I've actually had the TLB patch disabled since I got it and haven't had a single issue.

That's like overclocking a computer, using it in everyday work without any noticeable issues, and then having it crash instantly once you're stress testing it. Just because you don't know how to replicate it outside of that one specific application doesn't mean it's stable, and eventually you'll probably find some app that exhibits the same behavior (you know, kill the machine).

From what AMD has said, there's a race condition in L3 when one core sets the dirty bit and another reads from the now-dirty page soon after; this leads to hilarious memory corruption and then generates a machine check error, from which the processor cannot be restarted (e.g., you have to reboot--as far as I can tell you can't just set an interrupt handler to deal with it). Of course, that race condition kind of has to be based on clock speed, as far as I can tell, so I don't know what the deal is there.

I get the feeling if you're running multithreaded apps with both heavy memory accesses and poor cache coherency, you're going to encounter the crash. There's no mention of any virtualization-specific conditions that would cause this anywhere, so it's not going to be virtualization-specific--it's just the right kind of workload to generate the crash.

holy crap are we ever off topic

digitalwanderer · Dec 27, 2007

And I feel you're being a bit of an epipolar bear again.

I see your point, I shouldn't like my PC because it could lock up at anytime....but it hasn't and I've been f-ing trying so I'm just starting to think/feel/believe that mebbe when they say it's a rare condition that doesn't come up much that it really isn't.

I'm not saying you're wrong Tim, just relaying my personal experiences with the actual hardware.

Tim Murray · Dec 27, 2007

digitalwanderer said:
And I feel you're being a bit of an epipolar bear again.

I see your point, I shouldn't like my PC because it could lock up at anytime....but it hasn't and I've been f-ing trying so I'm just starting to think/feel/believe that mebbe when they say it's a rare condition that doesn't come up much that it really isn't.

I'm not saying you're wrong Tim, just relaying my personal experiences with the actual hardware.

Well, that doesn't mean it doesn't exist! And sure, it's rare in normal apps. You need the following conditions:

1. Multiple threads (or at least some sort of way for multiple cores to simultaneously access the same page, so multiple threads is probably the easiest way to imagine this).
2. The threads are accessing the same memory segments.
3. Some page (let's call it X) is cached in L3.
4. Core 1 writes to page X, which causes the dirty bit to be set on the page in L3.
5. Sometime very soon after (how soon that is, I have no idea), Core 2 writes to page X.

Now, here's where the TLB erratum hits you--the dirty bit (which exists in the TLB, hence the name) is ignored, you get memory corruption, and then the processor detects that things have gone HORRIBLY WRONG and generates the aforementioned machine check exception. Then the processor stops and you reboot.

My big problem with the idea that the TLB patch can be ignored is that while it's probably pretty stable right now, that kind of workload will be more common in six months. Six months later, it'll be even more common, and so on. Claiming that it's just not necessary for most people may be true-ish right now (you'll probably still find apps that need it, but maybe they'll be rare), but I don't think it will be as true in the future.

Skrying · Dec 27, 2007

Wait, so the only issue is that it reboots the system? No long time damage? What the hell is the worry then? Just run the system till you run into the issue and once you do enable the fix. Easy...

Davros · Dec 27, 2007

my god arnt you easilly pleased

Tim Murray · Dec 27, 2007

Skrying said:
Wait, so the only issue is that it reboots the system? No long time damage? What the hell is the worry then? Just run the system till you run into the issue and once you do enable the fix. Easy...

Data loss? Maybe rebooting the system is acceptable if you're just playing games, but if you're doing actual work with it, that's just not going to work (hence stopping shipments of Barcelona). There's absolutely no way I would run a Phenom without the patch.

I'm tempted to write an app that would basically make the thing crash if the erratum is correct, try to get some data on when it actually occurs. Wouldn't really be hard...

Skrying · Dec 27, 2007

Tim Murray said:
Data loss? Maybe rebooting the system is acceptable if you're just playing games, but if you're doing actual work with it, that's just not going to work (hence stopping shipments of Barcelona). There's absolutely no way I would run a Phenom without the patch.

I'm tempted to write an app that would basically make the thing crash if the erratum is correct, try to get some data on when it actually occurs. Wouldn't really be hard...

I'm not sure what person would purchase a Phenom for actual work, hell I'm not sure at all who in their right mind would purchase a Phenom at all. But sure data loss would be an issue, but then crashes seems to happen at the darnedest times anyway, so typical saving polices should prevent anything majort. If I was gaming mostly on the system, normal surfing, just doing every day tasks, or even just minor office work then I would bother with the patch until it actually occurs.

I still haven't seen anyone being able to have the issue pop up under normal every day use baring but purposefully trying to force it.

K.I.L.E.R · Dec 27, 2007

Where did you get the details about the TLB issue Tim?
If what you are saying is true then this issue is quite big, I heavily make use of mutiple threads in certain parts of my work that access shared memory.
If some parts of that memory is cached in the level 3 and mutiple threads access it after my main thread has written to it then I'm pretty much screwed? Thank goodness I didn't buy a Phenom.

digitalwanderer · Dec 27, 2007

Tim Murray said:
I'm tempted to write an app that would basically make the thing crash if the erratum is correct, try to get some data on when it actually occurs. Wouldn't really be hard...

"How difficult could it be?"

I double-dog dare you.

AlexV · Dec 27, 2007

K.I.L.E.R said:
Where did you get the details about the TLB issue Tim?
If what you are saying is true then this issue is quite big, I heavily make use of mutiple threads in certain parts of my work that access shared memory.
If some parts of that memory is cached in the level 3 and mutiple threads access it after my main thread has written to it then I'm pretty much screwed? Thank goodness I didn't buy a Phenom.

AMD documented it...erratum 298 IIRC. Dunno if the guy writing the erratum documentation came back from his holiday, but some reviews quoted AMDs text in full.

3dilettante · Dec 27, 2007

The way I've interpreted the description of the error is that updates to TLB entries are not (edit: completely) atomic.

TLB entries exist in memory, and while the TLB used directly by the core is separate from the L1 cache, TLB entries that are evicted from the TLB can reside as data in either the L2 or L3.

The problem as AMD described is that there is a window of time when a TLB entry that is present in the L2 needs to be updated due to a memory operation by the core up with the L1 and TLB (the aforementioned accessed and dirty bits), but another memory operation forces the L2 data to be evicted to the L3.

What this means is that the L3 gets an old copy of the TLB entry that can then be loaded by another core.
As a result, two cores have two different versions of the same TLB entry.

The time window is very small, one core must update a TLB entry that is cached in its L2 at the same time that some other operation evicts the old TLB data in the L2 to the L3. Then, if another core loads up that old data, it, the system, or system data is screwed.

Virtualization goes through a lot of common TLB accesses, which is why it is likely a bigger problem for server and virtualized loads. The problem with testing for this is that it requires a certain combination of events and data accesses that can force an L2 eviction at just the right time.

edit:
The erratum as I saw it described is a 2-parter. One involved evictions to the L3, the other occurs if the same L2 cache line is probed, in which case the core might simply forget to set the accessed and dirty bits and as an added bonus may corrupt another completely unrelated cache operation.

Here's a description of the bug and the OS workaround in Linux that avoids most of the performance penalty associated with the BIOS workaround.

https://www.x86-64.org/pipermail/discuss/2007-December/010259.html

Bouncing Zabaglione Bros. · Dec 27, 2007

I run a MySQL database as a back end for a news reader. I would not be surprised to see that sort of app running on multiple cores could trigger this kind of problem. Rebooting your machine in the middle of disk writes to your relational database and shafting the in-ram disk and db caches is potentially a recipe for lots of hassle.

I suppose it's a case of how important stability is for you. If you don't mind your machine rebooting once a month while playing games, it's not a big deal. If you trash a database every few hours that then needs repair/rebuilding/restoring, it's too much hassle to live with.

I think the biggest problem is that the fix for this kills even more performance off chips that are already under-performing, which is why people are trying to justify ignoring the fix. No one would care if the fix didn't lower Phenom performance even more, and everyone would just install the patch and be done with it.

Geo · Dec 28, 2007

Not everybody is running mission-critical apps on their PC. Certainly people who are should be using the patch. People who aren't, and own the cpu, might reasonably want to see if it bites them on the butt with their specific workload before making that decision.

Albuquerque · Dec 31, 2007

Geo said:
Not everybody is running mission-critical apps on their PC. Certainly people who are should be using the patch. People who aren't, and own the cpu, might reasonably want to see if it bites them on the butt with their specific workload before making that decision.

Agreed. And due to the way in which this "bug" operates, I'd still be comfortable saying that a large quantity of PC users in the wild wouldn't be affected. Hell, the same people would also be the ones who wouldn't notice the performance detriment of the fix either

3dilettante · Dec 31, 2007

The patch isn't really an option for most people.

The limited release of Phenom before this bug popped up+the limited number of Spider boards that were sold to run Phenom+the need for BIOS updates to run Phenom as a drop-in replacement = not too many people.
Barcelona customers are pretty rare, and a number of the big ones are HPC installations that might use the OS workaround instead.

Going forward, every BIOS is going to have it and there won't be many that will have a way to turn it off.
I read that AMD's Overdrive currently doesn't have a switch for it either.
There's a nebulous "Turbo" button that apparently does turn it off, but what else does it do?

digitalwanderer · Dec 31, 2007

3dilettante said:
I read that AMD's Overdrive currently doesn't have a switch for it either.
There's a nebulous "Turbo" button that apparently does turn it off, but what else does it do?

Uhm, AMD's Overdrive Utility (AOD) sure does have a switch for it. Three settings too; on, off, and middling.

Just gotta click on the little light thingy in the upper-right hand corner. Green is on, yellow middling, red off.

Phenom TLB discussion (from Vista SP1)

digitalwanderer

wandering

WaltC

Tim Murray

the Windom Earle of mobile SOCs

digitalwanderer

wandering

Tim Murray

the Windom Earle of mobile SOCs

digitalwanderer

wandering

Tim Murray

the Windom Earle of mobile SOCs

Skrying

S K R Y I N G

Davros

Tim Murray

the Windom Earle of mobile SOCs

Skrying

S K R Y I N G

K.I.L.E.R

Retarded moron

digitalwanderer

wandering

AlexV

Heteroscedasticitate

3dilettante

Bouncing Zabaglione Bros.

Geo

Mostly Harmless

Albuquerque

Red-headed step child

3dilettante

digitalwanderer

wandering

Similar threads