AMD: R8xx Speculation

Dave Baumann · Nov 3, 2009

Andrew Lauritzen said:
Not always - *I* shouldn't be able to write code that screws up my GPU either... that's not acceptable.

It doesn't screw up the GPU, it enters into a protection state - in this case its not a particularly end user friendly protection (which why why we improved it) but it is a protection state.

There's a difference between things like "turbo mode" which detect thermal conditions and modify clock rates and such on the fly and an application being able to bring down the machine - which should never be possible. The issue here is if there is *any* way *ever* to make it fail, then it's broken, no matter how often you expect to see that case. I don't recall any (working) CPUs that would fail in the hardware for some workload, but please correct me if I'm wrong.

http://www.cpu-world.com/Glossary/T/Thermal_Design_Power_(TDP).html

This is normal case of things - in you don't design to the maximum TDP because you're overdesigning for the corner cases. Generally you have catch alls for the corner cases.

No one is complaining about doing cool power savings and clock rate modification to keep the chip running at peak efficiency. The problem is you can't have the chip optimized for the 99% case, and fail catastrophically in the 1%. If you can 100% perfectly detect that 1% and down-clock or whatever you need to do to make it not die then that's completely fine, but I reiterate: the software should never be able to bring down the hardware.

And things have improved.

Andrew Lauritzen · Nov 3, 2009

Dave Baumann said:
It doesn't screw up the GPU, it enters into a protection state - in this case its not a particularly end user friendly protection (which why why we improved it) but it is a protection state.

That's fine if it still makes forward progress. The additional constraint here is that it still needs to comply with the 2 second command-buffer completion rate otherwise it doesn't satisfy Vista/WDDM's requirements.

I apologize if I'm simply not clear on the state of the art here, but it was my understanding that AMD was actually defending the card shutting down/producing incorrect results/frying itself as legitimate behavior... going slower is fine, but blowing up is not.

CNCAddict · Nov 3, 2009

Hmmm, the 5850 stock fan settings never allow it to spin up very fast so I could see how Furmark could overheat things. If you're having stability problems then it's easy to install MSI afterburner and dial in a sweet custom fan profile and get your temps down 10-20deg. I'm running 930Mhz in furmark and don't have any problems.

3dcgi · Nov 3, 2009

Andrew Lauritzen said:
That's fine if it still makes forward progress. The additional constraint here is that it still needs to comply with the 2 second command-buffer completion rate otherwise it doesn't satisfy Vista/WDDM's requirements.

What's the 2 second completion rate? There's no way to guarantee that as shaders could run for a long time and you can put 4 million indices in a draw command. Also, surely slowing down a high end chip will still be faster than a low end chip.

Andrew Lauritzen · Nov 3, 2009

3dcgi said:
What's the 2 second completion rate? There's no way to guarantee that as shaders could run for a long time and you can put 4 million indices in a draw command. Also, surely slowing down a high end chip will still be faster than a low end chip.

The driver is actually supposed to be able to hint WDDM as to where to split up a command buffer if it is unable to fit all of the resources in memory or finish in 2 seconds... otherwise Vista will kill the driver, assuming that it is hung.

And yes, it is unavoidable in general (particularly with shaders that can actually deadlock now), but it's also clearly unacceptable behavior for an application like Futuremark which has worked fine on older GPUs at < 2.0 seconds/frame rates for many years.

Dave Baumann · Nov 3, 2009

CNCAddict said:
Hmmm, the 5850 stock fan settings never allow it to spin up very fast so I could see how Furmark could overheat things.

The refence fan table will allow it to spin continue to spin up until it needs to.

hoom · Nov 3, 2009

There was a time before CPUs had built in thermal monitoring & throttling to protect against overheating.
I'm sure that the Intel burn test was quite capable of causing fried CPUs before then.

Bouncing Zabaglione Bros. · Nov 3, 2009

hoom said:
There was a time before CPUs had built in thermal monitoring & throttling to protect against overheating.
I'm sure that the Intel burn test was quite capable of causing fried CPUs before then.

If you want to go back seven years, then just turning on a PC without a heatsink could do that - but we're not at seven years ago.

dizietsma · Nov 3, 2009

Is it the voltage regulators that are the issue or the actual gpu with Furmark? I've got an aftermarket fan on my 4850 and with Furmark it goes up to 80C or so, no great shakes. If it's the gpu that causes the issue then rather than modding the drivers would it not be better to up spec the default cooler? The extra cost of the 5850 should cover this. Only joking.

On a similar topic I have finally decided after a lot of umming and arring that I will go for a 5870. There doesn't seem to be any about at the mo while TSMC sort things out, can Dave answer whether AIB's will be doing similar to the 4850 and having their own hsf soon? I quite like my cards cool as you can tell, psychological problem.

My monitor now flickers after the house electricity kept tripping so I am getting an H-IPS to replace my 24inch TN screen. So the pennies have been being well and truly spent.

neliz · Nov 3, 2009

dizietsma said:
On a similar topic I have finally decided after a lot of umming and arring that I will go for a 5870. There doesn't seem to be any about at the mo while TSMC sort things out, can Dave answer whether AIB's will be doing similar to the 4850 and having their own hsf soon? I quite like my cards cool as you can tell, psychological problem.

Sapphire already announced their Vapor-X models!
http://www.sapphiretech.com/presentation/product/?psn=000101&pid=293

3dilettante · Nov 3, 2009

Errata published by AMD and Intel show that there are either specification or silicon shortfalls that even CPUs encounter.

Given the rigorous testing and reliability qualifications those go through, these bugs tend to be more obscure these days (other CPU vendors would have similar and possibly more numerous bugs, but the openness displayed by x86 vendors is not universal).

AMD had a TLB issue with Barcelona, and there are a few non-atomic TLB issues on Intel chips that were published, any failure with those is a game-over from the OS perspective.

There have been board incompatibilities where socket-compatible CPUs were put into older boards that did not have a high-enough spec for their VRMs.

There are other precedents:http://en.wikipedia.org/wiki/Halt_and_Catch_Fire

The options for handling such problems are similar to what GPU makers have: a microcode update can be distributed for many problems, OS-level workarounds can be arranged, and workarounds can also appear in compilers and applications.

Voltage and amperage-based throttling in addition to thermal overload checking is something that is relatively recent in consumer CPUs.
It's nice to see that this feature has filtered down to GPUs.

ShaidarHaran · Nov 3, 2009

neliz said:
Sapphire already announced their Vapor-X models!
http://www.sapphiretech.com/presentation/product/?psn=000101&pid=293

And sites have already reviewed. Check the "awards" section.

dizietsma · Nov 3, 2009

Thanks guys, that looks like the modern batmobile rather than the original one !

Iron Tiger · Nov 3, 2009

Andrew Lauritzen said:
Yeah I have to agree with this... you can call a given workload atypical all you want, but that's no excuse for it screwing up your hardware.

I've been victimized by software homicide, and I don't blame the hardware for it.

Bouncing Zabaglione Bros. said:
I dunno about that. Look at IntelBurnTest, which is a popular stability testing tool that wraps up Intel's Linpack into a user-friendly package. It stresses the CPU to the max and is designed to light up as much of the chip and RAM as possible. It's used by Intel for stress testing, and is the baseline they work to, ie they want stability at this maximum stress.

It's not downclocking or stalling, in fact there's nothing that heats up the CPU more. It even produces 10 degress more heat than maximum heat in Prime95.

I had Linpack get my i7 up to 137W power draw (according to an Asrock util), but 3DMark Vantage got it up to 140W at same clocks with same voltage.

Andrew Lauritzen · Nov 3, 2009

3dilettante said:
Errata published by AMD and Intel show that there are either specification or silicon shortfalls that even CPUs encounter.

Ah but the key point here is that they're very forward with it being a problem in their hardware (hence "errata")... I don't think there would be such a backlash here if AMD would just say "yeah our hardware doesn't work so well in this case - we're fixing it, sorry for the inconvenience" rather than "guys don't run that stuff - we don't like it!"

Obviously we're not complaining about bugs getting into hardware... that happens all the time. But you don't defend those bugs as "stop running that code..." - you apologize and fix them.

I've been victimized by software homicide, and I don't blame the hardware for it.

I would! Obviously the software is partially to "blame" too, but it should never be possible to break the hardware with software, period. That represents a fundamentally more serious security flaw than anything in software.

3dilettante · Nov 3, 2009

Andrew Lauritzen said:
Ah but the key point here is that they're very forward with it being a problem in their hardware (hence "errata")... I don't think there would be such a backlash here if AMD would just say "yeah our hardware doesn't work so well in this case - we're fixing it, sorry for the inconvenience" rather than "guys don't run that stuff - we don't like it!"

I thought AMD put in a driver-level throttle on Furmark.
It's not ideal, but is it different than having an OS developer make a kernel change to work around a TLB issue, or even working with an application developer to create a patch to work around problems?

LordEC911 · Nov 3, 2009

My car can redline at 6k RPM, so when I keep it constantly at 6k it shouldn't have a problem right? If it does it is because Honda can't make an engine...

Andrew Lauritzen · Nov 3, 2009

3dilettante said:
I thought AMD put in a driver-level throttle on Furmark.

Right but if I write Furmark2.0.exe does it still get the workaround? i.e. is there still a way to make it blow up by renaming the app/doing something similar but slightly different, etc? In all of the CPU cases it is solved at the source (flushing TLB in problematic cases, detecting the error before it happens, etc), not just patched in for certain applications.

My car can redline at 6k RPM, so when I keep it constantly at 6k it shouldn't have a problem right? If it does it is because Honda can't make an engine...

That analogy would apply to overclocking... maybe. It doesn't apply at all to software just using the machine.

Florin · Nov 3, 2009

LordEC911 said:
My car can redline at 6k RPM, so when I keep it constantly at 6k it shouldn't have a problem right? If it does it is because Honda can't make an engine...

Please, no crummy car analogies

neliz · Nov 3, 2009

So Hemlock will be out 4 Months before a possible consumer version of Furby?

AMD: R8xx Speculation

How soon will Nvidia respond with GT300 to upcoming ATI-RV870 lineup GPUs

Within 1 or 2 weeks

Within a month

Within couple months

Very late this year

Not until next year

Dave Baumann

Gamerscore Wh...

Andrew Lauritzen

Moderator

CNCAddict

3dcgi

Andrew Lauritzen

Moderator

Dave Baumann

Gamerscore Wh...

hoom

Bouncing Zabaglione Bros.

dizietsma

neliz

GIGABYTE Man

3dilettante

ShaidarHaran

hardware monkey

dizietsma

Iron Tiger

Andrew Lauritzen

Moderator

3dilettante

LordEC911

Andrew Lauritzen

Moderator

Florin

Merrily dodgy

neliz

GIGABYTE Man

Similar threads