Could async compute and higher GPU utilisation cause system failures? *spawn

Microsoft actually talked about what their solution to thermal issues was, but I can't remember exactly. I know the system will shut down if it reaches a certain threshold.
Yeah, but will ordinary workloads push that threshold? The GPUs will end up running hotter than we're used to in playing games that push the hardware more than every before. Will the end result be :
1) Nothing, fans already have it covered
2) Fans spin up and the silent consoles get a bit noisier
3) They throttle back and the games run slower
4) They get hotter and hotter until safety measures kick in and switch the console off every 30 minutes of play time
5) They get hotter and hotter until they break

I'm confident number 2 has it covered, but it'll be interesting what the contingency is. As you say, XB1 will certainly switch itself off. Just trying to predict the next consolegate. Games that kill consoles or switch them off would be a nice headline generator for the online press. :yep2:
 
I have a feeling cooling has been specced so that at a certain ambient temperature the console can run at full power/100% utilization without having to ramp up the fan speed. Once you put the console in a space that has restricted airflow, or an extreme ambient temperature, it will start to ramp up fan speed and if the problem persists it will lower clocks to reduce power. Last resort is shutting down.

Small thermal chambers with temperature and humidity control are pretty cheap. They can easily design synthetic tests to tax the hardware and run thermal cycles for long periods of time, ramping between hot and cold. After last gen, it's an obvious next step in the system design.
 
If you put your console in an airtight cabinet (like Sony themselves did at some tradeshow last year) - or let your cat use the console as a nappy-time bed like a certain person on this forum lol - it obviously doesn't matter how much the fan spins up, option #2 will be inefficient. Downclocking probably isn't a very good idea, in addition to gameplay issues arising from juddery framerates (and customers calling tech support thinking their console is busted and screaming about it on forums), potentially this could cause weird bugs or even crashes due to hidden race conditions existing in the code...

It's better to just suspend the game, inform user of overheat situation and try to cool off; forcibly shut down entirely if heat still persists. I believe PS4 does this already, and most certainly bone as well.
 
But isn't The PS4 GPU basically a R9 270X with 2 CU shut off & clocked a lot lower? I know it's in a smaller box with different cooling system but it seem like it would have room to play with before overheating. R9 270X's are clocked at 1Ghz.
 
But isn't The PS4 GPU basically a R9 270X with 2 CU shut off & clocked a lot lower? I know it's in a smaller box with different cooling system but it seem like it would have room to play with before overheating. R9 270X's are clocked at 1Ghz.

Yes, GPU is not really a big one and PS4 has a nice cooling.
https://www.youtube.com/watch?v=g1h8dvqEItc
http://vr-zone.com/articles/sony-engineer-unveils-functional-beauty-ps4s-cooling-system/69852.html

I don't believe that async compute will strain the cooling so much that APU will fail. Maybe someone will perform some temperature tests when Tomorrow Children is released.
 
The PS4 has multiple temperature sensors, including air temperatures, and it controls the fan to keep everything within thermal limits. There's no guess work involved since the controller knows the temperature of sensitive areas at all times. It will spin up the fan if there's more demand, or it would shutdown to prevent overheating if it ever reaches the limits of the fan capabilities.

I tested the fan controller behavior on my PS4, and there's a huge amount of headroom available with that fan. I see no reason why a bit more load would cause any issue.

They certainly put a lot of effort to maximize the cooling efficiency, and they know what they are doing.
http://www.dualshockers.com/2014/01...iled-schematics-and-info-on-cooling-solution/
 
Last edited by a moderator:
The last of Us Furmaked

I think some high GPU utilization (with or without async compute) can already lead to some high temperatures in some current games.

For instance I have found a particular spot in TLOUR where the PS4 fan goes completely berzek (more than Infamous uncapped), and without the need to suspend the game with the PS button (only pause).

It's in the hydroelectric dam level, when you just made the small bridge with Ellie in order to cross it, you go at this exact spot just above the rock underneath, which is just after the middle of the small bridge:

t1fb.jpg

And then you look down at the splashes, you aim and zoom in with the rifle exactly where I drew the red cross (between the middle of the bridge at the left, the rock at the top and just above the wall of the bridge) and hit pause:

r1fb.jpg

Put your seatbelt on...and prepare yourself for the take off!
 
The PS4 has multiple temperature sensors, including air temperatures, and it controls the fan to keep everything within thermal limits. There's no guess work involved since the controller knows the temperature of sensitive areas at all times. It will spin up the fan if there's more demand, or it would shutdown to prevent overheating if it ever reaches the limits of the fan capabilities.

I tested the fan controller behavior on my PS4, and there's a huge amount of headroom available with that fan. I see no reason why a bit more load would cause any issue.

They certainly put a lot of effort to maximize the cooling efficiency, and they know what they are doing.
http://www.dualshockers.com/2014/01...iled-schematics-and-info-on-cooling-solution/

Your right , I don't think the ps4 will fry itself or any modern gpu. Furmark used to kill graphics cards and then they added features to stop it from doing so and those features should be alive and well in the amd chips.

But having the console (either one) go crazy with a fan full out at a 100% or the console shutting off every time at a certain spot can be almost as bad for a consumer as the console strait up dieing
 
But having the console (either one) go crazy with a fan full out at a 100% or the console shutting off every time at a certain spot can be almost as bad for a consumer as the console strait up dieing
There's no indication that it would, for either console. As I said, there's a very big headroom available on PS4 fan, and I would assume there's an equally big margin on the XB1 (anyone have the specs of the fan?).

Lots of people had bad experiences with RROD last gen and have become paranoid. Regardless of solder issues, the launch 360 was the worst heat management ever designed for any console, it was a piece of shit. It's overheating issues, along with mechanical stress of the motherboard, made the solder weakness exponentially worse. But they did solve all of this with both a heatsink redesign and lower power chips, and there's no reason to expect any of these issues to reappear. We can deduce this just by looking at the XB1 fan design and clamp design, it's a very straightforward design which looks perfectly adequate and a huge improvement from the launch 360. The XB1 is also much lower power than the launch 360. We can also deduce that their stress test was done correctly and have accounted for situation that approach 100% utilization of the SoC.

I'm more worried about dust buildup for those who don't vaccuum the console's intakes regularly (I do it often as I'm paranoid about it), and that would end up clogging the heat sink or intake grill over time. (or those of us with cats, which demands special care).
 
The xbox one is obviously going to generate less heat than the ps4. I assume this is why MS were so reticent in putting a high end card in there, they were crapping themselves over another repeat of the RROD fiasco which cost them a $1Billion or two, but it's cost them heavily in terms of market share.
Did it only cost them $1b to replace the consoles? They spent over twice that for Minecraft(!)

I wonder if you were to ask gamers what they'd prefer; a separate CPU and GPU both far exceeding the current APU, or potentially exclusive rights to Minecraft on their console. What would they have chosen.
 
There's no indication that it would, for either console. As I said, there's a very big headroom available on PS4 fan, and I would assume there's an equally big margin on the XB1 (anyone have the specs of the fan?).

112mm fan with respective heatsink. Heatsink looks much better than 'stock' quality. We're talking enthusiast grade, though not like Noctua, more like cooler master.

From Eurogamer:
Perhaps the Xbox One's power efficiency sounds too good to be true, especially when rumours not so long ago painted a very different picture of hot and loud Durango development hardware. Our sources concur that the February/March dev kits were indeed very loud indeed, but it wasn't due to over-heating - quite the opposite in fact. The thermal control algorithm - which monitors the heat output of the major chips on the motherboard and adjusts fan speed accordingly - simply wasn't implemented in the developing OS, and so to avoid damaging the hardware, the fans were set to 100 per cent all the time.

Honestly, still have yet to (really) hear that fan. I can only hear the optical disc drive realistically speaking, it over powers anything else from the console.

Thermals is a bit of a grey area for me, let us assume current cooling is sufficient for the APU for both consoles. Is RAM going to be okay though? Or will it generally only get hot only during over clocking of the RAM, and therefore not a concern for the consoles?
 
Last edited by a moderator:
The likely nature of the Xbox 360's problems might have been made less problematic if there were higher utilization on the GPU, or more consistent utilization.

The actual analysis would be more complicated, depending on the physical and mechanical properties of the materials in the solder bumps and the layers the chip was packaged with.
Different materials expand differently with temperatures, which stresses increasingly brittle materials (such as lead-free solder, low-K interconnect dielectrics) and/or increasingly small structures (very fine bumps, vias).
On top of this, there is the question of the physical design of the package, which at least some have alleged Microsoft did not do as good a job as others for the 360. Intel likely had the lead-free solder problem handled, as they had rolled that out ahead of most vendors. The shrink of the 360's chips likely lead to a redesign of the package as well, which would have corrected shortcomings there.

Thermal cycling inflicts significant stress, which often happens if there are frequent trips to and from a low power to high power state.
A very spiky utilization situation can lead to much more accumulated damage than a high-utilization case that stays consistently high. AMD's 290 GPUs have the ability to maintain a very constant 95 C by modulating their power consumption and the fan speed. This freaked out some gamers, but it might have been better than if the GPU ping-ponged from 70 to 90.
 
Last edited by a moderator:
While I don't know how reliable their source is, xbox-experts.com seems to have an extensive write up of all the issues that supposedly caused RROD. Maybe a bit doom and gloom to sell their replacement clamp, but they sure attempted to be thorough. ;)

They say that MS didn't used the leadfree solder at it's optimum temperature, for fear of breaking other components ???

What Causes Most Xbox 360 Red Light Errors ? (RROD)...
  • Xbox 360 Flexing Zones Caused by Chassis Design and X-clamps
At xbox-experts.com we discovered that the standoffs and outer lip of the metal chassis in which the motherboard rests were not completely level with each other. The 2 standoffs are exactly 0.75mm too high, this might not seem like much, but since they are supposed to be 3mm high this is about 1/4th more which is definitely noticeable with a straightedge or level. In addition to that we also discovered that middle area of the chassis where the X-Clamp bolts screw down are about 0.5mm lower than the rest of the chassis, causing the mainboard to be pulled down in the center as soon as the screws are tightened. This puts the entire mainboard under extreme stress which explains pretty much why there is such a wide range of errors that are not related to the CPU or GPU. This mainly applies to older units as we have seen some chassis revisions in which MS corrected the error themselves (indication they knew of the problem and it did indeed exist). It is advised to use three 1mm thick washers to measure your standoffs first to check if you have a revised chassis or not. An easy way to spot it is that the older faulty ones usually have rounded tops and the newer fixed ones have flat tops. This flexing could also explain why certain errors occur more frequently than others. The flexing zones can be divided into three parts...

Zone 1:
The most frequent error codes are 0102 (0100;0101;0103 as well)and 0020. These are GPU and CPU related errors mainly. 0020 can also be caused by the RAM in rare cases. About 80% of the errors fall into this category, as the solder balls under the CPU and GPU experience the most flexing of all. This is caused by the x-clamps flexing upwards in the area directly under the cpu/gpu, all concentrated on small points extremely in addition to the natural flexing caused by the metal case layout. Zone 1 is the area right under the CPU and GPU.

Zone 2:
The next most frequently occurring error codes are E74, 0022 and 0110, which are usually either RAM or ANA/HANA chip related. Sometimes E74 and 0022 can also be GPU related depending on the trace. Since the RAM and ANA/HANA chips are close to the GPU they can be affected by the Zone 1 flexing, in addition flexing caused from the two standoffs. These may not occur as often as the Zone 1 errors, but still make up approximately 12.5% of the overall error codes.

Zone 3:
The Zone 3 area experiences the least amount of flexing versus the other 2, but can still make up about 7.5% of the error codes. The related error codes are E73, 0021, E79(if hardware related), E71(if hardware related). The components effected by Zone 3 flexing are the Southbridge, ethernet chip, NAND and the entire motherboard to some extent. Since the NAND and ethernet chips are not BGA (ball grid array) chips like the Southbridge, GPU, CPU, HANA, etc, they are a bit more resistant to the flexing.


schematics_complete_x.jpg




  • Xbox 360 Overheating / Thermal Runaway
Many of the 3 red light errors are blamed on overheating due to heat buildup caused by thermal runaway. Microsoft tried to combat this issue by installing better GPU heatsinks with added heatpipe attachment which didn't help much as the problems still persisted. Eventually MS got sick of the red rings and came out with the SLIM model which has an entirely new thermal design. The SLIM models seem to handle the heat a bit better, but still suffer from overheating and insufficient air flow.

In 2007 thermal design experts Naoki Asakawa and Mayuko Uno from Nikkei Electronics in Japan analyzed the 360's heat radiation system in early models to determine if overheating was a problem or not. Some of their findings...​

  • The airflow cooling the heat sink is proportional to the cross-sectional area of the flow path, and in this case the cross-sectional area for the graphics IC heat sink was only about one-seventh that of the microprocessor. "Almost all of the air pulled in by the fan is used to cool the microprocessor, it looks like. They've made some effort to increase the cross-sectional area by widening the heat sink, but it doesn't look like it's very effective." "In PCs it is common practice to enclose the heat sink in a duct. There might not have been enough space available in the Xbox 360, but the duct stops just short of the heat sink. The heat sink is instead enclosed on top by the DVD drive, the case, etc. If the duct should happen to be dislodged in transport, for example, the airflow cooling the heat sink would drop significantly."
  • There was a temperature gap of 22C between the exhaust and room air, "When designing consumer products, it is common to seek a temperature gap of around 10C between exhaust and room temperatures," the thermal design expert said.
  • The maximum wind speed of the exhaust air is only 1.1 meters per second, only 1/2 to 1/3 compared to normal desktop PCs produce. The expert noted, "The amount of switched air is slightly in short considering the chassis' size (309 x 258 x 83 mm3)."
  • It takes only 5 minutes of gaming for the GPU heatsink to reach 70C, a thermal gradient of about 10C/min and after 15 minutes of play, the GPU heatsink can reach temps near 100C.
  • The heat sink temperature for the microprocessor was stable at 59*C, but the heat sink on the graphics IC reached 70*C within only five minutes of starting the game. The incline was about 10*C/min, and by 15 minutes it reached 80*C, representing a difference of 57*C from room temperature. Assuming a summer room temperature of 35*C, estimates indicate that heat sink temperature would exceed 90*C, and IC temperature might well exceed 100*C.
  • The airflow cooling the heat sink is proportional to the cross-sectional area of the flow path, and in this case the cross-sectional area for the graphics IC heat sink was only about one-seventh that of the microprocessor. "Almost all of the air pulled in by the fan is used to cool the microprocessor, it looks like. They've made some effort to increase the cross-sectional area by widening the heat sink, but it doesn't look like it's very effective."
  • When the IC, board, etc, reach excessive temperatures, the difference in the coefficients of thermal expansion cause board warpage, which in turn applies severe stress to the periphery of the ball grid array (BGA) connecting the two. Repeated exposure to elevated temperatures would cause cracks in the solder balls from heat fatigue, leading to failure.

nikkei_thermal_flexing.jpg



  • Lead-Free Solder and Improperly Reflowed Solder During Manufacture
The next issue many people blame problems on is the use of lead-free solder and "cold" solder joints because of its use. Starting July, 2006, the E.U. set strict environmental guidelines called the RoHS Directive, which banned the use of lead in any products marketed towards children. For nearly 50 years, the standard solder used was a tin and lead combo which had a melting point of around 183C. The new lead-free solder now needs temperatures of at least 217C. In fear of damage from over-heating, it is speculated that Microsoft's engineers most likely opted for the low-end of temp profiles needed for re-flow.

Seattle PI's "Digital Joystick" interviewed an inside source who has worked on the Xbox 360 project for many years who stated that the...

"RROD is caused by anything that fails in the “digital backbone” on the mother board. Also known as a core digital error. CPU, GPU, memory, etc. Bad parts, incompatible parts (timing problems) bad manufacturing process (like solder joints), misapplied heat sinks or thermal interface material, missing parts, broken parts, parts of the wrong value, missed test coverage. Any one or more, on any chip, or many other discrete components, would cause this. And many of the failures were obviously infant mortality, where they work when they leave the factory and fail early in use. The main design flaw was the excessive heat on the GPU warping the mother board around it. This would stress the solder joints on the GPU and any bad joints would then fail in early life."

"Some defective parts, like BGAs where the solder balls are not of sufficient and uniform size, so they don’t solder down evenly, or the substrate is warped, causing some joints to have insufficient solder. Bad chips from marginal or under tested wafers. Others are deficient processes, like misaligning the solder paste to the circuit board, or same on the parts, or not having the thermal profile right in the reflow oven during soldering."

"Manufacturers new to PB free tend to err on the low temp side thinking they are saving the parts reliability wise from a large thermal load. What they are really doing is not reflowing the PB free solder enough to make a good joint. PB free solder is non eutectic, which means the different metals in the solder alloy melt at different temperatures, unlike leaded solder where everything melts at the same temperature. If you under heat it, it won’t bond well to the board or parts, won’t form a good joint, leaving voids and other defects in the joints that lead to early failure under normal circumstances. But when you add the extraordinary heat and mother board warpage that goes with it, well you get a catastrophic failure rate like we’ve all seen on 360."

Reflow experts from Manncorp did an extensive investigation into the quality of the lead free solder joints in the 360 and found that with an x-ray they could actually see solder balls that did not look like they had been re-flowed properly in the first place. On a pair of x-rays from the GPU, different sized solder balls are clearly visible, which is an indication some spheres were not completely reflowed during manufacture....

001.jpg


Manncorp also noted, "While a "cold" solder joint may provide an adequate electrical connection, long-term reliability is jeopardized, especially in application where the solder bonds are subject to wide temperature fluctuations. In such an environment, continuous expansion and contraction of materials with varying thermal coefficients will quickly destroy the integrity of a "cold" solder joint, creating intermittent problems or even complete failure. This is precisely the environment of the Xbox 360 motherboard, due to the high amounts of heat generated by the CPU, GPU and memory components when running graphics-intensive gaming applications."

The guys at bunniestudios decided to send in a 360 for failure analysis to MEFAS for solder joint inspection on the GPU through a process called “dye and pry”. In this process, the motherboard is flooded with red ink, and then the GPU is mechanically pried off the board. The red ink flows into any of the tiny cracks in the solder balls, and at least in theory, when you pry the GPU off the cracked regions will shear first so you will be left with visible red spots at the points of failure.

Normal Solder Joint:

Here is one of several balls on the GPU that exhibited signs of partial failure, showing there was some “voiding” seen in the balls, e.g. trapped gas bubbles inside the solder balls that might serve as starting points for mechanical failure :

  • Bad Heatsink Mounting due to X-clamp Setup
In many units with 3 red lights or no video problems, we noticed that the GPU heatsink looked to be sitting slightly crooked with one side 1mm or so raised above the other. The leads to extra pressure being applied at one corner and not enough in the opposite side. The x-clamp setup allows the heatsink to sit unlevel because of its free floating design. Then they use way too much thermal paste which they apply to the heatsink first, then secure the heatsink with the x-clamps.

The x-clamps essentially act like a little prying device that stresses the connections more and more with each thermal cycle until the already poor solder connections are broken. The pics below were from a "no video" unit....

  • Sloppy Thermal Compound Application....
Almost every unit from the factory has a very sloppy application of thermal compound. Since the x-clamps have a hard time keeping the heatsink level, the compound oozes out all over other components, possibly causing risk of a short. Also since the heatsink is not flat and level, it has a horrible thermal connection with the chips leading to overheating. In the pics below you can see how they used way too much compound that spread out over surrounding components. Notice all the air pockets that were formed which would act like little insulators. Also the compound in general is very dry/crusty and not likely to be very high quality...

xbox_thermal_paste.jpg



As you can clearly see, there are many factors which contribute to the xbox 360 3 red light error problems. Controlling heat and replacing the x-clamp retention system is the first step to take to protect your console. The "screws & bolts" x-clamp fixes are usually very temporary and still allow the motherboard to flex over time. They also do not address the overheating issues like the Hybrid eXtreme Uniclamp system does. eXtreme-Cool 360 thermal compound is super easy to install and provides the best thermal interface connection possible. Working units can benefit from using the kits by preventing further flexing/damage and fixing the problem before it starts. Most 3RROD units can be easily repaired in less than an hour, although a few stubborn units will require a reflow for proper repair. After a reflow, install the x-clamp repair kit and prevent the problems from ever coming back. Make sure to get the real Hybrid eXtreme Uniclamp X-clamp repair kits made by xbox-experts.com
 
Back
Top