Nvidia GeForce RTX 50-series Blackwell reviews

Gonna need no-questions asked exchange commitment from the OEMs. And not RMA and pray - send replacements, then receive defective units back. Get a warchest ready Nvidia.
 
How is the affected units identification done? Is it up to the end consumer to identify their GPU is slower than expected?
 
5080 isn’t a cut down chip, so doubtful.
Doubtful or not, NVIDIA has now confirmed 5080s are affected too
Upon further investigation, we’ve identified that an early production build of GeForce RTX 5080 GPUs were also affected by the same issue. Affected consumers can contact the board manufacturer for a replacement
 
How is the affected units identification done? Is it up to the end consumer to identify their GPU is slower than expected?

It seems the firmware is passing on the lower ROP count to GPU-Z and other tools. You wouldn't need a performance test and comparison to detect it as a end user.
 
It seems the firmware is passing on the lower ROP count to GPU-Z and other tools. You wouldn't need a performance test and comparison to detect it as a end user.
Is this a firmware bug? I still have no idea how this happened.
 
Is this a firmware bug? I still have no idea how this happened.

My understanding is GPU-Z (and similar programs) is just parsing data provided to them via the drivers which in turn is interacting with the firmware on the graphics card. All that data they are reading and parsing is just what the hardware vendors are choosing to provide to them. For data that isn't being provided it's hardcoded based on their own database. None of these programs can actually inspect the cards in any direct or physical sense.

I don't think Nvidia has provided any statement with more details on what is causing this. I believe they've only given 2 official statements to the Verge that is being reported so far -

We have identified a rare issue affecting less than 0.5% (half a percent) of GeForce RTX 5090 / 5090D and 5070 Ti GPUs which have one fewer ROP than specified. The average graphical performance impact is 4%, with no impact on AI and Compute workloads. Affected consumers can contact the board manufacturer for a replacement. The production anomaly has been corrected.

Upon further investigation, we've identified that an early production build of GeForce RTX 5080 GPUs were also affected by the same issue. Affected consumers can contact the board manufacturer for a replacement.

Now I believe because of this -


combined with Nvidia's official solution and response being physical replacement as opposed software update of some sort is leading to the assumption that it isn't a firmware or driver issue. But given there's been no real statement or investigation from third parties we don't actually know what the cause is or the problem other then it manifesting as a 8 less ROPs being reported and lower performance.

It's interesting that there were also reported manufacturing/performance issues for delaying the RTX 5070 and 5060 and the recent statement from Nvidia seems to be the 5070 is not affected by this.

Now the following is just complete wild speculation/musing on my part but I'm not sure if the above necessarily rules out a firmware issue. I also might wonder if Nvidia has opted not to go with software update and would rather eat the replacement cost, with so few in the wild currently, due to security reasons. Why the above? I don't think Nvidia's ever formally commented on this but I think it's believed they moved away from physically disabling parts (and configuring) of GPUs ever since they moved towards their secured/signed firmware and added security chip with Maxwell. Nvidia has released firmware update tools in the past (eg. Rebar, DP fixes) but there was actually an exploit that ended up being found and released that enabled bios modding to a degree.
 
It's chips binning at the factory 'bug' where some chips with a defect in ROPs got green lit to be used for cards production when they shouldn't have.
As for how it happened no one will be able to tell outside of TSMC and Nvidia.

Why didn’t AIBs catch it? That means there’s very little QA being done across the board to validate basic specifications or the chips reported the wrong ROP count at some point in the process.
 
Why didn’t AIBs catch it?
Why would they catch it? They get chips to be used in SKUs from Nvidia, they don't validate them the second time.
I guess that if they'd be able to change the ROP number in their custom models they would look at it and would thus catch any discrepancy. But since this number is supposedly fixed for any SKU even a factory OCed one then there's no reason for them to monitor it.
 
Why would they catch it? They get chips to be used in SKUs from Nvidia, they don't validate them the second time.

They assemble and therefore must test the completed graphics card. As part of that testing they’re clearly not validating the basic configuration of the product. That seems bizarre to me. The more likely scenario is the cards were misreporting.
 
Also possible. The actual number is only being shown in the presence of a driver which AIBs may not in fact had up until the very announcement - due to leaks containment reasons.
GPUZ does say that in the absence of a proper driver they will use a simple database lookup. This is a likely explanation.

It is interesting that the performance loss wasn't caught. It is small but seems easily measurable and repeatable. Makes me wonder what level of performance variation is deemed acceptable. Or if they test for performance variation at all.

Still wondering though. If these ROPs are defective, how does the card know to exclude them? Surely the firmware would have to know how many functional units are...functional. Can NVIDIA simply zap some ROPs for binning reasons and make no firmware changes to the card?
 
This makes me sad, I guess all testing was done in automation and little to no manual testing was done, seems to be way the tech industry at the moment, let feature development write some automation tests and ship.!

Semiconductors haven’t been manually tested for 70 years. There aren’t enough humans to test the billions of chips produced. Intel manufactures a million CPUs a day. Are you expecting that a person installs them in a computer and runs tests??
 
Still wondering though. If these ROPs are defective, how does the card know to exclude them? Surely the firmware would have to know how many functional units are...functional. Can NVIDIA simply zap some ROPs for binning reasons and make no firmware changes to the card?

GN Steve has some insight into the AIB testing process having visited the factories and he’s also skeptical that this wasn’t found during basic validation. Either something changed to invalidate the test results (last minute firmware or driver update) or Nvidia and AIBs knew about it and decided to ship anyway. I don’t know which one is worse. Gross incompetence or extreme disdain for gamers.

 
GN Steve has some insight into the AIB testing process having visited the factories and he’s also skeptical that this wasn’t found during basic validation. Either something changed to invalidate the test results (last minute firmware or driver update) or Nvidia and AIBs knew about it and decided to ship anyway. I don’t know which one is worse. Gross incompetence or extreme disdain for gamers.

Are the number of active functional units specified in a card's firmware? Is it automatically determined by the firmware based on some hardware configuration?

Basically if I were going to turn a 3090 into a 3080, what steps would I take so that the card knows how many and which ROPs should be used? How does it know which ones are disabled?
 
Are the number of active functional units specified in a card's firmware? Is it automatically determined by the firmware based on some hardware configuration?

Basically if I were going to turn a 3090 into a 3080, what steps would I take so that the card knows how many and which ROPs should be used? How does it know which ones are disabled?

No idea but if I had to guess the testing software determines which functional units are working properly and that’s then baked into the firmware. Maybe the microcontroller on the chip has some basic ability to self diagnose completely busted data or power paths too. Any broken or unneeded units are presumably fused off in hardware and can’t be enabled in firmware/software like the old days.
 
So this is kind of interesting in that does Nvidia mention OpenCL (any version) support at all for Blackwell in official materials?

5xxx srivers, actually explicitly excludes Blackwell from OpenCL support -https://us.download.nvidia.com/Windows/572.60/572.60-win11-win10-release-notes.pdf

6.2 Support for OpenCL 3.0
Maxwell, Pascal, Volta, Turing, and NVIDIA Ampere architecture GPUs are supported.

Whitepaper. zero mentions of OpenCL - https://images.nvidia.com/aem-dam/S...ell/nvidia-rtx-blackwell-gpu-architecture.pdf

Product page, zero mentions of OpenCL - https://www.nvidia.com/en-us/geforce/graphics-cards/50-series/

From what I recall Nvidia historically has always basically backburnered OpenCL and rarely promoted support of it. Which makes the communication question a bit interesting as I don't recall them ever publicly communicating OpenCL support with past GPU launches (especially as part of the launch marketing), if so would they be obligated to media blitz that Blackwell/5xxx doesn't support OpenCL?

As for CUDA my understanding is Nvidia's been gradually depreciating CUDA 32 bit support on the development side ever since I think 2014? Since CUDA support from the development perspective has never been apart of the consumer marketing it's basically never been brought up. I actually don't even think the consumer side, even from enthusiasts, has ever cared or really discussed the details of CUDA support until now (as there's no direct consumer facing impact). Depreciation of 32-bit CUDA support was communicated in the development tool releases, so it wasn't a hidden change that was uncovered. So it does bring up an interesting issue of how much communication and to whom is the correct obligation here.
 
Last edited:
Back
Top