NVIDIA Tegra Architecture

extrajudicial said:
t's not just "clock gating" and how do you explain the fact that Kepler and Maxwell have the EXACT same power consumption on compute loads
And what is the performance?

http://anandtech.com/show/8526/nvidia-geforce-gtx-980-review/20

http://www.extremetech.com/computing/190463-nvidia-maxwell-gtx-980-and-gtx-970-review/3


Unless you have some (more than one) benchmarks showing a GTX 980 performing equal or worse than a 770 in compute, I'd stop while you are behind.
 
It's not just "clock gating" and how do you explain the fact that Kepler and Maxwell have the EXACT same power consumption on compute loads?
Do they?
And how so you explain a massive perf/W advantage over Kepler and GCN?

Your explanation is that "the architecture is different" ... Ok!
No. My explanation was: all known architectural changes to Maxwell are exactly the kind of thing you'd do if the low hanging clock gating fruit has already been consumed: reducing expensive operations, reducing data movement.

Clock gating is nothing new. It's been very well supported by the standard ASIC tools for over a decade. And it's not a if Kepler was a disaster in terms of perf/W either: it was very competitive, if not better, compared to the competition.

The perf/W improvement of Maxwell took everybody by surprise for a reason: nobody expected it to be possible.
 
From the CUDA thread, it's clear that they added some kind of register reuse cache that reduced register fetches from the register banks. Banks are pretty large, so that should result in quite an optimization. And if the register reuse cache is much closer to the ALUs, they will lose less power moving the operands around as well. And then there's reduced HW scheduling and the reduced crossbar not allowing operands to execute everywhere.
Can you provide a link to the CUDA thread you're referring to? I don't remember seeing one recently.
 
Can you provide a link to the CUDA thread you're referring to? I don't remember seeing one recently.
I can't find where we talked about it here (ok, on my phone and I didn't really look), but it's all about this: https://code.google.com/p/maxas/source/browse/

The guy decoded the Maxwell opcodes, wrote a scheduling assembler, and used it to write a very high performance SGEMM library. Crazy stuff. Tons of documentation about it too.

Read the SGEMM article to find out about the register reuse cache: https://code.google.com/p/maxas/wiki/sgemm
 
Last edited by a moderator:
Also, even if compute saw less improvement in perf/W, so, uhh, watt? I'd call that an interesting datapoint, worthy of discussion about how such a workload could trigger difference HW paths. Not something sinister.
But without having seen any power numbers for pure compute workloads, that's all academic anyway.
 
Last edited by a moderator:
I'm not saying that the architecture isn't more efficient, but its efficiency is only really easy to evaluate when it's under non-compute loads. When maxwell is forced to compute, it's pulling easily 300W. Still 20-30W less than a 680/770 but that's not much considering the 120W difference in TDP.


Nvidia has done some really amazing work with clockspeed and power gating. And their work brings legit benefit to GPU efficiency When it renders games. As soon as GM204 switches to compute, it becomes just as big a hog as Kepler.


I'm just wondering how much mobile GPU loads will represent game vs compute
 
Also, even if compute saw less improvement in perf/W, so, uhh, watt? I'd call that an interesting datapoint, worthy of discussion about how such a workload could trigger difference HW paths. Not something sinister.
But without having seen any power numbers for pure compute workloads, that's all academic anyway.


So what? So maybe all our assumptions about Kepler being super efficient are incorrect?

This is not a curiosity it's a serious aspect of maxwell performance. This is the topic to this thread. If you just want to sweep all this under the rug i would like to at least See an explanation for why that 980 is drawing almost the same as the 770 and also why the nexus 9 with a smaller screen and almost same battery size yet 1hr less charge than an iPad.
 
Last edited by a moderator:
I'm not saying that the architecture isn't more efficient, but its efficiency is only really easy to evaluate when it's under non-compute loads. When maxwell is forced to compute, it's pulling easily 300W. Still 20-30W less than a 680/770 but that's not much considering the 120W difference in TDP.


Nvidia has done some really amazing work with clockspeed and power gating. And their work brings legit benefit to GPU efficiency When it renders games. As soon as GM204 switches to compute, it becomes just as big a hog as Kepler.


I'm just wondering how much mobile GPU loads will represent game vs compute

Difference in TDP between the GTX980 and GTX680 is only 30 Watts, not 120. That said, I'm not sure what the use in comparing TDP is, you're better off comparing actual power consumption.

Now let's see, Furmark pretty much is a power virus, which means it's hard to find a workload with higher power consumption. Power consumption in Crysis 3 for example would be more realistic for games at least, but that also taxes the CPU more. In both, a GTX980 consumes roughly less than or equal the amount of power as a GTX680 or GTX770, while at the same time performing between 50% to 200% better in compute workloads (from the AnandTech Compute benches). Now, can you explain to me how that does not indicate a major improvement in perf/Watt? From what I'm seeing here, saying that it's two times as efficient as Kepler wouldn't be too unrealistic.

That said, while I believe my reasoning is pretty sound, we haven't seen any actual power consumption figures of compute workloads on GM204. That's also why I'm careful with making any definite statements regarding compute power efficiency. But on the other hand, isn't Furmark mostly a compute workload? Or doesn't it, for example, tax the memory subsystem enough or something?

In the end, you're still coming back to your clockspeed and power gating argument, which to me seems to indicate that you're not really listening to what others have to say.
 
Wow you are a very distasteful person to have to educate, but that's ok :)


www.anandtech.com/show/8526/nvidia-geforce-gtx-980-review/21

Scroll down to the third benchmark (Load Power Consumption - FurMark)


The 2014 GTX 980 drew 294W to the 680s 314. Thats two years of nvidias best work at efficiency. Ten Watts


I would appreciate an apology for your rudeness

Efficiency = performance/power.

Power didn't decrease a lot but performance shot up dramatically.
 
Difference in TDP between the GTX980 and GTX680 is only 30 Watts, not 120. That said, I'm not sure what the use in comparing TDP is, you're better off comparing actual power consumption.



Now let's see, Furmark pretty much is a power virus, which means it's hard to find a workload with higher power consumption. Power consumption in Crysis 3 for example would be more realistic for games at least, but that also taxes the CPU more. In both, a GTX980 consumes roughly less than or equal the amount of power as a GTX680 or GTX770, while at the same time performing between 50% to 200% better in compute workloads (from the AnandTech Compute benches). Now, can you explain to me how that does not indicate a major improvement in perf/Watt? From what I'm seeing here, saying that it's two times as efficient as Kepler wouldn't be too unrealistic.



That said, while I believe my reasoning is pretty sound, we haven't seen any actual power consumption figures of compute workloads on GM204. That's also why I'm careful with making any definite statements regarding compute power efficiency. But on the other hand, isn't Furmark mostly a compute workload? Or doesn't it, for example, tax the memory subsystem enough or something?



In the end, you're still coming back to your clockspeed and power gating argument, which to me seems to indicate that you're not really listening to what others have to say.


That's a gaming load! I think we can all agree that crisis 3 is a game, right?


This isn't tough to test on your own. Load up any compute task intensive enough to fully load the gpus and maxwell's efficiency advantage drops to <10%. That doesn't mean it won't be generally much more efficient, it will be because few workloads will fully load the gpu like intensive compute.

Maxwells efficiency advantage is very big under gaming loads on x86, but ARM has different competition. PowerVR has demonstrated they can match and beat nvidia gpus (tegra3/4) with their own designs, and that's because ARM is a different beast.
 
extrajudicial said:
Load up any compute task intensive enough to fully load the gpus and maxwell's efficiency advantage drops to <10%. That doesn't mean it won't be generally much more efficient, it will be because few workloads will fully load the gpu like intensive compute.
You are still confused. Power consumption is only half of the efficiency equation. A GTX980 may be using 90% of the power of a GTX770 in some cases, but it is also providing much, much better performance.
 
I'm on mobile and am having a hard time linking all the reviews. The consensus is that yes it's very fast but very buggy. Lots of crashing and updates.

That is nonsense. I actually own Shield tablet and the performance and stability is excellent and consistently as good or better than the actively cooled Shield portable (which is a great performer too). On Amazon there are more than a hundred reviews with average rating of 4.5 out of 5 stars. The professional reviews have also been very positive too.
 
Last edited by a moderator:
I'm not saying that the architecture isn't more efficient, but its efficiency is only really easy to evaluate when it's under non-compute loads. When maxwell is forced to compute, it's pulling easily 300W.

That is completely false. GTX 980 reference cards are set to have a power limit such that TDP is ~ 180w, while GTX 970 reference cards are set to have a power limit such that TDP is ~ 150w. Some AIB vendors such as Gigabyte have released Maxwell cards with higher power limits set in the BIOS which allows for higher TDP but also higher compute performance too. Here are some details from well-respected reviewer Damien @ hardware.fr: http://forum.beyond3d.com/showpost.php?p=1876415&postcount=2363 . And here are GPGPU power consumption measurements from Tom's Hardware which shows a reference GTX 980 card in comparison to non-reference Gigabyte Windforce cards that have a much higher power limit:

103-Overview-Power-Consumption-Torture.png
 
Last edited by a moderator:
That's a gaming load! I think we can all agree that crisis 3 is a game, right?


This isn't tough to test on your own. Load up any compute task intensive enough to fully load the gpus and maxwell's efficiency advantage drops to <10%. That doesn't mean it won't be generally much more efficient, it will be because few workloads will fully load the gpu like intensive compute.

Maxwells efficiency advantage is very big under gaming loads on x86, but ARM has different competition. PowerVR has demonstrated they can match and beat nvidia gpus (tegra3/4) with their own designs, and that's because ARM is a different beast.

Doesn't mater a bit that Crysis 3 is a game or that Furmark isn't a proper GPU Compute benchmark. They give an indication of max power draw and your max power draw will be roughly the same for even the most taxing Compute workloads. This while we see a huge increase in performance. Which means that Maxwell gives you a huge increase in efficiency compared to Kepler.

How all this will translate to how they compare against GPU designs from the likes of IMG is anyone's guess at this point. We'll find out in due time. Also, x86 vs ARM doesn't really matter much when it comes to GPU performance and efficiency. That IMG overall seemed to have better designs than NVIDIA had in their Tegra 3/4 chips had nothing to do with ARM being a different beast. Just with IMG being very good at what they do. With Tegra K1 and future Maxwell Tegras, NVIDIA finally seems to be bringing their A game to the low power SoC space. As I said, time will tell how they all compare. We also shouldn't forget about Qualcomm....
 
If K1 performs as advertised I will be very impressed. I am discounting maxwells efficiency overall, I'm just saying there is nothing magic about the architecture that makes more efficient. It's is 100% power gating and extremely fine clockspeed management.


There is nothing wrong with that, it's a perfectly good way to manage power use... But it won't work for compute.


I agree that IMG isn't cashing in on anything special about ARM vs x86 but regarding power usage there are massively different TDP budgets between the stuff IMG uses in arm devices than nvidia in x86. IMG has had to manage power usage on a scale nvidia never saw until Tegra.


And to the people claiming tegra 3/4 were awesome, you're wrong. They held their own in benchmarks but throttled terribly and had dismal power usage numbers. K1 looks to be a massive improvement but let's not pretend nvidia had been successful win Tegra before.
 
extrajudicial said:
I am discounting maxwells efficiency overall
For no credible reason (you have yet to provide one anyway).

extrajudicial said:
I'm just saying there is nothing magic about the architecture that makes more efficient.
Indeed, it is not magic; it is quality engineering.

extrajudicial said:
It's is 100% power gating and extremely fine clockspeed management.
You have provided zero actual evidence that this is the case.

extrajudicial said:
But it won't work for compute.
You have also provided zero evidence that Maxwell is less efficient for compute workloads.
 
I'm just saying there is nothing magic about the architecture that makes more efficient. It's is 100% power gating and extremely fine clockspeed management.
I have given you three very concrete examples of architecture changes that undeniably are more power efficient:
- simplification of operand fetching from the register bank: no more power hungry crossbar to swizzle registers around. (See SGEMM article of Maxas.)
- A register cache to reduce the number of register bank accesses and reduce the distance the operands need to travel. (Same SGEMM article.) This is because, for layout reasons, large RAMs always need to be placed on the border of a block.
- Reduced number of ALUs where operations can be scheduled which reduces the complexity of the routing logic.
But there are more:
- operation scheduling removed from HW and moved to the compiler. (Slides from Nvidia/Confirmed by Maxas.)
- Larger register files which increases the number of registers that can be used in a shader. This improves occupancy and reduces the number of power costly round trips to the L2 cache.
- Increased utilization of the available ALUs: SMX of Kepler has 192 ALUs but even Nvidia's SGEMM library never manages to exceed 75(?)% utilization. SMM has 128 ALUs and the Maxas guy gets to something like 96%.

These are all architectural changes that anyone who has a little bit of hardware knowledge would recognize as being beneficial for perf/W.

Yet you ignore them completely and repeatedly.

If you're willing to learn, you're welcome to join the discussion. But your stubborn way of covering your ears, yelling 'lalala', and perpetuating your mistaken beliefs is what's making you a troll.
 
Back
Top