Nintendo Switch Tech Speculation discussion

Status
Not open for further replies.
My point is you could take a 400 Gflop gpu with 25GB/s and an 800 Gflop gpu with 25 GB/s, even though both are limited on bandwidth, the 800 Gflop gpu will still be a lot faster, even though the more powerful gpu is even more severely starved for bandwidth.

The power words here are "diminishing returns" - if that 800GFLOP/512SP GPU never gets to stretch its proverbial legs because it's hamstrung by low bandwidth, then why put it in there in the first place? Why not put in a 600GFLOP/384SP GPU that consumes less power, produces less heat, and still gives 90-95% of the performance, but at a lower cost from better yields?

I keep coming back to the AMD APUs but that's because they're an excellent example of a solution that's severely bandwidth-constrained at the low end. From 800MHz to 2GHz the FPS:bandwidth ratio is just about linear.
 
The power words here are "diminishing returns" - if that 800GFLOP/512SP GPU never gets to stretch its proverbial legs because it's hamstrung by low bandwidth, then why put it in there in the first place? Why not put in a 600GFLOP/384SP GPU that consumes less power, produces less heat, and still gives 90-95% of the performance, but at a lower cost from better yields?

I keep coming back to the AMD APUs but that's because they're an excellent example of a solution that's severely bandwidth-constrained at the low end. From 800MHz to 2GHz the FPS:bandwidth ratio is just about linear.

I guess that is a question only the GPU engineers can answer, as well as the API developers.
 
Someone on Reddit posted a comparison of the new 2017 Shield TV and Switch's Tegra SoC
Both have a portion of their serial code similar. 1632A2 PARR39.COP and -A2 at the end.

Hopefully Chipworks or someone will delid these and take a die shot to compare.

uLF2CUq.png
IMO, considering it has exactly the same top-side decoupling caps, that's a convincing indication the die is practically the same.
If they had important changes, the power delivery would be altered, and thus the decoupling would have been readjusted.
 
I think the memory bandwidth is an issue where people look at it as if it's completely stalling the operation. It's not a hard ceiling that brings things to a hault, but slows down operations that max out the bandwidth. Lots of opps will be limited by something other than bandwidth. For example, let's say bandwidth heavy opperations take up 50 percent of frame time. A bandwidth limitation may result in those operations now taking 60-70 percent of frame time. It's allway bad to have anything slowing down the rendering time, but it's managable. The developer will simply have to make chpices, either reduce bandwidth requirements, or speed up code in other areas to give the bandwidth heavy opperations enough time. My point is you could take a 400 Gflop gpu with 25GB/s and an 800 Gflop gpu with 25 GB/s, even though both are limited on bandwidth, the 800 Gflop gpu will still be a lot faster, even though the more powerful gpu is even more severely starved for bandwidth. I think there has been a misconception that bottlenecks are a hard ceiling severely limiting framerate, when they are really only slowing down opperations that max them out.

Bandwidth bottleneck isn't a hard ceiling, but if you're spending say 80mm^2 on 4 SM units to get to 1TFLOPs then maybe you'd be better off spending 60mm^2 on 3 SM units and instead use the remaining 20mm^2 on say 16MB eDRAM.
Not being a hard bottleneck doesn't mean there wouldn't be a lot to gain by spending your resources elsewhere, which is the whole point of trying to develop a balanced architecture.

That said, if this 4 SM SoC happens to be the true thing, I doubt nvidia would pair it with 25GB/s. In Parker they just clocked the same amount of SM 50% higher and doubled the memory channels for a total of 4*32bit, 50GB/s. They increased the bandwidth disproportionately with GPU performance increase. Either those Denver cores are extremely bandwidth hungry or the GPU in TX1 is already being a bit bottlenecked at 25GB/s shared.



If frames or FLOPs is not a constant it means you have a non-deterministic benchmark which is pointless.
Why would you do benchmarks in the assembly factory with the purpose of thoroughly studying performance numbers?
It's not a benchmark but a stress test for quality control, that just happens to return performance statistics (like all stress tests should do?). All relevant benchmarks have probably been done long ago at Nintendo labs, not at the assembly line.


FPS = frames / time
FLOP/s (FLOP_rate) = total FLOPs / time

FPS/FLOP_rate = (frames / time) / (FLOPs / time) = (frames / time) * (time / FLOPs) = frames / FLOPs
You do know you just wrote RATIO * (time/time) = RATIO, right?



Can you find the random number generation in Julia set fractal code?
I already wrote I got different results with benchmarks I ran myself. The Julia fractal I ran in the GpuTest suite has a completely different behavior and the viewpoint pans horizontally and vertically. It looks different every time I run it.
You can't just assume this "julia.h" test has the exact same code has this example you found...
 
IMO, considering it has exactly the same top-side decoupling caps, that's a convincing indication the die is practically the same.
If they had important changes, the power delivery would be altered, and thus the decoupling would have been readjusted.
Fully Custom Codename - free engraving with online orders.
chainsaw.gif
 
I would assume cost and heat are typically the reasons SOC use the smaller memory bus? The Tegra X1 actually had pretty decent bandwidth for its flops performance. Compared to other mobile SOC, its fairly balanced.

If there is a secret sauce per se, I would bet it comes down it it ability to use half precision shaders. Sebbi mention upwards of 70 percent of the shaders can be done in half precision. I think we are looking at a slightly more powerful hardware than last gen in portable, and docked two to three times the throughput. A huge upgrade over previous gen hardware is in the memory. Pretty much all developers wished the 360 and ps3 had significantly more memory. Skyrim for example struggled more so because of the limited 512MB of memory, especially PS3. If those consoles had 1-2GB more memory, it would have lifted a huge burden for developers.

Sent from my SM-G360V using Tapatalk
 
In that case, there is virtually zero hardware out there that is not bottlenecked :D
In real terms, bottleneck describes a limiting far greater than others in the system. So where the CPU and GPU and storage are perfectly balanced to keep everything busy but the BW stops them being busy because not enough data can be fed, the BW would be a bottleneck. If you have enough BW to feed all the components and a potent GPU but a weak CPU that can't generate the workloads to keep the GPU busy, the CPU would be a bottleneck. In software the 'bottleneck' is whatever resource is preventing one from achieving one's intended goal and has to be worked around.

Common, we have to look at use cases here in order to evaluate it objectively. Never mind what Marketing says, Switch is a handheld console first and living room console second. For its use case the hardware seems to be fine, especially taking power consumption in consideration!
The discussion has nothing to do with 25 GB/s in a handheld and everything to do with the wisdom of including a 800 GF GPU with such a BW limited system, for the purpose of contesting the validity of the leak. Is the leak true or false? If true, it should make sense from a HW design POV.
 
My point is you could take a 400 Gflop gpu with 25GB/s and an 800 Gflop gpu with 25 GB/s, even though both are limited on bandwidth, the 800 Gflop gpu will still be a lot faster...
Has any one of our PC experts got some comparisons we can make between similar parts that vary mostly on BW? Are there any 1 TF ish GPUs with decent VRAM to compare the 850M to, for example? I tried a GTX470 versus 850M but didn't find any game benchmarks.
 
A huge upgrade over previous gen hardware is in the memory.

"Huge upgrade" over 11 year-old hardware means very little.
Yes, density and price-per-MB for RAM went insanely down during the last 11 years. What's new about that?
 
IMO, considering it has exactly the same top-side decoupling caps, that's a convincing indication the die is practically the same.
If they had important changes, the power delivery would be altered, and thus the decoupling would have been readjusted.

yea the main customization will be the down clocks mentioned in eurogamer leak, everything pretty much makes perfect sense. so far we have too much evidence supporting the eurogamer leak.

venture beat story
dev doc leak
switch taken apart pics
eurogamer confirming seeing docs saying those specs in there leak will be final specs.
 
Has any one of our PC experts got some comparisons we can make between similar parts that vary mostly on BW? Are there any 1 TF ish GPUs with decent VRAM to compare the 850M to, for example? I tried a GTX470 versus 850M but didn't find any game benchmarks.

There's a slight difference in clock speeds, but the easiest comparison would be the GTX 860M - it's the same core (GM107) but it's fitted with GDDR5 instead of that being an option on the 850M.

http://www.gaminglaptopsjunky.com/gtx-860m-vs-gtx-850m/

There's also an older comparison of the Kepler-based GT750M at the same site. It's much worse for the DDR3 card here because Kepler's color compression wasn't nearly as good as Maxwell's:

http://www.gaminglaptopsjunky.com/gt-750m-gddr5-vs-gt-750m-ddr3-gaming-performance-tested/

On a desktop GPU, you could easily use an OC tool (Rivatuner/Afterburner/etc) to lock power states/clock speeds as needed to yield the desired bandwidth and compute.
 
You do know you just wrote RATIO * (time/time) = RATIO, right?

I'm going to go through this again in more detail and hopefully you can explain which part you take issue with.

1) A benchmark measures FPS after executing a fixed number of frames over a variable amount of time.
2) A benchmark measures FLOP/s after executing a fixed number of FLOPs over a variable amount of time.
3) The time in the above two statements is the same.
4) Therefore, the ratio of FPS to FLOP/s should be a constant.
5) That is not the case in this screenshot.
 
The discussion has nothing to do with 25 GB/s in a handheld and everything to do with the wisdom of including a 800 GF GPU with such a BW limited system, for the purpose of contesting the validity of the leak. Is the leak true or false? If true, it should make sense from a HW design POV.

You forget that the device is probably expected to spend most of its time undocked, a mode with less than 50% of GFLOPs than docked mode. Would it make sense economically to have a lot of excess bandwidth for the most common operation mode? Wouldn't it make more sense to balance out the chip for the most common operation mode, with just the necessary excess to make the faster operation posible, not ideal, just posible and deal with the limitations, where they arise, for the faster mode? Isn't that a compromise Nintendo would be willing to make?

EDIT: Although we are discussing the viability of such an scenario, my opinion is that should that 800 SOC exist, it would probably use 128 bit interface.
 
"Huge upgrade" over 11 year-old hardware means very little.
Yes, density and price-per-MB for RAM went insanely down during the last 11 years. What's new about that?
Depending on the game, lack of memory may have been the biggest limitation many games faced in the previous generation. What is this generation's best selling game by far? GTA V, and Switch can run GTA V no question. I think the penalty Switch pays for being a hybrid system gets blown out of proportion. Maybe Battlefield would be a challenge to port on Switch, but a game like Garden Warfare might be pretty easy. I'm just pointing out that not every game taxes hardware equally, but lots of memory has been a necessity for quite some time. More so than processing power for a great deal of games.

Sent from my SM-G360V using Tapatalk
 
BTW is there any reason for 3733 Gbps LPDDR4 not being considered in the hypothetical discussions? Samsung lists its status as Mass Produced. Is that not trully the case? Or is there any other reason for not being viable? 30 GB/s is a lot less limiting than 25 GB/s I'd say...
 
I would assume cost and heat are typically the reasons SOC use the smaller memory bus?

Power consumption is a big one for mobile - often bigger than cost and heat.

Powering a bus takes power, as does sending data across it and refreshing dram connected to it. Widening the interface also takes additional die area that could be used for compute or caches or whatever.

Every time you send data off-chip you burn hundreds (probably - it's a lot) of times the power you do for staying in registers or on chip cache. HBM on silicone interposer is better than dram, but still more expensive than staying on-chip.

BTW is there any reason for 3733 Gbps LPDDR4 not being considered in the hypothetical discussions? Samsung lists its status as Mass Produced. Is that not trully the case? Or is there any other reason for not being viable? 30 GB/s is a lot less limiting than 25 GB/s I'd say...

If high end phones are reluctant to use it, a (relatively) low BOM device like the NX certainly won't.
 
BTW is there any reason for 3733 Gbps LPDDR4 not being considered in the hypothetical discussions? Samsung lists its status as Mass Produced. Is that not trully the case? Or is there any other reason for not being viable? 30 GB/s is a lot less limiting than 25 GB/s I'd say...
Cost and/or the memory controller just doesn't support it.
 
yea the main customization will be the down clocks mentioned in eurogamer leak, everything pretty much makes perfect sense. so far we have too much evidence supporting the eurogamer leak.

venture beat story
dev doc leak
switch taken apart pics
eurogamer confirming seeing docs saying those specs in there leak will be final specs.

Yup.

Switch is probably a custom bin of Nvidia's Shield/Pixel chip. Probably has the A53 cores enabled too, for background download / suspend / update etc. Though these are unlikely to ever be made available to developers, unless background game hosting becomes a thing ....

I'm expecting that Nintendo buy the chips direct from Nvidia and don't fab them themselves; that Nvidia had a hand in designing the main board (they've already been there, done that); that the API is based on Nvidia's existing work for Tegra; that Nintendo basically recognised that buying the processing technology for their platform from Nvidia was better than them half assing something themselves.

Nvidia get to re-use IP they've already spent money on, Nintendo get a ready made architecture and an API that already has a ton of preparatory background work done.

I think there will be no meaningful Nintendo performance enhancements at the chip level: there is no edram; no secret large pool of esram; no magic 2 x 64-bit LPDDR4 chips; no Denver cores being switched in for final hardware without developers being informed because (reasons).

It's almost certain we know what Switch is now. And it's a fine piece of gaming kit.
 
Status
Not open for further replies.
Back
Top