Nvidia Ampere Discussion [2020-05-14]

If they're being less conservative about clocks and that's where the power consumption is coming from, it'll be interesting. Means we'll have better performance out of the box, but less overclocking potential. Also means they may be more sensitive to ambient temperatures. Maybe under-volting to keep boost clocks will be a thing like with AMD.
 
NVIDIA’s Flagship RTX 3090 GPU Will Have A TGP Of 350W, Complete Breakdown Leaked

https://wccftech.com/nvidia-rtx-3090-gpu-tgp-350w

Estimated Power Consumption / Losses

Total Graphics Power TGP 350 Watts
24 GB GDDR6X Memory (GA_0180_P075_120X140, 2.5 Watts per Module) -60 Watts
MOSFET, Inductor, Caps NVDD (GPU Voltage) -26 Watts
MOSFET, Inductor, Caps FBVDDQ (Framebuffer Voltage) -6 Watts
MOSFET, Inductor, Caps PEXVDD (PCIExpress Voltage) -2 Watts
Other Voltages, Input Section (AUX) -4 Watts
Fans, Other Power -7 Watts
PCB Losses -15 Watts
GPU Power approx. 230 Watts
 
10nm? I guess that would explain the power consumption. Why isn't Nvidia transitioning to 7nm? I would have thought it would be very mature right now. Limited capacity of 7nm fabs?
 
10nm is the guessing from former varius information, better say its samsung 8nm process which some people think its a better 10nm process. This was not said by igor. Sorry my failure.
 
10nm is the guessing from former varius information, better say its samsung 8nm process which some people think its a better 10nm process. This was not said by igor. Sorry my failure.
Samsung is purported to get a "small number of orders", with rumors pointing to professional dies. It would likely be using Samsung's 7nm EUV process

June 11, 2020
And that's precisely what is being rumoured: more than just a die shrink, though I doubt the consumer versions will use the same TSMC 7nm CoWoS design as the GA100. There is meant to be more L2 cache inside the GPU and though there seems to be half the number of Tensor Cores—those AI-specific bits of silicon—they seem to perform far better than the previous generation, which should all help when it comes to ray tracing.

There is still more speculation about whose 7nm process Nvidia is going to be using, with both TSMC and Samsung's node being thrown into the mix. Jen-Hsun has confirmed that Samsung will be manufacturing a small number of its graphics chips, with TSMC still set to remain the manufacturer of the vast majority of Ampere silicon.

I'd suggest that maybe Samsung's EUV node would be used for the larger, though smaller volume, professional dies, with the high-volume gaming chips likely to filter out of TSMC's established fabrication facilities, following on from the stacked GA100 chip the Taiwanese company has created for Nvidia.
https://www.pcgamer.com/nvidia-ampere-release-date-specs-performance/
 
Regarding the Igor speculation how does he get to the strange 'space saving' placement of 12 GDDR6 chips, 4 left, 4 right, 3 top, 1 bottom?
Looking at the PCB of the RTX 2080 Ti, placement is 4 top, 4 right, 4 bottom, which definitely is more space saving, and this is on a full sized PCB.
Notice that all the power regulators need quite a bit of space too.
PCB (1).jpg
 
Last edited:
Rumor about RT on separate chip, for what it's worth: https://wccftech.com/nvidia-traversal-coprocessor-rtx-3090-gpu/
According to the source, the reason why NVIDIA has one fan at the bottom and one fan at the top is to facilitate airflow over two chips, not one. Coreteks guesses that this chip is something called a traversal coprocessor that will aid in the raytracing ability of the cards.
Erm... How exactly is the backside fan supposed to cool a chip that would be on the same level as fans topside, buried under backplate in the opposite end of the card?
 
Rumor about RT on separate chip, for what it's worth: https://wccftech.com/nvidia-traversal-coprocessor-rtx-3090-gpu/
Utter BS. The entire programming model is tied towards a a single thread controlling a single ray at a time. With threads in flight, allocated registers etc, it makes no sense whatsoever to even offload that from the corresponding SMM, it's tightly bound with scheduling on that SMM and is - in case you are not all too unlucky with cache misses - most certainly latency sensitive as latency between shader invocations directly translates to occupancy constraints. Even though it does make sense to encapsulate it as a dedicated unit within a SMM, just like any other "core" in there. Which is what the patent describes. Fixed function traversal, committing rays to the RT cores, and processing a warp full of hits as they are found.

Off-chip coprocessor, and then not even on the same package, is pure fiction. If someone suggested that NVidia would break entire SMMs off into chiplets, I might even believe that person. But ripping function units, for anything interleaved with execution flow, out of an SMM? Not a chance.
 
It's not necessarily *complete* BS. If you read the entire patent, Nvidia clearly expended an awful lot of effort to make the coprocessor asynchronous and decoupled from the main rendering path.
I do agree that off-package is quite unlikely, and wccftech's talk of a separate ray tracing add-in card isn't going to happen.

However, if you're already making a GPU with HBM, and you're already paying money for an interposer, breaking out the 'coprocessor' to its own chiplet doesn't seem like much of a stretch given the patent.
Also, since HBM2 also has its own fancy control die at the bottom of the stack, it wouldn't be too much of a stretch for Nvidia to try to get one of its partners to make a variant of HBM that's dual-ported.

As the patent mentions, the coprocessor may function with read-only access to the shared memory too. Sounds exactly like the dual-ported VRAM that was common back in the 80s.
 

Attachments

  • patent-1.png
    patent-1.png
    80.7 KB · Views: 20
  • patent-2.png
    patent-2.png
    100.6 KB · Views: 20
Nvidia engineers last year spoke about chiplet design (can't find the interview) and were very specific about future consumer products to remain monolithic dies due to cost and complexity reasons.
 
Nvidia engineers last year spoke about chiplet design (can't find the interview) and were very specific about future consumer products to remain monolithic dies due to cost and complexity reasons.

I'd agree that it's probably highly unlikely in this case too, although not impossible.

After reading through the patent a few times, one of the first things I thought of was the strange rumours about GDDR6x made a little more sense if the ray tracing hardware was being at least partially decoupled from the rest of the GPU.
Interestingly enough, one of GDDR6's generally unused features are its dual channel per package design.

Not too much of a stretch to imagine tweaking it to either add a 3rd 16-bit channel (read-only or otherwise) while keeping identical signaling specs to give the ray tracing 'coprocessor' its own half-width memory bus.
Or alternatively, multiplexing the coprocessor's memory access onto one of the two 16-bit channels.

IE - Most of the time the basic GPU keeps its full memory bandwidth, but for the times when the 'coprocessor' needs to access memory, it could lock the 2nd channel on the GDDR6 for its own use and then release it when finished. At no point does the GPU completely stall or get locked out of memory, but from time to time will drop to half memory bandwidth. If you aren't ray tracing, then the base GPU gets the full memory bandwidth 100% of the time and the coprocessor is idle.

The patent does make a lot more sense in the context of a hypothetical datacentre product for Google Stadia and the like, though.

Take GA100, add an RTX 'coprocessor' to the interposer, sprinkle a little dual-ported HBM2 into the mix, and you have your halo graphics product without needing to make a separate die or compromise GA100's compute performance by removing functional units to make room for the RTX hardware. Given that GA100 is already right near the reticle limit, it's either add the ray tracing capability as a coprocessor on the package, or design a completely separate ~800+mm2 die for a maximum-effort RTX-capable graphics product.

If this is the approach Nvidia's taking, it's pretty easy to see how the rumour mill may be right about the coprocessor/chiplet, just wrong about which product segment.
 

Attachments

  • gddr6.PNG
    gddr6.PNG
    171.7 KB · Views: 18
It's not necessarily *complete* BS. If you read the entire patent, Nvidia clearly expended an awful lot of effort to make the coprocessor asynchronous and decoupled from the main rendering path.

I don’t see anything in the patent that wouldn’t also apply to the existing on-chip co-processors (tessellators, TMUs). Certainly don’t see any evidence in the patent that nvidia is considering a separate chip for BVH intersection.

As usual the correct answer is probably the simplest one. The rumors are fantasy and RT hardware will remain on-chip.

Once we do get chiplets the work to render a frame will be allocated at pixel tile granularity across homogenous chiplets. Bet on it.
 
Last edited:
Utter BS. The entire programming model is tied towards a a single thread controlling a single ray at a time. With threads in flight, allocated registers etc, it makes no sense whatsoever to even offload that from the corresponding SMM, it's tightly bound with scheduling on that SMM and is - in case you are not all too unlucky with cache misses - most certainly latency sensitive as latency between shader invocations directly translates to occupancy constraints. Even though it does make sense to encapsulate it as a dedicated unit within a SMM, just like any other "core" in there. Which is what the patent describes. Fixed function traversal, committing rays to the RT cores, and processing a warp full of hits as they are found.
I completely agree. It would have made sense maybe in the early 90s, where everything was co-processorized that could not escape up a tree fast enough.

Off-chip coprocessor, and then not even on the same package, is pure fiction. If someone suggested that NVidia would break entire SMMs off into chiplets, I might even believe that person. But ripping function units, for anything interleaved with execution flow, out of an SMM? Not a chance.
Nothing I've read in the patent so far suggests that by co-processor, they mean anything else than a semi-independent part of the IP deeply integrated within the same physical chip as the … wait for it … streaming multiprocessors. For that reason, I could imagine that Gaming-Ampere retains the large(r) SM/L1-pool that GA100 has, maybe even hardcoded (BIOS-locked) in a triple-split to reserve the additional memory over Turing for raytracing.
 
Back
Top