Nvidia Ampere Discussion [2020-05-14]

DavidGraham · Apr 16, 2021

Sony is installing a cluster of NVIDIA DGX A100 systems linked on an NVIDIA Mellanox InfiniBand network.

Sony’s engineers are packing machine-learning smarts into products from its Xperia smartphones, its entertainment robot, aibo, and a portfolio of imaging components for everything from professional and consumer cameras to factory automation and satellites. It’s even using AI to build the next generation of advanced imaging chips.

https://blogs.nvidia.com/blog/2021/...582&linkId=100000040843726#cid=_so-twit_en-us

Deleted member 2197 · Jul 7, 2021

NVIDIA Nsight Perf SDK v2021.1 Now Available | NVIDIA Developer Blog
July 6, 2021

The Nsight Perf SDK v2021.1 public release is now available for download.

New features include:

Simplified APIs to measure GPU performance within your application, and generate HTML performance reports .

Lower-level range profiling APIs, with utility libraries providing ease-of-use.

All of the above, usable in D3D11, D3D12, OpenGL, and Vulkan.

Samples illustrating usage in 3D, Compute, and Ray Tracing.

New: Ray Tracing Samples

Below is a screenshot of Microsoft’s Real Time Denoised Ambient Occlusion sample, containing Nsight Perf SDK instrumentation. Each render pass or phase of execution has been annotated to take a measurement.

Deleted member 2197 · Nov 11, 2021

NVIDIA® Nsight™ Graphics 2021.5 is released
November 10, 2021

Feature Enhancements:

Nsight Graphics & Nsight Aftermath now provide full support for Windows 11

NVIDIA NGX is now supported for Linux (in addition to Windows). For more details on NVIDIA NGX, see NVIDIA NGX Technology - AI for Visual Applications

Added acceleration structure analysis to the acceleration structure viewer

The viewer now offers the ability to see instances that are excessively overlapping, following the best practices for NVIDIA RTX Ray Tracing.

See Best Practices: Using NVIDIA RTX Ray Tracing

Improvements:

Added an API for providing project parameters to the Nsight Graphics SDK injection API.

Added a file-based hierarchical sorting mode to the shader profiler.

This allows users to see all of the shaders originating from a particular file.

Additional Changes:

Support for graphics debugging and profiling of x86 (32-bit) applications will be removed in a future release. We will continue to support x86 (32-bit) launchers (e.g., Steam, Origin) that start x64 (64-bit) applications.

Known Issues:

When profiling ray tracing shaders using the Shader Profiler on a system with the R495 series driver, there is a known issue where samples may not be attributed. These samples will be classified as 'Unattributed'. This issue will be addressed in a future driver release.

After generating a trace of Quake II RTX in GPU Trace and clicking Analyze, the host may crash if you repeatedly click and cancel the Aggregate option. If you encounter this crash, please try clearing your app data as a workaround by going to the Help menu in Nsight Graphics and clicking on "Reset Application Data…"

Windows 11: Nsight Graphics may show incorrect Exception Summary status in the Reset Channel field of the Aftermath Dump Info tab for Vulkan applications. In addition, Reset Channel may show "3D" instead of "Compute."

Nsight Graphics Documentation (nvidia.com)
Download Center | NVIDIA Developer

CarstenS · Nov 11, 2021

Speaking of RT. Nvidias "Best Practices" mention "Use triangle geometries when possible. Hardware excels in performing ray-triangle intersections. Ray-box intersections are accelerated too, but you get the most out of the hardware when tracing against triangle geometries."

I was under the impression, that Nvidia as well as AMD could perform 4 Box intersections or 1 Triangle intersection per Unit. Has anyone seen more documentation on Ampere (and possibly Turing) in this matter?

trinibwoy · Nov 11, 2021

CarstenS said:
Speaking of RT. Nvidias "Best Practices" mention "Use triangle geometries when possible. Hardware excels in performing ray-triangle intersections. Ray-box intersections are accelerated too, but you get the most out of the hardware when tracing against triangle geometries."

I was under the impression, that Nvidia as well as AMD could perform 4 Box intersections or 1 Triangle intersection per Unit. Has anyone seen more documentation on Ampere (and possibly Turing) in this matter?

Nvidia has never shared the number of box or triangle intersections per clock in shipping products. If I recall correctly their patents referred to compressing 8 boxes into a single cache line so that may be a hint.

"In one embodiment, the TTU may include four traversal units to test up to eight child nodes for intersection with the ray in parallel."

https://patents.google.com/patent/US9582607B2/en

Jawed · Nov 11, 2021

CarstenS said:
Speaking of RT. Nvidias "Best Practices" mention "Use triangle geometries when possible. Hardware excels in performing ray-triangle intersections. Ray-box intersections are accelerated too, but you get the most out of the hardware when tracing against triangle geometries."

I was under the impression, that Nvidia as well as AMD could perform 4 Box intersections or 1 Triangle intersection per Unit. Has anyone seen more documentation on Ampere (and possibly Turing) in this matter?

Ampere's triangle intersection rate is twice that of Turing's isn't it?

The page was written before Ampere came along, August 2020.

Is Ray-box used primarily (solely?) for procedural geometry?

trinibwoy · Nov 11, 2021

Jawed said:
Is Ray-box used primarily (solely?) for procedural geometry?

All geometry (triangles and procedural) is encapsulated in boxes. It's not a choice between boxes vs triangles.

OlegSH · Nov 11, 2021

CarstenS said:
but you get the most out of the hardware when tracing against triangle geometries

I guess that's just a poor wording. What's meant here is that you should not use custom primitives instead of triangles because custom primitives would not be HW accelerated in that case.

CarstenS · Nov 11, 2021

Poor wording when you have to make sure that developers do understand exactly what the DOs and DON'Ts for Raytracing are?

TopSpoiler · Aug 18, 2022

https://www.reddit.com/r/nvidia/comments/wr13m3

TopSpoiler · Oct 27, 2022

https://twitter.com/x/status/1585204125590962176

And efficiencies:

https://twitter.com/x/status/1585247117970673664

Man from Atlantis · Oct 31, 2022

https://www.reddit.com/r/hardware/comments/yhrnf6

pjbliverpool · Oct 31, 2022

Man from Atlantis said:
View attachment 7405

https://www.reddit.com/r/hardware/comments/yhrnf6

That's a brilliant chart and really telling about what we're really getting with the 4xxx series

troyan · Oct 31, 2022

Strange chart. Why are they starting with 780 and not 680? And with Pascal there was a 600mm^2 die with P100 which hasnt more cuda cores but a better architectur and HBM.

Newguy · Oct 31, 2022

troyan said:
Strange chart. Why are they starting with 780 and not 680? And with Pascal there was a 600mm^2 die with P100 which hasnt more cuda cores but a better architectur and HBM.

680 is GK104 whereas 780 Ti is GK100/110 ("full" Kepler) and the 680 is the same as the 770 for reference (8 SMX), P100 isn't a consumer product like V100, A100 and H100 aren't consumer products so they're not mentioned

troyan · Oct 31, 2022

GK110 was ready for the Titan supercomputer in 2012. And the number of cuda cores depends on the amount of transistors nVidia want to spend. TU102 was bigger than P100 and P100 has every graphics function as gaming Pascal.

Kaotik · Nov 8, 2022

Exclusive: Nvidia offers new advanced chip for China that meets U.S. export controls

U.S. chip maker Nvidia Corp is offering a new advanced chip in China that meets recent export control rules aimed at keeping cutting-edge technology out of China's hands, the company confirmed on Monday.

www.reuters.com

NVIDIA makes new A800 for chinese market, since their A100 and H100 were among the GPUs caught in latest round of sanctions in the trade war.
By the looks of it A800 is A100 permanently limited to max 400 GB/s GPU to GPU bandwidth

DavidGraham · Apr 16, 2024

DavidGraham said:
I am curious how much of B200 is disabled to improve yields? In A100 it was 15%, in H100 it was 8%.

A semi complete die of A100 is now sold in China, A100 with 96GB of VRAM and 7936 cores.

Original A100: 6912 cores and 80GB
New A100: 7936 cores and 96GB
Full A100: 8192 cores and 96GB

NVIDIA A100 "Ampere" GPUs With 7936 Cores Being Sold In China, 15% More Cores Than Original A100s

NVIDIA seems to have been shipping A100 "Ampere" GPUs with higher core counts than the original spec in China as per recent retail listings.

wccftech.com

TopSpoiler · Mar 28, 2025

https://twitter.com/x/status/1905493794130137406

Researchers reverse engineered Nvidia Ampere microarchitecture, revealing how the issue logic works including the policy of the issue scheduler, the structure of the register file and its associated cache, and multiple features of the memory pipeline.

Link to the paper

Nvidia Ampere Discussion [2020-05-14]

DavidGraham

Deleted member 2197

Guest

Deleted member 2197

Guest

CarstenS

Moderator

trinibwoy

Meh

Jawed

trinibwoy

Meh

OlegSH

CarstenS

Moderator

TopSpoiler

TopSpoiler

Man from Atlantis

idk

pjbliverpool

B3D Scallywag

troyan

Newguy

troyan

Kaotik

Drunk Member

Exclusive: Nvidia offers new advanced chip for China that meets U.S. export controls

DavidGraham

NVIDIA A100 "Ampere" GPUs With 7936 Cores Being Sold In China, 15% More Cores Than Original A100s

TopSpoiler

Similar threads