DispatchRays doesn't return until after the hit shader runs so it seems the bound hit shader isn't doing a lot of math. My guess is that it's just writing parameters out to a buffer and a separate compute shader is spun up after to actually do the material shading.
That sounds like a very good theory.
According to the profiler each SM has a peak instruction issue rate of 4.0 and peak measured is 3.2 (80%). It makes sense as an SM has 4 independent partitions each with its own instruction scheduler and execution units. What's bizarre is that the profiler reports peak issue rate for each of the two FP pipelines as 0.5 per clock. I would expect it to be 2.0 as each of the 4 partitions can issue to each FP pipeline every other clock. ALU peak is 2.0 as expected as it takes 2 clocks to process an ALU warp. SFU is 0.5 as expected as it takes 8 clocks to process an SFU warp. I don't know how to interpret the FP peaks.
This appears to explain the terms being used:
Advanced Learning :: Nsight Graphics Documentation (nvidia.com)
What I don't understand is why the picture in that document is not explained by the bullet points that follow. Light Pipe and Heavy Pipe?
This describes the pipelines:
Kernel Profiling Guide :: Nsight Compute Documentation (nvidia.com)
where the heavy FMA pipeline is different from the light one because it has integer dot product functionality. Other "integer" operations occur on other pipelines, the primary one being "alu".
Then we have the "fma" pipeline: "Fused Multiply Add/Accumulate. The FMA pipeline processes most FP32 arithmetic (FADD, FMUL, FMAD). It also performs integer multiplication operations (IMUL, IMAD), as well as integer dot products.
On GA10x, FMA is a logical pipeline that indicates peak FP32 and FP16x2 performance. It is composed of the FMAHeavy and FMALite physical pipelines."
Notice that this isn't a real pipeline in Ampere, merely a convenience term to describe the grouping of the capability of the heavy and light pipelines.
The "alu" pipeline is for bitwise and boolean operations and some integer math.
As for the discrepancy in the peak rates, I have no idea!
Back to the subject of the benefits of FP32 throughput in NVidia ray tracing:
nvidia-ampere-ga-102-gpu-architecture-whitepaper-v2.pdf
"Ray tracing denoising shaders are a good example of a workload that can benefit greatly from doubling FP32 throughput." I have no idea what proportion of the frame time is spent denoising ray tracing results.