AMD RDNA4 Architecture Speculation

dr_ribit · Sunday at 11:49 AM

DavidGraham said:
NVIDIA definitely runs tensor and fp32 ops concurrently, especially now with their tensor cores busy almost 100% of the time (doing upscaling, frame generation, denoising, HDR post processing, and in the future neural rendering).

Latest NVIDIA generations have become exceedingly better at mixing all 3 workloads (tensor+ray+fp32) concurrently, I read somewhere (I can't find the source now) that ray tracing + tensor are the most common concurrent ops, followed by ray tracing + fp32/tensor + fp32.

Concurrent execution of CUDA and Tensor cores

Yes, that is what it means. I don’t know where you got that. If the compiler did not schedule tensor core instructions along with other instructions, what else would it be doing? NOP? Empty space? Maybe you are mixing up what the compiler does and what the warp scheduler does. The warp...

forums.developer.nvidia.com

I need help understanding how concurrency of CUDA Cores and Tensor Cores works between Turing and Ampere/Ada?

There isn’t much difference between Turing, Ampere and Ada in this area. This question in various forms comes up from time to time, here is a recent thread. It’s also necessary to have a basic understanding of how instructions are issued and how work is scheduled in CUDA GPUs, unit 3 of this...

forums.developer.nvidia.com

The way I read this it seams that while the workloads are executed concurrently, they are still not dispatched concurrently (unlike on some other architectures). So some pipes will be underutilized.

fellix · Monday at 12:26 PM

RDNA 4’s “Out-of-Order” Memory Accesses

AMD’s RDNA 4 brings a variety of memory subsystem enhancements. Among those, one slide stood out because it dealt with out-of-order memory accesses. According to the slide, RDNA 4 allows requ…

old.chipsandcheese.com

trinibwoy · Monday at 1:09 PM

fellix said:
RDNA 4’s “Out-of-Order” Memory Accesses

AMD’s RDNA 4 brings a variety of memory subsystem enhancements. Among those, one slide stood out because it dealt with out-of-order memory accesses. According to the slide, RDNA 4 allows requ…

old.chipsandcheese.com

Love this guy. Micro benchmarks aren’t dead yet.

Albuquerque · Tuesday at 5:01 PM

Mod Mode: I spun off the NVIDIA-specific conversation around multi-displatch and computation into another thread. This one is for RDNA4

AMD RDNA4 Architecture Speculation

dr_ribit

Concurrent execution of CUDA and Tensor cores

I need help understanding how concurrency of CUDA Cores and Tensor Cores works between Turing and Ampere/Ada?

fellix

RDNA 4’s “Out-of-Order” Memory Accesses

trinibwoy

Meh

RDNA 4’s “Out-of-Order” Memory Accesses

Albuquerque

Red-headed step child

Similar threads