AMD: RDNA 3 Speculation, Rumours and Discussion

Status
Not open for further replies.
upload_2021-4-5_18-21-50.png

Although this illustration (from the patent mentioned above) is more coincidental or reused from other patent diagrams (which AMD has done in the past), there is that use of the X3D die on top of the GCD (which have TSVs).
AMD did say they will share more info about X3D in time during the FAD 2020 webcast.
Hopefully more info will trickle in the next months.
 
View attachment 5394

Although this illustration (from the patent mentioned above) is more coincidental or reused from other patent diagrams (which AMD has done in the past), there is that use of the X3D die on top of the GCD (which have TSVs).
AMD did say they will share more info about X3D in time during the FAD 2020 webcast.
Hopefully more info will trickle in the next months.
They can do it 3-Hi too but that's not particularly useful for now.
 
Is RDNA 3 expected to have tensor cores of some kind?
At least for now it looks like AMD is using matrix crunchers only on the Instinct-line and I don't really see a single reason why they would bring them to gaming GPUs to eat the transistor budget.
 
At least for now it looks like AMD is using matrix crunchers only on the Instinct-line and I don't really see a single reason why they would bring them to gaming GPUs to eat the transistor budget.
if there is growth in the Deep learning space for games, then eventually it would make sense to bring over better accelerators for that purpose instead of increasing general compute
 
if there is growth in the Deep learning space for games, then eventually it would make sense to bring over better accelerators for that purpose instead of increasing general compute
Of course my guess is at best as good as anyone elses, but I wouldn't be putting my money on dedicated matrix crunchers being worth it in near term for gaming GPU. Guessing 5 years or more into the future is pretty useless.
 
Of course my guess is at best as good as anyone elses, but I wouldn't be putting my money on dedicated matrix crunchers being worth it in near term for gaming GPU. Guessing 5 years or more into the future is pretty useless.
Maybe not necessarily. The silicon size of TPUs are fairly smaller than general compute for it's output. This is one of those situations of CPU cores being much larger than GPU cores... and TPU cores being smaller and so forth.

This blog is pretty critical in understanding why the silicon usage may be worth it (also considering that bandwidth is constantly increasing at quite a premium)
https://timdettmers.com/2020/09/07/which-gpu-for-deep-learning/

Tensor Cores
Summary:

  • Tensor Cores reduce the used cycles needed for calculating multiply and addition operations, 16-fold — in my example, for a 32×32 matrix, from 128 cycles to 8 cycles.
  • Tensor Cores reduce the reliance on repetitive shared memory access, thus saving additional cycles for memory access.
  • Tensor Cores are so fast that computation is no longer a bottleneck. The only bottleneck is getting data to the Tensor Cores.
He'll go through the actual calculations, but the major cycle savings are actually on data share. I think his last point is actually the most critical aspect here. Computation is so fast, perhaps we don't need more tensor cores, what we need, is the ability to feed them. Naturally we build more tensor units as they fit into each SM/CU, so naturally we have more tensor cores than we need. The only reason more tensor cores perform better on benchmarks, is likely only because more SMs = more data share at once. So having more TPUs didn't address a computational bottleneck, the additional SM/CUs are addressing a bandwidth bottleneck.

Just thinking out loud, for RDNA 3 or even Ampere+, I think that's going to be the next set of innovation going forward, these computational units are sitting more and more idle, even though we keep putting in more. The data needs to get there faster. I like the idea of a big L3 cache to do this purpose, for data scientist the size is limiting however, but perhaps more than sufficient for games. All we needed were a few TPUs on gaming hardware.

Overall I think the reduction in bandwidth from using something like matrix crunchers will be significantly more versatile than just brute forcing it with compute. And if the future moves in that direction, we don't need a lot of matrix units, we just need to feed them.

We see that Ampere has a much larger shared memory allowing for larger tile sizes, which reduces global memory access. Thus, Ampere can make better use of the overall memory bandwidth on the GPU memory. This improves performance by roughly 2-5%. The performance boost is particularly pronounced for huge matrices.

The Ampere Tensor Cores have another advantage in that they share more data between threads. This reduces the register usage. Registers are limited to 64k per streaming multiprocessor (SM) or 255 per thread. Comparing the Volta vs Ampere Tensor Core, the Ampere Tensor Core uses 3x fewer registers, allowing for more tensor cores to be active for each shared memory tile. In other words, we can feed 3x as many Tensor Cores with the same amount of registers. However, since bandwidth is still the bottleneck, you will only see tiny increases in actual vs theoretical TFLOPS. The new Tensor Cores improve performance by roughly 1-3%.
 
Last edited:

Honestly, IMHO he's not getting to the correct point. He's not truly describing what's going on.

If my processor has an instruction that sums 512 values from continous memory, it would have the same characteristics, and is not considered a tensor core. That is the "processor" knows it has to burst prefetch a specific chunk of memory, and afterwards can do a hierarchical parallel reduction (log2), which, if the ALUs are there, is just 8 cycles. These are single large instructions, mostly with interdependent immediate result-sharing. Almost every defined operation like this, would run faster. Not because there's less internal bandwidth used, but because the sitation is forced to be favorable for the task. If you count the bits going over the busses between the units, it's the same amount. The unit itself is a value buffer, it itself is a register, albeit anonymous one. If you take away a GPR and add a unit, you don't change the number of state holders. Say the vector (or matrix) is allowed to be scattered, then the mem-burst advantage goes away, and if the instruction sequence would be discretized, then the instruction "compression" advantage would go away, if the unit is only 1 instruction wide, then the parallelism advantage goes away. There are a myriad of examples of this, one would be the texture filtering block, the situations is forced to be favorable for the parallel reduction (locality, morton, cascade, ...). Same with rasterizer blocks, it executes virtual (or on some architectures real) instructions on groups of pixels/tiles, which again is favorable for tile-internal depth-reduction, tile-fetch burst and so on.

The question is, which instructions should be designed/offered. There are just too many possible candidates, and maybe we should turn to FPGAs for these things. There already are already small programmable units in todays processors. The different instructions one would use are often occuring together (that is the code stream is not stationary), so reprogramming on-the-fly isn't prohibitive. Holistically, this is better performing than having a selected set of fast instructions, as you have now an infinite set of not-so fast instructions, even if the FPGA is clocked lower. Say you program all filter kernels used in a compute/graphics pipeline into FPGA (DoF, other blur, soft shadow, etc. pp.) one stage after the other, even if the block is slower clocked, the net result is more performance.
 
The question is, which instructions should be designed/offered. There are just too many possible candidates, and maybe we should turn to FPGAs for these things. There already are already small programmable units in todays processors.
AMD actually has patent for that, programmable execution units for CPUs (to be added next to integer and floating point units) https://www.freepatentsonline.com/y2020/0409707.html
 
I don't think adding latencies is a sensible way to arrive at performance numbers. Latency can be (and often is) hidden.
Agreed. But there has to be limits to how many can be hidden. Eventually you run out of blocks or threads.
 
Agreed. But there has to be limits to how many can be hidden. Eventually you run out of blocks or threads.
Yes, but if we're talking about matrix multiplication (which we are in the context of tensor cores) then we have predictable memory access patterns, high local data reuse, and thus a high math:mem ratio. If you amortise the startup cost you can get pretty close to the theoretical throughput of tensor cores. Global and shared memory access latencies can be almost completely hidden in this case.
 
Yes, but if we're talking about matrix multiplication (which we are in the context of tensor cores) then we have predictable memory access patterns, high local data reuse, and thus a high math:mem ratio. If you amortise the startup cost you can get pretty close to the theoretical throughput of tensor cores. Global and shared memory access latencies can be almost completely hidden in this case.
I agree, and with respect to @Ethatron post as well, this is pretty much talking about whether or not accelerators are performing their specific task versus general purpose compute.

The question postulated was whether or not having TPUs make sense in gaming GPUs, and my answer at the time is that if we're continuing moving into that direction, than the trade off for silicon may be worth it. Even if video cards in the future can generic as much compute power as a TPU today, there has to be computational density in favour of tensor math acceleration. The question of whether we should give up some compute units for tensor processing units really just boils down to how much we intend to use it.

I don't see an issue of not having TPUs today, but as workload in Deep Learning increases in the gaming space, I would imagine that it would eventually hit an inflection point where it makes more sense to ship with TPUs (or some other form of MLP accelerator) than without.
 
Don't forget Intels Altera either (there's actually Altera-branded chip on Sapphire Rapids too, but it's function is unknown and it's sitting clearly outside the actual CPU chips)
oh, I didn't know they bought Alterra.

interesting. I think I ended up succeeding in university with an alterra board but failed disastrously with the xilinx board.
I don't think the professor knew what he was asking of an undergrad student.
 
oh, I didn't know they bought Alterra.

interesting. I think I ended up succeeding in university with an alterra board but failed disastrously with the xilinx board.
I don't think the professor knew what he was asking of an undergrad student.
Yeah, they did back in 2015 already but have kept the branding until very recently (so recently that Sapphrie Rapids chip still says Altera on it). And it's with one "r" :)
 
I agree, and with respect to @Ethatron post as well, this is pretty much talking about whether or not accelerators are performing their specific task versus general purpose compute.

The question postulated was whether or not having TPUs make sense in gaming GPUs, and my answer at the time is that if we're continuing moving into that direction, than the trade off for silicon may be worth it. Even if video cards in the future can generic as much compute power as a TPU today, there has to be computational density in favour of tensor math acceleration. The question of whether we should give up some compute units for tensor processing units really just boils down to how much we intend to use it.

I don't see an issue of not having TPUs today, but as workload in Deep Learning increases in the gaming space, I would imagine that it would eventually hit an inflection point where it makes more sense to ship with TPUs (or some other form of MLP accelerator) than without.

From AMD's perspective it's going to be a tradeoff of development and production cost versus potential increased sales. So are games going in that direction? Certainly, but as always only to the limit of what consoles can do, and while the consoles have rapid low precision math on CPU and GPU, they don't have dedicated matrix units, so there's going to be a distinct limit to how much matrix math they'll do.

As well the percentage of games using every last new feature and technology that comes along shrinks progressively with time. You could add matrix only units, but competitive e-sorts games aren't going to use those very meaningfully, if at all. Adding those isn't going to get fps up in Valorant or Dota 2 at low settings; and since you at least theoretically want precisely accurate per pixel information in those games, then are people buying GPUs really going to use AI upscaling which by it's nature is going to give you false information from a bad guess?

I don't know the answer to that. But I do see the use case for such units as at least semi limited for the next seven+ years. Is it worth it for AMD to add those units to a GPU? I'd be unsure it is at the moment. Nvidia has successfully driven the niche high end GPU market crazy for "Raytracing", but their TPUs have gone largely unnoticed at the moment, and with something that abstract I'm not sure how successful a PR campaign in that direction would work.
 
Status
Not open for further replies.
Back
Top