AMD: RDNA 3 Speculation, Rumours and Discussion

Discussion in 'Architecture and Products' started by Jawed, Oct 28, 2020.

Tags:
  1. upload_2021-4-5_18-21-50.png

    Although this illustration (from the patent mentioned above) is more coincidental or reused from other patent diagrams (which AMD has done in the past), there is that use of the X3D die on top of the GCD (which have TSVs).
    AMD did say they will share more info about X3D in time during the FAD 2020 webcast.
    Hopefully more info will trickle in the next months.
     
  2. Bondrewd

    Veteran

    Joined:
    Sep 16, 2017
    Messages:
    1,682
    Likes Received:
    846
    They can do it 3-Hi too but that's not particularly useful for now.
     
  3. techuse

    Veteran

    Joined:
    Feb 19, 2013
    Messages:
    1,424
    Likes Received:
    908
    Is RDNA 3 expected to have tensor cores of some kind?
     
    PSman1700 likes this.
  4. madhatter

    Newcomer

    Joined:
    Jul 23, 2020
    Messages:
    32
    Likes Received:
    25
    Probably not.
     
  5. BRiT

    BRiT (>• •)>⌐■-■ (⌐■-■)
    Moderator Legend Alpha

    Joined:
    Feb 7, 2002
    Messages:
    20,502
    Likes Received:
    24,398
  6. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    10,244
    Likes Received:
    4,465
    Location:
    Finland
    At least for now it looks like AMD is using matrix crunchers only on the Instinct-line and I don't really see a single reason why they would bring them to gaming GPUs to eat the transistor budget.
     
  7. iroboto

    iroboto Daft Funk
    Legend Subscriber

    Joined:
    Mar 6, 2014
    Messages:
    14,833
    Likes Received:
    18,633
    Location:
    The North
    if there is growth in the Deep learning space for games, then eventually it would make sense to bring over better accelerators for that purpose instead of increasing general compute
     
    PSman1700 likes this.
  8. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    10,244
    Likes Received:
    4,465
    Location:
    Finland
    Of course my guess is at best as good as anyone elses, but I wouldn't be putting my money on dedicated matrix crunchers being worth it in near term for gaming GPU. Guessing 5 years or more into the future is pretty useless.
     
  9. iroboto

    iroboto Daft Funk
    Legend Subscriber

    Joined:
    Mar 6, 2014
    Messages:
    14,833
    Likes Received:
    18,633
    Location:
    The North
    Maybe not necessarily. The silicon size of TPUs are fairly smaller than general compute for it's output. This is one of those situations of CPU cores being much larger than GPU cores... and TPU cores being smaller and so forth.

    This blog is pretty critical in understanding why the silicon usage may be worth it (also considering that bandwidth is constantly increasing at quite a premium)
    https://timdettmers.com/2020/09/07/which-gpu-for-deep-learning/

    Tensor Cores
    Summary:

    • Tensor Cores reduce the used cycles needed for calculating multiply and addition operations, 16-fold — in my example, for a 32×32 matrix, from 128 cycles to 8 cycles.
    • Tensor Cores reduce the reliance on repetitive shared memory access, thus saving additional cycles for memory access.
    • Tensor Cores are so fast that computation is no longer a bottleneck. The only bottleneck is getting data to the Tensor Cores.
    He'll go through the actual calculations, but the major cycle savings are actually on data share. I think his last point is actually the most critical aspect here. Computation is so fast, perhaps we don't need more tensor cores, what we need, is the ability to feed them. Naturally we build more tensor units as they fit into each SM/CU, so naturally we have more tensor cores than we need. The only reason more tensor cores perform better on benchmarks, is likely only because more SMs = more data share at once. So having more TPUs didn't address a computational bottleneck, the additional SM/CUs are addressing a bandwidth bottleneck.

    Just thinking out loud, for RDNA 3 or even Ampere+, I think that's going to be the next set of innovation going forward, these computational units are sitting more and more idle, even though we keep putting in more. The data needs to get there faster. I like the idea of a big L3 cache to do this purpose, for data scientist the size is limiting however, but perhaps more than sufficient for games. All we needed were a few TPUs on gaming hardware.

    Overall I think the reduction in bandwidth from using something like matrix crunchers will be significantly more versatile than just brute forcing it with compute. And if the future moves in that direction, we don't need a lot of matrix units, we just need to feed them.

     
    #229 iroboto, May 20, 2021
    Last edited: May 20, 2021
    pharma, xpea, Newguy and 4 others like this.
  10. Ethatron

    Regular Subscriber

    Joined:
    Jan 24, 2010
    Messages:
    946
    Likes Received:
    413
    Honestly, IMHO he's not getting to the correct point. He's not truly describing what's going on.

    If my processor has an instruction that sums 512 values from continous memory, it would have the same characteristics, and is not considered a tensor core. That is the "processor" knows it has to burst prefetch a specific chunk of memory, and afterwards can do a hierarchical parallel reduction (log2), which, if the ALUs are there, is just 8 cycles. These are single large instructions, mostly with interdependent immediate result-sharing. Almost every defined operation like this, would run faster. Not because there's less internal bandwidth used, but because the sitation is forced to be favorable for the task. If you count the bits going over the busses between the units, it's the same amount. The unit itself is a value buffer, it itself is a register, albeit anonymous one. If you take away a GPR and add a unit, you don't change the number of state holders. Say the vector (or matrix) is allowed to be scattered, then the mem-burst advantage goes away, and if the instruction sequence would be discretized, then the instruction "compression" advantage would go away, if the unit is only 1 instruction wide, then the parallelism advantage goes away. There are a myriad of examples of this, one would be the texture filtering block, the situations is forced to be favorable for the parallel reduction (locality, morton, cascade, ...). Same with rasterizer blocks, it executes virtual (or on some architectures real) instructions on groups of pixels/tiles, which again is favorable for tile-internal depth-reduction, tile-fetch burst and so on.

    The question is, which instructions should be designed/offered. There are just too many possible candidates, and maybe we should turn to FPGAs for these things. There already are already small programmable units in todays processors. The different instructions one would use are often occuring together (that is the code stream is not stationary), so reprogramming on-the-fly isn't prohibitive. Holistically, this is better performing than having a selected set of fast instructions, as you have now an infinite set of not-so fast instructions, even if the FPGA is clocked lower. Say you program all filter kernels used in a compute/graphics pipeline into FPGA (DoF, other blur, soft shadow, etc. pp.) one stage after the other, even if the block is slower clocked, the net result is more performance.
     
    Jawed, pharma, DavidGraham and 2 others like this.
  11. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    10,244
    Likes Received:
    4,465
    Location:
    Finland
    AMD actually has patent for that, programmable execution units for CPUs (to be added next to integer and floating point units) https://www.freepatentsonline.com/y2020/0409707.html
     
    Jawed likes this.
  12. Xmas

    Xmas Porous
    Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    3,344
    Likes Received:
    176
    Location:
    On the path to wisdom
    I don't think adding latencies is a sensible way to arrive at performance numbers. Latency can be (and often is) hidden.
     
  13. iroboto

    iroboto Daft Funk
    Legend Subscriber

    Joined:
    Mar 6, 2014
    Messages:
    14,833
    Likes Received:
    18,633
    Location:
    The North
    Agreed. But there has to be limits to how many can be hidden. Eventually you run out of blocks or threads.
     
  14. Xmas

    Xmas Porous
    Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    3,344
    Likes Received:
    176
    Location:
    On the path to wisdom
    Yes, but if we're talking about matrix multiplication (which we are in the context of tensor cores) then we have predictable memory access patterns, high local data reuse, and thus a high math:mem ratio. If you amortise the startup cost you can get pretty close to the theoretical throughput of tensor cores. Global and shared memory access latencies can be almost completely hidden in this case.
     
    xpea and iroboto like this.
  15. iroboto

    iroboto Daft Funk
    Legend Subscriber

    Joined:
    Mar 6, 2014
    Messages:
    14,833
    Likes Received:
    18,633
    Location:
    The North
    I agree, and with respect to @Ethatron post as well, this is pretty much talking about whether or not accelerators are performing their specific task versus general purpose compute.

    The question postulated was whether or not having TPUs make sense in gaming GPUs, and my answer at the time is that if we're continuing moving into that direction, than the trade off for silicon may be worth it. Even if video cards in the future can generic as much compute power as a TPU today, there has to be computational density in favour of tensor math acceleration. The question of whether we should give up some compute units for tensor processing units really just boils down to how much we intend to use it.

    I don't see an issue of not having TPUs today, but as workload in Deep Learning increases in the gaming space, I would imagine that it would eventually hit an inflection point where it makes more sense to ship with TPUs (or some other form of MLP accelerator) than without.
     
  16. iroboto

    iroboto Daft Funk
    Legend Subscriber

    Joined:
    Mar 6, 2014
    Messages:
    14,833
    Likes Received:
    18,633
    Location:
    The North
    With AMD having bought Xilinx, are you trying to tell us something =P?
     
    Ethatron likes this.
  17. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    10,244
    Likes Received:
    4,465
    Location:
    Finland
    Don't forget Intels Altera either (there's actually Altera-branded chip on Sapphire Rapids too, but it's function is unknown and it's sitting clearly outside the actual CPU chips)
     
    PSman1700 and iroboto like this.
  18. iroboto

    iroboto Daft Funk
    Legend Subscriber

    Joined:
    Mar 6, 2014
    Messages:
    14,833
    Likes Received:
    18,633
    Location:
    The North
    oh, I didn't know they bought Alterra.

    interesting. I think I ended up succeeding in university with an alterra board but failed disastrously with the xilinx board.
    I don't think the professor knew what he was asking of an undergrad student.
     
  19. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    10,244
    Likes Received:
    4,465
    Location:
    Finland
    Yeah, they did back in 2015 already but have kept the branding until very recently (so recently that Sapphrie Rapids chip still says Altera on it). And it's with one "r" :)
     
    iroboto likes this.
  20. Frenetic Pony

    Regular

    Joined:
    Nov 12, 2011
    Messages:
    807
    Likes Received:
    478
    From AMD's perspective it's going to be a tradeoff of development and production cost versus potential increased sales. So are games going in that direction? Certainly, but as always only to the limit of what consoles can do, and while the consoles have rapid low precision math on CPU and GPU, they don't have dedicated matrix units, so there's going to be a distinct limit to how much matrix math they'll do.

    As well the percentage of games using every last new feature and technology that comes along shrinks progressively with time. You could add matrix only units, but competitive e-sorts games aren't going to use those very meaningfully, if at all. Adding those isn't going to get fps up in Valorant or Dota 2 at low settings; and since you at least theoretically want precisely accurate per pixel information in those games, then are people buying GPUs really going to use AI upscaling which by it's nature is going to give you false information from a bad guess?

    I don't know the answer to that. But I do see the use case for such units as at least semi limited for the next seven+ years. Is it worth it for AMD to add those units to a GPU? I'd be unsure it is at the moment. Nvidia has successfully driven the niche high end GPU market crazy for "Raytracing", but their TPUs have gone largely unnoticed at the moment, and with something that abstract I'm not sure how successful a PR campaign in that direction would work.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...