AMD: Navi Speculation, Rumours and Discussion [2019-2020]

trinibwoy · Jun 4, 2020

Any chance AMD will drop the confusing “dual compute unit” terminology for RDNA2? It seems the two CUs share an L0 instruction cache and scalar data cache but all other resources are CU specific. Not sure those two caches are worth the confusing name.

Deleted member 90741 · Jun 4, 2020

trinibwoy said:
Any chance AMD will drop the confusing “dual compute unit” terminology for RDNA2? It seems the two CUs share an L0 instruction cache and scalar data cache but all other resources are CU specific. Not sure those two caches are worth the confusing name.

Actually they don't share L0. Each WGP can access the double the LDS because it is part of the WGP.
More Info on WGP mode from AMD's RDNA whitepaper.

The RDNA architecture has two modes of operation for the LDS, compute-unit and workgroupprocessor mode, which are controlled by the compiler. The former is designed to match the behavior of the GCN architecture and statically divides the LDS capacity into equal portions between the two pairs of SIMDs. By matching the capacity of the GCN architecture, this mode ensures that existing shaders will run efficiently. However, the work-group processor mode allows using larger allocations of the LDS to boost performance for a single work-group

JoeJ · Jun 4, 2020

However, the work-group processor mode allows using larger allocations of the LDS to boost performance for a single work-group

I wonder if CU can mix workgroups that use lots of LDS with others that use only a little bit? Probably.
I also wonder how this compares with NV which seems to have more LDS in general (Ampere increased it once more).

szatkus · Jun 4, 2020

Kaotik said:
Considering he's adressed as Devinder throughout the interview, how likely it is he since switched his name and told WCCFtech specifically about it?

People with uncommon names (well, uncommon where they currently live) sometimes goes with more familiar name. When I was working with Koreans some more prominent people were using English names as their first names, because English is more cool or something. Considering it could be confusing for people in his vicinity he may just said something like "You can call me David" and here's that.

May be also because of etymology, some of my English and Americans clients call me Thomas because it would be my name in English. I don't know if it's the case for Devinder.

Kaotik · Jun 4, 2020

szatkus said:
People with uncommon names (well, uncommon where they currently live) sometimes goes with more familiar name. When I was working with Koreans some more prominent people were using English names as their first names, because English is more cool or something. Considering it could be confusing for people in his vicinity he may just said something like "You can call me David" and here's that.

May be also because of etymology, some of my English and Americans clients call me Thomas because it would be my name in English. I don't know if it's the case for Devinder.

I'm aware of some people doing this, even going all official like Jen-Hsun switching to Jensen, but Devinder is and has been Devinder everywhere but the WCCFtech article.

Deleted member 90741 · Jun 4, 2020

Ext3h said:
There has been only one ACE since early GCN, and there still only is. What's shown as 2 ACEs is a single core with 2x SMT, and each thread polls from a number of queues.

AMDs presentation of that implementation detail is more artistic freedom than anything else.

There is a good description of what it is from @bridgman

- "MEC" (Micro Engine Compute, aka the compute command processor)
Some chips have two MEC's, other parts have only one. So far one MEC (up to 32 queues) seems to be more than enough to keep the shader core fully occupied.
The MEC block has 4 independent threads, referred to as "pipes" in engineering and "ACEs" (Asynchronous Compute Engines) in marketing. One MEC => 4 ACEs, two MECs => 8 ACEs. Each pipe can manage 8 compute queues, or one of the pipes can run HW scheduler microcode which assigns "virtual" queues to queues on the other 3/7 pipes.

https://www.phoronix.com/forums/for...x/856534-amdgpu-questions?p=857850#post857850

BRiT · Jun 4, 2020

Kaotik said:
I'm aware of some people doing this, even going all official like Jen-Hsun switching to Jensen, but Devinder is and has been Devinder everywhere but the WCCFtech article.

It's a shame that only WCCFtech can get things right ...

trinibwoy · Jun 5, 2020

JoeJ said:
However, the work-group processor mode allows using larger allocations of the LDS to boost performance for a single work-group

I wonder if CU can mix workgroups that use lots of LDS with others that use only a little bit? Probably.
I also wonder how this compares with NV which seems to have more LDS in general (Ampere increased it once more).

That's true for HPC parts but RDNA and "gaming" Turing have the same LDS:ALU ratio of 1KB per FP ALU. They also share the same 4KB LDS per block/workgroup.

Support for higher maximum LDS allocations per workgroup makes sense but Nvidia seems to take a simpler approach. Still not clear to me why a dual CU isn't just a CU with 4 32-wide SIMDs with mode toggles for GCN compatibility. Maybe that's exactly what it is and the terminology is just wonky.

"Turing allows a single thread block to address the full 64 KB of shared memory. To maintain architectural compatibility, static shared memory allocations remain limited to 48 KB, and an explicit opt-in is also required to enable dynamic allocations above this limit."

3dcgi · Jun 5, 2020

ethernity said:
Actually they don't share L0. Each WGP can access the double the LDS because it is part of the WGP.
More Info on WGP mode from AMD's RDNA whitepaper.

They do share the instruction cache just not the vector L0.

JoeJ said:
However, the work-group processor mode allows using larger allocations of the LDS to boost performance for a single work-group

I wonder if CU can mix workgroups that use lots of LDS with others that use only a little bit? Probably.

Yes

Deleted member 90741 · Jun 5, 2020

Thanks to all for the clarity on the ACE nomenclature and cache structure.

For the more interesting part

Code:

drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c 
case CHIP_SIENNA_CICHLID:
    adev->gfx.me.num_me = 1;
    adev->gfx.me.num_pipe_per_me = 2;
    adev->gfx.me.num_queue_per_pipe = 1;

What is the main reasoning behind the increase of the GFX pipe to 2. Could the API calls theroretically be dispatched across multiple pipes.

I see the MES also got a bunch of updates for Sienna. It is would be really interesting to see this being put to action hopefully with the upcoming HW scheduling in Windows (seems to be not active based on some reports).

Krteq · Jun 8, 2020

Some new "Sienna Cichlid" related commits in radeonSI MESA driver

EDIT: Corrected Phoronix link.

Some interesting stuff:

ac_gpu_info.c

Code:

if (info->chip_class >= GFX10_3)
        info->max_wave64_per_simd = 16;
    else if (info->chip_class == GFX10)
        info->max_wave64_per_simd = 20;
    else if (info->family >= CHIP_POLARIS10 && info->family <= CHIP_VEGAM)
        info->max_wave64_per_simd = 8;

szatkus · Jun 8, 2020

I CTRL-Fed the RDNA whitepaper.

The fetched instructions are deposited into wavefront controllers. Each SIMD has a separate instruction pointer and a 20-entry wavefront controller, for a total of 80 wavefronts per dual compute unit. Wavefronts can be from a different work-group or kernel, although the dual compute unit maintains 32 work-groups simultaneously. The new wavefront controllers can operate in wave32 or wave64 mode.

Krteq · Jun 8, 2020

So, according to that commit, in Sienna there is 16-entry wavefront controller per SIMD, right?

szatkus · Jun 8, 2020

Another interesting bit.

PHP:

if (ASICREV_IS_SIENNA_M(chipRevision))
            {
                m_settings.supportRbPlus   = 1;
                m_settings.dccUnsup3DSwDis = 0;
            }

I figured RB could mean a rendering backend.

szatkus · Jun 8, 2020

Krteq said:
So, according to that commit, in Sienna there is 16-entry wavefront controller per SIMD, right?

Yeah, that's my conclusion as well. I guess 20 was too much for Navi, so they've shrank it to shave off some transistors.

trinibwoy · Jun 8, 2020

szatkus said:
Yeah, that's my conclusion as well. I guess 20 was too much for Navi, so they've shrank it to shave off some transistors.

Or they’ve managed to reduce pipeline latency and/or increase ILP such that 16 wavefronts per SIMD is enough to hide typical latencies.

For reference Turing allocates 8 wavefronts per SIMD down from 16 in Volta/Pascal. Ampere is back up to 16.

Radolov · Jun 9, 2020

Does the M in ASICREV_IS_SIENNA_M mean that it will be a mobile part?

szatkus · Jun 9, 2020

Radolov said:
Does the M in ASICREV_IS_SIENNA_M mean that it will be a mobile part?

I haven't even notice it.

Code:

#define ASICREV_IS_VEGA10_M(r)         ASICREV_IS(r, VEGA10)
#define ASICREV_IS_VEGA10_P(r)         ASICREV_IS(r, VEGA10)

Vega 10 has never been released as a mobile part, right? There are some chips with V. I really can't find logic behind it, maybe Value, Mid and Performance?

Edit: oh, and Vega M is apparently P.

Code:

#define ASICREV_IS_VEGAM_P(r)          ASICREV_IS(r, VEGAM)

Radolov · Jun 9, 2020

szatkus said:
I haven't even notice it.

Code:

#define ASICREV_IS_VEGA10_M(r) ASICREV_IS(r, VEGA10) #define ASICREV_IS_VEGA10_P(r) ASICREV_IS(r, VEGA10)

Vega 10 has never been released as a mobile part, right? There are some chips with V. I really can't find logic behind it, maybe Value, Mid and Performance?

Edit: oh, and Vega M is apparently P.

Code:

#define ASICREV_IS_VEGAM_P(r) ASICREV_IS(r, VEGAM)

I found some old tweet by komachi which says that M stands for "Mainstream" , unless things have changed. But it could indicate that Sienna Cichlid may not be the "Big Navi" that we're looking for. ¯\_(ツ)_/¯

szatkus · Jun 9, 2020

Radolov said:
I found some old tweet by komachi which says that M stands for "Mainstream" , unless things have changed. But it could indicate that Sienna Cichlid may not be the "Big Navi" that we're looking for. ¯\_(ツ)_/¯

After seeing 128-bit bus I didn't ever think it is. I wonder why they pushed this one into the driver before Big Navi.

AMD: Navi Speculation, Rumours and Discussion [2019-2020]

trinibwoy

Meh

Deleted member 90741

Guest

JoeJ

szatkus

Kaotik

Drunk Member

Deleted member 90741

Guest

BRiT

(>• •)>⌐■-■ (⌐■-■)

trinibwoy

Meh

3dcgi

Deleted member 90741

Guest

Krteq

szatkus

Krteq

szatkus

szatkus

trinibwoy

Meh

Radolov

szatkus

Radolov

szatkus