AMD: Navi Speculation, Rumours and Discussion [2019-2020]

Status
Not open for further replies.
After seeing 128-bit bus I didn't ever think it is. I wonder why they pushed this one into the driver before Big Navi.
Wait a sec... there are no mentions about 128-bit bus in amdgpu kernel driver for Sienna Cichlid.

If you mean this code from staging...
Code:
   if (adev->asic_type == CHIP_SIENNA_CICHLID && amdgpu_emu_mode == 1) {
       adev->gmc.vram_type = AMDGPU_VRAM_TYPE_GDDR6;
       adev->gmc.vram_width = 1 * 128; /* numchan * chansize */
   } else {
... it's referring only to emulation mode during memory controller initiation. See a code from mainline kernel

Code:
/**
 * gmc_v10_0_mc_init - initialize the memory controller driver params
 *
 * @adev: amdgpu_device pointer
 *
 * Look up the amount of vram, vram width, and decide how to place
 * vram and gart within the GPU's physical address space.
 * Returns 0 for success.
 */
static int gmc_v10_0_mc_init(struct amdgpu_device *adev)
{
    int chansize, numchan;

    if (!amdgpu_emu_mode)
        adev->gmc.vram_width = amdgpu_atomfirmware_get_vram_width(adev);
    else {
        /* hard code vram_width for emulation */
        chansize = 128;
        numchan = 1;
        adev->gmc.vram_width = numchan * chansize;
    }
 
New Patent awarded to AMD (It is not exactly HBCC) This will go very well with the other AMD Patent to decompress textures in real time using GPU shaders.
Stream data directly from SSD/NVME to GPU Memory bypassing host.

upload_2020-6-10_12-47-5.png
10678733 Apparatus for connecting non-volatile memory locally to a GPU through a local switch
Abstract
Described herein are a method and device for transferring data in a computer system. The device includes a host processor, a plurality of first memory architectures, a switch, a redundant array of independent drives (RAID) assist unit; and a second memory architecture. The host processor is configured to send a data transfer command to the RAID assist unit via the switch. The RAID assist unit is configured to create a set of parallel memory transactions between the plurality of first memory architectures and the second memory architecture, execute the set of parallel memory transactions via the local switch and absent interaction with the host processor; and notify the host processor upon completion of data transfer. In an implementation, the plurality of first memory architectures is non-volatile memories (NVMs) and the second memory architecture is local memory.
 
Another very interesting new Patent Award

10,672,095 Parallel data transfer to increase bandwidth for accelerated processing devices
Abstract
Techniques for improving data transfer in a system having multiple accelerated processing devices ("APDs") are described herein. In such a system, multiple APDs are coupled to a processor (e.g., a central processing unit ("CPU")) via a general interconnect fabric and to each other via a high speed interconnect. The techniques herein increase the effective bandwidth for transfer of data between the CPU and the APD by transmitting data to both APDs through the portion of the interconnect fabric coupled to each respective APD. Then, one of the APDs transfers data to the other APD or to the processor via the high speed inter-APD interconnect. Although data transferred "indirectly" through the helper APD takes slightly more time to be transferred than a direct transfer, the total effective bandwidth to the target is increased due to the high-speed inter-APD interconnect.
A secondary APD/GPU to help with data transfer, by allowing multiple transfers to multiple GPUs but the second GPU transfers the data back to the first GPU.
What is interesting to note is that what exactly is the bottleneck that AMD wants to address, considering their GPUs are PCIe 4.0 capable already. It might have something to do with the texture streaming, if I would guess.

When I think about it, would the addition of 2 new SDMA instances (in addition to 2 already existing on Navi1X) on Sienna Cichlid have something to do with this?
 
New Patent awarded to AMD (It is not exactly HBCC) This will go very well with the other AMD Patent to decompress textures in real time using GPU shaders.
Stream data directly from SSD/NVME to GPU Memory bypassing host.

View attachment 4006
10678733 Apparatus for connecting non-volatile memory locally to a GPU through a local switch

Potentially a next Gen Pro SSG? May be applicable to desktop systems too but this certainly seems pretty similar to the Pro SSG.

Another very interesting new Patent Award

10,672,095 Parallel data transfer to increase bandwidth for accelerated processing devices

A secondary APD/GPU to help with data transfer, by allowing multiple transfers to multiple GPUs but the second GPU transfers the data back to the first GPU.
What is interesting to note is that what exactly is the bottleneck that AMD wants to address, considering their GPUs are PCIe 4.0 capable already. It might have something to do with the texture streaming, if I would guess.

When I think about it, would the addition of 2 new SDMA instances (in addition to 2 already existing on Navi1X) on Sienna Cichlid have something to do with this?

I assume this is targeted as servers and super computers given the seeming reliance on multiple GPU's unless I'm missing something. Presumably being able to feed a GPU with data faster than a PCIe 4.0 16x interface would allow has tons on applications in those markets.
 
Potentially a next Gen Pro SSG? May be applicable to desktop systems too but this certainly seems pretty similar to the Pro SSG.
I assume this is targeted as servers and super computers given the seeming reliance on multiple GPU's unless I'm missing something. Presumably being able to feed a GPU with data faster than a PCIe 4.0 16x interface would allow has tons on applications in those markets.

Patent specifically mentions Graphics related issue. But I agree it is not limited to Graphics only.
BACKGROUND
Accelerated processing devices ("APDs") include hardware for performing tasks such as rendering graphics. Some computer systems include multiple APDs. The inclusion of multiple APDs is intended to speed up the tasks performed by the APDs. However, due to the complexities in designing software configured to take advantage of the presence of multiple APDs, computer systems often do not utilize the full potential of multi-APD systems.

It is possible for work on a particular APD to be constrained by the speed of the assigned portion of the interconnect fabric (such as the PCIe connection assigned to that APD). More specifically, it is possible for work on the APD to be processed more quickly than work can be transferred over the PCIe connection to the APD. The techniques herein increase the effective bandwidth for transfer of data from the CPU to the APD and/or from the APD to the CPU through cooperation of one or more other APDs in a multi-APD system. For a write to a "target" APD, the technique involves transmitting data both directly to the target APD as well as indirectly to the target APD through one or more other APDs (designated as "helper" APDs). The one or more helper APDs then transmit the data to the target APD through a high speed inter-APD interconnect. Although data transferred "indirectly" through the helper APD may take more time to be transferred to the target APD than a direct transfer, the total effective bandwidth to the target APD is increased due to the high-speed inter-APD interconnect. For a read operation from a "source" APD, the technique is similar, but reversed. More specifically, the technique involves transmitting data from the source APD to the processor, both directly, as well as indirectly through one or more helper APDs. A "source" APD--that is, an APD involved in a read operation from the processor 102--may also be referred to sometimes herein as a "target" APD.

FIG. 1 is a block diagram of an example device 100 in which one or more features of the disclosure can be implemented. The device 100 could be one of, but is not limited to, for example, a computer, a gaming device, a handheld device, a set-top box, a television, a mobile phone, a tablet computer, or other computing device. The device 100 includes a processor 102, a memory 104, a storage 106, one or more input devices 108, and one or more output devices 110. The device 100 also includes one or more input drivers 112 and one or more output drivers 114. Any of the input drivers 112 are embodied as hardware, a combination of hardware and software, or software, and serve the purpose of controlling input devices 112 (e.g., controlling operation, receiving inputs from, and providing data to input drivers 112). Similarly, any of the output drivers 114 are embodied as hardware, a combination of hardware and software, or software, and serve the purpose of controlling output devices 114 (e.g., controlling operation, receiving inputs from, and providing data to output drivers 114). It is understood that the device 100 illustrated and described is an example and can include additional components not shown in FIG. 1, or may omit one or more components illustrated in FIG. 1.
 
New Patent awarded to AMD
Stream data directly from SSD/NVME to GPU Memory bypassing host.
10678733 Apparatus for connecting non-volatile memory locally to a GPU through a local switch

That's US20190243791A1 which describes various ways to connect video card's GPU and onboard NVMe or flash controllers through a dedicated PCI Express Switch and directly communicate with each other with DMA peer-to-peer (P2P) transactions (in addition to standard transactions from devices to the CPU Root Complex).
There are a total of 64 lanes which connect two 16-lane links to the GPUs, a 16-lane link to the host CPU, and 2-lane or 4-lane links to 8x flash memory chips or 4x/8x NVMe drives.

They specifically mention a solid state graphics card (SSG) as in Radeon SSG, but this one is similar in concept to the PCIe Switch Complex in NVidia DGX-2 supercomputer which has 12 PCIe x16 switches connected in a two-level hierarchy. The CPUs connect with two leaves, each with three switches cross-connecting a NVMe disk, two onboard GPUs, and a PCIe x16 slot for a fiber-optic NIC (see NVIDIA GPUDirect Storage presentation).

10,672,095 Parallel data transfer to increase bandwidth for accelerated processing devices
That's US20190188822A1 - Infinity Fabric related?
 
Last edited:
A secondary APD/GPU to help with data transfer, by allowing multiple transfers to multiple GPUs but the second GPU transfers the data back to the first GPU.
What is interesting to note is that what exactly is the bottleneck that AMD wants to address, considering their GPUs are PCIe 4.0 capable already. It might have something to do with the texture streaming, if I would guess.
The bottleneck is host memory and host controller bandwidth, I presume.
If you need to broadcast memory to multiple GPUs, then re-distribution between GPUs, over a switch local to each GPU pair, is a sensible choice. And if the original source was DMA (mostly: NVMe drives), not buffered in host memory, then P2P redistribution is a must.
 
Anybody know what this Patent is about?
US20190385270A1 Single pass prefix sum in a vertex shader

This describes an implementation of parallel Prefix sum (scan) algorithm using vertex shader wavefronts, with practical applications like emulating glLineStipple.

Is this primitive shaders?
US10453243B2 Primitive level preemption using discrete non-real-time and real time pipelines

No, it's about GPU scheduler preemption granularity - i.e. task switching on buffer, primitive, triangle, pixel, or instruction-level boundaries (since WDDM 1.2 for Windows 8).

Not sure what's the invention here, they basically describe primitive-level scheduling (i.e. group of triangles) with two priorities (realtime and non-realtime).
 
Last edited:
Isn't that related to the MES (HWS) on Navi which supports preemption inside its scheduling queues.

Code:
enum MES_AMD_PRIORITY_LEVEL {
    AMD_PRIORITY_LEVEL_LOW        = 0,
    AMD_PRIORITY_LEVEL_NORMAL    = 1,
    AMD_PRIORITY_LEVEL_MEDIUM    = 2,
    AMD_PRIORITY_LEVEL_HIGH        = 3,
    AMD_PRIORITY_LEVEL_REALTIME    = 4,
    AMD_PRIORITY_NUM_LEVELS
};
 
Not sure what's the invention here, they basically describe primitive-level scheduling (i.e. group of triangles) with two priorities (realtime and non-realtime).
The invention isn't the preemption itself, but rather "incomplete preemption with software emulation". Meaning only wave front dispatch is halted, but all (or at least some) fixed function parts of the pipeline are not flushed and stay tied to the currently preempted draw call. Instead, more difficult to flush parts of the pipeline (I expect that's aiming at parts of the geometry engine, something before actual rasterization) are instead re-implemented as pure compute shader for the real time context, ensuring hard real time guarantees and reduced preemption overhead at the cost of efficiency.
 
This is probably the weirdest GPU release of all time.
Specs are up at AMD's website:


AMD Radeon™ Pro 5600M

40
Compute Units

2,560
Stream Processors

Up to 1035 MHz
Peak Engine Clock

Up to 5.3
FP32 TFLOPS

50 W
TGP

Up to 8GB
HBM2 Memory

394 GB/s
Memory Bandwidth

1.54 Gbps
Memory Speed
 
So it's a 40CU / 20 WGP GPU clocked at 1GHz and two very low clocked HBM2 stacks, on a 50W TDP.

Apple just updates its webpage out of the blue and AMD says nothing.
 
That'll be an interesting under-volting part.
It has to be a lot lower than 1.2V, because 1 stack at the regular 1.2V / 2.4Gbps should already provide more than enough bandwidth for a 5.3 TFLOPs GPU. There has to be a credible meaning to using 2 stacks.
 
There's an announcement now:

https://www.globenewswire.com/news-...-16-inch-MacBook-Pro-for-Users-On-the-Go.html

It's a RDNA1 GPU, so yes most probably Navi 12.
It's a very late RDNA1 GPU, not a very early RDNA2 one. And it's most probably exclusive to apple, like Vega 12 before it.


I do wonder if the volume of high-end macbooks (which on a worldwide scale is a niche within a niche) really is worth all the trouble of making an exclusive GPU just for it.
The Pro 5500 8GB GDDR6 -> Pro 5600 8GB HBM2 is a $700 (seven hundred) upgrade though, so AMD might be making as much money from this as nvidia makes out of a 2080 Ti.
 
Status
Not open for further replies.
Back
Top