AMD: Navi Speculation, Rumours and Discussion [2019-2020]

Discussion in 'Architecture and Products' started by Kaotik, Jan 2, 2019.

Thread Status:
Not open for further replies.
  1. Krteq

    Krteq Newcomer

    Wait a sec... there are no mentions about 128-bit bus in amdgpu kernel driver for Sienna Cichlid.

    If you mean this code from staging...
    Code:
       if (adev->asic_type == CHIP_SIENNA_CICHLID && amdgpu_emu_mode == 1) {
           adev->gmc.vram_type = AMDGPU_VRAM_TYPE_GDDR6;
           adev->gmc.vram_width = 1 * 128; /* numchan * chansize */
       } else {
    ... it's referring only to emulation mode during memory controller initiation. See a code from mainline kernel

    Code:
    /**
     * gmc_v10_0_mc_init - initialize the memory controller driver params
     *
     * @adev: amdgpu_device pointer
     *
     * Look up the amount of vram, vram width, and decide how to place
     * vram and gart within the GPU's physical address space.
     * Returns 0 for success.
     */
    static int gmc_v10_0_mc_init(struct amdgpu_device *adev)
    {
        int chansize, numchan;
    
        if (!amdgpu_emu_mode)
            adev->gmc.vram_width = amdgpu_atomfirmware_get_vram_width(adev);
        else {
            /* hard code vram_width for emulation */
            chansize = 128;
            numchan = 1;
            adev->gmc.vram_width = numchan * chansize;
        }
    
     
    w0lfram, Radolov, Lightman and 2 others like this.
  2. New Patent awarded to AMD (It is not exactly HBCC) This will go very well with the other AMD Patent to decompress textures in real time using GPU shaders.
    Stream data directly from SSD/NVME to GPU Memory bypassing host.

    upload_2020-6-10_12-47-5.png
    10678733 Apparatus for connecting non-volatile memory locally to a GPU through a local switch
     
    w0lfram, Alexko, Lightman and 5 others like this.
  3. Another very interesting new Patent Award

    10,672,095 Parallel data transfer to increase bandwidth for accelerated processing devices
    A secondary APD/GPU to help with data transfer, by allowing multiple transfers to multiple GPUs but the second GPU transfers the data back to the first GPU.
    What is interesting to note is that what exactly is the bottleneck that AMD wants to address, considering their GPUs are PCIe 4.0 capable already. It might have something to do with the texture streaming, if I would guess.

    When I think about it, would the addition of 2 new SDMA instances (in addition to 2 already existing on Navi1X) on Sienna Cichlid have something to do with this?
     
    Lightman, Pete, Krteq and 2 others like this.
  4. pjbliverpool

    pjbliverpool B3D Scallywag Legend

    Potentially a next Gen Pro SSG? May be applicable to desktop systems too but this certainly seems pretty similar to the Pro SSG.

    I assume this is targeted as servers and super computers given the seeming reliance on multiple GPU's unless I'm missing something. Presumably being able to feed a GPU with data faster than a PCIe 4.0 16x interface would allow has tons on applications in those markets.
     
  5. Patent specifically mentions Graphics related issue. But I agree it is not limited to Graphics only.
    It is possible for work on a particular APD to be constrained by the speed of the assigned portion of the interconnect fabric (such as the PCIe connection assigned to that APD). More specifically, it is possible for work on the APD to be processed more quickly than work can be transferred over the PCIe connection to the APD. The techniques herein increase the effective bandwidth for transfer of data from the CPU to the APD and/or from the APD to the CPU through cooperation of one or more other APDs in a multi-APD system. For a write to a "target" APD, the technique involves transmitting data both directly to the target APD as well as indirectly to the target APD through one or more other APDs (designated as "helper" APDs). The one or more helper APDs then transmit the data to the target APD through a high speed inter-APD interconnect. Although data transferred "indirectly" through the helper APD may take more time to be transferred to the target APD than a direct transfer, the total effective bandwidth to the target APD is increased due to the high-speed inter-APD interconnect. For a read operation from a "source" APD, the technique is similar, but reversed. More specifically, the technique involves transmitting data from the source APD to the processor, both directly, as well as indirectly through one or more helper APDs. A "source" APD--that is, an APD involved in a read operation from the processor 102--may also be referred to sometimes herein as a "target" APD.

    FIG. 1 is a block diagram of an example device 100 in which one or more features of the disclosure can be implemented. The device 100 could be one of, but is not limited to, for example, a computer, a gaming device, a handheld device, a set-top box, a television, a mobile phone, a tablet computer, or other computing device. The device 100 includes a processor 102, a memory 104, a storage 106, one or more input devices 108, and one or more output devices 110. The device 100 also includes one or more input drivers 112 and one or more output drivers 114. Any of the input drivers 112 are embodied as hardware, a combination of hardware and software, or software, and serve the purpose of controlling input devices 112 (e.g., controlling operation, receiving inputs from, and providing data to input drivers 112). Similarly, any of the output drivers 114 are embodied as hardware, a combination of hardware and software, or software, and serve the purpose of controlling output devices 114 (e.g., controlling operation, receiving inputs from, and providing data to output drivers 114). It is understood that the device 100 illustrated and described is an example and can include additional components not shown in FIG. 1, or may omit one or more components illustrated in FIG. 1.
     
    w0lfram, PSman1700 and pjbliverpool like this.
  6. DmitryKo

    DmitryKo Regular

    That's US20190243791A1 which describes various ways to connect video card's GPU and onboard NVMe or flash controllers through a dedicated PCI Express Switch and directly communicate with each other with DMA peer-to-peer (P2P) transactions (in addition to standard transactions from devices to the CPU Root Complex).
    There are a total of 64 lanes which connect two 16-lane links to the GPUs, a 16-lane link to the host CPU, and 2-lane or 4-lane links to 8x flash memory chips or 4x/8x NVMe drives.

    They specifically mention a solid state graphics card (SSG) as in Radeon SSG, but this one is similar in concept to the PCIe Switch Complex in NVidia DGX-2 supercomputer which has 12 PCIe x16 switches connected in a two-level hierarchy. The CPUs connect with two leaves, each with three switches cross-connecting a NVMe disk, two onboard GPUs, and a PCIe x16 slot for a fiber-optic NIC (see NVIDIA GPUDirect Storage presentation).

    That's US20190188822A1 - Infinity Fabric related?
     
    Last edited: Sep 7, 2020
    pharma, Krteq and BRiT like this.
  7. Ext3h

    Ext3h Regular

    The bottleneck is host memory and host controller bandwidth, I presume.
    If you need to broadcast memory to multiple GPUs, then re-distribution between GPUs, over a switch local to each GPU pair, is a sensible choice. And if the original source was DMA (mostly: NVMe drives), not buffered in host memory, then P2P redistribution is a must.
     
  8. Digidi

    Digidi Regular

    Nice Patents
     
    BRiT likes this.
  9. Digidi

    Digidi Regular

  10. Digidi

    Digidi Regular

  11. DmitryKo

    DmitryKo Regular

    US20190385270A1 Single pass prefix sum in a vertex shader

    This describes an implementation of parallel Prefix sum (scan) algorithm using vertex shader wavefronts, with practical applications like emulating glLineStipple.

    US10453243B2 Primitive level preemption using discrete non-real-time and real time pipelines

    No, it's about GPU scheduler preemption granularity - i.e. task switching on buffer, primitive, triangle, pixel, or instruction-level boundaries (since WDDM 1.2 for Windows 8).

    Not sure what's the invention here, they basically describe primitive-level scheduling (i.e. group of triangles) with two priorities (realtime and non-realtime).
     
    Last edited: Jun 12, 2020
    Digidi, BRiT and Deleted member 90741 like this.
  12. Isn't that related to the MES (HWS) on Navi which supports preemption inside its scheduling queues.

    Code:
    enum MES_AMD_PRIORITY_LEVEL {
        AMD_PRIORITY_LEVEL_LOW        = 0,
        AMD_PRIORITY_LEVEL_NORMAL    = 1,
        AMD_PRIORITY_LEVEL_MEDIUM    = 2,
        AMD_PRIORITY_LEVEL_HIGH        = 3,
        AMD_PRIORITY_LEVEL_REALTIME    = 4,
        AMD_PRIORITY_NUM_LEVELS
    }; 
     
  13. Ext3h

    Ext3h Regular

    The invention isn't the preemption itself, but rather "incomplete preemption with software emulation". Meaning only wave front dispatch is halted, but all (or at least some) fixed function parts of the pipeline are not flushed and stay tied to the currently preempted draw call. Instead, more difficult to flush parts of the pipeline (I expect that's aiming at parts of the geometry engine, something before actual rasterization) are instead re-implemented as pure compute shader for the real time context, ensuring hard real time guarantees and reduced preemption overhead at the cost of efficiency.
     
    w0lfram, Digidi, DmitryKo and 2 others like this.
  14. Bondrewd

    Bondrewd Veteran

  15. This is probably the weirdest GPU release of all time.
    Specs are up at AMD's website:


    AMD Radeon™ Pro 5600M

    40
    Compute Units

    2,560
    Stream Processors

    Up to 1035 MHz
    Peak Engine Clock

    Up to 5.3
    FP32 TFLOPS

    50 W
    TGP

    Up to 8GB
    HBM2 Memory

    394 GB/s
    Memory Bandwidth

    1.54 Gbps
    Memory Speed
     
    Man from Atlantis likes this.
  16. So it's a 40CU / 20 WGP GPU clocked at 1GHz and two very low clocked HBM2 stacks, on a 50W TDP.

    Apple just updates its webpage out of the blue and AMD says nothing.
     
    Lightman likes this.
  17. Scott_Arm

    Scott_Arm Legend

    That'll be an interesting under-volting part.
     
  18. It has to be a lot lower than 1.2V, because 1 stack at the regular 1.2V / 2.4Gbps should already provide more than enough bandwidth for a 5.3 TFLOPs GPU. There has to be a credible meaning to using 2 stacks.
     
    Lightman likes this.
  19. There's an announcement now:

    https://www.globenewswire.com/news-...-16-inch-MacBook-Pro-for-Users-On-the-Go.html

    It's a RDNA1 GPU, so yes most probably Navi 12.
    It's a very late RDNA1 GPU, not a very early RDNA2 one. And it's most probably exclusive to apple, like Vega 12 before it.


    I do wonder if the volume of high-end macbooks (which on a worldwide scale is a niche within a niche) really is worth all the trouble of making an exclusive GPU just for it.
    The Pro 5500 8GB GDDR6 -> Pro 5600 8GB HBM2 is a $700 (seven hundred) upgrade though, so AMD might be making as much money from this as nvidia makes out of a 2080 Ti.
     
  20. Of which a decent chunk is then pocketed by Apple...
     
    Last edited by a moderator: Jun 15, 2020
Loading...
Thread Status:
Not open for further replies.

Share This Page

Loading...