AMD Vega 10, Vega 11, Vega 12 and Vega 20 Rumors and Discussion

Discussion in 'Architecture and Products' started by Deleted member 13524, Sep 20, 2016.

  1. Gipsel

    Veteran

    Joined:
    Jan 4, 2010
    Messages:
    1,620
    Likes Received:
    264
    Location:
    Hamburg, Germany
    This nitpick in the AMD Vega thread should be allowed:
    P100 has more registers in total than any other nVidia GPU. Fiji and very likely Vega10 have both more registers in total (16MB vector registers, 16.8MB including the scalar ones).
    Traditionally, AMD builds GPUs with relatively large register files. Improving the energy efficiency of the register file accesses could help there too. The question would be, how efficient/wasteful is the current register file design of AMD (or are the lower hanging fruits somewhere else) and how could one improve it without giving up the general simplicity of how it works. GCN appears to be carefully designed from the beginning to reach this simplicity and different aspects are intertwined to make it work.
     
    #921 Gipsel, Jan 25, 2017
    Last edited: Jan 25, 2017
  2. CSI PC

    Veteran

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    Maybe indirectly but the register file size for the 3,840 Cuda core P40 is 7,680KB, keeping the ratio of 256KB * 30 SM, 1st full core Pascal that is nearly half the register file size to that of the P100 due to its 56SM.
    Cheers
     
  3. JoshMST

    Regular

    Joined:
    Sep 2, 2002
    Messages:
    467
    Likes Received:
    25
    I have no idea what I am talking about.
     
  4. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    This turns s_waitcnt effectively into a vector instruction with multiple write destinations, none of which are encoded in the instruction itself for the instruction buffer or upstream decoding to reference. Actually, the "s_" portion of it may already be a problem, since it puts it in the wrong domain and possibly a different sequencer. That s_waitcnt is a special instruction consumed in the instruction buffer may indicate it is even further upstream than that, and may only do a very simple operation in an early pipe stage that checks the counters and flips a bit to indicate the wavefront is ready in the next cycle.
    This change would be significant as it requires the operation to communicate across the IB, SIMD, and VMEM domains in non-trivial ways with information not actually encoded in the instruction.

    That implies a certain amount of information moving upstream. There is some number of pipeline stages between where s_waitcnt is able to make the thread wait, and where registers become available. Some of the manual wait states that affect instruction issue like setting VSKIP may mean a lag of 1 or 2 (x4 for real cycles), and that is following the flow of the pipeline. There may be registers that aren't available, which s_waitcnt would not be able to know for some time.

    Is there a particular reason why an instruction needs to explicitly specify a slot number? Memory operations are ordered, and it's not saving register context if the same slot is linked to more than one register pending the same waitcnt. The scenario where more than one slot is applied to the same register is equivalent to two loads hitting the same register.

    Since this actually still imposing a requirement that there be physical storage somewhere, why not put it into a special range of the register file? On top of that, add a small bit of register renaming and remove the explicit slot number. The target registers prior to a waitcnt get added to a small list, with each entry being paired with one of the special registers. A waitcnt would flip a bit or add a small offset, and then the current register and the load switch roles.
    Possibly, you could just define a range at one end of the register allocation for this, although that might be something along the lines of a clause-temporary or scratch register from the VLIW days.
    It keeps the shared vector memory pipeline from being modified too heavily, and reduces the need for communication between pipelines.
     
    Razor1 and sebbbi like this.
  5. revan

    Newcomer

    Joined:
    Nov 9, 2007
    Messages:
    55
    Likes Received:
    18
    Location:
    look in the sunrise ..will find me
  6. Gipsel

    Veteran

    Joined:
    Jan 4, 2010
    Messages:
    1,620
    Likes Received:
    264
    Location:
    Hamburg, Germany
    Hmm, so he really claimed this HBCC stuff is managing the memory in hardware and the developer doesn't have to take much care of it (doesn't rule out driver intervention, though). Let's see how this will work out.
    PS: And he explicitly confirmed the draw stream binning rasterizer just works without the need for code changes as one could expect (someone here was doubting that).
     
    Lightman likes this.
  7. seahawk

    Regular

    Joined:
    May 18, 2004
    Messages:
    511
    Likes Received:
    141
    Does the VRAM controller need much software input today? Imho not, as long as you do not try to interfere with what the application does. I can not see how hardware should be able to guess correctly which textures need to be loaded and which will not be used soon and can be discarded without the application handling it.
     
  8. MDolenc

    Regular

    Joined:
    May 26, 2002
    Messages:
    696
    Likes Received:
    446
    Location:
    Slovenia
    But driver does know that. It has access to future rendering commands that it will send to the GPU and as such knows which textures it will need (as I mentioned before they like it if they can have an entire frame pending; the so called "render ahead"). Though that does not necessarily mean it also knows which mip level it will need. There really isn't much interesting stuff going on with regard to VRAM controller and software. But only until application doesn't blow over the available GPU local memory. That's when the fun starts.
    DX11 drivers are already very good at this. The only problem with drivers interfering with applications are cases of virtual texturing when applications overshoots available GPU memory. Because this explicitly means that driver will say "ok this texture hasn't been used in a while, push it to system memory" and game will go "ok this texture hasn't been used in a while, lets reuse it for this new texture which we need next frame". But this should kinda fix its self with DX12/Vulkan and usage of actual hardware sparse resources.
    With HBCC and memory paging, yes it can all just work. When GPU access something that's not in local memory it will fault suspend an execution of a wave and OS will upload required page to GPU. After that is done wave will wake up and resume. This will "just work". Question is how well (without someone issuing a prefetch).
     
    Razor1, Lightman and DavidGraham like this.
  9. seahawk

    Regular

    Joined:
    May 18, 2004
    Messages:
    511
    Likes Received:
    141
    But it makes little difference between VRAM Controller and HBCC when you need to stop the wavefront until you loaded the data from the system RAM. Both are much faster than PCIe. So for HBCC to make sense it would need to wait less often than a conventional VRAM controller.
     
  10. MDolenc

    Regular

    Joined:
    May 26, 2002
    Messages:
    696
    Likes Received:
    446
    Location:
    Slovenia
    At the moment you won't just stop a wavefront. You'll stop feeding rendering commands to GPU (when next command will need a texture that's not yet local).
     
    Razor1 likes this.
  11. pTmdfx

    Regular

    Joined:
    May 27, 2014
    Messages:
    417
    Likes Received:
    381
    Depend on what you meant by software input. The driver still needs to update the page table and make sure the resources needed are resident before you access an allocation in the GPUVM address space, and this part of memory management is now exposed via modern APIs (D3D12, Vulkan, etc). Otherwise the GPU would go dead when a page fault occurred (except for sparse texture).

    Accessing the pageable memory via ATC (process VAS) can be already done transparently on the right hardware (e.g. Carrizo).
     
  12. arijoytunir

    Regular

    Joined:
    Nov 13, 2012
    Messages:
    347
    Likes Received:
    12
    what is the die size of polaris 11 and 10 gpus ?
     
  13. TheAlSpark

    TheAlSpark Moderator
    Moderator Legend

    Joined:
    Feb 29, 2004
    Messages:
    22,146
    Likes Received:
    8,533
    Location:
    ಠ_ಠ
    ~123 & 232 mm^2
     
  14. ImSpartacus

    Regular

    Joined:
    Jun 30, 2015
    Messages:
    252
    Likes Received:
    199
    The "AMD Radeon RX 490 8GB" was mentioned as part of the recommended specs for the Fallout 4 High Res Texture Pack. The recommendation from the Nvidia side was a 1080.

    It's now been removed and the only recommendation is the 1080, but the folks at r/fallout happened to copy the text.

    Original Text:

     
  15. seahawk

    Regular

    Joined:
    May 18, 2004
    Messages:
    511
    Likes Received:
    141
    Big Vega or little Vega?
     
  16. Transponster

    Newcomer

    Joined:
    Feb 24, 2016
    Messages:
    74
    Likes Received:
    13
    Little Vega shouldn't that much faster than RX 480, it's basically supposed to replace it.
     
  17. Anarchist4000

    Veteran

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    Might not even be Vega. It could be that P12(unlikely, small?) or P10 GX2(no idea), assuming it wasn't a typo.
     
  18. ImSpartacus

    Regular

    Joined:
    Jun 30, 2015
    Messages:
    252
    Likes Received:
    199
    I'm thinking if it was a typo, they would've corrected the typo and been done with it.

    It's not unheard of for recommended specs to pick the "best" part from each camp, even if those aren't terribly comparable, e.g. 1080, 480.

    But if that was the case, then they would've just fixed the typo rather than remove mention of the "490".
     
    pharma likes this.
  19. The problem with the typo is that a RX480 would never be in the performance bracket as the GTX1080. Then again, these requirements are completely messed up a lot more often than they should.

    I'm gonna go with either typo or dual Polaris 10 which has also been appearing in the driver and kernel listings.
    2017's new product line will probably consist of Polaris cards being rebranded to RX5xx, big Vega being Fury Z or 2 and small Vega being RX580 or 590.




    Speaking of Vega, Hynix has updated their HBM2 portfolio again, and it seems they now only have 1.6Gbps chips. 2Gbps chips are nowhere to be found.

    https://videocardz.com/65649/sk-hynix-updates-memory-product-catalog-hbm2-available-in-q1-2017

    So either big Vega is coming with 1.6Gbps after all (410GB/s total), or AMD will be turning to Samsung to get the 2Gbps chips on big Vega and Hynix's chips will be used on something else like small Vega or APUs.
     
  20. Anarchist4000

    Veteran

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    Not necessarily. A web guy could have removed it while verifying accuracy or NDAs were breached if it was drawing attention.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...