AMD: Volcanic Islands R1100/1200 (8***/9*** series) Speculation/ Rumour Thread

Discussion in 'Architecture and Products' started by Nemo, May 7, 2013.

Tags:
  1. jimbo75

    Veteran

    Joined:
    Jan 17, 2010
    Messages:
    1,211
    Likes Received:
    0
    7850 with what CPU...

    edit - just noticed the cpu benchmarks. Hmm.
     
  2. pjbliverpool

    pjbliverpool B3D Scallywag
    Legend

    Joined:
    May 8, 2005
    Messages:
    7,583
    Likes Received:
    703
    Location:
    Guess...
    What does it matter? The CPU unleashes the GPU's full potential, it doesn't make it faster than it is. It seems as long as you have a half decent tri-core or better you can max out a 7850.

    its not logical to assume the consoles are CPU limited in this game as if they were they'd be running at higher resolutions than they are. Put another way, the resolution (or other graphical effects) will clearly be dialled up in the console versions until the GPU's are maxed out at the target framerate.
     
  3. jimbo75

    Veteran

    Joined:
    Jan 17, 2010
    Messages:
    1,211
    Likes Received:
    0
    I'm going to "wait and see". There's too much going on here and there isn't any obvious logic to it.
     
  4. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
  5. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    That's what I meant by patch data.
    I was giving a suggestion for even avoiding the UV calculation in the tesselator, but if AMD does do it, then that's great. The problem is that their data flow management for tessellation sucks if it needs off chip storage.
    I don't see how this is so different from triangles from vertex shaders generating pixels that run on multiple CUs/SMXs.

    AMD chips don't push polygons into an off-chip buffer there. They run the VS on however many CUs are deemed appropriate, fill a FIFO on chip to temporarily store the output, and when it's full the CUs stop doing VS work while the rasterizers start emptying the FIFO by creating pixel wavefronts for the CUs to process.

    This is how tesselation should work. Run the HS, fill a FIFO, when full stop HS work and let tessellators empty the FIFO by creating DS wavefronts.
     
  6. 3dcgi

    Veteran Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    2,436
    Likes Received:
    264
    HS's can be very expensive so you need a lot of them in flight to perform well at low tessellation cases. The FIFO you suggest is the limiting factor for how many HS waves can execute in parallel.

    Say you have 4 control points per patch, 64 bytes of output per control point, plus 6 32-bit tessellation factors per patch and you need 1000 control points in flight to hide the latency of the HS. That's 70000 bytes of storage. You really want it to be double that because you can't free the storage until the DS's are done with it. That's 140000 bytes for fairly skinny control points so why not use your cache or spill to memory if necessary. This way all that storage is available when you're not using tessellation.

    As the tessellation level rises you need less storage in the FIFO, but you don't know that until the HS calculates the tessellation factors.
     
  7. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    First of all, that's a failure in optimization if AMD is sacrificing high-factor performance to improve low-factor performance. Their own devrel material suggests not using tessellation at low factors, and poor HS performance won't matter then anyway.

    The DS is where displacement mapping happens, so I don't see why the HS needs 1000 control points to hide latency. The vast majority of HSes will be pure arithmetic.

    Anyway, HS output going to memory doesn't explain AMD's performance in the least. The higher the tessellation factor, the more DS verts per patch, and the less that the BW/latency of off-chip HS data will affect the triangle throughput of the tessellator.

    In reality, AMD's tris per clock goes drastically down with tessellation factor. That must mean they are storing tessellator output off chip (maybe 5-6 bytes per DS vert), not HS output. There's no need to do this with properly designed hardware.
     
  8. 3dcgi

    Veteran Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    2,436
    Likes Received:
    264
    Correct. This is why it doesn't make sense to have a large on chip FIFO.

    If the DS output went off chip you'd see performance scale with memory bandwidth.
     
  9. fellix

    fellix Hey, You!
    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,490
    Likes Received:
    400
    Location:
    Varna, Bulgaria
    Yet, Hawaii is scaling just fine with occlusion culling with or without tessellation. It's only when the data stream hits the rasterizer the rate seems to be capped.
     
  10. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,798
    Likes Received:
    2,056
    Location:
    Germany
    I guess this will be explained by the launch reviews - or look at Kaotik's reply. Very insightful he has proven time and again. ;)
     
    #1750 CarstenS, Nov 1, 2013
    Last edited by a moderator: Nov 1, 2013
  11. lanek

    Veteran

    Joined:
    Mar 7, 2012
    Messages:
    2,469
    Likes Received:
    315
    Location:
    Switzerland
  12. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    It doesn't need to be large. Your calculations were dependent on having 1000 HS wavefronts, which is at least 50x too many. It's extremely rare for a HS to need texture access.

    BW wouldn't be an issue (4 verts/clk is < 24 GB/s), unless there is an internal restriction. Latency and/or extra clocks for export/import probably is.
     
  13. 3dcgi

    Veteran Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    2,436
    Likes Received:
    264
    I said 1000 control points not 1000 HS wavefronts. If you don't think bandwidth is an issue then I'm not sure why you think that would be a stupid design.
     
  14. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    My mistake. Ignore what I wrote in that paragraph.

    But sustaining 1000 control points is silly for a shader without texture loads in a load-balanced system (even if it does initially issue that many), and even if you're right that the tens of kB needed to buffer enough control points is too much of a cost, problems with HS streamout would result in a t-factor effect on performance that trends opposite of what we're seeing.

    BW for tessellator streamout is not an issue in theoretical tests, but it'll still sap significant BW (BTW, I forgot to multiply those numbers by 2 for write and read, but it's still a lot less than total BW. I assumed two 16-bit UVs plus a couple other bytes for indexing/overhead). So it still is stupid to stream UVs out, especially when they're so tiny and can be created on demand by the tessellator and packed directly into DS wavefronts.

    But why are you denying that there's a problem? I'm still waiting for your explanation, but AFAICS the only reason to have lower triangle throughput with high t-factor is UV streamout (high t-factor, in fact, reduces data flow in the pipeline), with stalls appearing either when writing out or reading back in.

    It's a flawed design.
     
  15. hoom

    Veteran

    Joined:
    Sep 23, 2003
    Messages:
    2,948
    Likes Received:
    497
    So 290 non X is AWOL...

    That XTL stuff was about 280X not 290 wasn't it?
     
  16. Dooby

    Regular

    Joined:
    Jul 21, 2003
    Messages:
    478
    Likes Received:
    3
    290 non-X was pushed back to Nov 5th last week. It's not yet completely AWOL.
     
  17. 3dcgi

    Veteran Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    2,436
    Likes Received:
    264
    Don't underestimate how much latency there is in a graphics chip nor how complicated some HS's are. Especially once sub-division surfaces start being used more and there are a lot of control points per patch.

    I'm not denying there is a problem and that the design is flawed. It's just not flawed due to off-chip buffering. I can't give more detail than that. I am curious to see if other vendors make a similar mistake as they add tessellation support. Nvidia did not though their first design wasn't perfect either. You can see how Nvidia improved between the 580 and 680.
    http://techreport.com/review/22653/nvidia-geforce-gtx-680-graphics-processor-reviewed/6
     
  18. hoom

    Veteran

    Joined:
    Sep 23, 2003
    Messages:
    2,948
    Likes Received:
    497
    OK I must have missed that.
     
  19. Sinistar

    Sinistar I LIVE
    Regular Subscriber

    Joined:
    Aug 11, 2004
    Messages:
    648
    Likes Received:
    61
    Location:
    Indiana
    http://forums.overclockers.co.uk/showthread.php?t=18551534&page=9

    I wonder how much faster these drivers make the R290X?
    Looking forward to the reviews.
     
  20. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    Again, I'm accepting your claim that HS output may warrant going off chip. And latency is exactly why I think UV streamout can cause stalls.

    But once HS data is read back, it makes no sense for high tessellation-factors to have reduced triangle throughput over low factors. The HS overhead gets amortized over more triangles, so throughput should go up.

    Well, improved off-chip buffering is really the only thing AMD has mentioned between generations, so that's what I assumed. I guess it could be related to DS wavefront generation and distribution, or bank conflicts, or whatever, but it's hard to think of anything that makes throughput go down with higher factors.

    Whatever the issue is, AMD must be doing something stupid in the tessellator, and it has now been there for four generations.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...