NVIDIA shows signs ... [2008 - 2017]

Discussion in 'Graphics and Semiconductor Industry' started by Geo, Jul 2, 2008.

Tags:
Thread Status:
Not open for further replies.
  1. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    30 SIMDs running at 1.5GHz is still pretty fast, though you could argue it's no faster than ~3GHz 4-core SSE with perfect SIMD utilisation (scalar code on the GPU would be 4x faster). The GPU has more bandwidth to play with.

    Flat or nested? What cycle counts for the alternate paths? What kinds of scatter/gather operations is the code doing?

    Looking at the code and from what I remember, the vector CS is faster than the scalar CS on ATI. The PS is scalar, only, I believe.

    All the GPUs have real predicates. ATI has a stack of predicates. NVidia has predicate registers. I don't understand the distinction you're trying to make.

    Sorry, you're right.

    It'd be good to work out how Fermi does DP. If it really has 32 SP units and 32 integer units per core that are entirely distinct (sharing register file ports and not overlapped) then something along those lines sounds feasible.

    It would be funny if Fermi and Larrabee do SP, DP and INT the same way.

    http://www.lirmm.fr/arith18/papers/libo-multipleprecisionmaf.pdf

    The optimal multiple-precision unit described here is 3.7x the size of the single-precision unit. It's 18% bigger than the optimal double-precision only unit. The double-precision only unit is 3.1x the size of the single-precision unit.

    Hmm, I guess the 16->4 granularity I was describing is like the 256->64 granularity you're describing.

    Jawed
     
  2. EduardoS

    Newcomer

    Joined:
    Nov 8, 2008
    Messages:
    131
    Likes Received:
    0
    Since each warp executes over 4 clocks and we are using only 1 of them it's more like 30 scalar cores running at 375MHz, and they need 6 (is that for GTX285?) threads per core just to hide execution latency, CPUs do better than that...

    Varies a lot, sometimes none.

    Varies too much.

    Well... Currently I don't remeber any code where it matters... Do you remebers any?

    Yes, the vector is faster than scalar and PS is only scalar, I was refering to the speed difference between PS and CS in scalar version, forgot it, I did some mistakes...

    Anyway I also did simple simulations today:

    [​IMG]

    Simple, ugly, but better than nothing or than just guessing, I take it from a trivial mandelbrot of 1024x1024 pixels, counted the number of iterations of the loop of each thread to get a number of unmasked execution, and then divided by the number of iterations of a given warp/wavefront, the number at bottom are the width of the warp/wavefront, the number in column is the efficiency/ocupancy, I also tested two patterns, a row-major order, and a Z-Order, just for information.

    Looking at the chart the first thing to note is that using row major order for 2D arrays is no-brainer :grin: locality is important, the curve in this chart is related to the resolution I choosed.

    The second thing, looking at the line of Z-Order, is the performance gain that nVidia would have by using 32 threads per warp instead of 64 and the performance drop by vectorizing (and so, increasing the granularity from 64 to 256) the code in AMD hardware, it's respectively 3.79% and 11.4%, hardly a big advantage for nVidia and also hardly an excuse for not vectorizing a code with less than 80% ALU occupancy.

    Increasing the resolution increases efficiency for all cases, but I won't plot another chart today, tomorrow I will try add some methods for mitigating branch divergence, any bids?

    About Voxille's Mandelbrot, Due to the loop unrolling I don'texpect it to suffer as much from branch divergence as the trivial implementation does.

    The raw data:
    Code:
                    1024x1024                 4096x4096
             RowMajor     Zorder       RowMajor     Zorder
        2    98,75399%    98,75399%    99,42808%    99,42808%
        4    97,00922%    97,43842%    98,6191%     98,80145%
        8    94,20914%    95,67982%    97,43809%    97,96538%
       16    89,6497%     93,6318%     95,65114%    97,0504% 
       32    82,99384%    90,6495%     92,74299%    95,80479%
       64    74,42341%    87,33846%    88,18603%    94,32423%
      128    61,60484%    82,94035%    81,49467%    92,22841%
      256    41,60408%    77,61035%    73,10745%    89,71968%
      512    24,65091%    72,02141%    60,45533%    85,88808%
     1024    21,14622%    65,19784%    40,78072%    81,84855%
     2048    20,80997%    57,08746%    24,42915%    76,54399%
     4096    20,1047%     49,18869%    21,02423%    71,98172%
     8192    19,28069%    41,28873%    20,83935%    65,88856%
    16384    18,3604%     32,75394%    20,49293%    57,56558%
    
    It doesn't have a predicate where I can execute intruction from one ALU but not from the other in the VLIW group, a predicate there would allow to two threads execute simultanelly by each using part of the execution units in the VLIW, note that it only makes sense in the context of trying to mitigate branch divergence penalities.

    Optimal for a given task :wink:

    The paper goes by the reverse path we are discussing, it have a full DP precision multiplier and wants to reuse it for vector SP, we are discussing how to use SP multipliers to perform a DP, wich is good for x86, maybe not so useful for GPUs.

    We could lower the level, about how to build multipliers from scratch and make it able to perform a DP or x SP operations, I'm still defending that it's possible to achieve a DP or 3 SP operations with few extra resources compared to a DP only mulplier or 3 independent SP multipliers (and both happens to cost almost the same), and also that this ratio is the most cost-effective.

    BTW, when designing a multiplier it's possible to do some area-latency tradeoffs, some tricks that are good for SSE CPUs may not be good for GPUs.

    Kind of.
     
  3. CarstenS

    Legend Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,800
    Likes Received:
    3,920
    Location:
    Germany
    Nvidia seems quite tight-lipped about anything Fermi beyond the whitepaper at the moment.

    Considering what Keane said at GTC, I think it's not totally out of the question that they have an expensive and a cheap set of 16 ALUs in each SM (what before would have been a TPC). The expensive one being able to occupy ports from the cheap one and doing DP, while also doubling as an additional SP Unit during normal operation.

    That's at least something that'd make sense considering the "even more modular" approach Fermi's taking and Nvidia supposedly being able to easily re-do a gaming-oriented Fermi with less transistors.
     
  4. Squilliam

    Squilliam Beyond3d isn't defined yet
    Veteran

    Joined:
    Jan 11, 2008
    Messages:
    3,495
    Likes Received:
    114
    Location:
    New Zealand
    I was wondering how the above in depth technical discussion proves how strained Nvidia are. Can anyone explain this to me?
     
  5. satein

    Regular

    Joined:
    Aug 17, 2005
    Messages:
    483
    Likes Received:
    21
    Location:
    Sheffield, UK.
    Just came across this short article at VR-Zone, Jen-Hsun Huang: "Nvidia is a Software Company", Nvidia's future, Fermi", which was summarised from Jen Hsun's interview by CHW [link @CHW].

    This is quoted from VR-zone's article..
    Perhaps, that high margin Quadra products was behind the design decision of Fermi.

    Hope I put it on the right thread :wink:
     
  6. rendezvous

    Regular

    Joined:
    Jun 12, 2002
    Messages:
    347
    Likes Received:
    12
    Location:
    Lund, Sweden
    Remember, remember the fifth of november.
    NVIDIA will hold their Q3 2010 earning call on Nov 5, 5:00 pm Eeastern time. That should shine som light on their situation.
     
  7. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    Whoops, yes. So that's down to approximately the rate of scalars in a 4 core CPU. As for the threads, well we'd be talking about something that's got some kind of data parallel feel to it, otherwise it wouldn't be on the GPU.

    That's a measure of the infancy of this stuff, still no real analysis.

    Ray tracing is a nice, long-standing, example.

    Nice :grin:

    The row major order is more "random" I suppose and is therefore more interesting for the general case. I suppose the Z order's speed-up over row major is something like the gain one would expect to see from using DWF.

    To me the row major order difference between 16 and 64 is the interesting one. That's Larrabee versus ATI on something that's reasonably random. That's 20% better for Larrabee. And nesting would make that difference grow massively, presuming there's a reasonable difference in path lengths. The Julia application seems to be nested 4 deep at most (looking at the D3D and ATI assembly) - two IFs then two loops, all nested.

    How many times is the hardware allowed to re-combine? :razz:

    It implements 2-SP/1-DP. Seems equally applicable to CPUs and GPUs (though x86 has extended precision, 80-bit, as well). I quoted it because it gives a reasonable baseline for a comparison with what you're proposing and an indication of the cost of making something multi-functional.

    It's definitely interesting and it seems possible that it wouldn't be dramatically more expensive than the 1/4 rate we see in ATI right now. Though I'm still puzzled what it is about the current implementation that is so expensive that it is worthwhile cutting it out of the smaller chips. Maybe it's just marketing/differentiation.

    ATI's execution pipeline appears to be only 6 cycles at most (if the MUL starts after operands 1 and 2 have arrived while operand 3 is being fetched) so there isn't much breathing room. Also SSE doesn't have FMA, and a quick Google:

    http://software.intel.com/en-us/forums/intel-avx-and-cpu-instructions/topic/61121/

    reveals fundamental uncertainties about how to add it...

    Jawed
     
  8. Squilliam

    Squilliam Beyond3d isn't defined yet
    Veteran

    Joined:
    Jan 11, 2008
    Messages:
    3,495
    Likes Received:
    114
    Location:
    New Zealand
    Got it.
     
  9. EduardoS

    Newcomer

    Joined:
    Nov 8, 2008
    Messages:
    131
    Likes Received:
    0
    Well... It's harder to scale to more threads... But ok, on some cases it may still be fast.

    And this makes me remebering Niagara...

    The row major take a hit due to the format of mandelbrot, using long rectangles instead of squares increases the chances of crossing black regions (even if they don't take a big area they goes from top to bottom), it's not exactly random or unavoidable.

    And... Well... I need more apps to see how common it's for others :)

    I didn't had time today to simulate it, let's see tomorrow, would you like to propouse some algorythms to do it? Preeliminary results are not like you are expecting :)

    I would argue "how much the front end costs? Does this pay for this gain?" But then I remembered that Larrabee is x86 with a greater overhead on the front end than both AMD and nVidia and also on architetural differences between them:
    1) Having more overhead means the sweet spot moves to the wider SIMD side;
    2) Slow gather (as apparently it is...) means if a 2D buffer isn't store in memory in z-order it will be very slow to load it in this way, if apps that does memory access behaves like mandelbrot wider SIMDs would take a serious performance hit;
    3) Not being clause based could result in a high number of IDLE units for non-computation units (control flow, load/store);
    4) Even not being so wide compared to others it's the most wide SIMD developed by Intel up to date.

    Even if 1 favors wider vectors 2 and 3 goes against and they may not be willing to risk due to 4 and so the reason for the width choose may have no relation to branching.

    I could try many different forms.

    6 cycles at 850MHz, about 5.3 times more time than 4 cycles at 3GHz of K-8.

    For Intel, everyone else already have FMA (IIRC Power 6 have and does at 5 cycles at 5GHz), AMD is set to add it regardless of Intel problems.
     
  10. Silent_Buddha

    Legend

    Joined:
    Mar 13, 2007
    Messages:
    19,426
    Likes Received:
    10,320
    Don't suppose there's anyway for a mod to split off the recent technical discussion and put it in the appropriate forum? It's fascinating stuff but isn't exactly relevant to this thread anymore.

    Regards,
    SB
     
  11. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    I think worst-cases are more interesting...

    As well as Julia I suppose it's possible to rummage in the CUDA/Stream SDKs. Fishing around in GPGPU implementations that haven't scaled usefully might be worthwhile.

    I'm not sure how detailed your simulation is, but constraints like strands have to retain their position when merged (e.g. modulo SIMD width or register file width) or being limited to 8 or 16 threads could cause some grief.

    Hmm, well it seems Fermi is generalising away from graphics, towards Larrabee - big L1 (bigger than Larrabee) is one indication. But Larrabee has MB of L2 cache.

    Larrabee doesn't favour Z order, architecturally, except for texturing I reckon. Z order is a pain in the arse for applications that don't easily match that.

    Ha, you can't keep everything running at 100% utilisation. The Sequencer in ATI should be idle most of the time in any reasonably complex kernel. Anything that's compute bound will make load/store idle for some of the time, too. The scalar part of each Larrabee core is arguably more flexible than the Sequencer in ATI.

    Just got to wait for the die shot that'll reveal the balance between x86, SIMD and cache.

    I think NVidia and AMD have no choice about moving towards the kind of generality in Larrabee as the applications get more complex.

    There's also the size of memory transactions, 512-bit cache lines and ring bus that all go with the SIMD width. And register file organisation.

    That paper suggests 3 cycles for the combined DP/SP unit, which seems pretty nifty. ATI might be bound by the latency of transcendentals, not MAD/FMA. I've got no idea of typical cost/latency trade-offs.

    Jawed
     
  12. dkanter

    Regular

    Joined:
    Jan 19, 2008
    Messages:
    360
    Likes Received:
    20
    I don't think there are any real uncertainties. I think the issue has to do with benefits and costs and quite a few of those are unique to Intel's situation. It takes a really long time for people to start using new instructions.

    Intel knows how to do FMA, they have designed those units - and there are papers on doing FMA with an intermediate round for backwards compatibility (targeted at 2.5GHz for Rock).

    I think if you read between the lines of what Mark said, you'll figure out what the problems are...

    David
     
  13. v_rr

    Newcomer

    Joined:
    Apr 30, 2007
    Messages:
    147
    Likes Received:
    0
    GPU shipments grew 21.2% in Q3 '09

    http://vr-zone.com/articles/gpu-shipments-grew-21.2-in-q3-09/7936.html?doc=7936
    http://techreport.com/discussions.x/17834
     
  14. Sxotty

    Legend

    Joined:
    Dec 11, 2002
    Messages:
    5,496
    Likes Received:
    866
    Location:
    PA USA
    Interesting that it adds to 97% in most recent quarter and 98% in quarter before. Someone else was the winner really :)
     
  15. neliz

    neliz GIGABYTE Man
    Veteran

    Joined:
    Mar 30, 2005
    Messages:
    4,904
    Likes Received:
    23
    Location:
    In the know
    According to BSoN

     
  16. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    Agreed. I was trying to suggest Intel has choices beyond area/latency.

    Jawed
     
  17. Sxotty

    Legend

    Joined:
    Dec 11, 2002
    Messages:
    5,496
    Likes Received:
    866
    Location:
    PA USA
    Oh I don't look at that site. :)

    Anyway in a funny someone earlier linked to newegg to show that Nvidia was discontinuing products b/c most were out of stock
    http://www.newegg.com/Product/Produ...0048 106793261 1067949754&name=Radeon HD 5870

    Oh noes! Discontinued :twisted:

    Actually that is obviously not the case, but it is annoying me the timing of all this as I wanted to get a card b4 Christmas (which for me means I want some competition or prices to decline), but I guess I can manage to wait.
     
  18. neliz

    neliz GIGABYTE Man
    Veteran

    Joined:
    Mar 30, 2005
    Messages:
    4,904
    Likes Received:
    23
    Location:
    In the know
    What they were trying to prove with that link is that the prices haven't gone down for nVidia products, so either a GTX285 that is more expensive than a HD5850 is really reall realy competetive or the product isn't really treated as a competetive part and they hope to minimize losses on brand sales.
     
  19. Silent_Buddha

    Legend

    Joined:
    Mar 13, 2007
    Messages:
    19,426
    Likes Received:
    10,320
    Interesting, if Nvidia doesn't get something out for the holiday season, it's quite possible that ATI might match or surpass them in Marketshare for Q4. Definitely NOT something I was expecting. Their Q3 numbers were far FAR worse than I expected.

    Considering it appeared (at least to me) that they were still competitive in the retail space, I can only imagine much of this is due to OEM sales.

    Regards,
    SB
     
  20. Sxotty

    Legend

    Joined:
    Dec 11, 2002
    Messages:
    5,496
    Likes Received:
    866
    Location:
    PA USA
    And we already went over it. Newegg is not going to just hold them for fun unless someone else is paying them to. The longer they wait the cheaper they will need to go to move them. Newegg has been in business for quite some time so I figure they understand this. Either people are still buying them that do not know better or someone is paying newegg for the loss. There is no way they would just hoard them. Of course it is possible they have 1 in stock and are just keeping it there for window dressing I suppose.
     
Loading...
Thread Status:
Not open for further replies.

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...