Article: Tesla 10-Series Analysis [Part 1]

Discussion in 'GPGPU Technology & Programming' started by B3D News, Jun 26, 2008.

  1. B3D News

    B3D News Beyond3D News
    Regular

    Joined:
    May 18, 2007
    Messages:
    440
    Likes Received:
    1
    Another year, another Tesla. So, what's new? What does performance look like when only the shader core matters? And does the FP64 implementation make any sense? We touch on this and much more in the first part of our Tesla coverage... [more on Tesla & RV770 within the next few days!]

    Read the linked item
     
  2. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    7,946
    Likes Received:
    2,370
    Location:
    Well within 3d
    I think, failing a significant change in RV770, that the expanded DP IEEE compliance does give GT200 a pretty good selling point for Tesla.

    I'm curious about what makes SSE2 denormal handling so expensive compared to Nvidia's DP unit.
    Does GT200 have exception handling? The fact that it doesn't support flags might be a related issue to that.
    The handling of division and square root is different by being handled in software. Are the results from the different method comparable to the SSE2 version?
     
  3. INKster

    Veteran

    Joined:
    Apr 30, 2006
    Messages:
    2,110
    Likes Received:
    30
    Location:
    Io, lava pit number 12
    So, in essence, they're aiming at benefiting from Intel and AMD's vector software, but it'll also bring some goodies to "Larrabee", right ?
    Are we distant from full x86 ISA emulation on a GPU, or is it technically (legally would be another matter, i know) reachable with near-future GPU's/software tools ?
     
  4. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    7,946
    Likes Received:
    2,370
    Location:
    Well within 3d
    Approaching full IEEE compliance is in itself a goal, as various niches like certain portions of the specificaiton.
    If the GPU is getting closer to SSE emulation, it's because SSE tries to comply with the same standard.

    The big weakness in emulating x86 in hardware isn't the numerical behavior, but the flag support and the general weakness in exception handling.
    For FP vector code, exceptions are often kept imprecise for performance reasons, so it's not so bad there, but the rest of x86 would simply be considered broken with GPUs as they are currently.
     
  5. MfA

    MfA
    Legend

    Joined:
    Feb 6, 2002
    Messages:
    6,454
    Likes Received:
    343
    Effectively having extra bits of exponent for small numbers is the only relevant part ... it's nice, buy if you are using it your code is already working on the fringes of the available precision, not a good place to be on GPUs. Exceptions or at least flags are much more important ... at the moment GPUs just kill your precision silently.
     
  6. Arun

    Arun Unknown.
    Moderator Legend Veteran

    Joined:
    Aug 28, 2002
    Messages:
    5,003
    Likes Received:
    235
    Location:
    UK
    Yeah, I definitely think it'll help a lot in some niches and not at all in others. As for performance, obviously once they switch to 2DP units/SM for the highest-end chip that won't be too much of a concern. Given how long software development can take and how big GT200's die already is, it probably wasn't too bad of an idea to go for 1DP/SM this gen (and 0DP/SM for GT206/iGT206/iGT209).

    It doesn't; so the point is that SSE2 handles it in software through an 'exception', while NVIDIA doesn't have that option so they 'just' implemented fully pipelined denormal handling in their DP unit.
    You mean vector hardware, right? And sure, that means CUDA will ironically likely also be a very viable option for Larrabee instead of Ct or whatever... although Larrabee's performance cannot be extracted fully with SSE, so they'd need to implement a new path specifically for it.
    Yeah, pretty far, plus there's really no point emulating a MIMD processor with a SIMD machine! :)
    You're talking about denormals, right? Can you elaborate a bit? (and I agree exceptions are important; NV conveniently claims they aren't and not really used outside of legacy code - duh, what would you expect them to say?)
     
  7. TimothyFarrar

    Regular

    Joined:
    Nov 7, 2007
    Messages:
    427
    Likes Received:
    0
    Location:
    Santa Clara, CA
    Arun, as per usual, great work!

    "There’s now a memory coalescing cache in each memory controller partition."

    I figured that was going to happen. Is this write combine only or read combine as well?

    I wonder if this means they will be soon be finally adding support for the PTX .surf surface cache?

    CUDA compiler for x86 is also an awesome idea. Very smart move. What's next CUDA compiled to CTM ; )
     
  8. MfA

    MfA
    Legend

    Joined:
    Feb 6, 2002
    Messages:
    6,454
    Likes Received:
    343
    The exception isn't so you can perform denorm handling, the exception is there to know you are getting them in the first place.
     
  9. Arun

    Arun Unknown.
    Moderator Legend Veteran

    Joined:
    Aug 28, 2002
    Messages:
    5,003
    Likes Received:
    235
    Location:
    UK
    Cheers! :)
    I wonder if this means they will be soon be finally adding support for the PTX .surf surface cache?

    The big question there is: who makes the software investment, and what are the benefits? Clearly marginalizing OpenCL would be beneficial to NVIDIA, but giving AMD CUDA support could hurt them in HPC. I would actually make the arguement (and in fact, I do rapidly in Part 3) that it wouldn't be a good idea for anyone involved to go down that route yet.

    On the other hand, it would be immensely beneficial to NVIDIA and AMD to support Larrabee in OpenCL (since the fundamental pradigm is more aligned to their GPUs than Intel's AFAICT) and, therefore, it would also be beneficial to make CUDA support both R8xx and Larrabee in order to further reduce the influence of whatever API Intel might come up with. Overall, CUDA becoming the main GPGPU HPC API and OpenCL/DirectX Compute becoming the consumer GPGPU API would likely be a very good outcome for both NVIDIA and AMD (imo).
     
  10. Arun

    Arun Unknown.
    Moderator Legend Veteran

    Joined:
    Aug 28, 2002
    Messages:
    5,003
    Likes Received:
    235
    Location:
    UK
    Oh yes, absolutely, I think I didn't make myself clear. I should have used the word 'interrupt', or perhaps even another one I can't think of at the moment. Definitely when you think of denormal exceptions, you tend think of what's visible to the programmer and lets him know the program is encountering one.

    However, what I was referring to is what happens even when the denormal exception is disabled/ignored: SSE2 takes an incredibly long time to handle denormals because, AFAIK, it sends an interrupt (or whatever the right word is) to properly process it... in software/microcode. This clearly requires an exception-like hardware mechanism and a MIMD processing path, and GT200 has neither - so their only real option was to support it at full-speed.

    In addition to that, as you point out, it is sometimes highly beneficial to be able to also signal an exception to let the program know a denormal is being handled. So with SSE2, the CPU would then generate two 'exceptions': one to handle the denormal, and one to let the program know about it.
     
  11. INKster

    Veteran

    Joined:
    Apr 30, 2006
    Messages:
    2,110
    Likes Received:
    30
    Location:
    Io, lava pit number 12
    Or for POWER/PowerPC... ;)
    If it would help in things like "Roadrunner" (which is already a PPC/x86 hybrid), why not ? It's a high-margin market, just as the ones Nvidia likes to go after.
    Plus, it could also trickle down to the consumer/"low end" in the future on a Playstation 4, for instance.
     
  12. MfA

    MfA
    Legend

    Joined:
    Feb 6, 2002
    Messages:
    6,454
    Likes Received:
    343
    Hey Arun, why not some comparitive benchmarks? Try and finagle the ACML library from AMD and do some of the standard stuff (ie. solving a system of linear equations, dense matrix multiplication, FFT etc) in single precision and double precision on the HD4870 and the GTX280?
     
  13. TimothyFarrar

    Regular

    Joined:
    Nov 7, 2007
    Messages:
    427
    Likes Received:
    0
    Location:
    Santa Clara, CA
    Yeah a PPC/VMX output path for CUDA would be awesome. And why wait for PS4? How about CUDA on 360 VMX and PS3 PPU/SPU? You need large software extended vectors (ie beyond 4-way SIMD) on 360 anyway to hide its long instruction latencies, seems like a really good match for CUDA. From what I have seen, developers are still have a hard time with portable vector programming on the consoles. If it is in any way possible to warm console devs on CUDA now, might be a great thing for NV on the next generation.
     
  14. TimothyFarrar

    Regular

    Joined:
    Nov 7, 2007
    Messages:
    427
    Likes Received:
    0
    Location:
    Santa Clara, CA
    Wow, both write/read combine, looks like as long as all accesses are from the same device granularity block (32/64/128 bytes) it can swizzle the access in any way to/from the vector.
     
  15. Tim Murray

    Tim Murray the Windom Earle of mobile SOCs
    Veteran

    Joined:
    May 25, 2003
    Messages:
    3,278
    Likes Received:
    66
    Location:
    Mountain View, CA
    Nice article, Arun :D

    Great to have finally met you, too...
     
  16. Arun

    Arun Unknown.
    Moderator Legend Veteran

    Joined:
    Aug 28, 2002
    Messages:
    5,003
    Likes Received:
    235
    Location:
    UK
    Benchmarking just the standard stuff from libraries might be a good idea, yeah - would help if I had managed to steal a card at Editors' Day though! Ah well... Anyway if there ever is an opportunity to do that kind of thing (plus maybe looking into the performance characteristics of RV770's Shared Memory), I'll definitely grab it.
    Well, I'd very much worry about resource allocation issues there... However, one very appealing advantage to that strategy is that Xbox360/PS3 games being ported to the PC would benefit from CUDA acceleration, which is pretty damn cool. As for PS4, whether that makes sense depends a lot on what their system architecture will look like and how fast their CPU will be relative to the GPU...
    Indeed! :D Was very glad to meet you too (could have dodged the 'how did you know each other' question a bit better though, hehe)
     
  17. ninelven

    Veteran

    Joined:
    Dec 27, 2002
    Messages:
    1,692
    Likes Received:
    107
    I don't really understand including a discrete FP64 unit if they have no intention of exposing/utilizing it for the graphics consumer (PhsyX maybe?). The idea of including even more FP64 units (which will never be used) per SM in future products seems ludicrous. Surely it would be more efficient to design a separate all FP64 chip specifically for Tesla applications.
     
  18. INKster

    Veteran

    Joined:
    Apr 30, 2006
    Messages:
    2,110
    Likes Received:
    30
    Location:
    Io, lava pit number 12
    Not all GPGPU applications need FP64 precision, so including plenty of FP32 support is certainly more effective (and definitively faster per die area) than building an all-new FP64-only chip just for GPGPU.
    The main target for the chip is the graphics card buyer, so all that FP64 precision processing capability would be wasted (not to mention "expensive", considering the amount of transistors disabled per manufactured core).
    At heart, it's still a realtime 3D raster graphics rendering-oriented design, and i don't see that changing anytime soon.
     
  19. ninelven

    Veteran

    Joined:
    Dec 27, 2002
    Messages:
    1,692
    Likes Received:
    107
    So use the 32bit "graphics" core in that case. It should be more cost/performance effective since it isn't wasting transistors on FP64.

    I said FP64 specifically for Tesla... Tesla cards don't even have a video out. The only people buying the FP64 core would be those who need it /duh.
     
  20. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,718
    Likes Received:
    91
    Location:
    Taiwan
    They can't afford to design a new core specifically for Tesla. That's too expensive.

    The current designs of new GPU seems to encourage a mixed precision usage. In many algorithms, you can use single precision to find the approximate result, then use double precision to quickly find the more accurate result (which, in some cases, are actually not necessary at all). In this usage pattern, it's reasonable to have relatively smaller number of double precision units.

    Another reason to have double precision is to appease those "it's not double precision so I'm not interested" people. Many people are just too accustomed to the idea that scientific computing "must" have double precision, so they'll dismiss anything that can't do double precision at all.
     

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...