AMD announces new GPGPU card, hints at RV670 specs

Discussion in 'GPGPU Technology & Programming' started by Dave Baumann, Nov 8, 2007.

  1. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    Something different

    Any chance that modifying/widening the DP4 paths will provide the requisite stages?

    Jawed
     
  2. Farhan

    Newcomer

    Joined:
    May 19, 2005
    Messages:
    152
    Likes Received:
    13
    Location:
    in the shade
    Regardless of whether it's a pipelined add, you can't do that shifting thing for the MUL because that would be incorrect. The alignment for p1 and p2 is always fixed (they are not 2 completely independent FP numbers, think of them as having a shared exponent). The addition is always between the top 54 bits of p1 and the bottom 54 bits of p2, with the carry propagation having to go through all the way to the MSB of p2 (27 bits).
     
  3. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    I've diagrammed a possible set of exponents:

    Code:
    Blo         27
    Alo         27
               ---
               w55
    Bhi       53
    Alo       27
             ---
             z81
             ---
    Z+W 
             =====
             z82    partial sum 1
             =====
     
    Blo       27
    Ahi       53
             ---
             y81
    Bhi     53
    Ahi     53
           ---
           107
           ---
    X+Y
           =====
           108    partial sum 2
           =====
     
    p1       z82
    p2    +108
           =======
           109
           =======
    For the sake of clarity, both A and B have exponent 53. When split into hi and lo parts, the hi parts keep their exponent, 53, while the lo parts are normalised to exponent 27 (though it could be lower for either of them). I've then worked through the multiplications and additions, calculating the maximum value of each of the resulting exponents.

    Doing this I think I've understood my mistake. When I said "the count of significant bits in p2 determines how many bits from p1 are used, i.e. 54-p2+27" that's wrong, it should be the difference in exponents as there's always 54 significant bits in p2.

    ---

    My suggestion is the addition, p1+p2, is done on the final adder in the pipeline (in lanes X and Y). This adder is required to perform a DADD instruction, so in this case it is also used for p1+p2. Since DADD has to support two 53-bit operands by being a 54-bit adder, the addition of p1+p2, 27 bits + 54 bits requires no extra hardware dedicated to MUL.

    So, what I'm thinking is that a conventional single precision DP4 needs to perform a final ADD on 4 MULs. So the DP4 instuction requires a 4 operand adder. I'm wondering if this same adder can also support:
    • DADD A, B
    • DMUL p1, p2
    • DMAD p1, p2, C
    C comes from A*B+C.

    Does DP4 work like that, though?

    Jawed
     
  4. itaru

    Newcomer

    Joined:
    May 27, 2007
    Messages:
    156
    Likes Received:
    15
    http://forums.amd.com/forum/messageview.cfm?catid=328&threadid=95565&enterthread=y
    AMD Stream SDK v1.1-beta Now Available For Download

    The AMD Stream Team is pleased to announce the availability of AMD Stream SDK v1.1-beta!

    The installation files are available for immediate download from:
    FTP Download Site For AMD Stream SDK v1.1-beta

    The AMD Stream Computing website will be updated in the next few days to reflect this new release.

    With v1.1-beta comes:

    - AMD FireStream 9170 support
    - Linux support (RHEL 5.1 and SLES 10 SP1)
    - Brook+ integer support
    - Brook+ #line number support for easier .br file debugging
    - Various bug fixes and runtime enhancements
    - Preliminary Microsoft Visual Studio 2008 support


    If you have any questions, please do not hesitate to post your question to the forum.

    Sincerely,
    AMD Stream Team
     
  5. wingless

    Newcomer

    Joined:
    Aug 5, 2007
    Messages:
    79
    Likes Received:
    0
    Location:
    Houston, Texas
    Awesome. I hope we see more ATI support in GPGPU before CUDA takes over the market.
     
  6. Karoshi

    Newcomer

    Joined:
    Aug 31, 2005
    Messages:
    181
    Likes Received:
    0
    Location:
    Mars
    Wishlist:
    - Brook CUDA backend.

    A quick search around here didnt find any references to this. I think I read a post sugesting CUDA on CTM or CAL a few days ago. Brook on CUDA seems easier.
    Disclaimer: I know CUDA and AMD´s stream SDK only at the executive PDF level.

    I see advantages to a brook port to cuda.
     
  7. itaru

    Newcomer

    Joined:
    May 27, 2007
    Messages:
    156
    Likes Received:
    15
    http://www.amd.com/us-en/Corporate/VirtualPressRoom/0,,51_104_543~126593,00.html
    AMD Stream Processor First to Break 1 Teraflop Barrier

    —Next-generation AMD FireStream™ 9250 processor accelerates scientific
    and engineering calculations, efficiently delivering supercomputer performance at
    up to eight gigaflops-per-watt —

    The AMD FireStream 9250 stream processor includes a second-generation
    double-precision floating point hardware implementation delivering
    more than 200 gigaflops, building on the capabilities of the earlier
    AMD FireStream™ 9170, the industry’s first GP-GPU with double-precision floating point support.
    The AMD FireStream 9250’s compact size makes it ideal for small 1U servers
    as well as most desktop systems, workstations, and larger servers and
    it features 1GB of GDDR3 memory, enabling developers to handle large, complex problems.

    AMD is also working closely with world class application and solution providers
    to ensure customers can achieve optimum performance results.
    Stream computing application and solution providers include CAPS entreprise,
    Mercury Computer Systems, RapidMind, RogueWave and VizExperts.
    Mercury Computer Systems provides high-performance computing systems
    and software designed for complex image, sensor, and signal processing applications.
    Its algorithm team reports that it has achieved 174 GFLOPS performance for
    large 1D complex single-precision floating point FFTs on the AMD FireStream 9250
     
  8. MfA

    MfA
    Legend

    Joined:
    Feb 6, 2002
    Messages:
    6,833
    Likes Received:
    481
    174 GFLOPs is incredibly fast (CUFFT did around 20 on the G80 last I looked).
     
  9. Anarchist4000

    Veteran Regular

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    1 TFLOP, <150W power, 1GB GDDR3 in a single slot? This ought to be interesting. I was sort of expecting a 2-slot card with a leaf blower.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...