Native FP16 support in GPU architectures

Discussion in 'Architecture and Products' started by xpea, Oct 17, 2014.

  1. Albuquerque

    Albuquerque Red-headed step child
    Veteran

    Joined:
    Jun 17, 2004
    Messages:
    3,864
    Likes Received:
    363
    Location:
    35.1415,-90.056
    Your statement was that ipc had not changed since the C2D days, I have demonstrated that claim to be incorrect. You point out cache differences as if that's somehow not comparable, but that architectural change (among dozens of others) is what allowed the increase.

    I do not understand why you pointed it out, as strapping that L1 cache structure to C2D isn't going to magically net a 30% increase in IPC and you know it.
     
  2. Ailuros

    Ailuros Epsilon plus three
    Legend Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    9,445
    Likes Received:
    181
    Location:
    Chania
    As I said a single FP64 unit at 1GHz under 28nm is for synthesis alone 0.025mm2, which I grabbed from one of Dally's past presentations somewhere long before Kepler appeared.

    :lol:
     
  3. homerdog

    homerdog donator of the year
    Legend Veteran Subscriber

    Joined:
    Jul 25, 2008
    Messages:
    6,270
    Likes Received:
    1,038
    Location:
    still camping with a mauler
    Haswell is basically just Conroe with moar cachez and 22nm. Xmas told me so you know it's true.


    :grin:
     
  4. Albuquerque

    Albuquerque Red-headed step child
    Veteran

    Joined:
    Jun 17, 2004
    Messages:
    3,864
    Likes Received:
    363
    Location:
    35.1415,-90.056
    My intention was not to be inflammatory, rather to clear up a misunderstanding that seems to pervade a lot of minds in the x86 world.

    I do infrastructure architecture for a living now, one of my many job duties is defining standards around hardware configuration for our x86 platforms. It very much surprises me when I have a "server admin", who for years has been the subject matter expert for spec'ing servers for decades before I showed up, who makes statements not dissimilar to the above: "Oh, clock speeds have been the same forever, the only reason new CPU's are faster nowadays is because there are more cores."

    Not true at all. A lot of these same folks seem clueless to the implications of NUMA domains the affect on code not specifically built to deal with it. The server admins here had been buying dual socket servers for everything and then blaming terrible performance on the lower clockspeeds of the decacore CPU's they purchase. It boggles their mind when I suggest removing one of the CPU's and benchmarking again, especially when I finally forced the issue and the product performance increased some ~25%.

    Dual socket decacore servers, running 32GB of ram, running nothing more than a fat stack of very simplistic (and non-computationally bound) JVM's. What idiot doesn't profile the use case?.
     
    #64 Albuquerque, Oct 21, 2014
    Last edited by a moderator: Oct 21, 2014
  5. homerdog

    homerdog donator of the year
    Legend Veteran Subscriber

    Joined:
    Jul 25, 2008
    Messages:
    6,270
    Likes Received:
    1,038
    Location:
    still camping with a mauler
    I would say most "server admins". I regularly work with self proclaimed experts (that naturally are paid more than me) who don't know the difference between an HDD and an SSD.
     
  6. swaaye

    swaaye Entirely Suboptimal
    Legend

    Joined:
    Mar 15, 2003
    Messages:
    8,579
    Likes Received:
    660
    Location:
    WI, USA
    lol yeah I actually use a FX 5900 Ultra for some old games. I have only barely used PS3.
     
  7. Albuquerque

    Albuquerque Red-headed step child
    Veteran

    Joined:
    Jun 17, 2004
    Messages:
    3,864
    Likes Received:
    363
    Location:
    35.1415,-90.056
    Ah yes, but all in the name of standards: "We only buy this configuration." Yeah, a $12,000 configuration that doesn't meet the needs of 80% of our deployed application base.

    Anyway, i need to stop hijacking this thread. IPC on modern processors has indeed significantly increased compared to processors five years ago.
     
  8. Ailuros

    Ailuros Epsilon plus three
    Legend Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    9,445
    Likes Received:
    181
    Location:
    Chania
    Ewwww that was the successor of the dustblower :shock: Jensen claimed back then that the G7x (I think Sony called it the "Reality Synthesizer") in the PS3 was twice as fast as a GeForce6800 [/end of OT]
     
  9. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,293
    Location:
    Helsinki, Finland
    I don't know how their hardware works, but I assume that they are only storing the (post projection) position data to memory. According to my experiments in GPU based software rasterization, I would definitely split the vertex shader (and vertex buffer) into two parts. There isn't usually that much ALU and fetched data that is shared by the position calculations and the other calculations (tangent frame transform + decompress + tangent fetch + UV fetch). This way you save 75%-80% of the storage cost (for non-skinned stuff). I would personally execute all the parts of the vertex shader that produce varyings in the second stage (output them directly to LDS or similar on-chip memory pool), so there's not much gains in using smaller format varyings.

    Obviously with random shaders (not under your control) it might be hard to ensure that the spitting is always a win (or works 100% perfectly), and it might be hard to split the vertex data efficiently (driver might do this, but transform feedback etc dynamic modifications still needs to work).
     
    #69 sebbbi, Oct 21, 2014
    Last edited by a moderator: Oct 21, 2014
  10. swaaye

    swaaye Entirely Suboptimal
    Legend

    Joined:
    Mar 15, 2003
    Messages:
    8,579
    Likes Received:
    660
    Location:
    WI, USA
    My 5900U is dead silent. It has an Arctic Accelero S2 on it. It does have some inductor noise but it's just enough for you to know it's there.

    GeForce FX has some advantages for old games like hardware palletized textures and table fog. The 45.23 drivers (oldest a 5900 can run) have solid compatibility with old stuff. Makes it better than ATI and also GF6+. FX cards are great as long as you aren't trying to play D3D 9 games. Think of them as beefed up GF4 Ti cards.
     
  11. Xmas

    Xmas Porous
    Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    3,314
    Likes Received:
    140
    Location:
    On the path to wisdom
    Rogue executes the vertex shader only once. Post-transform geometry is written to memory using lossless compression.

    There are various reasons for and against a two-stage approach, and this is an area where mobile GPU vendors have experimented a lot.
     
  12. MrMilli

    Newcomer

    Joined:
    Apr 6, 2008
    Messages:
    10
    Likes Received:
    0
    As I suspected. In GFXBench 3.0 Manhattan Offscreen, 32,5fps vs 31fps. Small victory for PowerVR. Need to see more benchmarks though.
     
  13. RecessionCone

    Regular Subscriber

    Joined:
    Feb 27, 2010
    Messages:
    499
    Likes Received:
    177
    A small victory despite the 20/28 nm implementation difference, at unknown power consumption and cost difference.

    Hard to say who it's a small victory for other than Apple - it doesn't tell us much about the underlying architecture. It would be equally wrong to say GTX980 is a big victory for NVIDIA over PowerVR.
     
  14. MrMilli

    Newcomer

    Joined:
    Apr 6, 2008
    Messages:
    10
    Likes Received:
    0
    True but the A8X is a 3 billion transistor chip. I'm pretty sure they can't push clocks that high, even on 20nm. Remember 20nm isn't that magical. Apparently the K1 GPU is clocked around 850Mhz.
     
  15. RecessionCone

    Regular Subscriber

    Joined:
    Feb 27, 2010
    Messages:
    499
    Likes Received:
    177
    We'll have to see what Nvidia's 20nm Maxwell SoC can do, then. Given Maxwell on 28nm, I'm thinking Maxwell on 20nm might be kind of magical.
     
  16. Ailuros

    Ailuros Epsilon plus three
    Legend Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    9,445
    Likes Received:
    181
    Location:
    Chania
    It'll lead you into the same dead end as it already is, since NV and Apple have different hw refresh cycles. Erista will be most likely ahead of A8X as likewise A9 will end up ahead of Erista.

    Maxwell is "magical" already in desktop GPUs even on "just" 28nm. Despite that the whole thing is way OT, aren't we overrating a bit the benefits of the underlying manufacturing process?
     
  17. DemoCoder

    Veteran

    Joined:
    Feb 9, 2002
    Messages:
    4,733
    Likes Received:
    81
    Location:
    California
    Sorry to divert the thread, I've been away for a long time, but one of the things that concerns me in these mobile benchmarking reviews is that no one seems to be doing direct image quality analyses anymore.

    In the heyday of desktop GPUs evolving quickly, vendors would get caught with their pants down using all kinds of approximations, deliberately cheating, having poor anisotropic filtering, or otherwise disobeying what the benchmark requested of the drivers.

    How do we know this isn't occurring again? I don't think we should trust GFXBench or 3dMark, Apple or Nvidia, and reviewers should be digging deep into the outputs of these chips. Who knows, maybe there's another "Quack.exe" lurking. Unfortunately on these mobile platforms, things seem very locked down, it's hard to get source and hard to probe the sandboxes to make the kinds of hacks needed to expose cheating as much as the past (although it's easier expose cheats on Android)
     
  18. Ailuros

    Ailuros Epsilon plus three
    Legend Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    9,445
    Likes Received:
    181
    Location:
    Chania
    Yes it should definitely be a concern, but I'm also sure that all involved IHVs are wide awake watching what each competitor is doing. Oh and I missed you definitely; it's good to see you again.
     
  19. lanek

    Veteran

    Joined:
    Mar 7, 2012
    Messages:
    2,469
    Likes Received:
    315
    Location:
    Switzerland

    Sadly, this is not the same market, its allready really hard to find reliable benchmarks for mobiles, let alone peoples who understand how they work when they use them for review.
     
  20. Dominik D

    Regular

    Joined:
    Mar 23, 2007
    Messages:
    782
    Likes Received:
    22
    Location:
    Wroclaw, Poland
    We don't. That's why we need a proper OGL cert suite. What Khronos provides is a joke compared to what you have to go through for DirectX validation. I hope that there would be some sort of transparency into waivers for different cores too - certain vendors cut corners wherever they can (or can't).
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...