Larrabee vs Cell vs GPU's? *read the first post*

Discussion in 'GPGPU Technology & Programming' started by rpg.314, Apr 17, 2009.

  1. TimothyFarrar

    Regular

    Joined:
    Nov 7, 2007
    Messages:
    427
    Likes Received:
    0
    Location:
    Santa Clara, CA
    CUDA 2.1 PTX ISA 1.3 section 10.1.5 has the following,

    - bra : Indirect branch via register are not implemented.
    - call : Indirect call via register are not implemented.

    So branching and calls are currently only supported to an immediate address (defined at compile time).
     
  2. TimothyFarrar

    Regular

    Joined:
    Nov 7, 2007
    Messages:
    427
    Likes Received:
    0
    Location:
    Santa Clara, CA
    If you keep everything in cache, ie pre bin every draw command intersecting the tile before the resolve, then sure I could see MSAA working. One thing I'm wondering about here is how the texture unit on LRB handles state changes (texture changes). If you have all these tiles running in parallel, each with perhaps up to 1K to 4K of state changes globally (and a smaller subset per tile), how does the texture unit manage tiles needing divergent subsets of textures?

    What I'm getting at here is if there conditions on LRB where tiles need to go to and from global memory pre-resolve?

    There are all sorts of good usage of insane Z or stencil fill. Most obvious cases are pre-z drawing pass, shadowmaps, and stencil frustums for shadowing or lighting optimizations. In the frustum drawing case I'd guess that you'd be hitting the max fill limit. Sometimes on pre-z or shadowmap drawing other constraints get in the way. Such as writing out normal in pre-z (loose double Z rate for light prepass rendering for example), or small triangles or small triangle count per batch making the draws not limited by fill.

    Isn't vector min and max a LRB instruction. HiZ likely easy to do with a parallel min/max reduction.
     
  3. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    The only useful thing I can say here is that texturing in Larrabee seems to attract the most handwaving. With the supposed strictness of D3D11 it'll be interesting to see how Intel gains WHQL...

    But 3x seems excessive, that's the only point I'm making. Perhaps the argument is that 3x Z-rate only costs NVidia an extra 15% die space (whatever, don't know what it is) and produces a 15% performance gain (again, for argument's sake, don't know what it is). 3x might not be a bar against the fixed-function ROP but it would be a bar against the software-version - so is it really needed?

    Game benchmarks answer with a resolute "no".

    Larrabee, in 2010, might be a little "early" as a full software pipeline. But what is the definition of the correct time? When NVidia implements it? Or when Intel's price/performance is better? etc.

    Yeah, REDUCE_MIN/MAX.

    Jawed
     
  4. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    10,430
    Likes Received:
    433
    Location:
    New York
    Not sure you can draw that conclusion based on the data we have. Who's to say extemely fast Z isn't compensating for deficiencies elsewhere? Sort of giving an ALU light architecture a head start.
     
  5. fellix

    fellix Hey, You!
    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,490
    Likes Received:
    400
    Location:
    Varna, Bulgaria
    Phony measurement -- on my 4890 board all numbers are correct (53 GPix out of 59 theoretical).
    Does GT200 have support for hierarchical depth/stencil buffering?
     
  6. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    It's possible. In the context of replacing ROPs with FLOPs it could indicate that the tipping point is further away, defined by the ratio of FLOPs/mm² : ROPs/mm² (or Z/mm²).

    Jawed
     
  7. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    That was my point, that Archmark is prolly not a useful test.

    Not that I'm aware of.

    Jawed
     
  8. fellix

    fellix Hey, You!
    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,490
    Likes Received:
    400
    Location:
    Varna, Bulgaria
    ArchMark is good by its nature, but it has to rely on wacky OGL driver impl. It definitely has less overhead than GZEasy test suite, and this is important for synthetic evaluations.
     
  9. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    Is that why it doesn't entirely work on ATI?

    Jawed
     
  10. fellix

    fellix Hey, You!
    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,490
    Likes Received:
    400
    Location:
    Varna, Bulgaria
    Yup -- some surface formats are gone for good in there, me thinks.
    There's one old "GPGPU" DIP benchmark which is also broken.
     
  11. TimothyFarrar

    Regular

    Joined:
    Nov 7, 2007
    Messages:
    427
    Likes Received:
    0
    Location:
    Santa Clara, CA
    It's not exactly the same between vendors but to my knowledge both ATI and NV have some coarse representation of both depth and stencil used for acceleration.
     
  12. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    Can some body explain to me what hiearchial z-cull is? :roll:

    Also, I have never understood how a large fill rate can compensate for anything? How is it (fill rate) calculate anyway? And why should anyone bother about it?
     
  13. fellix

    fellix Hey, You!
    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,490
    Likes Received:
    400
    Location:
    Varna, Bulgaria
    Well, I don't know for sure, but probably NV's solution is "single-level" coarse representation only, as we know in ATi's impl the buffer can scale from coarse to more fine grained regions in case of a miss.
    Hyper-Z III - scroll down the page for more info on the subject.
     
    #93 fellix, Apr 22, 2009
    Last edited by a moderator: Apr 22, 2009
  14. steampoweredgod

    Banned

    Joined:
    Nov 30, 2011
    Messages:
    164
    Likes Received:
    0
    Sorry for bumping this thread, but thought it was better than creating a new one as the topic relates to it. In physics related tasks how are these 3 speculated to compare? gpus dx11, larrabee(or the new intel knights cpu) and cell. Earlier in the thread it was mentioned memory latency was orders of magnitude higher in gpus, how does this affect physics calculations? Considering nextgen knights cpu and hypothetical cell are likely around 1~Tflops while dx11 gpus are several times that, is that reflected in physics performance? or do things like latency do away with it?
     
  15. Ninjaprime

    Regular

    Joined:
    Jun 8, 2008
    Messages:
    337
    Likes Received:
    1
    I don't think thats accurate. I'm guessing the ~1Tflops you're talking about with Knights Corner is from that recent claim of over 1 Tflops in DGEMM by Intel. DGEMM is double precision. That means its much, much faster than 1Tflops, in GPU SP flops terms. I believe the target clock for KC is 1.6 GHZ, and ~50 cores, which adds up to about 2.5 Tflops SP, which only AMDs top GPU could beat in raw numbers, and probably couldn't even begin to approach KC in real world tests. For example, a Tesla c2070 (fermi)I think hits around ~330Gflops in DGEMM while its theoretical peak would be 515Gflops DP, meaning KC is around 3x faster.

    I also think the hypothetical cell rates lower than that, at least the 32 SPU cell I've seen, would only hit 800Gflops SP, unless they plan on also increasing clock rates. Then again, its hypothetical, so who knows. /shrug
     
  16. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    Both Cell and Larrabee are dead, it doesn't matter how good they are.
     
  17. steampoweredgod

    Banned

    Joined:
    Nov 30, 2011
    Messages:
    164
    Likes Received:
    0
    The old dp revision of cell did 1DP per about 2 SP, not sure what other modifications can be done as the progress path seems to be in limbo. But one also has to take into account performance per watt, spes are said to consume about 1 Watt each in a modern process node.

    If knights corner is say 100 watts, the fair comparison is 100spe, and I think that goes to the spes :wink:*(though I've knights is likely 200+ watt range)

    I'd heard single SP Flop was in the 5+Tflop range in state of the art gpus, I also heard somewhere that around 1 fifth of that is programmable flops, so the figure truly available would be around 1Tflop SP, if it is true that 1/5th comment is true.

    There's also another comment by a reputable member that latency is orders of magnitude higher in gpus(though that was a while ago), which might or might not affect physics performance.

    If it significantly affected performance and if the 1/5th figure was true, something like a nextgen cell would be equal to about 10 state of the art 2012 gpus in physics related tasks. If the 1/5th figure is true but the latency is not or does not affect performance might be similar. If neither the 1/5th comment is true nor latency affects the performance might turn heavily towards the gpu depending on which architecture is generally better for physics.

    Given that highly advanced physics performance could bring ubiquitous cloth simulation, hair simulation, fluid simulation, particle simulation, deformable terrain and realistic destruction, this is a very important area when it comes to videogames.
     
    #97 steampoweredgod, Dec 6, 2011
    Last edited by a moderator: Dec 6, 2011
  18. steampoweredgod

    Banned

    Joined:
    Nov 30, 2011
    Messages:
    164
    Likes Received:
    0
    Maybe, cell might still live on future playstation platforms.

    Besides why did it die, what sort of agreements were in place, can ibm use and sell spe elements at will freely? or not? Was it ease of programming? Are there any limitations with regards to selling it the console market?

    You can see that for example if there are any limitations with selling cell architectures in the console market, it wouldn't bode well for ibm who has contracts with multiple other companies in the console market to have cell architectures shown off in press release after press release excelling.

    The cell was featured as a key component in many of the top greenest and most efficient supercomputer systems. Imagine the PR nightmare of having cell in the top of the world supercomputers and then it being on only one console platform, how would that look? Certainly not nice at all from a PR standpoint, especially for their clients.

    In another thread it was claimed that a developer got better performance out of the PS3 cell(6spes available) than a 4 core i7, in physics performance. Some unnamed nextgen console is rumored to have just 3 cores, cores that likely do not match an i7 core, and if we extrapolate, we could say the five year old cell would, assuming that statement regarding physics is true, likely beat the brand new console cpu at this task.
     
    #98 steampoweredgod, Dec 6, 2011
    Last edited by a moderator: Dec 6, 2011
  19. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    No.
    Nobody wants it. Period. If demand is there, then contracts can be modified.

    This situation already exists.
     
  20. hoho

    Veteran

    Joined:
    Aug 21, 2007
    Messages:
    1,218
    Likes Received:
    0
    Location:
    Estonia
    Now add in the infrastructure on the chip that is required to service those 100 SPEs with data and you'll probably be incredibly lucky if you only need to cut the SPE count by 30% to fit in same TDP :)
    Big point in Larrabee/KC was the interconnect between the individual cores and the mesh grid kind seems to work quite well for them. Cell's ringbus quite definitely can't scale to that many SPEs.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...