Larrabee delayed to 2011 ?

Discussion in 'Architecture and Products' started by rpg.314, Sep 22, 2009.

  1. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    IMHO, ability to write massively parallel codes in c/c++/fortran isn't a bonus. It's a evolutionary leftover from a distant era. Personally, I feel classic c/c++/fortran are fundamentally broken for parallelism. Java/C# etc. are not much better.

    Well, so far nobody has managed to make a language that has the dynamic range of massive parallelism to purely serial, that isn't purely functional. And they don't seem to be doing very well in terms of adoption just yet.

    10 years ago I'd have agreed with you, but today, I'd say MSIL (or it's bastardized cousin) has a better shot at it than x86 ISA. Don't forget that there are very few production GPU codes yet. And IMHO, 90% of them will be written with whatever tools MS can cook up.
     
  2. aaronspink

    Veteran

    Joined:
    Jun 20, 2003
    Messages:
    2,641
    Likes Received:
    64
    In any case where scatter/gather would run into an issue with snooping it is going to just plain suck on any modern memory system. I've been in the room with various interested parties in scatter gather with significant experience and past history with it and the fundamental problem is that unless what you are S/G is in a local SRAM, the interconnect and memory becomes a fundamental problem. And of course, in all their workloads, there's no way to really keep things that local due to data set size being so large. Oh what they would do to be able to use SRAM as main memory again.

    S/G on local caches isn't idea but it isn't really any harder than using enormous massively ported register files like ATI/Nvidia do now in order to support it. If you care a lot more about striding there are some elegant solutions available, but for general S/G, its all about using multi-ported register files instead of ram arrays for your L1 arrays. This is universal and doesn't really depend on if you call your L1 arrays local stores, shared memory, or cache.
     
  3. aaronspink

    Veteran

    Joined:
    Jun 20, 2003
    Messages:
    2,641
    Likes Received:
    64
    So when this perfect mythical pink elephant decides to show up, I'm sure we'll ride him. In the meantime we'll use the robust beasts of burden that we already have.

    Functional languages are where its at. Its a shame that so few people can actually use them correctly and that the two major functional languages are such clusters of semantic crap. Yet still, almost every piece of electronics now is based on them.
     
  4. Andrew Lauritzen

    Andrew Lauritzen Moderator
    Moderator Veteran

    Joined:
    May 21, 2004
    Messages:
    2,629
    Likes Received:
    1,227
    Location:
    British Columbia, Canada
    I agree, but I don't think it invalidates the point. Even accepting a speed hit I don't see moving forward without caches. Fermi's caches - while not completely coherent - also show promising performance results, so I'm unwilling to accept that it's a problem that can't be overcome.

    While I resoundingly agree with your point, there's still a lot of non-performance-critical code that can happily run in any of these languages without affecting the overall speed of the code. At least until we have CPUs and GPUs on the same chip using the same memory subsystem (i.e. same cache hierarchy) - and maybe even then - it's unreasonable to say that all of this code belongs on the CPU.
     
  5. MfA

    MfA
    Legend

    Joined:
    Feb 6, 2002
    Messages:
    7,610
    Likes Received:
    825
    Writing efficient low level code comes down to internalizing the compiler and hardware ... the way it ends up being executed is inherently imperative, so the effort it takes to do this for functional programming is much greater than with imperative languages. Syntactic sugar is helpful, type systems are helpful, deadlock/race detection/prevention is helpful, completely obscuring control flow is not helpful.

    For those couple of % of most important code functional programming will never be the right tool.
     
  6. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    Hmm.., OK.

    C/C++ are anything but robust in the parallel era. If anything, they are even more fragile tool in the parallel world. Or a sharper double sided sword if you prefer.

    Further, continuing to use a broken language - for any reason - won't fix the problem.

    Which two did you have in mind? Haskell and erlang?
     
  7. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,055
    Likes Received:
    3,109
    Location:
    New York
    Intel excoriates the Larrabee concept:

    http://www.techradar.com/news/compu...-cards/intel-larrabee-was-impractical--716960

     
  8. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    With benefit of hindsight, the rasterizer is the x87 of GPU's. You may hate it, you may even deprecate it by fiat (or by a sw renderer), but you may not remove it from the die.
     
  9. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    I just don't believe having no fixed function rasteriser is the problem. If it was that simple, you could give each core a rasteriser, or give each texture unit a rasteriser.

    Also, seeing how much grief 4 rasterisers have given NVidia (and they're fixed function), it seems to me Intel gave up too soon :razz:
     
  10. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,055
    Likes Received:
    3,109
    Location:
    New York
    I thought the grief was from maintaining triangle order in a distributed environment? That challenge remains whether you're doing rasterization in fixed-function units or software.
     
  11. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    Precisely my point. Intel's software rasterisation obviates the triangle ordering problem with its tiled approach. The struggle NVidia had is analogous to the struggle Intel had in distributing work across the cores.

    I think the problem lies elsewhere. e.g. $2 billion yearly TAM, say, for performance/enthusiast discrete just isn't worth chasing in comparison with server/cloud/HPC.

    Also life's simpler for Intel if it doesn't have to write drivers for D3D. There was always the question hanging over the architecture of how long it would take Intel to get a game's performance right, with worrying statements that months after game release would be required. (AMD doesn't seem to have much of a different attitude, though.)
     
  12. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    Off hand, I can't see how it obviates the need. You just moved the serialization point from rasterization to spatial binning. Scaling spatial binning across cores while maintaining triangle order isn't exactly easy.

    And by the looks of it, they have got it almost right for SNB.
     
  13. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    The serialisation is actually per tile-pixel (or more granular, e.g. per tile qquad), and local to a single core since tiles in rasterisation (stages post setup until back-end) don't span cores.
     
  14. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    Is that assuming the implementation placed the tesselation stage in the front-end and not the back?

    It would have been an interesting exercise to see what numbers Larrabee could have pulled in Heaven, its applicability to current workloads aside and assuming that the software renderer had been functionally coded to DX11 spec.

    This latest Intel statement is far more down on Larrabee graphics than I've seen thus far, and is a noticeable drop from a position I have already perceived as being rather lukewarm. I suppose Tim Sweeny will need to wait a little longer for his software rendering dream to come true.
     
  15. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    I don't understand what you're suggesting.

    Tessellation was very much an open question, I don't remember any of Intel's materials covering it.
     
  16. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    The option existed to run the tesselation stages either in the front-end or back-end.
    Whether there was ever an implementation of it for Larrabee is something I do not know, but Intel did discuss the possibility.

    If a primitive is allocated to a bin and the back-end is responsible for performing tesselation, the generated triangles on one core could cross the bin's tile boundaries.
     
  17. Arun

    Arun Unknown.
    Legend

    Joined:
    Aug 28, 2002
    Messages:
    5,023
    Likes Received:
    302
    Location:
    UK
    Logarithmic shadow maps: now even more of a pipe dream! Oh well.

    Anyway... if you look at communications, the vast vast majority of software-centric architectures still do problematic algorithms like Turbo Coding and Viterbi in hardware blocks. But there are exceptions that do those very efficiently in software - the trick is their architecture is incredibly unusual and very different from a traditional processor, even though it could afaict rightfully be called Turing Complete (as long as you look at a large enough piece of it rather than just a subsystem).

    The basic problem with graphics is that the number of blocks that would benefit from such exotic architectures is actually very small, and their data flow is very complex (rasterisation being the poster child). And going down that route would create a lot of complexity at the compiler for more normal shading workloads, so overall it just doesn't make any sense and the best approach remains fixed-function.

    The one thing Larrabee did provide above and beyond any current desktop GPU architecture is scalar/MIMD, and interestingly on-core rather than as a separate on-chip block. I'm honestly unsure whether there is much benefit to on-core SIMD+MIMD in either graphics or GPGPU compared to separate SIMD and MIMD cores, but a frequent problem of the latter in 80s/90s architectures is the lack of bandwidth between the scalar and the vector part. With the power consumption of data communication even on-chip increasing to dramatic levels, there might be something to be said for on-core integration of the two not (just?) from a software level but from a hardware level. Some sort of close coupling at least would make sense.

    Of course, ideally we'd all go pure MIMD. Rys, can I haz Series6? :D (and please don't break my heart and tell me it's SIMD now :()
     
  18. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    Can't remember seeing that :???:

    Tessellation consumes patches. I don't think patches would be screen-space binned.

    The patches should be able to run in parallel through VS/HS to generate input to TS and DS. Ordering of triangles coming out of DS should be keyed by Patch ID, I presume (TS generating sub-patch triangle ID).

    Is there a serialisation I'm missing?

    Anyway, screen-space tiling for binning of triangles involved in tessellation (input or output) would be done post-GS.
     
  19. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    Tom Forsyth's SIGGRAPH 2008 presentation touted the flexibility in assigning stages either to the front or back-end.
    Included in that set is GS and tesselation.

    Wouldn't this mean it occurs in the front end? If it's not in a bin, the back end would not be able to grab it.

    VS is listed as a front-end capability. It was not clear to me that VS is one of the stages that could be put in either front or back.

    GS could be either front or back as well.
     
  20. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    Looking at slide 22, the only way I can interpret tessellation being done in the back-end (along with GS) is if TS is synonymous with VS->HS->TS->DS (i.e. it is not a reference purely to the TS stage). In my interpretation, DS would be split between front-end and back-end:
    • Front-end DS would generate screen-space coordinates for the purposes of binning.
    • Back-end DS would generate all the other attributes of each vertex.
    The advantages of delaying some DS work would include reduced storage in global memory and re-distribution of workload (e.g. later DS might lead to better load-scheduling).

    Precisely. But tessellation doesn't have to be complete for binning to start (position attribute of each vertex is mandatory for binning). Which leads me to suggest what I posted above.

    GS can do a variety of things. If GS is used merely to delete vertices/triangles then in theory it can be delayed until after binning - again this is a load-balancing question, I think. i.e. run GS across lots of cores as they do binning, rather than on a few cores while creating bins.

    Maybe there are some other usages of GS that are amenable to delayed execution (e.g. generating attributes)?

    ---

    By the way, the term "rasteriser" is often used to describe all of these stages: setup->rasterisation->pixel shading->output merger (ROP). So it's possible to interpret the statement about the lack of a fixed-function rasteriser as actually descriptive of lack of "setup->rasterisation->pixel shading->output merger". To be honest I think this is very likely the correct interpretation.

    I pretty much always thought it would be years before Intel was competitive at the enthusiast end, but process would eventually allow it to catch up. A major question for the other IHVs is what proportion of die space ends up being programmable compute, and the higher that rises the more competitive Intel becomes.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...