Intel pushing Larrebee console deal with Microsoft

Discussion in 'Console Technology' started by The Seventh Taylor, Sep 6, 2008.

  1. V3

    V3
    Veteran

    Joined:
    Feb 7, 2002
    Messages:
    3,304
    Likes Received:
    5
    Each LRB core is going to be larger than Cell SPU or even PPU.
    If IBM is predicting 32 SPUs Cell, I am doubtful Intel is going to have 64+ LRB in the same time frame unless it is a really large chip like >400mm2.
     
  2. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,400
    Likes Received:
    440
    Location:
    San Francisco
    EDRAM makes a big difference (for the worst) when you are fabbing your chip.
    We can simply live without it, I hope LRB to lead the way from this standpoint in the console market. (PowerVR is excellently taking care of the mobile market already).
     
  3. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,400
    Likes Received:
    440
    Location:
    San Francisco
    1) You, as a developer, don't need to manually tile a thing on LRB (that would be in theory possible on 360 as well, but MS took another route..)
    2) The difference is in how you fab your chip + edram, it's generally not a walk in the park.

    EDRAM that doesn't effect logic is like ray tracing; it's the technology of the future and it will always be.
     
  4. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,400
    Likes Received:
    440
    Location:
    San Francisco
    Scan-line Based Deferred Renderer
     
  5. Fafalada

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    2,773
    Likes Received:
    49
    Ultra-wide MIMD 6502 FTW! :razz:

    Actually I meant Scanline in that context. To be fair I didn't even coin the abbreviation myself, nAo did it *points finger*.
     
  6. Blazkowicz

    Legend

    Joined:
    Dec 24, 2004
    Messages:
    5,607
    Likes Received:
    256

    you're describing a console with two monster GPGPU/GPU chips, one made by Intel and the other one made by AMD, with different feature sets and instructions, competing on the same board :). That looks unlikely.

    Cell 2 + GPU, while still a bit weird to my liking, at least makes more sense as you would expect the SPU to pack more general purpose flops per area than a GPU or LRB, and the GPU to pack more pixel flops power per area than LRB. It's what you're looking for.

    Regarding Larrabee I think there would be a single one in a console, not two, for the same reasons I don't believe at all in a two GPU consoles. What you gain in die size (or not, as you need two dies) is mitigated by the more complex board, interconnection etc., and you don't solve anything regarding power. I think Larrabee in a console would be quite a huge chip, with an amount of redudancy, with or without OoOE cores, and would balance its hugeness by being the only processing chip on the console (with the benefits of simplicity, effectiveness of die shrink and the lead Intel has on silicon process)
     
  7. c2.0

    Newcomer

    Joined:
    Jul 27, 2008
    Messages:
    12
    Likes Received:
    0
    Assuming same time frame, same process, and same size, then obviously not.
    From the Cell roadmap, it looks like 32iv is set for 10/11, probably at 45nm. I think Intel would be up to 32nm LRB by 11/12.
    I would also think that a LRB will be smaller than a PPU core (cache size and OoO buffers), so we can take into account the 4 PPU cores sitting Cell as well, maybe making up for the size difference between LRB and SPU cores.
    If we consider LRB for either discrete graphics or a single chip console, it's not completely unreasonable to expect a 400+mm2 chip. (wasn't G80 something like 480mm2?)
    All in all, I stand by my original suggestion.


    The way graphics are trending, there's less and less difference between general purpose flops and "pixel flops". General purpose might be too broad, but I'd say anything that falls into the category of vectorizable throughput computation should run well on LRB, and I think Intel have picked the right time in investing in this as the future direction for GPUs.

    One chip to rule them all, eh :p

    I completely agree with your argument regard to power, and since you need to keep a console within a reasonable power budget, if you can exhaust that with a single chip, then that's probably the way to go.
    I was wondering if it complicated design and manufacturing to go with the bigger single die/chip. Layout gets more complicated, particularly the ring would have to grow, and latencies across the chip get higher. Your external connectors would have to grow to fascilitate more memory channels to feed the chip (although NUMA adds it own share of load balancing issues, at least it's easier to scale).
    In the end I think you're right, because of power limits, a single chip solution is the most likely. I just wouldn't mind seeing 2 for the performance - assuming you won't be able to pack twice the number of cores on a die, and won't be able to scale the memory interface (internal and external) linearly.
     
  8. Gubbi

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,661
    Likes Received:
    1,114
    The PPU is in-order. For it's perf/W or perf/mm^2 Si it's just a crappy design (as is XCPU).

    Cheers
     
  9. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,400
    Likes Received:
    440
    Location:
    San Francisco
    It really doesn't get hard to run faster than a PPU on a per clock basis. XCPU cores are significantly faster, though I expect LRB perf on scalar code to be better than XCPU (clock per clock)
     
  10. betan

    Veteran

    Joined:
    Jan 26, 2007
    Messages:
    2,315
    Likes Received:
    0
    than PPU? Why?
     
  11. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,400
    Likes Received:
    440
    Location:
    San Francisco
    Ask IBM, perhaps they know why ;)
     
  12. Gubbi

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,661
    Likes Received:
    1,114
    For purely parallel problems adding more hardware contexts to each core is a fairly easy way to turn a latency bound problem into a bandwidth bound one.

    What is going to be interesting is when a massively threaded application has threads that needs to communicate. That is, a significant amount of synchronization and data exchange has to occur.

    Speculative lock elision will obviously boost basic mutex performance.

    However the demand loaded nature of caches will be a problem when cores have to exchange data. Imagine one core (producer) writing to a FIFO, and another (consumer) reading from it. In a conventional MOESI coherency protocol processor the producing core will get the cache line (with intent to write, invalidating all other copies) and put it in an exclusive state when writing it. The reading core afterwards needs to obtain the cache line. A whole bunch of coherency traffic is going on.

    Special loads and stores that bypass the data caches could help significantly. Stores to FIFOs (or similar structures) could even be handled by a special smaller and therefore faster L2-look-aside SRAM structure to lower load-to-use latencies.

    There would still be some latency incurred from accessing lower level caches, so the next step would be to add OOO capabilities. Data-capture schedulers are tiny dense SRAM structures (look at an Athlon core floorplan and the new ARM Cortex A9), decouple the super-wide SIMD registers from the main ROB to save space.

    Cheers
     
  13. c2.0

    Newcomer

    Joined:
    Jul 27, 2008
    Messages:
    12
    Likes Received:
    0
    Yep, I was referring to the next generation PPU. I thought they had plans to out-of-order execution to bring the single threaded perf up to scratch (seeing as how they have the SPUs for heavy throughput work).

    As demonstrated by modern GPUs. How many algorithms does this map well to though, and how well does it scale as the ratio of compute/external bandwidth worsens? As the number of cores and request demand on external memory increases, there should be less and less idle time for the memory interface. In this situtation doesn't it become redundant to try and hide latency with more contexts, where you should rather try to optimize locality and reuse and minimize external (read: slow and energy inefficient) traffic?


    If I understand SLE correctly, it should avoid a reader even needing exclusive access to anything. Today a reader would at a minimum require writing to the lock primitive. This is definitely something that looks interesting and practical in the short term.

    How feasable is it to automatically treat shared cache lines differently (either from heuristic analysis or program hinting) and transparently, rather than changing the programming model?
    I saw this paper: http://portal.acm.org/citation.cfm?...l=Portal&dl=ACM&CFID=2303156&CFTOKEN=37913594 at SPAA this year, but it doesn't look terribly scalable to me..
    I also heard murmurs of transactional memory support but noone from Intel would comment either way on this :p
     
  14. MfA

    MfA
    Legend

    Joined:
    Feb 6, 2002
    Messages:
    7,610
    Likes Received:
    825
    At the cost of streaming transformed vertices through memory in frame sized chunks.
     
  15. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,400
    Likes Received:
    440
    Location:
    San Francisco
    A cost that imo one would be pleased to pay compared to the complexity and costs of having edram on board. On the other hand it seems that NVIDIA and AMD think TBDR designs are like the boogeyman, they can't even name it (perhaps, as you suggested a long time ago, it's all due to patents wars and stuff..)
     
    #95 nAo, Sep 10, 2008
    Last edited: Sep 11, 2008
  16. MfA

    MfA
    Legend

    Joined:
    Feb 6, 2002
    Messages:
    7,610
    Likes Received:
    825
    I don't see how that would help with the FIFO case, unless you mean bypassing the local cache. So what you would need is to be able to lock cache lines to a specific processor.
     
  17. Gubbi

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,661
    Likes Received:
    1,114
    Yes, bypassing data caches for both cores, but no, not locking the cache line to a specific processor. I want the cacheline to reside in the L2. I want semantics or at least hints to the loads and store to tell the L2 apparatus that the data just stored is likely to be read by some other core (or possible by a different context on the same core).

    This alone would save on coherency traffic to and from the data-caches. The next step is to make a special chunk of logic to handle these loads and stores, ie a subset of the L2 with much faster access times to lower communication latencies.

    Cheers
     
  18. Panajev2001a

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,187
    Likes Received:
    8
    Unless GigaPixel IP 's cannot be used because it somehow infringes on IMG tech IP's (and we know how angry Simon gets when you touch his IP's), nVIDIA should have the basics building blocks for a TBDR...
     
  19. Fafalada

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    2,773
    Likes Received:
    49
    Unlike most other GPUs, LRB should be able to walk the scene trees itself (in fact, if it's anything it's cut out to be, it could be damn good at it as well), so why would it have to do "geometry capture" on top-scene level?
    And dated as it is, we've talked of stuff like that back in "Realizer" patent days.
     
  20. MfA

    MfA
    Legend

    Joined:
    Feb 6, 2002
    Messages:
    7,610
    Likes Received:
    825
    Unless Intel provides a scenegraph API that means you "need to manually tile" ... where it happens is irrelevant.

    Intel could go a long way to providing something close enough to a scenegraph API to suit our purposes by creating an OpenGL extension for per display list bounding volumes.

    PS. like this ... or rather, like this.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...