Larrabee vs Cell vs GPU's? *read the first post*

Discussion in 'GPGPU Technology & Programming' started by rpg.314, Apr 17, 2009.

  1. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    The purpose of this thread is to discuss the relative weakness and strengths of lrb, GPU and cell architectures with particular reference to how it might affect future architectures' evolution and market niches. I am particularly interested in the applications where one approach delivers considerably better or worse results in comparison, ie "presence/absence of x really hurts y while doing z".

    ---> ATM, points like "how easy it is to code for it" are somewhat secondary (at least in the beginning) since we are assuming that the programmer is willing to step up to the plate and equally good (a subjective term, of course) tools are available for both.

    ---> In the interest of fairness, to cancel the advantage of cell vs gpu's (in that ppe and spe sit pretty close), I think it is fair to assume (if need be) that some body (some time in to the future) will put a tiny x86-64 core next to a giant shader array :lol: . The goal here is to consider which problems are easier more efficiently solved on which architecture, so I think this is ok. Comments/cribs/questions/commendations :) welcome!

    {This obviously doesn't apply to LRB as all cores their are equal and the only thing stopping Intel from placing it on the motherboard socket is their marketing department. :evil:}

    --->Points like "trick x works for a but y for b regarding code optimization" are welcome as they serve to illustrate the differences.

    ---> Since we are all armchair architects here spouting wisdom for others to follow (ok, mostly....he he), feel free to speculate on "to improve x with regards to doing y, I'll make the following changes". Along this line of thought, suggestions for improving programmer productivity are welcome too. But kindly keep the proposed surgeries a bit conservative/realistic/less invasive if possible.

    {Suggestions like chip having dual ARM Cortex A9 based PPE, with 30 SPE's with 512K LS, 512 ALU's from nV, and a quad x86 core each with 4 hyperthreads, an FPGA and topped off with dedicated rasterizer/tessellator/texture units, all on same die, with optical interconnects aka "ONE CHIP TO RULE THEM ALL:twisted:" are fun to read and laugh at, but do not contribute to anybody's (and definitely not mine) understanding/knowledge.}


    Potential starting points. (meant to seed the discussion, please add your own)

    1) video transcoding seems to be better with cell wrt GPU's

    2) GPU's can't handle the task parallel codes. For instance, building jobs on PPE and then submitting it to a queue, that kind of approach can't be taken.

    3) For 3D rendering, LRB doesnt suffer from any pipeline bottlenecks as it is all done in software.
     
  2. TimothyFarrar

    Regular

    Joined:
    Nov 7, 2007
    Messages:
    427
    Likes Received:
    0
    Location:
    Santa Clara, CA
    How about banked vs cached memory in terms of vector scatter/gather. Clearly read only vector gather can be covered by the texture units, so I'm talking about R/W vector scatter/gather only.

    A LRB hyperthread doing scatter/gather can only access one L1 line per clock, so if you actually do a scatter/gather which doesn't simply load from the same vector sized cache line, then performance suffers proportional to the number of L1 lines hit. In my eyes, this vastly marginalizes the usefulness of scatter/gather.

    An even better question is how many L1 accesses can LRB do per cycle per core. So how badly will one thread issuing an expensive vector scatter/gather effect execution of the other hyperthreads on the core. Perhaps using very infrequent bad cases of scatter/gather will not result in total ALU throughput from suffering too much. Hard to tell given the extent of public info released thus far.

    One can compare and contrast to current NVidia hardware where scatter/gather to random memory locations runs at full speed as long as each lane of the vector accesses a separate bank of memory. This opens up a lot more flexibility with scatter/gather compared to the LRB approach in terms of keeping up ALU throughput.
     
  3. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    A LRB hyperthread doing scatter/gather can only access one L1 line per clock, so if you actually do a scatter/gather which doesn't simply load from the same vector sized cache line, then performance suffers proportional to the number of L1 lines hit. In my eyes, this vastly marginalizes the usefulness of scatter/gather.
    Good question. I am pretty sure that it can access only one cache line per core. AFAIK, there is a performance penalty if you read a value that is split across cachelines on CPUs. And I don't think this part of the Pentium core would have been reworked too much.
    Banked memory is certainly better in this regard. Making a banked cache seems possible too. If this problem turns out to be significant, now or in the future, I'd expect Intel to move to a banked cache. However the cost (die size) of making a banked cache would certainly be larger than the normal cache they appear to be using right now.
     
  4. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    To rephrase my question, why does video transcoding seem to be better with cell than with GPU's?

    Is it a software maturity issue or is it an architectural issue?
     
  5. Brad Grenz

    Brad Grenz Philosopher & Poet
    Veteran

    Joined:
    Mar 3, 2005
    Messages:
    2,531
    Likes Received:
    2
    Location:
    Oregon
    Both, and marketing as well. I think the hardware vendors overpromised on that front and as transcoding has proven to be a tough nut to crack on GPUs, corners have been cut to create an expected performance advantage.

    My opinion is that GPGPU has proven to be a huge red herring. HardOCP has been posting videos of tech demos from the Infernal Engine all week. The developers narrating are very emphatic on the point that it is far preferable to have your GPU processing graphics. Their physics engine, which is very impressive, relies on heavily threaded, CPU based processing. The results are pretty cool, but they still point out that the PS3 version is the one they like to demo because it can simulate the most objects. Even better than their PC build with 8 threads on an i7 platform. That being the case, I think there is a place for an intermediate level of hardware splitting the difference between big CPU cores and tiny shader cores. Ideally, these SPE level execution units would have high speed access to both the system and video memories.
     
  6. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    So GPUs are relatively not very good at transcoding. The question is why? Any insights?

    For physics, SPEs are better than even i7. Do you think that is because of the fast (L1 speed) and large (L2 sized) LS?

    AFAICS, in game physics, it should be possible to know deterministically your data flow patterns, so there Cell should shine. But won't gpu's be just as good at physics? After all, they too can use shared mem to collaborate and cache. And they make up for it's small size by having relatively thinner alu's per SM (CUDA speak).
     
  7. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,750
    Likes Received:
    127
    Location:
    Taiwan
    My knowledge of encoders goes back to the MPEG-2 era, so I'm not that familiar with more recent codecs such as H.264. However, to my understanding, the most easily workable tasks on a GPU are basically just transformation (DCT, or the H.264 thingy), motion search, and some pre-processing such as noise removal. Traditionally DCT and motion search are the two most time consuming tasks in a encoder, but with a complex standard such as H.264 it's probably no longer the case. For example, I am not sure if CABAC can be done fast on a GPU.

    In many physics simulation, at first you bucket sort objects into grids, then a SPE focus on a grid (which should able to fit into the LS). The processing of a grid is basically independent of other grids, so it's quite suitable on CELL.

    The share memory on G92/GT200 is really small, only 16KB. Each SPE has 4 pages of 64KB in its LS. However, I don't think GPU is "bad" at physics, as there are many nice demo showing that GPU can do some interesting physics simulation effects.
     
  8. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
  9. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    CABAC is particularly bad on CPU's, due to it's very branchy nature. In fact CABAC portion of x264 slowed down by almost 10% with the new Core i7. :-/

    Hmm. 1 spe has 25.6Gflops and 256k LS, so 100 flops per byte. 1 SM on nV has 21Gflops per 16Kbyte SM. That's not an equal comparison of course. Code is also part of a spe's LS but still the ratio is more than an order of magnitude higher. IMHO, code size > data size would still be anomaly in this regard (or when a bucket has very few particles)

    BTW, does any one have an estimate for the latency encountered on spe. IE when you dma 16 bytes (the minimum allowed, that is) in terms of clocks. On nV it is 400-600 clocks as per CUDA guide but 500-650 clocks as per Volkov's paper. It should be in the same ball park right? Or does XDR ram act funnily in this regard. On CPU's it is typically 100-200 clocks cycles if I am not mistaken. Better estimates welcome.
     
  10. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    @ Jawed: I have seen that presentation. What are the points you are trying to make here? And amd's transcoder is worse off/not much better compared to badaboom AFAIK.
     
  11. MfA

    MfA
    Legend

    Joined:
    Feb 6, 2002
    Messages:
    6,807
    Likes Received:
    473
    Cell has 8 and 16 bit SIMD instructions for one. For another, GPUs use parallelism not merely to hide memory latency, they use it to hide pretty much all latency in the pipeline ... because they usually work on massively parallel problems and because memory latency is orders of magnitude higher than everything else they just don't bother trying to keep latencies low for anything else. Their actual instruction latency is hideous compared to Cell. That means you are forced to have many more threads than on Cell.

    They are too optimized for massively parallel workloads.
     
  12. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    h.264 encoding isn't usefully faster (for a given bit-rate and quality trade-off) on GPUs because of bottlenecks (e.g. PCI Express bus) and limited applicability of GPUs (motion estimation is about the only thing worth doing).

    Latency across PCI Express is a serious problem. As far as I can tell in games it means a frame of delay in things like world-physics.

    h.264 encoding isn't, in itself, all that transcoding requires so there are other opportunities for performance gains. De-coding and de-noising are reasonable candidates, though of course GPUs have their dedicated, fixed-function, decoders.

    Jawed
     
  13. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    Well 8 and 16 bit ops certainly help.

    :) But then shouldn't they have more cache (aka shared memory) for massive parallelism. I mean they are too optimized for massively parallel workloads which have high arithmetic intensity too. LRB seems much better in this regard, in that it will have much larger cache to flop ratio wrt gpus. Assuming 2Ghz clock, it has 16 lanes *2 madd * 2Ghz=64Gflops per core. So perhaps not better than Cell.
     
  14. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    What about DCT, eh?
    It's an offline transcode, not a realtime one, so latency should be dominated by bandwidth right?

    BTW, these transcoders need cuda's atomic ops. Is that a pointer why cell does better at them than gpu's? But why do they need them?
     
  15. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    It's in the AMD presentation. Firmly stuck in CPU land.

    The issue is that while the CPU does X, Y and Z steps in the algorithm, the GPU doing A has to be fast enough that X, Y and Z aren't slowed down. Otherwise the CPU just waits. It seems with AMD this latency was so bad that they didn't even bother with h.264 encoding until RV770 came along.

    Do these transcoders need atomic ops? I was under the impression that G80 was supported.

    http://www.cyberlink.com/products/powerdirector/cuda-optimization_en_US.html

    http://www.badaboomit.com/?q=node/113

    I can't find any statement of supported GPUs for Super LoiLoScope.

    I think G80's out-of-date Pure Video (hmm, does it have PV?) is what's causing variations in support.

    Jawed
     
  16. MfA

    MfA
    Legend

    Joined:
    Feb 6, 2002
    Messages:
    6,807
    Likes Received:
    473
    Cache runs into diminishing returns for graphics relatively fast, so no.
     
  17. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
  18. Rayne

    Newcomer

    Joined:
    Jun 23, 2007
    Messages:
    91
    Likes Received:
    0
    I just can talk about my experiences with the x86 architecture.

    I also have some experience with the Cell. I wrote some wavelets, and i tried to speed up the code using all the SPEs, but, it didn't end very well. The Cell is too complex at low level. I remember having serious problems with the DMAs to feed the SPEs with the data to process the blocks. Bad performance. More time wasted in the setup than in the data processing.

    About the x86 architecture, i think that one of the biggest problems with the SIMD instructions is the time that you need to fill all the elements of a XMM register.

    If you are lucky, and your data is aligned in memory, you can load your data with a MOVAPS instruction. If the data is not aligned, you can use the MOVUPS (slower of course), but, i did read that in the core i7, movups is as fast as movaps on aligned data.

    But, there are worse cases. Sometimes you need to read 4 elements from 4 different memory addresses, and shuffle them to fit into a XMM register. This is for me the worst case, and it kills the SSE performance. You lose a lot of time doing the 4 loads, and shuffling the data, and in these cases, usually it is better to use the FPU, or a single component of a XMM register. Basically you lose all the advantages of the SIMD instructions in these cases, due to the high overhead of the data shuffling.

    I also think that intel should implement a hardware 4x4 matrix transpose instruction (among other things :)).

    About my recent experiences with CUDA on the GPU, well, i love the SIMT concept, but, i think that it still has the same problem with the data setup. In my perlin noise kernel, i waste half of the time filling the shared memory, and it's a very small chunk of memory.

    As resume, i think that we have a lot of ALU power today, but, the access to the memory should be more flexible today (and with TB/s :D).
     
  19. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    Can we have some more details please for that?

    It's really a shame that SSEx isn't orthogonal. Why does intel have to design every ISA (save lrbni) like they are brain dead or something? For your problem, may be loading the four values into a aligned float[4] array and then a load may be faster:???:

    That would be fun. They do have an optimized 4x4 transpose macro available in if you use the intrinsics.

    That would make a lot of effort put into exploiting memory hiearchy useless. :twisted:
     
  20. MfA

    MfA
    Legend

    Joined:
    Feb 6, 2002
    Messages:
    6,807
    Likes Received:
    473
    Personally I have always thought every architecture should have special engines doing nothing except for processing DMA lists (wasting 1/4th of your potential ALU issue slots on Larrabee is a rather poor alternative IMO).
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...