Intel on data-parallel languages and raytracing

Discussion in 'GPGPU Technology & Programming' started by B3D News, May 29, 2007.

  1. B3D News

    B3D News Beyond3D News
    Regular

    Joined:
    May 18, 2007
    Messages:
    440
    Intel is currently researching data-parallel languages, reports EETimes. These languages would most likely be for massively parallel architectures, such as Larrabee and Terascale, putting them in direct competition with NVIDIA and AMD in the GPGPU market.

    Read the full news item
     
  2. Arun

    Arun Unknown.
    Moderator Veteran

    Joined:
    Aug 28, 2002
    Messages:
    4,971
    Location:
    UK
    I don't want to sound like I'm bashing Intel's efforts (200 researchers? zomg, ftw, etc.) but it is interesting to note that these comments were made at the same time and by the same person who complained about the lack of effort by software makers to make their programs more parallel.

    These comments were primarily aimed at large companies (such as Microsoft), but they hardly apply exclusively to these. The arguement presented is fundamentally flawed imo: "if one company doesn't do it, a competitor will"... That has been Intel's motto for a long time, and it was certainly true 10+ years ago, when Joe Consumer's computing experience was clearly limited by the CPU performance.

    Nowadays, things are different. Client workloads are not very CPU limited at all. Look at Windows Vista, the latest version of Office and the couple of other apps that pretty much everyone uses. The only really mass-market apps that I can think of that might benefit from higher performance are antiviruses, and fixed-function hardware that humiliates any CPU at the task has been its debut in recent months/years. You'd expect that could eventually be integrated in chipsets and become a commodity.

    I'm not arguing that a number of apps won't benefit from multithreading. They obviously will. Games will benefit massively - that's just a matter of time, and a large number of non-mass-market applications will. That doesn't justify the purchase of a octo-core CPU for Joe Consumer though, and the problem is that I fail to see not what justifies it today, but what justifies it in 5+ years when it will have become a commodity. The only interesting emerging workloads that might become more important (such as voice recognition) seem to benefit more from throughput cores (or even GPU cores) than CPU cores.

    Of course, Intel must partially realize that, thus Larrabee. The big question there is its perf/cost$ (including perf/mm²), as unlike traditional CPUs it won't have any inherent advantage in terms of backwards compatibility etc. - but assuming that Intel plays its process advantage properly, and that the architecture is efficient enough... We'll see. Outside of graphics and HPC, it will also obviously depend on how fast emerging applications increase in importance, and how well they manage to benefit from data-parallel architectures. As for HPC and gaming, both perf/cost$ and perf/watt are obviously the key the factors.
     
  3. Blazkowicz

    Legend Veteran

    Joined:
    Dec 24, 2004
    Messages:
    5,359
    well, business as usual we might say, Joe didn't need a Pentium 4 either.
    at least joe's CPU should be reasonably power efficient, with 6 or 7 idle core in a good sleeping mode and IGP/"fusion units" taking care of video playback and maybe some other throughtput things, as you say.
     
  4. Geo

    Geo Mostly Harmless
    Legend

    Joined:
    Apr 22, 2002
    Messages:
    9,115
    Location:
    Uffda-land
    Well, I've used this example a couple times now, including in my contribution to our commentary on Carmean's presentation. . . but I think Intel is still rather bitter --and judging by this, possibly a little alarmed-- at how their hyperthreading technology was out in the world for several years and really did not gain the kind of traction that might have made a significant difference for them at a time when AMD was pretty much kicking their butt with enthusiasts/gamers.

    Then along comes X2 and suddenly there are game patches that note that the game will see signifcant improvements for dual core processors *and* Intel's old HT-enabled P4s. What does that tell you? It tells me that Intel did a lousy job of evangelizing HT to what should have been their showcase ISV audience for *several years*.
     
  5. Demirug

    Veteran

    Joined:
    Dec 8, 2002
    Messages:
    1,326
    Well Intel tries very hard but at this time most developers still believes that faster cores are just around the corner. After it was clear that the free lunch is over they were in a state of shock. Multithread programming was not on the skill list of most developers and if you were not one of the lucky teams that get direct help from Intel or AMD you have to learn it the hard way by your own.
     
  6. 3dilettante

    Legend

    Joined:
    Sep 15, 2003
    Messages:
    6,749
    Location:
    Well within 3d
    Or HT wasn't really all that great, and that using the P4 to spearhead the usage of SMT was not the best way to get people to multithread their code.

    The performance gains were noticeably mixed with HT, which was a much cruder implementation of SMT than other multithreaded processors of the time.
    SMT could hurt overall performance if done badly, and the P4 had a number of other architectural weaknesses that were worsened by HT, particularly in the way HT crudely divided shared buffers in half for each thread, and how multiple threads interacted with the P4's complex scheduling hardware.

    HT wasn't even fully fledged until several revisions of Netburst, and the best example of Netburst HT was Prescott. That core's other issues dragged HT down as well.

    Dual-core is much more reliable when it comes to maintaining single-threaded performance, and unlike the subset HT P4s, it was a market-wide shift.

    Dual cores most likely reached market just in time to benefit from the initial forays with HT, and in general they could be counted on to give consistent performance gains.
     
  7. Tim Murray

    Tim Murray the Windom Earle of mobile SOCs
    Veteran

    Joined:
    May 25, 2003
    Messages:
    3,278
    Location:
    Mountain View, CA
    The problem isn't with HT, though, is it? Isn't it more of a problem with SMT slaughtering cache coherence in general? I remember benchmarks that showed that HT should be disabled on any machine running... Apache, I think? because performance tanked when the number of cache hits decreased enormously.

    I am waiting to see just what the graphics-oriented things they're working on are, though. A hybrid raytracer/rasterizer is all well and good, but it will never, ever catch on unless you have a fantastic API that is wonderful for developers to use. And then you need a killer app...
     
  8. Gubbi

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,083
    That was mostly for Northwood P4s which suffered heavily from trace cache and D$ thrashing. Prescott added a lot of measures to better support SMT: 4x bigger D$ with higher associativity, two trace caches, one for each active context, and many more registers. Of course all these measures were negated in large part by the basic performance parameters Prescott had: Twice the D$ load-to-use latency (2 cycles to 4) and the much longer pipeline and associated miss predict latency (although the branch predictor in Prescott is better).

    SMT on OOO processors looks like it's dead, in essense you have to enhance your I and D caches, have a scheduler for each thread (because one big one holding all instructions in flight is simply too slow), architected registers for each thread. All these structures become slower, so you introduce pipestages (or run at a lower clock) so it impacts single thread performance, and in the end the large effort is for better utilization of the execution units which in a modern CPU takes up less than 25% (including massive SIMD FP units). Better to just replicate the entire core and get a guaranteed 2x speedup on independent threads.

    Cheers
     
    Tim Murray likes this.
  9. 3dilettante

    Legend

    Joined:
    Sep 15, 2003
    Messages:
    6,749
    Location:
    Well within 3d
    SMT was used successfully in IBM's POWER5. It wasn't for every workload, but it didn't have as many adverse effects as HT did for Netburst.

    IBM's method was more flexible, with software-controlled priority levels as well as hardware mechanisms that balanced overall instruction flow to keep a stalled thread from getting in the way other threads.

    POWER5 was also a wider design than Netburst, it had larger caches, and it wasn't as aggressively speculative as P4.
    IBM had more spare units that could be used, and it didn't fill its queues as readily with instructions that would have to be replayed.

    There were a number of things that were characterized as glass jaws for the Netburst architecture. One big one was its highly speculative instruction scheduling and replay mechanism.

    The design's long pipeline and emphasis on speculation made it so that the chip would issue an instruction several cycles before it was known if a cache access would hit.
    It is usually the case that cache misses happen more often with SMT, and P4's smaller caches tended to feel the impact more than most.
    Since many instructions would be issued incorrectly, the P4's replay mechanism would loop the instructions back into the pipeline on every cache miss and access to memory.
    Not only that, but it was possible for the replay loop to get clogged by multiple replays during long dependency chains.

    For single-threaded performance, replay could be a headache, since the P4 would sometimes for no visible reason take hundreds or thousands of cycles to complete a simple stretch of code.
    For SMT, the massive amount of speculation consumed limited resources on a rather narrow core.

    SMT should have filled in stall cycles in the long P4 pipeline. The problem was that it had competition from the replay mechanism, which also tried to fill stall cycles. In pathological cases, the replay mechanism would fill stall cycles with instructions that inevitably stalled again and again.

    Prescott theoretically improved its threading resources and replay mechanism.
    It was not enough to correct for the thermal ceiling that killed its clock scaling.

    edit:
    There is an interesting article on the replay mechanism on xbit:

    http://www.xbitlabs.com/articles/cpu/display/replay.html

    There's a section on its influence on hyperthreading that also includes a comparison between Prescott and Northwood that shows how much Prescott improved HT.
     
    #9 3dilettante, May 31, 2007
    Last edited by a moderator: May 31, 2007
    Tim Murray likes this.
  10. INKster

    Veteran

    Joined:
    Apr 30, 2006
    Messages:
    2,110
    Location:
    Io, lava pit number 12
    Huh ?!? :shock:

    "Larrabee" as a joint Intel-Nvidia effort.
    I wonder what this would mean for the future "Fusion" products (beyond the initially "simple" IGP/CPU on the same die)...
     
  11. Geo

    Geo Mostly Harmless
    Legend

    Joined:
    Apr 22, 2002
    Messages:
    9,115
    Location:
    Uffda-land
    I'm not taking that one to the bank yet. I can see why Intel would like Nvidia's participation. . .it's less clear to me why Nvidia would want to play ball unless there's a pretty sizeable revenue/royalty stream associated with it for them. Would Intel make that kind of deal? Doesn't seem in character for them.
     
  12. INKster

    Veteran

    Joined:
    Apr 30, 2006
    Messages:
    2,110
    Location:
    Io, lava pit number 12
    Extraordinary circumstances call for extraordinary partnerships, and the AMD/ATI merger was certainly one of them.
    Hannibal, in his late April article about "Larrabee" at arstechnica.com, shared some insider info about it which i found a bit... suspicious, given the constant reference to the G80 architecture.
    This could be why Intel "named certain names", excluding the R600 (aside from the fact that they are competitors, Intel could have used the AMD "Fusion" project as a bashing bullet on "why our solution is better than theirs").

    Let's wait and see.
     
  13. Geo

    Geo Mostly Harmless
    Legend

    Joined:
    Apr 22, 2002
    Messages:
    9,115
    Location:
    Uffda-land
    Oh, I'm not ruling it out. I'm just saying when I read it I didn't exactly go "Ah, of course. . ."
     
  14. Panajev2001a

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,187
    Arun, many do agree that not a lot of everyday use applications are going to benefit from having faster and faster cores and more of these cores running in parallel, but this might be true as far as each application taken alone.

    A lot of people, intentionally or not (tons of programs installed as start-up programs that do work while the PC is idle or that steal a few cycles here and there), are running more and more programs/processes in parallel and multi-core systems for users such as myself do benefit from multiple cores as the whole environment feels more responsive.
     
  15. nutball

    Veteran Subscriber

    Joined:
    Jan 10, 2003
    Messages:
    1,818
    Location:
    en.gb.uk
    This is an oft stated argument. It scales to ... maybe two cores on the desktop for a typical user. Four cores tops, but not for a typical user. It's not really a good justification for 8 or 16 CPU cores becoming the default option when buying a PC (unless our favourite operating system vendor can come up with new and even more pretty ways to waste our computing resources for us).
     
  16. AlNets

    AlNets ¯\_(ツ)_/¯
    Moderator Legend

    Joined:
    Feb 29, 2004
    Messages:
    17,842
    Location:
    Polaris

    Even so, I find the hard drives to be quite limiting for carrying out multiple tasks despite dual core. I suppose ideally, you'd have multiple programs and multiple hard drives from which to do those tasks. But there's too much (thrashing?) conflicting use with the one hard drive that many computers only have (e.g. laptops or other template-built computers from Dell or HP for instance).
     

Share This Page

Loading...