Is multiple core technology really needed for next-gen

Discussion in 'Console Technology' started by Ooh-videogames, Feb 17, 2004.

  1. SMarth

    Newcomer

    Joined:
    Feb 7, 2002
    Messages:
    43
    Likes Received:
    0
    That's because in SMP systems the processors are sharing the same resources (memory,bus,io) which are already slow for a single cpu. Global synchronization is also extremely costly. But there's never been any doubt in my mind that the futur is asynchronous parallelism in both hardware and software. There will always be a need to serialize and synchronize of course, without that, reality may not make sense, but the idea is to restrict this to a minimum.

    Is this possible ? Yes, absolutly. We just need to think a bit differently. After all the art of parallel programming is in it's infancy while the hardware is barely existant.

    What fascinate me the most about 3D is how more then anything else it is contributing to the devellopment of mainstream low-cost parallel architectures. Though currently most of that parallelism is hidden, it is still a step in the right direction. After all, without 3D, it could have taken decades more before we'd seen the likes of what GPU are offering us today. It's not that parallelism and a-synchronization or de-serialization is something new, but it's just like gasoline, why change what work, even if it's primitive and limited? This is why I hope "cell" will work well enough to inspire the rest of the industry to follow in it's footstep and push parallelism forward. But no matter what, one day, everybody will go that way.

    The bigest limitation toward that goal is the hardware, not the software. Not easy to overcome, and we'll need much more then a few billions transistors to achieve "thinking" and "living" computers which are our ultimate goal... no? :twisted: But by then we'll probably need 3D photonic to replace all that crappy electronic.... :wink:
     
  2. ERP

    ERP
    Veteran

    Joined:
    Feb 11, 2002
    Messages:
    3,669
    Likes Received:
    49
    Location:
    Redmond, WA
    Strictly speaking the current VU architecture has the same integer and float instruction throughput that doesn't mean it's a good processor to run general code on.

    A lot of game datastructures still require random access to memory and while game programmers in general try to model structures as streams (at least the good ones do), sometimes it's either impossible or just impractical.
     
  3. MfA

    MfA
    Legend

    Joined:
    Feb 6, 2002
    Messages:
    7,610
    Likes Received:
    825
    Fully async logic has quite a bit of overhead, with the extremely short pipeline stages of high performance processors Im not sure if it would actually gain you anything.

    I think keeping clocks synchronous wont get much more difficult ... if we start using switched networks instead of long wires going everywhere the domain in which the clock needs to be synchronous will get smaller with shrinks.

    As for how much parallelism we can use ... consider our brain, and it's clock speed. The really interesting applications have more than enough parallelism to go around.
     
  4. Vince

    Veteran

    Joined:
    Apr 9, 2002
    Messages:
    2,158
    Likes Received:
    7
    True, but there is just a little bit of an architectual difference between a connectionist system and your traditional von Neumann architecture. :wink:

    Although that doesn't stop some concepts from transfering over such as the theifs who stole AntiAliasing.
     
  5. Fafalada

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    2,773
    Likes Received:
    49
    Strictly speaking, no, it's still 1.16 : 1 ;)
    But you know I was talking about actual arithmetic throughput, data width, and completeness of the instruction sets, none of which current VU complies with. And I can tell you that I've run into situations where every single one of these would have been good to have, even with the "simple" programs we usually write for VUs.

    Oh I agree, but then we all figure that's what we have that other processor for next to APUs right?
    You can also do a whole lot more with not-streaming friendly problems if APUs can indeed start in/out DMAs from their side.
     
  6. Panajev2001a

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,187
    Likes Received:
    8
    Hehe... the current VUs have one single 16 bits ALSU with 16x16 bits Registers: I would not call Integer processing power to be equal to Floating Point procxessing power.

    APUs, even from IBM's own patents, promises to be different:

    1.) It can DMA from and to Shared DRAM.

    2.) It can do some I/O work with External I/O devices ( through clever use of DRAM's busy bits and other flags if they are implemented ): you could, for example, have the I/O device send the data to a location of the Shared DRAM and that data would be automatically forwarded to a selected portion of the interested APU's Local Storage. The same can be applied to send data to the I/O devices from the APU's Local Storage.

    3.) 128x128 bits Register File can be used for Floats and Integer Vectors and scalar values.

    4.) The following throughput describes the APU's operation:

    a.) 1 FP Vector Instruction.

    or

    b.) 1 FX Vector Instruction ( Integer )

    or

    c.) 1 FP Scalar Instruction.

    or

    d.) 1 FX Scalar Instruction.


    5.) When Executing scalar instructions the APU is limited to a peak of 2 FP/FX ops/cycle through the use of instructions like FP/FX MADD.
     
  7. Panajev2001a

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,187
    Likes Received:
    8
    Great post. :)
     
  8. V3

    V3
    Veteran

    Joined:
    Feb 7, 2002
    Messages:
    3,304
    Likes Received:
    5

    Hmm they seems to be having a debate over multi GHz Vs Multi cores.
     
  9. MfA

    MfA
    Legend

    Joined:
    Feb 6, 2002
    Messages:
    7,610
    Likes Received:
    825
    "The Intel senior principal architect Douglas Carmean introduced himself as the panel's token speed enthusiast. He dismissed skepticism by offering performance data that simply concluded that on 3D rendering tasks, higher clock frequency translated directly into faster completion times. He then generalized these results by arguing that many of the tasks real users cared about were still single-threaded, unparallelized and big"

    You would think if that was the case he could have picked an example which most users arent already executing on a parallel low clocked processor ... hell, if you extrapolate from the past then in one or two more generations real users wont even be running 3D rendering tasks on his processors anymore.
     
  10. nobie

    Regular

    Joined:
    Nov 11, 2003
    Messages:
    353
    Likes Received:
    0
    Location:
    Texas, USA
    [​IMG]
     
  11. Entropy

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,360
    Likes Received:
    1,377
    THANKS!
    For digging out the quote. I've been referring to it for ages, but I haven't kept it around in a quotable form.

    To what degree it will raise its head in the context of small scale parallell processing in a games console is debateable. (Also, as was alluded to above, it gets increasingly complex/costly to wring out relatively small performance benefits for non-amenable-to-parallell-coding problems.)
     
  12. Dio

    Dio
    Veteran

    Joined:
    Jul 1, 2002
    Messages:
    1,758
    Likes Received:
    8
    Location:
    UK
    One should note that that diagram is for a particular value of B. It is not always true that '16 processors' means '6x rather than 16x performance' (i.e. Your Mileage May Vary).

    It should also be noted that the same problem affects systems with non-identical but parallel processing units (i.e. VPU's) as well. Because VPU systems are pipelined, we do our internal analysis mostly on 'where are the bottlenecks', which is easier conceptually.
     
  13. MfA

    MfA
    Legend

    Joined:
    Feb 6, 2002
    Messages:
    7,610
    Likes Received:
    825
    Dont use Amdahl's law unless you are willing to discuss the input parameters ... it is a little like all the drones using Drake's equation to "support" their pre-existing opinion on extra-terrestrial live.
     
  14. MrWibble

    Regular

    Joined:
    Feb 9, 2002
    Messages:
    715
    Likes Received:
    43
    I dunno. He kinda/sorta has a point about most people not currently running (massively) parallel code. But then most people don't really have massively parallel architectures to use, and so why would developers target such things in the first place? On a sequential architecture, parallel code will run slower...

    If architectures head down the parallel route to get performance, new algorithms will come to the fore to take advantage. If that happens the game flips around and the platforms without good parallelism become the ones that run like dogs.

    Speaking as a programmer, I think if I had an infinitely fast machine, I'd rather it was serial rather than parallel because on the whole that's easier to deal with. But with the tangible limitations of the real world my primary motivation is speed, and so I'll take whichever is ultimately faster even though I might have to jump through some hoops to take advantage. It's looking like many architects think parallel is the way to go right now to get more bang per buck, so I guess we'll deal with that. Plenty of the code I write ought to be quite happy running in parallel, it's just not necessarily structured like that yet.

    Also, as a footnote to this, I suspect that if it was as easy to clock a chip up by a factor of 16 (or even 6) as it ought to be to connect that many slower cores together, then chip makers would do exactly that. And shoving things together in parallel doesn't prevent you from taking advantage of speed-ups possible in less parallel architectures either, so it's not like the parallel chips are going to be running at drastically lower clock rates.

    If a clocks on serial devices could be sped up 16x, but 16x the processors only get you a 6x improvement, then we'd only need to crank the clocks on those parallel processors by 3x or more to get a better improvement.

    The graph is only flat at it's extrimity, based on a particular task. It also doesn't seem to take into account the speedup for doing *several different tasks at once*. If I have 3 independant tasks, and 3 processors to run them on, it ought to be pretty clear that I can do those tasks 3x faster than with one processor, except for an outside condition such as memory bandwidth. The graph also shows good improvements for lower numbers of parallel processors, so why not choose to do a moderate amount of work in parallel and *also* boost the clock?

    I'm rambling, so I'll shut up now.
     
  15. nobie

    Regular

    Joined:
    Nov 11, 2003
    Messages:
    353
    Likes Received:
    0
    Location:
    Texas, USA
    It's true, the extremity at which seperating threads becomes disadvantageous will not be reached any time soon. Even the "BE" form of Cell is basically only a 4 processor system, although presumably it could run a seperate thread on each APU. It would probably be best to isolate the parallel tasks into their own threads, and for example have each APU work on a seperate vertex or a seperate pixel.

    I'm considering things here from a long-term perspective, even beyond PS3 and into the next decade. The truth is their most likely will not be a reliance on multi-threading, clock speed, or thread level performance alone. All three will be milked for everything they can give, and all three will eventually run out of steam.
     
  16. nobie

    Regular

    Joined:
    Nov 11, 2003
    Messages:
    353
    Likes Received:
    0
    Location:
    Texas, USA
    The figures themselves are not as significant as the curve of the chart. Adjusting for the granularity of the grid, the curve follows the same trend whether B is 10% or 0.1%
     
  17. 3dcgi

    Veteran Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    2,493
    Likes Received:
    474
    What kind of a title is that?
     
  18. MfA

    MfA
    Legend

    Joined:
    Feb 6, 2002
    Messages:
    7,610
    Likes Received:
    825
    A straight line is part of the equation too, it is irrelevant to the question at hand if you dont pick parameters ... pick a computationally relevant part of a game engine which you think wont scale, then we will see if we can make something meaningfull out of Amdahls law..
     
  19. Vince

    Veteran

    Joined:
    Apr 9, 2002
    Messages:
    2,158
    Likes Received:
    7
    Nobie, my previous objection to your invoking of Amdahl's still makes me wonder. Now, I'm not sure, but explain this to me.

    It was my understanding that this law holds true when trying to accelerate one entity, be whatever it is, with N processors and the diminishing return you'll experience.

    What happens when you don't accelerate one entity with N processors, but rather N entities with N processors concurrently. How is this influenced?

    Why work on one entity with a plurality as if you're trying to emulate a serial pipeline when you can just compute in parallel en masse? I have to be missing something, I'd like to know what.
     
  20. nobie

    Regular

    Joined:
    Nov 11, 2003
    Messages:
    353
    Likes Received:
    0
    Location:
    Texas, USA
    I'm not sure I understand what you mean by one entity. If a program is divided up into threads, and each processor is running a different thread, wouldn't this be N entities on N processors? This is what I'm referring to. Obviously, running a single-threaded application on multiple processors won't do you any good.

    Allright, I made an excel spreadsheet out of the formula. The figures I plugged in are from the graph I posted, but you can play around with it and see for yourself.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...