Predict: The Next Generation Console Tech

Discussion in 'Console Technology' started by Acert93, Jun 12, 2006.

Thread Status:
Not open for further replies.
  1. liolio

    liolio Aquoiboniste
    Legend

    Joined:
    Jun 28, 2005
    Messages:
    5,724
    Likes Received:
    194
    Location:
    Stateless
    Are you sure that both patents relate to the same chip?
     
  2. TheAlSpark

    TheAlSpark Moderator
    Moderator Legend

    Joined:
    Feb 29, 2004
    Messages:
    21,710
    Likes Received:
    7,349
    Location:
    ಠ_ಠ
    You will need a TIFF plug-in. If you go to the help page, you will find a couple links for IE/Netscape/Opera.

    edit: here's the link
    http://www.uspto.gov/patft/help/images.htm
     
  3. one

    one Unruly Member
    Veteran

    Joined:
    Jul 26, 2004
    Messages:
    4,835
    Likes Received:
    156
    Location:
    Minato-ku, Tokyo
    Do you mean the patent in my post? It was filed before the launch of Xbox 360 in 2005, it's Xbox 360's architecture with the high-speed bus to Xenos. It's not the same chip, it's inventors that are the same. So Rochester is most likely the location of the Xbox CPU design center. Apparently Microsoft is hiring engineers including ones from AMD for their next chip, but I am still not 100% sure if Microsoft chose IBM as the design partner or not. At least the new patent suggests people at Rochester are developing a possible candidate.
     
  4. liolio

    liolio Aquoiboniste
    Legend

    Joined:
    Jun 28, 2005
    Messages:
    5,724
    Likes Received:
    194
    Location:
    Stateless
    Thanks Alstrong :)

    One I don't get it the date for the patent you linked is 19 june 2007, no?

    I think I'm completely lost now.
     
    #644 liolio, Jun 4, 2008
    Last edited by a moderator: Jun 4, 2008
  5. one

    one Unruly Member
    Veteran

    Joined:
    Jul 26, 2004
    Messages:
    4,835
    Likes Received:
    156
    Location:
    Minato-ku, Tokyo
    No it's the published date. The original filed date is February 24, 2005.
     
  6. TheAlSpark

    TheAlSpark Moderator
    Moderator Legend

    Joined:
    Feb 29, 2004
    Messages:
    21,710
    Likes Received:
    7,349
    Location:
    ಠ_ಠ
    When filing a patent, there is a period of at least 18 months before the documents are made public. Delays can be due to incomplete documentation. Examination for approval of the patent is requested after the documents are made public. It can take years before a patent is finally granted. Part of the problem is the number of patents as they are generally examined on a first-come-first-serve basis i.e. a long queue.
     
  7. liolio

    liolio Aquoiboniste
    Legend

    Joined:
    Jun 28, 2005
    Messages:
    5,724
    Likes Received:
    194
    Location:
    Stateless
    Ok make sense... :lol: speak about an idiot... :lol:

    Thanks for yours answers (between sounds way better on the paper than it really is... :lol: )
     
    #647 liolio, Jun 4, 2008
    Last edited by a moderator: Jun 4, 2008
  8. Megadrive1988

    Veteran

    Joined:
    May 30, 2002
    Messages:
    4,677
    Likes Received:
    194
    I agree. The PS4 cannot use PPEs that are anything like the one in PS3.
    When I said next-gen PPEs, I meant basicly any CPU core that's several generations beyond the PPE. Something POWER6/7 based, maybe. Whatever it is, a much more robust CPU core than the heavily stripped down in-order PPE. I think they need to go back to OoOE. There would only need to be a few of these (2 to 4) though, since like current-gen CELL, most of the processing power is deferred to the SPE.
     
  9. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    If you don't mind, could you expand on this a bit? What about the local store programming makes the thread context size get so big? The way I thought of it is that with lots of light threads you don't have to do much context switching, as most threads just runs their course.
     
  10. Crazyace

    Regular

    Joined:
    Feb 9, 2002
    Messages:
    333
    Likes Received:
    6
    Also, if you context switch on a PPE, dont you switch 32 int registers , 32 double registers, and 32 VMX registers ( as well as various SPRs ) - 128 registers for SPE isn't drastically more...
     
  11. chachi

    Newcomer

    Joined:
    Sep 15, 2004
    Messages:
    120
    Likes Received:
    3
    The 970FX was a processor IBM designed for Apple several years ago to try to get into a power envelope that would let them use it in a mobile setting, I think they ended up using them in iMacs.

    http://en.wikipedia.org/wiki/PowerPC_970

    Obviously Apple decided IBM couldn't provide them with a competitive CPU (at least without investing a significant amount on r&d, or so the story goes) and in hindsight it's hard to argue that they were wrong to leave.
     
  12. Panajev2001a

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,187
    Likes Received:
    8
    Well, another thing in favor... or not ;)... of these VTE's is that they have only 32 registers (in the patent they are 128 bits wide, but I would not see 256 bits registers under a bad light if they go ahead with the dual SIMD unit per VTE idea as they mention in this and other patents).
     
  13. Panajev2001a

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,187
    Likes Received:
    8
    IBM docs suggest you avoid preemptive thread scheduling and only create as many SPE threads as you have physical SPE active (8 on a CELL BE and 16 on a dual chip Blade): if you really have to have more SPE threads than physical SPE's you should allow each thread to run to completion.

    (CBE Handbook, pag. 351)

    In and of itself programming for a local store does not make context switching a priority: it just makes for a rather inflexible programming model for the chip. Avoid context switches at any cost basically :).

    When I made the comment about lots of light threads (I perhaps misspoke when I used the word "light") it was more in the light of SoEMT/cache miss or other stall condition ==> thread put into sleep state and ready thread activated which more closely resembles what you also have on GPU's (it is not trivial to do what GPU's do with fragments in the same batch to cut latency down).

    Something like this CELL v2/v3 would be better suited to also handle work this way (not to mention the strategy Sun and Intel worked on with work/scout threads which keep going ahead of stall conditions to prefetch data and instructions, move past branches, etc...).

    You can stay away from many context switches, you can avoid stalls by manually DMA-ing data in while you do work on the previously received set of data, but...

    1.) you can still give hints to the HW prefetcher to do pretty much the same thing (you are not completely taken control away and in most chips you can lock cache lines to have some of your LS back).

    2.) it "requires" lots of micro-management from programmers. Not that cache based architectures allow you to completely forget about things such as cache size, data and code size and access, etc... you gain quite a bit of performance paying attention to those elements, but they are more forgiving in that aspect.
     
    #653 Panajev2001a, Jun 5, 2008
    Last edited by a moderator: Jun 5, 2008
  14. Entropy

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,214
    Likes Received:
    1,202
    I think we need to be a bit more precise when using that "forgiving" word. I can feel my blood starting to boil here. :)
    In my opinion and experience cache based UMA multiprocessors are NOT forgiving.
    An easy programming model to port to, yes.
    Easy to get high utilization of computational resources, hell no.

    SMP UMA systems are great if you want to speed up multiple instances of legacy code as in classical business server applications, or (less good) if you want some improvement when porting an application and do some "low hanging fruit picking" in terms of threading.
    But there are bottlenecks in the (access to) shared resources, and there are contention problems, and there is the problem that the programming model doesn't help you much in dealing with these issues - indeed the goal is rather to help abstracting the underlying complexities away.

    Coming from scientific computing, what I really liked about the BE was that it helps make the computational behaviour deterministic. There are separate memory pools that belong to each SPE that won't be stomped by other processors or threads, there are robust mechanisms for transferring data between processors, separate communications pathways for main memory and the coprocessor,... neatly partitioned, and relatively predictable. It comes from my world. Yes, you have to structure your problem to fit the hardware to get good utilization, but if you do, predicting the results is relatively straightforward.

    Compare this to, say, the XBox360 setup, where you can have six hardware threads, all sharing the same cache, and if these threads evict each others cached data (unpredictably, unless you lock by hand, and poof there goes that ease of porting) it will generate bus traffic, over the same bus that not only handles main memory access and cache traffic, but also all communication between CPU and GPU. And that main memory is also accessed by the GPU which has its own needs and ideas in terms of memory access. This may be a straightforward platform to port to, but to get high utilization and ensure consistent and predictable response in different situations is another matter entirely.

    My experience with SMP UMA systems has been that if you want high utilization out of them, "forgiving" is simply not an appropriate adjective.

    They are, and pardon my clear language, a fucking horrible mess that lack not only the underlying hardware organization, but often also the band-aid software tools needed to analyze and help control the overall dataflow of the system. Coarse grained parallelism over two or possibly four processors - OK. Maybe.
    Beyond that, and you are deep into blood-vessel bursting territory. Again, for performance critical work. Horses for courses apply here as everywhere else. But if that is what you're doing... "forgiving" - no, not really.

    (* Slowly unclenches jaws *)
     
  15. Panajev2001a

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,187
    Likes Received:
    8
    Edit: I do appreciate deterministic behavior, I can see why it is so important for you (you have got a very good point there). The problem might be to

    Easy programming model to port to, easier to get "more than decent" performance (provided the multi-threaded design of the project has some forethought behind it... but LS or caches, without being able to design parallel applications we can hang our keyboards to the wall... so little point complaining there)...

    I think that when it comes to a games console machine (and evidently since cache based architectures do not seem to die in the HPC field ;)) those do not seem a bad thing at all.

    Of course with more focus on high resources utilization the ease of porting goes to the sidelines, but helping developers get acceptable performance before going face down into the hardware might not always be a bad thing.

    Would we encourage people to forget the good lessons about data structuring (and making the project fit the architecture) they learned with architectures like CELL or tapping large multi-processors systems ? I'd hope not.

    I also am not too keen on the idea of just telling people "life sucks, get a helmet" and keep on bringing a steeper and steeper learning curve each hardware generation without ever looking back.

    Getting back on topic, any good engineer would have its blood boiling with all those unused transistors and untapped FLOPS ;).

    Obtaining very high performance out of a cache based parallel processor is still definitely possible even though it might be hard to go from decent/good performance to very high performance... maybe even harder than with a well designed Local Store based architecture (where going from horrible performance to decent performance is not exactly trivial either).

    A very important question (the $1 Billion question so to speak) is where exactly should our efforts on the combination of OS + programming languages (new or extended) + hardware resources should better focus on, how best to attack the problem holistically on all three fronts.

    Is sticking with a Local Store based architecture and simply adding more cores the answer ?

    I do not know, if I would know the answer with a good degree of confidence and evidence to back it up I'd be too busy counting money to write here I'd think ;). (ok... I'd still find time, but under a different nick-name :D).

    Is Xbox 360's Xenon SMP style the only future we can look at ? There are many takes on the cache based multi-processor concept going on in companies at the moment, all coming from different angles: Intel is moving there from three directions (Nehalem will bring them to the 8+ cores arena, Larrabee also marches toward the massive multi-core land, IA-64 means to get there too [and I do not think experience designing and evolving that architecture over time will be fruitless for Intel]... three different software approaches too), SPARC is getting there from two directions (Rock and Niagara I/II/etc...), and even ARM is entering the SMP arena finally.
     
    #655 Panajev2001a, Jun 5, 2008
    Last edited by a moderator: Jun 5, 2008
  16. Panajev2001a

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,187
    Likes Received:
    8
    This is the dual Vector Units for each VTE (shared register file between the two Vector Units):

    http://appft1.uspto.gov/netacgi/nph...AN/"International+Business+Machines"+AND+SIMD

    They do make the case that you could issue two vectors instructions at the same time and that each Vector Unit could work on different registers (input and output registers)... you would at worst duplicate the 3x128 bit input lane and the 1x128 bit output lane going to the register file.

    Nice, that would make the number of registers a major key variable to tweak to get high execution units utilization: the two VU's can work together sharing resources or they can work separately.

    You could issue:

    VU_A ............. VU_B

    Scalar .......... Vector

    Scalar .......... Scalar

    Vector .......... Scalar

    Vector .......... Vector

    Vector .......... NOP

    etc... etc... In some cases the execution units of both VU's would be working on the same vector instruction (like the cross-product example they make).
     
  17. Megadrive1988

    Veteran

    Joined:
    May 30, 2002
    Messages:
    4,677
    Likes Received:
    194
    IMO a CELL chip for PS4 with only 2x the performance of CELL in PS3 would be pretty disappointing.

    Especially since Jim Kahle said in 2006, that by 2010, IBM would have a Teraflop on a chip, using 32 SPEs. That's the baseline minimum I'd expect for PS4 in 2011-2013.

    http://blogs.mercurynews.com/aei/2006/10/31/the_playstation/


    Now four to five times the performance of PS3 CELL, for next-gen CELL and probably also for PS4, sounds much more reasonable. And yet, that's a much smaller leap from PS2 to PS3 (6.2 GFLOPs ==> 218 GFLOPs = ~35 times).

    BTW, what Kahle said in 2006 is not "crazy Ken Kutaragi" talking. That's far, far more conservative forcasting from the chief CELL architect.


    What do you think Pana?
     
  18. Carl B

    Carl B Friends call me xbd
    Moderator Legend

    Joined:
    Feb 20, 2005
    Messages:
    6,266
    Likes Received:
    63
    What folk will be capable of fabbing and what will go into consoles are two entirely separate things, however. Yes, IBM will be able to put out a chip like that, but I doubt that it would be for anyone other than their HPC customers. The problem for Sony, Microsoft, Nintendo(?), is that the rate at which chip costs diminish each gen has been thrown into a vat of mollasses, and where before Sony (under KK notably) was willing to bite the bullet early on knowing that chip costs would reduce significantly in time, to repeat that in the year 2011 (unless something drastic changes) would mean biting that bullet for quite a while. Thus the whole question becomes what is their initial silicon budget given this reality, rather than a question of what is possible.
     
  19. kyetech

    Regular

    Joined:
    Sep 10, 2004
    Messages:
    532
    Likes Received:
    0
    Yeah, and its really starting to frustrate me that so many people still believe there will be: huge 4 billion transistor GPUs with 8GB of memory, 700GB bandwidth, and all these ludicrous specs.
     
  20. Megadrive1988

    Veteran

    Joined:
    May 30, 2002
    Messages:
    4,677
    Likes Received:
    194
    I could see a 2 billion transistor GPU.

    GT200 / GTX 280 is already ~1 billion+


    4 GB RAM total (system memory + graphic memory or unified).


    400-500 GB/sec system bandwidth. I don't see why not. first generation XDR was meant to be capable of upto ~102 GB/sec bandwidth. XDR2 is capable of ~200 GB/sec and Rambus announced the goal of hitting 1000 GB / 1 TB in 2010. I'm only advocating half that for sometime after 2010 when PS4 is going to show up.


    For next-gen Xbox if it again uses EDRAM the bandwidth on that could be in TeraByte or several TB per second since the EDRAM in Xbox 360 is already 1/4 of a TB/sec.
     
    #660 Megadrive1988, Jun 6, 2008
    Last edited by a moderator: Jun 6, 2008
Loading...
Thread Status:
Not open for further replies.

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...