Predict: The Next Generation Console Tech

Status
Not open for further replies.
Are you sure that both patents relate to the same chip?
Do you mean the patent in my post? It was filed before the launch of Xbox 360 in 2005, it's Xbox 360's architecture with the high-speed bus to Xenos. It's not the same chip, it's inventors that are the same. So Rochester is most likely the location of the Xbox CPU design center. Apparently Microsoft is hiring engineers including ones from AMD for their next chip, but I am still not 100% sure if Microsoft chose IBM as the design partner or not. At least the new patent suggests people at Rochester are developing a possible candidate.
 
Thanks Alstrong :)

One I don't get it the date for the patent you linked is 19 june 2007, no?

I think I'm completely lost now.
 
Last edited by a moderator:
When filing a patent, there is a period of at least 18 months before the documents are made public. Delays can be due to incomplete documentation. Examination for approval of the patent is requested after the documents are made public. It can take years before a patent is finally granted. Part of the problem is the number of patents as they are generally examined on a first-come-first-serve basis i.e. a long queue.
 
Ok make sense... :LOL: speak about an idiot... :LOL:

Thanks for yours answers (between sounds way better on the paper than it really is... :LOL: )
 
Last edited by a moderator:
Current PPE design is seriously flawed. High latency cache, strange pipeline quirks.
I prefer one healthy horse to two exhausted :D

I agree. The PS4 cannot use PPEs that are anything like the one in PS3.
When I said next-gen PPEs, I meant basicly any CPU core that's several generations beyond the PPE. Something POWER6/7 based, maybe. Whatever it is, a much more robust CPU core than the heavily stripped down in-order PPE. I think they need to go back to OoOE. There would only need to be a few of these (2 to 4) though, since like current-gen CELL, most of the processing power is deferred to the SPE.
 
Local Stores have the problem of ballooning up the thread's context size and put a stop to the plethora of light threads approach that everyone else and their mother in the industry is pouring tons of R&D resources on (including the academia).
If you don't mind, could you expand on this a bit? What about the local store programming makes the thread context size get so big? The way I thought of it is that with lots of light threads you don't have to do much context switching, as most threads just runs their course.
 
Also, if you context switch on a PPE, dont you switch 32 int registers , 32 double registers, and 32 VMX registers ( as well as various SPRs ) - 128 registers for SPE isn't drastically more...
 
Do you mean the patent in my post? It was filed before the launch of Xbox 360 in 2005, it's Xbox 360's architecture with the high-speed bus to Xenos. It's not the same chip, it's inventors that are the same. So Rochester is most likely the location of the Xbox CPU design center. Apparently Microsoft is hiring engineers including ones from AMD for their next chip, but I am still not 100% sure if Microsoft chose IBM as the design partner or not. At least the new patent suggests people at Rochester are developing a possible candidate.
The 970FX was a processor IBM designed for Apple several years ago to try to get into a power envelope that would let them use it in a mobile setting, I think they ended up using them in iMacs.

http://en.wikipedia.org/wiki/PowerPC_970

Obviously Apple decided IBM couldn't provide them with a competitive CPU (at least without investing a significant amount on r&d, or so the story goes) and in hindsight it's hard to argue that they were wrong to leave.
 
Also, if you context switch on a PPE, dont you switch 32 int registers , 32 double registers, and 32 VMX registers ( as well as various SPRs ) - 128 registers for SPE isn't drastically more...

Well, another thing in favor... or not ;)... of these VTE's is that they have only 32 registers (in the patent they are 128 bits wide, but I would not see 256 bits registers under a bad light if they go ahead with the dual SIMD unit per VTE idea as they mention in this and other patents).
 
If you don't mind, could you expand on this a bit? What about the local store programming makes the thread context size get so big? The way I thought of it is that with lots of light threads you don't have to do much context switching, as most threads just runs their course.

IBM docs suggest you avoid preemptive thread scheduling and only create as many SPE threads as you have physical SPE active (8 on a CELL BE and 16 on a dual chip Blade): if you really have to have more SPE threads than physical SPE's you should allow each thread to run to completion.

(CBE Handbook, pag. 351)

In and of itself programming for a local store does not make context switching a priority: it just makes for a rather inflexible programming model for the chip. Avoid context switches at any cost basically :).

When I made the comment about lots of light threads (I perhaps misspoke when I used the word "light") it was more in the light of SoEMT/cache miss or other stall condition ==> thread put into sleep state and ready thread activated which more closely resembles what you also have on GPU's (it is not trivial to do what GPU's do with fragments in the same batch to cut latency down).

Something like this CELL v2/v3 would be better suited to also handle work this way (not to mention the strategy Sun and Intel worked on with work/scout threads which keep going ahead of stall conditions to prefetch data and instructions, move past branches, etc...).

You can stay away from many context switches, you can avoid stalls by manually DMA-ing data in while you do work on the previously received set of data, but...

1.) you can still give hints to the HW prefetcher to do pretty much the same thing (you are not completely taken control away and in most chips you can lock cache lines to have some of your LS back).

2.) it "requires" lots of micro-management from programmers. Not that cache based architectures allow you to completely forget about things such as cache size, data and code size and access, etc... you gain quite a bit of performance paying attention to those elements, but they are more forgiving in that aspect.
 
Last edited by a moderator:
2.) it "requires" lots of micro-management from programmers. Not that cache based architectures allow you to completely forget about things such as cache size, data and code size and access, etc... you gain quite a bit of performance paying attention to those elements, but they are more forgiving in that aspect.

I think we need to be a bit more precise when using that "forgiving" word. I can feel my blood starting to boil here. :)
In my opinion and experience cache based UMA multiprocessors are NOT forgiving.
An easy programming model to port to, yes.
Easy to get high utilization of computational resources, hell no.

SMP UMA systems are great if you want to speed up multiple instances of legacy code as in classical business server applications, or (less good) if you want some improvement when porting an application and do some "low hanging fruit picking" in terms of threading.
But there are bottlenecks in the (access to) shared resources, and there are contention problems, and there is the problem that the programming model doesn't help you much in dealing with these issues - indeed the goal is rather to help abstracting the underlying complexities away.

Coming from scientific computing, what I really liked about the BE was that it helps make the computational behaviour deterministic. There are separate memory pools that belong to each SPE that won't be stomped by other processors or threads, there are robust mechanisms for transferring data between processors, separate communications pathways for main memory and the coprocessor,... neatly partitioned, and relatively predictable. It comes from my world. Yes, you have to structure your problem to fit the hardware to get good utilization, but if you do, predicting the results is relatively straightforward.

Compare this to, say, the XBox360 setup, where you can have six hardware threads, all sharing the same cache, and if these threads evict each others cached data (unpredictably, unless you lock by hand, and poof there goes that ease of porting) it will generate bus traffic, over the same bus that not only handles main memory access and cache traffic, but also all communication between CPU and GPU. And that main memory is also accessed by the GPU which has its own needs and ideas in terms of memory access. This may be a straightforward platform to port to, but to get high utilization and ensure consistent and predictable response in different situations is another matter entirely.

My experience with SMP UMA systems has been that if you want high utilization out of them, "forgiving" is simply not an appropriate adjective.

They are, and pardon my clear language, a fucking horrible mess that lack not only the underlying hardware organization, but often also the band-aid software tools needed to analyze and help control the overall dataflow of the system. Coarse grained parallelism over two or possibly four processors - OK. Maybe.
Beyond that, and you are deep into blood-vessel bursting territory. Again, for performance critical work. Horses for courses apply here as everywhere else. But if that is what you're doing... "forgiving" - no, not really.

(* Slowly unclenches jaws *)
 
Edit: I do appreciate deterministic behavior, I can see why it is so important for you (you have got a very good point there). The problem might be to

In my opinion and experience cache based UMA multiprocessors are NOT forgiving.
An easy programming model to port to, yes.
Easy to get high utilization of computational resources, hell no.

Easy programming model to port to, easier to get "more than decent" performance (provided the multi-threaded design of the project has some forethought behind it... but LS or caches, without being able to design parallel applications we can hang our keyboards to the wall... so little point complaining there)...

I think that when it comes to a games console machine (and evidently since cache based architectures do not seem to die in the HPC field ;)) those do not seem a bad thing at all.

Of course with more focus on high resources utilization the ease of porting goes to the sidelines, but helping developers get acceptable performance before going face down into the hardware might not always be a bad thing.

Would we encourage people to forget the good lessons about data structuring (and making the project fit the architecture) they learned with architectures like CELL or tapping large multi-processors systems ? I'd hope not.

I also am not too keen on the idea of just telling people "life sucks, get a helmet" and keep on bringing a steeper and steeper learning curve each hardware generation without ever looking back.

Getting back on topic, any good engineer would have its blood boiling with all those unused transistors and untapped FLOPS ;).

Obtaining very high performance out of a cache based parallel processor is still definitely possible even though it might be hard to go from decent/good performance to very high performance... maybe even harder than with a well designed Local Store based architecture (where going from horrible performance to decent performance is not exactly trivial either).

A very important question (the $1 Billion question so to speak) is where exactly should our efforts on the combination of OS + programming languages (new or extended) + hardware resources should better focus on, how best to attack the problem holistically on all three fronts.

Is sticking with a Local Store based architecture and simply adding more cores the answer ?

I do not know, if I would know the answer with a good degree of confidence and evidence to back it up I'd be too busy counting money to write here I'd think ;). (ok... I'd still find time, but under a different nick-name :D).

Is Xbox 360's Xenon SMP style the only future we can look at ? There are many takes on the cache based multi-processor concept going on in companies at the moment, all coming from different angles: Intel is moving there from three directions (Nehalem will bring them to the 8+ cores arena, Larrabee also marches toward the massive multi-core land, IA-64 means to get there too [and I do not think experience designing and evolving that architecture over time will be fruitless for Intel]... three different software approaches too), SPARC is getting there from two directions (Rock and Niagara I/II/etc...), and even ARM is entering the SMP arena finally.
 
Last edited by a moderator:
This is the dual Vector Units for each VTE (shared register file between the two Vector Units):

http://appft1.uspto.gov/netacgi/nph...AN/"International+Business+Machines"+AND+SIMD

They do make the case that you could issue two vectors instructions at the same time and that each Vector Unit could work on different registers (input and output registers)... you would at worst duplicate the 3x128 bit input lane and the 1x128 bit output lane going to the register file.

Nice, that would make the number of registers a major key variable to tweak to get high execution units utilization: the two VU's can work together sharing resources or they can work separately.

You could issue:

VU_A ............. VU_B

Scalar .......... Vector

Scalar .......... Scalar

Vector .......... Scalar

Vector .......... Vector

Vector .......... NOP

etc... etc... In some cases the execution units of both VU's would be working on the same vector instruction (like the cross-product example they make).
 
IMO a CELL chip for PS4 with only 2x the performance of CELL in PS3 would be pretty disappointing.

Especially since Jim Kahle said in 2006, that by 2010, IBM would have a Teraflop on a chip, using 32 SPEs. That's the baseline minimum I'd expect for PS4 in 2011-2013.

http://blogs.mercurynews.com/aei/2006/10/31/the_playstation/

We will push the number of special processing units. By 2010, we will shoot for a teraflop on a chip. I think it establishes there is a roadmap. We want to invest in it. For those that want to invest in the software, it shows that there is life in this architecture as we continue to move
forward.

DT: Right now you're at 200 gigaflops?

JK: We're in the low 200s now.

DT: So that is five times faster by 2010?

JK: Four or five times faster. Yes, you basically need about 32 special processing units.


Now four to five times the performance of PS3 CELL, for next-gen CELL and probably also for PS4, sounds much more reasonable. And yet, that's a much smaller leap from PS2 to PS3 (6.2 GFLOPs ==> 218 GFLOPs = ~35 times).

BTW, what Kahle said in 2006 is not "crazy Ken Kutaragi" talking. That's far, far more conservative forcasting from the chief CELL architect.


What do you think Pana?
 
What folk will be capable of fabbing and what will go into consoles are two entirely separate things, however. Yes, IBM will be able to put out a chip like that, but I doubt that it would be for anyone other than their HPC customers. The problem for Sony, Microsoft, Nintendo(?), is that the rate at which chip costs diminish each gen has been thrown into a vat of mollasses, and where before Sony (under KK notably) was willing to bite the bullet early on knowing that chip costs would reduce significantly in time, to repeat that in the year 2011 (unless something drastic changes) would mean biting that bullet for quite a while. Thus the whole question becomes what is their initial silicon budget given this reality, rather than a question of what is possible.
 
What folk will be capable of fabbing and what will go into consoles are two entirely separate things, however. Yes, IBM will be able to put out a chip like that, but I doubt that it would be for anyone other than their HPC customers. The problem for Sony, Microsoft, Nintendo(?), is that the rate at which chip costs diminish each gen has been thrown into a vat of mollasses, and where before Sony (under KK notably) was willing to bite the bullet early on knowing that chip costs would reduce significantly in time, to repeat that in the year 2011 (unless something drastic changes) would mean biting that bullet for quite a while. Thus the whole question becomes what is their initial silicon budget given this reality, rather than a question of what is possible.

Yeah, and its really starting to frustrate me that so many people still believe there will be: huge 4 billion transistor GPUs with 8GB of memory, 700GB bandwidth, and all these ludicrous specs.
 
I could see a 2 billion transistor GPU.

GT200 / GTX 280 is already ~1 billion+


4 GB RAM total (system memory + graphic memory or unified).


400-500 GB/sec system bandwidth. I don't see why not. first generation XDR was meant to be capable of upto ~102 GB/sec bandwidth. XDR2 is capable of ~200 GB/sec and Rambus announced the goal of hitting 1000 GB / 1 TB in 2010. I'm only advocating half that for sometime after 2010 when PS4 is going to show up.


For next-gen Xbox if it again uses EDRAM the bandwidth on that could be in TeraByte or several TB per second since the EDRAM in Xbox 360 is already 1/4 of a TB/sec.
 
Last edited by a moderator:
Status
Not open for further replies.
Back
Top