Next gen consoles: hints?

As to the issue of external memory, a license agreement was announced between Rambus, Sony, and Toshiba in January of this year for Rambus' high-speed XDR DRAM (codenamed "Yellowstone") and the Redwood parallel chip-to-chip interface. XDR DRAM supports datarates ranging from 3.2 GHz to 6.4 GHz with support for 64 or 128-bit interfaces supplying between 25 GB/s and 100 GB/s depending upon the implementation.

I'm going to assume that the 100gbs is 128 bit interface running at 6.4ghz and a 64bit one would get around 50 gb/s .
 
Pana said:
At 3-4 GHz for the APUs we are talking about 512-768 GFLOPS, would you think that is not very useful for Vertex and Pixel Shading ?

Yes, this is why I questioned his comment about Tim Sweeney's statements which I ultimatly misunderstood, but it still stands. There is a difference between the Intel CPU of 2010 and the BroadBand Engine - something Dio (I believe) rightly made clear in that thread.

Personally, I can't understant how Dave arrives at his conclusions wrt Shading preformance. If the Sukuoki patent is correct (as we assume for discussion) then a single APU has roughly analogous 'raw' shading preformance of an entire R350. The Broadband Engine has 32 of them, the Visualizeranother 16.

DaveBaumann said:
Umm, Vince, the thread is titled "Next Generation Console Hints" to which the DX Next article was posted in relation to some of the directions MS are likely to be taking with the XBox2, yes?

I thought you were all big on "this is the consle forum, we talk about consoles".

Obviously. But I was keeping this argument strictly in context of DirectX Next. With the ideology being it's a constant we can atleast base our little discussion on. By you opening the debate up you can merely add and subtract from DXNext as you see fit, which is BS. And beyond that, it could follows that the Next console can deviate 100% from DXnext hypothetically, so why even debate?

V3 said:
I assume the upper end 6.4 GHz Redwood, would gives 64bit @ 6.4 Ghz for 51.2 GB/s

Also, unlike Hypertransport, I remember reading that Redwood doesn't have an upperbound on bus with.
 
Also, unlike Hypertransport, I remember reading that Redwood doesn't have an upperbound on bus with.

Yes it doesn't, its basically capable of delivering 6.4GB/s per byte. Cost and need pretty much dictates the upperbound. But who needs more than 51.2 GB/s for next year ?

Thats like the expected available bandwidth for highed PC 3D card.
 
Qroach said:
Actually Grall, MS downgraded the GPU speed once from the time they had actual silicon.

Whatever. Still, I've seen the 300MHz figure pop up in places on the web even after the machine launched (at 233MHz), like at howstuffworks.com for example. Their article was entirely built on MS PR and nothing else.

Nintendo also downgraded the speed of flipper before it was released.

YES YES, we know. Don't get so defensive! ;) I only mentioned XB as an example.

As a few gamecube developers originally recieved faster version of dev hardware.

AFAIK, no dev systems ran faster than the actual release hardware, even the 2nd/final version ran at only 300/150MHz CPU/GPU. If you got other information it would be the first time I've seen it ever, not just here but anywhere on the web...


*G*
 
Vince said:
Personally, I can't understant how Dave arrives at his conclusions wrt Shading preformance. If the Sukuoki patent is correct (as we assume for discussion) then a single APU has roughly analogous 'raw' shading preformance of an entire R350. The Broadband Engine has 32 of them, the Visualizeranother 16.

What conclusions would they be then Vince? I’ve yet to reach any conclusions in terms of shading performance. I’ve got some questions over potential issues I can see with fragment shaders, but then evidently so have others including Faf. So, where would these conclusions over performance be?

As for and APU being analogous to and R350 – is this your calculation? If so, what was the basis?

Obviously. But I was keeping this argument strictly in context of DirectX Next. With the ideology being it's a constant we can atleast base our little discussion on. By you opening the debate up you can merely add and subtract from DXNext as you see fit, which is BS. And beyond that, it could follows that the Next console can deviate 100% from DXnext hypothetically, so why even debate?

Ummm, Vince, this entire discussion has been about the application to Consoles – the “debate†was ever thus – you are the one that apparently now wishes to limit it. Read the thread title, read why DX10 was brought in, read the analogy jvd was writing about that prompted your initial post, and read every subsequent reply to it – it was always about the application to consoles. :rolleyes:
 
Vince said:
There is a difference between the Intel CPU of 2010 and the BroadBand Engine - something Dio (I believe) rightly made clear in that thread.
It was. In that thread we were explicitly discussing a PC architecture, and the restrictions of same (namely, that x86 is a given) make achieving competitive graphics performance with a PC architecture purely on CPU before at least 2010 at least unlikely (in my opinion).

I do agree that in the future, a more CPU-like approach will have strengths, and it will undoubtedly have flexibility. However, I believe less flexible hardware can, with equivalent silicon cost, always outperform more flexible hardware within the sufficient factor that 'design efficiencies' (or inefficiencies) will not keep the flexible hardware ahead.

Obviously, the PS3 people here disagree. It seems to me there are three major arguments which are expressed:
- Sony are that good and therefore design efficiencies can be a win
- PS3's silicon cost will be so high that nobody else can compete
- some exotica rendering such as REYES, volumetrics, etc. will ultimately prove significantly more efficient than a primitive pipe.

Personally I'm highly unconvinced by two of those three. Only time will tell, of course.

Vince said:
this is why I questioned his comment about Tim Sweeney's statements which I ultimatly misunderstood, but it still stands.
Please excuse me while I put on me Abraham Lincoln hat and chinstrap and call for peace, unity, etc.

There are a lot of misunderstandings going on in this thread. I do think everyone should take a little more time over their posts to try to be more clear (which will eliminate the need for comments of the 'Shit, Dave' variety).

In particular, many people are using particular bits of terminology in a very loose manner that are interpreted somewhat differently by the PS3 and the PC crowd.
 
Reyes is far from exotic ... dunno if it is the best solution but we will get in trouble with aliasing soon though, and that is not a problem of efficiency.

x86 isnt an altogether horrible ISA for a massively parallel processor BTW. The problem is that only console manufacturers have the impetus to build one, and noone but Sony is ambituous enough it seems and they arent using it.
 
MfA said:
x86 isnt an altogether horrible ISA for a massively parallel processor BTW.

x86 is rather cluttered with obsolete legacy instructions, asymmetrical instruction lengths and tons of unneccessary addressing modes, wouldn't you say? Not to mention lack of GPRs leading to high levels of I/O to shuffle data around, the stack-based FPU etc...

Why bother in a console with x86, when a different ISA could provide better bang/buck?


*G*
 
Grall said:
MfA said:
x86 isnt an altogether horrible ISA for a massively parallel processor BTW.
x86 is rather cluttered with obsolete legacy instructions, asymmetrical instruction lengths and tons of unneccessary addressing modes, wouldn't you say? Not to mention lack of GPRs leading to high levels of I/O to shuffle data around, the stack-based FPU etc...

Why bother in a console with x86, when a different ISA could provide better bang/buck?
This argument seems to crop up about every six months :)

The cost of decode (relative to execution units and cache) on x86 is not considered terribly large nowadays - so one could call the x86 instruction set and register renaming compression technology (and in fact several respected hardware engineers have). In particular, cache is around 30% more efficient than 'orthogonal' 16/32-bit instruction architectures (ARM, 68k and PPC were discussed in the article I read) when caching instructions.

The stack-based FPU was never an issue (due to fxch being free since Pentium) - although it made the code a pain to write, it was no slower than a register-addressable system. Now with SSE2 one could argue the FPU is pretty much redundant...

x86 also has the huge advantage of volume production and competition in the marketplace, making it relatively cheap (<cough> Xbox bidding wars <cough>).

That said, I do think x86 is a horrible ISA for a massively parallel processor. But I'm not sure there are any good ISA's for massively parallel processors. The Transputer definitely didn't do it for me... the brief study I did (crikey, was that really ten years ago???) indicated the Cray XMP might be OK, but it wasn't really general purpose...
 
DaveBaumann said:
Vince said:
Personally, I can't understant how Dave arrives at his conclusions wrt Shading preformance. If the Sukuoki patent is correct (as we assume for discussion) then a single APU has roughly analogous 'raw' shading preformance of an entire R350. The Broadband Engine has 32 of them, the Visualizeranother 16.

What conclusions would they be then Vince? I’ve yet to reach any conclusions in terms of shading performance.... So, where would these conclusions over performance be?

Um, Page5?

Dave Baumann said:
At the moment I'd say it would be tough for Sony to reach ATI's fragment shading potential.

I'm willing to bet, 'At the moment' you were resolved enough to comment. Which is why I questioned it and you got defensive. And I'm still a bit confused why you'd rather have a DX solution which has arbitray restrictions on logic constructs 'forward' of sampling (which is basically where the next generation will be limited, kinda that open ended O(Shader) concept) than something that doesn't. Why wouldn't you want the entire computational resource 'pool' that's not linearly bounded in task to be unified?

Dave Baumann said:
As for and APU being analogous to and R350 – is this your calculation? If so, what was the basis?

Perhaps you, or others, can help. I took the number off of a somewhat recent (~<3months) ATI Presentation which contained a slide that compared the aggregate FLOPs from the Shader Constructs on the R3x00 line and compared it to the N3x.



Dave said:
Vince said:
Obviously. But I was keeping this argument strictly in context of DirectX Next. With the ideology being it's a constant we can atleast base our little discussion on. By you opening the debate up you can merely add and subtract from DXNext as you see fit, which is BS. And beyond that, it could follows that the Next console can deviate 100% from DXnext hypothetically, so why even debate?

Ummm, Vince, this entire discussion has been about the application to Consoles – the “debateâ€￾ was ever thus – you are the one that apparently now wishes to limit it. Read the thread title, read why DX10 was brought in, read the analogy jvd was writing about that prompted your initial post, and read every subsequent reply to it – it was always about the application to consoles. :rolleyes:

Right, and ready my post again. This time think about why I said what I did. In fact, I must have been playing Nostradamus when I posted that as it answers your responce.

For example, "we" (as a board more or less) have basically accepted the Suzuoki patent as a constant which we can use as a basis for discussion. Similar to how I intended to use DXNext. What you're doing can also be done by the PS3 side as one can point to the Sukuoki Cell patent and say, "Hey! Preferred Embodiment! They're going to amend it and put another 8 APUs, some nVidia IP for the Shaders, and a small paramecium wheel for power in there". Obviously, doing so doesn't lend itself well to discussion.
 
The point is that nowadays in most general purpose application you take a processor like CELL, which straight General Purpose it is not, and with modern technology you can clock it decently enough that it runs your Web Browser, your Word Processor, your E-mail program or your Excel Database fast enough... not to talk about the case in which you are running a Web Browser + Word + Excel + E-mail program all at the "same" time in which CELL's parallel architecture ( and the fact that basically in the Broadband Engine you have from 4 independent CPUs to 32 independent CPUs depending how you count things ;) ) starts to kick-in.

APUs are not that different from what you could envision a fused Vertex+ Fragment Shading Unit could be so I would not call CELL a normal CPU like approach compared to the approach GPUs take.

Sony and Toshiba do have a good hold on new technology ( they worked hard on manufacturing technologies for a fair amount of years and spent a capital upon, it is not like all other engineers are stupid ): 65 nm mixed bulk ( record size e-DRAM and world's second smallest SRAM cells at 65 nm ) + SOI technology and upcoming 45 nm full SOI ( capacitor-less e-DRAM ).

CELL was not designed only for PlayStation 3 ( although it seems to make a perfect match for either the classical CPU portion and the GPU portion of the system ) and if it is competitive with what Xbox 2 carries, depending on the year of release of the consoles ( if PlayStation 3 releases at about the same time naturally the overall power of the two consoles would be close ) then it is all good.

The fabs built for CELL will be used, together with the processes designed with CELL fabbrication in mind, for other Sony products... to cheapen even more PlayStation 2 chips ( EE+GS@65 nm anyone ? ;) ) and to cut costs on PSP by shrinking the main SoC to 65 nm.

CELL making its way in other Sony products will also help Sony cut its fixed costs ( they do not have to buy as many ICs from third parties and currently they have been not for need, but for giant lack of communication between different R&D departments [trust me, it was a mess... :( fortunately they are working hard to fix it, Transformation 60 and all the restructuring and consolidating] ).
 
Dio said:
Grall said:
MfA said:
x86 isnt an altogether horrible ISA for a massively parallel processor BTW.
x86 is rather cluttered with obsolete legacy instructions, asymmetrical instruction lengths and tons of unneccessary addressing modes, wouldn't you say? Not to mention lack of GPRs leading to high levels of I/O to shuffle data around, the stack-based FPU etc...

Why bother in a console with x86, when a different ISA could provide better bang/buck?
This argument seems to crop up about every six months :)

The cost of decode (relative to execution units and cache) on x86 is not considered terribly large nowadays - so one could call the x86 instruction set and register renaming compression technology (and in fact several respected hardware engineers have). In particular, cache is around 30% more efficient than 'orthogonal' 16/32-bit instruction architectures (ARM, 68k and PPC were discussed in the article I read) when caching instructions.

The stack-based FPU was never an issue (due to fxch being free since Pentium) - although it made the code a pain to write, it was no slower than a register-addressable system. Now with SSE2 one could argue the FPU is pretty much redundant...

x86 also has the huge advantage of volume production and competition in the marketplace, making it relatively cheap (<cough> Xbox bidding wars <cough>).

That said, I do think x86 is a horrible ISA for a massively parallel processor. But I'm not sure there are any good ISA's for massively parallel processors. The Transputer definitely didn't do it for me... the brief study I did (crikey, was that really ten years ago???) indicated the Cray XMP might be OK, but it wasn't really general purpose...

Funny you should say that. The first time I saw the Cell patent, I got a Cray YMP flashback. :)
No, they aren't really all that similar, but arguably a lot more similar than either are to a P4 on steroids.

I've been very busy lately, and haven't been able to contribute to the two threads where I might have added something significant, so instead I'll quickly chime in with a useless "me too" here to say that sure, you can build multiprocessors around x86, as Intel has very capably shown. But no, it is most definitely not a starting point you would choose today if you had a clean slate, and particularly Windows isn't the OS of choice for extracting maximum performance from a multiprocessor. (It can't even handle processes that don't need to communicate particularly well!) As the number of processors to be tightly coupled increases, the limitations of the x86/Windows combo is likely to become more pronounced, both from the perspective of efficient scaling, and limitations in the set of problems which are reasonably well adressed.

In a console environment, you want the processors to cooperate in solving a single (complex) task. At their best, typical x86 multiprocessors can only hope to take a moderately capable stab at solving several, completely unrelated tasks. Passable for servers, but not a good fit for a gaming console.

While "some" hardware engineers may talk about x86 adressing/instruction format/register starvation as compression, I'd say that those engineers either have a vested interest or are trying very hard to find something politely positive to say. I'll counter and quote Intel processor architect Andy Glew, who stated that while the x86 ISA wasn't as bad as some competitors made it out to be, the stack based FPU was unarguably "brain damaged". And yes, that was when he was still working for Intel. :)
 
Vince said:
And I'm still a bit confused why you'd rather have a DX solution which has arbitray restrictions on logic constructs 'forward' of sampling (which is basically where the next generation will be limited, kinda that open ended O(Shader) concept) than something that doesn't. Why wouldn't you want the entire computational resource 'pool' that's not linearly bounded in task to be unified?
You asked this question before and he replied. http://www.beyond3d.com/forum/viewtopic.php?t=9332&postdays=0&postorder=asc&start=110

One could of course reverse that: why do you think it is acceptable for sampling to be done in a fixed-function manner, but not acceptable for anything else?

Note also that even DX9 level chips do not have "arbitrary restrictions on logic constructs 'forward' of sampling". With multipass, MRT, and dependent texture sampling, they can implement any logic construct.
 
Entropy said:
While "some" hardware engineers may talk about x86 adressing/instruction format/register starvation as compression, I'd say that those engineers either have a vested interest or are trying very hard to find something politely positive to say.
I'm not sure. I think inherently in most ISAs you have to have variable length instructions - it seems pointless to me to encode every instruction as 128 bits so you can encode 64-bit integer immediates and/or pointers. So why not divide it up on the byte level rather than 16 or 32 bits?

Of course, I would agree that you could do it much better than x86 - get both better compression and a more flexible ISA. I spent a few hours seeing how much better I could do once (I went for a 64-bit architecture with 16 GPR's and something like SSE). I think my conclusion was that byte boundaries were nice, but the smallest instruction I would ever encode was 16 bits.
 
Vince said:
I'm willing to bet, 'At the moment' you were resolved enough to comment. Which is why I questioned it and you got defensive.

At that point I was unaware of the layout of what, apparently, is the Visualiser, and as this discussion has evolved there is a considerable lack if understanding from all parties on how the fragment processing is to occur.

The subsequent points that I’ve raised with Panajev, to which he appears to agree with to some extent, is in the most likely “best case usage†scenario in which it looks like much of the resources from the BE will be concerned with geometry processing and the Visualiser mainly fragment processing, in which case in a general usage scenario I think the fragment processing abilities are still a question – in relation to a DX Next implementation that may well be able to alter its resource usage from geometry to fragment processing as demand requires.

Vince said:
And I'm still a bit confused why you'd rather have a DX solution which has arbitray restrictions on logic constructs 'forward' of sampling (which is basically where the next generation will be limited, kinda that open ended O(Shader) concept) than something that doesn't. Why wouldn't you want the entire computational resource 'pool' that's not linearly bounded in task to be unified?

Vince, I’m not necessarily saying I’d rather have anything, I’ve come from the opinion that there are always relative merits and pitfalls to any system.

The obvious point with the DX approach is that it is focused primarily on graphics processing which can be faster at that task than a more general purpose unit. Sure, there will be trade-off’s in other areas and ultimately differing architectures that appear in roughly the same timeframe will probably be fairly well matched given all their relative merits and weaknesses (and 90% of the software written will see to that anyway). I just don’t ascribe to the theorem that DX Next has any fundamental legacy that will necessarily inhibit it over different applications (from what I hear so far DX Next will span a lot more than just PC’s and high end consoles).

And how do you mean restrictions “forward of sampling� in the DX Next pipeline sampling could be one of the first things that you do!

Vince said:
Perhaps you, or others, can help. I took the number off of a somewhat recent (~<3months) ATI Presentation which contained a slide that compared the aggregate FLOPs from the Shader Constructs on the R3x00 line and compared it to the N3x.

I believe the only comparisons to NVIDIA ATI have done is in actual runtime, not theoretical rates. IIRC, 6 float instructions can be achieved in each pixel pipeline and, I think, 2 per Vertex Shader

Vince said:
For example, "we" (as a board more or less) have basically accepted the Suzuoki patent as a constant which we can use as a basis for discussion. Similar to how I intended to use DXNext. What you're doing can also be done by the PS3 side as one can point to the Sukuoki Cell patent and say, "Hey! Preferred Embodiment! They're going to amend it and put another 8 APUs, some nVidia IP for the Shaders, and a small paramecium wheel for power in there". Obviously, doing so doesn't lend itself well to discussion.

No Vince, I’ve not done anything of the sort the – you’re the one who has made the assumption on how the DX Next platform can or will be implemented, it has always been the case that we just don’t know about the details of the DX Next implementation within a console environment. All we can say about MS’s implementation is that we believe that they are using the R500 platform as a graphics technology basis and that the DX Next presentations are reasonable grounds to assume that these are the graphics directions that are likely to be taken – your post decrying the “legacy issues†that it brings forth is fundamentally misplaced in the context of a console discussion, as you now appear to admit, because we still do not know about implementation specific details.

While you are basing the Suzuoki patent as a constant for what you believe to be PS3, you cannot do the same for a console based DX implementation because we have no such specific details as yet, hence preconceptions about apparent “legacy†issues that are not related to the structure / direction of the API and are implementation specific are misplaced right now. As has been pointed out a few times already – MS has surprised with the choices hey have made so far and they may well continue to do so yet, so even extrapolating from alternate or previous platforms may not necessarily point to Xbox2.
 
Dio said:
One could of course reverse that: why do you think it is acceptable for sampling to be done in a fixed-function manner, but not acceptable for anything else?

Because, I believe that there are some tasks which are inheriently iterative in nature whose function scales linearly or are constant. These are generally things like filtering, sampling, et al. They're resolution or intensity dependant - if you want bilinear, put your 16 multipliers and 12 adders down in silicon concurrently. It just makes sence.

Shading, Topology and other such tasks really aren't intrinsically bounded like this. Nor do they have minimum or maximum levels of intensity (poor word, but the lvl of usage) nor do they have a sustained level like the above catagory, bust shift dynamically on a task by task basis.

IMHO, the rest follows logically. And if not, that any distinction such as the above doesn't exist, then we should forget shaders at all and go back to a DX7 style scheme where we just put everything in dedicated constructs.

Dio said:
Note also that even DX9 level chips do not have "arbitrary restrictions on logic constructs 'forward' of sampling". With multipass, MRT, and dependent texture sampling, they can implement any logic construct.

Really? So, I can throw out my CPU and run my Office off it? Wow, I had no idea, and here I'm thinking you guys were still unable to share Shading resources in DX9. *shrug* I apologize.

EDIT: Dave, I'll respond to your post laster. I've had Dio's typed out since I first woke up and must leave. Ohh, and I was referring to that ATI presenation that had the NV3x and R3x0 both labeled in FLOPS, perhaps it was from Mojo... not sure.
 
Look at who's set the highest goal... which group is seeking the most... which believes they can achieve the most...

In the past, great things were achieved, things that some say have yet to be outdone in real-time...

Time... it's only a matter of time...

We shall wait and see... if failure does occur, for if it does not...

Obviously, the PS3 people here disagree. It seems to me there are three major arguments which are expressed:
- Sony are that good and therefore design efficiencies can be a win
- PS3's silicon cost will be so high that nobody else can compete
- some exotica rendering such as REYES, volumetrics, etc. will ultimately prove significantly more efficient than a primitive pipe.

There's always my argument, two or more chips designed to heavily contribute to gphx calcs, and properly interconnected... if designed well...

will tend to out-do a single gphx calc chip with a non-heavily contributing chip, especially if they have innadequate connections, or with a slightly modded chip....
 
Dio said:
I'm not sure. I think inherently in most ISAs you have to have variable length instructions - it seems pointless to me to encode every instruction as 128 bits so you can encode 64-bit integer immediates and/or pointers. So why not divide it up on the byte level rather than 16 or 32 bits?

at least one ISA comes to mind where they didn't share your view on instruction decoding 'compression', Dio. then there comes to mind the VLIW - a particularly good CPU philosophy in terms of efficiency (IMHO) where constant insturction length is mandatory. at the end of the day, variable instruction lenght, call it variable-lenght compression if you will, brings in more indeterminism into the picture (imagine this hypotetical case where your i-cache has N bytes left and your innermost loop ends with an (N+1)-bytes instr?).
 
Vince said:
Dio said:
One could of course reverse that: why do you think it is acceptable for sampling to be done in a fixed-function manner, but not acceptable for anything else?
Because, I believe that there are some tasks which are inheriently iterative in nature whose function scales linearly or are constant. These are generally things like filtering, sampling, et al. They're resolution or intensity dependant - if you want bilinear, put your 16 multipliers and 12 adders down in silicon concurrently. It just makes sence.
Is your inference "It will be faster to do this bit fixed-function"? If so, the limits you have set there are equally arbitrary to the limits that you deride DaveBaumann for supporting - as his logic is much the same, just with different arbitrary limits.

Of course, I could be misreading you, in which case I would gladly accept correction.

Vince said:
Dio said:
Note also that even DX9 level chips do not have "arbitrary restrictions on logic constructs 'forward' of sampling". With multipass, MRT, and dependent texture sampling, they can implement any logic construct.
Really? So, I can throw out my CPU and run my Office off it? Wow, I had no idea, and here I'm thinking you guys were still unable to share Shading resources in DX9.
It wouldn't be easy to port, but in theory it can be done. Did you see ET's entry for the shading competition?
http://www.beyond3d.com/forum/viewtopic.php?t=8811

What does the sharing of resources have to do with "arbitrary restrictions on logic constructs 'forwards' of sampling?"
 
Back
Top