So, 1 PE after-all or is this just for GDC 2005 ?

So, 1 PE after-all or is this just for GDC 2005 ?

  • No, this is only the CPU they are describing at GDC: the final CPU of PlayStation 3 will have more P

    Votes: 0 0.0%
  • "Eh scusate... ma Io sono Io evvoi...e voi non siete un cazzo" --Il Marchese del Grillo.

    Votes: 0 0.0%
  • This, as the last option is a joke option... do not choose it.

    Votes: 0 0.0%

  • Total voters
    185
Fafalada said:
Mythos said:
However, the devil may be in the details. ISA (Cell) that is...
Oh of course - the S|APUs are pretty certainly a completely new ISA - I wasn't implying paralels between whole architectures, or even less - programming models (which look to be very different at the moment), just the PPCs cores which both of them use.

one said:
Well, I don't know the reason why Xenon is thought to use a PowerPC 3xx derivate if it's something developed after PPC 440. Can it be clocked at 3.5Ghz+?
And can it be clocked to 4Ghz+?
Just because both Xenon cpu and Cell PUs could be derived from same core, it doesn't mean that core must be an existing PPC derivate, does it?
Clean sheet design and all? :p

The PowerPC 3xx family does not exist yet officially so it is still a clean sheet design :p.
 
ERP said:
It may not be possible to do that on a Cell like architecture, because the combination of the interpreter and the associated global structures would blow the limited APU/SPU code space.

I do not think program+data will be limited to the 128 KB of Local Storage each APU has. We have seen several patents: one talking about using overlays to allow to divide code and data into modules with shared data and the main "loader/manager" module always loaded in the Local Storage, we have seen the description of a shared L1 cache for the APU's in a PE thatis hit beforethe APU goes to main RAM and the same patent suggestedthe CPU L2 might be able to be read by the APU IIRC.

Do you really think programmers will have to manually segment and upload, without any help, code and data to the APU's just like you do on the Vector Units on the Emotion Engine ?
 
Panajev2001a said:
ERP said:
It may not be possible to do that on a Cell like architecture, because the combination of the interpreter and the associated global structures would blow the limited APU/SPU code space.

I do not think program+data will be limited to the 128 KB of Local Storage each APU has. We have seen several patents: one talking about using overlays to allow to divide code and data into modules with shared data and the main "loader/manager" module always loaded in the Local Storage, we have seen the description of a shared L1 cache for the APU's in a PE thatis hit beforethe APU goes to main RAM and the same patent suggestedthe CPU L2 might be able to be read by the APU IIRC.

Do you really think programmers will have to manually segment and upload, without any help, code and data to the APU's just like you do on the Vector Units on the Emotion Engine ?



local storage is a stupid design, little cache is fine
but what size the address register in APU? 16 bit? 32 bit?
for ray tracing engine must have big memory access :)
 
local storage is a stupid design, little cache is fine
but what size the address register in APU? 16 bit? 32 bit?
for ray tracing engine must have big memory access
External memory references are done through DMA - and according to the patents, APU DMA controllers have their own TLB, so technically you could "see" the entire virtual adress space of the machine from the APU.
That big enough? :p
 
Fafalada said:
local storage is a stupid design, little cache is fine
but what size the address register in APU? 16 bit? 32 bit?
for ray tracing engine must have big memory access
External memory references are done through DMA - and according to the patents, APU DMA controllers have their own TLB, so technically you could "see" the entire virtual adress space of the machine from the APU.
That big enough? :p

yes, but virtual address is a base, offset is in the apu 's register
cell patent not perfect, if ibm use cache for APU and not LS then what doing the dma?
 
Fafalada said:
Oh, you were thinking of APU with no local storage at all? Personally I don't think that's likely to happen.

local storage not ideal for programming and too large transistor count, and problem in cell's evolution

16 kbyte cache perfect design for apu
 
Panajev2001a said:
ERP said:
It may not be possible to do that on a Cell like architecture, because the combination of the interpreter and the associated global structures would blow the limited APU/SPU code space.

I do not think program+data will be limited to the 128 KB of Local Storage each APU has. We have seen several patents: one talking about using overlays to allow to divide code and data into modules with shared data and the main "loader/manager" module always loaded in the Local Storage, we have seen the description of a shared L1 cache for the APU's in a PE thatis hit beforethe APU goes to main RAM and the same patent suggestedthe CPU L2 might be able to be read by the APU IIRC.

Do you really think programmers will have to manually segment and upload, without any help, code and data to the APU's just like you do on the Vector Units on the Emotion Engine ?

I can't really comment on this.
But think about worst cases, I can write a single piece of code that would require multiple overlays to complete a single iteration. or I could just split up the code in the address space to do basically the same thing.
I think that once you get to explicit DMA as your transfer mechanism you going to be chunking your code and data manually.
 
I don't care if Xenon CPU or PS3 PU(s) are a clean sheet design or not (whatever it means).
I think game developer cares about CPU that are fast in general purpose code, that are easy to program for and that expose 'custom' functionalities that can boot special computations.
Even if I don't know a single thing about Xenon CPU I'm sure IBM did a great job. They have the know-how, very smart people..and Microsoft founding ;)

We already debated 10 times the memory latency problem and how SPUs don't seem to be designed to hide very big latencies (such as latencies that can appear while doing textures sampling).
Obviously STI guys are well aware of this kind of problems, neverthless they decided to NOT go toward the fine-grained multithreading route.
I believe they preferred to have an array of very powerful and flexible stream processors under the controlo of a PU, than an set of (bigger) multithreaded processors. For a given process the STI/CELL choice should give us more (theoretical) flops per die area.
This is from a Mr. Billy Dally's work:"Stream Processors vs. GPUs"

latencyhiding.png


DeanoC wrote a good summary and moreover added a good number of educated guesses mostly driven by good common sense, so even if we'll end to do texture sampling on the SPUs, in the main case textures stuff will be addresed by the GPU in its pixel shaders engines. SPUs will provide a very powerful base for vertex processing and just everything we can think of (yeah..MRM too ;) ) as we know SPUs can adress external memory and own several mechanisms to allievate latency problems.
I'm quite excited to know the final Xenon and PS3 specs (that's a hint for someone!! :D ) but I'm even more excited to think how to exploit all that power in novel ways! :)



I can't really comment on this.
But think about worst cases, I can write a single piece of code that would require multiple overlays to complete a single iteration. or I could just split up the code in the address space to do basically the same thing.
What's the big deal? We can do the same on every pc ..or on the XBOX, just switch the vertex shader every couple of drawindexedprimitive() ;)
Once you have code and data overlays supports doesn't mean you (as a good programmer) are going to act as local mem were infinite. In the end you would sort per overlay switch , but it woull stilll be much more simple for the programmer than doing manual code/data chunking (I hate doing that on the Vu0/Vu1), imho.

ciao,
Marco
 
What's the big deal? We can do the same on every pc ..or on the XBOX, just switch the vertex shader every couple of drawindexedprimitive()
Once you have code and data overlays supports doesn't mean you (as a good programmer) are going to act as local mem were infinite. In the end you would sort per overlay switch , but it woull stilll be much more simple for the programmer than doing manual code/data chunking (I hate doing that on the Vu0/Vu1), imho.

I'm talking about the case where a single piece of code exceeds the local memory. As is the case with the interpreter for a script language once coupled with it's global data.

Sure you can organise a lot of code to run using overlays, but if you don't have good code locality/piece of data, and the data must be processed sequentially then you have a problem.

Running scripts was just an example that happens to fit that model, and one I happen to have looked at recently.

You can construct other examples, almost any large data structure that doesn't have an obvious spacial locality but exhibits good locality over tme.

I'm not saying you can't solve these problems on a Cell like architecture just that solutions tailored for a more general shared memory architecture aren't going to port easilly.
 
nAo said:
For a given process the STI/CELL choice should give us more (theoretical) flops per die area.
This is from a Mr. Billy Dally's work:"Stream Processors vs. GPUs"

latencyhiding.png
Sorry this is a bit OT but probably relevant enough...
Its true that a strict stream processor may have higher theoritical flops but in practise the other architectures may well have higher real performance. Also the author of that paper is slightly out of date with regard GPU architectures, they are much more complex than the simple version he showed.
Its similar to the RISC/CISC arguments of yore, in practise hybrids that borrow bits of both will likely win in the long term. Even looking at a current GPU (the NV40) we see a far more complex situation, it has MIMD vertex shader extremely similar in design to a stream processor or Cell combined with an array processor for the pixel shaders. With a few simple modification (i.e. let the vertex shader output memory directly) you could choose which type of processor best fits the job possible. It will be interesting if PS3 keeps a similar design, Cell APU for streaming work and NVIDIA pixel shaders for more random access vector processing...

nAo said:
What's the big deal? We can do the same on every pc ..or on the XBOX, just switch the vertex shader every couple of drawindexedprimitive() ;)
Once you have code and data overlays supports doesn't mean you (as a good programmer) are going to act as local mem were infinite. In the end you would sort per overlay switch , but it woull stilll be much more simple for the programmer than doing manual code/data chunking (I hate doing that on the Vu0/Vu1), imho.
I think its not specialist processing (graphics etc.) but the normal stuff that ERP is trying to fit on the APUs. Generally our game code is very 'pretty' with lots of virtuals and indirections, designed for mantainability and quick changing rather than fast processing. By 'pretty' I mean high level software engineering principles, interfaces, encapsulation and stuff.
Suspect we will end just running all this stuff on the main PU...
 
DeanoC said:
Its true that a strict stream processor may have higher theoritical flops but in practise the other architectures may well have higher real performance.
That's why I wrote about theoretical performance :)
Common sense says fine-grained multithreaded architectures should achieve higher number easily than a stream architecture, but in the long run (I'm thinking about the life cycle of a console..) developers should be able to extract more flops per die area on a stream architecture, at least in some specific application :)

With a few simple modification (i.e. let the vertex shader output memory directly) you could choose which type of processor best fits the job possible.
Then you should read one the last nvidia's patent application..;)

It will be interesting if PS3 keeps a similar design, Cell APU for streaming work and NVIDIA pixel shaders for more random access vector processing...
Let's see how they want to address the syncing and feedback problems..

Suspect we will end just running all this stuff on the main PU...
That's what most of first or second (next) generation games will do.
Saying APUs will not efficiently run scripting code is true..but well..who cares? I don't ;)
It's really too easy to say an application completely unfitted to run on a specific architecture will run badly on that architecture.
Vertex texturing will probably run very slow on Xenon CPU custom vector units..you know... :)

ciao,
Marco
 
DaveBaumann said:
Why would you even want t run it on the CPU?
That's the point Dave..I don't want to run that stuff on the CPU. A CPU is not designed to run some textures sampling code efficiently. Like we don't want to run scripts code on SPUs. Just because one can do it doesn't mean it's a good thing to do that for real.
Imho, it's a non-sense to complain about this kind of 'problems'.

ciao,
Marco
 
It occured to me that one reason you might go with LS versus cache for a Cell APU is predictable latency.

If the LS were replaced with cache, then all memory access would have extremely unpredictable latency, so hardware support for SoEMT would be required.

However, if you split memory accesses into LS and DMA, then AFAICS the only really high latency event that can happen is DMA access. Instruction ordering would hide most of the execution/LS latency, and the programmer could perform SoEMT manually by separating code into chunks ending with DMA access, then looping across independent chunks before using the results of a transfer.

I believe it was mentioned that an APU has access to a very large number of registers (128?) and can make up to 32 outstanding DMA transfers - if this is true, then I think it is incorrect to say that the APU has poor latency hiding mechanisms - it's just that the mapping of registers/LS to threads is left up to the programmer (or a potentially a specialized compiler).

Serge
 
mmm let see..

the highest feature for the chip produced from toshiba ( the chip in ps3? )

4000 Mbit/sec per pin

and 521 Mbit per chip , 16 bit/bus

let assume a 4 chip 4 channel 16 bit bus configuration , we have a pool o 256 MByte at 32GigaByte/sec

Ok these are the final spec of external ram pool on PS3 , ok ?
 
psurge said:
It occured to me that one reason you might go with LS versus cache for a Cell APU is predictable latency.

If the LS were replaced with cache, then all memory access would have extremely unpredictable latency, so hardware support for SoEMT would be required.

However, if you split memory accesses into LS and DMA, then AFAICS the only really high latency event that can happen is DMA access. Instruction ordering would hide most of the execution/LS latency, and the programmer could perform SoEMT manually by separating code into chunks ending with DMA access, then looping across independent chunks before using the results of a transfer.

I believe it was mentioned that an APU has access to a very large number of registers (128?) and can make up to 32 outstanding DMA transfers - if this is true, then I think it is incorrect to say that the APU has poor latency hiding mechanisms - it's just that the mapping of registers/LS to threads is left up to the programmer (or a potentially a specialized compiler).

Serge


if dma read or write the LS data , until then APU be stall !
for maximize performance ,must to divide LS 2,3 or 4 segment and then system can to work in paralell
a segment about 30-50 kb this is NIGHTMARE !
 
version - I'm not sure I understand what you mean - how exactly would this stall the APU? Do you mean by occupying the read/write port to the LS, or...?

Do you just mean that segments of the LS must be assigned to different "threads"? In any case, I'm not proposing to use a system like this for large general purpose codes (like a Java VM). On the other hand, it should be able to handle latency incurred by point sampling a vertex texture, for example.

Even if the code is fairly nightmarish, I would expect this to be worth it for some of the core algorithms (e.g. adaptive subdivision of a subdivision surface with displacements).

Serge
 
psurge said:
version - I'm not sure I understand what you mean - how exactly would this stall the APU? Do you mean by occupying the read/write port to the LS, or...?

Do you just mean that segments of the LS must be assigned to different "threads"? In any case, I'm not proposing to use a system like this for large general purpose codes (like a Java VM). On the other hand, it should be able to handle latency incurred by point sampling a vertex texture, for example.

Even if the code is fairly nightmarish, I would expect this to be worth it for some of the core algorithms (e.g. adaptive subdivision of a subdivision surface with displacements).

Serge

you know ps2's VU1 ? similar than that
 
Back
Top