Next gen consoles: hints?

Panajev2001a said:
Dio said:
There is a danger going down this route: you can end up with polygon-level aliasing. If you have finely expressed detail that is smaller than a single pixel on screen, you effectively sample random polygons out of this detail which will create aliasing artifacts.
REYES :D
I'm not quite sure how REYES architectures solve this issue (I'll of course freely admit I'm a long way from being an expert - in fact most of what I know I just read at http://www.cis.ohio-state.edu/~stuart/cis781/final.html).

This portrays REYES as a subdivision algorithm (which is what I thought it was) while the case I am considering as causeing a problem is not aided by subdivision. The problem in question is caused by the algorithmic description starting out with a pixel grid / primitive intersection.

If you have any screen-space algorithm that is considering extremely finely modelled detail (e.g. the 'grooves' on the surface of a brick) then you run into these problems - if the first step of your process is to intersect these tiny polygons with the pixel grid, then you have an aliasing problem.
 
Some patents if you care ;)

"Processor with Redundant Logic" ( James Kahle, chief architect of CELL in IBM )

http://makeashorterlink.com/?M1B322DC6


"Simmetric Multi Processor System" ( James Kahle, look at FIG.2 :) )

http://makeashorterlink.com/?B3D322DC6


"Processor Implementation having unified scalar and SIMD datapath" ( Michael Karl Gschwind, Harm Peter Hofstee, Martin Edward Hopkins... this is basically on e of the nicest APU patents together with Kahle's one and what we find about the APUs in Suzuoki's CELL patent ).

http://makeashorterlink.com/?M1F352DC6


"Processing Module for Broadband Networks" ( this is Sony's own Masakazu Suzuoki's CELL patent )

http://makeashorterlink.com/?T1C363DC6
 
Dio said:
Panajev2001a said:
Dio said:
There is a danger going down this route: you can end up with polygon-level aliasing. If you have finely expressed detail that is smaller than a single pixel on screen, you effectively sample random polygons out of this detail which will create aliasing artifacts.
REYES :D
I'm not quite sure how REYES architectures solve this issue (I'll of course freely admit I'm a long way from being an expert - in fact most of what I know I just read at http://www.cis.ohio-state.edu/~stuart/cis781/final.html).

This portrays REYES as a subdivision algorithm (which is what I thought it was) while the case I am considering as causeing a problem is not aided by subdivision. The problem in question is caused by the algorithmic description starting out with a pixel grid / primitive intersection.

If you have any screen-space algorithm that is considering extremely finely modelled detail (e.g. the 'grooves' on the surface of a brick) then you run into these problems - if the first step of your process is to intersect these tiny polygons with the pixel grid, then you have an aliasing problem.

REYES as Cook's original implementation provided also calls for the death of aliasing :p Stochastic AA is their motto ;)
 
Paul said:
It's a Vector processing unit, this is what a VPU is(should have made it clear)

It's not general purpose so which is why I tend to refrain from calling it a CPU, although it IS a CPU but it's specialized...

So, it is acting as the general purpose CPU for the system as well? And, there is also some “arbitrary†split between the BE and Visualiser? Your earlier response wasn’t clear though – are these separate chips?

Patent doesn't even go into any type of specifics about the VS, it mentions it's basic structure however it mentions no functions or how it works.

So, at the moment we’re guessing its capabilities, how it operates and whether this is actually specific to PS3?

Paul said:
I know you understand what I'm saying, it's just odd to explain.

Not really, you’re using different terminology – you say “VPU†I hear “Visual Processing Unitâ€, you say “VS†I hear “Vertex Shaderâ€.

Grall said:
Dude, don't be so thick-headed, just look at the diagram (and pretend that is the basic PS3 architecture; something we don't really know for sure yet).

Ummm, “Dude†first he described it as one thing and then something else, I’m trying to pointpoint what it is and what its function is – of course, as you point out this could be an entirely futile gesture because we are apparently assuming this is the PS3.

Grall said:
Again, look at the diagram. Note the black arrows that are quite visible in it. They point in both directions. I guess that is to signify it's not some kind of assembly-line setup where one APU pipes into the next until a pixel pops out at the bottom.

I read the diagram the way you described it – however a texture sample of some kind is frequently used as an input to a fragment processor, in which case the op would be going to the Pixel Engine first of the texture sample and then back up the APU pipeline, or read out of the pixel processor and fed back into the APU pipeline at the top. Does the Pixel Engine have the same level of access to RAM as the APU’s? Can some data be read by the Pixel Engine and arbitrarily passed back to the APU’s further up the pipeline?
 
Dave, that black arrow is a 1,024 bits bi-directional bus connecting the APUs, the Pixel Engine and its Logic and the PU.

Broadband Engine and Visualizer would be two different chips.

The detail we have more handles on is the Broadband Engine as it is also named in the Rambus contract with Toshiba and Sony.

The Visualizer is basically a normal CELL processor optimized as you see for graphics operations.
 
Dave, that black arrow is a 1,024 bits bi-directional bus connecting the APUs, the Pixel Engine and its Logic and the PU

And the data can be passed in each direction from each APU arbitarily?

Panajev2001a said:
Broadband Engine and Visualizer would be two different chips.

OK, so there are different processors handling different parts of the processing pipeline.

The detail we have more handles on is the Broadband Engine as it is also named in the Rambus contract with Toshiba and Sony.

I would be correct in the assumption that the BE would be handling largescale gemetry processing whilst the visualiser mainly fragement operations?
 
DaveBaumann said:
Dave, that black arrow is a 1,024 bits bi-directional bus connecting the APUs, the Pixel Engine and its Logic and the PU

1.) And the data can be passed in each direction from each APU arbitarily?

Panajev2001a said:
Broadband Engine and Visualizer would be two different chips.

OK, so there are different processors handling different parts of the processing pipeline.

The detail we have more handles on is the Broadband Engine as it is also named in the Rambus contract with Toshiba and Sony.

2.) I would be correct in the assumption that the BE would be handling largescale gemetry processing whilst the visualiser mainly fragement operations?

1.) Data can be passed in each direction by each device connected to the bus: of course it is a bus, so only one device has control of it at any given time ( prolly all handled by the DMAC [that is, I see the DMAC being used to transfer data on the bus across APUs and between APUs and the PU )...

It would not be much different from how the Emotion Engine transfers data internally using the DMAC, but in this case each APU can access the DMAC by itself.

2.) That depends by how you the programmer want to use the BE: we might not have the Visualizer and have a specialized GPU for all we know, but that kind of Visualizer would make the most sense as it would allow the programmer much more freedom in allocating execution resources.

Also such a Visualizer would be easier to manufacture than a different custom chip as it would be a "modification" of the Broadband Engine: logic debugging and testing of the two chips would share a lot of work and this would reduce over-all costs.

Your assumption would be correct as I see the Broadband Engine being the faster FLOPS and OPS wise of the two and the one who would handle Transform, Tessellation, Morphing, Physics, A.I., etc...

But you could also a a model in which both Broadband Engine and Visualizer help each other dividing all the tasks, it all depends how fast is the connection between the two chips.
 
DaveB said:
And given the criteria vendors are looking at any one time it may be more optimal to have separate pools of more dedicated resource or it may be deem more optimal to have pools of more generalized resource.
True, I wasn't really arguing that DX is a limitation - just that it has influence over the way hw designs pan out (of course influence goes both ways).

So there is some arbitrary split between what you are deeming the ?CPU? and the Visualiser then?
It's implied they are two separate chips afaik.
It could be argued the split is just physical though - logically they could be viewed as single pool of APUs.
 
Panajev2001a said:
1.) Data can be passed in each direction by each device connected to the bus: of course it is a bus, so only one device has control of it at any given time ( prolly all handled by the DMAC [that is, I see the DMAC being used to transfer data on the bus across APUs and between APUs and the PU )...

I assume that efficiency can be gained by keeping things going in one direction only then and executing on each APU in sequence? i.e, should you opt to go in a different direction this will stall the pipeline?

Your assumption would be correct as I see the Broadband Engine being the faster FLOPS and OPS wise of the two and the one who would handle Transform, Tessellation, Morphing, Physics, A.I., etc...

But you could also a a model in which both Broadband Engine and Visualizer help each other dividing all the tasks, it all depends how fast is the connection between the two chips.

And this seems to be a somewhat fundamental issue does it not? Even if the link is very high bandwidth what the possibility it will come close to the internal bandwidth of each of the chips?

Fafalada said:
It could be argued the split is just physical though - logically they could be viewed as single pool of APUs.

Yes, but it appears to be all dependent on the link between the two whether reasonably that’s a sensible thing to do. If the link is slow and you start executing operations as such you might be horribly bottlenecked – if its not then it might be feasible to execute in such a fasion.

However, fundamentally, I’m still at somewhat of a loss to see where this design seems to negate the issues that Vince appears to bring up as an issue with the “DX†approach – we still have separate pools of resource and we still appear to have fairly reasonable divisions as to what will be executed where, as Panajev points out. If fact Vince has long since evangelized the unification of the VS and PS in NVIDIA’s designs, which is something that is quite apparent with DX10, but the logical splits we have here with PS3 (if indeed this is PS3) appears not to be the case. Vince also mentions the “low bandwidth†of other design’s and yet that pitfall could be just as inherent here with the split between the BE and Visualiser – however, if the BE is mainly producing geometry then a low bandwidth link between the two is probably the last thing you want, and if there is a high bandwidth interconnect here I don’t fundamentally see why that should be any different to another closed system that has a CPU and a graphics processor.

It strikes me that both appear to have similar issues, but the execution is shifted around between them with the focus on what’s more important for that design probably reflecting this (i.e. I still have the impression that the main focus with PS3 is more on the primitive processing than the fragment processing, whilst the majority of execution time will be spent on fragment processing than primitive processing with DX based designs – even if they do have unified shader resources).
 
Dave,

It is likely that the link between the two chips is going to be be quite fast IMHO.

Also, we have on each chip a nice amount of e-DRAM or fast external DRAM the chip is connected to and we can use that to buffer data sent between the two chips.

The point is that there is still the possibility of load balancing as an APU can be configured to run almost any kind of job: if the Broadband Engine is at almost peak utilization it might off-load some work to APUs that are free on the Visualizer and then get the result back.

Each APU could be doing a Vertex Shading Task or a Pixel Shading task depending what the Software needs.

This is what Vince wanted to highlight IMHO comparing unification fo Pixel and Vertex Shaders and how the APU could do either work and thus we can have a CELL processor used for Pixel and Vertex processing and also for physics and game logic.

It would not be hard to see a computer entirely based on the Visualizer for example: it could act as CPU and GPU of the system.
 
I assume that efficiency can be gained by keeping things going in one direction only then and executing on each APU in sequence? i.e, should you opt to go in a different direction this will stall the pipeline?

APUs should be made work in 128 KB ( sligthly less than that counting space for buffering of data and instructions ) chunks of data because this is what their Local Storage povides them with.

All the Instructions being executed must be fetched from the Local Storage or LS and so has to be the operands these instruction are going to operate on ( either in the LS or in the Registers ): the LS is for each APU its System RAM.

The LS holds both Instructions and Data.

Each APU is like a CPU with its own eco-system.

Each PE has like 8 APUs, one PU and one DMAC.

Each PE is like a SMP system in itself and a Broadband Engine with multiple PEs is basically a small cluster of SMP machines.

When they said Distributed Processing was Inside the system they were not kidding :)
 
.e. I still have the impression that the main focus with PS3 is more on the primitive processing than the fragment processing, whilst the majority of execution time will be spent on fragment processing than primitive processing with DX based designs – even if they do have unified shader resources)

Although I would not mind to use all the APUs on both systems to implement a REYES style renderer based on micro-polygons, I think that we cannot say yet how fragment and primitive processing is going to be allocated.

What if programmers use the Broadband Engine to help with per-pixel programs execution together with the Visualizer ?
 
It is likely that the link between the two chips is going to be be quite fast IMHO.

Well, I would hope so. But I raise the point – this can equally be so on any other closed system, can it not?

Also, we have on each chip a nice amount of e-DRAM or fast external DRAM the chip is connected to and we can use that to buffer data sent between the two chips.

Again, this applies to anything that’s designed in that fashion.

The point is that there is still the possibility of load balancing as an APU can be configured to run almost any kind of job: if the Broadband Engine is at almost peak utilization it might off-load some work to APUs that are free on the Visualizer and then get the result back.

Yup, load balancing can occur in a DX system as well – it already does in other areas (re. the old Mitsubishi IMPAC-GE geometry processor). DX10 makes this quite feasible as well as if unified shader model is employed then under low fragment processor usage the graphics processor can be executing both geometry and fragment operations however under heavy fragment processing some of the geometry work could be shifted to the CPU for processing leaving more of the shader ALU cycles of the graphics processor concentrating on fragment processing (the texture lookup in the Vertex Shader may be a fly in the ointment to a certain extent, but that can probably be circumvented)
 
What if programmers use the Broadband Engine to help with per-pixel programs execution together with the Visualizer ?

How feasible is it going to be? If you have texture operations as inputs to the fragement program then you are going to have quite a lot of data flowing back and forth between the BE and Visualiser.
 
DaveBaumann said:
It is likely that the link between the two chips is going to be be quite fast IMHO.

Well, I would hope so. But I raise the point – this can equally be so on any other closed system, can it not?

Also, we have on each chip a nice amount of e-DRAM or fast external DRAM the chip is connected to and we can use that to buffer data sent between the two chips.

Again, this applies to anything that’s designed in that fashion.

The point is that there is still the possibility of load balancing as an APU can be configured to run almost any kind of job: if the Broadband Engine is at almost peak utilization it might off-load some work to APUs that are free on the Visualizer and then get the result back.

Yup, load balancing can occur in a DX system as well – it already does in OTHER AREAS (re. the old Mitsubishi IMPAC-GE geometry processor). DX10 makes this quite feasible as well as if unified shader model is employed then under low fragment processor usage the graphics processor can be executing both geometry and however under heavy fragment processing some of the geometry work could be shifted to the CPU for processing leaving more of the shader ALU cycles of the graphics processor concentrating on fragment processing (the texture lookup in the Vertex Shader may be a fly in the ointment to a certain extent, but that can probably be circumvented)

Interesting development for DirectX Next I have to admit.

The link between two chips can be fast in any system, this is of course not a CELL exclusive feature.

The idea with CELL is that, technology willing, you could keep adding APUs and then PEs and balance the work around pretty nicely.

Apulets ( which contain, as explained in Suzuoki's patent ) both program and data also have Routing Information ( Destination ID, Reply ID and Source ID and yes there is the provision that those can contain IP addresses ) and a global ID that allows you to identify that packet anywhere.

Apulets can migrate across PEs or acrosss chips or across computers on a LAN or a bigger network.

As soon as two CELL systems ( might be four PEs in the same chip, two chips in the same system or two devices in the same room, etc... ) are aware of each other they can trade inforamtion and share work.

Depending on the latency requirements on your applications you might want to send your Apulets outside of your device to be processed by anothr CELL system on the network or not.

Same thing happens between PE and PE.

Apulets ( let's limit to the ones that need only 1 APU at a time ) can be executed/processed by any APUs on any CELL system, whether you as a programmer take advantage of it or not is not CELL's fault.


This overview the concept the best way possible ( from the CELL patent by Suzuoki Masakazy ):

A computer architecture and programming model for high speed processing over broadband networks are provided.

The architecture employs a consistent modular structure, a common computing module and uniform software cells.

The common computing module includes a control processor, a plurality of processing units, a plurality of local memories from which the processing units process programs, a direct memory access controller and a shared main memory.

A synchronized system and method for the coordinated reading and writing of data to and from the shared main memory by the processing units also are provided.

A hardware sandbox structure is provided for security against the corruption of data among the programs being processed by the processing units.

The uniform software cells contain both data and applications and are structured for processing by any of the processors of the network. Each software cell is uniquely identified on the network.

A system and method for creating a dedicated pipeline for processing streaming data also are provided.

CELL was designed to be highly modular and scalable and the Apulet was created to fit this idea.

CELL achieves power by concurrency, by adding more and more processing units in parallel while avoiding the problems a single Threaded processor would have with decodign, fetching and issuing instructions for so many internal execution units.

In a lot of ways CELL is not new per se, the fundamental ideas present in it have been talked about in a while.

A lot of what is in CELL is a series of technologycal and architectural tricks that make it all work together.

Transistor wise and testing wise this is not an inefficient method.
 
DaveBaumann said:
What if programmers use the Broadband Engine to help with per-pixel programs execution together with the Visualizer ?

How feasible is it going to be? If you have texture operations as inputs to the fragement program then you are going to have quite a lot of data flowing back and forth between the BE and Visualiser.

You might be able to handle that ( see, I like having these objectons... it makes me think at problems in the programming side that looking at the machine purely from a Hardware and data flow perspective would not ) kind of traffic or you might choose to store some textures on the BE's DRAM as well.

If they do not need to be filtered it should not be a problem accessing for the APU as long as you have instructed them to do so and I guess in some cases something might be done in software by few APUs dedicated to do just that: if you do not have bandwidth, well re-calculate more and send less :)

The important fact is that each APU could do Pixel or Vertex Programs: this enables with the same technology to make a fast CPU for games and multi-media an a fast GPU ( with few fixes as you see in the Visualizer ).

The rest depends where the bottleneck is: if you are using an OpenGL or DirectX kind of pipeline and you are bottlenecked by Fragment Operations badly, even if it is not opimal you mgiht want to use some DRAM of the Broadband Engine to store Textures, some APUs to process these Textures and other APUs to work on Fragment Programs.

Think about the F-buffer on the R350: if you touch it you are already slowing down, but at least your Shaders still run and do not fault the program.
 
DaveB said:
Yes, but it appears to be all dependent on the link between the two whether reasonably that?s a sensible thing to do. If the link is slow and you start executing operations as such you might be horribly bottlenecked ? if its not then it might be feasible to execute in such a fasion.
Of course, bandwith is always an issue, but that's true with almost any approach.

If fact Vince has long since evangelized the unification of the VS and PS in NVIDIA?s designs, which is something that is quite apparent with DX10, but the logical splits we have here with PS3 (if indeed this is PS3) appears not to be the case.
I think he's coming from "APU as a general processing unit" view - not just vertex/pixel processing but other stuff as well. When a specific 'apulet' is to be executed, idea is you don't really care which APU you pass it to.
Obviously within certain constraints of performance/efficiency.

Or in other words, when your "GPU" can act functionally equivalent to the "CPU" - to the point of executing the same code, you are coming very close to some of the stuff Tim Sweeney was talking about - without the pointless things like full software rasterization. :p
 
If they do not need to be filtered it should not be a problem accessing for the APU as long as you have instructed them to do so and I guess in some cases something might be done in software by few APUs dedicated to do just that: if you do not have bandwidth, well re-calculate more and send less

Its not just a case of filtering them, its also a case of sampling them.

The rest depends where the bottleneck is: if you are using an OpenGL or DirectX kind of pipeline and you are bottlenecked by Fragment Operations badly

Eh?

That’s not a given – if up o the developer what they do with the resources available to them. There’s nothing inherently fragment limited unless the developer chooses for things to be that way.

even if it is not opimal you mgiht want to use some DRAM of the Broadband Engine to store Textures, some APUs to process these Textures and other APUs to work on Fragment Programs.

You still have to sample those textures though, and as far as we can see so far the only dedicated hardware is down in the Pixel Engine in the Visualiser.

Think about the F-buffer on the R350: if you touch it you are already slowing down, but at least your Shaders still run and do not fault the program.

Not really, or at least not in the implementations that its been put to so far – 9800 has 96 instruction slots, which could potentially be executed in about 60 cycles (best case) and that is more of a bottleneck than passing anything out to the external F-Buffer memory with the bandwidth available to 9800’s (and generally bandwidth has scaled with performance, and future hardware will have larger instruction counts). So the F-Buffer itself doesn’t necessarily slow anything down, just the length of the shader in the first place.
 
Panajev2001a said:
Think about the F-buffer on the R350: if you touch it you are already slowing down, but at least your Shaders still run and do not fault the program.
This is false. If you are running a long program then usually only the program execution time matters because it is the bottleneck; bandwidth, geometry etc. usually isn't particularly relevant.

It can be proven that multipass is faster than a very long shader program under certain circumstances.

Fafalada said:
When a specific 'apulet' is to be executed, idea is you don't really care which APU you pass it to.
Obviously within certain constraints of performance/efficiency.
Back to the Transputer again? This kind of architecture is usually considered to be extremely hard to get working close to its peak efficiency (on general code - how well specific divisions of general code, such as game general code or renderer general code work I don't have any specific data on).
 
Fafalada said:
I think he's coming from "APU as a general processing unit" view - not just vertex/pixel processing but other stuff as well. When a specific 'apulet' is to be executed, idea is you don't really care which APU you pass it to.
Obviously within certain constraints of performance/efficiency.

Or in other words, when your "GPU" can act functionally equivalent to the "CPU" - to the point of executing the same code, you are coming very close to some of the stuff Tim Sweeney was talking about - without the pointless things like full software rasterization. :p

But the issue appear that the implementation its being put to here appears to suffer from some of the same pitfalls that Vince was citing, due to the fact that there are two separate units. The “low bandwidthâ€￾ issues that were mentioned are just as possible to be an issue, or not, in both cases – if they are an issue with bandwidth then the performance/constraints issue could well render the system to be mainly utilized as two separate blocks: one for CPU like tasks and geometry processing and the other for fragment processing.

The other more fundamental question is do you want a your “GPUâ€￾ to act functionally equivalent to the “CPUâ€￾, or spend its transistor budget being focused on tasks that are commonly required for a “GPUâ€￾?
 
Back
Top