Next gen consoles: hints?

Panajev2001a · Dec 14, 2003

The rest depends where the bottleneck is: if you are using an OpenGL or DirectX kind of pipeline and you are bottlenecked by Fragment Operations badly

Click to expand...

Eh?

Thatâ€™s not a given â€“ if up o the developer what they do with the resources available to them. Thereâ€™s nothing inherently fragment limited unless the developer chooses for things to be that way.

Of course Dave, I was just following your example of being Fragment Processing Limited ( I was not asserting that those pipelines are inherently Fragment processing limited ): I specified two traditional rendering pipelines because if I went the REYES-like the discusssion would have changed a bit too much.

You still have to sample those textures though, and as far as we can see so far the only dedicated hardware is down in the Pixel Engine in the Visualiser.

That is true too, uless we plan to sample them with some dedicated APUs, yes... in software.

Alternatively you can transmit in the Apulet The Fragment Shader with no dependancy on texture sampling: execute until you sample the texture for the pixel program and then send the pixel program with the sample to be executed by another APU on the Broadband Engine.

Not really, or at least not in the implementations that its been put to so far â€“ 9800 has 96 instruction slots, which could potentially be executed in about 60 cycles (best case) and that is more of a bottleneck than passing anything out to the external F-Buffer memory with the bandwidth available to 9800â€™s (and generally bandwidth has scaled with performance, and future hardware will have larger instruction counts). So the F-Buffer itself doesnâ€™t necessarily slow anything down, just the length of the shader in the first place.

I was not saying that the F-buffer by itself was slowing anything down: it is as you say as well the fact that the shader is very long and if you did not have the F-buffer and you did not want to break the Shader manually into multiple rendering passes that Shader might crash or might not be executed as it would pass the instruction Slots limit.

The F-buffer in that case is a salve the day situation, transparent to the shaders you write.

I wonder how costly would be Texture Sampling for a unit like an APU: even for Vertex Programs, fi they wanted to do Texture Look-up in there they need to sample the textures.

Dave Baumann · Dec 14, 2003

That is true too, uless we plan to sample them with some dedicated APUs, yes... in software.

And now you are slowing down the execution quite significantly.

Alternatively you can transmit in the Apulet The Fragment Shader with no dependancy on texture sampling: execute until you sample the texture for the pixel program and then send the pixel program with the sample to be executed by another APU on the Broadband Engine.

Again, it seems you are reliant on a hellishly high bandwidth interconnect between the two chips â€“ your going to be shuttling the entire program and any temporaries associated with it back and forth between each chip here.

I wonder how costly would be Texture Sampling for a unit like an APU: even for Vertex Programs, fi they wanted to do Texture Look-up in there they need to sample the textures.

Well, put it like this: Vertex Shader 3.0 specification has texture access in the VS and its highly likely that youâ€™ll see the hardware that implements this either with their own dedicated texture sampling units or direct access to the fragment pipelines samplers.

Panajev2001a · Dec 14, 2003

Dio said:
Panajev2001a said:

Think about the F-buffer on the R350: if you touch it you are already slowing down, but at least your Shaders still run and do not fault the program.

Click to expand...

This is false. If you are running a long program then usually only the program execution time matters because it is the bottleneck; bandwidth, geometry etc. usually isn't particularly relevant.

It can be proven that multipass is faster than a very long shader program under certain circumstances.

False ?

Before saying I am not speaking the truth, be sure you understand the analogy I was making.

If you touch the F-Buffer on the R350, you are already running slow... why ? Because your shader was already long enough to challenge the Instruction Slots limit of the chip.

Running a 60 cycle Shader as Dave pointed out is not fast and few more instructions would hit Instruction Slots wall and what would happen then ?

On the R350 you have the F-buffer that allows you not to worry of the program not correctly executing or halting if you pass the instruction limit.

Using the F-buffer is slow... well it is not like runnign a Shader that is 90+ Instructions long is blazingly fast either.

The F-buffer is a save the day case: if you run a 200 Instructions Shader on the R300 I am not sure what it will happen, but on the R350 it would run albeit slow... but the important thing is that it runs.

Same thing as these load balancing tricks we are talking about: if on that Visualizer you were limited by Fragment Programs processing why not trying to off-load calculations to the Broadband Engine if that might speed renderign up ?

Panajev2001a · Dec 14, 2003

DaveBaumann said:
That is true too, uless we plan to sample them with some dedicated APUs, yes... in software.

Click to expand...

And now you are slowing down the execution quite significantly.

They said the same about clipping

( PlayStation 2 makes you do that in software on the VU ).

Again, it seems you are reliant on a hellishly high bandwidth interconnect between the two chips â€“ your going to be shuttling the entire program and any temporaries associated with it back and forth between each chip here.

True I do expect high bandwidth between the two chips.

I know the Shading Program would have to be sent back, but temps could be re-calculated: in an architecture like CELL calculating is less critical than moving data around.

Well, put it like this: Vertex Shader 3.0 specification has texture access in the VS and its highly likely that youâ€™ll see the hardware that implements this either with their own dedicated texture sampling units or direct access to the fragment pipelines samplers.

Interesting to see what will be cooked up for the PlayStation 3... now that is an area to think about.

Dave Baumann · Dec 14, 2003

The F-buffer is a save the day case: if you run a 200 Instructions Shader on the R300 I am not sure what it will happen, but on the R350 it would run albeit slow... but the important thing is that it runs.

Same thing as these load balancing tricks we are talking about: if on that Visualizer you were limited by Fragment Programs processing why not trying to off-load calculations to the Broadband Engine if that might speed renderign up ?

I think the F-Buffer analogy is a little misleading in this case, in reality all you are saying is "longer programs run slower". However, likewise in the case you point out with a unified shader resource such as DX10 allows longer fragment programs can just be requested to operate on more shader units putting the geometry processing on fewer (to none) or shifting some across to the CPU.[/quote]

V3 · Dec 14, 2003

There is a danger going down this route: you can end up with polygon-level aliasing. If you have finely expressed detail that is smaller than a single pixel on screen, you effectively sample random polygons out of this detail which will create aliasing artifacts.

LOD's mitigate, but don't completely solve this: if you consider a sphere, polygons near the edge of the circle (that is the, rendered image on the screen) are always sampled at a higher resolution than those that are near the middle of the circle.

Well, you can't remove the aliasing. Stochastic sampling will come in handy, I suppose.

Here is that old paper if you want to read up on it.
http://www.cs.unc.edu/~lastra/Courses/Papers/Cook_Stochastic_Sampling_TOG86.pdf

V3 · Dec 14, 2003

Take a looi at the detailed bumps on something like the characters on the second 3DMArk test - something thats simple for a fragment shader, but what type of poly requirements would be needed to do that with geometry?

It takes a bit more than 1 poly/pixel I suppose. I am sure the detailed bumps will look mostly the same, but for characters that deformed, high resolution mesh will look better when they do deformed. And as mesh resolution gets higher, and the ratio of pixel/poly decreases, shading at the fragment level will start to lose its advantage.

Vince · Dec 14, 2003

DaveBaumann said:
But the issue appear that the implementation its being put to here appears to suffer from some of the same pitfalls that Vince was citing, due to the fact that there are two separate units.

This is why I criticized DX (and stated in my rant on advancement that DX has effectivly killed the potentiality for all radical ideas on the PC). Unfortunatly, you must have not thought about it because if you have you'd see that as DX has this mutual interation with it's architecture that keeps it from ever attempting this type of system.

As for PS3, it's free of the legacy, so how do you know the Broadband Engine and the Vizualizer (if that's indeed what's used) aren't on some sort of MCM? V3 posted a link to a SCEI patent pertaining to such a device from 2002, perhaps he'll be kind enough to link-us.

Panajev2001a · Dec 14, 2003

DaveBaumann said:
The F-buffer is a save the day case: if you run a 200 Instructions Shader on the R300 I am not sure what it will happen, but on the R350 it would run albeit slow... but the important thing is that it runs.

Same thing as these load balancing tricks we are talking about: if on that Visualizer you were limited by Fragment Programs processing why not trying to off-load calculations to the Broadband Engine if that might speed renderign up ?

Click to expand...

I think the F-Buffer analogy is a little misleading in this case, in reality all you are saying is "longer programs run slower". However, likewise in the case you point out with a unified shader resource such as DX10 allows longer fragment programs can just be requested to operate on more shader units putting the geometry processing on fewer (to none) or shifting some across to the CPU.

[/quote]

I was not trying to make a misleading analogy: longer programs run slower, but at least they do still run.

In this case on the Broadband Engine pixels programs might run a vit slow, but if the Visualizer was already bottlenecked by other Pixel Programs we have not lost a whole lot. Better than the console crashing or doing something else similarly weird ?

Sorry for a bad example...

You got me thinking, what do we do about Texture Sampling for Vertex Programs ?

I hope that the bandwidth between Broadband Engien and Visualizer will be high, but I cannot just rely on that... that is why I am thinking

Vince · Dec 14, 2003

nAo said:
Compression would help saving memory and bandwith.

<snip> Multiresolution techniques integrated with subdivision surfaces.

I'm trying to remember where I saw a console developer propose utilizing Loop Subdivision with a basic wavelet compression scheme for the next generation platforms. Sound familiar to anyone?

Paul · Dec 14, 2003

Saying that the APUs are pure Vector Processors is incorrect though:

I didn't say they were pure heh. However I see BE's major job as rendering polygons, and then sending them to the VS. True VS has it's own computational units to do it's own geometry.. but I think those APU's are better needed for shader crunching.

This is why I call BE the VPU.

So, it is acting as the general purpose CPU for the system as well? And, there is also some â€œarbitraryâ€ split between the BE and Visualiser? Your earlier response wasnâ€™t clear though â€“ are these separate chips?

They are two different chips connected via Rambus Redwood.

The BE can do whatever you wanted it to do, you could have it do just physics and shift everything else to the VS's APU's if you wanted.

So, at the moment weâ€™re guessing its capabilities, how it operates and whether this is actually specific to PS3?

Well... from the patent anyway that second chipset next to BE, the Visualizer, runs at 4Ghz and provides 512GFLOPS of computational power.

Not really, youâ€™re using different terminology â€“ you say â€œVPUâ€ I hear â€œVisual Processing Unitâ€, you say â€œVSâ€ I hear â€œVertex Shaderâ€.

We do come from different backgrounds concering this stuff, different hardware studies anyway.

V3 · Dec 14, 2003

So there is some arbitrary split between what you are deeming the &#8220;CPU&#8221; and the Visualiser then?

The CPU is what lebel PU in that diagram, there are eight of them. And they act as scheduler. The rest are resources to execute the physics/graphics/compression/decompression, etc.

Then those pixel engines are the backend I suppose.

Vince · Dec 14, 2003

Paul said:
Well... from the patent anyway that second chipset next to BE, the Visualizer, runs at 4Ghz and provides 512GFLOPS of computational power.

Truthfully, we don't know this as fact. Yet, it makes me think of a question for Mr. Baumann as he once said:

[url=http://www.beyond3d.com/forum/viewtopic.php?p=202680 said:
Dave Baumann[/url]]Yes, and Cell isnâ€™t getting to anything like the power that Tim was referencing

Just, what exactly qualifies as "the power that Time was referencing"? Just so I know when we reach it. By the way you stated it, I think we can infer that it's quantifiable. Any standard metric works for me. thanks.

Panajev2001a · Dec 14, 2003

Dave,

I see though something hironic and you prolly saw it before me... while we advocated the fact VS and PS are merging in terms of functional units on such a PlayStation 3 we would divide Vertex Shading and Pixel Shading between the two chips for best efficiency ( unless we take the penalty, sample on the BE with some dedicated APUs and all... ).

To be fair we advocated that to let people understand why we felt the APU could have been usable for both Pixel and Vertex Shading tasks and we can build off CELL a CPU and a GPU without compromising the programmability of the GPU: we would have all the flexibility we need and access to Texture Sampling units.

Also another funny thing you noticed ( much better arguing when you are not trying to deny somethign you feel true as well ) in such a scenario Vertex Processing would be the area to push strongly as you can much more easily off-load that to the Visualizer than off-load the Pixel Shading part on the Broadband Engine due to the Texture Issue we just mentioned.

Unless someone has a better idea.

Paul · Dec 14, 2003

Truthfully, we don't know this as fact. Yet, it makes me think of a question for Mr. Baumann as he once said:

I know it's not fact, infact 4Ghz for a GPU in 2005 is insane. However.. 4Ghz is indeed the number coming from the patent, I'm just taking it straight from there.

Dave Baumann · Dec 14, 2003

Vince said:
DaveBaumann said:

But the issue appear that the implementation its being put to here appears to suffer from some of the same pitfalls that Vince was citing, due to the fact that there are two separate units.

Click to expand...

This is why I criticized DX (and stated in my rant on advancement that DX has effectivly killed the potentiality for all radical ideas on the PC). Unfortunatly, you must have not thought about it because if you have you'd see that as DX has this mutual interation with it's architecture that keeps it from ever attempting this type of system.

Vince, I'm saying that potnetially PS3, from this outlay here, suffers from the same issues. DX doesn't limit anything - there is already the movement to a unified shader architecture and that doesn't stifle more general purpose processing as well if thats the route that needs to be taken, but MS, ATI, NVIDIA, SGI, XGI, Nintendo, PowerVR, and a whole bunch of other people don't want to move down that route yet as they evidently feel more focused hardware serves their needs better.

Vince · Dec 14, 2003

DaveBaumann said:
Vince, I'm saying that potnetially PS3, from this outlay here, suffers from the same issues. DX doesn't limit anything - there is already the movement to a unified shader architecture and that doesn't stifle more general purpose processing as well if thats the route that needs to be taken, but MS, ATI, NVIDIA, SGI, XGI, Nintendo, PowerVR, and a whole bunch of other people don't want to move down that route yet as they evidently feel more focused hardware serves their needs better.

Dave, shit. Atleast quote my posts with a single topic in it's entirety. I was clearly talking about how it's basically impossible for a PC IHV, with the current limitations imposed on it (that I've proposed are cyclically influential between DX and the IHVs), to offer a solution that's reminescent of the Cell Patent.

I then stated that for all we know the interbus bandwith could be negligable due to the entire system being on a MCM. Again, show me the PC company that can -feasibly- do this in a macroscopic view. Maybe Intel?

Dave Baumann · Dec 14, 2003

Vince said:
[url=http://www.beyond3d.com/forum/viewtopic.php?p=202680 said:

Dave Baumann[/url]]Yes, and Cell isnâ€™t getting to anything like the power that Tim was referencing

Click to expand...

Just, what exactly qualifies as "the power that Time was referencing"? Just so I know when we reach it. By the way you stated it, I think we can infer that it's quantifiable. Any standard metric works for me. thanks.

Tim said 10 years IIRC, I'm sure he's had ample opportunity to know what is going on around and he hasn't yet reappraised that opinion to my knowledge.

Vince said:
Dave, shit. Atleast quote my posts with a single topic in it's entirty. I was clearly talking about how it's basically impossible for a PC IHV, with the current limitations imposed on it, to offer a solution that's reminescent of the Cell Patent.

Show me a PC IHV that want's to do this. And we're not just talking about PC's hear either - which is why I have mentioned closed box system numerous times.

I then stated that for all we know the interbus bandwith could be negligable due to the entire system being on a MCM. Again, show me the PC company that can -feasibly- do this in a macroscopic view. Maybe Intel?

Again, we're not talking about PC's here - do we know what busses are going to be emplyed by Xbox?

Vince · Dec 14, 2003

Dave Baumann said:
Vince said:

Dave Baumann said:

Yes, and Cell isnâ€™t getting to anything like the power that Tim was referencing

Click to expand...

Just, what exactly qualifies as "the power that Time was referencing"? Just so I know when we reach it. By the way you stated it, I think we can infer that it's quantifiable. Any standard metric works for me. thanks.

Click to expand...

Tim said 10 years IIRC, I'm sure he's had ample opportunity to know what is going on around and he hasn't yet reappraised that opinion to my knowledge.

So, wait. How can you tell me Cell/Broadband Engine/PS3 isn't going to be anywhere near what Tim's intensions are if:

(a) He hasn't commented on it in 10 years
(b) You don't know what the hell he intended?

Alrighty then.

DaveBaumann said:
And we're not just talking about PC's hear either - which is why I have mentioned closed box system numerous times.

What?!? Then why are we even discussing DXNext? Why have we been talking about PC architectual limitation?

Where did this come from?

Panajev2001a · Dec 14, 2003

I seriously would like to know how slow would be to sample Textures using soem APUs as TMUs if needed with what we know of the APUs.

Peak of either 1 Vector FP/FX ( we have MADD ) or Scalar FP/FX per cycle.

Next gen consoles: hints?

Panajev2001a

Dave Baumann

Gamerscore Wh...

Panajev2001a

Panajev2001a

Dave Baumann

Gamerscore Wh...

V3

V3

Vince

Panajev2001a

Vince

Paul

V3

Vince

Panajev2001a

Paul

Dave Baumann

Gamerscore Wh...

Vince

Dave Baumann

Gamerscore Wh...

Vince

Panajev2001a

Similar threads