More info about RSX from NVIDIA

DeanoC · Jun 23, 2005

Titanio said:
I had read they weren't given the same priority in the pipeline, but otherwise it should be similar. The former point (or something else?) may make a big difference though (?)

You heard wrong, the heritage of an SPE is obvious to anybody who has low level details. Its ALOT better than a PS2 VU but at heart its still a VU. The entire thing is designed for 4 vectors of floats. Anything else is supported but not happily.
Specific examples include integer multiples, scalar loads/stores (non-aligned).

Titanio said:
Anyway, specifically regarding what we're talking about here, ints or flops aside I don't think there's much argument that Cell needs GPU assistance for HD decoding given the performance they've disclosed in that area previously.

I've never wrote a HD decoder, so don't actually have any insights. How well it would work on an SPE I don't know. However even if a entire SPE is used decoding a single HD stream, its still 1 of 7. So I agree with Faf, I can't see there being much need for GPU assistance unless I fundementally misundersand modern codecs (last time I looked at them, was back in MPEG-1 days so its entirely possible)

nAo · Jun 23, 2005

PC-Engine said:
But the CPU in PS2 also do TnL...

EE core is the bottleneck most of the time, not VUs

nAo · Jun 23, 2005

jvd said:
i know you want to make the case that nvidia can do what ati can do on the consoles .

Emh no..You're completely off mark, sorry.
I'm trying to say..what I'm saying: NG console games will be, in many cases, CPU limited too.

jvd · Jun 23, 2005

nAo said:
jvd said:

i know you want to make the case that nvidia can do what ati can do on the consoles .

Click to expand...

Emh no..You're completely off mark, sorry.
I'm trying to say..what I'm saying: NG console games will be, in many cases, CPU limited too.

They wont be . Both parts will be pushed to the limits and it will change form scene to scene . IN fact the cell cpu may be powerfull enough that the rsx is the limiting factor .

Much like putting a athlon 64 dual core 4000+ with a radeon 9700pro . The radeon will be the limiting factor in all games

Fafalada · Jun 23, 2005

PCEngine said:
But the CPU in PS2 also do TnL...

No more then XBox cpu does - in other words, you'll see CPU based skinning and the likes being used, but not any kind of real T&L work beyond that.

Anyway, I agree with nAo, I think that early games will likely be CPU limited more often then anything else - on both new platforms.

Deano said:
Its ALOT better than a PS2 VU but at heart its still a VU. The entire thing is designed for 4 vectors of floats.

Blasphemy - if it was designed for 4 vectors of floats it would give me scalar access, swizzles, matrix to vector register mapping and a dot product instruction

Admitedly, I still have to see the full documentation of the ISA, but what I've seen and heard so far practically screams of Altivec roots, which like most other CPU SIMDs is only 'ok' when it comes to FP stuff.

Yeah I know, ISA is just one small part of the processor, and I'm a whiney bitch

PC-Engine · Jun 23, 2005

nAo said:
PC-Engine said:

But the CPU in PS2 also do TnL...

Click to expand...

EE core is the bottleneck most of the time, not VUs

Well yeah but you said CPU, not R5900 core. Regardless most were core limited because of the measely sized cache. CELL doesn't have this problem.

Titanio · Jun 23, 2005

DeanoC said:
Titanio said:

I had read they weren't given the same priority in the pipeline, but otherwise it should be similar. The former point (or something else?) may make a big difference though (?)

Click to expand...

You heard wrong, the heritage of an SPE is obvious to anybody who has low level details. Its ALOT better than a PS2 VU but at heart its still a VU. The entire thing is designed for 4 vectors of floats. Anything else is supported but not happily.
Specific examples include integer multiples, scalar loads/stores (non-aligned).

Thanks! I had heard this mentioned a few times, that you needed to vectorise your integer/scalar data to get maximum performance. I guess if you only have one or two integers for computation at a point in time you're "wasting" a lot of power, but if you can map 4 ints to your vector, does that alleviate the problem somewhat, or are there issues elsewhere too? Thanks again..

Jawed · Jun 23, 2005

Chalnoth said:
Examples:
(All at 1600x1200 8xS, 16-degree anisotropy)
UT2004: +73.7%
Far Cry: +57.5%
HL2: +76.2%
Doom3: +57.6%

I'm sorry, you lost it there. I specifically excluded AA and AF in order to avoid the possibility of muddying the waters with changes in texturing efficiency or memory utilisation efficiency (AA) or, indeed, due to swapping across the PCI Express bus due to the frame buffer's size causing textures to swap out.

I'm talking solely about the ability of games to shade more pixels more rapidly. We're not seeing ANY games get faster due to the increased capacity of the ALUs in G70's fragment pipelines. All we're seeing are speed-ups due to the increased core clock and pipeline count.

Jawed

Rockster · Jun 23, 2005

It would seem easy, especially early on, to get PPE bound on CELL fairly quickly.

In regards to the 4xAA benchmarks, the G70 has much greater framebuffer bandwidth available than RSX will.

Fafalada · Jun 23, 2005

In regards to the 4xAA benchmarks, the G70 has much greater framebuffer bandwidth available than RSX will.

Not if I stick Z and Frame in separate memories. 8)

DeanoC · Jun 23, 2005

Fafalada said:
Blasphemy - if it was designed for 4 vectors of floats it would give me scalar access, swizzles, matrix to vector register mapping and a dot product instruction
Admitedly, I still have to see the full documentation of the ISA, but what I've seen and heard so far practically screams of Altivec roots, which like most other CPU SIMDs is only 'ok' when it comes to FP stuff.

Yeah I know, ISA is just one small part of the processor, and I'm a whiney bitch

Fair enough it doesn't have those nice features. Its a real SIMD engine, non of this fancy horizontal stuff

Horizontal ops are for wimps, real men do it vertically ;-)

nAo · Jun 23, 2005

DeanoC said:
Fair enough it doesn't have those nice features. Its a real SIMD engine, non of this fancy horizontal stuff Horizontal ops are for wimps, real men do it vertically ;-)

MEGALOL,I have a new signature

nAo · Jun 23, 2005

Umh..if SPE doesn't have an instructions cache and it needs 2 instructions per clock (8 bytes) to feed itself, does this mean half of local store bandwith is wasted fetching instructions? :?

Acert93 · Jun 23, 2005

Rockster said:
It would seem easy, especially early on, to get PPE bound on CELL fairly quickly.

I think this may be especially true of early 3rd party cross platform games since I believe many early titles will only use the PPC (PPE). With limited time to hit launch titles, plus the challenge of new architectures to chew on, I would think of time's sake a lot (not all!) of early cross platform titles from 3rd parties would try to limit their heavy lifting to the PPC cores and only lightly experiment at this time with multithreading.

If a game is going to be one both platforms and you have a short time to make the game, the best solution is to use a single PPC. I know if I was the head of a cross platform game I would seriously consider this option. 1st gen games wont be able to have every bell and whistle. Advanced 3D rendering engine, intense particle systems, realistic physics, intelligent AI, believable facial expressions and lip synching, new dynamic gameplay elements, larger worlds with more interactice elements, etc... the burden on these guys are huge and I think they will take it one step at a time.

Also, compared to the current consoles both systems PPC cores are a large step in power in many areas. So even if they do not utlize every inch of the new console's CPU power they still are a fairly big leap to do more than they currently have been capable of.

I'm trying to say..what I'm saying: NG console games will be, in many cases, CPU limited too.

I think you could be right in many situations. CELL, e.g., requires that you can multithread your code 8 ways.

1. 7 Threads that will "play nice" and get most out of the SPEs *and* keep it busy enough to keep each one busy

2. 1 core for general purpose code and general management

How many developers are skilled enough, right now, to properly multithread their games (apps not traditionally multithreaded) so that they can have 7 equivalent threads running on SPEs, each thread maximizing each SPE's potential?

While the CELL is an amazing chip it is a completely new design an way to approach this stuff. Sounds very exciting... also sounds like many years of missed sleep to get at its potential.

360 fans should not gloat either. Same principles apply to the XeCPU. 3 cores for a total of 6 HW threads. 1MB of cache. Hmmmm. Yeah, sounds like fun to me. Same challenges, just in different ways. Ironically it seems the XeCPU is designed with the goal to have a CPU (or more) dedicated to procedural synthesis, yet as DeanoC has remarked this is something that CELL is actually pretty good at. Basically you are dedicating a PPC core in XeCPU to do what 2, maybe even 1, SPE can do on CELL. Seems like you could eat into what CPU power you have pretty quickly doing stuff like that on the 360.

ERP, DeanoC, Faf, nAo, any others who may have to work with this stuff: Sorry guys.

Too bad Intel never hit that 10GHz projection. Not that a fast x86 solves all your problems, but few--yet faster--CPUs sounds more flexible.

Makes me wonderif if a 1:5 CELL at 4.2GHz would have been better...

Faf said:
Not if I stick Z and Frame in separate memories.

So what is going to feed the power hungry CELL in these CPU starved settings?

CELL seems to rely on the fact the XDR will be available for its use.

Acert93 · Jun 23, 2005

nAo said:
New Sig: DeanoC: Horizontal ops are for wimps, real men do it vertically

At this rate we will all have DeanoC sigs

Shifty Geezer · Jun 23, 2005

Like emoticons and ringtones, Deano should start selling sig's. It's the future of microtransactions

KimB · Jun 23, 2005

Jawed said:
I'm sorry, you lost it there. I specifically excluded AA and AF in order to avoid the possibility of muddying the waters with changes in texturing efficiency or memory utilisation efficiency (AA) or, indeed, due to swapping across the PCI Express bus due to the frame buffer's size causing textures to swap out.

Heh, then the games are all CPU-limited, and your argument fails again.

KimB · Jun 23, 2005

nAo said:
PS3 could be CPU limited too

No, it won't. It's far too easy to make a program GPU-limited. But besides, any game that's designed for the platform will push both the CPU and the GPU to their limits.

Shifty Geezer · Jun 23, 2005

Any game will be limited based on it's own requirements and nothing else (in a balanced system anyhows). A game of loads of tiny, simple objects physically interacting will be CPU bound. A game of simple gameplay with fantastic graphics will be GPU bound. All depends what the dev's trying to do.

pjbliverpool · Jun 23, 2005

aaronspink said:
Jaws said:

Xenos can only issue 96 instructions/cycle ~ 48 vec4 + 48 scalar. It doesn't have any further instructions available per cycle for any other execution units? :?

Xenos ~ 480 Flops/cycle + NOTHING ELSE?

G70 can issue 136 instructions per cycle.

e.g. 64 instructions on 56 vec4+ 8 scalar ~ 464 FLOPS/cycle AND it still has 72 UNUSED instructions per cylce.

G70 ~ 464 Flops/cycle + 72 instructions/cycle on further operations.

Click to expand...

Actually its probably broken down closer to this:

(PS)24*2 ALUs each of which can issue 2 instructions in co-issue = 96 inst +
(PS)24 Misc Ops (aka 16b NRM) = 24 inst
(VS)8 VALUs each of which can issue 1 instructions = 8 inst +
(VS)8 SALUs = 8
Total = 136

Aaron Spink
speaking for myself inc.

Im not certain but it looks to me as though each pixel shader can perform two Vec4's and two Scalars per cycle. That 96 instructions at 480 flops. You get another 80 from the vertex shaders giving you a total of 560. You then have the 24 instructions left in the Pixel shaders which we havn't accounted for (the normalise).

Even if each pixel shader can only perform one scalar op per cycle along with the two Vec4's we are still looking at 512 flops per cycle.

More info about RSX from NVIDIA

DeanoC

Trust me, I'm a renderer person!

nAo

Nutella Nutellae

nAo

Nutella Nutellae

jvd

Fafalada

PC-Engine

Titanio

Jawed

Rockster

Fafalada

DeanoC

Trust me, I'm a renderer person!

nAo

Nutella Nutellae

nAo

Nutella Nutellae

Acert93

Artist formerly known as Acert93

Acert93

Artist formerly known as Acert93

Shifty Geezer

uber-Troll!

KimB

KimB

Shifty Geezer

uber-Troll!

pjbliverpool

B3D Scallywag

Similar threads