Gamespy leaks Xenon specs (?)

j^aws · Mar 9, 2005

DaveBaumann said:
2. 8 Pixel pipelines/shaders max., each a cluster of 6 ALUs.

Click to expand...

Nope. As I said, dump previous concepts of pipelines. NV4x has already decoupled the ROP's from the fragment pipelines, Xenon will be abstracted even further.

Okay...when I say pipelines, I'm not talking about conventional fixed pipelines. I'm talking about dynamic pipelines in the same sense as I would talk about dynamic streaming pipelines for CELL. At any given time there will be various streams of pixels and vertex ops being executed on clusters of ALUs, not unlike streaming pipelines being executed on SPEs on CELL. This is what I mean by pipeline and is this what you're refering to as dumping previous concepts?

j^aws · Mar 9, 2005

aaaaa00 said:
What DaveBauman said.

There's no such thing as pipelines any more. Stop talking about pipelines. It's meaningless. The next-gen chips no longer have X pipelines each with Y texture units attached to them.

What you have in a next-gen chip is A number of shader ALUs, B number of pixels "in-flight" at a given time, and C pixels retired per clock, where A is typically in the multiple dozens, B is in the hundred or hundreds range, and C is in the high single digits, or low double digits.

Fillrate is no longer interesting. What is interesting is your max shader ops throughput per clock.

The reason pipelines were done away with is memory latency. You don't want a shader memory fetch to stall an entire pipeline and waste all that hardware.

What you want is a shader ALU to queue the memory access and drop the pixel it is working on back into the pool of in-flight pixels. Then you want it to find a pixel that's just come out of a memory-wait, and start working on it as quickly as possible.

If you keep your pool filled with enough in-flight pixels, you'll always find work for your ALUs to do, instead of having them stand around twiddling their thumbs waiting for memory.

Think of it as a hundred-way hyperthread.

See above...I'm not talking about fixed pipelines, but referring to dynamic streaming piplelines...

aaaaa00 · Mar 9, 2005

Jaws said:
See above...I'm not talking about fixed pipelines, but referring to dynamic streaming piplelines...

Suffice it to say your post with the reconfiguring and "load balancing" is just wrong. It's not how the chip works.

You do not create a set of "dynamic pipelines" and assign shader ALUs to them. It just doesn't work that way.

Legend · Mar 9, 2005

aaronspink said:
z said:

two reasons:
1- all next-gen games will be high-def (spells; taking lots of space)

Click to expand...

All my current games are "high-def". In addition, they have several different complete sets of textures, models, etc. They all fit on a single DVD. And would require even less space if they shipped with only 1 set of texture/world data.

Maybe all your â€œPCâ€ games are, but we are talking about consoles. You have to consider the power difference.

2- Sony wants PS3 to be the ultimate promotion for its new platform, BD (as what PS2 did for DVD)

Click to expand...

This is some nice historical revision. PS2 is jack all for DVD as a format. Not even a drop in the bucket.

I didnâ€™t understand anything. Care to re-phrase?

and you have to think a head; nex-gen consoles will be around for years- heck PSOne is still being produced. so when high-def movie platforms (BD & HD-DVD) become more poular, consoles will be ready to play them. up grading consoles is not a smart choice. not to mention the 2nd point again

Click to expand...

Who wants to upgrade a console to play movies? When high def movies become popular, people will go down to fry's and buy one for $30-50 that will give much better quality than a console will.

Iâ€™m assuming that you mean by that price a price of BD movies. But what about the machine that will play it?

I still believe the Blu-ray in PS3 will be recordable.

Sony officially stated that PS3 will be â€œread onlyâ€, so no BD recording. But hey look at the bright side; you will likely still have a BD player before everyone elseâ€¦ not to mention a PS3 as an extra!
That goes for the rest of the consoles if they have BD or HD-DVD.

j^aws · Mar 9, 2005

aaaaa00 said:
Jaws said:

See above...I'm not talking about fixed pipelines, but referring to dynamic streaming piplelines...

Click to expand...

Suffice it to say your post with the reconfiguring and "load balancing" is just wrong. It's not how the chip works.

You do not create a set of "dynamic pipelines" and assign shader ALUs to them. It just doesn't work that way.

The R500 is a stream processor and as such a stream processor needs a control processor (i.e. HW load balancing in this case). Compare this analogy to a CELL processor which is a stream processor and it's control processor, i.e. the PPE...

rabidrabbit · Mar 9, 2005

z said:
I still believe the Blu-ray in PS3 will be recordable.

Click to expand...

Sony officially stated that PS3 will be â€œread onlyâ€, so no BD recording. But hey look at the bright side; you will likely still have a BD player before everyone elseâ€¦ not to mention a PS3 as an extra!
That goes for the rest of the consoles if they have BD or HD-DVD.

I'll still hold onto that belief 'til the last day

aaaaa00 · Mar 9, 2005

Jaws said:
The R500 is a stream processor and as such a stream processor needs a control processor (i.e. HW load balancing in this case). Compare this analogy to a CELL processor which is a stream processor and it's control processor, i.e. the PPE...

That's cause internally a GPU looks nothing like CELL.

GPUs are optimized around processing well defined pieces of data (a vertex or a pixel). You have a zillion ALUs. The ALUs are heavily hyperthreaded to hide memory latency. All or most of the ALUs run the same program at any given time. You keep lots of pieces of data around on-chip and context switch among them automatically and rapidly so you're never blocked on memory. You never feed the output of one ALU to another ALU (that creates a dependency, and makes it harder to get parrallelism). Instead you try and work on as many different pieces of data as you can, because they're all independent, and hence can be processed in parallel.

Edit: To be clear, when I say you don't feed the outpt of one ALU to another, I mean you don't send a vertex to one ALU for vertex processing then chain it to another ALU for more vertex processing.

CELL is composed of 8 streaming processors. You have a control CPU that schedules things. When a CELL SPU needs to access memory, there is no automatic scheduling of the memory access, and no automatic hyperthreaded context switch. You construct virtual pipelines feeding the output of one SPU to another SPU.

DeannoC has already posted about this multiple times. GPUs are not CELL and a CELL cannot beat a GPU doing what a GPU does best.

pc999 · Mar 9, 2005

Jaws said:
The R500 is a stream processor and as such a stream processor needs a control processor (i.e. HW load balancing in this case). Compare this analogy to a CELL processor which is a stream processor and it's control processor, i.e. the PPE...

I think that there are hints as Xenon being itself a "Cell" here PPU would be IBM part and SPUs the ATI part (and with the cach union), we all know that this is the future of GPUs, meybe it is closer than we think

London Geezer · Mar 9, 2005

aaaaa00 said:
DeannoC has already posted about this multiple times. GPUs are not CELL and a CELL cannot beat a GPU doing what a GPU does best.

It was never meant to. That's why NVIDIA is taking care of the GPU part in PS3...

aaaaa00 · Mar 9, 2005

london-boy said:
aaaaa00 said:

DeannoC has already posted about this multiple times. GPUs are not CELL and a CELL cannot beat a GPU doing what a GPU does best.

Click to expand...

It was never meant to. That's why NVIDIA is taking care of the GPU part in PS3...

That's my point. You don't compare CELL with GPUs, which is what Jaws was doing.

j^aws · Mar 9, 2005

aaaaa00 said:
Jaws said:

The R500 is a stream processor and as such a stream processor needs a control processor (i.e. HW load balancing in this case). Compare this analogy to a CELL processor which is a stream processor and it's control processor, i.e. the PPE...

Click to expand...

That's cause internally a GPU looks nothing like CELL.

GPUs are optimized around processing well defined pieces of data. You have a zillion ALUs. The ALUs are heavily hyperthreaded to hide memory latency. You keep lots of these pieces of data around on-chip and context switch among them automatically and rapidly so you're never blocked on memory. You never feed the output of one ALU to another ALU (that creates a dependency, and makes it harder to get parrallelism). Instead you try and work on as many different pieces of data as you can, because they're all independent, and hence can be processed in parallel.

CELL is composed of 8 streaming processors. You have a control CPU that schedules things. When a CELL SPU needs to access memory, there is no automatic scheduling of the memory access, and no automatic hyperthreaded context switch. You construct virtual pipelines feeding the output of one SPU to another SPU.

DeannoC has already posted about this multiple times. GPUs are not CELL and a CELL cannot beat a GPU doing what a GPU does best.

Yes, I know...where did I mention GPU in that analogy?

They are both stream processors with different memory architectures to hide appropriate latencies for data access...and are both efficient at tasks appropriate for those latencies. Both are highly parallel architectures and both still need to deal with dependencies.

aaaaa00 · Mar 9, 2005

Jaws said:
Yes, I know...where did I mention GPU in that analogy?

You're drawing an analogy between the ATI chip (which IS a GPU), and CELL (which is not a GPU).

Anyway, the point is, you do not configure "dynamic pipelines" on the ATI chip. You do not say "I want 8 pipelines, with 6 ALUs each, or 24 pipelines with 2 ALUs each" or what not. That's not how the chip works. ALUs are not assigned to "dynamic pipelines". You do not allocate them in chunks of 6 ALUs. The possible configurations are not "8PS|0VS or 7PS|3VS or 6PS|6VS or 5PS|9VS or 4PS|12VS or 3PS|15VS or 2PS|18VS or 1PS|21VS".

That's all I'm trying to say.

rendezvous · Mar 9, 2005

How does the chip work then?

Shifty Geezer · Mar 9, 2005

aaaaa00 said:
Anyway, the point is, you do not configure "dynamic pipelines" on the ATI chip. You do not say "I want 8 pipelines, with 6 ALUs each, or 24 pipelines with 2 ALUs each" or what not. That's not how the chip works. ALUs are not assigned to "dynamic pipelines". You do not allocate them in chunks of 6 ALUs. The possible configurations are not "8PS|0VS or 7PS|3VS or 6PS|6VS or 5PS|9VS or 4PS|12VS or 3PS|15VS or 2PS|18VS or 1PS|21VS".

That's all I'm trying to say.

I don't think anyone's saying you do configure a set of pipelines. A graphics pipeline is the set of processes that take some vector data and outputs a coloured pixel. That describes transforming vertices, texturing, applying shaders, and outputting to the frame buffer. Whether these is achieved in hardware that's structured on rigid physical pipes, or a general purpose CPU that's doing all this in software, the graphics pipeline still exists.

What you're describing is a set of 'free-floating' processors that fetch and process data from a central pool, which sounds like an excellent use of resources and supports the idea that unified shaders is a more efficient approach than custom shader. But it STILL works on a vertex->transform->shader->output graphics pipeline. You won't be doing shading before transforming, or texturing after outputting ot the frame buffer! This is a set series of functions that have to be applied in order, and as such I think people talk of the shader 'pipes' and vertex 'pipes' as the functional units that perform these tasks, as well as using such terms to describe the existing hardware implementation which you've described as not applying in the next-gen case.

j^aws · Mar 9, 2005

aaaaa00 said:
Jaws said:

Yes, I know...where did I mention GPU in that analogy?

Click to expand...

You're drawing an analogy between the ATI chip (which IS a GPU), and CELL (which is not a GPU).
...

No, I keep saying they are both stream processors and comparing as such. They BOTH will do GPU AND non-GPU tasks. E.g. see the thread on the PPU (Physics processing unit) for what the R500 could do or that the CELL may do vertex processing and fyi, see this url,

http://www.gpgpu.org/

aaaaa00 said:
...
Anyway, the point is, you do not configure "dynamic pipelines" on the ATI chip. You do not say "I want 8 pipelines, with 6 ALUs each, or 24 pipelines with 2 ALUs each" or what not. That's not how the chip works. ALUs are not assigned to "dynamic pipelines". You do not allocate them in chunks of 6 ALUs. The possible configurations are not "8PS|0VS or 7PS|3VS or 6PS|6VS or 5PS|9VS or 4PS|12VS or 3PS|15VS or 2PS|18VS or 1PS|21VS".

That's all I'm trying to say.

Okay, that 6 ALUs per pixel cluster and 2 for vertex clusters are just arbritary numbers, that's all. And I'm comparing these ALU clusters with CELL SPEs; and the PPE with the R500 'control' core for HW scheduling/ load balancing. If you're aware of CELL patents, then you should be aware that dynamic streaming pipelines can be created with the required number of SPEs (ALUs) to get the task done with appropriate resources. I'm just using that analogy here for HW load balancing but the minumum resource unit is not an SPE but a minimum 'chunk' of ALUs.

If you know how the HW load balancing works for the R500 then please elaborate...

Dave Baumann · Mar 9, 2005

Shifty Geezer said:
But it STILL works on a vertex->transform->shader->output graphics pipeline. You won't be doing shading before transforming, or texturing after outputting ot the frame buffer! This is a set series of functions that have to be applied in order, and as such I think people talk of the shader 'pipes' and vertex 'pipes' as the functional units that perform these tasks, as well as using such terms to describe the existing hardware implementation which you've described as not applying in the next-gen case.

You are talking about the overall rendering pipeline, which is not really what is being discussed there. What was being reference when people were talking about pipelines, the reference was the traditional "X pixel pipelines, with Y Texture units and Z ALU's per pipe".

However, even your description of the overall pipeline does't necessarily hold true under this arrangement - the data can be passed back and forth between the Shader processors to make for slighty different processing; Richard Huddy has already talked about the prospect of going VS -> Geometry Shader -> back to VS. Also, shader units have been the been the embodyment of that conceptual render pipline scenario with separate VS and PS units; physically the shader process of Xenon breaks that pipeline and its sort of decoupled since both VS and PS will operate over it.

DeanoC · Mar 9, 2005

Jaws said:
If you know how the HW load balancing works for the R500 then please elaborate...

It doesn't have an architecture like your hinting at, an ALU isn't a processor... Its not an independent unit, the load balencing occurs at a higher level, these are your control units (instruction decoders)...

An instruction decoder sends an instruction to N ALUs, each ALU works on a different bit of data. There are M instruction decoders per GPU. So there are M parellel programs running but working on N*M data elements (also note each ALU is also SIMD working on 4-5 floats, so N*M*5 FMADS per cycle).

Its more of a vector/array processor than a stream processors, look further back in time to Cray XMP architectures than the relatively modern idea of stream processors

Qroach · Mar 9, 2005

To those that want A HD disk player in the xbox will probably get to buy a different version of the xbox that comes with rrecording and other PVR capabilites.

Shifty Geezer · Mar 9, 2005

DaveBaumann said:
You are talking about the overall rendering pipeline, which is not really what is being discussed there. What was being reference when people were talking about pipelines, the reference was the traditional "X pixel pipelines, with Y Texture units and Z ALU's per pipe".

Okay, maybe, though I still think if you tip your head to one side and squint a bit, Jaws comments can be read as virtual pipes/processors! Besides, people need a handle for comparison. They know the existing hardware and limitations and to understand what the future tech can do, need to be able to interpret it in existing terms. Like when the UK went Decimal, and people took around with them the idea that 1kg is about 2lb or 2 bags of sugar, until they got used to the concept of what a gram actually was without needing a basis of comparison. Howevr the next gen tech works, to relate it's performance with existing tech there needs to be a level of "ATi's next offering can perform about the same as existing tech with this many pipes..." equivalency.

However, even your description of the overall pipeline does't necessarily hold true under this arrangement - the data can be passed back and forth between the Shader processors to make for slighty different processing; Richard Huddy has already talked about the prospect of going VS -> Geometry Shader -> back to VS. Also, shader units have been the been the embodyment of that conceptual render pipline scenario with separate VS and PS units; physically the shader process of Xenon breaks that pipeline and its sort of decoupled since both VS and PS will operate over it.

That's still a pipeline, just with some branching and loops. At the end of the day graphics consist of Input->processing->Output which can be shown in a flow chart/DFD. Where before it was a linear progression, next gen adds data branching and feedback.

Either way I'm curious what new effects this will have. Presumably displacement mapping. Anything else?

j^aws · Mar 9, 2005

DeanoC said:
Jaws said:

If you know how the HW load balancing works for the R500 then please elaborate...

Click to expand...

It doesn't have an architecture like your hinting at, an ALU isn't a processor... Its not an independent unit, the load balencing occurs at a higher level, these are your control units (instruction decoders)...

An instruction decoder sends an instruction to N ALUs, each ALU works on a different bit of data. There are M instruction decoders per GPU. So there are M parellel programs running but working on N*M data elements (also note each ALU is also SIMD working on 4-5 floats, so N*M*5 FMADS per cycle).

Its more of a vector/array processor than a stream processors, look further back in time to Cray XMP architectures than the relatively modern idea of stream processors

Thanks...I've also seen the Cray XMP vector architecture comparison applied to CELL also.

Gamespy leaks Xenon specs (?)

j^aws

j^aws

aaaaa00

Legend

j^aws

rabidrabbit

A Reformed Member

aaaaa00

pc999

London Geezer

aaaaa00

j^aws

aaaaa00

rendezvous

Shifty Geezer

uber-Troll!

j^aws

Dave Baumann

Gamerscore Wh...

DeanoC

Trust me, I'm a renderer person!

Qroach

Shifty Geezer

uber-Troll!

j^aws

Similar threads