Quick tech questions cencerning next gen...

pixelbox

Regular
Now before you bombard me with directions to the search engine, please just anwser these questions or give me links to similar thread topics.

1.Can Cell be defined in a summary? If so please explain.
2.What is/are important/possibilities about/for FLEXIO in Cell?
3.What are threads and how does this help processing?
4.What are display list?
5.What are MAC, FMAC, and DMA controllers?
6.Please define bandwidth, and bit rates (128 bits) and how do they help speed and performance?
7.Define SIMD/VLIW's?
8.Does Cell have graphical features like NURB's?
9.What is fillrate and why does fillrate affect draw distance and special effects?
10.How can you avoid idle SPE's when assigning them with different task? How does this work?

Please help, i'm so lost and out of date plus the search engine sucks!:rolleyes:
 
I'll bite - but I'll try to be very brief...

pixelbox said:
Now before you bombard me with directions to the search engine, please just anwser these questions or give me links to similar thread topics.

1.Can Cell be defined in a summary? If so please explain.

It's a CPU which employs multiple simple cores instead of a more traditional approach of a single (or very small number) of more complex cores.

2.What is/are important/possibilities about/for FLEXIO in Cell?

It's a fast bus which allows different devices to communicate.

3.What are threads and how does this help processing?

A processor traditionally only runs a single stream of instructions at any one time. Threads are an ecapsulation of something to run, which will be switched between. In software threading, many threads can be created and the switching occurs when the OS intervenes either through an interrupt/timing mechanism (pre-emptive) or when asked to by the program running (co-operative). In hardware there are a limited number of threads (typically two) which the CPU directly supports and switches between every cycle (more or less).

Hmm... that wasn't so brief.

4.What are display list?

Generally speaking, and on a modern machine, a list of things for the GPU to do (generally, draw a bunch of stuff).

5.What are MAC, FMAC, and DMA controllers?
MAC/FMAC = (float) mulitply accumulate unit. A fairly simple and common building block of any chip that does maths. In this case the unit is designed to multiply two numbers and add a third, which covers many basic operations with a single fast unit.

DMA = direct memory access - a DMA controller is pretty much anything that can directly read or write to memory - typically they can do memory transfers from point to point in a system, such as transferring data to or from a peripheral device.

6.Please define bandwidth, and bit rates (128 bits) and how do they help speed and performance?

Bandwidth is how much data you can transfer in any unit of time. The more bandwidth you have, the more data you can transfer - however it doesn't say anything about how long it takes to get from A to B, which is the latency. If you have an algorithm that can actually deal with a large amount of contiguous data, bandwidth is important. If you need to randomly access individual bits, it isn't.

7.Define SIMD/VLIW's?
SIMD = single instruction, multiple data. So operating on several different values at once with the same instruction - usually used for vector operations.

VLIW = very long instruction word - a way of encoding low-level hardware instructions that allows more control over individual hardware units instead of simple instructions that do a bunch of stuff at once.

8.Does Cell have graphical features like NURB's?
It's a CPU, not a GPU - it could calculate them, and do so quickly, but it doesn't really have them (or any other graphical thing) as a "feature".

9.What is fillrate and why does fillrate affect draw distance and special effects?
It's how much you can draw in a unit of time. The more you can draw, the further you can draw - also special effects tend to involve multi-pass techniques or heavy over-draw, which eats fillrate.

10.How can you avoid idle SPE's when assigning them with different task? How does this work?
Keep feeding them with more work.
 
MrWibble said:
SIMD = single instruction, multiple data. So operating on several different values at once with the same instruction - usually used for vector operations.

VLIW = very long instruction word - a way of encoding low-level hardware instructions that allows more control over individual hardware units instead of simple instructions that do a bunch of stuff at once.
Ok, VLIW is basically for Assembly language right? So my next question is what is this used for, graphics or things like physics and A.I.?

So what is SIMD used for? I associate vector operations with graphics, am i right?
 
MrWibble said:
It's how much you can draw in a unit of time. The more you can draw, the further you can draw - also special effects tend to involve multi-pass techniques or heavy over-draw, which eats fillrate.

Ok, so i thought fillrate was based on resolution (how many pixels a GPU can draw). So what, fillrate refers to every pixel/texel a GPU can draw?
 
pixelbox said:
Ok, VLIW is basically for Assembly language right? So my next question is what is this used for, graphics or things like physics and A.I.?

So what is SIMD used for? I associate vector operations with graphics, am i right?


VLIW could be a form of assembly language. It's somewhat out of vogue in processors these days, but the idea is that it gives you more explicit control over the execution resources. Most GPU's actually use VLIW like instructions, but you would actualy write the code as a pixel or vertex shader, so one instruction in the shader does not map to one instruction on the machine.
Physics and AI would generally be written in C or C++

SIMD is just a way to do wide math ops any where you can do 4 of the same thing to make things faster you can use it. Graphics would be an example, as would part of a physics system, animation....
 
Sorry, lol but does anyone know the fillrates of PS3/XBOX360's GPUS?

BTW thanks soooo much!:D I am extremely grateful you took the time out!;)
 
ERP said:
VLIW could be a form of assembly language. It's somewhat out of vogue in processors these days
Yea it's much more popular to forgo using it in favour of archaic ISAs that can't fit even the basic functionality into the "normal width" instruction space. :p
 
Fafalada said:
Yea it's much more popular to forgo using it in favour of archaic ISAs that can't fit even the basic functionality into the "normal width" instruction space. :p

Ignoring architectures designed when instruction decode logic was hideously complex and commonly written in uCode.

If you want we can discus why it's out of vogue and why it makes sense for GPU's.

Basically it comes down to the fact that a lot of execution resources remain unused clock to clock in a conventional processor, which means the additional instruction bandwidth goes wasted. In fact in most code load/store instructions dominate, although I guess MS/Sony and IBM missed that paper ;)

Shader programs on the other hand very heavilly use the execution resources so VLIW is a really good fit.

Interestingly the Nuon architecture used VLIW in it's processor, they had a number of patents relating to the compression of the instructions so that code didn't get hideously bloated in the load/store bound cases.
 
ERP said:
Basically it comes down to the fact that a lot of execution resources remain unused clock to clock in a conventional processor, which means the additional instruction bandwidth goes wasted. In fact in most code load/store instructions dominate, although I guess MS/Sony and IBM missed that paper
I know, and I completely agree that in a conventional CPU there is more arguments against VLIW then pro, but when it comes to something like SPE I just don't see any advantage that current design has over VLIW.
Let's face it - currently, what benefit there is to codesize for cases where we'd stick with unaligned instructions & single issue load/store will tend to be greatly overshadowed by the preferred slot paradigm bloat.
IMO VLIW would make things simpler issue/execution wise, provide instruction space for a better featured ISA, and because of the latter, also wouldn't tip the balance of average code size in any significant manner, since it could eliminate unnecessary bloat (it might even improve it in some cases).

Of course I'm no hw engineer, so someone feel free to correct my line of thinking. :p
 
Fafalada said:
I know, and I completely agree that in a conventional CPU there is more arguments against VLIW then pro, but when it comes to something like SPE I just don't see any advantage that current design has over VLIW.
Let's face it - currently, what benefit there is to codesize for cases where we'd stick with unaligned instructions & single issue load/store will tend to be greatly overshadowed by the preferred slot paradigm bloat.
IMO VLIW would make things simpler issue/execution wise, provide instruction space for a better featured ISA, and because of the latter, also wouldn't tip the balance of average code size in any significant manner, since it could eliminate unnecessary bloat (it might even improve it in some cases).

Of course I'm no hw engineer, so someone feel free to correct my line of thinking. :p

I think to some extent it depends on what jobs you see and SPU doing...

Basically I've heard two different schools of thought.

The "write all your SPU code in assembly" school, where SPU's are used for expensive (FPU wise) simple stuff, like vertex transforms audio and cloth dynamics, in this school I agree with you.

The other school is the "put anything you can run on an SPU on an SPU even if it runs slower there" school, the logic being that the PPE is extremly week, and will likely be the bottleneck in any real application, so anything you can put on an SPU is a win. In this scenario, you have to consider the SPU to be a crippled conventional processor, so I think the argument goes more towards a conventional style ISA.

FWIW I think the sweet spot is somewhere in the middle.
 
Fafalada said:
IMO VLIW would make things simpler issue/execution wise, provide instruction space for a better featured ISA, and because of the latter, also wouldn't tip the balance of average code size in any significant manner, since it could eliminate unnecessary bloat (it might even improve it in some cases).

There are also very good reasons not to make it VLIW. This diagram shows 6 different execution units/pipelines. So you'd need 6 issue-slots in your VL instruction word, quite a bit of bloat if they can't be used effectively.

Also, the complexity of the result forwarding network increases dramatically, since 6 instructions can be issued every cycle, 6 results can be generated. The exec units also needs data, so you need many more read ports in your registerfile, compromising operating frequency or increasing the pipeline length even further.

Cheers
 
Gubbi said:
So you'd need 6 issue-slots in your VL instruction word, quite a bit of bloat if they can't be used effectively.
Not sure if I'm reading you right - but I was thinking more along the lines of VLIW with the same 2-way issue capacity that SPE already have - so most of the existing execution resources should not change much if at all.

The reason is that currently SPE already needs instructions aligned in fashion equivalent to that of a 2-way VLIW to dual-issue, so most optimized code would not even change in size (or rather, VLIW could make it smaller if ISA was extended appropriately to take advantage of extra instruction space).
As ERP notes it's not entirely clear that one or the other is always a win, but personally I have my preference here.
 
Fafalada said:
Not sure if I'm reading you right - but I was thinking more along the lines of VLIW with the same 2-way issue capacity that SPE already have - so most of the existing execution resources should not change much if at all.

I think I get what you mean. You want the issue slot to implicitly decide what instruction can be issued in that slot, so that the same opcode can be used for different instructions in different slots.

This would give you one extra (implicit) bit of opcode space, right ?

Traditionally VLIW instruction issue is controlled by the compiler, and the instruction word issues instructions to all exec units every cycle.

Cheers
 
Last edited by a moderator:
Not to derail you guy's conversation but i have some more questions if you would be so kind to awnser...
1. Just in case you missed this, Ok, so i thought fillrate was based on resolution (how many pixels a GPU can draw). So what, fillrate refers to every pixel/texel a GPU can draw?
2. What is this 2-way/dual-issue stuff you speak of?
3.Are VLIW/SIMD different modes of a processor or types of codes.
4.What are the pupose of pipelines in GPU's and what's their tie to fillrates i.e. the 16 pipelines in ps2? I always thought of pipelines working like pistins in an engine and the gasoline vapor as data, would i be wrong in this case?

Please be kind to awnser!:oops:
 
Gubbi said:
I think I get what you mean. You want the issue slot to implicitly decide what opcodes can be issued in that slot, so that the same opcode can be used for different instructions in different slots.
This would give you one extra (implicit) bit of opcode space, right ?
Yup, this is what I had in mind. Given the alignment restriction already exists in SPE(it breaks co-issue), it would IMO make sense to take advantage of extra opcode space.

Traditionally VLIW instruction issue is controlled by the compiler, and the instruction word issues instructions to all exec units every cycle.
Yea I forget about that - I guess since I only had limited experience with VLIWs in the past - most of it being on PS2 VUs.
 
pixelbox said:
1. Just in case you missed this, Ok, so i thought fillrate was based on resolution (how many pixels a GPU can draw). So what, fillrate refers to every pixel/texel a GPU can draw?
It refers to the number of pixels it can draw (peak) per time unit. In case of xenos, it's got 8 rasterizers and it's clocked at 500MHz (or close enough anyway), giving a theoretical max fillrate of 4Gpix/s. Considering the eDRAM framebuffer, I'm guessing it would probably be possible to come reasonably close to this figure when drawing simple polygons.

Likewise, from what we've guessed regarding RSX in PS3 and given some preliminary performance claims from Sony, if it's more or less a modified PC GPU (perhaps a big if), it would have 24 rasterizers running at 550MHz, giving a highly theoretical fillrate of 13.2Gpix/s. Considering its framebuffer bandwidth, each pixel filled couldn't even consume 2 bytes of bandwidth, which is obviously impossible. Modern GPUs don't even support accelerated 8 bits per pixel display modes I might add...

This just goes to show btw that fillrate is a highly archaic method of measuring performance. If PS3 did have over 13Gpix/s fillrate, it could redraw a 1080P screen 106 times per frame at 60fps. A rediculous and uselessly high figure.

2. What is this 2-way/dual-issue stuff you speak of?
It refers to the ability of a CPU to execute two separate instructions at once. Usually with some restrictions applied, such as the data these instructions refer to having to be available at the time of execution, etc. ;) Also, not all CPUs are symmetrical, so that not all instructions might be executable by either unit; this is quite common actually and includes the SPUs in Cell for example where one pipe deals with (most) maths instructions and the other with loads/stores and other stuff.

3.Are VLIW/SIMD different modes of a processor or types of codes.
The latter. VLIW is when the compiler re-arranges the code and attempts to bunch up as many instructions as possible in pre-manufactured blocks that can all be executed at once in parallel in the CPU. The CPU therefore doesn't have to have any out-of-order execution hardware, as that bit has already been taken care of by the compiler. Theoretically, the transistors saved can then be spent on more parallel execution units for greater performance. In reality, VLIW has been a disappointment when used as a general purpose processor (think: Itanic).

SIMD is merely performing the Same Instruction on Multiple pieces of Data. Such as multiplying 5 with 11, 8, 23 and 7 for example. :) This is different from the above in that VLIW bunch up many usually different instructions.

4.What are the pupose of pipelines in GPU's
Well, a pipeline is merely a fancy-smanchy term for an assembly line. That's what it does really, it breaks down a big task (such as drawing a shaded pixel of a polygon into a framebuffer) into many little tasks so that it can be finished in a step-by-step fashion. Pipelining in microprocessors has been around for decades really, originally in CPUs, and then other devices as well as the microchip became more widely used. Superpipelining in CPUs was when true assembly line execution became possible; originally one instruction had to move fully through the pipeline before the next could enter. Superpipelined CPUs (now a redundant term as all chips feature this these days) feeds new instructions into the pipe as soon as an opening slot appears.

and what's their tie to fillrates i.e. the 16 pipelines in ps2? I always thought of pipelines working like pistins in an engine and the gasoline vapor as data, would i be wrong in this case?
Heh, well, the analogy is a bit crude, but essentially correct I suppose. I'd say though that data isn't really gasoline though because it's not data that drives the GPU; it's the other way around. So say it's a 16-cylinder air compressor instead. :)
 
Isn't it right to say that each pixel pipeline works on one pixel per clock? Hence the maximum output of a GPU in pixel terms relative to pipes is (pipes*frequency).

The best analogy for computers is water passing through a plumbing system. Water represents data and is constricted in flow by pipe size (bandwidth) and number of pipes. Tanks store water (RAM, SRAM, cache etc) to pass to other parts of the system. In the case of a GPU, you'll have (I think) one pipe into the system that carries both Strawberry water (vector data) and Raspbetty water (pixel data). Both flavours are sent together and arrive in seperate sets, but get mixed together and leave the GPU as Orange flavour (a shaded pixel). It'd probably make more sense to think of it as paint as consider the analogy of data processing to colour mixing, but still...

The speed at which you can produce Orange water is dependent on the speed you can mix Strawberry and Raspberry water. If the flow into the GPU of either flavours is too slow, the GPU has to wait. If the pipe out of the GPU carrying the Orange flavour water away is too small, the GPU's tanks will not drain quickly and the Strawberry and Raspberry inflows will be held up. These are bottlenecks. You need to balance inflows and outflows. 100 taps at 1 litre a minute will fill a tank faster than 1 tap at 50 litres a minute. 16 pixel pipes at 100 MHz will process pixels faster than 4 pixel pipes at 350 MHz (all things being equal).

Plus, while I'm on the subject, if you need to get water from Tank A to Tank C via Tank B, the rate at which you can fill Tank C is the lowest flowrate.

Tank A >> 10 l/m >> Tank B >> 2 l/m >> Tank C

In this case, Tank C will fill at a rate of 2 litres a minute regardless of the Tank A>>Tank B flow (Bandwidth). This shows that numbers like total system bandwidth are meaningless. The above system has 12 l/m 'aggregate' flow (total system BW). This system...

Tank A >> 4 l/m >> Tank B >> 4 l/m >> Tank C

Has only 8 l/m aggregate, but fills Tank C twice as fast as the other system. Now you can think of Tank A as the CPU producing data, Tank B as the RAM, and Tank C as the GPU. The ability to process graphics depends on the mimimum throughput, or the bottleneck.

Of course, a console is much more complicated than a 3 tank system. You have loads of tanks (stores) of widely varying capacity, like RAM, caches, registers, and optical storage. You'll have lots of pipelines which may not all be running at maximum capacity, and you'll have flows of data that need to be managed to prevent a 'blockage'. This is the problems the engineers face when designing the system, and to understand a system you have to understand the whole system and the way the components link together. Figures like 'Total System Storage Size' and 'Total Flow Rate' may make for nice big numbers but they tell you absolutely nothing about how quickly the flow of data can pass, be processed, and output.
 
I wonder how you handle latency in your pipe/tank analogy...
Could it be the pipe length ?
 
Back
Top