ATI R500 patent for Xenon GPU?

Then it seems that ALU is used for much more than addressing.
If just the first cycle can be hidden it would be nice to know how many cycles
(on the ALU) takes a 'standard' texture fetch (2D, 32 bit texture)
On that test I can't find any info about what kind of texture fetch is being testes.
 
psurge said:
What about disallowing control flow (besides predicated instructions) and texture ops inside an ALU clause?
The patent seems pretty clear to me, an ALU clause consists of nothing but ALU operations.

The arbiters are able to pack the ALU's pipeline so that the ALUs are never sitting idle (unless of course no command thread requires ALU ops).

The patent only appears to be explicit about ALU pipeline length (8 cycles) when referring to an embodiment of the patent that makes use of two arbiters, each of which takes it in turns to put command thread ALU instructions into the ALU pipeline. I interpret this to mean that two separate command threads are being interleaved, and each command thread is broken up into 4 ALU instructions per clause (maximum).

But the focus of the patent isn't on ALU pipelining or the number of ALU operations that can be performed whilst waiting for a texture operation to complete (say).

I interpret the "graphics processing engine" as the shader core for ALL non-ALU instructions, not just texture instructions. There's more in there than just texturing and texture address calculation.

Basically an ALU clause would look like this :

Code:
{
    math_op1
    math_op2
    ...
    math_opN
    [texture_op] 
    [control_flow_op]
}

A thread (in this case per pixel or per vertex) is returned to the reservation station at the end of each clause.
Yes, the patent is explicit about that. This applies to both ALU clauses and graphics processing clauses.

You have enough information available to determine:
- whether the thread is waiting for a texture
- which clause it should continue with, and whether or not this
clause has been locally cached
Agreed.

Jawed
 
Save the Nanosecond

Page 38 said:
Support for vitual memory
- So texture downloads are much more efficient
- Now only those pages of the relevant mip levels will be present
--- Contrast that with the current situation where all of every mip level is required to be present in VPU-accessible memory before the first texel is filtered...
- And DX Next has the notion of graphics hardware contexts with maximum context switch times
- VM may also include write capabilities

So that sounds like a massive reduction in texture bandwidth demand. And it might support rendering to textures.

So is this coming to R500? Is a unified shaders architecture dependent upon virtual memory to make it viable?

Jawed
 
Jawed - the thing is, dependent texture reads are an ALU instruction (at least, they depend on an ALU register for texture addresses and gradient instructions for LOD). The whole patent is pretty unclear to me - I can't tell whether the arbiter simply dispatches threads to a scheduler, or actually dispatches individual instructions to the ALU(s)...

As far as the interleaving of instructions goes - I have one more guess: maybe each instruction is issued 4 times (once for each pixel in a quad). Interleaving two threads means each instruction gets at least 8 cycles to complete before a dependent instruction is issued. If the ALU pipeline is 8 stages deep and this is limited to register read, execution, and write-back, then it seems like clock frequency could be fairly high...
 
psurge said:
Jawed - the thing is, dependent texture reads are an ALU instruction (at least, they depend on an ALU register for texture addresses and gradient instructions for LOD).
Jaws and I have been discussing a texture address calculation ALU (which we decided to label "tALU"). I don't understand if the tALU would be enough (or the right thing) to also do gradients. I'm stuck not really understanding how gradients are done.

Jaws and me haven't even decided whether the much-rumoured 3xALU+1xTMU architecture includes the tALU in the count of the three ALUs.

More annoyingly, what isn't clear is whether this ALU forms part of the ALU engine, or whether it is located within the graphics processing engine (which appears to be where the TMU is).

It's not clear to me whether the arbiter would want to fire and forget a TMU instruction (whether or not it requires an address calculation) or whether it's better to perform texture address calculation in the ALU engine in parallel with (if possible) math ALU operations. Then hold the resulting address in a register and transfer the command thread back to its reservation station, where at some point it'll be picked up by the arbiter for execution by the graphics processing engine, and the texture address will be ready in the register to be used by the TMU.

The whole patent is pretty unclear to me - I can't tell whether the arbiter simply dispatches threads to a scheduler, or actually dispatches individual instructions to the ALU(s)...

The arbiter is the scheduler. Well, scheduling requires the reservation stations as well (which are glorified FIFOs, essentially) which simply record the status of all objects' (pixel or vertex) command threads, for those objects that are currently in flight (1000?). By status we're talking about "this command thread is executing" or "this command thread needs to run some ALU instructions".

After a command thread has spent some time being executed, it is returned into the reservation station whence it came. This is done by updating the status of the command thread in the reservation station, e.g. marking it with "needs texturing". The state of the command thread is simply parked in memory: the registers, the loop indexes, the texturing results, etc. For good performance (i.e. minimal start and stop latency for each command thread), it seems to me that in Xbox 360 this'll be how the EDRAM is used. The actual instructions for a command thread also need to be stored somewhere, as the reservation stations merely record the status of each command thread.

It's not clear what all the limit conditions for command thread execution are, but clearly if an ALU engine runs out of ALU operations (say) or an engine runs into the execution of a dynamic branch which doesn't follow the predicted path, or an ALU engine has run all instructions leading up to a texturing operation, then that command thread will exit its engine (park its state in memory and mark the relevant slot in the reservation station with the latest status).

I suspect the code for the command threads will be held in EDRAM. It might make sense for the GPU to decode instructions into engine microcode and store that, so that when an engine runs the command thread, pipeline time is not wasted on decoding (keeping the pipeline short). That would also help the GPU to efficiently run the same shader on many hundreds of objects at a time, e.g. by making all copies of a command thread point at the micro op code held in a single place and therefore only decoded once. (The patent says nothing about this.)

With a simple scan up each reservation station in reverse-FIFO order (oldest to youngest), the arbiter can readily identify which command threads are ready to execute when the next ALU pipeline becomes ready to accept instructions. Similarly the arbiter knows which command threads are ready for the graphics processing engine, when it's ready. The actual instructions that constitute a command thread will be found in memory. I don't think the patent explicitly describes one way or another if the instructions pass through the arbiter to get from memory into the engine. I'm not sure if that's relevant.

What is relevant is the arbiter is responsible for keeping each engine's pipeline full. So the arbiter needs to know when to issue the next command thread to the engine.

After that, the arbiter is responsible for ensuring that neither reservation station becomes empty. Presumably, for example, if it sees that the pixel reservation station is emptying faster or more frequently than the vertex reservation station, then it needs to prioritise vertex command threads over pixel command threads.

The patent is deliberately vague about scheduling.

As far as the interleaving of instructions goes - I have one more guess: maybe each instruction is issued 4 times (once for each pixel in a quad). Interleaving two threads means each instruction gets at least 8 cycles to complete before a dependent instruction is issued. If the ALU pipeline is 8 stages deep and this is limited to register read, execution, and write-back, then it seems like clock frequency could be fairly high...

Agreed on all that.

The patent is quite explicit that a 2-thread interleaving (sourced by two arbiters on one pair of ALU/graphics-processing engines) is one possible interpretation of the patent. Your proposed interleaving seems possible too. The key thing is to submit ALU operations to utilise as many ALUs per cycle as possible and to ensure there's no gaps in the pipeline.

It may turn out that there's no interleaving.

Who knows. The patent is sufficiently vague. Its main focus is the arbiter and with it the two reservation stations and the two engines. The scheduling and multi-threading is centred on the arbiter with the reservation stations merely acting as scheduling data structures.

Whilst the patent is moderately specific about the ALU engine, it's annoyingly vague about the graphics processing engine. I like to think of the graphics processing engine as performing all shader instructions that the ALU engine can't.

Jawed
 
I wrote a post on this another thread, but can't figure out how to link to the exact one so : I don't think you have to monitor how empty or full the reservation stations are at all. If you issue threads like this :

Thread 1: If there are any vertex threads ready, issue the oldest available one. Otherwise, if there are pixel threads available, issue the oldest waiter.

Thread 2: If there are any remaining pixel threads ready, issue the oldest available one. Otherwise (if there is one), issue the oldest waiting vertex thread.

A vertex thread can't complete (be removed from the reservation station) until it is accepted by the rasterizer, which presumably accepts at most 1 vertex per cycle and rasterizes at most 1 triangle per cycle. So - if the pixel load is heavy, all vertex threads will be stalled on the rasterizer, and only pixel threads get issued. If the vertex load is heavy, the pixel station will be relatively empty and more vertex work gets scheduled...

Maybe I'm missing something, but it seems to me that just by limiting the size of the vertex reservation station and forcing vertices through the rasterizer, memory/rasterizer pressure automatically gives you load-balancing (between vertices and pixels) without having to do anything fancy at the arbitration stage.

So while I agree that the arbiter schedules threads, this is different from scheduling actual instructions. Anyway, just my 2c

Serge
 
Jawed - no not that one, but I'll try and explain it anyway. Sorry if my English is confusing - I have no excuse since it is my first language.

Unless a thread can be moved from one ALU to a completely different ALU, there is no point to having global storage for thread state (because if a thread stays local to an ALU you might as well keep all of its state nearby).

The cost of moving thread state from one of the ALUs to another (or to and from a global reservation station) involves moving data across a significant portion of the chip. This will take a variable number of clock cycles, and I assume would also consume a non trivial amount of power. A simple shader using 4 temp registers would require moving 64 bytes for the temp registers alone. So now a thread is basically completely out of action for twice the amount of time it takes to do one of these transfers. Imagine a shader which performs say 4 math ops per texture op. Each texture op will most probably cause the thread to be returned to the global reservation station, which means 2 transfers of the entire thread state for every 4 math instructions performed.

With such a high number of threads trying to enter and exit the global reservation station, you are going to have provide large buses for data transfer, deal with contention on these buses, include logic that tracks threads that are "in-transit", and maybe even provide multiple ports to
the reservation station memory.

A single reservation station capable of holding N threads is going to be much larger than M stations each holding N/M threads. This means it takes longer to search through the reservation station. On top of this, a global arbiter has to be able to find work at a higher rate than a smaller local arbiter, since it is in charge of all the ALUs instead of just a couple.

Finally, if you have a defect in the global arbiter, you have to toss the entire chip.

So basically, why take this approach when multiple local reservation stations avoid or mitigate all these problems?

Anyway - that's my argument as to why a global structures are a bad idea.
 
Remember that the reservation stations only hold status for each command thread. Not state. The registers etc. have to be held somewhere else, outside of the reservation stations. The amount of data involved in a command thread is too large to go into a reservation station.

Well, that's my interpretation of the patent.

I agree there's some serious buffering required to hold the state of (approximately 1000, say?) command threads while they're waiting. Yep I think you're right there has to be locality between the buffer and the ALU/graphics processing engines. And your point about defects is a good one.

Jawed
 
Jawed -

Why hold the status in a global structure if you aren't going to do the same for other state (e.g. registers)? The only pro I can see to a global structure is that it could move threads across ALUs (and therefore get better resource utilization). But, as I said, IMO the cost of moving a thread would negate those benefits.

Somewhat related: I think that space available in the reservation station has to be determined at least partly by program register demands. PS3.0 specifies that a program can use up to 32 128bit temp registers. 64 full size thread contexts would require 32kB of memory (roughly the size of an L1 D-cache for a modern CPU). So, I imagine the actual local register file would contain far fewer registers - 256 (64 threads*4 regs, 32 threads*8 regs, etc...) seems reasonable given how many are available in CPUs.
 
"Status" means things like "executing" or "needs ALU" or "completed". That's just a few bits per command thread, and I imagine other status information only amounts to a fairly small amount of data too. There'll be lots of objects in flight, hundreds, thousands, who knows?... But obviously, only 64 being executed at a time.

I'm not suggesting that a reservation station must be a global data structure, I merely mentioned that it might be (days ago, now...). I don't think it's a big deal frankly. There may be only 1 byte of status data per command thread. However much data there is, it won't be a massive data structure.

It might be fruitful to think of the arbiters as micro-cores, and the reservation stations as registers in those micro-cores. Dunno... Obviously in that case the reservation stations aren't a single global data structure.

"State" means the registers, texturing results and other stuff that is uniquely produced by a command thread. Because of the sheer quantity of this data (I assume), the patent specifies that this data is not stored in reservation stations.

Rumours suggest Xbox 360 will have 10MB+ of EDRAM on the GPU die. Maybe some (or all) of that is consumed supporting the command threads.

Maybe the EDRAM is split into (16) localised blocks, one block per arbiter, so that there's uncontended bandwidth available to each arbiter+engines.

The remaining problem, then, is to interface, say, 16 distinct blocks of EDRAM with Xbox 360's general RAM. This would appear to be where Kaleidoscope (the ring network) comes into play. The GPU might have a 4-port interface to general RAM, with each port distributed one quarter of the way round the network. Between each port you've got 4 unified shaders and their associated EDRAM.

Plainly the memory used by the engines for registers and so on would need to be tightly integrated - so this is extra memory (for 64 executing command threads) beyond whatever EDRAM commitment there would need to be for the 1000-odd command threads that are waiting.

We're going way beyond what the patent says now :LOL:

Jawed
 
Well I think the eDRAM is going to be used for color/stencil/z-buffer storage (possibly textures). If it isn't, I have a hard time seeing where the bandwidth for these things is going to come from...

Anyway the state that needs to be kept around for a pixel or vertex is basically just registers :
- temp registers (including predicate/loop/program counter)
- input registers
- output registers

You can probably get fancy with the input/output registers, since the output "registers" are basically the post T&L cache and the ROPs. Input registers might have to be kept around, adding another 16 (I think) registers to the set of temps.

Jawed said:
We're going way beyond what the patent says now :LOL:

Jawed

And I had fun doing it :D
 
xbox2_scheme_bg.gif


Leak said:
...
The Xenon GPU is a custom 500+ MHz graphics processor from ATI. The shader core has 48 Arithmetic Logic Units (ALUs) that can execute 64 simultaneous threads on groups of 64 vertices or pixels. ALUs are automatically and dynamically assigned to either pixel or vertex processing depending on load. The ALUs can each perform one vector and one scalar operation per clock cycle, for a total of 96 shader operations per clock cycle. Texture loads can be done in parallel to ALU operations. At peak performance, the GPU can issue 48 billion shader operations per second.

The GPU has a peak pixel fill rate of 4+ gigapixels/sec (16 gigasamples/sec with 4× antialiasing). The peak vertex rate is 500+ million vertices/sec. The peak triangle rate is 500+ million triangles/sec. The interesting point about all of these values is that they’re not just theoretical—they are attainable with nontrivial shaders.

Xenon is designed for high-definition output. Included directly on the GPU die is 10+ MB of fast embedded dynamic RAM (EDRAM). A 720p frame buffer fits very nicely here. Larger frame buffers are also possible because of hardware-accelerated partitioning and predicated rendering that has little cost other than additional vertex processing. Along with the extremely fast EDRAM, the GPU also includes hardware instructions for alpha blending, z-test, and antialiasing.
...
...
Eight pixels (where each pixel is color plus z = 8 bytes) can be sent to the EDRAM every GPU clock cycle, for an EDRAM write bandwidth of 32 GB/sec. Each of these pixels can be expanded through multisampling to 4 samples, for up to 32 multisampled pixel samples per clock cycle. With alpha blending, z-test, and z-write enabled, this is equivalent to having 256 GB/sec of effective bandwidth! The important thing is that frame buffer bandwidth will never slow down the Xenon GPU.
...


Option 1:

US = 3 ALU + (tALU + TMU)

R500 = 16 US

So for option 1 US unit,

1. US unit is 4-way SMT
2. US unit is 6-issue

(i.e. 48 Billion instructions @ 500Mhz and 64 threads total)

The patent mentions Arbiters can accept input from other Arbiters, so we may see groups/ clusters of say 4 US units, which share Reservation Stations (i.e. L1 Reservation Station analogy) and 4 groups of these making up the R500 (say L2 Reservation Station analogy)...

At this moment I'm leaning towards option 1. Also only 8 pixels per cycle outputted can be handled by the backend and written to the framebuffer. MSAA down-sampling hardware present alongside the eDRAM unit, downsamples 32 pixels per cycle and writes 8 (4xMSAA?) anti-aliased pixels per cycle, giving a 4 GPixel/sec fillrate.

1280*720* (8 Byte/pixel) ~ 7.4 MB

That would leave 2.6 MB for other stuff on eDRAM...?

And that 10+ MB eDRAM module definitely looks like it's a separate chip with it's own custom logic, i.e. z/stencil, alpha, MSAA etc...

I'm guessing an R500 with 16 US units, each US unit as above ~ 15-20 mm2 die area @ 90nm process

So 16 US units, R500 ~ 240-320 mm2 @ 90nm

That's a huge chip! And if you're gonna add another 10+ MB of eDRAM, that's GINORMOUS! :oops: ...So the eDRAM module looks like a separate chip with it's 16 + 32 GB/s read/write bandwidth (48 GB/s?)...
 
Jaws said:
The patent mentions Arbiters can accept input from other Arbiters, so we may see groups/ clusters of say 4 US units, which share Reservation Stations (i.e. L1 Reservation Station analogy) and 4 groups of these making up the R500 (say L2 Reservation Station analogy)...

I can't find anywhere in the patent that says arbiters can be connected directly to each other. :(

Also, I have a sneaky feeling that the tALU lives inside the ALU engine, and it counts as one of the 3 ALUs per TMU. But that's just me being cynical.

Jawed
 
Jawed said:
Jaws said:
The patent mentions Arbiters can accept input from other Arbiters, so we may see groups/ clusters of say 4 US units, which share Reservation Stations (i.e. L1 Reservation Station analogy) and 4 groups of these making up the R500 (say L2 Reservation Station analogy)...

I can't find anywhere in the patent that says arbiters can be connected directly to each other. :(

Also, I have a sneaky feeling that the tALU lives inside the ALU engine, and it counts as one of the 3 ALUs per TMU. But that's just me being cynical.

Jawed

Patent said:
...
[0022] Not illustrated in FIG. 4, in one embodiment an input arbiter provides the command threads to each of the first reservation station 302 and the second reservation station 304 based on whether the command thread is a pixel command thread, such as thread 312, or a vertex command thread, such as thread 318. In this embodiment, the arbiter 306 selectively retrieves either a pixel command thread, such as command thread 316, or a vertex command thread, such as command thread 322.
...

I may have read it wrong but the above gives me that impression with the 'input' Arbiter. Also the 'two' Arbiter embodiment gives me the impression that Arbiters can share Reservation Stations. But that's just me. Also, I've seen a Sony Pixel Engine patent (the die shot on the previous page) that also distributes it's bus layout in that fashion.

Well, the tALU, I put alongside the TMU because currently the R420 ratio of TMU:tALU is 1:1. By putting it alongside the R500 ALU, it's 1:3. So that's the madness behind my logic of keeping it alongside the TMU! :p
 
Jaws said:
Patent said:
...
[0022] Not illustrated in FIG. 4, in one embodiment an input arbiter provides the command threads to each of the first reservation station 302 and the second reservation station 304 based on whether the command thread is a pixel command thread, such as thread 312, or a vertex command thread, such as thread 318. In this embodiment, the arbiter 306 selectively retrieves either a pixel command thread, such as command thread 316, or a vertex command thread, such as command thread 322.
...

I may have read it wrong but this gives me that impression.
Hmm, I read that as "input arbiter". This input arbiter is sitting on the end of the the vertex FIFO, and the set-up engine, deciding when to pull data out of each and initiate vertex or pixel command threads. Well that's how I see it. This should lead to load balancing throughout the remainder of the rendering pipeline.

Also the 'two' Arbiter embodiment gives me the impression that Arbiters can share Reservation Stations. But that's just me. Also, I've seen a Sony Pixel Engine patent (the die shot on the previous page) that also distributes it's bus layout in that fashion.
OK, that sounds interesting...

Well, the tALU, I put alongside the TMU because currently the R420 ratio of TMU:tALU is 1:1. By putting it alongside the R500 ALU, it's 1:3. So that's the madness behind my logic of keeping it alongside the TMU! :p
No, I meant that each ALU engine contains 2x general purpose ALUs plus one tALU, thus retaining a 1:1 ratio for TMUs and tALUs. Pure speculation. I'm just being cynical about the capabilities of the ALU engine (i.e. I suspect there' not as much non-texturing ALU power in there as we might hope).

Rummaging around:

http://www.ati.com/products/radeonx800/RADEONX800ArchitectureWhitePaper.pdf

Each pixel shader unit in the RADEON X800 actually consists of five distinct ALUs: two 72-bit floating point vector ALUs, two 24-bit floating point scalar ALUs, and a 96-bit texture address ALU.
So my earlier suggestion, way back in the thread, that X800 actually has 92 ALUs appears to be correct (80 ALUs in the pixel shader pipelines, plus 12 in the vertex shader pipelines).

If R500 is going to be using 48 ALUs, then on average they're going to need to be way more powerful than X800's. Ignoring the 16 tALUs, X800 has 76 ALUs:

12 4D VS2 ALUs
16 3D PS1.4 ALUs
16 3D PS2.0 ALUs
32 1D ALUs

Maybe R500 will have (ignoring tALUs):

- 32 4D ALUs
- 16 1D ALUs

In the vertex shader, you could issue 2 vector ops (4D) with 1 scalar.

In the pixel shader, you could issue 2 vector ops (3D) with up to 3 scalar ops (1D from each vector ALU + the scalar ALU). Or 9 scalar ops, if you treat both 4D ALUs as 4x 1D scalar and use the 1D ALU too.

Bags of power! On average 1.3x more ALU capacity per cycle than R420. What a coincidence...

Jawed
 
Jawed said:
...
Maybe R500 will have (ignoring tALUs):

- 32 4D ALUs
- 16 1D ALUs

In the vertex shader, you could issue 2 vector ops (4D) with 1 scalar.

In the pixel shader, you could issue 2 vector ops (3D) with up to 3 scalar ops (1D from each vector ALU + the scalar ALU). Or 9 scalar ops, if you treat both 4D ALUs as 4x 1D scalar and use the 1D ALU too.

Bags of power! On average 1.3x more ALU capacity per cycle than R420. What a coincidence...

Jawed

Well, the 'leak' mentions that each of the 48 ALUs has a 'vector' and 'scalar' unit. So that's 48 vector and 48 scalar units. I expect the vector unit to be 4-way SIMD (128 bit, 4* 32 bit) and both units to be 32bit single precision capable. Also capable of Floats and Integers.

Ignoring the tALU, TMU + any other hardwired units, I expect these ALUs to be 'beafy' compared to R420 ALUs with limited precision. If you count an FMADD as 2 ops, then,

ALU = 4-way SIMD vector unit + scalar unit ~ 8 Flops/cycle + 2 Flops/cycle ~ 10 Flops/cycle

48 ALUs ~ 480 Flops per cycle

R500 @ 500 Mhz ~ 480 * 0.5 GHz ~ 240 GFlops

These 240 GFlops would be fully programmable, 32bit single precision Flops. This is in the ball park with a 4GHz, 8SPE + PPE CELL ~ 296 GFlops. However, with the R500 you'll get all the hardwired graphics functionality in addition and would be a powerfull chip with balanced general purpose 32 bit shading power.
 
Back
Top