nAo said:.AFAIK yes.Is the texture ALU able to operate whilst with the texture latency? (i.e. still interleave instructions whilst waiting to address the texture)
It seems as though Xmas thinks not on that one.
nAo said:.AFAIK yes.Is the texture ALU able to operate whilst with the texture latency? (i.e. still interleave instructions whilst waiting to address the texture)
The patent seems pretty clear to me, an ALU clause consists of nothing but ALU operations.psurge said:What about disallowing control flow (besides predicated instructions) and texture ops inside an ALU clause?
Yes, the patent is explicit about that. This applies to both ALU clauses and graphics processing clauses.Basically an ALU clause would look like this :
Code:{ math_op1 math_op2 ... math_opN [texture_op] [control_flow_op] }
A thread (in this case per pixel or per vertex) is returned to the reservation station at the end of each clause.
Agreed.You have enough information available to determine:
- whether the thread is waiting for a texture
- which clause it should continue with, and whether or not this
clause has been locally cached
DaveBaumann said:nAo said:.AFAIK yes.Is the texture ALU able to operate whilst with the texture latency? (i.e. still interleave instructions whilst waiting to address the texture)
It seems as though Xmas thinks not on that one.
Page 38 said:Support for vitual memory
- So texture downloads are much more efficient
- Now only those pages of the relevant mip levels will be present
--- Contrast that with the current situation where all of every mip level is required to be present in VPU-accessible memory before the first texel is filtered...
- And DX Next has the notion of graphics hardware contexts with maximum context switch times
- VM may also include write capabilities
Jaws and I have been discussing a texture address calculation ALU (which we decided to label "tALU"). I don't understand if the tALU would be enough (or the right thing) to also do gradients. I'm stuck not really understanding how gradients are done.psurge said:Jawed - the thing is, dependent texture reads are an ALU instruction (at least, they depend on an ALU register for texture addresses and gradient instructions for LOD).
The whole patent is pretty unclear to me - I can't tell whether the arbiter simply dispatches threads to a scheduler, or actually dispatches individual instructions to the ALU(s)...
As far as the interleaving of instructions goes - I have one more guess: maybe each instruction is issued 4 times (once for each pixel in a quad). Interleaving two threads means each instruction gets at least 8 cycles to complete before a dependent instruction is issued. If the ALU pipeline is 8 stages deep and this is limited to register read, execution, and write-back, then it seems like clock frequency could be fairly high...
Jawed said:We're going way beyond what the patent says now
Jawed
Leak said:...
The Xenon GPU is a custom 500+ MHz graphics processor from ATI. The shader core has 48 Arithmetic Logic Units (ALUs) that can execute 64 simultaneous threads on groups of 64 vertices or pixels. ALUs are automatically and dynamically assigned to either pixel or vertex processing depending on load. The ALUs can each perform one vector and one scalar operation per clock cycle, for a total of 96 shader operations per clock cycle. Texture loads can be done in parallel to ALU operations. At peak performance, the GPU can issue 48 billion shader operations per second.
The GPU has a peak pixel fill rate of 4+ gigapixels/sec (16 gigasamples/sec with 4× antialiasing). The peak vertex rate is 500+ million vertices/sec. The peak triangle rate is 500+ million triangles/sec. The interesting point about all of these values is that they’re not just theoretical—they are attainable with nontrivial shaders.
Xenon is designed for high-definition output. Included directly on the GPU die is 10+ MB of fast embedded dynamic RAM (EDRAM). A 720p frame buffer fits very nicely here. Larger frame buffers are also possible because of hardware-accelerated partitioning and predicated rendering that has little cost other than additional vertex processing. Along with the extremely fast EDRAM, the GPU also includes hardware instructions for alpha blending, z-test, and antialiasing.
...
...
Eight pixels (where each pixel is color plus z = 8 bytes) can be sent to the EDRAM every GPU clock cycle, for an EDRAM write bandwidth of 32 GB/sec. Each of these pixels can be expanded through multisampling to 4 samples, for up to 32 multisampled pixel samples per clock cycle. With alpha blending, z-test, and z-write enabled, this is equivalent to having 256 GB/sec of effective bandwidth! The important thing is that frame buffer bandwidth will never slow down the Xenon GPU.
...
Jaws said:The patent mentions Arbiters can accept input from other Arbiters, so we may see groups/ clusters of say 4 US units, which share Reservation Stations (i.e. L1 Reservation Station analogy) and 4 groups of these making up the R500 (say L2 Reservation Station analogy)...
Jawed said:Jaws said:The patent mentions Arbiters can accept input from other Arbiters, so we may see groups/ clusters of say 4 US units, which share Reservation Stations (i.e. L1 Reservation Station analogy) and 4 groups of these making up the R500 (say L2 Reservation Station analogy)...
I can't find anywhere in the patent that says arbiters can be connected directly to each other.
Also, I have a sneaky feeling that the tALU lives inside the ALU engine, and it counts as one of the 3 ALUs per TMU. But that's just me being cynical.
Jawed
Patent said:...
[0022] Not illustrated in FIG. 4, in one embodiment an input arbiter provides the command threads to each of the first reservation station 302 and the second reservation station 304 based on whether the command thread is a pixel command thread, such as thread 312, or a vertex command thread, such as thread 318. In this embodiment, the arbiter 306 selectively retrieves either a pixel command thread, such as command thread 316, or a vertex command thread, such as command thread 322.
...
Hmm, I read that as "input arbiter". This input arbiter is sitting on the end of the the vertex FIFO, and the set-up engine, deciding when to pull data out of each and initiate vertex or pixel command threads. Well that's how I see it. This should lead to load balancing throughout the remainder of the rendering pipeline.Jaws said:Patent said:...
[0022] Not illustrated in FIG. 4, in one embodiment an input arbiter provides the command threads to each of the first reservation station 302 and the second reservation station 304 based on whether the command thread is a pixel command thread, such as thread 312, or a vertex command thread, such as thread 318. In this embodiment, the arbiter 306 selectively retrieves either a pixel command thread, such as command thread 316, or a vertex command thread, such as command thread 322.
...
I may have read it wrong but this gives me that impression.
OK, that sounds interesting...Also the 'two' Arbiter embodiment gives me the impression that Arbiters can share Reservation Stations. But that's just me. Also, I've seen a Sony Pixel Engine patent (the die shot on the previous page) that also distributes it's bus layout in that fashion.
No, I meant that each ALU engine contains 2x general purpose ALUs plus one tALU, thus retaining a 1:1 ratio for TMUs and tALUs. Pure speculation. I'm just being cynical about the capabilities of the ALU engine (i.e. I suspect there' not as much non-texturing ALU power in there as we might hope).Well, the tALU, I put alongside the TMU because currently the R420 ratio of TMU:tALU is 1:1. By putting it alongside the R500 ALU, it's 1:3. So that's the madness behind my logic of keeping it alongside the TMU!
So my earlier suggestion, way back in the thread, that X800 actually has 92 ALUs appears to be correct (80 ALUs in the pixel shader pipelines, plus 12 in the vertex shader pipelines).Each pixel shader unit in the RADEON X800 actually consists of five distinct ALUs: two 72-bit floating point vector ALUs, two 24-bit floating point scalar ALUs, and a 96-bit texture address ALU.
Jawed said:...
Maybe R500 will have (ignoring tALUs):
- 32 4D ALUs
- 16 1D ALUs
In the vertex shader, you could issue 2 vector ops (4D) with 1 scalar.
In the pixel shader, you could issue 2 vector ops (3D) with up to 3 scalar ops (1D from each vector ALU + the scalar ALU). Or 9 scalar ops, if you treat both 4D ALUs as 4x 1D scalar and use the 1D ALU too.
Bags of power! On average 1.3x more ALU capacity per cycle than R420. What a coincidence...
Jawed