ATI MSAA/ eDRAM module patent for R500/ Xenon?

london-boy said:
What, kinda like NV2A was advertised as having 4GPixel fillrate when it really was 1GPixel? All because 2xAA was supposed to be "free"? We saw how that worked out in the end huh... Guess this time around, Sony will be victim of the NVIDIA math.

original Xbox / XGPU / X-Chip spec was something like:

4.8 billion pixels per second fillrate. but it turned out this was 4.8 billion AA sampled pixels per second and the true pixel fillrate was 1.2 billion. that's @ 300 MHZ clock.

at 233 MHz the true fillrate for Xbox is under 1 billion pixels per second / under 1 Gpixel. its of course, 932 Mpixels/sec and 4x for the AA samples per second.

PS2 therefore has a higher fillrate than Xbox. in raw pixels, and in single textured pixels.
 
Jawed said:
Sigh, I was wondering why the Z was "asymmetric", and trying to reconcile that with the "4 z/stencil" per clock on the diagram. Doh, that's 4 "quads" of Z per clock, which is 64 bytes, too.

That makes much more sense. Apologies for trying to decode the diagram, and not referring to the text as well.

Thanks,
Jawed

No probs...looking at the text again,

Leak said:
Eight pixels (where each pixel is color plus z = 8 bytes) can be sent to the EDRAM every GPU clock cycle, for an EDRAM write bandwidth of 32 GB/sec. Each of these pixels can be expanded through multisampling to 4 samples, for up to 32 multisampled pixel samples per clock cycle. With alpha blending, z-test, and z-write enabled, this is equivalent to having 256 GB/sec of effective bandwidth! The important thing is that frame buffer bandwidth will never slow down the Xenon GPU.

looking at that again, I'm not sure where they're getting 256 GB/s 'effective' bandwidth from...hmm...

Read/write (16+32) ~ 48 GB/s -->> 256 GB/s 'effective'

That's ~ 5.3x compression ratio...4x for pixels/z/stencil...the rest for 'other' stuff?
 
Jawed said:
Right, well, now that's sorted, I'm wondering about the relatively weak "pixel pipeline" capability of R500.

Going back to the implied 4GPixel/s fill-rate of R500 at 500MHz: that's 8 pixels per clock, compared with R420 which is 16 pixels per clock.

That seems to imply to me that the pixel shader core in R500 is smaller scale than R420. Sure it's SM3, 32-bit etc., but in terms of equivalent pipelines/TMUs/ALUs, R500 has a lower capacity than R420.

We're getting conflicting ideas about whether the EDRAM is actually on-die or a separate device.

I'm wondering, now, if ATI has traded shading power for the EDRAM module.

On the contrary, the off-chip eDRAM module has allowed for more shading power on the R500 die. The question is whether the unified shader units needed MORE for the die area on the R500 itself than going non-unified?

Also the leak shows the eDRAM module as a separate chip. However, if the targetted manufacturing process is on a smaller process, then there's nothing to stop them implementing the same compression and bandwith saving feature completely on die on the R500...

Jawed said:
A while back I asserted that in ATI's counting, R420 has 76 ALUs (excluding texture address calculation ALUs) and that R500's equivalent is only 48 ALUs for the same functionality (vertex and pixel shader ALUs).

http://www.beyond3d.com/forum/viewtopic.php?p=496088#496088

Quick question, have you changed that graph as IIRC, your interpretation of the ALU transistor index for the R500 was the same as the R420 originally?

Jawed said:
So, compared with R420's 8GP/s coming from a pool of 64 ALUs arranged in 16 pipelines, it seems likely that R500 is getting 4GP/s from a pool of 48 ALUs, not all of which are pixel shading for 100% of the time...

So, by cutting back the transistors expended on the vertex and pixel shader engines, R500 has space for the on-die EDRAM.

I disagree...there seems to be an increased cost to 'other' logic, i.e. control, branching, registers, compression etc...but we can't really tell without knowing the die areas of all the chips...

Jawed said:
Presumably to generate anti-alias samples, the pixel pipeline needs to do no work beyond the interpolators. All the work to generate these samples is done by the interpolators - there's no texturing, blending, shading etc. required. Is that correct?

Jawed

I'd take the alpha 'blending' as done on the eDRAM module and not on the R500...

The texturing/shading will be done on the 'oversampled' fragments on the R500...
 
My understanding of the patent is that the z-compression is operating while calculating the MSAA for a fragment, i.e. un-compressed Z values representing AA samples are delivered by the GPU's interpolators to the EDRAM module, which are then assigned to the corresponding pixel as a list of compressed Zs. The patent then goes on to discuss how the compressed Zs are de-compressed when the next fragment and its accompanying multi-sample Zs turn up, all trying to fit into the same pixel. In other words, the GPU outputs un-compressed Zs for AA samples, because they're only compressed while buffering/calculating AA.

For this reason, I don't believe that the EDRAM gets compressed Z MSAA samples. Compression/decompression is performed inside the EDRAM module.

So because of all this, the "bandwidth" between the GPU and the EDRAM is simply 32GB/s, e.g. 2 quads of (pixel + Z) or 4 quads of Z.

Jawed
 
Jaws said:
Jawed said:
A while back I asserted that in ATI's counting, R420 has 76 ALUs (excluding texture address calculation ALUs) and that R500's equivalent is only 48 ALUs for the same functionality (vertex and pixel shader ALUs).

http://www.beyond3d.com/forum/viewtopic.php?p=496088#496088

Quick question, have you changed that graph as IIRC, your interpretation of the ALU transistor index for the R500 was the same as the R420 originally?
Yes. I'd made an error in interpreting the vertex shaders in R420 as dual-vector, when they're vector + scalar. I'd noticed this in the R520 Infomania thread:

http://www.beyond3d.com/forum/viewtopic.php?p=496703#496703

(top of the previous page shows the table I created for a comparison of R520 and R420.)

Jawed
 
Jawed said:
My understanding of the patent is that the z-compression is operating while calculating the MSAA for a fragment, i.e. un-compressed Z values representing AA samples are delivered by the GPU's interpolators to the EDRAM module, which are then assigned to the corresponding pixel as a list of compressed Zs. The patent then goes on to discuss how the compressed Zs are de-compressed when the next fragment and its accompanying multi-sample Zs turn up, all trying to fit into the same pixel. In other words, the GPU outputs un-compressed Zs for AA samples, because they're only compressed while buffering/calculating AA.

For this reason, I don't believe that the EDRAM gets compressed Z MSAA samples. Compression/decompression is performed inside the EDRAM module.

So because of all this, the "bandwidth" between the GPU and the EDRAM is simply 32GB/s, e.g. 2 quads of (pixel + Z) or 4 quads of Z.

Jawed

I'm not sure how you can save bandwidth by compression over a bus without compressing before sending over the said bus and then decompressing on the other end...

Patent said:
...
The packing block 28 may pack a number of pixel fragments together into a data block that can be transferred more efficiently over the bus 30 in a manner that maximizes the available bandwidth. Such packing techniques can include buffering a number of fragments that are to be processed simultaneously within the custom memory 40, where when enough fragments have been collected within a particular block, the block is packed and then transferred over the bus 30. Packing can be used to ensure that only valid fragment data is sent over the bus 30 with minimal inclusion of placeholders when necessary. As such, if a fragment block includes up to eight fragments, yet only three fragments are included in a block when it is to be transferred over the bus 30, the memory bandwidth utilized will only be that required to transfer the three fragments, and not the bandwidth required to send over a complete eight fragment block. Such packing techniques are described in additional detail in copending patent application Ser. No. 09/630,783 entitled "GRAPHICS PROCESSING SYSTEM WITH ENHANCED BUS BANDWIDTH UTILIZATION AND METHOD THEREFOR" which was filed on Aug. 2, 2000.
...

If you look at FIG 1. of the patent, it clearly shows a 'packing block' (compression) on the GPU side and an 'unpacking block' (decompression) on the eDRAM side...
 
I've currently got my head in the patent trying to delineate the functions. Even though I said I wouldn't...

For example, I feel there's an inconsistency between the bus 30 in Fig. 1 and the bus indicated as "6" on the Leak diagram.

I say this because

The pixel fragments 16 produced by the pixel pipe 14 are oversampled pixel fragments, where each pixel fragment includes a color value, a Z value, and a coverage mask. The coverage mask indicates to which of the samples for the pixel fragment the color and Z values correspond. For example, the color and Z values may only correspond to the bottom most samples of the pixel fragment. As such, only half of the bits within the coverage mask will be set, where this encodes for which pixel fragments the color and Z values are valid.

This fragment data isn't compressed. There's an ominous "Export" block on the side of the GPU in the leak. I think the Export block corresponds with Custom Memory Interface, 18, plus 20, 26 and 24 in the patent.

So I have this feeling that the data rate described at 6 is before fragment compression takes place. The pixel pipeline 14 in the GPU hands-off fragments to let the Export and EDRAM modules in the leak diagram churn through raster output, including AA.

As far as I can tell this is anything but a one-hit process. It might be thousands of cycles later before the next fragment for a given pixel location turns up, requiring AA. Until a tile is complete, the entire multisample data for the tile has to be kept open. That's what the Sample Memory 25 seems to be for. I think that's how the remainder of the 10MB in the EDRAM, left over from the frame buffer, is consumed.

Obviously the key thing is that AA etc. can be performed without getting in the way of the GPU working with textures across the main bus.

Anyway, off to read more, cos I've started now...

Jawed
 
Jawed said:
Well, it would appear the EDRAM in Xbox 360 is going to be on die :D

http://www.beyond3d.com/forum/viewtopic.php?t=22333

This page:

http://www.necelam.com/edram90/index.php?Subject=edramoptions

says that 256Mb will consume half of a 15x15mm die at 90nm. So 10MB will consume about 1/3 of that, i.e. about 5x7.5mm. Less than 40mm squared, not very much.

So, if you have to fit EDRAM on die, you might have to make do with less ALUs ;)

Jawed

I'd make that, 256Mbit ~ 32 MBytes -> 1/2(15*15) ~ 112.5 mm2

10 MBytes eDRAM ~ 35 mm2

Add some custom logic to that and you'd have I'd guesstmate ~ +50 mm2 die at 90nm

Considering a PowerPC G5 is 60-70 mm2 at 90nm, that's not a small die. So I'd take it off the R500 and make it a separate custom eDRAM module chip unless they want to lose shading power on the R500 or save cost...

EDIT: If you manage to read the whole patent, let us know on the 'compression/decompression'...otherwise, time permitting, I may trawl through the whole lot later... ;)
 
Jawed said:
Well R420 is 281mm squared...

http://www.beyond3d.com/misc/chipco...derby=release_date&order=Order&cname=

And clearly NEC is capable of making dies in the ballpark of 15x15mm squared.

Scaling R420 to 90nm, it would come out as 140mm squared, roughly. Add 50mm squared for EDRAM, and you're up to 200mm squared.

Add-in SM3 complexity and take away some ALUs, and you should still fit within 250mm squared.

So it all looks feasible on one die. :LOL: :!:

Jawed

Nah! ...look back at your 'transistor index' table... R500 has nearly twice the number as R420...

So,

2*140 mm squared + 50 mm squared at 90nm ~ 330 mm2! :oops:

Not small! ;)
 
Aha! Only if we take the Jaws R500.

If we take the Jawed R500 then it's a difference of 14%.

Jawed
 
By the way, I think the 8 pixels per clock number is very significant.

If that determines the maximum number of concurrently executing pixel shader command threads that can be executing, then that implies to me that there are 8 Unified Shaders, each containing a pair of ALU units, each ALU unit consisting of 2 vector and 1 scalar ALUs.

Jawed
 
Jawed said:
By the way, I think the 8 pixels per clock number is very significant.

If that determines the maximum number of concurrently executing pixel shader command threads that can be executing, then that implies to me that there are 8 Unified Shaders, each containing a pair of ALU units, each ALU unit consisting of 2 vector and 1 scalar ALUs.

Jawed

If we ignore the number of US units and just concentrate on the number of ALUs, i.e. 48 and the fact that completed theads leave these ALUs from the R500 patent,

48 ALUs --> 32 pixels per cycle (oversampled, peak)

2/3 of the ALUs, peak will output pixels, the rest will be vertex output to maintain auto-load-balancing.

These 32 pixels per cycle are then 'downsampled" to 8 pixels per cycle (4*MSAA) and written to the frame buffer. I agree, the 8 ppc is key, because it also determines the total numbar of ALUs. But how these ALUs are arranged can be debated.

In this scenario, 48 ALUs,

32 ALUs ---> pixels shading

16 ALUs ---> vertex shading

So a 'unit' can be 16 ALUs...
 
Ah, (I'm guessing here) but when you multi-sample a pixel, you don't shade the multi-samples. In other words the multi-samples don't pass through any pipeline shader computation at all.

The multi-samples are (I'm guessing) generated by the interpolator and circumvent the pixel shader portion of the pixel pipeline entirely.

The interpolator, when it comes to a pixel on a triangle edge (or, when it generates them) knows that each edge pixel needs to be multi-sampled. It generates the coverage mask for the pixel based on the gradient of the edge and the offset with respect to the pixel's centre.

So the ALUs aren't involved in multi-sampling. But while the interpolator is generating multi-samples it can't generate pixels.

Well, that's my understanding, erm, lots of guesses...

Jawed
 
Jawed said:
If that determines the maximum number of concurrently executing pixel shader command threads that can be executing
I can't see any logical reason why the number of pixel output per clock should be linked with the number of threads the hw is running.
 
nAo said:
Jawed said:
If that determines the maximum number of concurrently executing pixel shader command threads that can be executing
I can't see any logical reason why the number of pixel output per clock should be linked with the number of threads the hw is running.
I was trying to refer to the number of concurrently executing pixel shader command threads that generate a pixel for ROP, as opposed to the number of command threads in flight.

Obviously there's a huge queue of pixel shader command threads backed-up ready to generate pixels on the next cycle. In theory there are hundreds or thousands of pixel shader command threads in flight. But per clock there are only 8 pixels being fed for ROP, according to the leak.

This is why I think there are 8 Unified Shaders (two quads), each (quad?) fronted by an interpolator (for pixels - obviously vertices are handled separately). Each US can execute any combination of, perhaps, two command threads:

2 vertex
1 vertex, 1 pixel
2 pixel

and at the same time issue one or more texture operations (depending on how many TMUs it has).

I think each US has 2 ALU units (one per command thread - the patent even mentions this as a scenario, using two arbiters), each of which can co-issue two vec-4 and one scalar operations (three co-issued ops in each of the two ALU units, 6 ALUs per US in total).

I dunno what you'd do with so much co-issued shader code. Just stumbling around here, in the dark, speculating for the sake of it.

Obviously this configuration still leaves the question of what happens when two pixel threads in a US complete at the same time - presumably the Render Backend 350 in the patent (page 4):

http://www.beyond3d.com/forum/viewtopic.php?t=21708

will accept both pixels and queue them. Who knows eh? You'd expect that the Render Backend would have to synch-up the Interpolator-generated multisamples, with their owning pixel, and so that makes for a natural queue that needs to be constructed.

The figure of 8 pixels per clock coming out of R500 is unexpectedly low, to be quite honest. It seems that R500 always produces multisamples for edge pixels, so 4xMSAA comes free, compensating somewhat for the "mere" 8 pixels per clock.

In other words you can't compare the pixel fill rate of R500 to R420, for example, because R500's maximum pixel fill rate is the same with 0x and 4xMSAA.

Jawed
 
Back
Top