Details on PowerVR's "USSE"?

Gunhead

Regular
I'm interested in the architecture of PowerVR's unified scalable shader engine. Does it have vector or scalar units? Does the thread management derive from their Metagence stuff, and does that imply anything interesting? How scalable is it, can you combine much more than 8 of them for a mainstream class PC product? (This is fantasy, I know... too bad Intel had too much NIH attitude to create killer integrated graphics around Series 5, let alone a discrete GPU. I suppose it's also one of the best archs for multichip or multicard solutions...) How much is texture filtering decoupled from the shader core? Et cetera...

Feel quite free to speculate as they haven't given much technical details of the internals of USSE; at least I haven't seen any apart from fairly vague block diagrams and such. And please do correct any misconceptions I may have above.
 
Come on people, it's in the SGX, somebody must know something... or could Simon F help out here?
Not really. If you can find a public document then maybe I could comment on what is said but otherwise no.
 
Warning: anything below is based only on a few sparse details found here and there and the rest is pure layman's speculation:

I'm interested in the architecture of PowerVR's unified scalable shader engine. Does it have vector or scalar units?

I'd place my bets on the latter, yet no one can be really sure at this point. Even if someone would answer that question, it still comes down what each and everyone means nowadays exactly with "scalar" or "superscalar" or whatever term is being (ab-)used.

Does the thread management derive from their Metagence stuff, and does that imply anything interesting?
If memory serves well Metagence was a division that started a lot later than PowerVR itself. Simple reasoning would tell me that a GPP/DSP core was actually "born" out of existing past GPU IP than the other way around.

Metagence sports according to IMG's own claims "superthreading"; a rough explanation can be found here (written with the aid from IMG employees):

http://www.audiodesignline.com/showArticle.jhtml?articleID=183701195

I might be completely wrong, but the whole superthreading thingy sounds to me more like an optimisation which is necessary for a general purpose core, unlike a typical GPU which since its conception was laid out for handling a large amount of threads in parallel.

How scalable is it, can you combine much more than 8 of them for a mainstream class PC product? (This is fantasy, I know... too bad Intel had too much NIH attitude to create killer integrated graphics around Series 5, let alone a discrete GPU. I suppose it's also one of the best archs for multichip or multicard solutions...)
If you press a gun at my head I'd say that the SGX555 (which according to IMG's roadmap is to launch (?) in 2009) is that 8x times scaled "something" compared to lower end cores. Here a few details from the latest SGX whitepaper:

PowerVR SGX core architecture currently comprises
seven variants, with sizes ranging from less than
1.5mm2 to 20.3mm2 in a 65nm process:
l SGX 510, 520, 530 - mobile, wireless
l SGX 535, 540 - high-end mobile, automotive
l SGX 545, 555 - PC, games consoles
20.3mm2@65nm sounds tiny, but it then also comes down what one would mean by "PC" exactly. An IGP is a PC part too in that sense.

Maximum effective pixel fillrate performance ranges
from 100Mpix/sec to 4000Mpix/sec @ 200MHz. Polygon
throughput ranges from 2Mpoly/sec to 100Mpoly/sec
@ 200MHz.
Let me sum the highest end thingy up:

4000MPix/s@200MHz
100MPolys/s
20,3mm2@65nm

I'd be damned if I hadn't seen a footnote in another document stating that above rates are at less than 50% shader load (which might imply 1 Tri/clock). Anyway my guess would be 8 TMUs * 200MHz * 2.5 overdraw = 4000; and that might be the mythical "8" you're looking for.

Given a partner requesting a higher end part and IMG being able to divert the needed amount of resources, I don't see why they couldn't scale such a tiny core a lot more; even more so under a smaller manufacturing process. Trick being that the former sentence has too many conditionals for my taste.

How much is texture filtering decoupled from the shader core? Et cetera...
No idea; they tout deferred texturing for one and de-coupled geometry from pixel processing.

Just don't expect Simon to comment on the above *snicker*
 
Last edited by a moderator:
In this example the MTMP replaces the function of geometry processing 1502, texturing and shading 1508, alpha test 1509, fogging 1510 and alpha blend 1511 unit from figure 5. In addition the accumulation buffer is replaced with output buffer space allocated from the shared data store. This integration of functionality into a single unit with general programmability results in an exceptionally capable system that may either directly emulate or replace fixed function blocks or replace them with an arbitrary piece of code supplied by an attached application.

Hmmmm.....interesting.
 
Yes, it sounds like they one-upped R6xx by making everything programmable down to texture sampler/filtering units as well as ROP's...

TMU + GS + VS + PS + ROP (in part) --> all implemented by: a MTMP programmable block.

At least unless I am reading the above in some very rushed way...

Edit:

I jumped to conclusions... too quickly... further down in the patent:

The processing pipelines 1860 are also now allocated specific functions with the tiling engine 1862, the pixel processing 1864 and the texturing unit 1866 being directly equivalent to same units in figure 5.

coupled with the earlier statement:

MTMP replaces the function of geometry processing 1502, texturing and shading 1508, alpha test 1509, fogging 1510 and alpha blend 1511 unit from figure 5. In addition the accumulation buffer is replaced with output buffer space allocated from the shared data store.

reading the patent in those paragraphs seems to say that some things of the texturing and ROP pipes were implemented by a single programmable unit which also implements GS+VS+PS (like SGX's USSE)...

Maybe blending (frame-buffer and textures), AA resolve and and filtering were left in what they call texturing unit and pixel processing unit while texture fetch, alpha and Z tests, etc... have been shifted to the Unified Shader processing block just like what happened with GS, VS, and PS functionality.


http://appft1.uspto.gov/netacgi/nph...G&l=50&co1=AND&d=PG01&s2=MTMP&OS=MTMP&RS=MTMP

BTW, in the USPTO you did not use Imagination Technologies as Assignee Name but in the EU version of the same patent application you did... interesting :).
 
Last edited by a moderator:
For the 2nd one which talks about the parameter compaction, is it correct to say that, when the display list gets too large, it calls up a section (macro-tile) to be rendered, but all it's actually doing is the HSR to remove objects that are covered, and then if they are covered up, free up the memory they are using (in whole blocks only)? Is that sort of the gist of it? The objects that are still in the front, are they still stored as normal?

This could probably be combined with actual compression of the display list data as mentioned in this older patent (?):

http://www.freepatentsonline.com/EP1351195.html
 
Last edited by a moderator:
Here are some USSE data from TI. This is a SGX530 core.
Code:
Universal Scalable Shader Engine – Key Features
The (USSE) is the engine core of the PowerVR SGX architecture and supports a broad range of
nstructions.
• Single programming model:
– Multithreaded with 16 simultaneous execution threads and up to 64 simultaneous data instances
– Zero-cost swapping in, and out, of threads
– Cached program execution model
– Dedicated pixel processing instructions
– Dedicated video encode/decode instructions
• SIMD execution unit supporting operations in:
– 32-bit IEEE float
– 2-way 16-bit fixed point
– 4-way 8-bit integer
– 32-bit bit-wise (logical only)
• Static and dynamic flow control:
– Subroutine calls
– Loops
– Conditional branches
– Zero-cost instruction predication
• Procedural geometry:
– Allows generation of primitives
– Effective geometry compression
– High-order surface support
• External data access:
– Permits reads from main memory using cache
– Permits writes to main memory
– Data fence facility
– Dependent texture reads

source: http://focus.ti.com/lit/ug/spruff6/spruff6.pdf
 
Back
Top