Larrabee: Samples in Late 08, Products in 2H09/1H10

crystall · Mar 21, 2008

CouldntResist said:
I always found this hard to believe, because coding for x86 involves things that defy code space efficiency. Inserting instructions that do no useful work, like register swapping due to ISA asymetry, reg spills, reg copy due to destructive update, reg clear to avoid false dependency, etc.

Integer code on x86 is fairly packed compared to RISCs with fixed-width 32-bit instructions. SSE code - especially when coupled REX prefixes for accessing the extra registers in x86-64 mode - tends to be in the same ballpark if not bigger compared to AltiVec for example.

Nick · Mar 21, 2008

CouldntResist said:
By the way, is there any serious, comparative study that proves that x86 does indeed compress the code? The only support for this claim I've ever seen was like "it's CISC and variable-length, therefore it's more compact, period".

Considering its age, it's not that bad really. I was actually mainly referring to code size reduction compared to fully decoded micro-instructions. The P6 architecture uses 118-bit long micro-instructions.

I'm curious how that actually compares to GPUs.

3dilettante · Mar 21, 2008

crystall said:
Integer code on x86 is fairly packed compared to RISCs with fixed-width 32-bit instructions. SSE code - especially when coupled REX prefixes for accessing the extra registers in x86-64 mode - tends to be in the same ballpark if not bigger compared to AltiVec for example.

Running x86 superscalar complicates the code density argument, at least as far as the I-cache is concerned.

For A64, each cache line has 6.5 bytes used to indicate predecode information that allows for (edit: fast and uniform) 3-way superscalar decode.
There are a number of other bytes, but those are tied to branch prediction and parity, which are orthogonal to the ISA.

In the instruction cache, assuming we fit 3 instructions into a 16-byte line, we're adding ~2 bytes per instruction.

It's not just an x86 thing, though. Some number of bits is also used by POWER6 to move some decode stages out of the critical loop.

TimothyFarrar · Mar 21, 2008

CouldntResist said:
I'll ask a bit different question: how many x86 CPU pipeline stages are wasted for x86?

I was guessing the majority was in instruction decode and related stages. I suppose dealing with x86 flags and the other necessary forwarding (for performan with a small register set) adds some extra pipeline complexity... I wonder what the % of instruction decode space on chip was for the early and late model Alphas in comparison to x86?

By the way, is there any serious, comparative study that proves that x86 does indeed compress the code? The only support for this claim I've ever seen was like "it's CISC and variable-length, therefore it's more compact, period".

I always found this hard to believe, because coding for x86 involves things that defy code space efficiency. Inserting instructions that do no useful work, like register swapping due to ISA asymetry, reg spills, reg copy due to destructive update, reg clear to avoid false dependency, etc. And let's not forget that x86 CPU vendors recommend non trivial subset of the actual x86 to feed their modern CPUs. The rest is considered slow, kept for compatiblity. This fact alone indicates wastage of opcode space. Also there is still no way to write FP code in sane way, you either deal with ancient 87 or with glutonous prefixes in SSE.

As for good opcode compression, how about Thumb?

My previous remark was sarcastic in nature, and even less valid for integer math when thinking about in-order-execution without register renaming (for example the really short EAX opcodes cannot be used as much, and you need to use the upper 8 regs of x86-64). The win for all the address modes goes away when you are mostly using RISC [reg+offset] style addressing, especially with all the prefixes needed for SSE* and accessing the upper 8 registers in x86-64. Perhaps we should throw in (for good measure) the space lost in general purpose code keeping all branch targets properly aligned. I've done my share of painful run-time code generation for x86, and the expense of dealing with x86's mess (for example, ebp/esp register usage and addressing modes) is enough to make me loath having to deal with this again perhaps with runtime compiling of "shaders" on Larrabee (assuming we get this low level access)... BUT if Larrabee becomes a really useful tool to do what I want and there isn't a good alternative, then no sense in complaining about something you have no choice but to live with.

BTW, I think NVidia 8/9 series opcodes are 32bit with some extended 64bit opcodes (perhaps for texturing) according to http://www.cs.rug.nl/~wladimir/decuda/.

nAo · Mar 21, 2008

TimothyFarrar said:
BTW, I think NVidia 8/9 series opcodes are 32bit with some extended 64bit opcodes (perhaps for texturing) according to http://www.cs.rug.nl/~wladimir/decuda/.

And their previous (vliw) generation uses 16 bytes long instructions (pixel shaders).

Squilliam · Mar 21, 2008

Sorry if this sidetracks things a little. Could Larrabee and Nehalem (Sounds like a Donkey and a cart ridden by Larry Bee) Be one of the first steps to phase out x86/64 and develop a new architecture? They can emulate x86 and keep a few small atom like cores whilst migrating to a newer and better system? Perhaps something that doesn't have its' roots in the 1980's?

3dilettante · Mar 21, 2008

TimothyFarrar said:
BTW, I think NVidia 8/9 series opcodes are 32bit with some extended 64bit opcodes (perhaps for texturing) according to http://www.cs.rug.nl/~wladimir/decuda/.

The CTM guide had information for the Radeon X1900.
GPU Command instructions were up to 20 bytes, but ALU ops were 3-6 words in length.
I don't know about R600's instruction stream, nor the internal VLIW format generated from it, though we should note that the batch size of 64 means in the best case that a single instruction applies to 64 items, not including data used for pointing to the items.

Unless the instruction is 64 times the size of an x86 instruction, there is a form of compression right there.

Squilliam said:
Sorry if this sidetracks things a little. Could Larrabee and Nehalem (Sounds like a Donkey and a cart ridden by Larry Bee) Be one of the first steps to phase out x86/64 and develop a new architecture? They can emulate x86 and keep a few small atom like cores whilst migrating to a newer and better system? Perhaps something that doesn't have its' roots in the 1980's?

People keep hoping, but it keeps not happening.
Let's not forget that Intel had x86 emulation on Itanium, the last time the possibility existed.
Those ISA extensions on their own would in way be a dead end anyway. They're just extensions that fall back on standard x86 for a lot of functionality, and their default encodings would still be quirky and oversized.

Jawed · Mar 21, 2008

3dilettante said:
I don't know about R600's instruction stream, nor the internal VLIW format generated from it, though we should note that the batch size of 64 means in the best case that a single instruction applies to 64 items, not including data used for pointing to the items.

According to page 11 of:

http://coachk.cs.ucf.edu/courses/CDA6938/UCF_1_25_08.pdf

The ALU instruction is 1 to 7 64-bit words variable length, 5 of those being scalar instructions, the 2 others being literals.

Jawed

3dilettante · Mar 21, 2008

Not knowing the bit length of the current x86 chips' internal ops, I can't make a direct comparison of contemporary architectures.

In comparison to the P6 at 118 bits per instruction, the biggest packet in the internal instruction format of R600 would be 56 bytes in length. Five P6 micro-ops would be ~74 bytes.

How many bits would the matching external shader instruction be in comparison, I wonder?

Jawed · Mar 22, 2008

3dilettante said:
Not knowing the bit length of the current x86 chips' internal ops, I can't make a direct comparison of contemporary architectures.

In comparison to the P6 at 118 bits per instruction, the biggest packet in the internal instruction format of R600 would be 56 bytes in length. Five P6 micro-ops would be ~74 bytes.

How many bits would the matching external shader instruction be in comparison, I wonder?

Each of the 3 operands and the resultant in an SM4 shader instruction can relate to any one of 4096 vec4 registers - so in terms of R600 that would be any one in 16384 scalar operands. Multiply that by the number of threads supported by the architecture and you get...

From the assembly code I've looked at, it appears R600 provides for 128 vec4 registers per thread, actually implemented as 512 scalars. In order to support the resulting 16384-scalar register file, there'd presumably be some kind of abstracted register file paging mechanism never seen by the ALU pipeline, along with paging of the shader code into distinct programs that address subsets of the thread's register set.

A curiosity of the architecture is that the compiler assigns registers in two sequences: r0-upwards and r127-downwards. As far as I can tell, the r127-downwards registers are clause-temporaries, i.e. their lifetime is not that of the shader program, but the execution of the clause.

R600 breaks up shader programs into clauses of, at maximum, 32 instruction slots (or 128 scalar ops, whichever comes first - 128 scalar ops can be compiled in 26 instruction slots). So you could say that in order to run a 400-slot shader, R600 runs sequences of clauses from this shader that are a maximum of either 32 slots or 128 ops long, the latter being a harder limit, corresponding to 1024 bytes of code (125 * 8 bytes). (If the maximum number of literals per instruction are allocated then presumably only 18 instruction slots per clause could be compiled.)

So it would appear likely that R600 has a subsidiary register file that supports only clause-temporaries (r127, r126 etc.), in addition to the register file that supports "shader state", r0, r1, r2 etc.

The subsidiary register file would only need to support 2 threads at a time. If a clause can be a maximum of 128 operations, that's 128 scalar resultants required in the subsidiary register file - or 32 vec4s. So it would appear that the subsidiary register file needs to be 64 objects * 2 threads * 128 scalars * 4 bytes = 64KB. This couldn't actually be true because a clause that consisted of 128 resultants that were all junked once the clause completed would, in effect, do nothing. So that's not the whole answer...

So, instead of this, I can imagine that in setting up a thread's clause for execution by a SIMD, the relevant registers are copied from the register file into the subsidiary register file. So, this is a different organisation from what I've just proposed. Instead of the ALUs consuming operands from both the register file and a subsidiary register file (as well as writing to both), I'm guessing that the ALUs can only work against the subsidiary register file.

At the end of execution of a clause the updated shader state registers (r0 upwards) would need to be written back to the register file, while the clause-temporaries (r127 downwards) are junked. In order to pipeline these copy operations, I guess "triple buffering" is required. Sorta similar to the way in which Cell SPE programs often use triple buffering and DMA to split LS into "incoming", "active" and "outgoing" stages. Double buffering could work I suppose if the register copy operations are "fast".

If this is the case then operand addressing at the operation level in R600 would only need to cater for 512 scalar addresses (128 operations per clause, each operation having a maximum of 3 operands and 1 resultant), 9 bits per operand, 36 bits per operation, out of the budget of 64 bits per operation identified earlier. So out of the remaining 28 bits, perhaps 6-8 bits for opcode? Then various bits for masking and predication...

The other issue is that the ALU pipeline needs to be able to address the memory cache (streamout, memexport or just general inter-thread data swapping). It's 8KB and can be either read or written with a granularity of 32-bits (I presume). I guess this "mov" instruction would have a unique destination/source formatting, re-using bits I've already described...

Gah, I wonder if there's "CTM" documentation for R600 out there...

Later in that PDF it says, "1MB of GPR space for fast register access."

For what it's worth various patent documents describe a "low capacity" register file which has been puzzling me for absolutely ages - I guess this tallies with the idea of the "subsidiary register file" I've been talking about. If this is true then the bulk "1MB register file" is actually just a "register file cache", virtualising the full 16384-scalars per thread "register set" that R600 supports. The actual register file implemented per SIMD is a small block of memory:

SIMD processor and addressing method

Here a 10-bit address is provided for a 1KB register file - able to address individual bytes, rather than the scalars (4 bytes) that I've been describing.

---

How big would all the registers for, say, 24 cores of Larrabee be, scalar + vector?

Jawed

aaronspink · Mar 22, 2008

CouldntResist said:
By the way, is there any serious, comparative study that proves that x86 does indeed compress the code? The only support for this claim I've ever seen was like "it's CISC and variable-length, therefore it's more compact, period".

quite a few actually. Overall x86 object codes tends to be smaller on the order of 1.5-2x versus equivalent non-x86 cpus and the critical code paths tend to be smaller as well. In general the major impact of this is in I cache sizes.

Aaron Spink
speaking for myself inc.

armchair_architect · Mar 22, 2008

Jawed said:
R600 breaks up shader programs into clauses of, at maximum [...] In order to pipeline these copy operations, I guess "triple buffering" is required.

Wow. I can see good things about each of these architectural quirks alone, but together they sound like someone *trying* to waste power. Moving data around without operating on it is something power efficient designs try to avoid, rather than build an architecture that requires it to happen at some minimum frequency regardless of other factors.

Of course all engineering is a balancing act, and maybe this pays for itself in increased efficiency elsewhere :???:

.

Jawed · Mar 22, 2008

armchair_architect said:
Moving data around without operating on it is something power efficient designs try to avoid, rather than build an architecture that requires it to happen at some minimum frequency regardless of other factors.

Which data is being moved around but not being used?

My guess is that a clause might consume, say, 4x vec4s from the register set, update 3x vec4 clause-temporaries and 2x vec4s during clause execution. Then after execution write back the 2x vec4s and junk the 3x vec4 clause-temporaries.

Are you alluding to the reading of registers without updating them? Or do you mean that over the lifetime of a shader a register might be read, say, 10 times (in 10 different clauses), but only updated once? :???:

As far as I can tell what we're looking at here is each clause being the equivalent of a macro-op. That it can consist of up to 125 micro-ops is a bit zany, but I'm struggling to see which data is being moved around wastefully.

Jawed

armchair_architect · Mar 22, 2008

It's quite possible I simply misunderstood you.

The way I read it, there's essentially two register files, a big&slow one and a small&fast one. Each clause does all its work in the fast registers: at the beginning of the clause all values it needs to read are copied into the fast registers, and at the end of the clause any live outputs (i.e. values needed by subsequent clauses) are copied back to the slow registers.

Is that accurate?

If so, if clause A computes a value used by clauses B and C, then that value will be copied out from A and copied in to B and C. Compared to a design with a flat register file, that's three extra data movements.

This idea of big&slow vs. small&fast registers is just a two-level memory hierarchy (ignoring the cache and framebuffer levels beyond them). Which is fine by itself. But having these atomic clauses essentially means that every N operations you're forced to flush/invalidate the lowest level of the hierarchy, and reload it next time you need the data.

Jawed · Mar 22, 2008

armchair_architect said:
It's quite possible I simply misunderstood you.

The way I read it, there's essentially two register files, a big&slow one and a small&fast one. Each clause does all its work in the fast registers: at the beginning of the clause all values it needs to read are copied into the fast registers, and at the end of the clause any live outputs (i.e. values needed by subsequent clauses) are copied back to the slow registers.

Is that accurate?

Yep, that's a good summary.

If so, if clause A computes a value used by clauses B and C, then that value will be copied out from A and copied in to B and C. Compared to a design with a flat register file, that's three extra data movements.

Ah, OK, I understand what you're saying. Agreed.

This idea of big&slow vs. small&fast registers is just a two-level memory hierarchy (ignoring the cache and framebuffer levels beyond them). Which is fine by itself. But having these atomic clauses essentially means that every N operations you're forced to flush/invalidate the lowest level of the hierarchy, and reload it next time you need the data.

Often you'd be switching clauses anyway because of texture operations (or to evaluate branching-predicate for subsequent instruction issue). Regardless, yes the extra data being moved is costly.

---

I've been fiddling some more with GPUShaderAnalyzer and what I'm finding is that there seems to be a low ceiling on the count of clause-temporaries. R122 seems to be the lowest I've gotten so far (i.e. 6 temporaries, R122-R127). 6 is a funny number. Anything more complex seems to start assigning shader registers, e.g. r1, r2 etc.

This makes me suspect that the first model is being used, where the ALU pipeline operates on both the main register file and the subsidiary one. That obviously makes the operand addressing more fiddly. Of course what I've been referring to as a subsidiary register file could be just a scratch area within the main register file :???:

Coming slightly back on topic I'm also wondering how Larrabee will map the voluminous register file specification of SM4 onto its caches/register-file/ALUs... Software threads versus hardware threads.

Jawed

Jawed · Mar 23, 2008

armchair_architect said:
This idea of big&slow vs. small&fast registers is just a two-level memory hierarchy (ignoring the cache and framebuffer levels beyond them).

Hmm, isn't this actually how G80 works? It does "wide but shallow" fetches from the register file into a block of memory that's used to enable the correct ordering of registers for the ALUs? This is how it "simulates" a multi-ported register file, using a single very fat port into a small memory that easily supports the "random" fetches the ALUs require.

As far as G80 ALUs are concerned addressing is from the fast register file, which presumably only requires a few bits of address per operand ("thread ID" takes care of the rest). I presume that the fast memory is a gather window for all types of operands: registers, parallel data cache and constants.

Jawed

Ilfirin · Mar 26, 2008

I just got some performance figures back from a project I completed recently and found it rather interesting in relation to this discussion.

Some background..

The project in question was for the display section of a relatively advanced VNC system where the worst-case involved analyzing two screens for differences, extracting the bounding rectangles of those differences, moving the changed areas from one screen to the next while performing some dynamic digital grading followed by a RGB->YUY2 colorspace conversion at the end. This is essentially a 3 step process: 1. Find differences and update the target screen 2. Perform digital grading 3. RGB->YUY2 (sound a lot like a pixel shader running on a CPU?

)

I wrote all this in highly optimized mmx-assembly code and got some rather surprising results. At 1680x1050x32 on a P4 1.3GHz system the RGB->YUY2 transformation took about 9ms, as did the update stage.. and the digital grading stage. The process as a whole ended up being around 30ms. When all 3 stages were combined into 1 the process as a whole went back down to around 9ms. I added some 'wavy screen effects' just to test the performance of doing so.. still 9ms. The arithmetic instructions now outnumber the memory operations by a massive amount.

On a system clocked 500mhz faster but with the same cache size it was still at 9ms, but when it was ran on a similarly clocked processor with twice the cache, the performance nearly doubled. On a dual-processor system with a combined total cache equal to the last setup (running the multithreaded version of the algorithm) performance was the same. C2Qs with doubled up caches performed twice as fast as C2Ds, regardless of differences in clock rate.

To make a long story short, what ended up happening was that across all the systems tested performance scaled almost perfectly linearly with cache size and nothing else seemed to have any effect on it. Tripling the ALU ops didn't affect performance, nor did going multi-core without increasing the overall cache amount. Bus speed differences made a marginal affect, but certainly nothing to write home about.

The moral of all of this is that with the clock frequencies being quoted for Larrabee the most limiting factor isn't going to be so much how many cores it has as it is how much cache and bandwidth is associated with each core and how it's organized. Being able to execute a bunch of different sections of the screen simultaneously doesn't really help you much if a single core can already process all the data from the last read before the next set arrives. I'm very interested in seeing exactly how they address this.. along with seeing what it's going to be like programming on it after they're done.

Looking over the posts here it seems a bunch of people are already saying this, but it was neat to have some "real" data on the matter.

Demirug · Mar 26, 2008

Yes, cache misses are expensive on all current CPUs. This is even more a problem when it comes to pure “In Order CPU”. I could not remember who said it but a GPU marker claimed that today’s GPUs are designed to have very high texture cache hit rates as this is one of the key factors to be fast.

nAo · Mar 26, 2008

Ilfirin said:
.
Looking over the posts here it seems a bunch of people are already saying this, but it was neat to have some "real" data on the matter.

Forgive me for this silly question but given that you probably had a pretty regular and predictable memory access pattern were you prefetching your data at all?

Jawed · Mar 26, 2008

Jawed said:
This makes me suspect that the first model is being used, where the ALU pipeline operates on both the main register file and the subsidiary one. That obviously makes the operand addressing more fiddly. Of course what I've been referring to as a subsidiary register file could be just a scratch area within the main register file

Yay, just been reading the R600 ISA document and this is how it works, using a variably sized scratch area within the register file for clause temporaries.

It turns out there's only a maximum of 4 clause temporaries supported in R600.

In some cases I've seen GPUSA report less registers required for RV670. I wonder if this is because RV670 can support more clause temporaries?

Jawed

Larrabee: Samples in Late 08, Products in 2H09/1H10

crystall

Nick

3dilettante

TimothyFarrar

nAo

Nutella Nutellae

Squilliam

Beyond3d isn't defined yet

3dilettante

Jawed

3dilettante

Jawed

aaronspink

armchair_architect

Jawed

armchair_architect

Jawed

Jawed

Ilfirin

Demirug

nAo

Nutella Nutellae

Jawed

Similar threads