Larrabee: Samples in Late 08, Products in 2H09/1H10

Nick · Feb 2, 2008

ArchitectureProfessor said:
This might be fine for shaders. However, I still get the feeling there is some secret sauce in the Larrabee vector units that allows it to get away with less fixed-function hardware than a GPU.

If it's any reference, SwiftShader has no big problems performing 'fixed-function' tasks with MMX/SSE, except texture sampling. Other than that shaders really take the vast majority of execution time, making it less relevant whether other tasks take 10 instructions or 20 to implement.

It wouldn't hurt to have scatter/gather and 1-2 ULP transcendental functions though...

Nick · Feb 2, 2008

nAo said:
What does make a P5 core + SIMD efficient at rasterization?

SMT to hide instruction latency, caches and prefetching to hide memory latency, and enought cores to reach a TFLOP?

Larrabee II????

Whatever the successor will be named. I see no reason why they wouldn't introduce a new design every two years and a shrink in between.

nAo · Feb 2, 2008

Nick said:
SMT to hide instruction latency, caches and prefetching to hide memory latency, and enought cores to reach a TFLOP?

Umh, I'm not sure I'm following you here.
What I was asking is what's so special in Larrabee that makes you think rasterization (as sw implementations) would map well to it.

Whatever the successor will be named. I see no reason why they wouldn't introduce a new design every two years and a shrink in between.

I understand, but if g100 and r700 will come out this year (as it seems) I'm not sure Larrabee will compete with them, let alone Larrabee II

aaronspink · Feb 2, 2008

Voltron said:
So how does NVIDIA perform this economic magic? For starters out of necessity. In the beginning nobody wanted to pay much for an accelerator chip, especially factoring in the board costs. Lucky for them TSMC is an extremely efficient company that has many customers that will keep using older processes, and hence TSMC depreciation costs are probably relatively low. Plus Taiwan is cheap. Intel may have foreign operations, but a lot of corporate expenses are incurred in USA. So TSMC might not have high performance logic leadership over Intel, but it best them in economics (probably by a wide margin) and versatility.

you do realize that pretty much 99% of a fab cost are fixed depreciation. Compared to the capital costs, the wages are in the noise. And I would again, be shocked if TSMC was any cheap on a per wafer cost than what Intel can do. The simple fact that TSMC has multiple different process types drives up their costs. And Intel generally get 2-3 generations of use out of each process. For example the new fab being build in China is a 90nm fab! In general intel sells at least 3 chips in every PC sold (north bridge, south bridge, and cpu). Its likely that they are each on different processes. Add on to that NOR and NAND flash production, communication chips, etc. I'm pretty sure that all of it isn't on the bleeding edge process technology.

Intel, meanwhile had the luxury of essentially dictating prices to the market, and built an a massive organization that reflected this. So one might argue that even though they have trimmed the payrolls, NVIDIA looks remarkably lean with about 20x fewer employees. It worth noting the NVIDIA probably sells somewhere between 1/2 to 2/3 fewer chip (just a rough guess including core logic). But still over 100 million chips to the same customers.

It would be interesting if you could dig up a reference to the 100 million chips that nvidia sells. Based on market distribution I doubt they are over 50 million or so. That 50 million number assumes they ship an average of 2 chips on 20% of every pc sold.

Aaron Spink
speaking for myself inc.

aaronspink · Feb 2, 2008

V3 said:
They only used raytracing when they need it. Go to http://graphics.pixar.com/ there are a few articles that discuss it. What they said is that the time to evaluate their shader is longer compare to the raytracing part.

More specifically, they use REYES as their rendering algorithm and then as part of the shader functionality of the PRman they issue out ray tracing functionality only when required. They also use a lot of nice algorithms and caching in order to speed up the ray tracing functionality and they generally use a beam casting functionality instead of individual ray casting. A lot of their work could be applied to real time ray tracing.

And certainly their cachability information is interesting in light of previous discussions in this thread. They make a large amount of use out of their caching which is primarily responsible for the speed of PRman in a lot of the scenes, in a lot of cases resulting in at least an order of magnitude increase in performance. And they use fairly large caches with 90%+ hit rates.

It wouldn't surprise me to see their optimizations surrounding when to use ray tracing and when to use enviorment maps for reflections to be incorporated into games in the future. Ideally we want a combination of raster and ray tracing going forward to get the best performance and that will require much more flexible hardware than current GPUs.

Really what we all REALLY want is realtime REYES functionality. But its quite a ways off do to a lot of the dataset size issues that generally come up. In order to get to the point where we can really hardware accelerate REYES like functionality we're going to need a much tighter linkage between the graphics pipelines and then general purpose memory. Some of the dataset sized used by PIXAR are incredibly large with some scenes requiring 10s of gigabytes of data.

Aaron Spink
speaking for myself inc.

aaronspink · Feb 2, 2008

Jawed said:
What are the effects of the distinction you're making here. I don't understand the consequences of the two approaches you've described.

Jawed

In general SIMD are limited vector instruction using a vector register file. Some of the descriptions of the design of the G80 break some of the general design constraints of a SIMD architecture, mostly in the area of load/store functionality.

General SIMD uses vector loads and stores. Some of the description provided for the G80 implies independent load/store per lane vs unified load/store across lanes. The limitations of the described architecture imply that each lane is not fully independent though and rely on a single unified instruction pointer and decode across all lanes.

So the question is can each lane issue an independent load/store or is it really a unified load/store with a scatter/gather mask?

Basically, it goes to the independence and flexibility of each lane.

Aaron Spink
speaking for myself inc.

Voltron · Feb 2, 2008

aaronspink said:
you do realize that pretty much 99% of a fab cost are fixed depreciation. Compared to the capital costs, the wages are in the noise. And I would again, be shocked if TSMC was any cheap on a per wafer cost than what Intel can do. The simple fact that TSMC has multiple different process types drives up their costs. And Intel generally get 2-3 generations of use out of each process. For example the new fab being build in China is a 90nm fab! In general intel sells at least 3 chips in every PC sold (north bridge, south bridge, and cpu). Its likely that they are each on different processes. Add on to that NOR and NAND flash production, communication chips, etc. I'm pretty sure that all of it isn't on the bleeding edge process technology.

It would be interesting if you could dig up a reference to the 100 million chips that nvidia sells. Based on market distribution I doubt they are over 50 million or so. That 50 million number assumes they ship an average of 2 chips on 20% of every pc sold.

Aaron Spink
speaking for myself inc.

Intel has between 85,000. A fair amount of them are in manufacturing. If you want to call that noise...TSMC has 25,000 employees. Most are in the fabs. Think about that. 25,000 employees. It's not Wal-Mart or Intel buts it pretty damn big. And I haven't looked lately but TSMC is still proably producing on .35 micron tech. Is Intel even using .18 still...maybe I don't know. But point is there are differences in equipment use, salaries, and lots of other things that add up.

Just a little financial comparison between Intel and TSMC is pretty enlightening. Just look at revenues per wafer and the comparative profit margins. It is clear TSMC is a highly efficient company.

As far as digging up info on NVIDIA chips sold, you are just foolish if you think they only sell 50 million or so chips a year. With 50%+ of the AMD integrated market thats at least 30 or 35 million chips a year right there. Thats before you even get to the main business which is discrete GPUs.

aaronspink · Feb 2, 2008

Voltron said:
Intel has between 85,000. A fair amount of them are in manufacturing. If you want to call that noise...TSMC has 25,000 employees. Most are in the fabs. Think about that. 25,000 employees. It's not Wal-Mart or Intel buts it pretty damn big. And I haven't looked lately but TSMC is still proably producing on .35 micron tech. Is Intel even using .18 still...maybe I don't know. But point is there are differences in equipment use, salaries, and lots of other things that add up.

I don't know the fab employee break down but I doubt that more than half the TSMC employees work directly in the fab on a daily basis. About half of the employees are probably related to design services and biz side operations. Intel also has a fraction of its employees working directly in the fabs. A large percentage on intel employees are in design, non-process research, test, packaging, etc.

Intel still has a fab operational that produces 1 uM devices. And has approximately 2x the number of fabs as TSMC.

And realistically for the market segments we are talking about costs and process abilities for anything above 90nM aren't worth talking about.

Just a little financial comparison between Intel and TSMC is pretty enlightening. Just look at revenues per wafer and the comparative profit margins. It is clear TSMC is a highly efficient company.

Yes it is... Q4 2007 results...
INTC TSMC
Net Revenue: 10712 2894
Cost of Sales: 4486 1512
Gross Margin: 6226 1382

interesting thing is only 10% of TSMC revenue was at 65 nM and only 29% at 90 nM.

Aaron Spink
speaking for myself inc.

Voltron · Feb 2, 2008

aaronspink said:
interesting thing is only 10% of TSMC revenue was at 65 nM and only 29% at 90 nM.

Another interesting thing would be the volume of wafers produced and the ratios of wafers to revenues and profits.

aaronspink · Feb 2, 2008

Voltron said:
Another interesting thing would be the volume of wafers produced and the ratios of wafers to revenues and profits.

based on fab capacity the volume is likely going to be 2-1 in favor of Intel. Intel also had ~4x the revenue and ~4x the gross profit.

Aaron Spink
speaking for myself inc.

Arun · Feb 2, 2008

aaronspink said:
The past experience on TSMC's quoted performance vs delivered performance isn't good.

Mostly agreed, yes.

I haven't seen anything which would give TSMC density advantage except for some random numbers that don't have anything to do with density in real life.

TSMC's SRAM density on 45GS is, quite simply, the best in the industry. They're cheating a bit because what they call their 45nm general-purpose process is really what was known just 6-9 months ago as 40nm, but you get the point. As for gate density, it's hard to compare between different processes of course, but TSMC claims 2.43x the gate density of 65nm and comparing transistor counts per mm² for NVIDIA vs Intel on 6(nm shows a favorable picture.

Traditionally, TSMC's strong point has been litography, and I think their 45/40nm processes are a good proof of that. The fact Burn Lin is the inventor of Immersion Litography is also pretty good evidence they're amongst the very best there. Performance, on the other hand, is another thing completely...

And I don't ever believe TSMC's delivery dates for anything but FPGAs.

They're not supposed to be for anything else on the general-purpose nodes, either. It's called 'risk production' for a reason. On the low-power side, it's mobile phone chip sampling since the design cycles are so long there.

Arun · Feb 2, 2008

aaronspink said:
It would be interesting if you could dig up a reference to the 100 million chips that nvidia sells. Based on market distribution I doubt they are over 50 million or so. That 50 million number assumes they ship an average of 2 chips on 20% of every pc sold.

Jon Peddie's the man for that kind of stuff: http://www.businesswire.com/portal/...d=news_view&newsId=20080131005336&newsLang=en

That's 33M chips in one quarter for GPUs+IGPs, and from looking at other reports it looks like about 115M this year. That excludes discrete northbridges/southbridges (NV's IGPs are single-chip) and handheld chips.

Also, regarding TSMC's volume: their main leading-edge production is at Fab12 and Fab14, with monthly production of 60000 wafers and 45000 wafers respectively. AFAIK, that's total capacity for 130nm/90nm/65nm. I couldn't find wafers per month numbers for Intel, but they tend to have three 300mm fabs per process generation and, AFAIK, those aren't as big as TSMC's GigaFabs.

So Intel does have a volume advantage still, but I'm not convinced it's as large as you think it is. I wish I could find more precise data...

aaronspink · Feb 2, 2008

Arun said:
Mostly agreed, yes.
TSMC's SRAM density on 45GS is, quite simply, the best in the industry. They're cheating a bit because what they call their 45nm general-purpose process is really what was known just 6-9 months ago as 40nm, but you get the point. As for gate density, it's hard to compare between different processes of course, but TSMC claims 2.43x the gate density of 65nm and comparing transistor counts per mm² for NVIDIA vs Intel on 6(nm shows a favorable picture.

be careful how you compare. The question is how usable the SRAM cell is at that size and is it actually possible to do it in volume. All the major players (AMD, IBM, Intel) can do smaller SRAM cells if they want, they can all also do smaller gates if they want. The problem is that the trade-offs in various parameters don't make sense.

And its hard to compare the transistor counts between intel and Nvidia without knowing what tradeoffs were made. FWIW, G92 and Core 2 Duo are within spitting difference on transistor counts per mm^2.

Traditionally, TSMC's strong point has been litography, and I think their 45/40nm processes are a good proof of that. The fact Burn Lin is the inventor of Immersion Litography is also pretty good evidence they're amongst the very best there. Performance, on the other hand, is another thing completely...

it may be a strong point, but what volumes can they push and at what yields. What other issues are going to crop up, etc.

I don't know, history always says TSMC always over promises and under delivers on processes. Granted for the prices they are decent but it seems to take them some while even once they hit production to actually get the yields and process parameters to actually hit what they advertise.

Maybe I'm just used to the real men side of the semi industry with their own fabs who always are ready for production before target dates, always have better than defined yields, and always have better than spec'd process parameters.

Aaron Spink
speaking for myself inc.

Jawed · Feb 2, 2008

aaronspink said:
So the question is can each lane issue an independent load/store or is it really a unified load/store with a scatter/gather mask?

Both G80 and R600 use an operand window between the ALU and the register file (constant cache, load/store) to solve this problem.

Both GPUs implement a scalar-organised register file, not a vector-organised one.

I'm not saying that "random per lane", register indices or loads won't cause bubbles in the ALU. The window is required just to make normal block-based fetches work and performance degrades as the fetch becomes more irregular.

So, I see both GPUs as having true SIMD ALUs that are buffered against the regular case of the complexity of transforming block-based register fetches into multiple sparse scalar fetches per instruction slot.

Jawed

Arun · Feb 2, 2008

aaronspink said:
be careful how you compare. The question is how usable the SRAM cell is at that size and is it actually possible to do it in volume. All the major players (AMD, IBM, Intel) can do smaller SRAM cells if they want, they can all also do smaller gates if they want. The problem is that the trade-offs in various parameters don't make sense.

TSMC's most touted number is 0.242μm², but at IEDM 2007 they apparently pointed out that it could range from 0.202μm² to 0.324μm² depending on performance/power/area/yields trade-offs. I'm not going to make the case for or against the 0.242μm² figure, but it should be easy to see that even 0.324μm² is better than Intel's 0.346μm²...

it may be a strong point, but what volumes can they push and at what yields. What other issues are going to crop up, etc.

They're obviously lagging behind Intel in terms of volumes but that's to be expected. Regarding yields, that's not as much of a problem for GPUs as for CPUs so I wouldn't worry too much about it.

Maybe I'm just used to the real men side of the semi industry with their own fabs who always are ready for production before target dates, always have better than defined yields, and always have better than spec'd process parameters.

Are you implying AMD is no longer part of the real men side? That's mean!

(I see your point though wrt TSMC)

Nick · Feb 3, 2008

nAo said:
What I was asking is what's so special in Larrabee that makes you think rasterization (as sw implementations) would map well to it.

What makes you think it needs anything special? It has fixed-function texture samplers and its SIMD units are no doubt tailored towards shader operations. The rest (setup, interpolation, ROPs, etc.) can just use the same operations, some integer variants, and scatter/gather.

nAo · Feb 3, 2008

Nick said:
What makes you think it needs anything special? It has fixed-function texture samplers and its SIMD units are no doubt tailored towards shader operations. The rest (setup, interpolation, ROPs, etc.) can just use the same operations, some integer variants, and scatter/gather.

I'm talking about rasterization here, not texture sampling or shading.
I don't know how to make it more clear than this but my question was: why do you think Larrabee is going to be better than any other CPU out there at rasterization (FLOPs aside)
My assumption is that CPUs are not very good at rasterization, I guess you disagree with this hypothesis.

Nick · Feb 3, 2008

nAo said:
I'm talking about rasterization here, not texture sampling or shading.

Ah, ok, rasterization in the narrow sense...

I don't know how to make it more clear than this but my question was: why do you think Larrabee is going to be better than any other CPU out there at rasterization (FLOPs aside)
My assumption is that CPUs are not very good at rasterization, I guess you disagree with this hypothesis.

Yes I disagree. Rasterization takes only a tiny amount of time compared to shading. Even with the GPU's approach a software implementation takes only three additions, three compares and two 'and' operations in the inner loop. With 24 cores and single-issue 16-wide SIMD at 2 GHz that's 96 gigapixel/s. A GeForce 8800 GTX is capped at 13.8. And this assumes Larrabee has no special instructions to speed this up (like doing compare and 'and' in one instruction), or dual-issue. And there might be even faster approaches.

TimothyFarrar · Feb 3, 2008

Thinking more about writing a software renderer for Larrabee,

I seriously doubt that Larrabee is going to have generalized scatter or gather without a performance hit (like it is with SSE), would require too many read/write ports on the cache? But it seems as if it will for sure have to have gather from the texture units, and I can see some very good reasons to have a separate texture cache (L1). I don't think it would be wise to count on scatter or gather for performance, but rather stick to standard SOA style coding (and fully aligned and coalesced memory access pattern).

Seems as if you had access to interleaving and format conversion instructions, the ROP could easily be built in software. With the speculative mental model I have for Larrabee's design, working on fully aligned 4x4 pixel blocks in the fragment shaders would make sense, probably in this order (for scaler slots 0-F in the 16-way vector),

0145
2367
89cd
abef

With the order this way, render target (RT) writes would be set in order for good 2D locality for future possible texture reads from RTs. The 16 pixels would have to be of the same program, but perhaps not of the same triangle. Fragment shaders would be run "NVidia style" as I was describing in previous posts (predicated vector ISA absolutely required). ROP blending would probably just be aligned reads/writes directly from the RT. Non FP32 formats would be messy (extra performance cost) depending on ISA. FP32 read/write would obviously be native (512bit aligned R/W), but what about FP16, INT8, INT16? Possibly 256bit aligned reads, with hardware pack/unpack + format conversion for 16bit types (perhaps done with separate instructions). You could skip 256bit aligned accesses if you didn't support native single channel 16bit types (just use extra interleave opcode prior to write). Anything other than 32bits/pix would just be uped to 32bits. In any case, rendering single points (or lines) is probably going to share the same poor performance (most of the parallel shader computations doing nothing) as it does on NVidia and AMD/ATI GPUs compared to rendering triangles.

I think at least for the fragment shader and ROP parts of a software pipeline, Larrabee could do quite well, if you take out my primary concern about Larrabee, hiding texture fetch latency (especially with dependent fetches). Obviously peek ALU performance could be around 16simd * 24cores * 2.5ghz = 960 gflops, so Larrabee won't be a slouch there (at least compared to 2008 GPUs).

This leaves attribute interpolation and all prior parts of the pipeline to talk about... and I'm off to dinner. Perhaps more later, or maybe Nick would like to throw out some ideas...

Nick · Feb 4, 2008

TimothyFarrar said:
I seriously doubt that Larrabee is going to have generalized scatter or gather without a performance hit (like it is with SSE), would require too many read/write ports on the cache?

What's the actual impact of extra read/write ports (but keeping the same total bus width) on size and performance?

High-performance scatter/gather makes most sense for L1 cache. If all of the elements are located there you want it to be a low latency, high throughput operation (unlike SSE4's extract/insert instructions). When some of the elements are in L2 cache latency will be higher anyway, so it doesn't matter much if it takes a few extra cycles. The next few accesses should hit L1 cache again...

But it seems as if it will for sure have to have gather from the texture units, and I can see some very good reasons to have a separate texture cache (L1).

It would be very interesting to know a bit more about Larrabee's texture samplers. Are they just 'co-processors' to the mini-cores that execute the texture sampling instructions? Are they glorified scatter/gather units and the filtering happens in the cores? How does cache coherency work with the rest of the chip? Etc.

I don't think it would be wise to count on scatter or gather for performance, but rather stick to standard SOA style coding (and fully aligned and coalesced memory access pattern).

Seems as if you had access to interleaving and format conversion instructions, the ROP could easily be built in software. With the speculative mental model I have for Larrabee's design, working on fully aligned 4x4 pixel blocks in the fragment shaders would make sense...

With 16-wide SIMD units, I expect it can transpose a 4x4 matrix in one instruction. This would make AoS-SoA conversion very fast, and indeed make scatter/gather less of a necessity for everything except texture sampling.

Larrabee as anything other than a GPU would also still benefit greatly from scatter/gather instructions though. Think of ray tracing again, and any operation that requires a lookup table. Fast scatter/gather out of L1 cache would greatly improve average latency. But maybe the texture samplers deal with all of this.

With the order this way, render target (RT) writes would be set in order for good 2D locality for future possible texture reads from RTs.

I'm not sure I follow here. Render targets are never read directly after being written. The first read as a texture could be in a very different location than the last write. So pixel order doesn't matter, at this level.

ROP blending would probably just be aligned reads/writes directly from the RT. Non FP32 formats would be messy (extra performance cost) depending on ISA. FP32 read/write would obviously be native (512bit aligned R/W), but what about FP16, INT8, INT16? Possibly 256bit aligned reads, with hardware pack/unpack + format conversion for 16bit types (perhaps done with separate instructions).

That seems likely. MMX/SSE has efficient pack/unpack instructions, so I'd expect Larrabee to have something similar.

In any case, rendering single points (or lines) is probably going to share the same poor performance (most of the parallel shader computations doing nothing) as it does on NVidia and AMD/ATI GPUs compared to rendering triangles.

Not necessarily. The software is not restricted to implementing one rendering pipeline implementation, so you can specialize for the type of primitive. Generic scatter/gather would be handy though.

I think at least for the fragment shader and ROP parts of a software pipeline, Larrabee could do quite well, if you take out my primary concern about Larrabee, hiding texture fetch latency (especially with dependent fetches).

Speculative prefetching FTW!

Obviously peek ALU performance could be around 16simd * 24cores * 2.5ghz = 960 gflops, so Larrabee won't be a slouch there (at least compared to 2008 GPUs).

Add dual-issue or multiply-add to the equation and it can play with the big boys.

And I think it can have an advantage in terms of efficiency. It never has a bottleneck, never has bubbles because of register pressure, doesn't have to mask away unused components, can reduce overdraw to zero, specialize for specific scenarios, etc.

I'm most excited about seeing the software evolve. The first version will likely be sucky, the second version could be so-so, the third implementation could be interesting, and the fourth implementation could blow us away. Then again, GPUs might choose a very similar software-oriented route...

This leaves attribute interpolation and all prior parts of the pipeline to talk about...

Attribute interpolation is at most two multiply-add operations, I see no issues there. And I don't think there's any part of the pipeline we haven't discussed yet that isn't obviously fairly efficient to implement on a CPU.

Larrabee: Samples in Late 08, Products in 2H09/1H10

Nick

Nick

nAo

Nutella Nutellae

aaronspink

aaronspink

aaronspink

Voltron

aaronspink

Voltron

aaronspink

Arun

Unknown.

Arun

Unknown.

aaronspink

Jawed

Arun

Unknown.

Nick

nAo

Nutella Nutellae

Nick

TimothyFarrar

Nick

Similar threads