Larrabee delayed to 2011 ?

Add to this the ability to program it in regular C++ (or Fortran or whatever), and you can see why an architecture like Larrabee is far more likely to dictate the future of HPC.

IMHO, ability to write massively parallel codes in c/c++/fortran isn't a bonus. It's a evolutionary leftover from a distant era. Personally, I feel classic c/c++/fortran are fundamentally broken for parallelism. Java/C# etc. are not much better.

Developers don't like hybrid solutions.
Well, so far nobody has managed to make a language that has the dynamic range of massive parallelism to purely serial, that isn't purely functional. And they don't seem to be doing very well in terms of adoption just yet.

The best candidate to create a strong software ecosystem, in my opinion, would be x86. It's not vendor neutral, but it already has a massive existing ecosystem it can borrow from.
10 years ago I'd have agreed with you, but today, I'd say MSIL (or it's bastardized cousin) has a better shot at it than x86 ISA. Don't forget that there are very few production GPU codes yet. And IMHO, 90% of them will be written with whatever tools MS can cook up.
 
Scatter/gather without just slow serialization is really hard to do with caches ... and with snooping you can just plain forget about it (scaling the snoops filters, invalidation ports etc. for the order of magnitude higher traffic it can cause is not an option).

In any case where scatter/gather would run into an issue with snooping it is going to just plain suck on any modern memory system. I've been in the room with various interested parties in scatter gather with significant experience and past history with it and the fundamental problem is that unless what you are S/G is in a local SRAM, the interconnect and memory becomes a fundamental problem. And of course, in all their workloads, there's no way to really keep things that local due to data set size being so large. Oh what they would do to be able to use SRAM as main memory again.

S/G on local caches isn't idea but it isn't really any harder than using enormous massively ported register files like ATI/Nvidia do now in order to support it. If you care a lot more about striding there are some elegant solutions available, but for general S/G, its all about using multi-ported register files instead of ram arrays for your L1 arrays. This is universal and doesn't really depend on if you call your L1 arrays local stores, shared memory, or cache.
 
IMHO, ability to write massively parallel codes in c/c++/fortran isn't a bonus. It's a evolutionary leftover from a distant era. Personally, I feel classic c/c++/fortran are fundamentally broken for parallelism. Java/C# etc. are not much better.

So when this perfect mythical pink elephant decides to show up, I'm sure we'll ride him. In the meantime we'll use the robust beasts of burden that we already have.

Well, so far nobody has managed to make a language that has the dynamic range of massive parallelism to purely serial, that isn't purely functional. And they don't seem to be doing very well in terms of adoption just yet.

Functional languages are where its at. Its a shame that so few people can actually use them correctly and that the two major functional languages are such clusters of semantic crap. Yet still, almost every piece of electronics now is based on them.
 
Scatter/gather without just slow serialization is really hard to do with caches
I agree, but I don't think it invalidates the point. Even accepting a speed hit I don't see moving forward without caches. Fermi's caches - while not completely coherent - also show promising performance results, so I'm unwilling to accept that it's a problem that can't be overcome.

IMHO, ability to write massively parallel codes in c/c++/fortran isn't a bonus. It's a evolutionary leftover from a distant era. Personally, I feel classic c/c++/fortran are fundamentally broken for parallelism. Java/C# etc. are not much better.
While I resoundingly agree with your point, there's still a lot of non-performance-critical code that can happily run in any of these languages without affecting the overall speed of the code. At least until we have CPUs and GPUs on the same chip using the same memory subsystem (i.e. same cache hierarchy) - and maybe even then - it's unreasonable to say that all of this code belongs on the CPU.
 
Writing efficient low level code comes down to internalizing the compiler and hardware ... the way it ends up being executed is inherently imperative, so the effort it takes to do this for functional programming is much greater than with imperative languages. Syntactic sugar is helpful, type systems are helpful, deadlock/race detection/prevention is helpful, completely obscuring control flow is not helpful.

For those couple of % of most important code functional programming will never be the right tool.
 
So when this perfect mythical pink elephant decides to show up, I'm sure we'll ride him.

Hmm.., OK.

In the meantime we'll use the robust beasts of burden that we already have.
C/C++ are anything but robust in the parallel era. If anything, they are even more fragile tool in the parallel world. Or a sharper double sided sword if you prefer.

Further, continuing to use a broken language - for any reason - won't fix the problem.

Functional languages are where its at. Its a shame that so few people can actually use them correctly and that the two major functional languages are such clusters of semantic crap.

Which two did you have in mind? Haskell and erlang?
 
With benefit of hindsight, the rasterizer is the x87 of GPU's. You may hate it, you may even deprecate it by fiat (or by a sw renderer), but you may not remove it from the die.
 
I just don't believe having no fixed function rasteriser is the problem. If it was that simple, you could give each core a rasteriser, or give each texture unit a rasteriser.

Also, seeing how much grief 4 rasterisers have given NVidia (and they're fixed function), it seems to me Intel gave up too soon :p
 
I thought the grief was from maintaining triangle order in a distributed environment? That challenge remains whether you're doing rasterization in fixed-function units or software.
 
I thought the grief was from maintaining triangle order in a distributed environment? That challenge remains whether you're doing rasterization in fixed-function units or software.
Precisely my point. Intel's software rasterisation obviates the triangle ordering problem with its tiled approach. The struggle NVidia had is analogous to the struggle Intel had in distributing work across the cores.

I think the problem lies elsewhere. e.g. $2 billion yearly TAM, say, for performance/enthusiast discrete just isn't worth chasing in comparison with server/cloud/HPC.

Also life's simpler for Intel if it doesn't have to write drivers for D3D. There was always the question hanging over the architecture of how long it would take Intel to get a game's performance right, with worrying statements that months after game release would be required. (AMD doesn't seem to have much of a different attitude, though.)
 
Precisely my point. Intel's software rasterisation obviates the triangle ordering problem with its tiled approach.
Off hand, I can't see how it obviates the need. You just moved the serialization point from rasterization to spatial binning. Scaling spatial binning across cores while maintaining triangle order isn't exactly easy.

I think the problem lies elsewhere. e.g. $2 billion yearly TAM, say, for performance/enthusiast discrete just isn't worth chasing in comparison with server/cloud/HPC.
And by the looks of it, they have got it almost right for SNB.
 
Off hand, I can't see how it obviates the need. You just moved the serialization point from rasterization to spatial binning. Scaling spatial binning across cores while maintaining triangle order isn't exactly easy.
The serialisation is actually per tile-pixel (or more granular, e.g. per tile qquad), and local to a single core since tiles in rasterisation (stages post setup until back-end) don't span cores.
 
Is that assuming the implementation placed the tesselation stage in the front-end and not the back?

It would have been an interesting exercise to see what numbers Larrabee could have pulled in Heaven, its applicability to current workloads aside and assuming that the software renderer had been functionally coded to DX11 spec.

This latest Intel statement is far more down on Larrabee graphics than I've seen thus far, and is a noticeable drop from a position I have already perceived as being rather lukewarm. I suppose Tim Sweeny will need to wait a little longer for his software rendering dream to come true.
 
Is that assuming the implementation placed the tesselation stage in the front-end and not the back?
I don't understand what you're suggesting.

Tessellation was very much an open question, I don't remember any of Intel's materials covering it.
 
The option existed to run the tesselation stages either in the front-end or back-end.
Whether there was ever an implementation of it for Larrabee is something I do not know, but Intel did discuss the possibility.

If a primitive is allocated to a bin and the back-end is responsible for performing tesselation, the generated triangles on one core could cross the bin's tile boundaries.
 
Logarithmic shadow maps: now even more of a pipe dream! Oh well.

Anyway... if you look at communications, the vast vast majority of software-centric architectures still do problematic algorithms like Turbo Coding and Viterbi in hardware blocks. But there are exceptions that do those very efficiently in software - the trick is their architecture is incredibly unusual and very different from a traditional processor, even though it could afaict rightfully be called Turing Complete (as long as you look at a large enough piece of it rather than just a subsystem).

The basic problem with graphics is that the number of blocks that would benefit from such exotic architectures is actually very small, and their data flow is very complex (rasterisation being the poster child). And going down that route would create a lot of complexity at the compiler for more normal shading workloads, so overall it just doesn't make any sense and the best approach remains fixed-function.

The one thing Larrabee did provide above and beyond any current desktop GPU architecture is scalar/MIMD, and interestingly on-core rather than as a separate on-chip block. I'm honestly unsure whether there is much benefit to on-core SIMD+MIMD in either graphics or GPGPU compared to separate SIMD and MIMD cores, but a frequent problem of the latter in 80s/90s architectures is the lack of bandwidth between the scalar and the vector part. With the power consumption of data communication even on-chip increasing to dramatic levels, there might be something to be said for on-core integration of the two not (just?) from a software level but from a hardware level. Some sort of close coupling at least would make sense.

Of course, ideally we'd all go pure MIMD. Rys, can I haz Series6? :D (and please don't break my heart and tell me it's SIMD now :()
 
The option existed to run the tesselation stages either in the front-end or back-end.
Can't remember seeing that :???:

If a primitive is allocated to a bin and the back-end is responsible for performing tesselation, the generated triangles on one core could cross the bin's tile boundaries.
Tessellation consumes patches. I don't think patches would be screen-space binned.

The patches should be able to run in parallel through VS/HS to generate input to TS and DS. Ordering of triangles coming out of DS should be keyed by Patch ID, I presume (TS generating sub-patch triangle ID).

Is there a serialisation I'm missing?

Anyway, screen-space tiling for binning of triangles involved in tessellation (input or output) would be done post-GS.
 
Can't remember seeing that :???:
Tom Forsyth's SIGGRAPH 2008 presentation touted the flexibility in assigning stages either to the front or back-end.
Included in that set is GS and tesselation.

Tessellation consumes patches. I don't think patches would be screen-space binned.
Wouldn't this mean it occurs in the front end? If it's not in a bin, the back end would not be able to grab it.

The patches should be able to run in parallel through VS/HS to generate input to TS and DS. Ordering of triangles coming out of DS should be keyed by Patch ID, I presume (TS generating sub-patch triangle ID).

Is there a serialisation I'm missing?
VS is listed as a front-end capability. It was not clear to me that VS is one of the stages that could be put in either front or back.

Anyway, screen-space tiling for binning of triangles involved in tessellation (input or output) would be done post-GS.

GS could be either front or back as well.
 
Tom Forsyth's SIGGRAPH 2008 presentation touted the flexibility in assigning stages either to the front or back-end.
Included in that set is GS and tesselation.

Looking at slide 22, the only way I can interpret tessellation being done in the back-end (along with GS) is if TS is synonymous with VS->HS->TS->DS (i.e. it is not a reference purely to the TS stage). In my interpretation, DS would be split between front-end and back-end:
  • Front-end DS would generate screen-space coordinates for the purposes of binning.
  • Back-end DS would generate all the other attributes of each vertex.
The advantages of delaying some DS work would include reduced storage in global memory and re-distribution of workload (e.g. later DS might lead to better load-scheduling).

Wouldn't this mean it occurs in the front end? If it's not in a bin, the back end would not be able to grab it.
Precisely. But tessellation doesn't have to be complete for binning to start (position attribute of each vertex is mandatory for binning). Which leads me to suggest what I posted above.

GS could be either front or back as well.
GS can do a variety of things. If GS is used merely to delete vertices/triangles then in theory it can be delayed until after binning - again this is a load-balancing question, I think. i.e. run GS across lots of cores as they do binning, rather than on a few cores while creating bins.

Maybe there are some other usages of GS that are amenable to delayed execution (e.g. generating attributes)?

---

By the way, the term "rasteriser" is often used to describe all of these stages: setup->rasterisation->pixel shading->output merger (ROP). So it's possible to interpret the statement about the lack of a fixed-function rasteriser as actually descriptive of lack of "setup->rasterisation->pixel shading->output merger". To be honest I think this is very likely the correct interpretation.

I pretty much always thought it would be years before Intel was competitive at the enthusiast end, but process would eventually allow it to catch up. A major question for the other IHVs is what proportion of die space ends up being programmable compute, and the higher that rises the more competitive Intel becomes.
 
Back
Top