LRB - ditching x86?

RPG - lrb is about programmability. In theory, a pure accumulator architecture can run the same shaders as GT200 or LRB. In practice...

I see no reason to remove x86 compatibility from LRB, although it'd be a good idea to shrink the overhead as much as possible.

Also, LRB has some pretty big handicaps compared to NV's GPU efforts:
1. First discrete GPU by Intel (in a long time)
2. First time the design team has worked together
3. First time the driver team has worked together
etc. etc.

Intel is coming at this from a big disadvantage and hopefully they can get close to NV.

DK
 
OK 1/4 cache + 1/3 VPU ~ 58% is useful area. The rest of x86 takes up, 42% which is massive. :oops:
Does that 42% area also contain ringbus connection, scalar core or any other "useless" things?
rpg.314 said:
And let's face it, what can LRB provide you as a programmer that GT200 can't (from a programmability POV).
Depends on what you are after. For regular DX/GL stuff it doesn't provide much but LRB is not limited to those areas.
 
Does that 42% area also contain ringbus connection, scalar core or any other "useless" things?
The 42% area given is the scalar core (don't know about the exact number, the actual die shots show differing ratios), which if going by comparisons of Pentium-era chips is possibly a third too large. The original guesswork didn't go beyond rough fractions because too much wasn't known.
Now that we see die shots of Larrabee, a lot of data about what is taken up by the cores, cache, and other hardware is now known, and this does dilute the x86 penalty further.

The dominant die penalty as far as current graphics applications are concerned might not be x86, so much as Intel's insistence on using a fully featured CPU core 32 times over.
x86 might have made the solution in the ballpark of 10% larger than it otherwise would have been--although there are more modern examples of even slimmer cores than the earlier RISC I used as a baseline (and some really tiny embedded cores).
Until more exotic methods of rendering become available for a good comparison, we can see from the Larrabee rasterizer description that a decent chunk of the instructions and much of the general capability of the x86 cores is not used.
A general-purpose RISC core would still have this functionality, and while it might be slimmer, it would still not be used and multiplied 32 times over.

I cannot really figure on what power penalties there might be, as there is no good way to compare without physical implementations, and too much can vary between manufacturers and processes. The vector units themselves will likely swamp the TDP calculations during peak loads.
 
Last edited by a moderator:
In a LRBni compatible way? I mean with all the masks and swizzles etc.?
No. Although masking and swizzling support is pretty good, AVX sorely lacks scatter/gather. As vectors get wider, collecting elements from different memory locations becomes a huge bottleneck. Also lacking is any reasonably fast exp/log instructions.

On the other hand, AVX does support vectors of small integers. But with 512-bit registers I don't think that actually matters much anyway.
 
Now, now, this is pure speculation. Let's wait for the actual hardware before declaring winners.
I personally would be very impressed if they get a Larrabee implementation matching GTX 285 - with completely software DX10 driver mind you.
If this kind of performance can be achieved while emulating a hardware oriented API, I can't wait to see what direct low level access can get you in terms of performance, quality and features.

But in todays terms, thats like Intel releasing a GPU thats twice as big as GT200 but performs in line with something like a HD 3850.

Fully programmable or not I wouldn't be particularly impressed with that. Especially when you consider how programmable NV's GT4xx series and ATI's RV9xx (which this could be competing with) will be by then.

Thats if the rumor is true though, the fact that its not going to be released for another 2 years suggests that the final performance will be much better than "what it is now".
 
No. Although masking and swizzling support is pretty good, AVX sorely lacks scatter/gather. As vectors get wider, collecting elements from different memory locations becomes a huge bottleneck. Also lacking is any reasonably fast exp/log instructions.

Better to say AVX lacks a good performing scatter/gather (since SSE4's insert/extract at least can scatter/gather at the cost of one instruction per lane).

Still funny how the x86 vector hardware --- excluding Larrabee ---, has been (and looks to still be) utterly clueless. One would think after the 6th (or so) revision that they might be able to get it right, but no IMO. Instead they support the lazy programmer route and provide fast unaligned loads and funny instructions for bad AoS code (which isn't ever going to scale as you increase the vector width).
 
Still funny how the x86 vector hardware --- excluding Larrabee ---, has been (and looks to still be) utterly clueless. One would think after the 6th (or so) revision that they might be able to get it right, but no IMO. Instead they support the lazy programmer route and provide fast unaligned loads and funny instructions for bad AoS code (which isn't ever going to scale as you increase the vector width).
It's far from perfect, but I wouldn't say clueless. Everything you need for multimedia and scientific computing is there. Also, don't underestimate the amount of SSE code in (graphics) drivers and programming frameworks. It's not because it isn't widely used by application programmers that it's little used in general.

Anyway, there's still a long way to go to use the full potential, and the fragmentation of SSE definitely doesn't help. Once all scalar operations have a vector equivalent though we'll be able to vectorize code automatically much more easily. Single instruction scatter/gather and shifting by independent values are the first things that come to mind. The latter will be added by AMD in SSE5 by the way...
 
To all the lrb-is-about-programmability guys.

I admit I haven't given this issue more than a superficial thought. But if you think that LRB can do something that GT200 or rv770 can't, then I'd like to know that workload.

IOW: Please give me an example where you can do something with lrb1 that you can't (as efficiently or with as much effort) with GT200 or RV770.

My guess is, that there aren't any, if at all.
 
Off the top of my head I recall:

Potentially escape the clutches of bog-standard rasterization for the ray-tracing fans out there.

order-independent transparency
irregular z-buffer

probably more compliant floating point capability
most likely better DP
a way shorter write then read path with the general caches


possibly elements of the driver or optimizer that run on the CPU that might be put on Larrabee
 
Off the top of my head I recall:

Potentially escape the clutches of bog-standard rasterization for the ray-tracing fans out there.

order-independent transparency
irregular z-buffer

OK. why can't you do that in CUDA, if you use shared mem as your cache, with smaller tiles, of course.
probably more compliant floating point capability
most likely better DP
That is not part of the programming model per se, although better numerics would help.

a way shorter write then read path with the general caches
Do you mean that writes don't stall/ have less latency/are closer to ALU's here?
possibly elements of the driver or optimizer that run on the CPU that might be put on Larrabee
That is a definite plus. It may help with runtime profile guided optimization, but I am not sure if that will be a big win over doing it over the PCIe bus.
 
OK. why can't you do that in CUDA, if you use shared mem as your cache, with smaller tiles, of course.
One Larrabee L2 tile has 16 times more capacity than what a multiprocessor in GT200 gets.

You added a qualifier to your request:
"IOW: Please give me an example where you can do something with lrb1 that you can't (as efficiently or with as much effort) with GT200 or RV770."

Wrangling CUDA to thrash about in tiny isolated memory pools and then having to export and import to global memory is nowhere near as efficient.
It's computationally possible, but I wouldn't hold my breath for acceptable performance.


Do you mean that writes don't stall/ have less latency/are closer to ALU's here?
Writing to memory and then reusing the result is a full round-trip.
Atomic ops in the latest Nvidia GPUs are very slow, for example.
 
IOW: Please give me an example where you can do something with lrb1 that you can't (as efficiently or with as much effort) with GT200 or RV770.
Render target read. That alone is the only example you should need, and it's just the tip of the iceburg :)

Anything where you have to modify the rasterizer is a good example too. Logarithmic, dual paraboloid, b-spline, patch, etc. rasterization all come to mind not to mention much more sophisticated things (REYES).

Really though if you're willing to cast off your graphics API shackles, there's few limits to what you can do. There are a huge number of things that are interested to do in a renderer that don't map nicely to DirectX or even CUDA/OpenCL/ComputeShader.

And NVIDIA has implemented and demoed "alias free shadow maps" in CUDA, but they weren't exactly fast (or alias free for that matter :))...
 
Last edited by a moderator:
I should have made my meaning clearer.

Please give me an example where you can do something with lrb1 that you can't (as efficiently or with as much effort) with CUDA or RV770+opencl (whenever it comes out).
Really though if you're willing to cast off your graphics API shackles, there's few limits to what you can do. There are a huge number of things that are interested to do in a renderer that don't map nicely to DirectX or even CUDA/OpenCL/ComputeShader.

That's what I want to know. What are those things that map nicely to lrb1 but not to CUDA or RV770+opencl?
 
A better question would be, of the things which Larrabee could do more efficiently than DX11 generation GPUs from NVidia and AMD, will those things run fast enough to be used in games?

We've got REYES and raytracing and much more running on GPUs now. And when I say run fast enough, meaning you could hit 30 fps at 720P with AA using that technique AND doing everything else needed to reader the frame including animating characters and transparent particles, post processing, etc. So something like a really nice irregular shadowmap technique that runs at 30Hz without rendering materials doesn't count, it needs to be the entire engine.
 
I clearly don't understand your question... all of the things that I mentioned you can do with LRB1 and not (efficiently) with CUDA/OpenCL. Render target read is a prime example, which is basically "free" on LRB and rediculously expensive - if even ever implemented - on current NVIDIA/AMD-style architectures.

We've got REYES and raytracing and much more running on GPUs now.
Uhuh, and I can run REYES on my phone if I bothered to code it up, but it's not going to be fast or efficient. All of the rasterization-based things that I mentioned can be implemented orders of magnitude more efficiently on LRB than on competing hardware. Go implement log rasterization - or any rasterization - in CUDA for instance and let me know how fast it is...

I'm not trying to be coy or dismiss OpenCL/ComputeShader, but are you guys really trying to argue that they are currently suitable for writing - for instance - an efficient software rasterizer in? I'd love to be convinced, but I'll wait until I see it. And if someone does rewrite the graphics pipeline in an efficient, programmable way on that hardware then maybe I can finally read my render targets :)
 
Last edited by a moderator:
Please give me an example where you can do something with lrb1 that you can't (as efficiently or with as much effort) with CUDA or RV770+opencl (whenever it comes out).
Read the 'limitations' section (or whatever it's called) in the CUDA and OpenCL specifications.

Current GPU's don't have a stack and don't support function pointers. So you can forget about deep function calls or recursion, and polymorphism.
 
Function pointers are in DX11 right?

DX11 gives better ability to patch shaders, but no run-time in-shader function pointers as far as I know.

Speaking about DX11, I'm playing a little devils advocate here, because I do see a few things which would I think would run faster on Larrabee, as well as stuff which I believe will run faster on non-Larrabee GPUs. The question really comes down to who will make best use of the platform at hand (and who has the best ART and gameplay). Part of the problem here is that while we can talk about perhaps specific algorithms working better on a given machine, ultimately what matters more is which algorithm adapted to the machine provides better results. So perhaps on Larrabee you use A and on non-Larrabee you use B. Just because A doesn't work well on GPUs and B doesn't work well on Larrabee doesn't mean much. What I rather see as important is that I would assume that a lot of developers won't have the ability or time to jump into LRB native and will instead be using DX11. So will Larrabee be faster at DX11?
 
Back
Top