LucidLogix Hydra, madness?

That's already been disclosed, I believe.
It's also not guaranteed to be applied universally.

The chip and driver will analyze a scene and then pick between SFR, AFR, object based, and whatever else it can do depending on what will work best or can be made to work for the task at hand.
 
They's be lucky to make a dime if that's the case. :???:Though some might say that they'll be lucky to make a nickel any which way. :LOL:
 
Well, there is a market for straight PCI-E bridges as well, as the Elsa product showed.

The fact that they are pro-actively being defensive in saying "ELSA's implementation doesn't generally work in this mode" is a black mark BTW. Basically the only news was that there will actually not be a product out which demonstrates their technology any time soon, the Fudzilla article was more like anti-news.
 
That's already been disclosed, I believe.
It's also not guaranteed to be applied universally.

The chip and driver will analyze a scene and then pick between SFR, AFR, object based, and whatever else it can do depending on what will work best or can be made to work for the task at hand.

I've been trying to wrap my head around this, but I don't see how this will work? Do they have a geometry front end? If not, how can they know which object is going where?

Also, are they going to share Z buffers between multiple video cards? How? If not, in case of overlapping objects rendered on different GPU's, you won't have the benefit of killing pixels early and you may end up doing more work than with a single GPU. And then they'll still need some kind of additional pass to merge the two buffers. All over PCIe and without explicit support from AMD or Nvidia?

If they get this to work like they promise, it will be very interesting to see how they do it.
 
I'm not sure what exactly an implementation can do.
I got the impression from an Anandtech article that the chip can skip completely occluded objects, but that leaves a whole world of extra work for partial occlusion.
 
I've been trying to wrap my head around this, but I don't see how this will work? Do they have a geometry front end? If not, how can they know which object is going where?

Also, are they going to share Z buffers between multiple video cards? How? If not, in case of overlapping objects rendered on different GPU's, you won't have the benefit of killing pixels early and you may end up doing more work than with a single GPU. And then they'll still need some kind of additional pass to merge the two buffers. All over PCIe and without explicit support from AMD or Nvidia?

If they get this to work like they promise, it will be very interesting to see how they do it.

From what I get from the patents the Object mode splits up polygon data between the rendering pipelines (cards), when rendering is completed the frame- and z-buffers from each of the cards is merged in the composoting engine in the hydra chip.
Is case of rendering of transparent objects each of the frame and z-buffers will be synchronized.

From the patents it seems like it should be able to switch rendering mode on the fly by determining if the mode used in rendering the last frame was using the optimal strategy.
 
But it should be a better system than just straight AFR, right?

There's nothing better than straight AFR. It has the highest potential efficiency of all multi-GPU approaches based on independent GPUs with dedicated framebuffers.

What I really don't understand about the Lucid hype is that something like object based rendering may remove the inter-frame dependencies that AFR is subject to but it introduces vastly more complicated intra-frame dependencies. Not to mention it really craps all over any sort of full screen Z-buffer techniques.

Deferred shading should be fun too.
 
I don't think so, but at least on X58 no extra hardware is required. There's still a certification fee or some such AFAIK.
 
There's nothing better than straight AFR. It has the highest potential efficiency of all multi-GPU approaches based on independent GPUs with dedicated framebuffers.

Umm no. IMHO, lrb's multi gpu scaling would definitely leave everyone else crapping in the pants even if lrb1 is a laughing stock all by itself. Single board multi gpu would definitely rock for it. Multiboard "intel's sli" would be limited only by inter gpu bandwidth, which unfortunately, isn't going to improve anytime soon.

What I would love is ati/nv to adopt such similar automatic doubling of perf. No not 1.7/1.8/1.9x scaling. But actual 1.99x scaling which lrb should have without breaking into a sweat.

As far as the present approach is concerned, I find it to be a cheap slam 'em together approach that isn't scalable in the long term. I think it's high time that (ati in particular) did some dirty work and made 4870x2 look like a single gpu to the driver itself. Just like multisocket servers look to the kernel as multi core cpu's.
 
Umm no. IMHO, lrb's multi gpu scaling would definitely leave everyone else crapping in the pants even if lrb1 is a laughing stock all by itself.
Which multi-gpu scaling solution are you talking about? Something Hydra based?

What I would love is ati/nv to adopt such similar automatic doubling of perf. No not 1.7/1.8/1.9x scaling. But actual 1.99x scaling which lrb should have without breaking into a sweat.
This is not about wishing something to be 1.99 scaling, it's about how to do it. But if the Intel solution is something like Hydra, then how will they pull it off? Like trinibwoy wrote: unlike AFR, with an object based solution, you also have to deal with intra-frame dependencies.

I think it's high time that (ati in particular) did some dirty work and made 4870x2 look like a single gpu to the driver itself. Just like multisocket servers look to the kernel as multi core cpu's.
If it were simple, it would already have been done...
 
Which multi-gpu scaling solution are you talking about? Something Hydra based?

LRB does not need any extra chip to mid wife it's multi gpu work. It's all done in software so just gang the two chips together at the hardware level and they are good to go. Just like multi socket cpu's work out of the the box. :LOL:

Of course, Intel will have to build in that capability, but I don't see them not doing it. Power/thermals may make it infeasible of course.
This is not about wishing something to be 1.99 scaling, it's about how to do it. But if the Intel solution is something like Hydra, then how will they pull it off? Like trinibwoy wrote: unlike AFR, with an object based solution, you also have to deal with intra-frame dependencies.

If the rendering is a giant for loop (appropriately vectorized and parallelized, ofcourse), IMHO you don't need the kludges like Hydra. Intra frame dependencies are not a problem in this rendering method. They are more like consecutive kernel launches in CUDA, afaics.

If it were simple, it would already have been done...

It has been done before, (look at smp servers), but not in gpu space. OTOH, I'd argue that it's much easier and scalable (in a restricted sense) in gpu space as gpu caches are readonly so there is no cache coherency traffic.
 
LRB does not need any extra chip to mid wife it's multi gpu work. It's all done in software so just gang the two chips together at the hardware level and they are good to go.

There is no such thing as a pure software solution: you need to run on it on some piece of hardware eventually.

Larrabee will have to execute the same identical set of input commands as any other GPU, after all, they're going to comply with DX10 etc.

The intra-frame dependencies are inherent to those commands. When you render a pixel that depends on a texture that was rendered in a previous step, then you have an intra-frame dependency. When you have a Z-only prepass before rendering the pixel, you have an intra-frame dependency.

You're going to have to store those pieces of data somewhere and they will have to be accessed later on. The latter part has always been the problem: how can you make one chip access that data efficiently when it's stored in the local memory of another.

It is no different than an SMP system (unlike what you imagine them to be, not exactly champions when it comes to linear performance scaling) where one thread needs to wait for results of an other.

In a system where you have dependencies, it's vastly more efficient to keep everything under control on the same chip than to spread it out over many: a high bandwidth inter-chip data sharing interface is hard to design no matter what's sitting on either side of that interface.

In practice, you'll almost always end up with a NUMA situation, where local memory access is an order of magnitude faster than accessing data across the inter-chip bus... This is no different with SMP system. There's a reason for the existence of NUMA aware Linux schedulers...

Even in AFR, you have dependencies but they're only inter-frame, e.g. if a frame uses a previous one for reflections.

I'm afraid your view is a little bit naive...
 
I'm not sure what exactly an implementation can do.
I got the impression from an Anandtech article that the chip can skip completely occluded objects, but that leaves a whole world of extra work for partial occlusion.

Even then why would I be as naive to believe that graphics IHVs haven't looked into object based culling so far? I don't think as a layman that the idea is completely worthless, since I had heard in the past that at least one IHV is doing some research on the matter, I'd rather speculate that it's a matter of time until IHVs incorporate according algorithms in their pipeline.
 
Even then why would I be as naive to believe that graphics IHVs haven't looked into object based culling so far? I don't think as a layman that the idea is completely worthless, since I had heard in the past that at least one IHV is doing some research on the matter, I'd rather speculate that it's a matter of time until IHVs incorporate according algorithms in their pipeline.
The key phrase here is 'in their pipeline'. It's one thing for an IHV embed it somewhere after the vertex shader stage, it's something else to bolt it on to an existing chip with no internal insight whatsoever into where the objects will actually end up.

Forget about combining different frames back together (which has its own problems, I'm sure), the most fundamental issue is to know where the data ends up on the screen.

The suggestion that this would be an adaptive process based on the previous frame certainly has merit, but even there I don't see how they can extract that data after the fact.

E.g. imagine that one object is completely occluded by another. When you get the color and Z buffers back at the end of the frame, how are you going to determine that one pixel was from object A and not from object B? How will you do it if your scene has thousands of objects?
 
Personally in between lines I had similar thoughts/doubts if and how the technology is supposed to work. One minor tidbit:

You're going to have to store those pieces of data somewhere and they will have to be accessed later on. The latter part has always been the problem: how can you make one chip access that data efficiently when it's stored in the local memory of another.

In theory I don't see why a shared memory pool in between chips would be a problem for such cases. On today's multi-chip/GPU configs the first redundancy I can think of is double the amount of memory compared to single chip sollutions (whereby it's always N amount of memory per chip per frame and not twice as much).
 
Umm no. IMHO, lrb's multi gpu scaling would definitely leave everyone else crapping in the pants even if lrb1 is a laughing stock all by itself. Single board multi gpu would definitely rock for it. Multiboard "intel's sli" would be limited only by inter gpu bandwidth, which unfortunately, isn't going to improve anytime soon.

Huh? What's so special about lrb that it will have magical multi-gpu scaling? It will face the same dependency problems that the other architectures face today. And like silent-guy said, lrb's architecture doesn't give it a free pass on inter-chip bandwidth.

In theory I don't see why a shared memory pool in between chips would be a problem for such cases. On today's multi-chip/GPU configs the first redundancy I can think of is double the amount of memory compared to single chip sollutions (whereby it's always N amount of memory per chip per frame and not twice as much).

But isn't that the fundamental problem? In order the share memory you need some sort of communication protocol between the chips. Both chips can't independently just access the framebuffer without some sort of co-ordination. So all of your memory management processes now have to go over this bus - global memory atomic ops for example.
 
Back
Top