LucidLogix Hydra, madness?

How on earth does any of that actually work? My only guess is that everything is done on the CPU and the hardware is just a fancy PCIe bridge chip? I can't believe it actually touches the command stream from the GPU driver to the GPU, since that would mean reverse engineering ATI's and NV's drivers and re-implementing them. However the article claims the chip does some of the splitting and load-balancing, meaning data would go their driver -> their chip -> their driver -> gpu driver -> their chip as bridge -> gpu.
...
Oh, and how about transparency? Or off-screen buffers used for shadow maps and reflections? Or post-processing?

Perhaps their driver and chip keep track of resource allocation in and the shader command stream and cooperate to construct a dependency graph in a fashion similar to what the Larrabee paper said was done to bin the work.

Chunks of rendering that don't share resources within the same frame can be routed to separate cards.
Buffers built on one card that persist between frames could be handled by the driver inserting an export from one card and and a load to the next.

It seems that such a system could hit a snag if the dependency graph doesn't split all that well, though.
 
:|||| @ http://www.extremetech.com/image_popup/0,,iid=215156&aID=231125&sID=25522,00.asp
And they claim near-linear scaling for THAT method?! It's possible, but only with really really smart algorithms and a few things that might be a tad too app-specific for their own good.

This brings up the next question: what the hell is Lucid's chip actually *doing*? And what Tensilica core does it use anyway?!
April 18th, 2006: http://www.tensilica.com/news_events/pr_2006_04_18.htm
April 19th, 2006: http://www.soccentral.com/results.asp?CatID=552&EntryID=18761

Furthermore, neither has *any* FP support. So I can't see how they could even understand the 3D object positions unless they transform that on the CPU. Assuming that's the case, I guess they could get a dependency graph for rendertargets and sort objects as they see fit, and even send some objects to both GPUs as Z-Only for the main pass etc...

Their claims aren't completely impossible, but they remain are very difficult to swallow to say the least... If they do pull it off though, I'm sure I won't be the only one to be very impressed!
 
Furthermore, neither has *any* FP support. So I can't see how they could even understand the 3D object positions unless they transform that on the CPU.
Sort last doesn't care about position ... compositing steps use the Z-buffer, the black spots in the single GPU frame on the screenshot aren't chosen. They just happen to be places where all the surfaces which cover that area were drawn on other GPU(s).
 
Anyone have any reasonable idea how the compositing engine might work?
Why wouldn't they just use the GPU? Rendering a single frame sized quad comparing Z-values inside the pixel shader is such a small amount of work in the grand scheme of things (communication overhead is a much scarier problem).
 
Furthermore, neither has *any* FP support. So I can't see how they could even understand the 3D object positions unless they transform that on the CPU. Assuming that's the case, I guess they could get a dependency graph for rendertargets and sort objects as they see fit, and even send some objects to both GPUs as Z-Only for the main pass etc...

Their claims aren't completely impossible, but they remain are very difficult to swallow to say the least... If they do pull it off though, I'm sure I won't be the only one to be very impressed!

Another question: how can they possibly extract more information from the command stream that existing 3D driver? And assuming they can, what would prevent those two from doing just the same?

Related: could any of the 3D SW expert here give some insight into how much intra-frame and inter-frame dependency there really is between render buffers? Are secondary render targets typically of smaller size that don't contribute too much to the overall rendering time?
 
:|||| @ http://www.extremetech.com/image_popup/0,,iid=215156&aID=231125&sID=25522,00.asp
And they claim near-linear scaling for THAT method?! It's possible, but only with really really smart algorithms and a few things that might be a tad too app-specific for their own good.
I could believe that they might have some really smart algorithms and some really cool ideas that can do multi-GPU better (or at least different) than simple SFR or AFR that both SLI and CF use right now. I'm sure there are a dozen different distribution methods other than SFR and AFR that have different performance and compatibility tradeoffs. AFR and SFR are by far the simplest and are "good enough" it seems.

If they were partnered with ATI or NV (or even S3) and had an announcement together with ATI/NV/S3, I wouldn't doubt them for a minute. If they announced that they were targeting Larrabee I wouldn't really doubt them too much. However as it stands I see absolutely no way that they can build a multi-GPU system that's doing low-level distribution of rendering and compositing when they don't have complete control over the system (driver through hardware). Obvious questions like how do you get data between the GPUs (without resorting to CPU readback and writing to the other GPU) when you can't interact with the hardware directly? It seems far too easy for the parts they don't control to do something unexpected that best case results in a performance cliff and worst case a deadlock.
 
Another question: how can they possibly extract more information from the command stream that existing 3D driver? And assuming they can, what would prevent those two from doing just the same?

They can’t extract more from the command stream than a driver because a driver get’s everything. Therefore ATI and nvidia can although try to dong smarter things. I remember there was an Idea from nvida to render some smaller independent parts of the command stream on an IGP and send the result to the GPU which will do the rest.

Related: could any of the 3D SW expert here give some insight into how much intra-frame and inter-frame dependency there really is between render buffers? Are secondary render targets typically of smaller size that don't contribute too much to the overall rendering time?

It depends on the game you are doing. Therefore there is no general answer to this question.


Overall this looks for me like an somewhat improved version of this first custom crossfire chip + an PCIe switch. We all know that ATI prefers AFR to these days. As long as your CPU can feed the GPU fast enough and you don’t have frame to frame dependencies you can beat AFR.
 
Why wouldn't they just use the GPU? Rendering a single frame sized quad comparing Z-values inside the pixel shader is such a small amount of work in the grand scheme of things (communication overhead is a much scarier problem).

I think it will be more efficient to do it in a simple compositing engine when the number of GPUs are more in which case multiple z values have to be compared for single pixel. On a quick scan the "compositing three-dimensional graphics images using associative decision mechanism" seems to be talking about accelerating such cases. This would take very little hardware anyways so they can just read the multiple FBs and composite on the fly while transferring it to the display unit. Much more efficient than transferring all the FBs to a single GPU and compositing on the GPU.

As soon as I saw the original link, I was reminded of HP's HPC visualization platform where they use a cluster of PCs with GPUs also use a custom board called sepia to do the compositing. Theirs is also a sort-last architecture.

Has anyone gone through the patent application- "Graphics processing and display system employing multiple graphics cores on a silicon chip of monolithic construction"? (some images would help :/ ) They clearly mention they have three strategies for parallelization Object division, Image Division (=SFR) and Time Division (=AFR). The main differences from other approaches are they keep profiling the GPU bottlenecks so they can decide which strategy to use for the next (frame or after the next sync) and hence can decide when Object divison is not good. They seem to access the performance counters of the GPUs to estimate per GPU bottlenecks.

They also mention that flush, swap, alpha blending are problematic and require all GPUs to be synchronized especially in Object Division mode. When such a "blocking" event occurs they compose the FB and it is send to all the GPUs. Further for cases such as Alpha blending all GPUs process all data from the application till alpha blending mode is disabled.

I guess when a texture is rendered to and it is re-used as a texture it is composed and transferred to all GPUs. Since this kind of operation is common among games with multiple passes I'm wondering why this is not a performance problem for them. Or perhpas this cost is outweighed by their more efficient use of the GPUs?

Can they really do these things below GPU vendor driver level? In my opinion no. They would require too much of GPU specific information. I think all this decision happens on the CPU and they simply create different contexts for each GPU and distribute commands to the different GPUs.
 
Just to put some numbers to the scale of the problem ... a 1600x1200 4x MSAA HDR buffer takes enough bandwidth that they get capped at 65 FPS with sort last parallelization, without any dynamic textures at all and assuming they can use the full PCIe bandwidth. If they could somehow transfer the data in it's losslessly compressed native form they would get some breathing room, but I don't quite see how they would manage that.
 
Last edited by a moderator:
Let’s do some math. Let’s assume this custom chip should merge two render targets together. As we run a multi chip system here we should use something challenging. 1920*1200*4xAA with FP16 as we want some HDR, too. This will require ~72 MB. To merge we might need the depth buffer with additional ~36 MB. We need to send back at least the final mage to one card. Therefore we got an overall traffic of 180 MB/image. 16 PCIe lanes can theoretical transfer up to 4 GB. This will give us 22 frames/s at best. But there will be no bandwidth left to actual send any commands to the GPU.

Edit: MFA had the same idea.
 
16x PCIe 2.0 is 8 GB/s and bidirectional, still dire just a little less dire than assuming 4 GB/s unidirectional.
 
The switch in Hydra 100 is supposed to be PCIe 1.x.
Hah, fat chance it will scale better then ... it can't do sort last at high resolution + HDR and it is unlikely they will be able to do AFR better than NVIDIA/ATI.
 
Last edited by a moderator:
Hah, fat chance it will scale better then ... it can't do sort first at high resolution + HDR and it is unlikely they will be able to do AFR better than NVIDIA/ATI.

They're not doing AFR. Did you read the preview from PCPerspective?
 
Did you read Krychek's post?

There is nothing in the fluff articles on the web or in their patents which hints at a solution for the fundamental problem which faces everything except for AFR ... communication bandwidth requirements higher than PCIe can supply most of the time (especially 1.0).

AFAICS the only method which could make do with PCIe bandwidth would be sort first with remote texturing for accessing dynamic textures ... but sort first pretty much requires integration with the rendering engine and GPUs are not designed to deal with the latency introduced by remote texturing (assuming a PCIe target can even directly access memory on a different PCIe target).
 
Last edited by a moderator:
Did you read Krychek's post?

There is nothing in the fluff articles on the web or in their patents which hints at a solution for the fundamental problem which faces everything except for AFR ... communication bandwidth requirements higher than PCIe can supply most of the time (especially 1.0).

So the demonstrations of different elements per frame being assigned to different GPUs were what, exactly?

Don't get me wrong, I'm highly skeptical of their claims as well. I'm not going to dismiss PCPerspective's piece off-handedly because of it, though.
 
When I said it can't do sort last at high resolution, I meant it would be forced to switch to AFR because of performance reasons.
 
Last edited by a moderator:
When I said it can't do sort first at high resolution, I meant it would be forced to switch to AFR because of performance reasons.

Do you mean sort first or sort last? Object divison will give sort last since the FB has to be recombined and visibility resolved at each pixel.

The bandwidth requirement *is* too high but I think Extremetech's comment that they observed it performing better than SLI counts for something no? Maybe that demo didn't use AA :)
 
Back
Top