LucidLogix Hydra, madness?

Let’s do some math. Let’s assume this custom chip should merge two render targets together. As we run a multi chip system here we should use something challenging. 1920*1200*4xAA with FP16 as we want some HDR, too. This will require ~72 MB. To merge we might need the depth buffer with additional ~36 MB. We need to send back at least the final mage to one card. Therefore we got an overall traffic of 180 MB/image. 16 PCIe lanes can theoretical transfer up to 4 GB. This will give us 22 frames/s at best. But there will be no bandwidth left to actual send any commands to the GPU.

Edit: MFA had the same idea.

This makes a lot of assumptions. First, it assumes you'd need to send back all the MSAA samples (rather than leave the GPUs to resolve those to their buffers, and the Hydra chip to composite only the final MSAA-resolved pixels). Second, you're adding the "send back to one card" to your total, when it wouldn't count against the 4GB/sec traffic anyway. It's going the other direction, and PCIe is bi-directional.

Let's suppose each GPU is rendering 1920x1200 with 4xAA. Each one is going to get and process only the vertex buffers, textures, etc. that it needs to process its own parts of the frame - and the whole idea of the Hydra chip is that it's going to work a little software magic for real-time profiling to make sure the loads remain balanced. If an object/task/render target/cube map/whatever has an inter-frame dependency, it'll make sure those tasks remain on the same GPU to minimize cross-GPU traffic. If you need to render out a texture for use later in the frame rendering, just have the GPU that renders out that texture also do the later part. No need to composite it or transfer back to the other GPU.

There's no real reason each GPU can't take this process through to rendering final pixels, and the final pixels composted by the Hydra chip. And we all know the final frame buffer in almost all cases is 32-bit integer. That's, what, 9MB for a 1920x1200 frame buffer (after AA resolve)?

I think there's plenty of headroom even in a PCIe 1.1 bus.

The trick..the real "magic" that the Hydra technology supposedly performs, is being smart enough to have each GPU perform "half the work to render the frame" in such a way that it doesn't require constant cross-GPU communication. I wish I knew more about it, but the LucidLogix guys will only say so much right now. :)

We do know this much - neither the chip nor it's driver running on the host machine know nor care about which graphics card you have plugged in. It's measuring stalls and frame completion rates and stuff to do its profiling.

Note that the Unreal Tournament images you see around the web (including on our site) are somewhat simplified representations of the GPU work split, according to the LucidLogix guys. There's actually a lot more going on, but they only show the stuff that makes a visual which is easy for people to understand.

They forbade anyone to take video, but I think that would have been very helpful for people like the crowd here to understand the tech. See the black parts in the UT scene? If you move the mouse even a little bit, those could change dramatically. As you move the view around, the screen flickers like mad as the surfaces being drawn by each GPU change from one frame to the next. If the sky is black (undrawn) in a screenshot you see, the slightest shift of view might change that in the next frame. You can definitely see the per-frame load balancing at work.

Just FYI- the LucidLogix guys said that in some games at really high resolutions, drawing pixels becomes the bottleneck and the chip basically just does standard split-frame rendering, because that evenly distributes the pixel drawing workload.

It know it seems "impossible" or alternately "too simple that the GPU vendors must have already thought of it and rejected it" but there really does seem to be just some clever software/hardware tricks going on. We'll see how it does when you try it with a really broad variety of games, of course, but from a hands-on and eyes-on demonstration it does indeed seem like the real deal.
 
This makes a lot of assumptions. First, it assumes you'd need to send back all the MSAA samples (rather than leave the GPUs to resolve those to their buffers, and the Hydra chip to composite only the final MSAA-resolved pixels).
You need the MSAA samples to composite the image. Otherwise, you don't know how much to "blend" a pixel with what's in the other buffer.
 
Perhaps they switch to SFR when rendering into Multisampled render targets. Since in such case it definitely becomes pixel heavy. AA definitely seems to pose problems for this.

If all GPUs allowed programmable sample positions they could do the AA themselves. OR maybe they could modify the vertex shader to displace the vertex positions differently in each GPU? With something like this they would pass all geometry that is rendered into AA targets to all the GPUs and convert MS into Super Sampling.

Any one has hard facts on how much of geometry is drawn into multisampled surfaces on modern games? What's the % of frame time spent in such passes?
 
This makes a lot of assumptions. First, it assumes you'd need to send back all the MSAA samples (rather than leave the GPUs to resolve those to their buffers, and the Hydra chip to composite only the final MSAA-resolved pixels).
Not that bad an assumption ... since otherwise any shared edge which the game engine renders in separate calls and which get distributed to different GPUs will bleed in background color, also intersections won't be AAd.
Let's suppose each GPU is rendering 1920x1200 with 4xAA. Each one is going to get and process only the vertex buffers, textures, etc. that it needs to process its own parts of the frame
With sort last rendering they can't have their own parts of the frame, you can only do that with sort first ... the problem that to sort first you have to transform first too, you can of course do all the transforms on all the GPUs and throw away the tris you don't need but that will result in sublinear scaling.
If you need to render out a texture for use later in the frame rendering, just have the GPU that renders out that texture also do the later part. No need to composite it or transfer back to the other GPU.
Okay, lets suppose for a moment we are rendering a couple of shadow buffers and use them for rendering a frame. No need to composite or transfer anything ... just render everything with a single GPU you say? I see a flaw in this plan.
 
Intercepting API level instructions has been done before--just not with GPU's, to the best of my knowledge. For example; I'm pretty sure that Asus' DS3D GX and Creative's ALchemy software intercepts DirectSound 3D calls in Vista and converts them to OpenAL to allow positional sound processing.

The scale and complexity is different, but I doubt that it's impossible to do something similar.
 
Not that bad an assumption ... since otherwise any shared edge which the game engine renders in separate calls and which get distributed to different GPUs will bleed in background color, also intersections won't be AAd.

why not? this seems pretty damn easy to do actually, you are still just resolving the pixels, it doesn't matter if the next pixel is resolved by you or not, as long as you have the correct data to resolve YOUR pixel.

With sort last rendering they can't have their own parts of the frame, you can only do that with sort first ... the problem that to sort first you have to transform first too, you can of course do all the transforms on all the GPUs and throw away the tris you don't need but that will result in sublinear scaling.

or you can just use a mask function and rely on early rejects.

Okay, lets suppose for a moment we are rendering a couple of shadow buffers and use them for rendering a frame. No need to composite or transfer anything ... just render everything with a single GPU you say? I see a flaw in this plan.

or render to the object that affect one mask set on one GPU and the ones that affect another mask set on another GPU.

I'm making the assumption that the chip has enough functionality to know what is screen space and what isn't and deal with things correctly. Shouldn't be that hard.
 
why not? this seems pretty damn easy to do actually, you are still just resolving the pixels, it doesn't matter if the next pixel is resolved by you or not, as long as you have the correct data to resolve YOUR pixel.
There is no "YOUR pixel" in sort last rendering, a single GPU can not know which subpixel samples will be visible until after composition ... I'm not making it up that they are using sort last, it's in their patents.
or you can just use a mask function and rely on early rejects.
There is nothing to early reject with, the GPU doesn't get a hierarchical representation ... it gets a triangle soup. To compare the triangle to the mask function you have to know the screen space extent of it, to know that you have to transform it.
or render to the object that affect one mask set on one GPU and the ones that affect another mask set on another GPU.
I don't know quite what you mean with that.
I'm making the assumption that the chip has enough functionality to know what is screen space and what isn't and deal with things correctly. Shouldn't be that hard.
It's not hard, it's just a lot of work (for the CPU or the GPU, unless you are suggesting the Hydra chip will have it's own vertex shaders).
 
Last edited by a moderator:
Would it be possible to place the Lucid chip on a x2/GX2 product? Can it do the same task as a PLX chip?

Yup. They're planning on it.

From The Tech Report article:

Lucid has identified a few places where its technology could likely be deployed at first. The most obvious, perhaps, is in place of a simple PCI Express switch chip on a dual-GPU video card like the Radeon HD 4870 X2. Lucid is already talking with board makers about the possibilities there.
 
I hope some ATI board partner does one with the Hydra chip and four RV870s. It'll be a toaster but I'll buy it anyway. :LOL:
 
I still can't wait to see how this one pans out in the long run. As someone previously posited:

It seems incredibly unlikely that a small(ish) startup could come along with something functionally this novel and neither red nor green could think of it first?

Anyway thanks for the response, I'll be keeping my eyes peeled for any more news. ttfn =)

edit: grammar owies.
 
Very interesting Device ID find, but he's reading way too much into the clockgen part...
 
it's one of the biggest hit slated for 2009. and i don't mind about mixed vendors capabilities but more about the vram amount (get what you paid for) and no "microstutter" anymore.

and to be honest: if those guys (you know who i mean...) think it's only some bitboys malware crap why the heck they just don't stop the arguing?

anyway: time will tell.
 
Very interesting Device ID find, but he's reading way too much into the clockgen part...

'SYNOPSIS: You can clock each core seperately to achieve whatever your goal is (heat/power/high clockspeed on less cores/)'

This can already be done on Barcelona-based processors. But this we knew early last year.. The clock gen in the southbridge also exists in SB700 (or is it SB750) and is used for this 'ACC' feature they use to improve overclocking. Don't know about Fusion though..
 
Seems it wasn't mentioned here yet, but Intel has apparently decided to include Hydra chip on next revision of their X58 "smackover" board
 
Back
Top