LucidLogix Hydra, madness?

So all of your memory management processes now have to go over this bus - global memory atomic ops for example.
Coherency isn't really an issue (and not really necessary at that fine a granularity). The raw bandwidth required is the problem.
 
And like silent-guy said, lrb's architecture doesn't give it a free pass on inter-chip bandwidth.

I think intel's own scaling measurement with different numbers of cores have left quite an imprint on peoples minds. :)

Though I am not sure, i assume these were done at simulator level and wrt a single LRB solution having this and that many cores, not multiple LRB addon-cards communicating via some kind of external interface.
 
Coherency isn't really an issue (and not really necessary at that fine a granularity). The raw bandwidth required is the problem.

I'm not sure what kind of bandwidth you mean here, but wouldn't in theory memory bandwidth automatically be twice as high if such a config uses 2*256bit busses as 1 real 512bit wide bus?
 
Not memory bandwidth, interconnect bandwidth. Basically it takes extra pins and gates, which lie idle in single chip setups (which are the most common).
 
I'm not sure what kind of bandwidth you mean here, but wouldn't in theory memory bandwidth automatically be twice as high if such a config uses 2*256bit busses as 1 real 512bit wide bus?

If you want a 2 chip memory uniform memory architecture wrt bandwidth (you ignore latency) where each chip has the same amount of bandwidth as a 1 chip solution and if you assume that you can spread your accesses in a uniform way across the memory of both chips, then you'll have a 50% IO pin overhead compared to a single chip solution.

In single chip mode, you have a 2x256 bit bus. In dual chip, you have 1 256-bit bus going to local memory and 1 256-bit bus going from chip a to chip b, but chip b also needs concurrent access to chip a, so you need an additional 256-bit bus for that.

In any case, your total bandwidth per IO pin (memory pins + interconnect pins) has gone down by 50%: you still only have 512 pins to memory with the same total bandwidth, but you also have 512 pins between the two chips. If your application is bandwidth limited, you didn't win anything.

If you make the interconnect only 2x128 bit and 2x384 bits to memory, then you're not uniform anymore and you have to start to implement about application specific allocation strategies. In case of a GPU, I assume you could do smart placement of textures and Z-buffers, duplicate some texture in both memories, make sure GPU-A renders one part of the screen and GPU-B renders the other...

Starts to sound a lot like SFR, doesn't it? ;)

No matter what, you'll very soon run into issues that prevent perfect scaling.
 
This might be a stupid question but how exactly would a third-party compositing solution get access to intermediate buffers on a GPU without assistance from the graphics driver? Not to mention how it sends a composed multi-sampled buffer back to a card for AA resolution. Actually how does any communication happen at all without AMD's and Nvidia's blessings?
 
If you want a 2 chip memory uniform memory architecture wrt bandwidth (you ignore latency) where each chip has the same amount of bandwidth as a 1 chip solution and if you assume that you can spread your accesses in a uniform way across the memory of both chips, then you'll have a 50% IO pin overhead compared to a single chip solution.

In single chip mode, you have a 2x256 bit bus. In dual chip, you have 1 256-bit bus going to local memory and 1 256-bit bus going from chip a to chip b, but chip b also needs concurrent access to chip a, so you need an additional 256-bit bus for that.

In any case, your total bandwidth per IO pin (memory pins + interconnect pins) has gone down by 50%: you still only have 512 pins to memory with the same total bandwidth, but you also have 512 pins between the two chips. If your application is bandwidth limited, you didn't win anything.

If you make the interconnect only 2x128 bit and 2x384 bits to memory, then you're not uniform anymore and you have to start to implement about application specific allocation strategies.

Damn...:(

In case of a GPU, I assume you could do smart placement of textures and Z-buffers, duplicate some texture in both memories, make sure GPU-A renders one part of the screen and GPU-B renders the other...

Starts to sound a lot like SFR, doesn't it? ;)

SFR isn't necessarily a panacea if you could IMHLO manage to scale geometry as well as with AFR and manage to divide as good as possible the workload between GPUs.
 
If you want a 2 chip memory uniform memory architecture wrt bandwidth (you ignore latency) where each chip has the same amount of bandwidth as a 1 chip solution and if you assume that you can spread your accesses in a uniform way across the memory of both chips, then you'll have a 50% IO pin overhead compared to a single chip solution.

In single chip mode, you have a 2x256 bit bus. In dual chip, you have 1 256-bit bus going to local memory and 1 256-bit bus going from chip a to chip b, but chip b also needs concurrent access to chip a, so you need an additional 256-bit bus for that.

In any case, your total bandwidth per IO pin (memory pins + interconnect pins) has gone down by 50%: you still only have 512 pins to memory with the same total bandwidth, but you also have 512 pins between the two chips. If your application is bandwidth limited, you didn't win anything.

If you make the interconnect only 2x128 bit and 2x384 bits to memory, then you're not uniform anymore and you have to start to implement about application specific allocation strategies. In case of a GPU, I assume you could do smart placement of textures and Z-buffers, duplicate some texture in both memories, make sure GPU-A renders one part of the screen and GPU-B renders the other...

Starts to sound a lot like SFR, doesn't it? ;)

No matter what, you'll very soon run into issues that prevent perfect scaling.

Nice explanation. If the software model can guarantee that no chips write to the same location, but read is arbit, then perhaps you will not have these issues. Let us take the case of simple texturing here. In a 2 gpu system, only the pixels at the border will need to read from the other gpu, so and since the ratio of border to bulk pixels goes down as the resolution increases, then numa style hacks may npt be necessary. For general texturing, it's difficult to say.
 
I don’t want to spoil the party but I am still pretty sure that Hydra will never work as promised.

It could work to some degree with special modified drivers and support from the GPU manufacture but not as a generic solution. As I don’t expect support from nvidia or ATI there is only Intel left.
 
I don’t want to spoil the party but I am still pretty sure that Hydra will never work as promised.

It could work to some degree with special modified drivers and support from the GPU manufacture but not as a generic solution. As I don’t expect support from nvidia or ATI there is only Intel left.

In other words nothing; why would Intel think of multi-GPU with a concept like Larabee anyway?
 
Lucid had a new demo with MSI's Big Bang board and Ryan (PCPer) had to say this about Hydra 200.
Based on what I saw last year as compared to this year, I am confident that it will achieve competitive gaming performance. (Sorry we can't say more, they are really tying our hands here.)
 
Back
Top