Evolution cGPU

kyetech

Regular
While the Larrabee thread is quiet I thought it would be interesting to discuss the future scalability of cGPU style architectures. my post assumes that: Larrabee style systems are going to be the way forward and future graphics APIs will trend more towards open programable architectures.

Please excuse the number of questions!

Having seen that Larrabee scales linearly upto 32 cores and less so as you go to 64 cores, what is the long term evolution of a Larrabee style architecture going to look like? What is it about the current design which causes problems in scaling beyond 32 / 48 cores? Is it the speed of the ring bus? or the fact that only 16 cores sit on each ring? What would be the consequence placing more cores on each ring bus ?

Could it be possible that in 12 years there could be a 1000 core chip still with good scaling? Or will it take another conceptual shift before it can get there? Is software or hardware going to be the long term obstacle in the evolution process?

Please discuss...
 
Last edited by a moderator:
Interesting topic.

As always, I think process limitations will be a factor in how such architectures scale. It is in my opinion, that we will see something much greater than 1000 cores in 12 years, but it is also in my belief by then that the industry as a whole will have moved away from silicon and traditional concepts. Other wise, I don't see how or what would stop such a architecture from scaling on the hardware side. I'm just worried how quickly or slowly for that matter will software adapt to Larrabee(and other fully programmable architectures), as well as I have said already... Process limitations.
 
I'm by far not the most competent to answer this question.
Anyway, I think that the scaling issue is related tho the Amdahl's law:
http://en.wikipedia.org/wiki/Amdahl's_law

If I understand properly the serial part of the work always end limiting the performances.
In this case it may be for example the communication overhead between so many cores (If I understand properly whatever is the reason this extra latencies add in the serial execution part and thus limit the performance scaling though parallelization).

I think that GPU will face the same issue (under some workloads), but so far they are still made of fewer "cores" than say the larrabee.

It's also a matter of workloads, some workloads with close to zero dependency will scale really well.

I'm also interest in this point, so if knowledgeable members can give us more explanation they are welcome :)
 
What is it about the current design which causes problems in scaling beyond 32 / 48 cores? Is it the speed of the ring bus? or the fact that only 16 cores sit on each ring? What would be the consequence placing more cores on each ring bus ?

I don't think that it is the speed of the ring bus. More like the number of contenders for the same bus. And less scaling for than 32/48 cores is somewhat missing the point. The paper said these cores were larrabee units, not larrabee cores themselves. Though the arch could very well have limitations at 32/48 cores.

Could it be possible that in 12 years there could be a 1000 core chip still with good scaling?

I wont bet anything beyond 2015, we are at the gates of the dead end of CMOS process. 22 nm one can scale to, beyond that, it's dodgy. At 10 nm, the party is officially over. :cry:

If I understand properly the serial part of the work always end limiting the performances.

I dont think I get you here. :???: Serial parts have, by definition, no communication. And for embarrasingly parallel apps like 3D rendering, there is very little to none communication. As for Amadahl's law, look here. All in all, could you be please more clear?

I think that GPU will face the same issue (under some workloads), but so far they are still made of fewer "cores" than say the larrabee.

I dont think so. I think one potential reason is that larrabee has coherent caches. As you scale the no of cores, coherence traffic increases. GPU's avoid this overhead of communication, by having no caches at all, and launching 10k's of threads to hide latency. ;)

i think the biggest limitation will be how to program something to make use of 32 cores

look at cuda, It's not so hard to program 100's of cores. Having a easy to please app helps though. :smile:
 
i think the biggest limitation will be how to program something to make use of 32 cores
In a past life I wrote 3D and 2D rendering software that ran on 20+transputers. Mind you, occam probably made that aspect a bit easier (but others - no structures!!! - harder).
 
Could AMD configure the shaders in their GPU to run x86 instructions without major architectural changes, or would the shader cores pretty much need a full re-design for that?
 
Interesting topic.

As always, I think process limitations will be a factor in how such architectures scale. It is in my opinion, that we will see something much greater than 1000 cores in 12 years, but it is also in my belief by then that the industry as a whole will have moved away from silicon and traditional concepts. Other wise, I don't see how or what would stop such a architecture from scaling on the hardware side. I'm just worried how quickly or slowly for that matter will software adapt to Larrabee(and other fully programmable architectures), as well as I have said already... Process limitations.

I should have written also that I didnt mean in terms of process tech. Its very well documented that things get sketchy after 16nm, but my post assumes things will carry on, I believe they will actually. Im sure there will be all sorts of exotic materials that save the day (thats another topic).

I really wanted to focus this topic on hardware and software design wrt scaling. Will scaling be problem? And what will end up being harder to scale in the end; hardware or software?

Software seems the obvious one, but when the software problems become more open, how well will that map on to flexible architectures?
 
Interconnects are never really scaleable ... and intel is only one step up from the simplest possible architecture (they will probably end up with a hierarchical switched network). Their coherency model is based on snooping and that just plain won't scale (too much broadcast traffic).

They can change those with new generations without breaking backwards compatibility for applications though (the OS layer will need minimal changes).
 
Could AMD configure the shaders in their GPU to run x86 instructions without major architectural changes, or would the shader cores pretty much need a full re-design for that?

Running x86 in hardware would probably require a revamping of the entire GPU for proper support.
The shaders, even if they were retooled (need more instruction support, need to be able to have precise exceptions), don't encompass the full functionality of a fully fledged core to be sufficient for x86 execution.
I'd also be interested in seeing the die size and power numbers for 5-wide x86 decoders times 10 on a RV770-type design.
 
We figured out how to go from 1 or 2 shaders in a video card to 800 shaders in hardware and program for it. While a bit more complex, I think software and programming techniques will scale with this new hardware like it did for GPUs. That only took a few short years. More high level programming tools will be developed making it easier and easier to take advantage of an increasing number of cores in both cGPUs and CPUs, just as it did with our current GPUs. Actually better multi-core CPUs will probably be developed out of advances in this technology. I wouldn't be surprised if some of the techniques used in GPUs makes it to multi-core CPUs.
 
Running x86 in hardware would probably require a revamping of the entire GPU for proper support.
The shaders, even if they were retooled (need more instruction support, need to be able to have precise exceptions), don't encompass the full functionality of a fully fledged core to be sufficient for x86 execution.
I'd also be interested in seeing the die size and power numbers for 5-wide x86 decoders times 10 on a RV770-type design.

Heh, if AMD had the R&D staff to do this, they could fix up the K6 family CPUs by adding x64 and their upcoming advanced vector instruction set (SSE5 I think?). Then they could plop it down on the upcoming 40nm GPU process. This is essentially what Intel did with the old Pentium architecture. The damn thing could compete blow for blow with Larrabee. Larrabee software should work on any x86 implementation that's similar I hope.

My idea is probably just a pipe dream but we'll see. Something like that would work well in a Fusion or Torrenza implementation.
 
Heh, if AMD had the R&D staff to do this, they could fix up the K6 family CPUs by adding x64 and their upcoming advanced vector instruction set (SSE5 I think?). Then they could plop it down on the upcoming 40nm GPU process. This is essentially what Intel did with the old Pentium architecture. The damn thing could compete blow for blow with Larrabee. Larrabee software should work on any x86 implementation that's similar I hope.

My idea is probably just a pipe dream but we'll see. Something like that would work well in a Fusion or Torrenza implementation.

Well the initial Fusion is already confirmed to have RV8xx based graphics core so that's out of the question, and honestly I don't see the point of doing that when AMD has the gfx chip they have now. Sure, it could compete against Larrabee and all, but I doubt the implementation would be as easy as you make it sound, or even close to it, not to mention that since the main focus of Larrabee is hitting the gfx market dominated by Radeons and GeForces, AMD would not only compete Intel & nVidia, but theirselves aswell.
 
Back
Top