22 nm Larrabee

I think what you mean is that they are not specifically targeting Llano.
Specifically as in inclusively, yes.

Intel is expected to support OpenCL on Ivy Bridge's IGP. For the sake of argument, let's assume performance will be horrendous. Then clearly the existence of OpenCL applications doesn't prove that GPGPU on an IGP is the future. Similarly there is no proof yet that Llano's architecture has any merit beyond mere graphics. In other words, just because Llano runs OpenCL, doesn't mean it's a convincing incentive for developers to invest more into OpenCL development.

I'd love to see a software renderer written purely in OpenCL (not using any fixed-function hardware), and compare that against SwiftShader. Then we'd be able to get a true picture of the value of IGPs for computing...
 
I'd love to see a software renderer written purely in OpenCL (not using any fixed-function hardware), and compare that against SwiftShader. Then we'd be able to get a true picture of the value of IGPs for computing...

Why would an OpenCL-based software renderer be a better benchmark than many already-available image-editing, video-editing, video-encoding and password decrypting applications?
 
In other words, just because Llano runs OpenCL, doesn't mean it's a convincing incentive for developers to invest more into OpenCL development.
Llano is a huge incentive for developers to invest more into OpenCL development as the installed base of machines having competent gpu's just went through the roof.

EDIT
Intel is expected to support OpenCL on Ivy Bridge's IGP. For the sake of argument, let's assume performance will be horrendous.
It's Intel's IGP. Of course it is going to suck.
EDIT
I'd love to see a software renderer written purely in OpenCL (not using any fixed-function hardware), and compare that against SwiftShader. Then we'd be able to get a true picture of the value of IGPs for computing...
I'd love to see a dx11 software renderer (not using any fixed-function hardware), and compare that against Llano. Then we'd be able to get a true picture of the value of pure software rendering.
 
Why would an OpenCL-based software renderer be a better benchmark than many already-available image-editing, video-editing, video-encoding and password decrypting applications?
Because many GPGPU applications and benchmarks claim extraordinary speedups by comparing the results of high-end GPUs against a plain C implementation on the CPU.
 
In regard to software renderer vs IGP it would be imho more interesting to see an updated version of Unreal vs BF3 running on IGP (either llano, Sandy Bridge and Ivy bridge).
 
Llano is a huge incentive for developers to invest more into OpenCL development as the installed base of machines having competent gpu's just went through the roof.
Unless Intel just went out of business and every consumer decided to upgrade today, nothing is going through the roof any time soon.
 
Unless Intel just went out of business and every consumer decided to upgrade today, nothing is going through the roof any time soon.
If Intel's software graphics rendering power is as you claim, then why wouldn't their OpenCL computational power scale just as well? People can use an OpenCL CPU device just as easily as a GPU device.
 
That's easy. Workloads depend on ILP, TLP or DLP for high performance, and increasingly a combination of these. GPUs still only offer good DLP, with TLP improving but still suffering from cache contention. CPUs are great for both ILP and TLP, and are catching up really fast in DLP.

I suspect you're giving CPUs a free pass here. Their ILP prowess comes at the cost of large caches, OOOE and speculative branching which are all very expensive. Are you suggesting that all those things will scale accordingly with higher arithmetic throughput? Not to mention the burden of x86 decoders.

CPUs are also subject to cache thrashing due to TLP and frequent context switching, especially on larger data sets. There is no magic that will enable CPUs to feed much wider execution resources efficiently and do so for free.

GPUs will rely less on DLP in the future but it's doubtful that x86+AVX will offer much competition in graphics workloads. IGPs may be doomed but anything more than that has the memory bandwidth and transistor/power budget to put CPUs to shame.

Which converges it toward the CPU...

Not really, it just makes it faster.
 
IGPs may be doomed but anything more than that has the memory bandwidth and transistor/power budget to put CPUs to shame.

SB class igp's are doomed. I don't see any reason why strong alternatives like Llano's projected successors are doomed as well.
 
I think you're using the more hardware-centric definition of a register file as necessarily not being based on SRAM, or at least much more expensive SRAM? If so that doesn't apply because (as far as I can tell) GPUs frequently use L1-like SRAM for their register file as they can tolerate the inherently higher latency.

Given the porting requirements it is unlikely that it is anywhere near as dense nor as power efficient as a cache ram array.
 
SB class igp's are doomed. I don't see any reason why strong alternatives like Llano's projected successors are doomed as well.

Llano's projected successors will be going up against their discrete counterparts. The only reason igp's and apu's exist is that many people don't care about graphics performance. For everyone else they're pretty useless.

Also, what's going to happen when games aren't based on 6 yr old console hardware any more? All IGPs will then resume their place in the trash bin.
 
I suspect you're giving CPUs a free pass here. Their ILP prowess comes at the cost of large caches, OOOE and speculative branching which are all very expensive. Are you suggesting that all those things will scale accordingly with higher arithmetic throughput? Not to mention the burden of x86 decoders.
You have to think of a high throughput homogeneous CPU as the unification of a legacy CPU and an IGP. The compute density isn't necessarily much higher than that of a whole APU. But the high throughput AVX units benefit from having access to the same cache hierarchy and from out-of-order execution. You save a lot of communication overhead and certain structures don't have to be duplicated. And as I've detailed before, executing AVX-1024 on 256-bit execution units drastically reduces the power consumption of the CPU's front-end and schedulers, and hides latency by implicitly allowing access to four times more registers.

So there are no compromises to legacy scalar execution, and it also exploits DLP in practically the same way as a GPU!

Besides, there is no viable alternative. You said you agree they will converge but wonder whether CPUs or GPUs are more representative (i.e. closer to the result of the convergence)? GPUs have a very long way to go to offer acceptable sequential performance. Some form of out-of-order execution, and a comprehensive cache hierarchy are an absolute must to be able to compete with CPUs. For CPUs to compete with GPUs the only thing lacking is AVX-1024...
CPUs are also subject to cache thrashing due to TLP and frequent context switching, especially on larger data sets.
Hardware thread switches are rare when you use software fiber scheduling. But even full context switches can be accelerated if that ever proves to be useful.
There is no magic that will enable CPUs to feed much wider execution resources efficiently and do so for free.
It would be no more magical nor free than on a GPU. I don't see any noteworthy obstacles in increasing the CPU's DLP.
GPUs will rely less on DLP in the future but it's doubtful that x86+AVX will offer much competition in graphics workloads. IGPs may be doomed but anything more than that has the memory bandwidth and transistor/power budget to put CPUs to shame.
Anything more as in anything larger? Once IGPs have been replaced by software rendering nothing is holding Intel from selling CPUs with more cores and more bandwidth. If they can increase their revenue by keeping people from buying low-end and mid-end discrete graphics cards, they won't let that opportunity slip.

Actually it's a simple question of growing the IGP or growing the CPU cores to threaten the mid-end discrete GPU market. Given that AVX2 brings us everything to drastically speed up software rendering and other high throughput applications, and it's readily extendable to 1024-bit registers, Intel seems focused on increasing CPU DLP. They only have to keep an adequate IGP around for long enough to make the transition.

Software rendering is not limited by the API so once developers start using the CPU more directly it would even compete with high-end discrete cards. It will take many years, but the convergence isn't stopping so this is bound to happen. Perhaps by the end of this decade buying a discrete graphics card may seem as silly as buying a discrete sound card. They'll still exist but for the majority of consumers won't offer any worthwhile benefit.
Not really, it just makes it faster.
Increasing the clock frequency doesn't come for free. You need substantial changes to the register set, caches, instruction scheduling, etc. to sustain the higher throughput while keeping relative latencies the same. No matter what you do, increasing the clock frequency converges the GPU closer to the CPU microarchitecturally as well.
 
Last edited by a moderator:
SB class igp's are doomed. I don't see any reason why strong alternatives like Llano's projected successors are doomed as well.
Butchering CPU performance for 50% higher graphics performance is hardly a success formula. Intel's IGPs might be "doomed", but that has never stopped them before. Why would it be of significance now?
 
Given the porting requirements it is unlikely that it is anywhere near as dense nor as power efficient as a cache ram array.
Due to the splitted nature of the register files (each slice serves just a single or only a few vector lanes) and that the units ecxecute the same instruction for the same nominal register (just a different lane of the logical vector each clock) over 2 to 4 clocks, the register files do not have more ports than your typical L1. You can get away with a single read and a single write port.
 
Butchering CPU performance for 50% higher graphics performance is hardly a success formula. Intel's IGPs might be "doomed", but that has never stopped them before. Why would it be of significance now?


Those charts are a good representation of cumulative sales over the past 15 years.
Regarding sales for the past 2-3 years (which is what matters the most for OEMs and computing-demanding software developers), they're a bit useless, as the top 5 GPUs aren't even in the market anymore.

I think you're downplaying graphics a bit too much, as if we were in ~2005.

Even if the end customer is ignorant of that fact for 99% of the cases, OEMs know that a better GPU drastically enhances gaming, video and web-browsing performance.That's why Brazos was sold out in Q1 2011, and has taken quite a chunk out of Atom shipments.

So if OEMs value better performing iGPUs and prefer the option to bundle AMD APUs, more PCs with AMD APUs will be on the market, more people will buy AMD APUs, and more developers will put a nice, big and shiny stamp in their latest software claiming it takes full advantage of the iGPU in people's newly-bought PCs.
 
Llano has noticeable gap in it's integration today. Needn't be the case with it's successors.

Besides, if IB's dram stacking takes off, it might reduce the bw advantage of discretes to a dead heat with integration benefits.
 
Sorry to derail the conversation but I've some questions about larrabee/knight's corner.
I read news about Intel managing to get CMOS @32nm they expect this technology now to scale along with their process progress (they were stuck @65nm till then).
Are ring buses made out using this technique or they are done other way?

About larrabee/K'sC, basically can we expect change from the original larrabee text units removal aside?
I've the feeling that K'sC is clearly a "filler product", something Intel push out to somewhat compete with GPGPU and make some money out of their investments.

As nick is saying Intel is putting is strength in AVX2 instruction set (and proper implementation). It doesn't make much sense to launch next year something that use a completely different instruction set (hence my feel about K'sC being a filler product).
We know really few about Haswell but I don't believe it's the architecture that will allow Intel to do it all. It may allow software rendering with acceptable result for casual gamers, do marvels for physics, AI, etc. for the others but that's it. GPUs (and GPGPUs) will still be a compliant target for the workload that map well to their architectures. 500 GFLOPS won't cut it against modern GPUs.

Intel needs a more throughput oriented design if they want to stop GPUs to bite into their market share. It may also help them to reach (or definitively secure) others markets. Honestly I don't know much but after reading some stuffs about UltraSparc CPU line or upcoming IBM POWERPC A2, it looks like to me that the way larrabee was design is no longer adapt to the goals Intel may pursue now. May be it's nothing but I noticed that in all those designs the cores can access a "shared L2" (as I understand it vs larrabee local subset of the L2 is that they can read and write anywhere on the L2 cache whereas larrabee core can only read&write on their local subset of the L2 and read from the others). Could this be a wanted feature for the kind of works larrabee successors (after K'sC) migh be intended to? (Or Intel could/should scale back the number of cores and include an L3?)
There is also the focus on power consumption, 16 wide SIMD may not be workable within the design, it supposedly consume a lot, it set terrible constrain on the memory system. A move AVX2 as Nick is proposing sounds like a win to me, actually I wonder if it worse it to get them push 4FLOPS per cycle (Haswell is supposed to do 2 FMAC per cycle so twice 2 FLOPS right? I'm not sure I got this properly while reading).
I also notice that in POWERPC A2 the designer have put a huge focus on chip to chip communication. Something that seems absent from Larrabee/K'sC and the looks like a huge lack. They need something that scale well.

So what are you POV(s) on the matter, do you believe Intel after experimenting with larrabee, with Itanium crumbling support, a possible threat on their CPUs dominance could launch a proper throughput cores? How could they look like?
It could be a win for Intel as Haswell might be awesome but I don't believe the silicon budget will allow proper do it all architecture (if it is to happen), they could have their way with heterogeneous designs, different cores but using the same ISA(s).
 
Back
Top