CPU's run OCL only when specifically asked to do so.And CPUs run OpenCL too.
fGPU will run transparently the code written for dGPU.
CPU's run OCL only when specifically asked to do so.And CPUs run OpenCL too.
Specifically as in inclusively, yes.I think what you mean is that they are not specifically targeting Llano.
I'd love to see a software renderer written purely in OpenCL (not using any fixed-function hardware), and compare that against SwiftShader. Then we'd be able to get a true picture of the value of IGPs for computing...
Llano is a huge incentive for developers to invest more into OpenCL development as the installed base of machines having competent gpu's just went through the roof.In other words, just because Llano runs OpenCL, doesn't mean it's a convincing incentive for developers to invest more into OpenCL development.
It's Intel's IGP. Of course it is going to suck.Intel is expected to support OpenCL on Ivy Bridge's IGP. For the sake of argument, let's assume performance will be horrendous.
I'd love to see a dx11 software renderer (not using any fixed-function hardware), and compare that against Llano. Then we'd be able to get a true picture of the value of pure software rendering.I'd love to see a software renderer written purely in OpenCL (not using any fixed-function hardware), and compare that against SwiftShader. Then we'd be able to get a true picture of the value of IGPs for computing...
Because many GPGPU applications and benchmarks claim extraordinary speedups by comparing the results of high-end GPUs against a plain C implementation on the CPU.Why would an OpenCL-based software renderer be a better benchmark than many already-available image-editing, video-editing, video-encoding and password decrypting applications?
Unless Intel just went out of business and every consumer decided to upgrade today, nothing is going through the roof any time soon.Llano is a huge incentive for developers to invest more into OpenCL development as the installed base of machines having competent gpu's just went through the roof.
If Intel's software graphics rendering power is as you claim, then why wouldn't their OpenCL computational power scale just as well? People can use an OpenCL CPU device just as easily as a GPU device.Unless Intel just went out of business and every consumer decided to upgrade today, nothing is going through the roof any time soon.
That's easy. Workloads depend on ILP, TLP or DLP for high performance, and increasingly a combination of these. GPUs still only offer good DLP, with TLP improving but still suffering from cache contention. CPUs are great for both ILP and TLP, and are catching up really fast in DLP.
Which converges it toward the CPU...
IGPs may be doomed but anything more than that has the memory bandwidth and transistor/power budget to put CPUs to shame.
I think you're using the more hardware-centric definition of a register file as necessarily not being based on SRAM, or at least much more expensive SRAM? If so that doesn't apply because (as far as I can tell) GPUs frequently use L1-like SRAM for their register file as they can tolerate the inherently higher latency.
SB class igp's are doomed. I don't see any reason why strong alternatives like Llano's projected successors are doomed as well.
And they will have significant advantages of latency, power and cost.Llano's projected successors will be going up against their discrete counterparts.
You have to think of a high throughput homogeneous CPU as the unification of a legacy CPU and an IGP. The compute density isn't necessarily much higher than that of a whole APU. But the high throughput AVX units benefit from having access to the same cache hierarchy and from out-of-order execution. You save a lot of communication overhead and certain structures don't have to be duplicated. And as I've detailed before, executing AVX-1024 on 256-bit execution units drastically reduces the power consumption of the CPU's front-end and schedulers, and hides latency by implicitly allowing access to four times more registers.I suspect you're giving CPUs a free pass here. Their ILP prowess comes at the cost of large caches, OOOE and speculative branching which are all very expensive. Are you suggesting that all those things will scale accordingly with higher arithmetic throughput? Not to mention the burden of x86 decoders.
Hardware thread switches are rare when you use software fiber scheduling. But even full context switches can be accelerated if that ever proves to be useful.CPUs are also subject to cache thrashing due to TLP and frequent context switching, especially on larger data sets.
It would be no more magical nor free than on a GPU. I don't see any noteworthy obstacles in increasing the CPU's DLP.There is no magic that will enable CPUs to feed much wider execution resources efficiently and do so for free.
Anything more as in anything larger? Once IGPs have been replaced by software rendering nothing is holding Intel from selling CPUs with more cores and more bandwidth. If they can increase their revenue by keeping people from buying low-end and mid-end discrete graphics cards, they won't let that opportunity slip.GPUs will rely less on DLP in the future but it's doubtful that x86+AVX will offer much competition in graphics workloads. IGPs may be doomed but anything more than that has the memory bandwidth and transistor/power budget to put CPUs to shame.
Increasing the clock frequency doesn't come for free. You need substantial changes to the register set, caches, instruction scheduling, etc. to sustain the higher throughput while keeping relative latencies the same. No matter what you do, increasing the clock frequency converges the GPU closer to the CPU microarchitecturally as well.Not really, it just makes it faster.
Butchering CPU performance for 50% higher graphics performance is hardly a success formula. Intel's IGPs might be "doomed", but that has never stopped them before. Why would it be of significance now?SB class igp's are doomed. I don't see any reason why strong alternatives like Llano's projected successors are doomed as well.
Due to the splitted nature of the register files (each slice serves just a single or only a few vector lanes) and that the units ecxecute the same instruction for the same nominal register (just a different lane of the logical vector each clock) over 2 to 4 clocks, the register files do not have more ports than your typical L1. You can get away with a single read and a single write port.Given the porting requirements it is unlikely that it is anywhere near as dense nor as power efficient as a cache ram array.
Butchering CPU performance for 50% higher graphics performance is hardly a success formula. Intel's IGPs might be "doomed", but that has never stopped them before. Why would it be of significance now?
And they will have significant advantages of latency, power and cost.