22 nm Larrabee

Let's first get something straight. There are three main discussions going on right now:

1) GPGPU versus high-throughput CPU
2) Dedicated versus software graphics
3) Fixed-function versus programmable video
.
It really is about homogeneous vs heterogeneous compute, where ff is one type of "core".
 
The handheld market is moving towards a 'reverse Turbo Boost' mechanism rather similar to what AMD implements on Cayman: you have a maximum frequency and the chip monitors its total power consumption and temperature at a variety of likely hotspots. It automatically reduces the frequency and voltage of different blocks as required to fit within the power and thermal budget specified by the OEM.

As the number of cores increases further on CPUs and both peak power and hotspots become a problem even at the default frequency, Intel will be forced to move to something closer to this even on the desktop. I suspect the CPU will have to be clocked lower when all 8 cores are doing full-throttle AVX2 work.

we're already seeing that underclocking on quad core i7 for laptops, and dekstop AMD A8-3800, A6-3600.
it's a good throwback to old times - the turbo button's only purpose was to underclock your PC.
 
This isn't shocking, new, or unforseen.
I guess it was indeed news for some contributors to this thread, unless they simply decided to ignore it. Sometimes it's easy to underestimate or even forget facts that don't play nice with a given/preconceived vision.
 
The communication overhead you keep speaking about is a myth. In a well designed heterogeneous system, the cost of communicating across cores is independent of the nature of cores itself.
Except that with a homogeneous architecture, you minimize having to communicate across cores!
 
That paper lacks an evaluation of executing wide vectors on less wide execution units (i.e. AVX-1024 on x86). So you can throw its conclusions in the trash bin.
Again, it only considers "a hypothetical heterogeneous processor [consisting] of a small number of large cores for single-thread performance and many small cores for throughput performance." It does not consider achieving high throughput out of unified cores.

Both these papers utterly fail to recognize that workloads do not consist of separate sequential and data parallel tasks. It's a smoothly varying gradation. With a heterogeneous architecture, you'd have to choose between two (or even more) types of core, neither of which would be ideal, and with a substantial overhead for transitioning between the two.
 
So, after a single core is done with it, just pass the pointer to the gpu. :cool:
No, after the serial task is done, process the parallel task with AVX and save having to transfer data and control to a different core.

And like I just said, there isn't really any transition between serial and parallel workloads. It can be tightly interwoven. Note that GCN will have a scalar ALU as well, to avoid the common case of repeating the exact same calculation for each work item. GCN will still choke on heavy scalar workloads though, but at least GPUs are taking the first baby steps toward a homogeneous architecture capable of handling any workload. CPUs merely have to implement AVX2 and to further scale performance/Watt there's AVX-1024.
 
No, after the serial task is done, process the parallel task with AVX and save having to transfer data and control to a different core.
The data is in L3. There is no transfer of data involved here.
 
Nick said:
It's an intrinsic limitation of fixed-function hardware that it's not forward compatible. Today's H.264 hardware is worthless for tomorrow's HEVC material, even at low resolution. No amount of power savings makes up for not being able to run something.

And how many people are still running Celeron 300's overclocked to 450MHz?

In the fast moving world of personal computers technology is obsolete within 3-5 years and new devices are purchased by consumers that need them. Those that don't upgrade simply dont need/want new level of performance. They have other priorities.
 
Depends if you are streaming through tons of stuff and apply various math on it on each iteration or you take a small chunk that fits on L1/2 and do everything at once on it.
[edit]
Didn't notice there was already a new page that had a post saying the same thing :oops:
 
And how many people are still running Celeron 300's overclocked to 450MHz?

In the fast moving world of personal computers technology is obsolete within 3-5 years and new devices are purchased by consumers that need them. Those that don't upgrade simply dont need/want new level of performance. They have other priorities.
Don't feed him. He conveniently forgets that generally programmable logic won't be able to cope any better in his example. Just because something could compute something doesn't mean it should.
 
That paper lacks an evaluation of executing wide vectors on less wide execution units (i.e. AVX-1024 on x86). So you can throw its conclusions in the trash bin.

Again, it only considers "a hypothetical heterogeneous processor [consisting] of a small number of large cores for single-thread performance and many small cores for throughput performance." It does not consider achieving high throughput out of unified cores.

Both these papers utterly fail to recognize that workloads do not consist of separate sequential and data parallel tasks. It's a smoothly varying gradation. With a heterogeneous architecture, you'd have to choose between two (or even more) types of core, neither of which would be ideal, and with a substantial overhead for transitioning between the two.
Best trolling I have seen in years. I am done with this "discussion".
 
I point out that some research papers are not taking all options into account and suddenly I'm a troll?
 
nicks just making a similar argument to vertex shaders + pixel shaders vs unified shaders isnt he
ie: you never find the perfect balance if you have 2 separate types of units
 
Not really... the problem is Nick just dismisses any evidence that runs counter to his way of thinking out of hand. FWIW, I think that Nick is right about much of what he says, just very wrong about when it is going to happen (he thinks it is a few years, I think it is a few decades...). You don't have to look very far back in this industry to find people making some outlandish claims about the way things will be 10 years in the future. Things rarely work out how people imagine...
 
nicks just making a similar argument to vertex shaders + pixel shaders vs unified shaders isnt he
ie: you never find the perfect balance if you have 2 separate types of units
There's no choice. Physics doesn't like homogeneous compute at this scale.
 
Last edited by a moderator:
Not really... the problem is Nick just dismisses any evidence that runs counter to his way of thinking out of hand. FWIW, I think that Nick is right about much of what he says, just very wrong about when it is going to happen (he thinks it is a few years, I think it is a few decades...). You don't have to look very far back in this industry to find people making some outlandish claims about the way things will be 10 years in the future. Things rarely work out how people imagine...
When do you guys think that Intel CPUs will have AVX-1024 (in 2008 Intel said it extends to 1024 bit FP and 512 bit integer—correct me if anything's changed since then)?

I'm going to assume Tick-Tock continues at ~1 microarchitecture / 2 years, and that AVX width changes only happen with microarchitecture changes (is this a safe assumption?).

If either INT or FP gets doubled per microarchitecture change, that gives:
2011 — Sandy Bridge — 128 bit INT / 256 bit FP
2013 — Haswell — 256 bit INT / 256 bit FP (this says Haswell's been delayed to 2014. Anyone know anything about that?)
2015 — SkyLake — 256 bit INT / 512 bit FP
2017 — SkyLake + 1 — 512 bit INT / 512 bit FP
2019 — SkyLake + 2 — 512 bit INT / 1024 bit FP

If both INT and FP get doubled per microarchitecture change (when possible), that gives 2017 — SkyLake + 1 — 512 bit INT / 1024 bit FP. Something quicker could give 2015 — SkyLake — 512 bit INT / 1024 bit FP.

Using the first sequence of values, if the number of cores doubles every 4 years and clock speed stays relatively unchanged, then (disregarding other changes) that means the INT/FP doubles every 2 years.
 
Back
Top