22 nm Larrabee

rpg.314 · Jul 13, 2011

Nick said:
Power cost of audio processing >>> power cost of text processing. Yet it turned into a software driver anyway.

AFAIK, IBM has XML and/or regex ff hw in Power 7, so clearly depends on application. Power cost for both audio and text for consumer apps is below threshold and hence both of them are in sw. Not for video.

Compute power needed for audio and text is bounded for all practical purposes. Not for video. Hence, after a while audio turned into sw. Video will persist in hw as resolutions, bit rates, compression complexity and 3D continue to increase compute power needed for video decode.

The performance requirement for video processing is evolving more slowly than the CPU's performance/Watt. So it's merely a matter of time before video becomes just a minor task that can be handled perfectly by the CPU, just like audio became a minor task when computers outgrew being glorified typewriters.

And ff hw has close to perfect scaling, so it will still win.

rpg.314 · Jul 13, 2011

Nick said:
Sandy Bridge extending the video processing capabilities really can't be called a trend and isn't even relevant to the software rendering discussion. GPUs extend their generic programmability every generation, and you can't execute shaders and computing kernels on hardwired logic. So let's not underestimate the effect of increasing software diversification. Even for graphics things like micro-polygons, ray-tracing, custom anti-aliasing, etc. will render dedicated components less useful. Fortunately, software rendering allows taking many shortcuts.

It's funny, that you are portraying a technique originally developed to be integrated into GPU as something only sw rendering can do.

There's only so many things you can potentially be split off into dedicated components, it's a nightmare to develop and maintain software for, and there's a very real communication overhead for offloading things outside of the CPU cores.

If developers are complaining about having to use a gazillion types of cores, I haven't heard a peep from android or ios devs. And plenty of sw is being developed and maintained there.

The communication overhead you keep speaking about is a myth. In a well designed heterogeneous system, the cost of communicating across cores is independent of the nature of cores itself.

nAo · Jul 13, 2011

3dilettante said:
I'm not sure why dark silicon is such a revelation all of a sudden, other than perhaps certain marketers finally got around to mentioning it.

None of that. It's "simply" because the free lunch (which wasn't free to begin with

) from scaling performance with multi-core architectures might be over in a not-so-distant future. The power to slap and drive more cores on the same die won't be there:

Dark Silicon and the End of Multicore Scaling
The future of microprocessors

rpg.314 · Jul 13, 2011

dark silicon paper said:
Eight of twelve benchmarks
show no more than 10X speedup even with practically unlimited
power, i.e. parallelism is the primary contributor to dark silicon.

Intriguing.

Nick · Jul 13, 2011

rpg.314 said:
AFAIK, IBM has XML and/or regex ff hw in Power 7, so clearly depends on application.

I wasn't able to find any technical documents confirming this. Can you point me to any?

Nick · Jul 13, 2011

rpg.314 said:
Compute power needed for audio and text is bounded for all practical purposes. Not for video. Hence, after a while audio turned into sw. Video will persist in hw as resolutions, bit rates, compression complexity and 3D continue to increase compute power needed for video decode.

Nonsense, audio also increased its sampling rate, bit depth, number of sources, number of output channels, complexity of filtering effects, etc. Yet the CPU's processing power scaled faster than the demand, leading to the quick demise of discrete sound cards.

Video may continue to scale for a bit, but it is not unbounded and is surely outpaced by CPU progress.

It's ironic that you mention compression complexity. Intel already has to update Quick Sync to support new popular formats. But even the next generation won't be forward compatible. For that you need a high level of programmability.

rpg.314 · Jul 13, 2011

Nick said:
I wasn't able to find any technical documents confirming this. Can you point me to any?

http://www-01.ibm.com/software/integration/datapower/xa35/#

It's not in power7 per se, but ff hw for text exists.

rpg.314 · Jul 13, 2011

Nick said:
Video may continue to scale for a bit, but it is not unbounded and is surely outpaced by CPU progress.

Does it outpace CPU - dark silicon??

It's ironic that you mention compression complexity. Intel already has to update Quick Sync to support new popular formats. But even the next generation won't be forward compatible. For that you need a high level of programmability.

The real test is whether or not intel adds more hw for new formats going forward.

Arun · Jul 13, 2011

nAo said:
It only took 30 pages to name dark silicon, thanks Arun. Physics doesn't really like homogeneous computing at this point, it doesn't matter how badly we want it, it's not around the corner.
Soon in a theater near you "The Revenge of Fixed Function Hardware" ©

nAo said:
None of that. It's "simply" because the free lunch (which wasn't free to begin with ) from scaling performance with multi-core architectures might be over in a not-so-distant future. The power to slap and drive more cores on the same die won't be there:

You're welcome

And I think that while dark silicon has been present for quite a while, it's only now that hardware engineers are starting to look at it as something that is genuinely inescapable and even desirable, rather than an unfortunate side-effect of specialisation.

However I'd point out that in theory there's one other solution to consider: undervolting. You could temporarily keep increasing the number of cores *and* aggregate performance faster than the intrinsic performance/watt of the process if you make every SKU similar to today's ultra-low-voltage variants for ultraportables.

But the point is that this has a huge performance/area penalty and it's nearly always much more efficient to use fixed-function hardware at more regular voltages instead. And as leakage becomes an even bigger problem but power gating becomes easier to implement, this is only becomes more and more true. One possible exception is if you can get an unusually large performance/watt boost from custom logic in a processor but you can't afford to use it for fixed-function hardware. However I'd argue restricted design rules are reducing the advantage of custom logic and the difference is far from large enough.

Arun · Jul 13, 2011

Nick said:
I fully agree, but this argument wasn't about mobile phones.

My point was that PC architecture must necessarily become somewhat closer to handheld architecture. One obvious example in the short-term is Intel's Ultrabook vision but the same will eventually be true all the way from Windows tablets to desktops.

It was about Sandy Bridge extending its video processing capabilities. And while it may take many more years, eventually any budget CPU will offer a high enough performance at a low enough power consumption to allow software video processing for any practical purpose.

You are massively underestimating the demands of video processing in the next 10 years. I think 3D 4Kx4K HEVC could take 40x the performance of 2D 1080p H.264 High Profile. Even including the large one-time bump with AVX2, I'm very skeptical that CPU performance will scale by 40x in 10 years while remaining in the same thermal envelope. And even if it did, it will most certainly not be "low enough power consumption". The other side of the coin is that the PowerVR VXD392 and VXE382 can both handle 3D 4Kx4K H.264 High Profile today (scaling to HEVC in the future) at significantly less than 1W.

As for the short-term, the power consumption of 1080p video processing in a tablet, netbook, ultrabook or 13" notebook is far from negligible even on 22nm or 14nm. It makes sense to keep investing in a small amount of dedicated silicon for it. And yes, Sandy Bridge's video encoding doesn't achieve very high quality, but that is not an intrinsic limitation of hardware encoding (even if it's practically never going to beat x264's quality).

Multi-core also allows multi-tasking, always delivering maximum aggregate performance. The reverse isn't possible with dedicated hardware though: you can't use the video processing logic to speed up anything else.

No, but if you need 1mm² for dedicated hardware and 10mm² for general-purpose CPU cores to do the same job, then in that specific scenario you can have an extra 9mm² of CPU power for the background tasks than in an homogeneous architecture. And the difference will be even greater if you're limited by power consumption (e.g. Turbo Boost).

Again, that's the handheld market. In the desktop market, we instead see Turbo Boost technology to increase the clock frequency while the TDP hasn't been exceeded yet!

The handheld market is moving towards a 'reverse Turbo Boost' mechanism rather similar to what AMD implements on Cayman: you have a maximum frequency and the chip monitors its total power consumption and temperature at a variety of likely hotspots. It automatically reduces the frequency and voltage of different blocks as required to fit within the power and thermal budget specified by the OEM.

As the number of cores increases further on CPUs and both peak power and hotspots become a problem even at the default frequency, Intel will be forced to move to something closer to this even on the desktop. I suspect the CPU will have to be clocked lower when all 8 cores are doing full-throttle AVX2 work.

And tri-gate technology enables generous performance/Watt scaling for several more nodes.

Just like High-K was at 45nm, Tri-gate is a one-time improvement at 22nm. It pre-emptively prevents some problems that would have become much more severe at future nodes, but it does not magically fixes the power scaling problem in any general sense. Intel's 14nm process will take a fair bit more power per mm² of silicon than their 22nm process at the default voltage and there's no way around that. I'm not aware of anything revolutionary on 14nm and I haven't even heard anything very exciting for 10/8nm despite having my ears pretty close to the ground for process technology in general.

A lot of your arguements seem to implicitly require process engineers to pull off a miracle. Don't count on it.

Last but not least, there's AVX-1024 to provide a 3/4 cycle clock gating opportunity for the control logic. What happens beyond that nobody can accurately predict, but by that time the IGP should already be history.

Once again, even if you were right (which you are not), it's not just a question of technical viability but also political considerations inside Intel. The lesson that Intel seems to have learned from the failure of Larrabee is the exact opposite of everything you're saying:

http://www.techradar.com/news/computing-components/graphics-cards/intel-larrabee-was-impractical--716960 said:
When asked why Piazza thought it had failed though he was surprisingly candid. "I just think it's impractical to try to do all the functions in software in view of all the software complexity," he explained. "And we ran into a performance per watt issue trying to do these things."

"Naturally a rasterizer wants to be fixed function." Piazza went on. "There is no reason to have the programming; it takes so little area for what it does relative to trying to code things like that."

It turned out that it was a problem of trying to balance out what they were doing, trying to find "what's the right level of programmability and what's the right level of fixed function."

So there's no way such a solution will come from Intel. If you want to see this happen, your only chance is to implement a kickass AVX2-based DirectX11 renderer in SwiftShader and manage to prove to the world - including Intel itself - that you can be competitive with Intel's own IGP in Haswell in terms of both absolute performance and performance/watt.

I sincerely wish you good luck if go down that route, but sadly I think you're going to need more than luck to achieve that.

Nick · Jul 13, 2011

rpg.314 said:
It's funny, that you are portraying a technique originally developed to be integrated into GPU as something only sw rendering can do.

I'm not portraying anything. It's just the tip of the iceberg of what you can do with more programmability. Despite software execution taking more power than fixed-function hardware, techniques like this actually increase effective performance/Watt. Given that pixel, vertex and geometry shaders all use the same execution cores, it's only logical that the cull shader would also be executed in unified shader units (cfr. section 6 of the paper). So even if you implement it on a GPU as suggested, it's still a step closer to software rendering.

That was the point I was trying to make. Applications demand ever more programmability, but that doesn't have to be a bad thing for effective performance/Watt.

rpg.314 · Jul 13, 2011

Nick said:
I'm not portraying anything. It's just the tip of the iceberg of what you can do with more programmability. Despite software execution taking more power than fixed-function hardware, techniques like this actually increase effective performance/Watt. Given that pixel, vertex and geometry shaders all use the same execution cores, it's only logical that the cull shader would also be executed in unified shader units (cfr. section 6 of the paper). So even if you implement it on a GPU as suggested, it's still a step closer to software rendering.

That was the point I was trying to make. Applications demand ever more programmability, but that doesn't have to be a bad thing for effective performance/Watt.

Even if you ran the cull shader on GPU, you'd still want a tiny bit of hw to avoid/reduce branching. And even if you didn't add it, you'd still need heterogeneous hw for perf/W.

CarstenS · Jul 13, 2011

Being a bit late, so I'll keep to the most important points.

Nick said:
I didn't subtract that. The display logic is part of the System Agent. And besides, the I/O logic contains PCIe lanes for communicating with a discrete GPU... But let's not start nitpicking. The point was that doubling the vector width and implementing FMA brings the CPU a whole lot closer to the GPU in terms of computing density.

How's FMA increasing compute density over, say, MADD?

Nick said:
Absolutely. But now apply that to an APU's IGP. How would that still be significantly different from a CPU with AVX-1024? Keep in mind that removing fixed-function hardware and thus moving toward software rendering requires an extensive ISA, advanced data coherency, high cache hit rates, preemptive scheduling, etc. All of these things further converge the IGP toward the CPU's architecture. So at some point it would make sense to fully unify them into a homogeneous architecture.

...while adding wider and wider vector units to a CPU moves it closer to the vincinity of GPUs.

Nick said:
I'm sorry but that's absolute nonsense. It takes considerable effort to scale GPUs. And software renderers which make use of AVX do exist. Also, LLVM abstracts the vector width, and AVX encoding support is being implemented as we speak. Scaling to AVX-1024 will be trivial.

Sorry, I should have explained in a bit more detail. Only this statement looks really silly now.
What I meant was that CPUs have to excel at serial workloads in the first place and adding throughput capability is only a second-class citizen. At least that's how it appears to me until now - every additional core is a complete Unit. I have yet to see a dualcore CPU with, say quad or hexuple vector units attached.
AMD even goes into the other direction with bulldozer.

Nick said:
As far as I'm aware, GPU manufacturers also maintain a high level of hardware backward compatibility. Just look at the many CUDA compute capabilities. You can't just radically throw things around. Don't underestimate the software work involved to ensure that a large number of APIs run reliably and efficiently on a completely new architecture. Once again GPUs don't appear to have any significant leg up.

The x86 legacy is only a small burden in comparison to the benefits. With AMD already having revealed the ISA for GCN, it appears they also want to start reaping benefit from long-term binary compatibility.

GPU vendors have all the infrastructure in place for abstracting the hardware as much as possible from the programs run on it, whereas every CPU core has silicon devoted to backwards compatibility with x86/7.

Nick said:
This seems little more than a temporary selling point. Note that audio used to demand a discrete sound card, then it was moved to the chipset, and then it became a software driver, despite improving the quality along the way. MMX played a major role in the final move to software. Video processing used to demand a discrete graphics card, then it was moved to the chipset, now it's part of the CPU just like the IGP, and the next logical step would be a move to software. AVX2's 256-bit integer vector operations and gather support could mark the turning point.

I don't believe power is becoming neglegtable any time soon.

rpg.314 · Jul 13, 2011

With AMD already having revealed the ISA for GCN, it appears they also want to start reaping benefit from long-term binary compatibility.

AMD has revealed overall plan for GCN, not ISA, hence no BC.

Nick · Jul 13, 2011

rpg.314 said:
AMD has revealed overall plan for GCN, not ISA, hence no BC.

http://www.hardware.fr/news/11648/afds-architecture-futurs-gpus-amd.html

3dcgi · Jul 13, 2011

nAo said:
It only took 30 pages to name dark silicon, thanks Arun. Physics doesn't really like homogeneous computing at this point, it doesn't matter how badly we want it, it's not around the corner.
Soon in a theater near you "The Revenge of Fixed Function Hardware" ©

While not named specifically there was brief discussion about it on page 7 starting with a post by glw.

Gipsel · Jul 13, 2011

CarstenS said:
How's FMA increasing compute density over, say, MADD?

CPUs don't have that neither so far. Just MUL and ADD pipes

aaronspink · Jul 13, 2011

nAo said:
None of that. It's "simply" because the free lunch (which wasn't free to begin with ) from scaling performance with multi-core architectures might be over in a not-so-distant future. The power to slap and drive more cores on the same die won't be there:

Dark Silicon and the End of Multicore Scaling
The future of microprocessors

This isn't shocking, new, or unforseen.

Nick · Jul 14, 2011

Arun said:
You are massively underestimating the demands of video processing in the next 10 years. I think 3D 4Kx4K HEVC could take 40x the performance of 2D 1080p H.264 High Profile. Even including the large one-time bump with AVX2, I'm very skeptical that CPU performance will scale by 40x in 10 years while remaining in the same thermal envelope. And even if it did, it will most certainly not be "low enough power consumption". The other side of the coin is that the PowerVR VXD392 and VXE382 can both handle 3D 4Kx4K H.264 High Profile today (scaling to HEVC in the future) at significantly less than 1W.

Let's first get something straight. There are three main discussions going on right now:

1) GPGPU versus high-throughput CPU
2) Dedicated versus software graphics
3) Fixed-function versus programmable video

Even if hypothetically for (3) the fixed-function hardware wins hands down (which I'll debate below), that doesn't mean a single thing to (1) and means (2) can still go in either direction, depending on platform, market segment, semiconductor technology breakthroughs, etc. Can we at least agree on that to some degree?

Now, while there's indeed plenty of room for scaling the video processing workload, I sincerely doubt that the majority of consumers want / need 3D 4Kx2K HEVC in the next ten years. Case in point, Blu-ray sales still have to overtake DVD sales. Quad Full HD screens are humongous and currently cost both your kidneys. And while I'm sure one day they'll become somewhat affordable, the question remains whether people actually see much need it. This situation is once again not unlike audio processing. Although higher qualities exist, 16-bit at 44k has proven to be adequate for the masses for decades.

And yes, Sandy Bridge's video encoding doesn't achieve very high quality, but that is not an intrinsic limitation of hardware encoding (even if it's practically never going to beat x264's quality).

It's an intrinsic limitation of fixed-function hardware that it's not forward compatible. Today's H.264 hardware is worthless for tomorrow's HEVC material, even at low resolution. No amount of power savings makes up for not being able to run something.

The handheld market is moving towards a 'reverse Turbo Boost' mechanism rather. similar to what AMD implements on Cayman: you have a maximum frequency and the chip monitors its total power consumption and temperature at a variety of likely hotspots. It automatically reduces the frequency and voltage of different blocks as required to fit within the power and thermal budget specified by the OEM.

As the number of cores increases further on CPUs and both peak power and hotspots become a problem even at the default frequency, Intel will be forced to move to something closer to this even on the desktop. I suspect the CPU will have to be clocked lower when all 8 cores are doing full-throttle AVX2 work.

Whether you increase the frequency with a single-threaded workload, or reduce the frequency with a multi-threaded workload, in the end that's exactly the same thing. It's just semantics whether you call the low or the high frequency the base frequency and boost or reverse boost from there.

Just like High-K was at 45nm, Tri-gate is a one-time improvement at 22nm. It pre-emptively prevents some problems that would have become much more severe at future nodes, but it does not magically fixes the power scaling problem in any general sense. Intel's 14nm process will take a fair bit more power per mm² of silicon than their 22nm process at the default voltage and there's no way around that. I'm not aware of anything revolutionary on 14nm and I haven't even heard anything very exciting for 10/8nm despite having my ears pretty close to the ground for process technology in general.

That's still many years from now. Back when 90 nm caused Intel to ditch the NetBurst architecture, people didn't have the slightest clue about 22 nm tri-gate devices. And even if 14 nm itself doesn't bring anything new to lower power consumption, once again note that AVX-1024 creates a 3/4 clock gating opportunity. That's some dark silicon for you right there.

Everyone is very well aware that new innovation in power efficiency will be required to continue Moore's Law. So a lot of effort goes into it, and it can come in many forms; not just semiconductor process breakthroughs and fine-grained clock gating, but also ISA enhancements and software itself. For example gather replaces a power hungry sequence of 26 uops. Each insert or extract instruction required moving more than 128-bit around (instead of merely the individual elements)! In a way the gather logic adds "dedicated" support for a common operation, but it's still quite generic (in the same way that other vector instructions are generic at least). Again it's all about covering ILP, TLP and DLP for generic workloads. And software innovation also assists in increasing effective performance/Watt through things like dynamic code generation and advanced work culling. And I've also already mentioned the potential of out-of-order execution to actually assist in increasing cache hit ratios and thus reducing the power consumption involved in fetching the data from higher up in the hierarchy.

So I'm not counting on a miracle from the process engineers. There's plenty of opportunity to scale homogeneous architectures for the foreseeable future. That said, I'm curious about the "exciting" things you've heard about, if you haven't heard about any "very exciting" things, yet...

Once again, even if you were right (which you are not), it's not just a question of technical viability but also political considerations inside Intel. The lesson that Intel seems to have learned from the failure of Larrabee is the exact opposite of everything you're saying:

Like I said several times before, the failure of Larrabee as a high-end GPU doesn't mean the IGP won't get replaced by CPU cores. Intel still has plenty of other reasons to turn the CPU into a power-efficient high-throughput device with AVX2 and AVX-1024. So far these "politics" you talk about have not prevented AVX from converging toward LRBni.

If you want to see this happen, your only chance is to implement a kickass AVX2-based DirectX11 renderer in SwiftShader and manage to prove to the world - including Intel itself - that you can be competitive with Intel's own IGP in Haswell in terms of both absolute performance and performance/watt.

This isn't about me, it's about empowering every developer with limitless capabilities. And so it's not about DX11 software rendering either. A software renderer that's no more than a drop-in alternative for restrictive hardware rendering APIs, would be a failure. Where things get really interesting, is when you leave the beaten path...

rpg.314 · Jul 14, 2011

Nick said:
http://www.hardware.fr/news/11648/afds-architecture-futurs-gpus-amd.html

All of it is hidden behind CAL.

22 nm Larrabee

rpg.314

rpg.314

nAo

Nutella Nutellae

rpg.314

Nick

Nick

rpg.314

rpg.314

Arun

Unknown.

Arun

Unknown.

Nick

rpg.314

CarstenS

Moderator

rpg.314

Nick

3dcgi

Gipsel

aaronspink

Nick

rpg.314

Similar threads