22 nm Larrabee

CarstenS · Jul 11, 2011

Do you intentionally pick those half-lines for quotation, completely not-presenting the context?

hoho said:
that depends directly on the software you are running. I would imagine a CPU-based OpenCL or something similar would scale automagically with wider configuration

OpenCL could profit, yes. But the vast majority of multi-core aware CPU software does not run that and instead would Intel force to decide/divide their investments in die-area and transistor count three- if not fourfold: becoming faster at serial workloads (Amdahl to the rescue!), becoming faster at traditional multicore-software, becoming faster at graphics/OpenCL-type of applications or becoming more power efficient.

3dilettante · Jul 11, 2011

Nick said:
You can't do that at the decoder stage since it depends on the addresses.

I think you're right, the decoder is too far ahead of the data in the pipeline to have the needed data even if the index operand was ready.

Note that loads can already be decomposed into multiple fetches when they straddle a cache line boundary. This happens in the load unit itself, and it's still one uop.

That only needs one AGU op for the split load. Each independent address calculation gets its own uop in the memory pipeline. Do you envision a gather pipe that would handle the normally separate memory uops internally?

Note that an uninterrupted gather does not require writing two 256-bit registers. Zeroing the mask register merely requires flagging it in the renamer.

An interrupted one does, so there would be a mechanism for updating both destinations.

That would still require 24 uops and thus offer very little advantage.

My idea of having a separate op per cache access would be impractical.

How would Haswell feed the gather instruction with its operands?
From the description, there is a source/destination register, a mask register, a base register, and an index register.
There would be move from the GPR file as well as the FP file for the combined VSIB:Base and VSIB:Index values.

In total, the hardware would see 4 separate operands in the uop.
If vgather has special treatment to expand the number of inputs, why stop at FMA3? FMA4 was promised and then dropped, potentially because of the number of operands needed in the uop.

Could vgather be a multi-uop instruction from the complex decoder, with up to 4 uops?
The process for collecting the operands could span 2 uops. A third uop could cover the actual gather.
I'm not sure if a fourth uop would be needed, possibly for the write to the second destination in the suspend case?
The end result is a fixed number of uops that would take one cycle to dispatch to the uop buffers.

aaronspink · Jul 12, 2011

rpg.314 said:
So the GPU's are SIMD but LRB is vector because LRB has scalar instructions as well, but not GPU's. Is that so?

If yes, then imho, that's more hair splitting than a meaningful distinction. And even then, GCN tips the vector scales.

Can you point out where I say that LRB is vector?

aaronspink · Jul 12, 2011

Gipsel said:
The most fundamental property of a vector instruction is that it operates on vectors. It's that simple. If you have instructions for common data types as elements of those vectors, you have a vector extension (yes, SIMD is only a subset, but who cares about that) to an otherwise scalar ISA.

I would say that the most fundamental part of a Vector ISA are VL and VS fields/registers. Seeing as how that is the fundamental thing that differentiates vectors from simple SIMD.

rpg.314 · Jul 12, 2011

aaronspink said:
Can you point out where I say that LRB is vector?

Oh, I assumed that since LRBni was marketed as being vector complete, I thought it counted as vector as well.

The VL, VS way of distinguishing between SIMD and vector seems rather cosmetic. Current GPU's achieve the same effect by masking unused lanes and generating an address per lane using arithmetic instructions.

Gipsel · Jul 12, 2011

aaronspink said:
I would say that the most fundamental part of a Vector ISA are VL and VS fields/registers. Seeing as how that is the fundamental thing that differentiates vectors from simple SIMD.

I don't see how this is supposed to be a fundamental difference compared to instructions operating on vector vs. scalar data.

Omitting VL/VS registers is merely stripping down the conventional architectures a bit. And even that would not be entirely correct. One could say that GPUs for instance have a fixed vector length, but while the configurable workgroup size serves normally another purpose, one could view it as some kind of an equivalent. And for the vector stride thing, GPUs (and x86-CPUs with AVX2) are actually more flexible as they support also irregular gathers. CPUs before that have to split that up in several instructions. Not a big deal from a fundamental point of view, especially as all of them are relatively short vector implementations. It's just a trade-off between performance and implementation complexity in the design.

Nick · Jul 12, 2011

CarstenS said:
Just came to my mind: The analogue part of chips have been said to scale really poor by process tech and is a burden each GPU must bear. And on top of that you subtract that crucial part from CPUs... Maybe some GPU-vendor goes back to separating 2D and Video in an additional chip like Nvidia did in the past? That'd increase compute density a bit I'd say.

I didn't subtract that. The display logic is part of the System Agent. And besides, the I/O logic contains PCIe lanes for communicating with a discrete GPU... But let's not start nitpicking. The point was that doubling the vector width and implementing FMA brings the CPU a whole lot closer to the GPU in terms of computing density.

Number one is related to the point above: Removal of ff hardware in order to cram more compute units into a given die.

Absolutely. But now apply that to an APU's IGP. How would that still be significantly different from a CPU with AVX-1024? Keep in mind that removing fixed-function hardware and thus moving toward software rendering requires an extensive ISA, advanced data coherency, high cache hit rates, preemptive scheduling, etc. All of these things further converge the IGP toward the CPU's architecture. So at some point it would make sense to fully unify them into a homogeneous architecture.

Number two: contrary to GPUs, CPUs do not automatically profit from wider designs, i.e. higher counts (or width) of their execution units.

I'm sorry but that's absolute nonsense. It takes considerable effort to scale GPUs. And software renderers which make use of AVX do exist. Also, LLVM abstracts the vector width, and AVX encoding support is being implemented as we speak. Scaling to AVX-1024 will be trivial.

And finally number three: CPUs have to maintain hardware backward compatibility with their ancestors at least for a while in order for their extension to be of any real-world relevance. GPUs have the benefit (or curse) of being able (or forced, depending on your point of view) to mask that behind the driver. This, obviously, does not completely disable any kind of progress in the same line of extensions, but makes it at least more costly and/or inflexible compared to being able to do something completely different in your next generation. At least that's what I'd think from my layman's perspective.

As far as I'm aware, GPU manufacturers also maintain a high level of hardware backward compatibility. Just look at the many CUDA compute capabilities. You can't just radically throw things around. Don't underestimate the software work involved to ensure that a large number of APIs run reliably and efficiently on a completely new architecture. Once again GPUs don't appear to have any significant leg up.

The x86 legacy is only a small burden in comparison to the benefits. With AMD already having revealed the ISA for GCN, it appears they also want to start reaping benefit from long-term binary compatibility.

While i see your point, i also suspect that there must be a reason, why Intel has decided to go with more ff hardware in SB IGP (most of which related to video playback/transcoding) than in previous generations instead of using the x86 cores.

This seems little more than a temporary selling point. Note that audio used to demand a discrete sound card, then it was moved to the chipset, and then it became a software driver, despite improving the quality along the way. MMX played a major role in the final move to software. Video processing used to demand a discrete graphics card, then it was moved to the chipset, now it's part of the CPU just like the IGP, and the next logical step would be a move to software. AVX2's 256-bit integer vector operations and gather support could mark the turning point.

Rys · Jul 12, 2011

Nick said:
Video processing used to demand a discrete graphics card, then it was moved to the chipset, now it's part of the CPU just like the IGP, and the next logical step would be a move to software. AVX2's 256-bit integer vector operations and gather support could mark the turning point.

Why would you do that when a high performance multi-standard decoder would be less than one square millimeter at bare handful of mW, even for 4Kx4K, on the same process that AVX2 will show up on? x86 can't ever hope to compete with that, and that basic reasoning is something you consistently ignore, and it underpins almost everything you've had to say in this thread.

That fixed hardware is cheaper and more efficient in every way.

Nick · Jul 12, 2011

3dilettante said:
That only needs one AGU op for the split load. Each independent address calculation gets its own uop in the memory pipeline. Do you envision a gather pipe that would handle the normally separate memory uops internally?

Yes, the load unit would receive a single gather uop and all the source operands. It computes the full (virtual) address of the first element, and checks which of the other indices point to the same cache line. In the next cycle, it computes the address of the next element for which the cache line fetch hasn't already been queued (if any), again comparing which indices of other elements are stored within the same cache line. This is repeated till every unique cache line fetch has been queued up (1 to 8 cycles).

How would Haswell feed the gather instruction with its operands?
From the description, there is a source/destination register, a mask register, a base register, and an index register.
There would be move from the GPR file as well as the FP file for the combined VSIB:Base and VSIB:Index values.

In total, the hardware would see 4 separate operands in the uop.
If vgather has special treatment to expand the number of inputs, why stop at FMA3? FMA4 was promised and then dropped, potentially because of the number of operands needed in the uop.

The most straightforward solution would be to perform the source/destination register blending in a separate uop corresponding to a vblend instruction. There are two ports capable of executing vblend on Sandy Bridge, so this wouldn't have a noticeable effect on gather performance. However, considering that the other ports are already fed with up to three 256-bit input operands, it might not be that big of a deal to have the load port take the same amount of input.

Note that vblend with an immediate operand as the mask requires only a single uop, while vblendv takes a fourth register requires two uops. This indicates that the 8-bit immediate is a free extra operand, but it takes a uop (the same one as vmovmsk) to extract it from a register.

The choice to go with FMA3 instead of FMA4 may simply be about uop encoding size. Since they store physical register indexes, FMA4 would have allowed significantly fewer uops in an equal sized uop cache versus FMA3. Note however that a fused FMA3 uop with a memory operand encodes four registers. Same for gather.

So in the worse case a gather instruction may end up taking three uops:
vmovmsk imm, mask
vgather temp, [base+vindex*scale], imm
vblend dest, dest, temp, imm

Despite that, the peak throughput would still be 1 full gather instruction every clock cycle since each of these uops take a different port.

I'm not sure if a fourth uop would be needed, possibly for the write to the second destination in the suspend case?

An exception starts the execution of a micro-coded routine, so these may contain a uop to have the load unit write back the mask register.

Nick · Jul 12, 2011

Rys said:
Why would you do that when a high performance multi-standard decoder would be less than one square millimeter at bare handful of mW, even for 4Kx4K, on the same process that AVX2 will show up on? x86 can't ever hope to compete with that, and that basic reasoning is something you consistently ignore, and it underpins almost everything you've had to say in this thread.

That fixed hardware is cheaper and more efficient in every way.

Where's the dedicated audio processing hardware then?

Arun · Jul 12, 2011

Nick said:
Where's the dedicated audio processing hardware then?

Every single mobile phone SoC in the world has dedicated audio hardware of some kind such as the Tensilica 330HiFi. The reason why that makes sense for them is that music is often played while the screen disabled so every milliwatt counts. The savings would be completely negligible on a PC, and anyway on notebooks the system architecture could not benefit from it. Handheld subsystems default to being fully disabled (power gated or not even any current in the first place so zero leakage) but PC subsystems default to being partially enabled (e.g. even empty PCI-E slots take quite a bit of power).

Rys said:
That fixed hardware is cheaper and more efficient in every way.

Rys, let me pre-emptively counter the other likely disagreement with your point: that if you need AVX2/FMA/Gather anyway, then using it for video is free whereas dedicated hardware is idle most of the time.

That forgets two important points:
1) This allows you to use the CPU for other purposes while processing video. This is a common and valuable scenario.
2) 'Dark silicon' has been the very foundation of the handheld industry for some time with this trend accelerating further, and as power consumption continues scaling slower than cost, it will become necessary in every part of the semiconductor industry, including PC desktops.

Think about it: if cost is reduced by 50% every generation but power consumption is reduced by 50% only every 2 generations, then if you are power constrained (which notebooks have been for a long time and desktops are starting to be as well to some extent) then the 'maximum viable chip cost' would go down by 50% every 2 generations. This is obviously unacceptable to semiconductor companies, just like it would have been unacceptable to Intel to stop scaling CPUs even if multicore programming took off even more slowly than it did.

So 'dark silicon' doesn't just make sense in terms of battery efficiency but also economics. The problem on PCs (and GPUs specifically) is that most subsystems are often only clock gated rather than power gated. As leakage becomes an increasingly large part of power consumption over time (High-K and 3D Transistors help but aren't really enough) it will become more and more important to implement fine-grained power gating even if there is an area cost to it. But that's not the end of the world and dark silicon is here to stay, like it or not, including dedicated graphics acceleration of one sort or another.

rpg.314 · Jul 12, 2011

Nick said:
Where's the dedicated audio processing hardware then?

Power cost of HD video decode >>> power cost of audio processing.

nAo · Jul 12, 2011

It only took 30 pages to name dark silicon, thanks Arun. Physics doesn't really like homogeneous computing at this point, it doesn't matter how badly we want it, it's not around the corner.
Soon in a theater near you "The Revenge of Fixed Function Hardware" ©

aaronspink · Jul 12, 2011

Rys said:
Why would you do that when a high performance multi-standard decoder would be less than one square millimeter at bare handful of mW, even for 4Kx4K, on the same process that AVX2 will show up on? x86 can't ever hope to compete with that, and that basic reasoning is something you consistently ignore, and it underpins almost everything you've had to say in this thread.

because "high performance multi-standard decoder" is generally a nice way of saying, "sure its craap, but its cheap craap!"

That fixed hardware is cheaper and more efficient in every way.

and worse quality and worse flexibility.

3dilettante · Jul 12, 2011

I'm not sure why dark silicon is such a revelation all of a sudden, other than perhaps certain marketers finally got around to mentioning it.

The selective inactivation or gating of circuits and units has been around for several nodes now, even for the desktop.
Is dark silicon qualitatively different from any other silicon that is behind a sleep transistor, gated clock, or power gate?

What of FPGA silicon, which is not on the continuum being discussed, but could be included at some point with Intel's recent interest in fabbing them?

hoho · Jul 12, 2011

Depends on the kind of fixed HW we are taling about. The encryption "accelerators" in ARM CPUs sure seem quite nice and should easily pay back for the extra transistors they take:

Obviously not everything can have anywhere nearly as big boost but I bet lots of stuff can be accelerated quite nicely.

Rys · Jul 12, 2011

aaronspink said:
because "high performance multi-standard decoder" is generally a nice way of saying, "sure its craap, but its cheap craap!"

Some are neither crap nor cheap.

aaronspink said:
and worse quality and worse flexibility.

Or better quality and more flexible than a 'software' solution in any given area/power/time/cost, you mean? You can spin generally programmable silicon all you like, but the reality is that it's just a complete waste of everyone's time, money, area and power for various things, and it being generally programmable is the only selling point. That doesn't give it a free pass as better.

Nick · Jul 13, 2011

Arun said:
Every single mobile phone SoC in the world has dedicated audio hardware of some kind such as the Tensilica 330HiFi. The reason why that makes sense for them is that music is often played while the screen disabled so every milliwatt counts. The savings would be completely negligible on a PC, and anyway on notebooks the system architecture could not benefit from it. Handheld subsystems default to being fully disabled (power gated or not even any current in the first place so zero leakage) but PC subsystems default to being partially enabled (e.g. even empty PCI-E slots take quite a bit of power).

I fully agree, but this argument wasn't about mobile phones. It was about Sandy Bridge extending its video processing capabilities. And while it may take many more years, eventually any budget CPU will offer a high enough performance at a low enough power consumption to allow software video processing for any practical purpose. At that point it's an utter waste to include dedicated hardware for it, no matter how small and no matter how power efficient it could be. Just like dedicated sound processing, it will go the way of the dodo.

That forgets two important points:
1) This allows you to use the CPU for other purposes while processing video. This is a common and valuable scenario.

Multi-core also allows multi-tasking, always delivering maximum aggregate performance. The reverse isn't possible with dedicated hardware though: you can't use the video processing logic to speed up anything else.

2) 'Dark silicon' has been the very foundation of the handheld industry for some time with this trend accelerating further, and as power consumption continues scaling slower than cost, it will become necessary in every part of the semiconductor industry, including PC desktops.

Again, that's the handheld market. In the desktop market, we instead see Turbo Boost technology to increase the clock frequency while the TDP hasn't been exceeded yet! And tri-gate technology enables generous performance/Watt scaling for several more nodes. Last but not least, there's AVX-1024 to provide a 3/4 cycle clock gating opportunity for the control logic. What happens beyond that nobody can accurately predict, but by that time the IGP should already be history.

Sandy Bridge extending the video processing capabilities really can't be called a trend and isn't even relevant to the software rendering discussion. GPUs extend their generic programmability every generation, and you can't execute shaders and computing kernels on hardwired logic. So let's not underestimate the effect of increasing software diversification. Even for graphics things like micro-polygons, ray-tracing, custom anti-aliasing, etc. will render dedicated components less useful. Fortunately, software rendering allows taking many shortcuts.

Moore's Law is and has always been a self-fulfilling prophecy. To continue scaling the transistor density, they'll also have to continue reducing the power consumption. Tri-gate isn't the end, but just the beginning. Heterogeneous computing certainly isn't a viable long-term alternative. There's only so many things you can potentially be split off into dedicated components, it's a nightmare to develop and maintain software for, and there's a very real communication overhead for offloading things outside of the CPU cores.

Nick · Jul 13, 2011

rpg.314 said:
Power cost of HD video decode >>> power cost of audio processing.

Power cost of audio processing >>> power cost of text processing. Yet it turned into a software driver anyway.

The performance requirement for video processing is evolving more slowly than the CPU's performance/Watt. So it's merely a matter of time before video becomes just a minor task that can be handled perfectly by the CPU, just like audio became a minor task when computers outgrew being glorified typewriters.

Blazkowicz · Jul 13, 2011

the improvements in general purpose cpu power will be used in running html5 and javascript anyway.

power cost of html5 web page with GUI elements and text >> power cost of adobe flash >> power cost of video processing.

22 nm Larrabee

CarstenS

Moderator

3dilettante

aaronspink

aaronspink

rpg.314

Gipsel

Nick

Rys

Graphics @ AMD

Nick

Nick

Arun

Unknown.

rpg.314

nAo

Nutella Nutellae

aaronspink

3dilettante

hoho

Rys

Graphics @ AMD

Nick

Nick

Blazkowicz

Similar threads