22 nm Larrabee

3dilettante · Dec 9, 2011

Nick said:
I sincerely doubt they couldn't. I think they simply expected software to easily scale to many threads by now and not suffer from Amdahl's Law. In such a world Bulldozer would have made a lot of sense, if it was well executed.

I have a hard time believing they all could have been that naive. There are, or at least were, many engineers doing in-depth analysis of the workloads BD would face. The case for such a regression is even weaker since the design is badly delayed and would have come out facing even more serial workloads if it were on time. It should have been out on the 45nm node.

A wide OoO x86 running in the 3-4 GHz range is a significant undertaking to convert to SMT.
It took Intel quite some time to get it right, and lets note that it had to do this twice, possibly three times if we consider that SB probably heavily rearchitected key parts of the execution engine significantly compared to Nehalem.

Aside from Intel, the other significant high-performance wide OoO SMT design is POWER.

AMD may have decided that CMT gave them the best return on their engineering buck, possibly because of the difficulty in validating the design and making it run optimally while staying within power and manufacturing constraints.

Intel has managed all of this with far more resources, very good engineering, and far superior manufacturing.
IBM has managed this with a massive subsidy from its system and software side, some nice engineering, control over its hardware and software stack, and very relaxed power and yield requirements thanks to the previously mentioned subsidy and control over the platform.

In AMD's case, it needs to match the following without the money, engineering, or process. Faced with uncertainty about doing a wide OoO SMT design with the resources on hand, it may have hoped it could min

Unfortunately they miscalculated how hard it is to exploit TLP.

I think they knew how hard it would be. The chip that they produced did not clock high enough and there are niggling issues with its memory performance throughout the hierarchy. They made the TLP situation worse with their evolutionary take on the uncore, which is still unimpressive when it comes to inter-core communication and still lags in memory utilization.

That's one interesting theory. I have little doubt that somewhere along the development of Bulldozer they realized they were moving in the wrong direction, resulting in resources to be cut and valuable time lost figuring out what to do.

They went in the wrong direction more than once. BD is not the first planned successor to K8. At least one SMT design flamed out, and there may have been more than one phase of BD.

The design seems like it is missing something. AMD seems to have architected the cores to handle a memory pipeline that is apparently very paranoid about synchronization and write combining, prone to burdening the L/S units, but also primed for long queues of ops and high straight line speed.
Minimizing the impact of redoing a failed transaction may have fit in that philosophy.

hoho · Dec 9, 2011

Nick said:
Intel has been sharing ALUs between two threads for almost ten years now, in the form of Hyper-Threading. As long as the ALUs are close together I don't see why the latency would be horrible. There's a 1 or 2 cycle penalty for data crossing domains but that has little impact.

Yes, it works for intel exacltly because they are already built together as a monolithic core. There is a huge difference between that and combining ALUs in BD's two separate cores.

Nick said:
I sincerely doubt they couldn't. I think they simply expected software to easily scale to many threads by now and not suffer from Amdahl's Law. In such a world Bulldozer would have made a lot of sense, if it was well executed.

Unfortunately they miscalculated how hard it is to exploit TLP. SIMD is far more effective, which is why Intel created AVX. For everything less DLP oriented, high ILP is still critical.

That makes no sense at all. Had they really expected mutlithreading being the norm by now then adding SMT to their CPU would have only made it scale even better than it does now.

denev2004 · Dec 9, 2011

Nick said:
Unfortunately they miscalculated how hard it is to exploit TLP. SIMD is far more effective, which is why Intel created AVX. For everything less DLP oriented, high ILP is still critical.

Based on AMD's APU whitepaper's critical about Vector, AMD seems to share a sama idea with NVIDIA that use massive amount of thread and units to increase performance.

Gipsel · Dec 9, 2011

denev2004 said:
Both
But actually I'm wondering wether it is more like CMT/FMT

"The FPU can receive up to four ops per cycle. These ops can only be from one thread, but the thread may change every cycle" By the Optimization gudie of BD

That comes from the fact that the decoder runs with a fine grained temporal multithreading scheme (basically barrel processing with just 2 threads). The decoders simply alternate each clock cycle between the two threads of the module. And from that it follows that the FPU scheduler receives only instructions from a single thread in each cycle. I think the Intel decoders work the same (in the execution pipeline behind the schedulers instructions from two threads may be issued simultaneously, making it SMT, but the BD-FPU does the same).

Nick · Dec 9, 2011

tunafish said:
Every sentence you write screams you still don't get it.

I appreciate your input but there's no need to be condescending and arrogant.

I've read the Optimization Guide now and discovered that, unlike previous AMD architectures, the AGU units can also execute simple ALU instructions (hence they call it AGLU units in the manual).

That changes everything. There is obviously no lack of arithmetic execution width and adding or sharing more ALUs would be the last thing they need.

Haskell brings two important new technologies -- HTM and Gather.

Has HTM support been confirmed?

3dilettante · Dec 9, 2011

Nick said:
I've read the Optimization Guide now and discovered that, unlike previous AMD architectures, the AGU units can also execute simple ALU instructions (hence they call it AGLU units in the manual).

That changes everything. There is obviously no lack of arithmetic execution width and adding or sharing more ALUs would be the last thing they need.

AMD's BD documention is rife with copy/paste from previous architectures and is often misleading.
In a lot of places, the descriptions are appropriate for a K8, not BD.

The AGLU line does not seem to be significantly different from what the int and AGU units did in previous architectures.

Check the instruction tables and note that most integer operations only go to the EX pipes. The simple math ops are probably the small shifts and adds related to address generation such as the LEA entries that do list the AGU as being involved. I haven't seen benchmarks indicating any integer instruction throughput greater than what could be expected for 2 integer pipes, but I am not sure that this has been specifically checked in isolation of other factors.

We may need to wait for Agner's optimization manual to see if his synthetics bear the AGLU thing out, but I have doubts it's more than BS from AMD.

mhouston · Dec 9, 2011

Nick said:
Has HTM support been confirmed?

Do you mean "Has it been double confirmed?"?

Nick · Dec 9, 2011

denev2004 said:
Based on AMD's APU whitepaper's critical about Vector, AMD seems to share a sama idea with NVIDIA that use massive amount of thread and units to increase performance.

CPU threads are entirely different from what GPU designers typically call threads (which should probably be called strands or fibers instead). Calling individual SIMD lanes cores also adds to much of the confusion...

To bring the strengths of GPU architectures to the CPU, wider SIMD is more critical than more cores/threads. That appears to be exactly where the AVX roadmap is heading. My concern is that AMD didn't anticipate the need for stronger SIMD and will be playing catch up with Intel for at least another two generations.

Nick · Dec 9, 2011

3dilettante said:
AMD's BD documention is rife with copy/paste from previous architectures and is often misleading.
In a lot of places, the descriptions are appropriate for a K8, not BD.

The AGLU line does not seem to be significantly different from what the int and AGU units did in previous architectures.

Check the instruction tables and note that most integer operations only go to the EX pipes.

Sigh, yes, that document is a mess. Thanks for pointing out the instruction tables. Assuming those are correct (I also spotted an inconsistency for popcnt) it's back to square one: Bulldozer lacks issue width and a form of SMT might fix it.

We may need to wait for Agner's optimization manual to see if his synthetics bear the AGLU thing out, but I have doubts it's more than BS from AMD.

Yeah I'm eagerly awaiting Agner Fog's analysis as well. I tend to rely more on his findings than the official optimization guidelines.

tunafish · Dec 9, 2011

Nick said:
I appreciate your input but there's no need to be condescending and arrogant.

Sorry.

I've read the Optimization Guide now and discovered that, unlike previous AMD architectures, the AGU units can also execute simple ALU instructions (hence they call it AGLU units in the manual).

That changes everything. There is obviously no lack of arithmetic execution width and adding or sharing more ALUs would be the last thing they need.

AFAIK, no instruction can be scheduled into both an AGU and an AGLU. I believe the only non-memory instructions that issue into AGLUs are inc and dec. Since the AGUs have all the hardware needed for add/sub, not issuing them there is probably to make the scheduler simpler. I think making add and sub available in the AGUs is an improvement probably worth making.

Has HTM support been confirmed?

No, but among the compiler people I know who visit such conferences, it's considered in the bag. It hasn't been publicized probably so they can disable it if it fails testing, like HyperThreading was originally.

Check the instruction tables and note that most integer operations only go to the EX pipes.

The instruction tables are also wrong. Some of the errors are borderline hilarious. It's not worth trusting any of it.

tunafish · Dec 9, 2011

Oh btw, having inc and dec (only) in the AGLU pipes can actually be bad in some cases. Having a tight loop with a load(+op), a store, and inc for the address (such as for normalizing a bunch of bytes in an array) operates at less than an iteration per cycle on BD because the inc occupies an AGU pipe.

Of course, in reality it doesn't matter because bd is so horribly bad at storing things into cache. Sigh.

Nick · Dec 9, 2011

tunafish said:
I'll put it one more time: 2x(3ALU+2AGU) would be cheaper to make than the 2x(1ALU+2AGU+1shared ALU) you proposed.

Why? By extension you appear to be implying that 2 x 4 ALU is cheaper than 2 x 2 shared ALU = 1 x 4 ALU, which clearly can't be the case.

It's quite possible that sharing only part is worse than sharing everything (like Hyper-Threading), but I believe that requires much more careful consideration than just stating that adding more ALUs is cheaper than sharing. There are a ton of other things to determine. For instance the shared ALUs could have a fast divider and multiplier, so you don't need two of each (slow ones) per module.

Nick · Dec 9, 2011

tunafish said:
I believe the only non-memory instructions that issue into AGLUs are inc and dec.

According to the Optimization Guide's Table 10 the AGU's don't execute anything other than address generation. Even though these tables could contain (more) errors, handling just inc/dec wouldn't make sense considering the result forwarding required for just these couple of instructions.

No, but among the compiler people I know who visit such conferences, it's considered in the bag. It hasn't been publicized probably so they can disable it if it fails testing, like HyperThreading was originally.

Interesting. Sounds quite plausible indeed. Even if it works fine they may not enable it till Broadwell, just to give it a worthwhile selling point. Also HTM support only becomes critical when the thread count increases, which may not happen till the 14 nm shrink anyway.

The instruction tables are also wrong. Some of the errors are borderline hilarious. It's not worth trusting any of it.

Could you point out some hilarious ones?

denev2004 · Dec 10, 2011

Nick said:
CPU threads are entirely different from what GPU designers typically call threads (which should probably be called strands or fibers instead). Calling individual SIMD lanes cores also adds to much of the confusion...

To bring the strengths of GPU architectures to the CPU, wider SIMD is more critical than more cores/threads. That appears to be exactly where the AVX roadmap is heading. My concern is that AMD didn't anticipate the need for stronger SIMD and will be playing catch up with Intel for at least another two generations.

Well, I just mean they prefer the idea with lost of execution units as well as lots of threads instead of using Vector or SIMD

tunafish · Dec 10, 2011

Nick said:
According to the Optimization Guide's Table 10 the AGU's don't execute anything other than address generation. Even though these tables could contain (more) errors, handling just inc/dec wouldn't make sense considering the result forwarding required for just these couple of instructions.

Could you point out some hilarious ones?

According to the guide, the AGUs don' actually do address generation.

Code:

ADD reg, mem   EX0 | EX1 FastPath Single 5
MOV reg, mem32 EX0 | EX1 FastPath Single 4
MOV mem, reg   EX0 | EX1 FastPath Single 4

Interesting. Sounds quite plausible indeed. Even if it works fine they may not enable it till Broadwell, just to give it a worthwhile selling point. Also HTM support only becomes critical when the thread count increases, which may not happen till the 14 nm shrink anyway.

Another possibility is that they are going to segment it to expensive Xeon only, as it's mostly wanted by servers. Which would suck.

Nick said:
Why? By extension you appear to be implying that 2 x 4 ALU is cheaper than 2 x 2 shared ALU = 1 x 4 ALU, which clearly can't be the case.

It's quite possible that sharing only part is worse than sharing everything (like Hyper-Threading)

It is. As I said -- the more things there are that need talk to each other, the more expensive it is. In HT, there is a single register file that everything is connected to. If you share part of the units, you need to route everything to two register files.

It's actually wrong to think of HT as shared units. As far as the execution cluster is concerned, there is no difference at all between the instructions coming from the two threads -- all that is handled by the decode and register rename hardware. They all end up in the same pool and use the same registers. They just never have dependencies on each other.

According to the Optimization Guide's Table 10 the AGU's don't execute anything other than address generation. Even though these tables could contain (more) errors, handling just inc/dec wouldn't make sense considering the result forwarding required for just these couple of instructions.

That was my reaction. Alas, my friend tested, and a single core running add add inc inc can do it in a clock.

Nick · Dec 10, 2011

denev2004 said:
Well, I just mean they prefer the idea with lost of execution units as well as lots of threads instead of using Vector or SIMD

I'm still not sure where you're going with this argument. GPUs already have wide SIMD units, and now they can only add more cores.

That is not an indication that CPUs should focus on more cores. Widening the SIMD units is the cheapest way to increase throughput.

Nick · Dec 10, 2011

tunafish said:
Another possibility is that they are going to segment it to expensive Xeon only, as it's mostly wanted by servers. Which would suck.

That would indeed seriously suck, but I don't think it makes sense for them to do that. Quad-core is pretty much mainstream now and if they want single applications to make use of an increasing number of cores/threads then HTM becomes vital. I'd even argue that it's less important for the server market since they mostly run multiple independent single-threaded applications.

Also given the focus on auto-vectorization with AVX2, I expect they also intend to promote the development of automated tools to create multi-threaded applications. Enabling such software with HTM is the only way for Intel to convince people to continue to buy CPUs with more cores. In the short term SIMD offers a higher return on investment, but they have to think longer term as well so HTM should be made available sooner rather than later.

It is. As I said -- the more things there are that need talk to each other, the more expensive it is. In HT, there is a single register file that everything is connected to. If you share part of the units, you need to route everything to two register files.

I'm not suggesting keeping two register files.

That was my reaction. Alas, my friend tested, and a single core running add add inc inc can do it in a clock.

That's seriously amazing and totally messed up at the same time. :|

Gipsel · Dec 10, 2011

Nick said:
I'm not suggesting keeping two register files.

The discussion runs in circles. I guess it is now the third loop or so.

denev2004 · Dec 11, 2011

Nick said:
I'm still not sure where you're going with this argument. GPUs already have wide SIMD units, and now they can only add more cores.

Are you sure the kind that Fermi has can be called "wide SIMD" compared to AVX?

Gipsel · Dec 11, 2011

denev2004 said:
Are you sure the kind that Fermi has can be called "wide SIMD" compared to AVX?

Fermi has wide SIMD units. They accept every second cycle one instruction for a vector with 32 elements.
By the way, the high latency (18 cycles minimum) of the Fermi units is caused to a great deal by the schedling scheme (does not allow result forwarding, i.e. register accesses add to the latency) and that a registerfile is shared for several (physically large) units. Compare with GCN: not only has each SIMD unit its own registerfile, it is also distributed so each lane of an SIMD unit has its own one. That results in a latency of just 4 cycles. Larabee has of course a single registerfile for its single SIMD unit, albeit it can permute operands before feeding it to the unit (so not a distributed one like AMD GPUs) and also reaches 4 cycles latency for the simpler arthmetic operations.

22 nm Larrabee

3dilettante

hoho

denev2004

Gipsel

Nick

3dilettante

mhouston

A little of this and that

Nick

Nick

tunafish

tunafish

Nick

Nick

denev2004

tunafish

Nick

Nick

Gipsel

denev2004

Gipsel

Similar threads