I Can Hazwell?

Grall · Sep 12, 2012

tunafish said:
This is incorrect. All software only use virtual addresses, the OS is free to relocate the physical pages anywhere it wants.

It was my understanding that these virtual addresses were transformed into actual addresses once the program code was loaded somewhere into RAM by the operating system and would thus be unable to move, but if that's not true then it's pretty cool.

modern Linux can do things like migrating memory to a closer NUMA node, or migrating pages to merge many small pages into few 2MB ones.

That sounds extremely useful actually. Considering the number of people experiencing crashes in certain misbehaved software like bethesda's Skyrim due to memory fragmentation, it would have been nice if windows had been able to do this as well. Any idea why microsoft haven't bothered to pursue this path? Perhaps they're too content with their current market dominance... *shrug*

sebbbi · Sep 12, 2012

mczak said:
But OMG this thing is a beast. AMD thought there's not much point of having a 3rd INT ALU and intel now has 4...

Intel has hyperthreading, so they can feed the ALUs from two instruction streams. Four ALUs is overkill for ILP alone, but add the TLP from two threads to the mix, and the situation becomes very different. As long as other parts of the chip are not a bottleneck, hyperthreading (two threads on a single core) should have performance closer to two separate (2 ALU) cores (in ALU tasks). It's not looking good for AMD.

rpg.314 said:
A microcoded sequence could still be faster.

It would likely be at least slightly faster. If nothing else is improved, at least gather takes only one slot in (x86) L1 instruction cache (but several slots in uop cache), and they can choose the optimal uop sequence for the processor (x86 compilers are too general purpose for this task). But that's a bit pessimistic view, I must admit. Maybe I have spent too much time evading stuff like microcoded imul and sraw (variable shifts) in console programming

3dilettante said:
Port 6 and 7 provide integer capability and branching, while also keeping the vector pipes unencumbered.
An internal gather loop could utilize the extra integer operand access and branch capability of the extra ports without simultaneously blocking the vector pipes that would make use of a gather instruction. The store AGU all by itself seems unbalanced, unless it's sharing that port with something they've chosen not to discuss yet, maybe the specialized hardware that would scan a gather index register and detect how many belong to the same cache line.
Port 5 has vector shuffles, which might include that permute unit that both gather and vector work would like. It's not zero-sum because a gather would provide data from memory in the desired arrangement.

Good points. That would (also) be a good use for the extra ALU/branch ports. If your algorithm is vector math heavy (almost no ALU ops), and doesn't include too many gathers, the CPU should be able to mask out ("co-issue") all the microcoded ALU ops from the gather. But for algorithms that already have interleaved ALU and vector ops, this technique would make the ALU a bottleneck. It would also prevent the other thread (HT) of running ALU heavy code while the other runs vector heavy code... but of course the current ways of doing gather manually are even worse. And the fourth ALU helps in both cases.

3dilettante said:
The breadth of the engineering effort for this architecture visually dwarfs the competition. The promotion of integer vector instructions to 256-bit is going to put some serious hurt on one of the few areas BD was not outclassed in.

Fourth ALU should be improving the integer performance as well (especially with hyperthreading). It seems that they have made some very good architectural choices that fit together very well. Previously (year ago) I though that gather would be one of the key new features of this architecture, but Haswell has so much more than that to offer. I can't wait to do some performance tests with transactional memory. Assuming it's fully L1 based, the transaction cannot access more than 32KB of memory (minus hyperthreading, minus cache aliasing = around 10KB to be sure). But that's more than enough for games, as game access patterns are usually cache line optimized, and limited in scope. Enterprise software however might need more than Haswell L1 has to offer for their transactions.

Gubbi · Sep 12, 2012

sebbbi said:
Hopefully they spill the beans soon. I just hope that you don't have to code a loop for it (like you must for Knights Corner). The worst case is that it's just a long microcoded sequence, but that wouldn't make much sense. I am keeping my pessimistic view until Intel proves me otherwise. Efficient gather is almost too good to be true

It adds a lot of complexity to support single instruction gather. A gather instruction could generate a multitude of addresses that all cause a MMU page-walk, the accesses themselves would require a multi-ported cache to be efficient, each access potentially causing a full cache miss.

Worst case, a single gather instructions could take several thousands of cycles to complete. So you either make it interruptible or suffer intolerable interrupt latency, the former means you need to save partial state of registers (*ugh*) the latter is just not acceptable.

Mind you, a fair fraction of accesses are likely to miss caches and with 4 to 8 cores on a chip it'll be relatively easy to saturate the main memory interface anyway.

So you end up spending a lot of complexity and power on something that might not add a whole lot of performance in the end.

Cheers

sebbbi · Sep 12, 2012

Grall said:
Considering the number of people experiencing crashes in certain misbehaved software like bethesda's Skyrim due to memory fragmentation, it would have been nice if windows had been able to do this as well. Any idea why microsoft haven't bothered to pursue this path?

Skyrim is a 32 bit executable, and has 32 bit pointers. It runs out of virtual address space, so no reordering can help it. 64 bit pointers allow a 64 bit virtual address space. With 64 bit pointers you pretty much never run out of virtual address space.

Grall · Sep 12, 2012

sebbbi said:
It runs out of virtual address space, so no reordering can help it.

Yeah, I know it's 32-bit, but if windows supported reordering maybe a large enough chunk of continuous memory could be presented to the game.

sebbbi · Sep 12, 2012

Grall said:
Yeah, I know it's 32-bit, but if windows supported reordering maybe a large enough chunk of continuous memory could be presented to the game.

No operating system can reorder your software's own virtual address space. It can only reorder the physical data in memory, and update virtual address tables accordingly. If you have only 32 bit pointers in your game and you do a lot of dynamic memory allocation, you will eventually run out of continuous memory blocks (in the 32 bit virtual memory address space), and there's nothing an OS can do to help you.

tunafish · Sep 12, 2012

Grall said:
It was my understanding that these virtual addresses were transformed into actual addresses once the program code was loaded somewhere into RAM by the operating system and would thus be unable to move, but if that's not true then it's pretty cool.

No, that's what page tables and TLBs are for. Basically, when you issue a load, the first thing that happens is that the CPU looks for the address you gave it from the TLB. If found, it takes the physical address stored in the TLB, and uses it instead. If not found, it fires up the page walker, and walks the page tables (an in-memory data structure) to find the correct physical address (and stores it in the TLB). If still not found, it interrupts into the OS and lets it handle it.

So the address translation is entirely dynamic and run-time. It's how processes are separated on multi-tasking operating systems -- your address 0x4000 can point to something completely different than my 0x4000, and the privileged operating system structures are not found in either of our address spaces.

You can actually do all kinds of neat things with page tables. For example, Azul systems uses it for unblocking GC. Basically, it gives you a cheap (ish) hook you can invoke on any memory access to a given page (on x86, that's 4k/2M/1G granularity).

That sounds extremely useful actually. Considering the number of people experiencing crashes in certain misbehaved software like bethesda's Skyrim due to memory fragmentation

This wouldn't actually help. The things that get fragmented are the 32-bit virtual addresses -- the physical pieces of ram can be moved about at will, but the 32-bit addresses cannot change, simply because the OS would then have to fix up every address in the program, and it cannot know what is an address and what is an unfortunately chosen integer.

it would have been nice if windows had been able to do this as well. Any idea why microsoft haven't bothered to pursue this path? Perhaps they're too content with their current market dominance... *shrug*

They just can't keep up. In the internals, modern Linux is now about a decade ahead of Win8, and the difference is growing, not decreasing.

Exophase · Sep 12, 2012

3dilettante said:
Intel's improved fetch bandwidth and widened the back end to handle two branches per cycle.
It doesn't mention predicting two branches per cycle, though.

Unless they decoupled the branch predictor from the rest of the frontend like AMD did I don't think they'd really even be able to predict multiple taken branches in one cycle. One block is loaded from fetch/uop cache and for that you can only make use of one BTB hit. No later instructions in the block would apply.

You could benefit from being able to predict multiple untaken branches in a block (up to the end or first taken branch). It may already do this. I know the BTB supports up to 4 branches per fetch block in SB; the prediction resolution before lookup may be capable of predicting all four in parallel.

mczak said:
Hmm interesting that both port 0 and port 1 can do FMA and port 1 now can do fp mul too but port 0 can't do fp add. Any ideas why that would be?

My guess would be this: on SB/IB, FADD and FMUL latency is only 3 cycles but on Haswell FMA latency is 5 cycles which is substantially higher. David Kanter has remarked that Intel engineers found Bulldozer's 5-6 cycle FMA latency to be a weakness, so I don't think they'd be happy with 5 cycles for FADD and FMUL. So I'm guessing they did what they could to bypass the FMA unit to reduce latency for FADD and FMUL: you can get a multiply result early and start an add late. And for the early multiply result the rest of the FMA is a don't care, if it runs at all, but for the early add you have to feed it a 0 to start with. So it may be that a fast FADD is more complex to support than a fast FMUL and therefore they only have one.

sebbbi said:
It would likely be at least slightly faster. If nothing else is improved, at least gather takes only one slot in (x86) L1 instruction cache (but several slots in uop cache), and they can choose the optimal uop sequence for the processor (x86 compilers are too general purpose for this task).

In SB/IB the uop cache doesn't store more than the first few uops from a microcode sequence. Were you thinking that Haswell would expand entire microcode routines into the uop cache? I'm not sure they'd do this because it'd complicate the mapping between uop cache and L1 instruction cache and it'd also open up the potential for uop cache thrashing with lots of microcode instructions which would all have to be inlined into the cache to get proper performance.

Without such a mechanism Haswell would need to have much faster microcode ROM throughput to maintain a fast microcoded gather. Historically it has only been one uop per cycle, where the decoders then can't provide anything. This might be enough for the gather itself (depending on what microcode is available), but it stills everything else. It's hard to imagine Intel investing in either the ability to dispatch from both the microcode and uop cache/decoders simultaneously nor a wide microcode ROM that can feed several uops per cycle, but I really wouldn't know what they do and don't find practical here..

Barring that I'd expect the gather to be done by an independent hardware state machine, regardless of whether or not it can service multiple loads per cycle. Even if it's stuck at one load per cycle it'll still be a lot better than the current alternative.

Gubbi said:
It adds a lot of complexity to support single instruction gather. A gather instruction could generate a multitude of addresses that all cause a MMU page-walk, the accesses themselves would require a multi-ported cache to be efficient, each access potentially causing a full cache miss.

Worst case, a single gather instructions could take several thousands of cycles to complete. So you either make it interruptible or suffer intolerable interrupt latency, the former means you need to save partial state of registers (*ugh*) the latter is just not acceptable.

Mind you, a fair fraction of accesses are likely to miss caches and with 4 to 8 cores on a chip it'll be relatively easy to saturate the main memory interface anyway.

So you end up spending a lot of complexity and power on something that might not add a whole lot of performance in the end.

Cheers

It's not pretty, but the gather could be replayed in its entirety upon any fault whatsoever, and if you so desire, after an interrupt. All that's required is that the cache and TLB have at least as many ways as there are elements in the vector, because otherwise you could get an infinite loop where the later fields keep evicting the former ones. This shouldn't be a problem for Haswell. In the normal case, the cost of a cache miss is big compared to the cost of redoing the earlier loads which are now in cache. An interrupt can evict the stuff that was gathered from the cache, but that's not that much worse than it evicting any of the rest of the program's working set (not to mention, having to save/restore registers). And interrupts aren't really frequent enough for this to be a concern.

This may be why AVX2 has no scatter instruction. Replaying a half-done scatter has more consequences.

No one would expect a gather instruction that's single cycle if everything's in cache. The reasonable highest end expectation is a gather that can load multiple elements if they're all in the same cache line, like Larrabee can. Tons of code would benefit from this. A lot of useful gathers can even have multiple fields going to the exact same address.

Grall · Sep 12, 2012

Sebbbi, Tuna; thanks. Your posts are very educational, as always.

mczak · Sep 12, 2012

Exophase said:
My guess would be this: on SB/IB, FADD and FMUL latency is only 3 cycles but on Haswell FMA latency is 5 cycles which is substantially higher. David Kanter has remarked that Intel engineers found Bulldozer's 5-6 cycle FMA latency to be a weakness, so I don't think they'd be happy with 5 cycles for FADD and FMUL. So I'm guessing they did what they could to bypass the FMA unit to reduce latency for FADD and FMUL: you can get a multiply result early and start an add late. And for the early multiply result the rest of the FMA is a don't care, if it runs at all, but for the early add you have to feed it a 0 to start with. So it may be that a fast FADD is more complex to support than a fast FMUL and therefore they only have one.

Hmm that makes sense, though you're wrong about the latencies. Only fadd is 3 cycles on snb/ivb/hsw, fmul is 5 cycles, same as fma. So maybe fadd indeed has some special path to get latency down to 3 whereas for the fmul it can just use ordinary fma path. This is indeed different to amd which had same latency for fmul and fadd (and now fma) for ages (K8/K10 had latency 4, BD latency 5-6).

Exophase · Sep 12, 2012

You're right, my mistake. All the more reason why it only supports one FADD, though. It's possible that it's implemented with an entirely separate unit.

Having a big difference in latency between FADD and FMUL is actually kind of surprising, the significand multiplication itself must be eating a lot of that because you'd expect the normalization to be more expensive with the add.

tunafish · Sep 12, 2012

I'd give another vote to "FADD is probably it's own dedicated unit". Both because FADD units are much cheaper than multiply ones, and because scheduling instructions gets a lot harder in cases where you can stuff things into the middle of a pipeline.

Also, the slides for ARCS004 and 005 are now up. TSX works on L1 (only), and gather zeroes out elements in the mask register when it successfully fetches them -- this way, after any fault the gather can just be restarted and it keeps all the work it has already done.

liolio · Sep 13, 2012

So sadly it's almost double confirm that the next core i3 4xxx are to ridicule the upcoming quad core (/2 modules) Streamrollers for all personal usages.
AMD sounds really in a tough spot... especially as Haswell might be released before Streamrollers.

The scary part part is whereas Jaguar cores look nice we have no release date and Intel has a counter... If Atom uses the same power saving techniques as Haswell and taking in account how long Intel must have been working on that one I'm scared that the "really nice Jaguar core" may look like toys in front of Intel offering.

Alexko · Sep 13, 2012

liolio said:
So sadly it's almost double confirm that the next core i3 4xxx are to ridicule the upcoming quad core (/2 modules) Streamrollers for all personal usages.
AMD sounds really in a tough spot... especially as Haswell might be released before Streamrollers.

I think the aim of Bulldozer, and therefore Steamroller was never really to compete with Intel on a core-for-core basis, but to allow for more cores within the same transistor and power budget. This is precisely what AMD does with the FX lineup—well, technically they use more transistors and power, but still—and on the A lineup, that is for APUs, they choose to spend the extra transistors and power on the GPU.

Steamroller should continue that trend and, frankly, if Kaveri really does deliver a 30% performance improvement over Trinity, it should be more than enough for most people, so spending extra transistors and power on the GPU seems like the right thing to do.

liolio said:
The scary part part is whereas Jaguar cores look nice we have no release date and Intel has a counter... If Atom uses the same power saving techniques as Haswell and taking in account how long Intel must have been working on that one I'm scared that the "really nice Jaguar core" may look like toys in front of Intel offering.

We'll see. I think the new Atom is supposed to be OoO—which would make it an Atom only in name—but it's still targeted at phones while Jaguar isn't meant to go any lower than tablets. Different power targets usually mean different performance targets too. I wouldn't write AMD off just yet.

DavidC · Sep 13, 2012

liolio said:
now the gpu in compute operation has 0.5TB/s of bandwidth the last level of cache the thing could literally fly.

Unless it was specified, I think the cache means the GPU dedicated L3 cache, rather than LLC. That makes a big difference.

hkultala · Sep 13, 2012

[offtopic]

Alexko said:
I think the aim of Bulldozer, and therefore Steamroller was never really to compete with Intel on a core-for-core basis, but to allow for more cores within the same transistor and power budget.

yep.

But this is a bad strategy.

A) What is usually needed is
1) 1 to few very powerful cpu cores, for code which does not parallelize well
2) Large number of weak cores for massively parallel code
Very few code needs something like 6-12 relatively powerful cores. This is either too few or not enough.

B) Transistors are getting "almost free", so sacraficing single-thread-performance to save transistors is just bad tradeoff. Especially when those saved transistors are only used to put more "semi-powerful" cores to the chip.

Sacraficing single-thread ipc however might be reasonable in cases where it allows higher total performance by higher clock speed, or considerable power savings.

Single-thread performance is still very important, and now when it's harder to get single-thread performance improvements, it just mean cpu developers should concentrate more on it, not give up.

I think Intel has the better strategy here. They are concentrating more on single-thread performance, and using SMT to also get improvements with multiple threads.

AMD's strategy with fusion would also be very good, if they just had executed it correctly and designed a proper "high performance for single-thread" cpu core for it, instead of having to use either 4 outdated cores, or 4 new semi-powerful "mini-cores".

Intel is also bringing the "many weak cores" into play with larrabee/knights line, wonder when they release a single chip with both high-end x86 cores and larrabee cores.

[/offtopic]

aaronspink · Sep 13, 2012

rpg.314 said:
A microcoded sequence could still be faster.

microcoded has a lot of rather thorny problems that aren't fun to implement, verify, or validate on a cpu. The coded sequence solution provides a defined bounded operation that can be implemented and verified much easier.

The problem really is one of bounds. Consider that each entry in the gather could potentially point to a different PTE and that PTE may or may not be loaded into TLBs or even cache. And I cannot recall off the top of my head whether the PTE can themselves be virtually/indirectly allocated, etc. So for 1 gather you could be looking at potentially 16+ tlb fills + page faults + memory accesses, etc. We're talking upwards of thousands of cycles in what would be in the microcoded case an atomic operation that has significant implications up and down the architecture and validation stack. If you look at errata for various processors, you are likely to find many entries associated with long complicated atomic memory operations.

By implementing it as a load->mask->fill->update instruction, the side effects are significantly restricted and the performance difference should be minimal in a modern core.

aaronspink · Sep 13, 2012

And as a follow up, doing it as a looped instruction sequence:

A: GATHER Y,X+([IMM]/W),Z
B: BNZ Z, A

has the benefit of allowing things like offloading to micro engines in the future if desired while not requiring it at the start. It is perfectly possible in the future to stick a microengine off an L1 or L2 in the future and pass the instruction through to the microengine to do the whole gather.

Given the common use cases for gather, having a small microengine with 16-64(1-4 element vector aka RGBA/XYZ/XY/etc) cachelines would enable very fast gather generation.

liolio · Sep 13, 2012

Alexko said:
I think the aim of Bulldozer, and therefore Steamroller was never really to compete with Intel on a core-for-core basis, but to allow for more cores within the same transistor and power budget. This is precisely what AMD does with the FX lineup—well, technically they use more transistors and power, but still—and on the A lineup, that is for APUs, they choose to spend the extra transistors and power on the GPU.
Steamroller should continue that trend and, frankly, if Kaveri really does deliver a 30% performance improvement over Trinity, it should be more than enough for most people, so spending extra transistors and power on the GPU seems like the right thing to do.

Well I've made up my mind so to speak, no matter people telling me the situation is not that bad wrt to CMT.

CMT did not deliver on this premise 'more core within the same silicon and power budget".
Trintiy modules are tinier than 2 Star cores, they are more featured but not enough to make a significant difference (I made gross measurement). If not for the better power management features in power constrained environment and the use of new instructions the Star core are still better.
It's imo the contrary, for an "industrial/production" pov actually the modules approach offer lesser granularity than lesser cores. Back in time AMD could sell 1 and 3 cores variantions, they now longer can't.

On the igp integration side of the equation Intel is sadly ahead of AMD. AMD seems completely focus on fixing their modules, the "uncore" progress at really low speed if at all.
And the Anandtech seems to confirm that fast memory will make it into Haswell. AMD may lost here too. On the compute side of thing Intel IGP was already arguably better.
AMD is putting together mostly its CPU parts and GPUs parts whereas Intel develop its APU as a whole. May be AMD if they were not putting all their efforts in fixing their modules performances they could do better here too.

In the mean time a quick list of what they postpone to fix:
The L3 (won't be done before at least 2014)
Fp/SIMD performances (won't be done before at least 2014)
Support for AVX2 (won't be done before at least 2014)
Single thread performances should catch up with prior architecture may be in 2013 with streamrollers.

Overall I fail to see how AMD could be in a worse situation if they have passed on CMT.
They had proven solutions in front of them with SMT and cache hierarchy of CPUs like Nehalem and Power7. They decided to come with their own take and for me it failed. They should have make the bitter and difficult conclusion as soon as BD launched (or no that longer after engineer sample were out) to push BD how (or scrap it) and start something new.

A 3 issue std CPU core which would include all the refinement they included in BD and then PD. Such a CPU I fail to understand how it would not completely out perform their previous architectures and as such it would be closer to Intel offering.
Such a CPU might have ended bigger than bot Star core or half a BD/PD module but by how much? I suspect not that much not even to significantly change their costs.
It may also be a bit more power hungry but it might allow for better power management and turbo. You have more granularity, you could change clock speed, clock gate on a per cores basis vs a module basis (that for coarse grained).
If they didn't /couldn't copy IBM or Intel approaches for the cache hierarchy, they may have come with something akin to Jaguar which looks saner. I can't see ( or understand) why AMD that is still doing great things (may be while beating a dead horse...) could not successfully engineer something like that.

At least they could fight Intel Dual cores with Tri core instead of quad cores (better usage of salvage parts) and have a chance to actually look good.
They are lagging Intel more and more

All this sounds a bit like angst but I believe that AMD can do so much better. The sad thing is imho that by 2014 when or if most CMT approach pitfalls have been fixed (while still not bridging the gap with Intel, more the contrary), and depending on the success of Windows8 RT they might be threatened by ARM64 CPUs. ARM is already more advanced in the APU road than AMD, with its mali/a15 CPU. They are to end between a rock and an hard place

We'll see. I think the new Atom is supposed to be OoO—which would make it an Atom only in name—but it's still targeted at phones while Jaguar isn't meant to go any lower than tablets. Different power targets usually mean different performance targets too. I wouldn't write AMD off just yet.

I don't think they are meant for phone only. I'll try to be optimistic but like for the Streamrollers, the jaguar have no release date, AMD may have only a short head start.

EDIT

OOps sorry I just realize that we are indeed in the wrong thread to discuss that matter, sorry for the Ot.

Exophase · Sep 13, 2012

aaronspink said:
And as a follow up, doing it as a looped instruction sequence:

A: GATHER Y,X+([IMM]/W),Z
B: BNZ Z, A

has the benefit of allowing things like offloading to micro engines in the future if desired while not requiring it at the start. It is perfectly possible in the future to stick a microengine off an L1 or L2 in the future and pass the instruction through to the microengine to do the whole gather.

Given the common use cases for gather, having a small microengine with 16-64(1-4 element vector aka RGBA/XYZ/XY/etc) cachelines would enable very fast gather generation.

The problem is that requirement for a new vector branch instruction. Masks in AVX2 are full vector registers, and it isn't designed to branch using a vector as condition input. You'd probably have to output to the zero flag instead. Still, that's two vectors the instruction has to write instead of one. I'm not aware of any AVX2 instructions with such capability. Surely that hurts the design somewhere. Larrabee gets around it by having a special register set for predicates.

I Can Hazwell?

Grall

Invisible Member

sebbbi

Gubbi

sebbbi

Grall

Invisible Member

sebbbi

tunafish

Exophase

Grall

Invisible Member

mczak

Exophase

tunafish

liolio

Aquoiboniste

Alexko

DavidC

hkultala

aaronspink

aaronspink

liolio

Aquoiboniste

Exophase