22 nm Larrabee

dkanter · Jun 22, 2011

MfA said:
When you say general purpose I think about a normal processor cache ... not one designed for a throughput of 16 independent 32 bit words per cycle (well at least in Cypress, maybe more for their next gen ... also strictly speaking it needs to retrieve 8 64 bit words, with 32 bit alignment per cycle, but that's just nitpicking).

They aren't independent accesses. I think you get 4 addresses per cycle, and each one can load a 16B chunk. But I can't quite remember at the moment.

If you try and access 16 addresses all in different memory pages, you're going to run very slowly.

Anyway, my point is that full associativity is power hungry (check all lines) and inefficient for either latency or bandwidth. If you miss due to a conflict, you can just get it later. In the common case of mostly contiguous memory fetches, you shouldn't have any issues.

DK

MfA · Jun 22, 2011

There is no continuous way to store texel quads so they can be retrieved as 16 bytes from a cache from one cache line. You could make huge cache lines and throw most of it away to succeed most of the time, but that's hardly a good idea. From a linear memory mapping point of view each quad is two independent 8 byte accesses, with 4 byte alignment (texture tiling can make sure they are close together, but it can't make sure they are contiguous and will in fact likely make the 8 byte accesses non contiguous too ... which brings us back to essentially independent 4 byte accesses).

Coalescing stages and/or banking with per bank request queues can help to maintain throughput efficiently without massive multiporting ... but "general purpose" doesn't exactly make me think of those, at least not yet. Maybe when Larrabee gets a little more successful (and when it can actually efficiently handle per lane independent cache accesses, instead of big aligned vectors).

GZ007 · Jun 22, 2011

hoho said:
How strong would that G80 be given RSX transistor budget (and bandwidth)?
[edit]
RSX was "over 300M" transistors, g84 is ~289M. How big performance difference they have?

I just wanted to say that the cell graphic programing model was a failed experiment in PS3 (and basicly just a necessity). Low level coding on the SPE-s is not for everyone and PS3 lost several years bacause of it.
I think everyone will breathe out when PS4 will have just fast GPU and CPU which is easy to program and send cell to the deepest hells.

rpg.314 · Jun 22, 2011

trinibwoy said:
I'm no engineer but I find the distinction to be really academic and useless at this point. When Intel adds a bunch of vector processing to x86 it's called a CPU. What do you call it when nVidia adds ARM cores to its GPUs? Sounds like the same friggin thing to me.

I think the integration of vectors into serial instruction stream counts in whether something is a cpu or a gpu.

Nick · Jun 22, 2011

3dilettante said:
There are certain elements of this change that are pointing to a different track than Fermi.
The CU's scheduling changes are making it appear as the possible base of a shared coprocessor.
Ironically, if it were to become something like a FlexFP unit, it could very well go back to being VLIW or at least LIW-ish, given what macro-ops and dispatch groups actually are.
In AMD's case, it's easier, because the coprocessor model isolates the rename and scheduling portions of the FP unit from the OoO integer pipes. If the unit was instead a CU, the in-order nature would be none of the core's business.

Are you basically suggesting in-order execution with n-way SMT for what is currently the FlexFP unit?

This makes me wonder, how is wavefront or thread scheduling fundamentally different from out-of-order scheduling? Isn't it simply scoreboarding versus Tomasulo?

There may be a third way, though it isn't fleshed out as of yet.

What was the second way?

AMD has already speculated on perhaps putting instructions into the ISA that would integrate disparate devices into the same instruction stream.
The CU design is already built to be shared amongst various controllers, either the graphics pipes or compute pipes. Adding a third client in the form of a CPU core would not be impossible.

If the CU keeps its control flow capability, an instruction with the proper escape sequence could make a thread migrate to the CU, where it would function as an offload engine until an escape sequence sends it back. If not, it behaves like the upcoming FlexFP unit, reliant on the integer pipe for control flow.
Potentially, this could be regarded as a compiler hint or noop in a chip without a CU.
This drops it down to 2 devices, or perhaps 2.1 devices.

Thread migration would indeed make it a lot more manageable from the software perspective.

But it still means three paths can be taken: pure software using AVX2+, integrated CUs, or discrete CUs. The performance for each of these can be very different. I'm particularly "concerned" that a homogeneous Intel CPU would outperform AMD's integrated CU's using the software path. Note that AMD has to invest a lot more transistors to make this approach work, while a homogeneous architecture would have the entire die at its disposal, AVX-1024 can lower the power consumption, and Intel has a process advantage. I'm doubtful AMD has the right formula here.

Gipsel · Jun 22, 2011

Nick said:
Are you basically suggesting in-order execution with n-way SMT for what is currently the FlexFP unit?

No, probably putting it in additionally, shared by all modules.

Nick said:
This makes me wonder, how is wavefront or thread scheduling fundamentally different from out-of-order scheduling? Isn't it simply scoreboarding versus Tomasulo?

It is in-order. In principle, you don't necessarily need dynamic scheduling at all.

3dilettante · Jun 22, 2011

Nick said:
Are you basically suggesting in-order execution with n-way SMT for what is currently the FlexFP unit?

The two could coexist, or one could be modified to handle the tasks of the other. The two designs are far enough apart that I think for now they are better separated.

AMD's coprocessor model allows for this. There's the cost in latency that AMD has accepted for over a decade in doing so, but it also frees up the FPU to implement things however it wants, and the integer side only needs to know the completion status.
If the CU keeps its own branching ability, it could free the integer pipe more than the FlexFP could,
since the ROB still has to track FP instruction completion status. The handoff would be its own instruction and could retire and the CU would worry about further fetches.

The CU itself is already designed to serve multiple masters, either from multiple apps and potentially from the compute control and graphics pipeline as well. A CPU would be just another client.

This makes me wonder, how is wavefront or thread scheduling fundamentally different from out-of-order scheduling? Isn't it simply scoreboarding versus Tomasulo?

There is no dependence checking within the buffer of wavefronts, just a readiness check on scalar issue (basically, check if the last instruction finished). The CU is described as being threaded in such a way that it knows the result is ready from a previous instruction during back-to-back issue.

What was the second way?

Well I suppose there's 4, there's CPU-only, CPU and discrete, a hybrid chip, and hybrid+discrete.

But it still means three paths can be taken: pure software using AVX2+, integrated CUs, or discrete CUs. The performance for each of these can be very different. I'm particularly "concerned" that a homogeneous Intel CPU would outperform AMD's integrated CU's using the software path.

That potential exists, and the evaluation would depend on implementation details and workloads that won't be available for years, so I do not have a verdict to render.

Note that AMD has to invest a lot more transistors to make this approach work, while a homogeneous architecture would have the entire die at its disposal,

I would still argue that nobody has an entire die at their disposal at any given time period, unless that die is a fraction of the size it would otherwise be, perhaps 1/3 or 1/4 the size.
Even if Intel's "2 node" advantage applied to the power ranges above ultramobile or embedded (it doesn't), that at most promises that the design won't be more heavily gated and throttled than Sandy Bridge. The best-case was power scaling of 50% is what was taken for granted prior to 90nm.
It's a case of having decent to good scaling versus an industry average of "meh" to mediocre.
It's very helpful, but its just one factor of many.

AVX-1024 can lower the power consumption, and Intel has a process advantage. I'm doubtful AMD has the right formula here.

A single AVX-1024 operation must consume at least the same amount of power as a single AVX 256 or 128. In an absolute sense it just constrains the growth.
If a workload is amenable to AVX 1024, then the power consumption could be lower.
I see savings in Icache and decode consumption. That leaves execution units, register file, and writeback unchanged to slightly worse.

You've claimed that the scheduler can be powered down, but I am not so certain it is that easy. I don't know Sandy Bridge's internal scheduler implementation though it is described as being unified over the whole core. Certainly, it can't be completely powered down for a single AVX instruction.

Another concern is that I am not certain which units broadcast the register ID, tracks exceptions and interrupts, and updates completion status in the ROB. If any parts of this are in the scheduler, they need to either stay on or be duplicated further down the pipeline, where they would remain active.

Since we are writing back results to the register file per 256-bit chunk, there are times when the intermediate register state can become visible for a single instruction.
There are ways around this, and implementation details would be interesting.

BD kind of skips out on the debate by cracking its ops, so the existing logic applies at all times.

Nick · Jun 22, 2011

GZ007 said:
So CPU-s can now software render Crysis with same quality and speed like GPU-s ?

Yes and no.

I was talking about the execution model, registers, and caches. A CPU handles any workload complexity gracefully. A GPU on the other hand can easily slow down to a crawl or even not run at all if you don't put considerable effort into mapping your algorithm to its architecture. A GPU is great at rasterization graphics, but everything else is hit-or-miss.

A device which combines all qualities into one seems incredibly valuable to me. All of Intel and AMD's announcements show that the convergence is ongoing: GPUs are getting more flexible, CPUs are getting more poweful, and Larrabee is a compromise which for now fills the gap in the middle.

Nick · Jun 23, 2011

trinibwoy said:
I'm sure there are highly optimized software renderers out there. CPUs are just slow at some tasks, end of story. No amount of fantasizing will change that.

Yes, and software rendering will always look worse than hardware rendering...

Seriously now, it's a risky statement to say CPUs are just slow at some tasks. In particular for software rendering there's a huge potential for higher performance. For your information, SwiftShader only uses SSE1-4 and some portions are still using MMX. With AVX2 we're looking at a fourfold increase in arithmetic processing capabilities. Non-destructive instructions also offer some benefit, and last but definitely not least on vgather instruction replaces a sequence of 18 extract/insert instructions!

So it's pretty conservative to assume we'll see a 4x leap in software rendering performance in just a few years time, purely from architectural changes. The number of cores will continue to increase as well.

GZ007 · Jun 23, 2011

Nick said:
Yes, and software rendering will always look worse than hardware rendering...

Yes, it will always suck.

Nick said:
Seriously now, it's a risky statement to say CPUs are just slow at some tasks. In particular for software rendering there's a huge potential for higher performance. For your information, SwiftShader only uses SSE1-4 and some portions are still using MMX. With AVX2 we're looking at a fourfold increase in arithmetic processing capabilities. Non-destructive instructions also offer some benefit, and last but definitely not least on vgather instruction replaces a sequence of 18 extract/insert instructions!

So it's pretty conservative to assume we'll see a 4x leap in software rendering performance in just a few years time, purely from architectural changes. The number of cores will continue to increase as well.

Because only intel has "super" 3d transistors and 22nm and everyone else is stuck :?:

The 28nm AMD and Nvidia GPU-s will probably kick the 22nm knights ferry in balls, and in 2013 when AVX2 comes out with Haswell (still 22nm), TSMC will be probably close to the 20nm node.

rpg.314 · Jun 23, 2011

So now we are into bogo-fps extrapolated from bogo-flops....

When swiftshader is capable of doing dx11 with perf/W competitive with AMD IGP's, let us know.

Nick · Jun 23, 2011

3dilettante said:
There is no dependence checking within the buffer of wavefronts, just a readiness check on scalar issue (basically, check if the last instruction finished). The CU is described as being threaded in such a way that it knows the result is ready from a previous instruction during back-to-back issue.

That would result in horrible ILP, which means you need massive register files.

As far as I'm aware NVIDIA uses lightweight scoreboarding and superscalar issue to tackle this very problem. So why would AMD be able to do without dynamic scheduling?

That potential exists, and the evaluation would depend on implementation details and workloads that won't be available for years, so I do not have a verdict to render.

These workloads do exist for x86 (and more will be created over the next couple of years). I seriously doubt that if developers can get 500 GFLOPS out of Haswell with minimal effort, they'll bother using vendor specific CUs. Any kind of API or language developed for these CUs will also be available for the CPU, but not the other way around. So developers get a maximum number of options to select the combination of performance gain and development/maintenance cost that is ideal for their particular situation (assembly, intrinsics, vector libraries, OpenCL, vectorizing compilers, ...).

Originally AMD planned on fully unifying the CPU and GPU:

Either they've changed their mind, or GCN is just an intermediate step. Ironically it looks like Intel was the first to realize that to turn the CPU into a high throughput device they merely had to expand the vector ISA, instead of trying to breed it with a GPU.

I would still argue that nobody has an entire die at their disposal at any given time period, unless that die is a fraction of the size it would otherwise be, perhaps 1/3 or 1/4 the size.

Sure, but it's about data locality. You could pull the AVX(2) units out of the CPU cores and call that a compute unit or a coprocessor, but although the die size hardly changes, it's an inferior design due to the overhead of sending data and instructions back and forth.

It's also about duplication of functionality, and balancing between them. You could have an equivalent x86 and PowerPC core on the same die, but it's clearly much more straightforward to balance the workload if you had two x86 or two PowerPC cores instead.

A single AVX-1024 operation must consume at least the same amount of power as a single AVX 256 or 128. In an absolute sense it just constrains the growth.

It's really a case of one AVX-1024 instruction versus four AVX-256 instrutions. So you have to divide the absolute power consumption of the AVX-1024 instruction by four to compare it against one AVX-256 instruction. This 'relative' power consumption is all that really matters.

If a workload is amenable to AVX 1024, then the power consumption could be lower.
I see savings in Icache and decode consumption. That leaves execution units, register file, and writeback unchanged to slightly worse.

You've claimed that the scheduler can be powered down, but I am not so certain it is that easy. I don't know Sandy Bridge's internal scheduler implementation though it is described as being unified over the whole core. Certainly, it can't be completely powered down for a single AVX instruction.

I imagine ALU pipelines can be clock gated per stage. The register file power consumption indeed probably can't be lowered (beyond current techniques), but neither can a GPU so there's no need to expect it.

I believe the "unified" scheduler refers to the fact that all ports can take both integer and floating-point instructions. That doesn't mean it's monolithic. Each port must have its own scheduler. So when a 1024-bit instruction has been issued, that port's scheduler can go to bed for 3 cycles. There has to be some 'sequencing' logic to keep feeding in the data, but that would be really tiny and doesn't require the scheduler to be awake at all.

rpg.314 · Jun 23, 2011

As far as I'm aware NVIDIA uses lightweight scoreboarding and superscalar issue to tackle this very problem. So why would AMD be able to do without dynamic scheduling?

Apparently, the alu latency is matched to rf raw latency, which means the hw never has to stall for raw hazards, so the scoreboarding is simpler.

Nick · Jun 23, 2011

GZ007 said:
Yes, it will always suck.

What graphics operation can a GPU do, which a CPU can not (and never will)?

Because only intel has "super" 3d transistors and 22nm and everyone else is stuck The 28nm AMD and Nvidia GPU-s will probably kick the 22nm knights ferry in balls, and in 2013 when AVX2 comes out with Haswell (still 22nm), TSMC will be probably close to the 20nm node.

TSMC and Global Foundries will move to FinFET at the 14 nm node. That's four years from now, if everything goes well. That's four years during which Intel has a substantial advantage beyond mere density, and during which others will also severely struggle to keep leakage under control. And by the time TSMC will be "close" to 20 nm, Intel will be "close" to 16 nm. So for the foreseeable future, Intel is hanging on to its 18 month lead in process technology.

Nick · Jun 23, 2011

rpg.314 said:
So now we are into bogo-fps extrapolated from bogo-flops....

For GPUs, FLOPS and FPS are strongly correlated. AVX2 offers 4x more performance over what SwiftShader is currently using, and I'm not even taking the increase in efficiency from non-destructive instructions into account. But CPUs are not GPUs; we have to take texture sampling into account as well. Texel fetch is a huge bottleneck, which is resolved by AVX2's gather instructions.

When swiftshader is capable of doing dx11 with perf/W competitive with AMD IGP's, let us know.

Why would it have to be competitive with that? Last time you bought a desktop system, did you go for the one with the highest performance/Watt?

Nick · Jun 23, 2011

rpg.314 said:
Apparently, the alu latency is matched to rf raw latency, which means the hw never has to stall for raw hazards, so the scoreboarding is simpler.

Yes but that means threads advance at an excruciatingly slow pace (one scalar at a time), meaning you need more threads in flight, and thus more storage (beyond the ever increasing need for storage required by long and complex kernels). Of course moving from VLIW4 to scalar SIMD lowers latency a bit, but then why would NVIDIA perform scoreboarding but not AMD?

liolio · Jun 23, 2011

Nick said:
Originally AMD planned on fully unifying the CPU and GPU:

Either they've changed their mind, or GCN is just an intermediate step. Ironically it looks like Intel was the first to realize that to turn the CPU into a high throughput device they merely had to expand the vector ISA, instead of trying to breed it with a GPU.

I remember an AMD top engineer (may be Huddy) saying that by 2015 you won't be able to tell the difference between CPU and GPU. So they may have a different roadmap than INtel. There may also be difference in implementation.

rpg.314 · Jun 23, 2011

Nick said:
Yes but that means threads advance at an excruciatingly slow pace (one scalar at a time), meaning you need more threads in flight, and thus more storage (beyond the ever increasing need for storage required by long and complex kernels).

No, it means that since each thread advances upto 4x slower, it needs, upto 4x less reg file to hide the same latency, which is exactly what they have done.

Of course moving from VLIW4 to scalar SIMD lowers latency a bit, but then why would NVIDIA perform scoreboarding but not AMD?

If your raw latency is more than your alu latency, you'll need to check for raw hazards in your scoreboard.

rpg.314 · Jun 23, 2011

Nick said:
For GPUs, FLOPS and FPS are strongly correlated.

That model used a synthetic bench for data. It is not necessarily applicable to any real workload, aka games.

Why would it have to be competitive with that? Last time you bought a desktop system, did you go for the one with the highest performance/Watt?

Desktops are dying.

trinibwoy · Jun 23, 2011

rpg.314 said:
If your raw latency is more than your alu latency, you'll need to check for raw hazards in your scoreboard.

That still doesn't explain the need for advanced scoreboarding. The only reason to do that is to exploit ILP.

22 nm Larrabee

dkanter

MfA

GZ007

rpg.314

Nick

Gipsel

3dilettante

Nick

Nick

GZ007

rpg.314

Nick

rpg.314

Nick

Nick

Nick

liolio

Aquoiboniste

rpg.314

rpg.314

trinibwoy

Meh

Similar threads