Larrabee delayed to 2011 ?

Non-coherence has already happened (Fermi). My experience programming with non-coherent caches is that they are practical, useful, and perform pretty well. You should give it a try sometime, you might be surprised.
On what (parallel) architecture with non coherent r/w caches have you worked on?
 
The difference between atomics and cache coherence is that atomics are explicit and visible in the code you write, whereas cache coherence introduces invisible performance bugs. We need architectures and programming models that push people in the direction of scalable parallel code. Cache coherence is wrong because it facilitates bad code.

If by bad you mean correct working code, I agree.

Too many programmers parallelize their code and forget to parallelize their data structures because of cache coherence. That's why I consider it a bug, not a feature.

That is a feature. The fact that the programmers don't do further work is laziness.
 
From a software developer's point of view, the opposite is certainly true: once you have partitioned your data structures for parallel execution on a non-coherent processor, it's easy to get parallel scalability on a processor with coherent caches. You've already done the essential work to parallelize your application. On the other hand, once you've parallelized your execution on a coherent processor, you do not necessarily have scalable code, and you have to do more work to partition your data structures.

Which is fine for trivial problems, but trivial problems haven't been problems for quite some time.

Non-coherence has already happened (Fermi). My experience programming with non-coherent caches is that they are practical, useful, and perform pretty well. You should give it a try sometime, you might be surprised.

Fermi is fine as long as you have no data interaction, once you do, which a LOT of real world problem do, you are in trouble. For the problems that GPUs are good/great at, you would be better off with a sea of super vectors quite honestly. Less power, more performance, less complexity.
 
The difference between atomics and cache coherence is that atomics are explicit and visible in the code you write, whereas cache coherence introduces invisible performance bugs.
Nah, they're the same. An atomic call may be cheap or extremely expensive... you can't tell by looking at the code. Same with a memory operation in a coherent cache, except they're slightly less bad compared to atomics.

We need architectures and programming models that push people in the direction of scalable parallel code. Cache coherence is wrong because it facilitates bad code.
So do atomics. So does shared memory. So does basically everything that deviates from the happy world of completely independent stream programming. Sadly, while these things sacrifice performance and don't encourage programmers to write "scalable code", these things are also useful as the past 5 years have shown.

Too many programmers parallelize their code and forget to parallelize their data structures because of cache coherence. That's why I consider it a bug, not a feature.
Far more don't even parallelize their code until they have to... it's the same deal. People will optimize stuff when they need to, and it's far better to have code always *work* and then optimize for parallelism and caches than it is to have the correctness of the code effectively depend on those optimizations (which I will note, are only necessary for a subset of the code you're writing).

From a software developer's point of view, the opposite is certainly true: once you have partitioned your data structures for parallel execution on a non-coherent processor, it's easy to get parallel scalability on a processor with coherent caches.
Right but we're talking about code that cannot be expressed efficiently without coherent caches. Same deal with shared memory, same deal with atomics, etc. Vertex scatter histograms and parallel reductions written in DX9 all run on current hardware, but not as well as version that use atomics and shared memory.

My experience programming with non-coherent caches is that they are practical, useful, and perform pretty well.
Sure, but I'd say the same thing for coherent caches :) Again, thinking about just the non-coherent cache cases predisposes you to not think about the problems that can't be expressed efficiently in that model. For instance, sparse, many/wide-bin histograms with local, data-dependent coherence are hard to express efficiently on current GPUs, Fermi included, but work well with coherent caches. You basically can't make use of shared memory/caches at all in this case unless they are coherent. You really want the hardware to efficiently move around bins/cache lines precisely how coherent caches work in this case, and this is not uncommon.
 
Last edited by a moderator:
Right, precisely. My point is you have to expect people to be using coherent caches and other features well. "You can screw yourself with them" is not an argument against it, in the same way that it wasn't an argument against atomics (which can screw you ever harder).

My point against full-coherency is not that you can screw yourself with them. Truth be told, programmers (of all levels of expertise) can screw themselves with just about any tool.

My argument against full-coherency is that it's frequency of use doesn't justify it's presence in hw in the massively parallel world.

Fermi seems to be at the sweet spot of coherency vs scalability vs frequency of use in emerging workloads.
 
See people still don't understand, once you have coherence, its easy to be incoherent. The opposite isn't true.
Think about it this way. Once you are incoherent/semi-coherent - both in hw and in programming model, are the advantages of implementing full-coherence in hw justified by their frequency of use in massively parallel sw?

by the same token, I'll buy non-coherence when that happens too ;)
Fair enough.

Unfortunately, they are quickly running out of parallelism.
Which consumer apps. I'd really like to know. No sarcasm intended.
 
Think about it this way. Once you are incoherent/semi-coherent - both in hw and in programming model, are the advantages of implementing full-coherence in hw justified by their frequency of use in massively parallel sw?

Yes, not all programs/algorithms are nice enough no to care about communication and interaction. You are coming at this from the perspective that coherency is some massive cost adder when it really isn't. In fact, coherency can provide substantial performance benefits on a wide variety of codes, even those that don't necessarily need the coherence for correctness.

Which consumer apps. I'd really like to know. No sarcasm intended.

Graphics are already starting to hit significant issues with scaling vs resources/parallelism. Just look at the performance increases of this generation vs the last one on what is effectively one of the most parallelizable AND easiest to parallelize workloads known. Video compression already has issues, esp if you want to do anything close to real time.
 
Yes, not all programs/algorithms are nice enough no to care about communication and interaction. You are coming at this from the perspective that coherency is some massive cost adder when it really isn't. In fact, coherency can provide substantial performance benefits on a wide variety of codes, even those that don't necessarily need the coherence for correctness.
I am not arguing for pure stream computing either. I am just saying that compared to Fermi, which IMHO is at the sweet spot of coherency vs utilization, LRB1 has features that are not worth the additional costs.

If you can provide an estimate for per core full-coherency costs, that will be so much better.

Graphics are already starting to hit significant issues with scaling vs resources/parallelism. Just look at the performance increases of this generation vs the last one on what is effectively one of the most parallelizable AND easiest to parallelize workloads known.
For the most part, that is due to ff hw in gpu's, which is on it's way out. And bandwidth is equally constrained for everyone.

To make your case, you need to show parallel (100 hw threads minimum) workloads that do better with full-coherence over semi-coherent.

AFAIK, in the production Unreal engine, the stuff that can be easily parallelized is already "Essentially functional" or on GPU.

The stuff that remains is gameplay simulation, which is in "High-level, object-oriented code and written in C++ or scripting language" and it only takes ~10% of cpu time. On top of it, they are looking for functional implementation of this.

So, coherency???

Video compression already has issues, esp if you want to do anything close to real time.
Video encode/decode should be done in ff hw. Period.
 
Also,

"Manual synchronization (shared state concurrency) is hopelessly intractible here."

Sweeny runs amok, coherency where art thou?
 
IMO the cost is not in the absolute area, but in the options which get ignored if you want absolute transparency and one size fits all hardware coherency.

Lets say you want to do a banked cached with 1 scatter&gather per cycle per core without bank conflicts, increasing traffic 16 fold at peak ... the costs for snooping (at least on a scale where you need snoop filters) and directories both shoot up, so it's not an option. With page level software managed coherency this becomes trivial (the cache knows exactly where to go for misses/invalidates, no broadcasts or directory searches necessary). Not transparent though, you'd have to be rather careful how you used it compared to "normal" memory accesses.

Lets say you want to use a switched mesh to route traffic ... if you are married to snooping then the option is off the table.

Lets say you want to use very narrow/simple cores ... just like with scatter/gather the relative cost of the hardware based coherency shoots up.
 
Last edited by a moderator:
32 times that is ~100M trannies. Negligible in comparison to chips that have ~1B trannies???
Yes. According to [Brian Case, Intel Reveals Pentium Implementation Details: Architectural Enhancements Remain Shrouded by NDA, Microprocessor Report, vol. 7, no. 4, March 29, 1993], the overhead for x86 in the original Pentium is 30%. So we're talking about 30 million transistors on a 1 billion transistor chip.

Exactly what miracle do you think you can pull off with that 3%? And why wouldn't we just have a 1.03 billion miracle chip instead?
How about an alternative approach. They run their apps on existing cpu's. Profile which bits need perf, and rewrite them in OpenCL to target GPUs/whichever-LRB-is-available?
That would be ok if only a small part of the application was performance critical. But Larrabee aims much higher. Take the example of a ray-tracer. The entire application is performance critical. So rewriting it for OpenCL would be a massive undertaking, especially when we're talking about industry quality raytracers here, and not someone's hobby raytracer written in one day. Likewise a lot of scientific applications just aren't well suited to run on a heterogeneous system. Larrabee is like a pocket sized supercomputer. It would be ridiculous to not be able to run the entire application on that supercomputer but require an API in between. Unless OpenCL turns into C++ with kernels it will only have limited use. In fact the entire GPGPU approach hasn't been very succesful yet. It took a lot of transistors to get to where it is today, but 3% more will take it to an entirely new level with minimal effort for the developers.
 
Yes. According to [Brian Case, Intel Reveals Pentium Implementation Details: Architectural Enhancements Remain Shrouded by NDA, Microprocessor Report, vol. 7, no. 4, March 29, 1993], the overhead for x86 in the original Pentium is 30%. So we're talking about 30 million transistors on a 1 billion transistor chip.
Transistor count, or area?
This is more for getting a more precise figure of what affect it has. The bulk of x86 overhead is in some of the least-dense logic areas, which has a disproporitionate affect on core area. Cache would have higher density, but is mostly* agnostic.

*unless this were a performant wider-issue x86, in which case there would be a predecode component to the I-cache, or a trace cache if you're a P4

Within the logic component, x86 overhead is applied in varying degrees to the whole design.
Whatever overhead Larrabee has, it is an overly optimistic best-case scenario where the VPU that Intel stated takes up 1/3 of the core area is not also larger by some percentage than what it would otherwise be.

Exactly what miracle do you think you can pull off with that 3%? And why wouldn't we just have a 1.03 billion miracle chip instead?
Given the physical bulk of Larrabee, the answer is that it was pushing the limits of what could be squeezed onto a die.
And the 3% figure is not indicative of the area or power.
This has been gone over.
 
Yes. According to [Brian Case, Intel Reveals Pentium Implementation Details: Architectural Enhancements Remain Shrouded by NDA, Microprocessor Report, vol. 7, no. 4, March 29, 1993], the overhead for x86 in the original Pentium is 30%. So we're talking about 30 million transistors on a 1 billion transistor chip.

Exactly what miracle do you think you can pull off with that 3%? And why wouldn't we just have a 1.03 billion miracle chip instead?
Well, to compete in the market it was aimed at, all of x86 was pointless. While I agree that the long term future will look like lrb when cpu's and gpu's become enmeshed in each other, that day is still years away. Intel overshot the market needs of TODAY.

And before you ask what could you do with 3% more transistors, ask yourselves why LRB's bogo-flops/mm and bogo-flops/W are not very impressive wrt it's competitors of today. At any rate, it seems to have been delayed by a year.

That would be ok if only a small part of the application was performance critical. But Larrabee aims much higher. Take the example of a ray-tracer. The entire application is performance critical. So rewriting it for OpenCL would be a massive undertaking, especially when we're talking about industry quality raytracers here, and not someone's hobby raytracer written in one day.

If the entire app is perf critical, LRB will suck badly on the serial bits.

Considering how different LRB would have been from COTS x86, I fail to see much difference. If you want to be productive while you optimise it to lrbni, you'll want a JIT based implicitly parallel language, be it DXCS or OCL. Once you go there, there's not much distance between the two options.

in 2000, intrinsics made sense.

In 2010, OCL/DXCS/CUDA/make sense.

It took a lot of transistors to get to where it is today, but 3% more will take it to an entirely new level with minimal effort for the developers.
Fair enough, but you'll still want to have a shaderish language to program it in.
 
Exactly what miracle do you think you can pull off with that 3%?
If x86 on a GPU doesn't actually make sense in the first place why would you even want it, regardless of if it "only" costs 3%/30 million (that's roughly half of an entire geforce 3!) trannies?

That LRB is never going to run windows kernel or office, what point to x86 being is there? You'd still want to run your GPGPU application through a compiler that's optimized for the precise architecture, so wether the compiler compiles to x86 or something custom doesn't really matter. In fact making it compile to something custom would probably make life easier for the writers of the compiler, as x86 is likely the quirkiest and most legacy-ridden computing architecture still in large-scale use today...

And why wouldn't we just have a 1.03 billion miracle chip instead?
Because 1.03 billion trannies would be more than 3% more expensive/power-hungry than 1.00 billion? :p

With that reasoning, why not add in everything but the kitchen sink to the chip? On a 1 billion+ IC it's just a few percent here, a few percent there...doesn't matter, right?
 
The assumption that the overhead introduced by this or that ISA can be represented by a single number is weak at best. Area and/or power costs are only part of the story.
 
IMO the cost is not in the absolute area, but in the options which get ignored if you want absolute transparency and one size fits all hardware coherency.

sure sure.

Lets say you want to do a banked cached with 1 scatter&gather per cycle per core without bank conflicts, increasing traffic 16 fold at peak ... the costs for snooping (at least on a scale where you need snoop filters) and directories both shoot up, so it's not an option.

Coherence is the least of your problems. If you aren't getting almost perfect locality, you'll blow out your memory subsystem. FYI, the architecture for this was done as part of the Alpha Tarantula work already.

With page level software managed coherency this becomes trivial (the cache knows exactly where to go for misses/invalidates, no broadcasts or directory searches necessary). Not transparent though, you'd have to be rather careful how you used it compared to "normal" memory accesses.

You'd have to effectively program it as message passing or only deal in a data parallel model.

Lets say you want to use a switched mesh to route traffic ... if you are married to snooping then the option is off the table.

So glad that there was no way that stupid Alpha EV7 system could work, what with that 2d switched torus interconnect.

Lets say you want to use very narrow/simple cores ... just like with scatter/gather the relative cost of the hardware based coherency shoots up.

So does everything else. A lot. Its one of the reasons we won't see 100+ core designs for quite some time.
 
In fact making it compile to something custom would probably make life easier for the writers of the compiler, as x86 is likely the quirkiest and most legacy-ridden computing architecture still in large-scale use today...

by quirkiest and most legacy ridden, I assume you mean most well understood and supported with the widest depth and variety of performance primitives, evaluations, and optimization tools in the history of computing, right!
 
If you can provide an estimate for per core full-coherency costs, that will be so much better.

Less than you think.

For the most part, that is due to ff hw in gpu's, which is on it's way out. And bandwidth is equally constrained for everyone.

That's not what the data that has been generated really shows.

Video encode/decode should be done in ff hw. Period.

Video encode should only be done in hardware if you don't care at all about quality and bit rates. The difference between hardware encoders and even mediocre software encoders is night and day. Compared to something like X264, its like you are still back in the mpeg1 days.
 
Less than you think.
I am not hearing numbers.

That's not what the data that has been generated really shows.
In absence of that data, I can't say I agree with you.

Video encode should only be done in hardware if you don't care at all about quality and bit rates. The difference between hardware encoders and even mediocre software encoders is night and day. Compared to something like X264, its like you are still back in the mpeg1 days.
WHY?
 
Back
Top