It's just one person's speculation so far.Anyway back to (~)Larrabee, OK I did not get that Haswell were to use two 8 wide SIMDs.
It's just one person's speculation so far.Anyway back to (~)Larrabee, OK I did not get that Haswell were to use two 8 wide SIMDs.
Careful -> only DX10 IGPs (HD3200, Geforce 8200) and later will get you DXVA, flash and browser acceleration.
Looking at your experience, I think you'd be better served with an E-350.
micro-ATX boards with E-350 look great but expensive (can get a low end H61 and sandy bridge pentium for that)
It's not a representation of cumulative sales for the last 15 years. It's a representation of the hardware people had in Q1 2011. And the most interesting observation is that a 2005 ultra-low-end IGP is leading the chart...Point being?
I said the list you mentioned represented cumulative sales for the last 15 years. The same list shows the S3 Virge, Intel Extreme Graphics, Ati Rage. My statement stands true.
I'm not forgetting anything. This thread is mainly about Larrabee, desktop CPUs and GPUs, and close derivatives. The ultra-mobile market has different design goals and spending more silicon to achieve the lowest power consumption makes sense for the time being.You're forgetting about the rise of the ARM architecture in tablets, phones and smart/netbooks.
Even ARM is betting on GPU computing, with the OpenCL-oriented Mali 604.
You mean like this: Runescape gameplay on intel gma 950 no lag?This isn't true. Try playing any 3D browser game (runescape, quake live, free realms, fusionfall, etc) and see how your 945G will fare, no matter what CPU you're using.
How do APUs help people who upgrade their GPU frequently?I change graphics cards about 3x more than I change CPUs in my desktop.
What huge disparity? Tesla offers ~1 TFLOP theoretical peak performance, but only half that in practice (complex workloads are even worse). It also consumes 200+ Watt and takes 3 billion transistors. Haswell on the other hand is likely to offer 500 GFLOPS at 100 Watt and 1 billion transistors. Granted, NVIDIA will have a new architecture by the 2013 timeframe as well, but it will cost transistors and power to increase efficiency and programmability so I seriously doubt there will be a "huge" disparity. Last but not least, AVX-1024 will give the CPU another performance/Watt advantage to further close the gap, if there will even be a gap left at all...So you disagree with pretty much all of the scientific community related to computing hardware, as you seem to neglect the huge disparity in FLOPS between architectures.
Let's just agree to disagree then.
Does not sound to me like Intel would aim for performance at all costs.Here's what we already know about Haswell, per a conversation I had last month with Intel marketing chief Tom Kilroy: The mobile version of Haswell will be Intel's first system-on-a-chip designed for the mainstream laptop market, according to Kilroy.
Rough translation:Aber Anfang 2013 soll dann der „Tock“ zum Haswell-Prozessor folgen, der wieder von der Crew in Oregon rund um Ronak Singhal designt wird, die womöglich weitere Techniken aus der untergegangenen Netburst-Architektur wiederbeleben könnte. Man hört zudem von einem komplett neuen Cache-Design, einer vergleichsweise kurzen Pipeline von 14 Stufen, neuen Energiesparmechanismen und einer wahrscheinlich optionalen Vektoreinheit, die mit 512 Bit Breite arbeitet und LNI spricht: Larrabee New Instructions.
ARM has a very, very long way to go to become part of a competitive HPC architecture. It absolutely won't taken Intel longer to implement AVX-1024 than it would take ARM and its partners to gain a foothold in the HPC market.They have all grounds covered but I'm not so sure about them having that much time. ARM cpus may bite into the share sooner than later.
Why? Intel already has quite successful power-efficient 6-core CPUs at 32 nm. An 8-core Haswell capable of delivering 1 TFLOP at the same TDP doesn't seem much of a challenge at 22 nm + FinFET. That would still be a very small chip by today's GPU standards.That's where I get lost. If Intel is really to kill GPUs (whatever their real purpose graphics/compute) they need to deliver more perfs. 8 Haswell cores (or Skylake something) would be big and still far away from GPUs in throughput.
Why are matrix operations an NVIDIA weakness but not for AMD? And what will it cost to reach parity with CPUs?But that is "just" a distinct weakness of Fermi in this respect. They don't handle the matrix operations (which are the base for the top500 list) with very high efficiency. AMD GPUs are actually currently better in this and nvidia promised to improve that considerably with Kepler too (i.e. they aim for parity with CPUs).
Nvidia GPUs lack register space to hold more values in there. So they put more strain on the cache/local memory system (doesn't matter what you use, same [and shared] bandwidth) which is simply a bit too slow. The fastest implementation on AMD GPUs doesn't use the LDS at all (wouldn't have enough bandwidth either), but only rely on the reuse of values in the registers (which have of course enough bandwidth to the units). As AMD GPUs have more registers, the bandwidth of the caches is just fine to reach 90%+ of theoretical peak with large matrix multiplications (as CPUs can also do).Why are matrix operations an NVIDIA weakness but not for AMD? And what will it cost to reach parity with CPUs?
I never said x86 is the center of the universe. I just think Intel is very close to producing a homogeneous power-efficient high-throughput CPU.I don't believe x86 is the center of the universe. Legacy CPU support isn't a hard requirement in HPC and consumer software may follow suit soon enough.
Why would we come up with new workloads only after CPU-GPU unification? I agree it will allow new workloads which are a complex mix of ILP / TLP / DLP, but I can't really imagine anything beyond that...I agree integration is the future. At least for now. The moment homogenous architectures become good enough we'll come up with new workloads that require dedicated hardware. It's a cycle.
Do you have any supporting arguments? Or are you basing things solely on sentiment?In any case I'm not seeing any indication that AVX will provide competitive performance to discrete GPUs regardless of efficiency advantages. The raw advantage in throughput for GPUs is still too great.
The full support of legacy scalar ISAs and techniques is an albatross when it comes to throughput computing, not a benefit.
Why would the thermal and power limits of the CPU socket be any different than those of a graphics card?There's no free pass there. Intel is still constrained by the thermal and power limits of the CPU socket.
What do you mean it did squat? Software renderers which don't use SSE are way slower.Given that AVX is just a wider SSE and the latter did squat for software rendering performance I'm still not seeing reasons to be excited.
I think he means that software renderers that use SSE are still useless for real world gaming.What do you mean it did squat? Software renderers which don't use SSE are way slower.
Hi all,
Since Intel's 22 nm FinFET process technology will be production ready at about the same time as TSMC's 28 nm process, I was wondering if this means Intel is actually two generations ahead now.
I think this could give them the opportunity to launch an improved Larrabee product. The inherent inefficiency of such a highly generic architecture at running legacy games could be compensated by the sheer process advantage. Other applications and games could potentially be leaps ahead of those running on existing GPU architectures (e.g. for ray-tracing, to name just one out of thousands).
In particular for consoles this could be revolutionary. They needs lot of flexibility to last for many years, and the software always has to be rewritten from scratch anyway so it can make direct use of Larrabee's capabilities (instead of taking detours through restrictive APIs).
It seems to me that the best way for AMD and NVIDIA to counter this is to create their own fully generic architecture based on a more efficient ISA.
Thoughts?
Nicolas
Nick said:Why would we come up with new workloads only after CPU-GPU unification? I agree it will allow new workloads which are a complex mix of ILP / TLP / DLP, but I can't really imagine anything beyond that...
Do you have any supporting arguments? Or are you basing things solely on sentiment?
GPUs used to have a huge advantage in raw throughput due to exploiting TLP and DLP, while legacy CPUs only exploited ILP. Today's situation is very different. CPUs now feature multiple cores and multiple wide SIMD units, soon to be extended with FMA support. And AVX-1024 gives it the power efficiency of in-order processing. This leaves GPUs with no unique advantages.
Yes, supporting a scalar ISA costs some throughput density, but any APU still requires CPU cores so you have to compare homogeneous architectures against their entire heterogeneous counterpart.
Why would the thermal and power limits of the CPU socket be any different than those of a graphics card?
What do you mean it did squat? Software renderers which don't use SSE are way slower.
A third option would be to increase ILP so they need fewer warps in flight and each of them gets a larger cut of the register file. Simply adopting GF104's superscalar issue could do the trick.Nvidia has basically two options, increasing register space (if they go to SMs with 64 ALUs, they will probably double registers either way *) and/or increasing bandwidth/size of the shared memory/L1 or a combination of that of course.
A third option would be to increase ILP so they need fewer warps in flight and each of them gets a larger cut of the register file. Simply adopting GF104's superscalar issue could do the trick.
I beg to differ, at least with regard to the Top500 - I'm aware of some specific MM-Test which show pretty good utilization rates on radeon hardware.But that is "just" a distinct weakness of Fermi in this respect. They don't handle the matrix operations (which are the base for the top500 list) with very high efficiency. AMD GPUs are actually currently better in this and nvidia promised to improve that considerably with Kepler too (i.e. they aim for parity with CPUs). And you always get a better power efficiency when choosing low voltage parts, that is true for GPUs too.
Yeah, it's worse than 3DMark for games. But I'm sure they'd welcome any advice to improve upon that.The more general problem is actually what the scores tell about the code those computers will actually encounter in reality. It's almost nothing. Just solving huge systems of linear equations isn't what most of these systems do as their daily work.
Throughput will continue to increase, even with a fully homogeneous architecture. What you're claiming is that won't suffice, and we'll get workloads which will require heterogeneous dedicated hardware again. Could you give me an example of a task which requires much more throughput than graphics but less programmability, and would be worth the dedicated silicon?You can't imagine algorithms that require far more raw throughput than is available on current or near future hardware but would benefit from dedicated silicon? I sure hope we're not at the end of the road already!
Compared to SSE, AVX2 increases the throughput fourfold, adds non-destructive instructions, and features gather. That's way more than "just a few more flops", and then some.I find that question ironic given your unshakeable faith in Intel's ability to upset the status quo with just a few more flops. Where is your supporting evidence that slapping a few vector units onto an x86 CPU will result in computing nirvana or even compete with contemporary GPUs? My opinion is based on the facts on the ground, not wishful thinking.
That's sentiment, not fact. GPUs are losing this advantage too. If Sandy Bridge had FMA support and no IGP, it would be less than 200 mm² and deliver 435 GFLOPS. GF116 is 238 mm² and delivers 691 GFLOPS.Sure GPUs have no unique advantages if far higher performance doesn't count as an advantage in your books.
What's not to understand? You said CPUs are constrained by the thermal and power limits of the CPU socket. So I'm asking, what would keep them from increasing these thermal and power limits?For the same reasons that they're different today. I don't understand your question.
Ivy Bridge doesn't have AVX2 nor AVX-1024. IGPs will stay around as long as these haven't been implemented. Software rendering being slow is not a cosmic constant. They're currently limited to using 100 GFLOPS and emulating gather takes 3 uops per element. With four times higher throughput per core, non-destructive instructions, hardware gather, and more cores, software rendering is about to take a quantum leap.Why does Ivy Bridge still have an IGP? Software rendering is SLOW. Doubling CPU performance won't change a thing.
That depends on the total memory access latency per warp. CUDA features prefetching, which also benefits from having fewer warps competing for cache space. So increasing ILP with superscalar issue should help in several ways.If you increase ILP, you'll need more and not less warps in flight.
Everything else being equal, more ILP means more warps in flight to hide the same memory latency.That depends on the total memory access latency per warp.