Albuquerque

Red-headed step child
Moderator
Veteran
This topic needs to be beaten down in a place that isn't about Skylake.

Yeah, AMD's big opportunity here is if Intel is slow to respond with making 8 real cores reasonably priced.

Intel already makes 8-core CPU, and 10-core, and 12-core, and 14-core, and all the way up to 18 cores per socket. The only thing "stopping" Intel from dropping any number of those into a consumer market is demand.

No. The demand is already there (I know because I am part of it). And $1000 dollars for an 8 core is just greed; they do it because they can.

Quad-core is the sweetspot nowadays on the desktop, but since Bulldozer sucks and AMD is faltering, Intel still wants to force you into paying almost $200 for the cheapest quadcore when even Skylake Celerons should be quadcore in this day and age.

You will need to defend this with data. Just as a thought experiment, you must realize that the super-majority of the consumer computing space is working in the Microsoft Office suite (Outlook, Word, Excel, OneNote, PowerPoint, Project, VIsio) or some other open source version of the same. When those consumers aren't using local apps, they're using browser interfaces that are probably server-side execution with little Java applets. Even when they are gaming, the largest portion of that gaming is browser-based gaming, not locally installed, multi-gigabyte powerhouse games like we would talk about here on B3D.

All of the above items are lightly threaded at best. Consider than a desktop Core i3 can run (up to) four threads, which means it is likely more CPU than most of the population really needs.

I'd argue the reason why most applications don't use more than 4 cores is that we don't have more than 4 core CPU's in the mainstream. Pretty much all games these days make good use of 4 cores and most scale to some extent to 8. I've little doubt that if 8 cores were mainstream now (and had been for the last few years) then we'd be seeing pretty good scaling on them in lots of applications.

I take exception to a few things in this quote parade:
  • The overwhelming majority of consumers are using applications that are lightly threaded at best
  • The reason why most applications aren't using lots of cores (I contend that even saying "good use of 4" is a stretch for nearly all games) isn't because four cores aren't prevalent. Quite the opposite: we've had eight-thread CPU's since the i7 debuted. No, it's because code isn't that easy to parallelize, and this isn't a new piece of knowledge.
  • All things considered, as a purchaser of a 3930k when the c2 stepping came out, I'd rather be back on a "slower" core i7 quad-core again. Nothing that I do uses these threads, outside of a rare video encoding / / transcoding app that goes so fast anyway that it doesn't matter.
  • Cramming a zillion cores into a single socket eventually tilts the I/O to compute balance of the socket. There are workloads where this is not a problem, but personally I'd rather have multiple sockets when the scale gets to that point...

Let the fun begin.
 
To answer the question in the tile : There are none
And considering your opinion on the 3930k having cores that arnt being used I'm wondering why you would want one ?
 
To answer the question in the tile : There are none
And considering your opinion on the 3930k having cores that arnt being used I'm wondering why you would want one ?

Albu doesn't want a many core CPU, this thread is a spawn becuase he wants to demonstrate that we don't need them and shouldn't be asking for them.
 
Albu doesn't want a many core CPU, this thread is a spawn becuase he wants to demonstrate that we don't need them and shouldn't be asking for them.
But they should eventually though right? I mean he's right that the average consumer doesn't require all this CPU power, for now, but that doesn't change the fact that more cores can do equivalent amount of work in parallel with a significantly smaller power footprint, and possibly a thermal print.

We should see cores go up and MHz go down as we move further into the mobile space right?
 
We should see cores go up and MHz go down as we move further into the mobile space right?
This only works as long as the task can be parallelized (so games yes, browsers no) and as long as single-threaded performance is high enough.
I use both a phone with quad A7s and one with dual Kraits; the dual Krait phone is far more responsive when browsing, and I do not care about battery life in games.

We can get a quadcore from Intel for $50 (Atom). Moving up in price, Intel could do three things: give us more cores, give us more per-core performance, give us higher GPU performance. Intel choose to do the latter two, and I'm happy they do it this way.
With lowend CPUs, the choice is between fast dual cores or slow quads; most consumers find quads slower (less responsive). With highend CPUs, I'm very happy that Intel is giving me a GT2 GPU instead of 8 cores, because now I hopefully will not need a discrete GPU for my future 4K monitor.

I do agree that as the iGPUs on Intel chips advance, the mentioned power issue along with process shrinks and DX12 rising in popularity may change the balance so that there is use from having 3/4 big cores even on the U family of chips – and then we'll get them.

To anybody arguing Intel is cheating us: no, their priorities are simply aligned with the majority of their consumers, as shown by the BDW-U die:
http://images.anandtech.com/doci/9320/BDW-U.png
 
16 cores (32 threads) doesn't benefit most consumer applications or games at all. You actually lose performance by using a CPU like this on a home computer, since the clock ceilings are slightly lower.

i7 5960X is the biggest consumer desktop CPU with 8 cores (16 threads). It has whopping 20 MB of L3 cache and is clocked at 3 GHz:
http://ark.intel.com/products/82930...tion-20M-Cache-up-to-3_50-GHz#@specifications

18 cores (36 threads) is the highest available Haswell Xeon:
http://ark.intel.com/products/81061/Intel-Xeon-Processor-E5-2699-v3-45M-Cache-2_30-GHz

The core frequency of the 18 core Xeon is only 2.3 GHz.

If you look at various consumer software benchmarks, you notice that most don't scale beyond 4 cores. Some consumer apps used by hobbyists, like Photoshop scales up to 8 cores (but only gives a few percents better perf compared to 4). Only a few games benefit from more than 4 cores, and if you want to go wider, the 8 core i7 already beats the 14, 16 and 18 core Xeon in all games.

Even if your program is fully multithreaded and scales to any amount of cores, some algorithms are inherently bandwidth bound. 8 core i7 has 68 GB/s of bandwidth, the 18 core Xeon has exactly the same bandwidth. Haswell architechture is 32 FLOP per core, meaning that a 18 core Xeon is 32*18*2.3 = 1.32 TFLOP/s. As CPUs achieve higher percentage of their peak flops, you can already easily saturate the 68 GB/s bandwidth with 8 cores in optimized AVX2 code, meaning that the extra cores don't help at all for bandwidth bound algorithms.

The Xeons (and the 8 core i7 flagship) both have four channel memory controllers, meaning that you need to buy four (or eight) memory modules to maximize the bandwidth. Most consumers wouldn't want to spend the extra money required for the memory and the motherboard, as the higher clocked 4 core CPU is slighly faster in most of their apps and games. Four channel memory buses are not cheap to manufacture. We need faster memory (HBM for example) to make high core CPUs scale better in various algorithms. This is especially true with AVX-512.

We are facing a chicken and egg problem in consumer apps. Most are designed to scale up to four cores now, since that is the norm. Even if you code in a way that your code theoretically could scale up to any amount of cores, critical sections tend to have more than linear amount of collisions as the thread count rises, meaning that synchronization becomes a bottleneck. I was highly pleased to see TSX extensions in Haswell. It will allow many algorithms to scale to higher core counts efficiently. Unfortunately the chip was bugged and Intel disabled TSX. It will take again some more years to get this important feature widely enough supported to warrant programmer effort to support it.
 
Last edited:
This only works as long as the task can be parallelized (so games yes, browsers no) and as long as single-threaded performance is high enough.

But didn't people say the same thing about games at the start of the XB360 generation before developers were forced to find new ways to parallelize games to fit the hardware?

We can get a quadcore from Intel for $50 (Atom). Moving up in price, Intel could do three things: give us more cores, give us more per-core performance, give us higher GPU performance. Intel choose to do the latter two, and I'm happy they do it this way.

Have they given us more per core performance though? Outside of clock speed boosts and SIMD instructions sets, from the latest leaked Skylake benchmarks it looks like they may have improved per core performance around 30% or so from the 4.5 year old, 4 generation old Sandybridge. Everyone is always commenting how desktop CPU progression has flattened out.

Add to that that the clock speed of the newest generation (Skylake) is rumoured to be lower than that of the 2 generations old Haswell and from what I've seen of the leaked benchmarks, barely faster at all.

I agree Intel have done a superb job in bringing up IGP performance over the last few years and no doubt that will ultimately benefit the PC gaming industry greatly, but I'd also argue that the average consumer no more needs a 48 EU IGP than s/he needs an 8 core CPU.

With highend CPUs, I'm very happy that Intel is giving me a GT2 GPU instead of 8 cores, because now I hopefully will not need a discrete GPU for my future 4K monitor.

But you're representing just one small niche of the overall market there. Mid-high end gamers who have no need for an IGP are another market segment which Intel isn't particularly well catering too at the moment - if you consider the available technology rather than the competitive landscape.

To anybody arguing Intel is cheating us: no, their priorities are simply aligned with the majority of their consumers, as shown by the BDW-U die:
http://images.anandtech.com/doci/9320/BDW-U.png

I wouldn't describe Intel as cheating consumers, but they are holding back in order to increase their margins. That's just good business sense when there is virtually no competition. To put it in context, Lynnfield was released almost 6 years ago with 4 cores and 8MB L3 at a price of $196 on a 45nm process and a 296mm die with a 95w TDP.

6 years later and we're getting the Core i5-6600K with 4 cores and 6MB L3 at a price of ~$250 on a 14nm process and (likely much smaller) die with the same 95w TDP. Sure we get an extra 1 Ghz of clock speed (assuming full boost) and we get some reasonable IPC improvements plus a good igp. But in 6 years? And a process difference of 45nm vs 14nm? That's a tough pill to swallow for anyone who won't actually be using that IGP.
 
GPU compute benefits from the integrated GPU, giving similar peformance boosts for many parallel algorithms than the extra 10 cores (running AVX2 core full tilt) in the 18 core 2.3 GHz CPU would give. The GPU is more efficient (perf/watt) in running highly parallel linear number crunching (and/or BW heavy data movement) code than the highly complex Intel OoO CPU cores. Of course there are algorithms where the extra CPU cores would be better, but the Intel GPU has lately (Broadwell) received many important features for GPU compute, such as cache coherent virtual memory (fully backed by Intels huge caches) and OpenCL 2.0/2.1 support (GPU-side kernel enqueue among other features) that make the interoperation of the CPU and GPU more efficient.

DirectX 12 will allow games to scale up to higher CPU core counts. Hopefully it is widely adapted, but I fear that the Windows 10 requirement will be slowing down the adaptation a bit (not everyone takes the free upgrade). It's too bad that TSX failed on Haswell (and the first Broadwell chips) as it would have combined nicely with renderers that scale up to high core counts. When these new features (and API) become commonly used we will start seeing more gains from high CPU core counts in games. However I don't expect to see a big breakthrough until the next console generation (assuming they sport 16+ core CPUs with TSX support and high memory bandwidth).

I really hope Zen supports TSX. It would be sad if this awesome feature eventually becomes a Xeon only (server/workstation) feature, like some other Intel extensions. If AMD supports TSX, we will have a good chance in seeing it adopted in important consumer software, meaning that it will become an important addition to consumer CPUs, leading us closer to higher core counts in consumer CPUs. Of course we have many other roadblocks to overcome before all consumer software scales up to 16 cores (= 32 threads).
 
Last edited:
@sebbbi It's interesting you mention TSX; I've read (a little) about this on Ars Technica mainly, and it has always been described as a business/big data feature; games haven't even been mentioned, so this is the first time I hear about TSX and games.

Fascinating! :)
 
@sebbbi It's interesting you mention TSX; I've read (a little) about this on Ars Technica mainly, and it has always been described as a business/big data feature; games haven't even been mentioned, so this is the first time I hear about TSX and games.

Fascinating! :)
Games might actually be a slightly better target, as one can assume that many of the business apps that were looking at high-contention scenarios already moved to fine-grained locking (and, for extra bloody cuts, lock-free). On the contrary, based on presentations here and there, many games / game engines appeared to go with reasonably heavy-handed approaches to locking / mutual-exclusion.
 
On the use of the integrated GPU, personally I'd love nothing more than to see it start getting put to good use new games but I fear we're a long way (if ever) from it being well utilized in combination with a discrete GPU, so in the mean time I'd prefer a high core count CPU only option for gamers (and anyone else sporting a dGPU) that doesn't demand enthusiast level prices. For that matter, I don't see why Intel can't release an 8 core + iGPU model at 14nm.

It seems to me the main reason they're not scaling core counts up (and clock speeds for that matter) is because they have no competitive pressure to do so. May as well keep the die small (more chips per wafer) and the clock speed low (more higher binned chips) for greater profits than push core number and clock speed envelopes.
 
Thanks, great read. It seems like there are many business opportunities out there ;).
I guess the question is, are we limited by our tools or by our education that's really stopping us from everyone making scalable core applications? If we look at the example provided, say HTML, web browsers need to do this in-order otherwise the webpage won't render as we expect it to. But what if we stopped using HTML, could we create a scripting language that would properly render and assemble a page where it should be - are we over complicating the problem just for the sake of parallelization?

(edit: just did some googlefu - looks like there are some movements to parallelize web browsers:
https://parlab.eecs.berkeley.edu/research/80)
HTML5 comes with Web Workers - which lets you spawn JS scripts. http://www.html5rocks.com/en/tutorials/workers/basics/ Could in theory help here. I never knew all this stuff existed, I really should thank you guys, this is what happens in the world of sad web form processing. You get complacent, and technology runs on by.

It's an interesting topic to discuss, as mentioned, I assume in today's world if we needed mass parallel compute power we would just use GPUs to do it. Especially with the way that we're moving towards integrated GPUs on most mobile devices I don't see CPUs scaling too large for the sake of doing a GPU's job.

We'd need a very real and significant reason (in the consumer application space) for CPUs to increase their core count I think. It may not have arrived yet, but it could arrive later.
 
Last edited:
But didn't people say the same thing about games at the start of the XB360 generation before developers were forced to find new ways to parallelize games to fit the hardware?
That pressure already exists. On the really low end, there's the quad A7s. On the low end, there is a ton of Atom Celeron and Pentium laptops out there – those are quad cores with low per-core performance. On the high end, we've had quad cores for years.
And at least the browser developers are working on better parallelization. They've got far too many quad A7 devices they cannot ignore. And as somebody mentioned, MS Office (Excel) is becoming better at it too. When we get enough software, there will be enough use for more cores, and we'll get those cores.
This is not a case of only having single cores to develop parallelized software on. There is a ton of quad cores out there.
Besides which, the i5 4690K "has sixteen cores". In Cinebench, a quad core 4690K has more performance than a "sixteen core" Celeron J1900.

Have they given us more per core performance though?
I was not speaking in terms of generations.
I was speaking in terms of die size. Those 4 CHW cores are probably way smaller than 2 BDW cores.
My point was that using the same die size, Intel can go for high core counts, high per-core performance, or high GPU performance. For now, high per-core performance and high GPU performance trump high core counts.

Mid-high end gamers who have no need for an IGP are another market segment which Intel isn't particularly well catering too at the moment - if you consider the available technology rather than the competitive landscape.
How are they not catered for? They can get a really fast quad core for $199 (i5 4590), or an unlocked one for $50 more.

6 years later and we're getting the Core i5-6600K with 4 cores and 6MB L3 at a price of ~$250 on a 14nm process and (likely much smaller) die with the same 95w TDP. Sure we get an extra 1 Ghz of clock speed (assuming full boost) and we get some reasonable IPC improvements plus a good igp. But in 6 years? And a process difference of 45nm vs 14nm? That's a tough pill to swallow for anyone who won't actually be using that IGP.
177mm² for Haswell GT2 vs 296mm² for Lynnfield. 60% decrease.
But.
1.4B vs 774m transistors, 80% increase.
So we are actually getting way more transistors for the same price. And per-transistor costs aren't actually decreasing that quickly.
Another issue is nonrecurring engineering expenses for every type of die. The quad cores with IGP are subsidized by nongamers and have a ton of customers. Question is, will those cheap octacores actually have enough buyers?
And if we go to sixteen cores, there's the issue of separate ring buses with switches (basically, you actually have two octa core CPUs on a single die, connected by a really fast interconnect). That has not only die size, development and power costs, but also means that software may get funky behavior (as this is actually NUMA on a die pretending not to be one).
 
I don't see why Intel can't release an 8 core + iGPU model at 14nm.
8 core (16 thread) Skylake with 128 MB EDRAM and the fastest/biggest iGPU would be perfect for my needs. With quad channel DDR4 memory controller of course (68 GB/s). That kind of setup would remove the bandwidth bottleneck of the GPU completely. High TDP (120W) would allow both the CPU and The GPU to run at max turbo much more frequently. This would be a perfect chip for gamedev use.

The iGPU will be more important when DX12 games come out, as there is explicit multiadapter support. Unreal Engine devs have already shown performance boost by offloading the tail of the frame to the iGPU. Also asynchonous compute with the iGPU allows low latency offloading of heavy parallel tasks to GPU. Intel GPUs can run multiple compute shaders simultaneously (but not mixed compute + graphics), making the Intel GPU a perfect fit for helping the CPU in math heavy asynch compute tasks, while the discrete GPU renders the graphics. As a developer I am happy that almost all new consumer CPUs have iGPUs, allowing us to spend time to target these setups. If 16 core CPUs with no iGPU become common, offloading processing (such as physics simulation) to iGPU no longer is a valid approach, as the discrete GPU is too far away from the CPU to offload tasks with low latency requirement.
 
GPU compute benefits from the integrated GPU, giving similar peformance boosts for many parallel algorithms than the extra 10 cores (running AVX2 core full tilt) in the 18 core 2.3 GHz CPU would give.
One item of note is that running with the base clock for a Haswell Xeon in AVX is not what is on the spec page.
http://www.anandtech.com/show/8423/intel-xeon-e5-version-3-up-to-18-haswell-ep-cores-/5

The 2699 has a base of 1.9 if it detects AVX.
This is time-based throttling as well, rather than a check of the buffers. The altered status can hang around for a millisecond, which is absolutely glacial relative to the cores themselves. It sounds like a physical/electrical consideration and not as effective as Intel's management tends to be. This seems to point to greater difficulty in satisfying the disparate demands high-performance generalist cores are being saddled with, at least for current cores.

I really hope Zen supports TSX. It would be sad if this awesome feature eventually becomes a Xeon only (server/workstation) feature, like some other Intel extensions. If AMD supports TSX, we will have a good chance in seeing it adopted in important consumer software, meaning that it will become an important addition to consumer CPUs, leading us closer to higher core counts in consumer CPUs. Of course we have many other roadblocks to overcome before all consumer software scales up to 16 cores (= 32 threads).
It would be interesting. AMD has lagged in instruction adoption lately. TSX is a much more involved change than the FMA4-capable Bulldozer line adopting FMA3, and it would point to a significant investment of resources and risk in the memory pipeline.

For the small number of instructions, the replumbing, access filtering, and handling of the cache hierarchy is a very complicated endeavor in an area of architectural behavior that is very unforgiving of error, as the TSX bug that even Intel couldn't avoid with its vast resources attests.
 
Currently Intel has 2 cores in the low-end, 4 cores in the mid-end and 6-8 cores in the high-end desktop.
There are enough popular games and applications that show better performance on 4 cores over 2 cores, sometimes not by much, but enough to make many people fork almost $200 for the cheapest quadcore, and enough to justify 4 cores as the new baseline on the desktop. But the same cannot be said about 6, 8 or more cores though.
So, if Intel was to shift from 2/4/6-8 cores, as described above, to 4/6/8-10 cores in the next generation, which would make $50 3GHz quadcore Celerons a reality, how many people would still want to pay $200+ for mid or high-end CPUs?
 
Also asynchonous compute with the iGPU allows low latency offloading of heavy parallel tasks to GPU. Intel GPUs can run multiple compute shaders simultaneously (but not mixed compute + graphics), making the Intel GPU a perfect fit for helping the CPU in math heavy asynch compute tasks, while the discrete GPU renders the graphics. As a developer I am happy that almost all new consumer CPUs have iGPUs, allowing us to spend time to target these setups. If 16 core CPUs with no iGPU become common, offloading processing (such as physics simulation) to iGPU no longer is a valid approach, as the discrete GPU is too far away from the CPU to offload tasks with low latency requirement.

Thanks, this certainly has me convinced that we should retain the igp's even if they're not being used in current games. On that basis I'd say 16 core desktop CPU's are defintely a bad idea right now. I'd still like to see the transition to 8 cores plus IGP at the mainstream level though now that we've hit 14nm. I can't believe that would be a challenge for Intel on the new process - but of course it would reduce margins so until we see some real competition it won't happen.
 
Why wouldn't you want the most powerful gpu you can get you hands on ?
Obviously, as a graphics programmer I would use two high end discrete GPUs as well (in addition to the iGPU). The problem right now is that if I want to have a 8 core (16 thread) CPU (for fast compile times) I can't have a iGPU. This means that I can't optimize my renderering code for Intel GPUs or test my explicit multiadapter DX12 code path unless I have a secondary computer just for that purpose. I will not switch my workstation 8 core Xeon to a 4 core i7 just to get a iGPU. I hope that Intel releases a 8 core CPU (Xeon or i7) with a full scale iGPU. Preferably this CPU would also have EDRAM L4 cache (as that boosts the compile times and the iGPU performance).

This would be a perfect CPU for graphics programmers. I don't know if any other fields are interested about a configuration like this. Intel LLCs are shared between the CPU and the GPU and the CPU and the GPU share virtual address space and support cache coherence. With 20+ MB L3 and 128 MB L4 cache, these chips would be perfect for low latency mixed CPU+GPU processing.
 
Back
Top