Intel is pretty good at determining what costs more, so if it keeps on creating integrated chips, it is very likely that they have more money as a result of that design choice than the alternative.
 
There is a cheap 16-core, that's an Opteron that looks like a dual 8-core FX, on a socket with quad channel memory (G34). "Cheap" as well under $1000 for e.g. Opteron 6378, which is 2.4GHz Piledriver (+ turbo).

But it's not that good compared to 6-core Sandy Bridge and up.
 
One item of note is that running with the base clock for a Haswell Xeon in AVX is not what is on the spec page.
http://www.anandtech.com/show/8423/intel-xeon-e5-version-3-up-to-18-haswell-ep-cores-/5

The 2699 has a base of 1.9 if it detects AVX.
This is time-based throttling as well, rather than a check of the buffers. The altered status can hang around for a millisecond, which is absolutely glacial relative to the cores themselves. It sounds like a physical/electrical consideration and not as effective as Intel's management tends to be. This seems to point to greater difficulty in satisfying the disparate demands high-performance generalist cores are being saddled with, at least for current cores.
Should we expect the AVX base clock to drop further (relative to normal base clock) with the upcoming introduction of AVX-512?
 
I can see the benefit, since the amount of hardware being driven is increasing significantly. It could increase the range, or insert additional brackets for different reasons besides AVX or no AVX.
I have only seen a clear reference to the different AVX range being a consideration for Xeons.

I'm not clear on whether client versions of Skylake will have AVX512, at least for now it is a Skylake Xeon and Xeon Phi extension.

Skylake is a different architecture, so the circumstances that lead to this specific mode could be significantly different in later chips. One possibility is that the different clock modes and the apparent use of AVX512 in Xeon/HPC chips points to a more complex trade-off than Intel found worthwhile in the client space.

That there can be different clock limits based on instruction usage doesn't strike me as ideal, but it is understandable. What I found noteworthy is that there may be non-obvious reasons for why Intel's clock management has comparatively long wall-clock latencies for switching modes, and it's rather coarse in activation rather than querying activity or what's on the queues at nanosecond and microsecond speeds.
That could be something that might change with an improved architecture.
 
What I found interesting is how much hotter my haswell-e gets running avx2 code (I think). I'm away for a bit so I can't get exact numbers, but prime95's torture test (avx2 based) causes my cpu to run absurdly hotter than other benchmarks that max it out (I can't remember exactly but I want to say something like 20 degrees hotter). Now obviously this is highly flawed (who knows what prime95 is doing compared to other benchmarks) but I do suspect at least some of the extra heat is generated from utilizing the larger vector set.

It would be interesting if there was a multithreaded micro benchmark that allowed someone to compare the power usage of 128-bit vector instructions versus their 256-bit counterparts. I would be curious to know the "cost" of going wider.
 
What I found interesting is how much hotter my haswell-e gets running avx2 code (I think).
Yeah there's really no magic here guys - it takes a lot more power to be doing all that extra SIMD stuff and the vast majority of code barely touches the throughput of Haswell, etc. Thus rather than constrain *all code* to lower frequencies, it makes more sense to only constrain code that is actually using a significant portion of the machine (i.e. AVX2). I agree that it is non-ideal that it takes so "long" to transition between the states, but presumably there's a fair amount of complexity there. I don't believe normal turbo bin transitions happen at a much higher rate than that anyways, correct?
 
I'm not clear on whether client versions of Skylake will have AVX512, at least for now it is a Skylake Xeon and Xeon Phi extension.

Skylake is a different architecture, so the circumstances that lead to this specific mode could be significantly different in later chips. One possibility is that the different clock modes and the apparent use of AVX512 in Xeon/HPC chips points to a more complex trade-off than Intel found worthwhile in the client space.

Knowing Intel I guess that consumer Skylake has the AVX512 hardware built-in, but it's disabled (like 64bit in Prescott or HT in Northwood, at the beginning)
They can use it internally to weed out the bugs, so when the arch goes to -E/-EP and then later -EX it's less and less bugged. It's how I interpret the strategy, relative to all hardware or microcode bugs in general.

The early target is also laptops or even the most "mobile" of them, including the "workstation tablets" from Microsoft.
So if power management of high-width instructions is a tricky problem or if they're not that desirable in that context, that's another reason to leave them out.

Kaby Lake might include them but if so Intel might want to communicate about it later.
 
Yeah there's really no magic here guys - it takes a lot more power to be doing all that extra SIMD stuff and the vast majority of code barely touches the throughput of Haswell, etc. Thus rather than constrain *all code* to lower frequencies, it makes more sense to only constrain code that is actually using a significant portion of the machine (i.e. AVX2). I agree that it is non-ideal that it takes so "long" to transition between the states, but presumably there's a fair amount of complexity there. I don't believe normal turbo bin transitions happen at a much higher rate than that anyways, correct?

Measurements from the following found P state transitions could be measured in tens of microseconds.
https://www.usenix.org/system/files/conference/atc14/atc14-paper-wamhoff.pdf

The millisecond latency is longer relative to another known AVX-specific mode change that was found for Sandy Bridge, where going to 256 bits required several hundred floating point operations to warm up to full throughput. That's a fair amount of cycles, but at multi-GHz range it sounds like tens to a hundred nanoseconds or so for the CPU to determine that AVX-256 was needed and for the pipeline to reach full utilization.

Intel's DVFS and voltage regulation capabilities are very effective and very fast, so opting for a method that takes at least a millisecond and can kick in due to the detection of one AVX instruction does show there are complexities that the above methods do not handle. I think this could be an area that future designs would try to bring in line with other latencies and provide a little more dynamism in the response.
 
extra cost

This is counterintuitive, but it's entirely possible that having those extra few square millimeters of silicon that you aren't using reduces the cost of your chip.

When building chips on bleeding edge foundries, the initial costs of spinning out a design, making masks for it and validating it are massive. Massive enough that for any chip other than the very highest volume ones they are at least a significant portion of the total lifetime cost of making the chips.

All the following numbers are totally made up, but the illustrate the point:

Let's say that you are an enthusiast who wants a chip with the fastest CPU and no GPU, and there are a hundred thousand like you in the market. There are also 5 million customers who want a chip with an IGP. The marginal cost of manufacturing a chip without an IGP is $30, while the marginal cost of manufacturing a chip with one is $50. However, the initial cost of getting a design rolling is $10M for the IGP chip and $5M for the IGPless chip, and this needs to be amortized over all sold chips.

If the CPU maker makes just one design, the per-chip manufacturing costs are ~$52 for each, while if they make two chips, the chips with IGP cost $52 and the chips without one cost $100. ... At which point all the customers who would have bought it wonder why the CPU maker is trying to fleece them and buy the version with the IGP instead.

In the era of multi-patterning, CPUs made at the newest processes need sales in the millions to amortize the initial costs. The initial costs of $5M I used are seriously lowballing it. If Intel could actually make money by selling you a different kind of top-end CPU, they would do it.
 
Last edited:
Particularly for gaming, aren't we essentially limited by the console focus? Intel has leapt so far beyond previous and current gen that it doesn't seem like multicore will be an issue until the end of time.

DX12/Vulcan etc looks to alleviate CPU pressure as well...

?
 
Particularly for gaming, aren't we essentially limited by the console focus? Intel has leapt so far beyond previous and current gen that it doesn't seem like multicore will be an issue until the end of time.

DX12/Vulcan etc looks to alleviate CPU pressure as well...

?

This is one of the reasons why I'd like to see 8 cores become mainstream. Presumably being able to map a single core to single core would be easier for developers and beneficial from a performance pov, especially when we have DX12 which should offer similar scaling to consoles. Obviously it doesn't take a hyperthreaded 4Ghz Skylake core to match a 1.6 Ghz Jaguar core but when VR hits we're going to want to be pushing 30fps console games at a locked 90fps, possibly with extra CPU intensive effects applied. That's going to take a huge amount of additional CPU power that I really don't want my quad core to be having to take that on at the same time as doing the work of 2 Jaguar cores per PC core.
 
This is one of the reasons why I'd like to see 8 cores become mainstream. Presumably being able to map a single core to single core would be easier for developers and beneficial from a performance pov, especially when we have DX12 which should offer similar scaling to consoles.
Nah, there's really no benefit there. A single core at double the frequency is pretty much strictly superior to two cores at half the frequency in terms of raw performance so even if you were extremely generous to the IPCs of consoles and pretend that games can actually use all the 8 cores and don't really touch floating point operations, a 3Ghz quad core is still going to run circles around the console CPUs. In practice, even a dual core has no problems competing given the large gap in IPC, cache, SIMD, etc.

There is literally no point in 8 cores for pure gamers until games start doing more total work on the CPU. They're great for productivity though :)

Obviously it doesn't take a hyperthreaded 4Ghz Skylake core to match a 1.6 Ghz Jaguar core but when VR hits we're going to want to be pushing 30fps console games at a locked 90fps, possibly with extra CPU intensive effects applied.
In this case I don't think there's any direct comparison to be made between VR and consoles here to be honest. For VR you do whatever you can afford while still hitting the relevant performance. There are likely to be few if any compelling VR experiences that are "ported" from non-VR, fewer from consoles and basically zero that are ported naively enough to not receive major modifications when running in VR anyways. VR content really needs to be designed directly for VR as the constraints across the board from game design to rendering tradeoffs and choices are very different.

Do you want a fast CPU for VR? Sure, you want fast everything. But the comparative scaling of the GPU power dwarfs what you need on the CPU really.

Personally I'd love to see more games make use of the vast amount of CPU power that is actually available on quad core PCs these days already (let alone on 6-8 core machines), but realistically it's just not going to happen unless games sort out a better way to scale CPU load that is acceptable from a design perspective. That is really what is at a crux of this matter for gamers - if your game needs to run at all on a dual core machine and a console it's likely not going to be able to scale up to "max out" a quad, let alone more cores of similar speeds.
 
In the era of multi-patterning, CPUs made at the newest processes need sales in the millions to amortize the initial costs. The initial costs of $5M I used are seriously lowballing it. If Intel could actually make money by selling you a different kind of top-end CPU, they would do it.

AMD is reflecting this in a different manner : there's only one FX chip, there's only one Kaveri chip. They eat it up and sell various disabled models just so they can do one chip for each product of a given generation. If they stick to what they have announced though, with Zen they'll do a quad core APU (CPU+GPU) and an eight core CPU without a GPU.

Ignoring any additional low power options in the Atom/Jaguar/Core M segments, that's only two chips for the whole market. Even less than configurations of CPU+GPU from Intel.
 
Not to mention that AMD's Bulldozer is also a server chip & has an iGPU option in the motherboard chipset.
 
Nah, there's really no benefit there. A single core at double the frequency is pretty much strictly superior to two cores at half the frequency in terms of raw performance so even if you were extremely generous to the IPCs of consoles and pretend that games can actually use all the 8 cores and don't really touch floating point operations, a 3Ghz quad core is still going to run circles around the console CPUs. In practice, even a dual core has no problems competing given the large gap in IPC, cache, SIMD, etc.
Broadwell IPC is roughly 2x compared to Jaguar. 3.5 GHz Broadwell is 2x clock rate compared to 1.75 GHz Jaguar. Jaguar has twice the number of cores. One Broadwell core (at 3.5GHz) is roughly as fast as 4 Jaguar cores. As you said, a high clocked dual core Broadwell pretty much matches the 8 core Jaguar when both are running perfectly threaded code. A quad core Broadwell would be twice as fast as the 8 core Jaguar.

A fast Broadwell dual core is certainly enough if you want to run the console ports at console settings ("medium") and console frame rate (30 fps). If you want to run the console ports at (locked) 60 fps, the CPU requirement doubles, and a quad core is going to be required, once the developers max out the Jaguar cores (7 cores are available for games now). It took quite a long time to get maximum performance out of the Xbox 360 PPC core. Jaguar is easier to utilize, but it takes time to learn it perfectly. Naughty Dog had a very good presentation at GDC 2015 about this: http://www.gdcvault.com/play/1022186/Parallelizing-the-Naughty-Dog-Engine.

Only time will tell whether a 4 core PC CPU is going to be enough when this console generation ends. PC gamers love 60 fps + ultra settings with long view distances (lots of draw calls). Even with DX12, PC has higher draw call driver overhead compared to consoles. Asynchronous compute is starting to be popular in console games, and some compute jobs require low latency (data needs to be available to CPU during the the same frame). Code like this cannot be executed on discrete GPUs. You have two choices, either run it on the iGPU (not always available), or run it with AVX on the CPU. If the latter is chosen, the PC port will be more taxing to the CPU as the console port (especially when combined with longer view distances and other high quality settings).
Personally I'd love to see more games make use of the vast amount of CPU power that is actually available on quad core PCs these days already (let alone on 6-8 core machines), but realistically it's just not going to happen unless games sort out a better way to scale CPU load that is acceptable from a design perspective. That is really what is at a crux of this matter for gamers - if your game needs to run at all on a dual core machine and a console it's likely not going to be able to scale up to "max out" a quad, let alone more cores of similar speeds.
Yes, game design is a problem for CPU scaling. You don't want to have more enemies and/or NPCs on more powerful CPUs. However frame rate scaling from 30 fps (dual core mobile = equal to console fps) to 60 fps (gaming desktop) to 90 fps (VR) already triples the CPU cost. Scaling to ultra settings increase the CPU cost on top of that. It is highly probable that some VR games (at the end of this console generation) benefit greatly from a 8 core CPU. Of course these games also require monster GPUs to run properly :)
 
Thinking about the lack of support four many cores in games makes me Rage...actually it does feel like treading water lately watching progress slow.
 
The problem with multi-threading isn't really that it's hard, it's more like a lot of programmers don't have the discipline and rigor to do it, and many also still think that Object Oriented Programming is panacea, which is pretty much the opposite of what you want to make fast (multi threaded or not) programs. (Note that going for SoA and arrays of data vs list of objects does help splitting work for concurrency, but you still have to care for data dependencies.)
So first problem is thought massive latency, second problem is software latency, although there are good libraries to get concurrency to work for a lot of simple problems faced in games, there are massive code bases noone masters anymore and that take forever to parallelise...

I also want to see 8 cores becoming the norm, and I'm fine with having an on-board high-latency high-throughput massively parallel mathematic co-processor, as long as it's used for that rather than graphics ;)
(Ok you can use it for graphics too and it can increase performance, but it might not be the best use of it. That will have to be tested.)
 
Back
Top