Where are the cheap 16-core desktop processors?

Discussion in 'PC Industry' started by Albuquerque, Jul 21, 2015.

  1. Davros

    Legend

    Joined:
    Jun 7, 2004
    Messages:
    14,712
    Likes Received:
    2,166
    Tnx Sebbi.

    extra cost
     
  2. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,069
    Likes Received:
    2,739
    Location:
    Well within 3d
    Intel is pretty good at determining what costs more, so if it keeps on creating integrated chips, it is very likely that they have more money as a result of that design choice than the alternative.
     
    Kaarlisk likes this.
  3. Blazkowicz

    Legend Veteran

    Joined:
    Dec 24, 2004
    Messages:
    5,607
    Likes Received:
    256
    There is a cheap 16-core, that's an Opteron that looks like a dual 8-core FX, on a socket with quad channel memory (G34). "Cheap" as well under $1000 for e.g. Opteron 6378, which is 2.4GHz Piledriver (+ turbo).

    But it's not that good compared to 6-core Sandy Bridge and up.
     
  4. iMacmatician

    Regular

    Joined:
    Jul 24, 2010
    Messages:
    759
    Likes Received:
    198
    Should we expect the AVX base clock to drop further (relative to normal base clock) with the upcoming introduction of AVX-512?
     
  5. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,069
    Likes Received:
    2,739
    Location:
    Well within 3d
    I can see the benefit, since the amount of hardware being driven is increasing significantly. It could increase the range, or insert additional brackets for different reasons besides AVX or no AVX.
    I have only seen a clear reference to the different AVX range being a consideration for Xeons.

    I'm not clear on whether client versions of Skylake will have AVX512, at least for now it is a Skylake Xeon and Xeon Phi extension.

    Skylake is a different architecture, so the circumstances that lead to this specific mode could be significantly different in later chips. One possibility is that the different clock modes and the apparent use of AVX512 in Xeon/HPC chips points to a more complex trade-off than Intel found worthwhile in the client space.

    That there can be different clock limits based on instruction usage doesn't strike me as ideal, but it is understandable. What I found noteworthy is that there may be non-obvious reasons for why Intel's clock management has comparatively long wall-clock latencies for switching modes, and it's rather coarse in activation rather than querying activity or what's on the queues at nanosecond and microsecond speeds.
    That could be something that might change with an improved architecture.
     
    Grall and iMacmatician like this.
  6. willardjuice

    willardjuice super willyjuice
    Moderator Veteran Alpha Subscriber

    Joined:
    May 14, 2005
    Messages:
    1,366
    Likes Received:
    219
    Location:
    NY
    What I found interesting is how much hotter my haswell-e gets running avx2 code (I think). I'm away for a bit so I can't get exact numbers, but prime95's torture test (avx2 based) causes my cpu to run absurdly hotter than other benchmarks that max it out (I can't remember exactly but I want to say something like 20 degrees hotter). Now obviously this is highly flawed (who knows what prime95 is doing compared to other benchmarks) but I do suspect at least some of the extra heat is generated from utilizing the larger vector set.

    It would be interesting if there was a multithreaded micro benchmark that allowed someone to compare the power usage of 128-bit vector instructions versus their 256-bit counterparts. I would be curious to know the "cost" of going wider.
     
  7. Andrew Lauritzen

    Moderator Veteran

    Joined:
    May 21, 2004
    Messages:
    2,526
    Likes Received:
    454
    Location:
    British Columbia, Canada
    Yeah there's really no magic here guys - it takes a lot more power to be doing all that extra SIMD stuff and the vast majority of code barely touches the throughput of Haswell, etc. Thus rather than constrain *all code* to lower frequencies, it makes more sense to only constrain code that is actually using a significant portion of the machine (i.e. AVX2). I agree that it is non-ideal that it takes so "long" to transition between the states, but presumably there's a fair amount of complexity there. I don't believe normal turbo bin transitions happen at a much higher rate than that anyways, correct?
     
  8. Blazkowicz

    Legend Veteran

    Joined:
    Dec 24, 2004
    Messages:
    5,607
    Likes Received:
    256
    Knowing Intel I guess that consumer Skylake has the AVX512 hardware built-in, but it's disabled (like 64bit in Prescott or HT in Northwood, at the beginning)
    They can use it internally to weed out the bugs, so when the arch goes to -E/-EP and then later -EX it's less and less bugged. It's how I interpret the strategy, relative to all hardware or microcode bugs in general.

    The early target is also laptops or even the most "mobile" of them, including the "workstation tablets" from Microsoft.
    So if power management of high-width instructions is a tricky problem or if they're not that desirable in that context, that's another reason to leave them out.

    Kaby Lake might include them but if so Intel might want to communicate about it later.
     
  9. Blazkowicz

    Legend Veteran

    Joined:
    Dec 24, 2004
    Messages:
    5,607
    Likes Received:
    256
    That's right, we're discussing Skylake on the multi-core thread and multi-core on the Skylake thread? :lol:
     
    tabs and iMacmatician like this.
  10. Kaarlisk

    Regular Newcomer Subscriber

    Joined:
    Mar 22, 2010
    Messages:
    293
    Likes Received:
    49
    At least Socket 1150 Haswell boosts voltage whenever it encounters AVX2 code.
     
  11. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,069
    Likes Received:
    2,739
    Location:
    Well within 3d
    Measurements from the following found P state transitions could be measured in tens of microseconds.
    https://www.usenix.org/system/files/conference/atc14/atc14-paper-wamhoff.pdf

    The millisecond latency is longer relative to another known AVX-specific mode change that was found for Sandy Bridge, where going to 256 bits required several hundred floating point operations to warm up to full throughput. That's a fair amount of cycles, but at multi-GHz range it sounds like tens to a hundred nanoseconds or so for the CPU to determine that AVX-256 was needed and for the pipeline to reach full utilization.

    Intel's DVFS and voltage regulation capabilities are very effective and very fast, so opting for a method that takes at least a millisecond and can kick in due to the detection of one AVX instruction does show there are complexities that the above methods do not handle. I think this could be an area that future designs would try to bring in line with other latencies and provide a little more dynamism in the response.
     
  12. tunafish

    Regular

    Joined:
    Aug 19, 2011
    Messages:
    532
    Likes Received:
    163
    This is counterintuitive, but it's entirely possible that having those extra few square millimeters of silicon that you aren't using reduces the cost of your chip.

    When building chips on bleeding edge foundries, the initial costs of spinning out a design, making masks for it and validating it are massive. Massive enough that for any chip other than the very highest volume ones they are at least a significant portion of the total lifetime cost of making the chips.

    All the following numbers are totally made up, but the illustrate the point:

    Let's say that you are an enthusiast who wants a chip with the fastest CPU and no GPU, and there are a hundred thousand like you in the market. There are also 5 million customers who want a chip with an IGP. The marginal cost of manufacturing a chip without an IGP is $30, while the marginal cost of manufacturing a chip with one is $50. However, the initial cost of getting a design rolling is $10M for the IGP chip and $5M for the IGPless chip, and this needs to be amortized over all sold chips.

    If the CPU maker makes just one design, the per-chip manufacturing costs are ~$52 for each, while if they make two chips, the chips with IGP cost $52 and the chips without one cost $100. ... At which point all the customers who would have bought it wonder why the CPU maker is trying to fleece them and buy the version with the IGP instead.

    In the era of multi-patterning, CPUs made at the newest processes need sales in the millions to amortize the initial costs. The initial costs of $5M I used are seriously lowballing it. If Intel could actually make money by selling you a different kind of top-end CPU, they would do it.
     
    #32 tunafish, Jul 26, 2015
    Last edited: Jul 26, 2015
    Kaarlisk likes this.
  13. AlBran

    AlBran Ferro-Fibrous
    Moderator Legend

    Joined:
    Feb 29, 2004
    Messages:
    20,540
    Likes Received:
    5,644
    Location:
    ಠ_ಠ
    Particularly for gaming, aren't we essentially limited by the console focus? Intel has leapt so far beyond previous and current gen that it doesn't seem like multicore will be an issue until the end of time.

    DX12/Vulcan etc looks to alleviate CPU pressure as well...

    ?
     
  14. pjbliverpool

    pjbliverpool B3D Scallywag
    Legend

    Joined:
    May 8, 2005
    Messages:
    7,583
    Likes Received:
    703
    Location:
    Guess...
    This is one of the reasons why I'd like to see 8 cores become mainstream. Presumably being able to map a single core to single core would be easier for developers and beneficial from a performance pov, especially when we have DX12 which should offer similar scaling to consoles. Obviously it doesn't take a hyperthreaded 4Ghz Skylake core to match a 1.6 Ghz Jaguar core but when VR hits we're going to want to be pushing 30fps console games at a locked 90fps, possibly with extra CPU intensive effects applied. That's going to take a huge amount of additional CPU power that I really don't want my quad core to be having to take that on at the same time as doing the work of 2 Jaguar cores per PC core.
     
  15. Andrew Lauritzen

    Moderator Veteran

    Joined:
    May 21, 2004
    Messages:
    2,526
    Likes Received:
    454
    Location:
    British Columbia, Canada
    Nah, there's really no benefit there. A single core at double the frequency is pretty much strictly superior to two cores at half the frequency in terms of raw performance so even if you were extremely generous to the IPCs of consoles and pretend that games can actually use all the 8 cores and don't really touch floating point operations, a 3Ghz quad core is still going to run circles around the console CPUs. In practice, even a dual core has no problems competing given the large gap in IPC, cache, SIMD, etc.

    There is literally no point in 8 cores for pure gamers until games start doing more total work on the CPU. They're great for productivity though :)

    In this case I don't think there's any direct comparison to be made between VR and consoles here to be honest. For VR you do whatever you can afford while still hitting the relevant performance. There are likely to be few if any compelling VR experiences that are "ported" from non-VR, fewer from consoles and basically zero that are ported naively enough to not receive major modifications when running in VR anyways. VR content really needs to be designed directly for VR as the constraints across the board from game design to rendering tradeoffs and choices are very different.

    Do you want a fast CPU for VR? Sure, you want fast everything. But the comparative scaling of the GPU power dwarfs what you need on the CPU really.

    Personally I'd love to see more games make use of the vast amount of CPU power that is actually available on quad core PCs these days already (let alone on 6-8 core machines), but realistically it's just not going to happen unless games sort out a better way to scale CPU load that is acceptable from a design perspective. That is really what is at a crux of this matter for gamers - if your game needs to run at all on a dual core machine and a console it's likely not going to be able to scale up to "max out" a quad, let alone more cores of similar speeds.
     
  16. Blazkowicz

    Legend Veteran

    Joined:
    Dec 24, 2004
    Messages:
    5,607
    Likes Received:
    256
    AMD is reflecting this in a different manner : there's only one FX chip, there's only one Kaveri chip. They eat it up and sell various disabled models just so they can do one chip for each product of a given generation. If they stick to what they have announced though, with Zen they'll do a quad core APU (CPU+GPU) and an eight core CPU without a GPU.

    Ignoring any additional low power options in the Atom/Jaguar/Core M segments, that's only two chips for the whole market. Even less than configurations of CPU+GPU from Intel.
     
  17. Kaarlisk

    Regular Newcomer Subscriber

    Joined:
    Mar 22, 2010
    Messages:
    293
    Likes Received:
    49
    Not to mention that AMD's Bulldozer is also a server chip & has an iGPU option in the motherboard chipset.
     
  18. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,288
    Location:
    Helsinki, Finland
    Broadwell IPC is roughly 2x compared to Jaguar. 3.5 GHz Broadwell is 2x clock rate compared to 1.75 GHz Jaguar. Jaguar has twice the number of cores. One Broadwell core (at 3.5GHz) is roughly as fast as 4 Jaguar cores. As you said, a high clocked dual core Broadwell pretty much matches the 8 core Jaguar when both are running perfectly threaded code. A quad core Broadwell would be twice as fast as the 8 core Jaguar.

    A fast Broadwell dual core is certainly enough if you want to run the console ports at console settings ("medium") and console frame rate (30 fps). If you want to run the console ports at (locked) 60 fps, the CPU requirement doubles, and a quad core is going to be required, once the developers max out the Jaguar cores (7 cores are available for games now). It took quite a long time to get maximum performance out of the Xbox 360 PPC core. Jaguar is easier to utilize, but it takes time to learn it perfectly. Naughty Dog had a very good presentation at GDC 2015 about this: http://www.gdcvault.com/play/1022186/Parallelizing-the-Naughty-Dog-Engine.

    Only time will tell whether a 4 core PC CPU is going to be enough when this console generation ends. PC gamers love 60 fps + ultra settings with long view distances (lots of draw calls). Even with DX12, PC has higher draw call driver overhead compared to consoles. Asynchronous compute is starting to be popular in console games, and some compute jobs require low latency (data needs to be available to CPU during the the same frame). Code like this cannot be executed on discrete GPUs. You have two choices, either run it on the iGPU (not always available), or run it with AVX on the CPU. If the latter is chosen, the PC port will be more taxing to the CPU as the console port (especially when combined with longer view distances and other high quality settings).
    Yes, game design is a problem for CPU scaling. You don't want to have more enemies and/or NPCs on more powerful CPUs. However frame rate scaling from 30 fps (dual core mobile = equal to console fps) to 60 fps (gaming desktop) to 90 fps (VR) already triples the CPU cost. Scaling to ultra settings increase the CPU cost on top of that. It is highly probable that some VR games (at the end of this console generation) benefit greatly from a 8 core CPU. Of course these games also require monster GPUs to run properly :)
     
  19. Sxotty

    Veteran

    Joined:
    Dec 11, 2002
    Messages:
    4,828
    Likes Received:
    289
    Location:
    PA USA
    Thinking about the lack of support four many cores in games makes me Rage...actually it does feel like treading water lately watching progress slow.
     
  20. Rodéric

    Rodéric a.k.a. Ingenu
    Moderator Veteran

    Joined:
    Feb 6, 2002
    Messages:
    3,965
    Likes Received:
    825
    Location:
    Planet Earth.
    The problem with multi-threading isn't really that it's hard, it's more like a lot of programmers don't have the discipline and rigor to do it, and many also still think that Object Oriented Programming is panacea, which is pretty much the opposite of what you want to make fast (multi threaded or not) programs. (Note that going for SoA and arrays of data vs list of objects does help splitting work for concurrency, but you still have to care for data dependencies.)
    So first problem is thought massive latency, second problem is software latency, although there are good libraries to get concurrency to work for a lot of simple problems faced in games, there are massive code bases noone masters anymore and that take forever to parallelise...

    I also want to see 8 cores becoming the norm, and I'm fine with having an on-board high-latency high-throughput massively parallel mathematic co-processor, as long as it's used for that rather than graphics ;)
    (Ok you can use it for graphics too and it can increase performance, but it might not be the best use of it. That will have to be tested.)
     
    entity279 likes this.
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...