Predict: Next gen console tech (9th iteration and 10th iteration edition) [2014 - 2017]

Discussion in 'Console Technology' started by Shifty Geezer, Dec 22, 2014.

Thread Status:
Not open for further replies.
  1. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,365
    Likes Received:
    3,955
    Location:
    Well within 3d
    Most of my thinking on this as to why Sony was allegedly considering Steamroller is that the clocks were not intended to be the same, by a very significant margin.
    The base clocks for the desktop Kaveri SKUs seem to indicate that at least doubling Jaguar's clocks could have been possible, and the PS4 Pro's uptick in power consumption is an example of where Sony was fine with a marginal increase in overall power consumption over the release version of the PS4.

    There's loss of peak against matched FADD and FMUL operations that cannot be turned into an FMA, which Steamroller cannot avoid. Loss of peak due to a mix that is slanted towards FADD or FMUL is something Jaguar also experiences, which I think makes the decision less straightforward at an architectural level. Jaguar has more issue port contention on a per-clock basis due to its having 1 fewer FPU pipeline, and it does have a less feature-rich ISA. The extra instructions AMD had for the Bulldozer line saw limited use in the desktop market, but it would have been presented differently as part of a proprietary platform.

    As far as stalls in a multi-threaded context, I can think of some limitations like the store buffer out of the shared FPU not being able to supply the bandwidth of two cores, but this is a case of faltering into a tie rather than losing to 8 Jaguar cores due to Jaguar's narrower architecture. Some stalls with Piledriver and some possible errata were fixed with Steamroller.
    There are non-FPU related performance problems, such as odd regressions in multithreaded performance, cache bandwidth, and decode. Decode was generally doubled to make choking there unlikely, with some additional writeback and errata fixes that might have done a little for the rest.

    Are there examples in the PC space where having two cores sharing the FPU leads to a loss in performance versus one core per FPU? The benchmarks I can remember didn't show performance going down.

    There would be little reason to use a core optimized to ~4GHz peak if it were to be bumped down to one that struggled to hit 2.
    If the leaked Steamroller variant of the PS4 is a legitimate early version, it would indicate that concerns such as module area and FPU sharing were considered acceptable since those would have been known in advance.

    The actual health of the manufacturing process and the validation of the architecture and overall SOC might have been where the momentum changed.
    Perhaps Steamroller couldn't clock sufficiently faster for Sony's requirements with GF's process, or AMD could not make its deliverables for that architecture despite Kaveri seemingly inheriting some high-level similarities with Orbis.
    I have seen some discussion that the firmware situation for Kaveri was problematic, and it was notably later to be released than the consoles.
    Jaguar's hop to TSMC cost AMD money, and seemingly hindered its bug-fixing and physical characterization--hence lack of turbo in general until the return to GF.

    One scenario I've been mulling over is that AMD determined it could not get a validated Steamroller APU out of GF in time, within its clock/power parameters, and in volume.
    A Jaguar-based core, and an SOC that used it, could make a jump to TSMC, whereas a Steamroller APU with some apparent teething pains could not.
     
  2. HTupolev

    Regular

    Joined:
    Dec 8, 2012
    Messages:
    936
    Likes Received:
    564
    Minimalist-deep-pipeline high-clock cores?

    Intel:

    2000: "Theoretical peak performance can be decent for the die space!"

    "Architecturally-aware compilers will solve these problems!"

    2006: "We've decided that the best strategy in the CPU market is to sell Pentium 3s forever."

    IBM:

    2005: "Theoretical peak performance can be decent for the die space!"

    "These problems won't exist when people code to the specific console!"

    2006: "..."

    Toshiba: "You know the best thing about the Cell? We can build them without including a PPU anywhere on the die!"

    AMD:

    2011: "Theoretical peak performance can be decent for the die space!"

    2013: "Hey, remember when our CPU market share started to stabilize around 2007? Those were good years."

    B3D:

    2016: "Wouldn't theoretical peak performance be decent for the die space?"
     
    #1302 HTupolev, Nov 26, 2016
    Last edited: Nov 26, 2016
  3. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,293
    Location:
    Helsinki, Finland
    I am not an HW engineer, but a double clocked part (with identical die size) isn't likely going to be just a "marginal increase" in power consumption. Wouldn't the power scale quadratically -> 4x higher power consumption (ballpark)? That's not good when you only get equal peak flops (assuming full FMA). Of course performance in generic less well threaded code would have been significantly better.

    FMA would of course be used a lot in console code. However even in well optimized AVX inner loops (or well optimized GPU microcode), the FMA percentage (of ALU ops) is usually closer to 50% than 100% (there's always some adds and muls and other ALU instructions mixed in). C/C++ compilers also don't automatically translate vector multiply+add intrinsics to FMA. You have to do it manually. Console devs also use libraries designed for PC. Most (open source) libraries aren't even vectorized, and if they are, you likely see SSE3 at most. So no FMA. Fortunately compilers can in some cases automatically translate float (non vector) mul+add to FMA: http://stackoverflow.com/questions/34265982/automatically-generate-fma-instructions-in-msvc.

    Jaguar lacks TBM and XOP instruction sets (found in Steamroller). But Zen isn't going to support them either. Otherwise both Jaguar and Steamroller have the same stuff: SSE4.2, AVX1, BMI1, F16C. No BMI2 on either (so no PDEP for fast morton code). Zen has AVX2 and BMI2 (Haswell+ on Intel).
     
    Orion and Heinrich4 like this.
  4. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,293
    Location:
    Helsinki, Finland
    AMD could have avoided the "speed demon" trap. Both IBM and Intel had already hit the limit and were struggling. For some reason AMD decided to continue the Bulldozer development. They should have just continued to iterate the Phenom II design. They already had a native 6 core design back then. 3.2 GHz wasn't that bad limit (if you look at Intel's forthcoming Core 2 and Nehalem). Imagine there AMD would be now :)
     
  5. itsmydamnation

    Veteran Regular

    Joined:
    Apr 29, 2007
    Messages:
    1,314
    Likes Received:
    414
    Location:
    Australia
    Is BD really a speed demon? To me it seems more like they just made bad choices, Bad L1/L2, "bad" L1i, very long miss predict penalties ( far longer then the pipeline length). To much pressure on the ALU's, FMA pipes resulting in long FP latencies for non FMA ops. From what i know Fmax of BD is caused by L2 so a bulldozer with one extra ALU, write back not write through cache, larger L1 and a uop cache while keeping everything else the same probably would have ended up clocking the same at around the same power for a whole lot more performance.

    But its only a guess and we will never know, remember bulldozer int pipeline is only around 15-16 stages not the 20-22 normally quoted (the miss predict penalty) .
     
  6. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,365
    Likes Received:
    3,955
    Location:
    Well within 3d
    Within the same implementation, there's a mostly linear relationship with clock and a quadratic one with voltage. Keeping it within one implementation factors out items like the design's frequency target, process, and logic complexity.
    Comparing two different architectures would need to control for items like that.
    Voltage is the larger driver, and there's an additional relationship within a design's clock/voltage curve where voltage starts to rise as clocks rise in order to overcome device and wire delay.
    Where that voltage increase starts to kick in is determined by the design and how it handles things like wire delay.

    A low-clocked design's voltage starts to ramp at lower clocks, where the relationship starts to look more cubic than quadratic.
    A high-performance design can take an initial hit in power consumption due to its linearly larger factors (clock, transistor count), but it won't be forcing its voltages higher (the big cost) until it reaches the upper part of its target range.
    For example: at launch, a 1.5 GHz Jaguar quad-core with a 2 CU GPU had a 15W TDP and the 2.0 GHz had a 25W TDP. Later mobile Kaveris at ~2GHz base had a 20W TDP.

    Also for various reasons, Jaguar's power efficiency was rather middling, which meant the difference between it and the larger x86 core was smaller than it should have been.
    The fact that large cores can throttle down to Jaguar's range (FinFET helps with a good portion of the static cost), while truly optimized architectures could hit it from below partly explain why its particular compromise point has been discarded by AMD (among other reasons like not having much of the staff responsible any longer).

    I have not tried to really narrow down the error bounds, but doubling a launch Jaguar's core count could take it to the 20-30W range, when there were 45-65W Kaveri chips with larger GPUs as well.
    It seems possible Kaveri's number could be dropped if it settled on only having base clocks, and the larger GPU were accounted for.
    Something like 20W (ed: possibly worst-case) on top of the PS4's power budget (150W+) seems like it might not be an immediate disqualifier since the PS4 Pro is somewhere above the PS4.

    However, one possible disqualification is that Kaveri came out in 2014, and the mobile Kaveris that wound up beating the TDP of the hottest Jaguar didn't launch until June of that year.
     
    #1306 3dilettante, Nov 27, 2016
    Last edited: Nov 27, 2016
  7. MrFox

    MrFox Deludedly Fantastic
    Legend Veteran

    Joined:
    Jan 7, 2012
    Messages:
    6,487
    Likes Received:
    5,992
    Finally hynix is going to put some pressure on samsung and toshiba. They have 1TB single chip in their Q1 databook. It will have 72-layers on 10nm-class.

    H27Q8T8LEA
    8192Gb
    FBGA(152ball)
    Q4'17

    I want two of those on a user-replaceable M.2... with a fancy screw of course.

    They dropped their idea of DC-SF and are doing nitride charge trap like everybody else.
     
  8. RDGoodla

    Regular Newcomer

    Joined:
    Aug 21, 2010
    Messages:
    499
    Likes Received:
    138
    It seems that Intel CPU & Nvidia GPU are still the best choices for a next-gen high performance console?

    How much bill of material will it cost compared with a console with APU?
     
  9. ProspectorPete

    Regular Newcomer

    Joined:
    Feb 1, 2017
    Messages:
    414
    Likes Received:
    137
    PS5:
    Launch Q4 2019
    CPU (apu): ULP 16 core Zen
    GPU (apu): Navi 12TF with features from the next hardware
    memory 24GB HBM3 (up to 6 reserved for OS)
    storage: 1468GB 3D SSD with 5000MB/sec both read and write
    HDMI 3.0 with adaptable refresh
    Thunderbolt 4 ports with USB-C connector
    8x BDXL drive with option for UHD BD
    8K streaming through future update
    Wireless display link for PSVR2
    802.11AX wifi
    bluetooth 5.1 for controllers as wel as latency free headset
     
  10. Shifty Geezer

    Shifty Geezer uber-Troll!
    Moderator Legend

    Joined:
    Dec 7, 2004
    Messages:
    43,577
    Likes Received:
    16,028
    Location:
    Under my bridge
    How can you say that when we haven't seen a next-gen AMD APU with stacked RAM in action? Or even how can you say that when Vega and Ryzen make a competent combination at a far lower price? You could get more power from $400 of AMD based console than $400 of Intel+nVidia, no?
     
    RootKit likes this.
  11. N00b

    Regular

    Joined:
    Mar 11, 2005
    Messages:
    698
    Likes Received:
    114
    You forgot:
    * Fairy dust
    * Built-in world peace
     
  12. Shifty Geezer

    Shifty Geezer uber-Troll!
    Moderator Legend

    Joined:
    Dec 7, 2004
    Messages:
    43,577
    Likes Received:
    16,028
    Location:
    Under my bridge
    That's going to make COD and GTA pretty boring...
     
    Prophecy2k, ToTTenTranz and N00b like this.
  13. Blazkowicz

    Legend Veteran

    Joined:
    Dec 24, 2004
    Messages:
    5,607
    Likes Received:
    256
    There's a new GTA coming, but you play as Noah and you roam the city to collect lost sheeps, bring bread to orphans and when you enter a car the driver asks you to have a seat please.
     
    Jay likes this.
  14. HTupolev

    Regular

    Joined:
    Dec 8, 2012
    Messages:
    936
    Likes Received:
    564
    You mean slingshot bread at orphans, resulting in them taking a nap.
     
  15. zupallinere

    Regular Subscriber

    Joined:
    Sep 8, 2006
    Messages:
    750
    Likes Received:
    95
    Wait a year and reduce some of those specs and add some more slower memory or even replace the SSD with x-point memory using HBCC. Resolution isn't going to be a big deal in the future since I think 4k is all you need so why not see what you can do with a crazy amount of memory.
     
  16. Rikimaru

    Veteran Newcomer

    Joined:
    Mar 18, 2015
    Messages:
    1,023
    Likes Received:
    396
    It's likely HBM3 price will be too high for a console.
    Only special low-cost HBM variant could be feasible.
     
  17. Entropy

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,214
    Likes Received:
    1,202
    You are leaving out the main prediction (or is it impled?), that it will be fabbed at TSMC 7nm node.
    (And why go so high in CPU parallelism, won't average utilization be abysmal?)
     
  18. Rikimaru

    Veteran Newcomer

    Joined:
    Mar 18, 2015
    Messages:
    1,023
    Likes Received:
    396
    It's cheaper but not cheap.
    https://arstechnica.com/gadgets/2016/08/hbm3-details-price-bandwidth/
     
  19. DieH@rd

    Legend Veteran

    Joined:
    Sep 20, 2006
    Messages:
    6,227
    Likes Received:
    2,179
    Quite the opposite. They are the worst choices.

    As for HBM talk.... the more and more times goes by, the more I think that next consoles will stick to the GDDR, especially if by 2019 someone manages to create 2GB chips. 16 of those chips will be enough for a nice 32GB of unified ram, with speeds hopefully in the 350-500 GB/s range. That would be enough for 1080p-4K gaming.

    As for storage... I don't think that anyone will go above the speeds that consumer external storage devices can achieve. Most likely they will stick with ~500GB/s and call it a day.

    We all want fancy fast storage, but I would not be shocked if they stick with spinning laptop hard drives for another round [2TB as a starting size]. Cost trumps everything.
     
    #1319 DieH@rd, Apr 7, 2017
    Last edited: Apr 7, 2017
    Prophecy2k, DSoup and RootKit like this.
  20. Rodéric

    Rodéric a.k.a. Ingenu
    Moderator Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,031
    Likes Received:
    898
    Location:
    Planet Earth.
    8 core zen based APU + Vega, if we are lucky HBM 16GB, even luckier standard PCIe based SSD, with an optical drive.
     
Loading...
Thread Status:
Not open for further replies.

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...