Bring back high performance single core CPUs already!

Discussion in 'PC Hardware, Software and Displays' started by Frontino, Apr 10, 2012.

  1. Davros

    Legend

    Joined:
    Jun 7, 2004
    Messages:
    14,891
    Likes Received:
    2,309
    No Never....
     
  2. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,750
    Likes Received:
    127
    Location:
    Taiwan
    Of course, that depends on the hit rate (and miss penalty) of P4's trace cache compared to a more traditional L1-I cache. IIRC since the original P4's trace cache is not exactly large (12K uOps, roughly equivalent to 16KB) its hit rate should be similar to a more traditional CPU, but its miss penalty is much larger (the whole trace has to be flushed and rebuilt, compared to just load the instructions from L2 cache in the traditional case).

    Looking back, a full trace cache seems to be a bad idea. However, Sandy bridge revived this idea in the form of a uOps cache (1.5K entry). Basically it works just like a traditional I-cache, but it stores decoded uOps instead of x86 instructions. So it's not a trace cache, but it's still able to save some ecoder loads.
     
  3. imaxx

    Newcomer

    Joined:
    Mar 9, 2012
    Messages:
    131
    Likes Received:
    1
    Location:
    cracks
    You're right, my point was that TC in P4 is.. peculiar, since there's no front L1I, so you had to balance the comparison somehow by either trounce both, or find a way to include both.

    The one in SB, instead, looks cool to me. In moderate loops, it can pump up heavily the processor speed by removing the front end.

    @Grall: BD seems really not bad at all. Dont look at performance for a while, look to the idea behind it. I did try to see it from any angle, wondering why it was so pathetic on performance. It's IPC is surely... terrible. I thought it could be due to many missing AGLU instructions, L1D cache trash, or the extra cycle latency, or the undersized BTB, or the missing 3rd ALU (but AMD said it was rarely used), or the shared L2 access that literally makes it ~an L3.
    In the end, I believe my very first impression was right.
    On a long run, decoders average around 2 instr/cycle. BD can average 2 instr/cycle/core (theorical 2+1Alu+2Aglu+2Float). A shared decoder would output average/good case of 2-3 mops/2 cycles... or about between 1 and 1,5/cycle instructions per core. Mah.
     
  4. itsmydamnation

    Veteran Regular

    Joined:
    Apr 29, 2007
    Messages:
    1,298
    Likes Received:
    396
    Location:
    Australia
    except a bulldozer module (L2 inc) is about the same size as a SB core (L3 included). You can spin it any way you want, but the simple fact is that in the 17 watt class intel has a 2c/4t and in a month or two Trinity will be out with 4c/4t and from the looks of it is seems much improved over both bulldozer and stars.

    http://amdfx.blogspot.com.au/2012/04/amd-trinity-benchmark-geekbench.html
     
  5. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,750
    Likes Received:
    127
    Location:
    Taiwan
    Are you sure about this? Bulldozer's module size seems to be quite large. It's die size is 315 mm^2 @ 32nm, and from the floor plan each module looks like to be roughly 50 mm^2 (that's without L3 cache).

    Sandy Bridge, on the other hand, is 216 mm^2 @ 32nm. Its GPU/Display/DMI part takes roughly the same proportion of Bulldozer's L3 cache + system part, so Sandy Bridge's each core with its respective L3 cache is roughly 36 mm^2. That's much smaller and, frankly, offers better performance in general.

    [EDIT] I used die photo of both CPU to estimate the core size (Bulldozer for its module without the L3 cache, and Sandy Bridge for one core and 1/4 L3 cache). The number for Bulldozer is 36 mm^2, while Sandy Bridge is 28.7 mm^2.
     
  6. Silent_Buddha

    Legend

    Joined:
    Mar 13, 2007
    Messages:
    16,141
    Likes Received:
    5,078
    That's also an excellant illustration of why AMD multi-core CPUs are just about as good as Sandy Bridge CPU's for enthusiast game settings in current games that can take advantage of multiple threads.

    In single threaded or lightly threaded games or games which stress the CPU more than GPU then SB can sometimes have a noticeable lead, but in many cases an enthusiast would do just as well with an AMD multicore as an Intel multicore as enthusiast graphic settings.

    At least if all they are doing is gaming.

    Regards,
    SB
     
  7. hoho

    Veteran

    Joined:
    Aug 21, 2007
    Messages:
    1,218
    Likes Received:
    0
    Location:
    Estonia
    I'm one of those enthusiasts that also care about noise. Having the CPU consume a ton more power and thus needing beefier cooling would mean more noise.

    Though yeah, raw performance wise BD is good enough for most stuff, especially after it's price was lowered somewhat. It's only problem is it's maximum performance isn't all that great vs I7's.
     
  8. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,288
    Location:
    Helsinki, Finland
    AMDs way of sharing resources (vector units) between two threads is also a very well designed one. Vector pipelines are longer than scalar/logic/fetch pipelines, so it's harder to keep them fed from just single thread ILP alone. Getting instructions from two threads benefits vector processing even more than generic processing. Additionally vector instructions are not used as frequently as normal instructions (usual usage pattern contains heavy bursts and lots of pipeline idling between). So sharing a vector pipeline & execution units between two threads improves the vector pipeline usage even more than HT does for generic processing.

    But the biggest gain of sharing vector units between two threads is that vector execution units take a lot of die space, and sharing reduces the die space requirement of vector units to half. It's better to slightly beef up the vector execution unit and share it with two threads than have separate ones for each core (with very low utilization rate because of long pipelines and bursty instruction usage patterns).

    The reason why Bulldozer cannot currently match Sandy Bridge is not caused by the shared vector units. One of the bottlenecks in Bulldozer design is the L1i cache. It has only two way associativity, but is shared between two threads in the module. Two way cache is just enough for one thread if branching pattern in simple, but sharing such a simple cache with two threads is asking for trouble. Sandy Bridge's L1i is 8 way associative, so it works very well with two threads as well (HT). AMD should have improved their L1i cache logic (for example 4 way associative would have been a good compromise) when they designed to share the cache with two threads (just like Intel did with HT). Or they could have split the L1i cache to half and gave both threads their own caches (so that they would not randomly evict pages of each other). It's not a common scenario that two threads are executing exactly the same code, so a shared cache doesn't help much compared to smaller separate ones.
     
    #88 sebbbi, Apr 18, 2012
    Last edited by a moderator: Apr 18, 2012
  9. swaaye

    swaaye Entirely Suboptimal
    Legend

    Joined:
    Mar 15, 2003
    Messages:
    8,457
    Likes Received:
    580
    Location:
    WI, USA
    Except with multi-GPU setups, where I've seen some curious test results with Bulldozer. I don't know if Phenom II was better but no point in multi GPU on one of those anymore anyway.
     
  10. Tim

    Tim
    Regular

    Joined:
    Mar 28, 2003
    Messages:
    875
    Likes Received:
    5
    Location:
    Denmark
    For comparison Sandy Bridge has a 18 stage pipeline.
     
  11. imaxx

    Newcomer

    Joined:
    Mar 9, 2012
    Messages:
    131
    Likes Received:
    1
    Location:
    cracks
    what?? C2/nehalem had around 13-14.. more or less like K8/10 design.

    I dont remember seeing relevant miarch changes (<- to pipeline)....
    ah, i think i understand. ~+4 for the loop buffer that has become a true TC cache.

    Well, this brings again the problem of P4: when you consider a loop within TC cache boundaries (~1500 mops), decoding is out of play and you get full decoding throughput.
     
  12. almighty

    Banned

    Joined:
    Dec 17, 2006
    Messages:
    2,469
    Likes Received:
    5
    As I've run both including a Phenom 2 x6 that I run 24/7 at 4.8Ghz I can say that average frame rates between all of them is mostly the same.

    Minimum frame rates on the other hand are a completely different story, they are a lot lower on the AMD chips, and on BD they are really low.

    Minimums are the most I'm orange number of all.

    And personally I would never compare any Core i series CPU with an AMD chip as the Intel would completely bash them in gaming.

    The old 1156 Core i3 at 4Ghz+ from a pure gaming point of view would flat out leave any AMD chip for dead.
     
    #92 almighty, Apr 21, 2012
    Last edited by a moderator: Apr 21, 2012
  13. Blazkowicz

    Legend Veteran

    Joined:
    Dec 24, 2004
    Messages:
    5,607
    Likes Received:
    256
    I've had similar gains on single core, single thread computers, just by asking the OS scheduler to set the priority of an offending process (firefox) lower.

    moving from sempron 64 to athlon II X2, on same mobo, made my PC incredibly better. sure. installing an OS in virtualbox was utter pain (major CPU hog for every simulated I/O)
     
  14. Davros

    Legend

    Joined:
    Jun 7, 2004
    Messages:
    14,891
    Likes Received:
    2,309
    on the subject of p4's my 3ghz northwood seems to have 90% cpu usage while watching flv's/flash
     
  15. pascal

    Veteran

    Joined:
    Feb 7, 2002
    Messages:
    1,830
    Likes Received:
    49
    Location:
    Brasil
    Just give me a 22nm 6.4GHz Xenon CPU :)

    Very small, cheap and fast for entertainment applications.
    And probably cool too.
     
    #95 pascal, May 20, 2012
    Last edited by a moderator: May 20, 2012
  16. pjbliverpool

    pjbliverpool B3D Scallywag
    Legend

    Joined:
    May 8, 2005
    Messages:
    7,583
    Likes Received:
    703
    Location:
    Guess...
    Even at 6.4Ghz Xenon would be slow compared to a modern mid range quad core.
     
  17. Blazkowicz

    Legend Veteran

    Joined:
    Dec 24, 2004
    Messages:
    5,607
    Likes Received:
    256
    I would like AMD to make a single module Piledriver, still unlocked like other FX processors.
    no GPU, just the basics for am3+ socket, small and somewhat cool running - can be o/c on stock heatsink without too much noise.

    maybe we can get something like that in the form of a disabled Trinity - but there's the socket dance.
     
  18. swaaye

    swaaye Entirely Suboptimal
    Legend

    Joined:
    Mar 15, 2003
    Messages:
    8,457
    Likes Received:
    580
    Location:
    WI, USA
    Yeah something like that would be interesting.
     
  19. pascal

    Veteran

    Joined:
    Feb 7, 2002
    Messages:
    1,830
    Likes Received:
    49
    Location:
    Brasil
    For gaming and entertainment it would be fast and cheap.
    Probably something around 230 GFlops peak :)
     
  20. swaaye

    swaaye Entirely Suboptimal
    Legend

    Joined:
    Mar 15, 2003
    Messages:
    8,457
    Likes Received:
    580
    Location:
    WI, USA
    Xenon has been said to be practically similar to a high end Athlon 64 X2. But there's no way 6.4 GHz is going to happen.

    What is "entertainment"? HTPC just needs a video DSP like the APUs, IGPs and video cards have, plus any CPU that's faster than a P4.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...