Relative performance of AMD 2.9GHz APU vs current consoles?

pjbliverpool · Jul 12, 2011

swaaye said:
I'm really skeptical about Xenon being competitive with an Athlon 64 X2 on a per clock basis in general. Those PPC cores are awfully simplistic compared to K8 outside of SIMD.

The comparison was between a 2Ghz AX2 and the 3.2Ghz Xenon, so on a per clock basis you're right that Xenons not competitive if it is indeed a little slower.

pjbliverpool · Jul 12, 2011

ToTTenTranz said:
Going to the question at hand, it's a good guess that a 4-core, 400sp Llano is at least 2x more powerful than the XCGPU in the graphics department, but may not retain the same advantage in some FP-intensive CPU stuff like physics.

The CPU part of the A8-3850 is as fast or a little faster than a Q6600. In other words it would obliterate Xenon in pretty much anything.

swaaye · Jul 12, 2011

Yeah I think one of the Llano chips would be quite a nice little game console. I really wonder if Nintendo might be planning to use one. I'm counting on Sony and MS to go for "difficult to optimize speed demon with a zillion compromises to achieve max potential at low cost" though in their usual arms race.

sebbbi · Jul 12, 2011

swaaye said:
I'm really skeptical about Xenon being particularly competitive with an Athlon 64 X2 without a lot of attention to how the program works. As far as we know Xenon might be more comparable to Brazos or Atom.

They were comparing the whole CPU, not IPC or single thread execution. So 3 core (6 thread SMT) 3.2GHz in order PPC versus 2.0GHz dual core out of order Athlon 64. Those six hardware threads and a higher clock should be helping it a lot.

swaaye said:
Llano has about the same IPC as Athlon II, not Phenom II. Phenom II is to varying degrees faster than both.

Quoted directly from Anandtech:

"Architecturally AMD has made some minor updates to each Llano core. AMD is promising more than a 6% increase in instructions executed per clock (IPC) for the Llano cores vs. their 45nm Athlon II/Phenom II predecessors. The increase in IPC is due to the larger L2 cache, larger reorder and load/store buffers, new divide hardware, and improved hardware prefetchers.

On average I measured around a 3% performance improvement at the same clock speed as AMD's 45nm parts. Peak performance improved up to 14% however most of the gains were down in the 3—5% range."

swaaye · Jul 12, 2011

Yeah I would like to see some more thorough tests of Llano against a Phenom II.... They might be quite close most of the time. Of course the Phenom IIs are available in much higher clock grades.

KKRT · Jul 12, 2011

sebbbi said:
Update:
Found a forum with lots of Brazos E-350 APU benchmark links. Brazos is a netbook chip designed to compete with ATOM (much lower end than Llano).

Brazos fares pretty well in current console games:
Battlefield 2 (2005) - 1024x768, all max, 25-35 fps
Crysis 2 (2011) - 1280х720, 20 fps
Dirt 3 (2011) - 1024x600, 19-25 fps (ultra low noAA 25-28 fps)
Mass Effect 2 (2011) - 1024х600, 20 fps
Call of Duty: Modern Warfare 2 (2009) - 1280х720, 25 fps

Current consoles run those same games at 30 fps at similar resolutions. These games are definitely not optimized for Brazos architecture. Maybe we get some netbooks for the Christmas that are more capable than the current consoles

Amazing. We really need new generation asap and better Directx API/GPU drivers for PC devs.

Squilliam · Jul 12, 2011

KKRT said:
Amazing. We really need new generation asap and better Directx API/GPU drivers for PC devs.

We do have a new API, its DX11!

KKRT · Jul 12, 2011

Squilliam said:
We do have a new API, its DX11!

Yes and still Carmack complains that he cant make more than 3k calls on current GPUs

sebbbi · Jul 13, 2011

KKRT said:
Yes and still Carmack complains that he cant make more than 3k calls on current GPUs

You can make more than that if you really minimize the state changes, and use constant buffers and state blocks efficiently.

I actually did a raw performance test some time ago in DX11:
- Draw call + constant buffer change (including object matrix and four other constants): 50000 calls at 30 fps
- Draw call + ib + vb + cb change: 25000 calls at 30 fps

When doing virtual texturing, you do not have to change the textures at all, so I didn't benchmark them. But Carmack is also using virtual texturing, so this doesn't really matter

I also recently noticed that DX11 command buffer playback performance isn't any good. My test application rendered exactly the same amount of draw calls from a prerecorded command buffer. The application is not CPU bound at all (less than 5% CPU used for the tight render loop), and it surely isn't GPU bound either. Maybe the command buffer is streamed over PCI express every frame, and the API does some sanity checks to it... but it's a static buffer that's not changed, and thus no extra checks or transfer should be necessary every frame.

Also DX11 command buffers require that you setup every bit of state at the start of the buffer (including render targets, state blocks, views, buffers, etc) that the command buffer is going to use (most likely this is for validation purposes). You can't inherit any device state. This makes it impossible to use smaller command buffers efficiently (for example recording a buffer per object or per object group). And DX11 command buffers do not support predication (it's a really handy feature). I hope DX11.1 will improve on these features. Currently multithreaded rendering is basically pointless, because the playback of the command buffer is equally slow than just submitting the draw calls.

Actually one thing I hope that fusion will improve is the CPU->GPU->CPU communication. It might be that future fusion chips are capable of much higher draw call counts than discrete GPUs.

pjbliverpool · Jul 13, 2011

sebbbi said:
Actually one thing I hope that fusion will improve is the CPU->GPU->CPU communication. It might be that future fusion chips are capable of much higher draw call counts than discrete GPUs.

Haha now that would certainly be an interesting turnaround. Especially if the relatively poor CPU performance some consoles ports tend to display on PC's is down to draw call limitations.

We may see a mid range fusion chip outperforming a stronger seperate CPU and GPU combo!

ERP · Jul 13, 2011

Amazing. We really need new generation asap and better Directx API/GPU drivers for PC devs.

The fundamental issue with PC drivers is that they are solving a different problem.
Consoles run 1 application at a time, the driver doesn't have to track state or deal with context switches.
In addition PC drivers actually try and identify bad usage patterns of the API by applications and fix them during the calls which really isn't helping much from a performance standpoint. They also emulate older bugs, so as not to break older games.

You can do pretty well in DX11 in terms of the number of calls, but the exact number is very CPU dependent.

MJP · Jul 13, 2011

sebbbi said:
I also recently noticed that DX11 command buffer playback performance isn't any good. My test application rendered exactly the same amount of draw calls from a prerecorded command buffer. The application is not CPU bound at all (less than 5% CPU used for the tight render loop), and it surely isn't GPU bound either. Maybe the command buffer is streamed over PCI express every frame, and the API does some sanity checks to it... but it's a static buffer that's not changed, and thus no extra checks or transfer should be necessary every frame.

Also DX11 command buffers require that you setup every bit of state at the start of the buffer (including render targets, state blocks, views, buffers, etc) that the command buffer is going to use (most likely this is for validation purposes). You can't inherit any device state. This makes it impossible to use smaller command buffers efficiently (for example recording a buffer per object or per object group). And DX11 command buffers do not support predication (it's a really handy feature). I hope DX11.1 will improve on these features. Currently multithreaded rendering is basically pointless, because the playback of the command buffer is equally slow than just submitting the draw calls.

What hardware did you try this on, and how long ago? Nvidia just started supporting native multithreaded command buffers in their drivers pretty recently, and I don't think AMD supports it at all yet. So if you're using a driver without that support, the runtime has to package together all of the deferred command buffers into a single stream before handing it off to the driver so there's no way for it to be pre-converted into the native HW command buffer format.

mikiex · Jul 15, 2011

Its interesting that intergrated PCs are becomming a viable platform for gaming as performance goes up and price / watts goes down, why not forget consoles and move the PC in to the living room? Get a bunch of companies to join together like MSX, could this happen?

MS and Sony will have to come up with something pretty special.

sebbbi · Jul 15, 2011

ERP said:
You can do pretty well in DX11 in terms of the number of calls, but the exact number is very CPU dependent.

My test application uses only 5% of the CPU, and it's still draw call bound. Maybe it's bound by the driver architecture or by the bus bw/latency. Hard to say. I doubt having more powerful CPU would help it at all.

MJP said:
What hardware did you try this on, and how long ago? Nvidia just started supporting native multithreaded command buffers in their drivers pretty recently, and I don't think AMD supports it at all yet. So if you're using a driver without that support, the runtime has to package together all of the deferred command buffers into a single stream before handing it off to the driver so there's no way for it to be pre-converted into the native HW command buffer format.

AMD Radeon 5870 + i7 (eight HW threads) + Win7. The drivers are around half year old or so.

3dilettante · Jul 15, 2011

MJP said:
What hardware did you try this on, and how long ago? Nvidia just started supporting native multithreaded command buffers in their drivers pretty recently, and I don't think AMD supports it at all yet.

I haven't kept track, but for Nvidia the support was very narrow the last I heard of it. It was introduced for Civ5, and there may have been some driver-level hackishness to get it to work for Civ5. At least then, it was not readily extended to other apps. Has this been opened up?

ERP · Jul 15, 2011

sebbbi said:
My test application uses only 5% of the CPU, and it's still draw call bound. Maybe it's bound by the driver architecture or by the bus bw/latency. Hard to say. I doubt having more powerful CPU would help it at all.

AMD Radeon 5870 + i7 (eight HW threads) + Win7. The drivers are around half year old or so.

I think the command buffers are a different issue, I could guess at what it is but I don't know enough difinitively about the hardware or driver to say for sure.
It's unlikely to be the PCI bus, there is a lot of bandwidth there for command buffers.
If you don't think you're draw bound and you are not using 100% of a core, then something is causing the GPU and CPU to work synchronously, i.e. there is a sync operation somewhere in the driver path.
It might be fixable, or it could be an architectural limitation, it's impossible to know without details of the driver.

sebbbi · Jul 15, 2011

ERP said:
If you don't think you're draw bound and you are not using 100% of a core, then something is causing the GPU and CPU to work synchronously, i.e. there is a sync operation somewhere in the driver path.

For some reason the GPU can't process as many commands as I send it. It feels like the ring buffer gets full. But that's kind of strange since my draw calls are really simple (small single triangle calls with a simple color setting pixel shader).

My frame rendering is really simple:
1. Clear the render target + depth
2. Single call to ExecuteCommandList
3. Present (swap)

The command list is static and is generated at the application startup. It has 50000 draw calls (one cb+vb+ib change before each).

ERP · Jul 16, 2011

This is getting off topic but.
It's possible your filling the ring buffer, but unlikely.
There were issues with some of the early command buffer implementations on Xbox.
The driver had to patch the command buffer for every submission, the driver had to ensure that the previous submission wasn't still being processed by the GPU when the patch occurred, so inserted a fence causing the CPU to stall if the same command buffer was submitted repeatedly.
I'd guess that something similar is occurring here.

Mobius1aic · Sep 12, 2011

sebbbi said:
Update:
Found a forum with lots of Brazos E-350 APU benchmark links. Brazos is a netbook chip designed to compete with ATOM (much lower end than Llano).

Brazos fares pretty well in current console games:
Battlefield 2 (2005) - 1024x768, all max, 25-35 fps
Crysis 2 (2011) - 1280х720, 20 fps
Dirt 3 (2011) - 1024x600, 19-25 fps (ultra low noAA 25-28 fps)
Mass Effect 2 (2011) - 1024х600, 20 fps
Call of Duty: Modern Warfare 2 (2009) - 1280х720, 25 fps

Current consoles run those same games at 30 fps at similar resolutions. These games are definitely not optimized for Brazos architecture. Maybe we get some netbooks for the Christmas that are more capable than the current consoles

Those scores are pretty decent, but at what settings? Even still, it's still impressive when you consider that the E-350's x86 core are only running 1.5 GHz, and what, 500 MHz on the GPU? I wonder how much better the scores could be with a dual memory control (128 instead of 64 bit), though I guess the dual core Llano derivative would basically give us that measurement.

Relative performance of AMD 2.9GHz APU vs current consoles?

pjbliverpool

B3D Scallywag

pjbliverpool

B3D Scallywag

swaaye

Entirely Suboptimal

sebbbi

swaaye

Entirely Suboptimal

KKRT

Squilliam

Beyond3d isn't defined yet

KKRT

sebbbi

pjbliverpool

B3D Scallywag

ERP

MJP

mikiex

sebbbi

3dilettante

ERP

sebbbi

ERP

Mobius1aic

Quo vadis?

Similar threads