AMD Bulldozer Core Patent Diagrams

aaronspink · Sep 26, 2010

Jawed said:
Good question, supposedly Larrabee's ring bus runs under L2 cache, so supposedly doesn't consume extra area. Does the ring bus in Sandy Bridge have the same kind of layout? Isn't there another Intel processor already out there with a ring bus? Does that consume area?

I doubt it runs *under* the L2 cache. Beckton also has a ring bus between cores/cache and system interface.

aaronspink · Sep 26, 2010

rpg.314 said:
It could be a resolution issue, but I don't see any ring-bus on SB or Larrabee. I wonder why it pops out on Cell.

If you are talking about die photos, it is going to heavily depend on what layer the die was completed to for the photo. Global wires tend to be on the upper layers because of the reduced CLR on them. The downside of the upper layers is significantly reduced number of tracks.

Jawed · Sep 26, 2010

aaronspink said:
I doubt it runs *under* the L2 cache. Beckton also has a ring bus between cores/cache and system interface.

OK, then, does the ring bus run over the cache?

aaronspink · Sep 26, 2010

Jawed said:
OK, then, does the ring bus run over the cache?

one would how so considering high speed global buses tend to run on upper layer metal.

Jawed · Sep 26, 2010

aaronspink said:
one would how so considering high speed global buses tend to run on upper layer metal.

So, excluding areas of logic for ring stops, how much impact on the density of cache would a ring bus have, if it runs over the cache (i.e. purely the wires between stops)? Would it impact density of power/ground/data connections in areas of cache?

Is there a need for repeaters between ring stops? Repeaters can be quite small, can't they?

Would there be more impact from a ring bus upon other kinds of logic? e.g. would running a ring bus over portions of a core, such as the main integer units, have greater impact on the density of that logic than it would upon cache?

3dcgi · Sep 27, 2010

I don't think there's anything specific about a ring bus that makes it affect logic more than other constructs. If you have a lot of wires passing over/under/through logic they have the potential to add area as the gates might need to be spread out more than they otherwise would. This can be detrimental to timing because the local wires will be longer than if the bus didn't interfere.

Even if the bus doesn't result in increased area of the logic it can affect timing because the local wires still have to contend with the bus.

hoom · Oct 27, 2010

The FP unit has a name Flex FP
http://blogs.amd.com/work/2010/10/25/the-new-flex-fp/

I think there were a couple of places he could have used 'flexible' that he missed

Jawed · Oct 27, 2010

So, in most cases you don’t want one massive 256-bit floating point unit per core consuming all of that die space and all of that power just to sit around watching the integer cores do all of the heavy lifting. By sharing one 256-bit floating point unit per every 2 cores, we can keep die size and power consumption down, helping hold down both the acquisition cost and long-term management costs.

Hmm, considering how JF likes to promote the fact that customers don't buy die area, and that we're now in an era where unused circuits consume "no power":

In fact, the Flex FP is designed to reduce its active idle power consumption to a mere 2% of its peak power consumption.

this first paragraph turns into pure bullshit.

3dilettante · Oct 27, 2010

This confirms that Bulldozer's FPU cannot split its FMA capability into separate ops.
For integer apps, this is much ado about nothing.

On the FP client side, a whole Bulldozer chip will have no significant per-clock advantages over Sandy Bridge (desktop, with a GPU) in 128-bit code.
SB will curb-stomp it in 256-bit AVX.
If any of this SIMD code needs to shuffle anything, BD goes down by half.
BD may be able to outclock SB, but this is a big shortfall to make up.

An inevitable SB without GPU amd with more cores will likely outclass BD in less die area in FP apps, with or without AVX.
The only area where BD sort of competes is FMA4, which is a single lonely, possibly short-lived, AMD-only factor in an Intel world.

FP aside, the high core count design will be great for server and enterprise apps that scale so long as they are licensed per-socket.
Given the vagaries of licensing, a number of big-time apps go by core.
They could give it away for free and it would still be a bad proposition.
The fact that AMD could not make each module one core (or at least convince the software vendors that it was), is a serious blunder in that market. That some old marketing materials appeared to label each module a core was probably an early mistake in draft materials or it might have meant that AMD tried to call a module a core, but it got called out on it by Oracle and others.

For the client side, unless AMD can drive clocks to 5-6 GHz, Zambezi will be eclipsed a half year before it comes out if we assume it is not beaten by Westmere.

Squilliam · Oct 27, 2010

3dilettante said:
An inevitable SB without GPU amd with more cores will likely outclass BD in less die area in FP apps, with or without AVX.
The only area where BD sort of competes is FMA4, which is a single lonely, possibly short-lived, AMD-only factor in an Intel world.

Could this be due to the fact that they are possibly looking to specialise their CPU line in a way to compliment the weaknesses of their GPU compute capabilities and vice versa?

The fact that AMD could not make each module one core (or at least convince the software vendors that it was), is a serious blunder in that market. That some old marketing materials appeared to label each module a core was probably an early mistake in draft materials or it might have meant that AMD tried to call a module a core, but it got called out on it by Oracle and others.

If you were running it in CUDA for instance on a GTX 480, they wouldn't charge you for 480 cores would they? Anyway personally I would assume they would have a public/private dichotomy where in public a module = two cores and privately a module = a core depending on whether they are speaking to other companies or the public at large.

3dilettante · Oct 27, 2010

Squilliam said:
Could this be due to the fact that they are possibly looking to specialise their CPU line in a way to compliment the weaknesses of their GPU compute capabilities and vice versa?

The co-processor model leaves the door open for something like that.
No timeframe has been given for when this could happen.
The integer side does not quite complement the weaknesses of the GPU side. It is also focusing on throughput at the expense of some single-threaded capability.

If you were running it in CUDA for instance on a GTX 480, they wouldn't charge you for 480 cores would they? Anyway personally I would assume they would have a public/private dichotomy where in public a module = two cores and privately a module = a core depending on whether they are speaking to other companies or the public at large.

The enterprise and server apps I am referring to don't run on CUDA, and Nvidia's definition of core is garbage.
A Bulldozer module has: a front end(1)->issue network+instruction control(2)*->ALU back end(2).
That physical pathway for thread execution is what makes each core a core. The completely separate issue logic and control is a big factor.

If there were some way to make a Bulldozer module's cores into mere "clusters" closer to the tradition of other partitioned architectures, with instructions able to issue and forward results with latency penalties between clusters, AMD could have argued a module was a core.

If AMD is trying to pull a private/public shell game with core counts, it is a an untenable position.
The people who license these products are in the public, and the providers of the software aren't going to hold AMD's confused PR baggage for them.

Gubbi · Oct 27, 2010

3dilettante said:
<snip>
SB will curb-stomp it in 256-bit AVX.

<snip>
The only area where BD sort of competes is FMA4, which is a single lonely, possibly short-lived, AMD-only factor in an Intel world.

I basically agree, playing Devil's advocate here:

1. There are no AVX apps today
2. Applications where AVX makes a difference are also applications where FMA4 makes a difference. Both will see support.
3. When AVX picks up, the BD FPU can be extended to support it.

If SSE/SSE2 is anything to go by, AMD has *plenty* of time to implement AVX.

Cheers

AlexV · Oct 27, 2010

Gubbi said:
Both will see support.

I'm not that sure of that, TBH. Traditionally, when AMD had something that Intel didn't have, that meant either no support for it until Intel had it too or death. AMD64 is the one exception, and that was a special case...FMA4 certainly isn't that. It's a bit 3DNow-ish, IMHO.

Gubbi · Oct 27, 2010

AlexV said:
FMA4 certainly isn't that. It's a bit 3DNow-ish, IMHO.

I agree. My point is that the bulk of the application developers will stick to SSE2 for the immidiate future. People who really need the performance right now, will go the extra mile to get it.

In retrospect, SSE was a bit 3DNow-ish, it wasn't until SSE2, - and support for doubles, that we saw wide-spread use.

Cheers

Squilliam · Oct 27, 2010

Thanks for taking the time. I guess I have to brush up on my computers for dummies texts.

3dilettante · Oct 27, 2010

Gubbi said:
I basically agree, playing Devil's advocate here:

1. There are no AVX apps today

This is true, but there is also no BD today, either.
There will be at least some AVX apps or codecs at or near the launch of SB, months before the release of BD.

2. Applications where AVX makes a difference are also applications where FMA4 makes a difference. Both will see support.

This is not necessarily true. FMA works if there are multiplies with a dependent add that can be folded in.
If there are situations where there is no such dependence, or the application is for some reason oddly particular about the numeric behavior of the rounding, FMA is not useful.

3. When AVX picks up, the BD FPU can be extended to support it.

That's probably an argument for the Bulldozer2 Core Patent Diagrams thread.

If SSE/SSE2 is anything to go by, AMD has *plenty* of time to implement AVX.

Meh. On a per-clock basis, it's not a massive improvement on old code.

There is the clock speed advantage AMD hopefully accomplishes. If it flounders on clock, there is not much point in caring about BD.
I won't go into the balance of factors I mentioned in my single/multi-threaded comparisons earlier, except that much of the FP advantage I posited did depend on being able to split the FMAC up.
Without it, BD is more "meh" in the client space.

Jawed · Oct 27, 2010

3dilettante said:
There will be at least some AVX apps or codecs at or near the launch of SB, months before the release of BD.

How would a codec use AVX? Do you know of any plans for such?

Without it, BD is more "meh" in the client space.

Which client space apps care about FP?

3dilettante · Oct 27, 2010

Jawed said:
How would a codec use AVX? Do you know of any plans for such?

Audio and image processing tends to take advantage of extensions quickly.
Intel was researching a faster PNG codec in the labs, and it also usually updates its math libraries pretty quickly.

Sisoft Sandra will probably have it pretty quickly (Intel does get benchmarks updated rather quickly as well).

Which client space apps care about FP?

Multimedia apps, mostly.
Games probably could use it more often, though this obviously isn't very uniform as the PhysX situation shows.
Windows might be able to use some of the AVX moves to its advantage, for memory copies and such.

Jawed · Oct 27, 2010

I thought you might have seen some comments related to a specific codec, e.g. x264's implementation of h.264.

This topic isn't very encouraging:

http://doom10.org/index.php?topic=514.0

and I can't find anything more salient. I don't know if AVX increases throughput for integer SIMD code, per se.

As for consumer apps, I'm wondering which of them makes such intensive use of FP, specifically, that they will lead to BD being considered "meh", or to AVX being rapidly-implemented.

How much game code is SIMD-FP limited on the CPU?

3dilettante · Oct 27, 2010

Jawed said:
and I can't find anything more salient. I don't know if AVX increases throughput for integer SIMD code, per se.

I rechecked the descriptions for SB's integer SIMD units, and the current implementation does not have wider integer operations, rather there is a promise for more at some later date.

There are two integer SIMD blocks for both BD and SB. BD does have an integer FMAC on the first FP port.
One SIMD block is on the store pipe, which may cut into how often it can be used. I am not sure whether the XBAR unit covers integer permutes as well.
The downside to the FMAC and XBAR is that their ops are neither SSE nor AVX.
On the other hand, a few codecs could be made with code paths to use XOP and FMA.

Otherwise, it seems like rough parity per-clock, though the store port sharing may rear its head.

As for consumer apps, I'm wondering which of them makes such intensive use of FP, specifically, that they will lead to BD being considered "meh", or to AVX being rapidly-implemented.

Consumer apps have a problem in that they are poorly threaded and like single-threaded performance better. The BD FPU is higher latency and its read/write capability is not better than a single core.
It becomes "meh" because Zambezi is a server chip that will be tangling with a mid-range desktop SB.

How much game code is SIMD-FP limited on the CPU?

For many games, it is more of a question if SIMD shows up at all.
The integer pipelines and single-threaded performance would matter more in the client space for games, which does not look like it favors BD.

On the other hand, I believe graphics drivers do leverage SIMD instructions, of what flavor I am unsure.

edit:

With regards to x264, it seems the Sandy Bridge preview thread has some chatter about using the special-purpose hardware in SB for the codec.
This is a lateral move around the SSE/AVX debate, apparently.
It does not change that I had erroneously assumed AVX had doubled the Int SIMD side as well, or my misclassification of the FP needs of video codecs.

AMD Bulldozer Core Patent Diagrams

aaronspink

aaronspink

Jawed

aaronspink

Jawed

3dcgi

hoom

Jawed

3dilettante

Squilliam

Beyond3d isn't defined yet

3dilettante

Gubbi

AlexV

Heteroscedasticitate

Gubbi

Squilliam

Beyond3d isn't defined yet

3dilettante

Jawed

3dilettante

Jawed

3dilettante