AMD: R8xx Speculation

How soon will Nvidia respond with GT300 to upcoming ATI-RV870 lineup GPUs

  • Within 1 or 2 weeks

    Votes: 1 0.6%
  • Within a month

    Votes: 5 3.2%
  • Within couple months

    Votes: 28 18.1%
  • Very late this year

    Votes: 52 33.5%
  • Not until next year

    Votes: 69 44.5%

  • Total voters
    155
  • Poll closed .
I'm somewhat limited in the number of large files I can download where I am currently.

Can you direct me to which link and which slide in particular you are referring to?
 
The Intel presentation doesn't make mention of VLIW.
It does point out how data can be presented to the SIMD units in either SOA or AOS format, but the actual instruction issue is basic scalar instruction issue.

A processor using VLIW can be made to handle either SOA or AOS data, because VLIW concerns itself with how the instruction stream is structured and decoded, not how the data is structured.

A RV770 SIMD could be treated as ~4 16-wide SIMD units, each of which can behave in the manner described.
 
In the context of the problem domain and the threads that are running each instruction stream there are distinctions to be made. So "A RV770 SIMD could be treated as ~4 16-wide SIMD units" is vastly understating how much you have to mangle your software to get that to work. Intuition and maintainability be damned! :) And i doubt it would work since it's still all one big SIMD instruction whereas 4 LRB cores are completely independent.

It's hard to make the case that an RV770 SIMD is just a long 80-wide vector that we can manipulate with the same flexibility as LRB. The architecture is obviously built around submitting 5-wide instructions for 16 homogenous threads/strands/work-items etc per-clock. Slicing it any other way is just improbable unless you take on the job of parallelizing the workload yourself (essentially chunking it into 4 at a time per thread and suffering the memory coherency, code obfuscation and branch granularity consequences). And wouldn't that also mean you have to take care of intra-thread predication yourself? Craziness!
 
In the context of the problem domain and the threads that are running each instruction stream there are distinctions to be made. So "A RV770 SIMD could be treated as ~4 16-wide SIMD units" is vastly understating how much you have to mangle your software to get that to work. Intuition and maintainability be damned! :) And i doubt it would work since it's still all one big SIMD instruction whereas 4 LRB cores are completely independent.
Not much significant mangling at all.

An RV770 SIMD has 16 internal divisions that take a VLIW instruction packet from a single thread for up to 5 separate and explicitely parallel operations.
If 5 parallel operations are not possible, the packet uses a NOP on the remaining units.

A Larrabee SIMD takes a single instruction from a single thread and runs it on 16 internal divisions.
The only difference is that Larrabee's design just has one ALU available in the 16 lanes.
If there were two vector units, a dual issue from the same thread would lead to instruction behavior no different than a RV770 with only 2 ALUs per cluster.

If the compiler can only find one operation to put into a VLIW packet, RV770's execution model is quite similar to Larrabee's.

The data structures and registers that are arranged in an SOA or AOS manner don't need to see or care about this.
 
An RV770 SIMD has 16 internal divisions that take a VLIW instruction packet from a single thread for up to 5 separate and explicitely parallel operations.

But then you're not following the SOA model outlined in the LRB presentation. Note that Intel uses SOA as an analogy for scalar issue. That's the fundamental message. What you're describing is still going to be vector issue per-thread, which by definition is AOS.

I think you might be complicating things a bit (perhaps because you know a lot more about this stuff than simple me :)) but I find the AOS and SOA concepts to be very intuitive and straightforward for describing per-thread instruction issue.

How the data is stored is really irrelevant as scatter/gather takes care of that. I understand AOS/SOA are usually used to describe how the data is stored but that's not what I'm referring to.

It does point out how data can be presented to the SIMD units in either SOA or AOS format, but the actual instruction issue is basic scalar instruction issue.

Nope, it's not talking about how data is presented to the SIMD. It's talking about how instructions are executed on the SIMD. Hence the point about gather/scatter being necessary to get SOA issue from AOS data.
 
But then you're not following the SOA model outlined in the LRB presentation. Note that Intel uses SOA as an analogy for scalar issue. That's the fundamental message. What you're describing is still going to be vector issue per-thread, which by definition is AOS.
Each ALU in each lane can look to a different register. The data doesn't know this.
It's up to 5 (not always so because there are only enough operands for 4 a lot of the time) separate 16-wide operations.

It is similar to how Larrabee would look if it could execute 5 vector instructions simultaneously.
From the point of view of the instruction operands, it can very readily match Intel's model.

What VLIW does is offer a certain amount of flexibility at the hardware level to mix and match and allow AOS work in cases where it is appropriate.
It also avoids the significant amount of work Larrabee's instruction issue hardware would have to do to perform superscalar issue on 5 instructions.

I think you might be complicating things a bit (perhaps because you know a lot more about this stuff than simple me :)) but I find the AOS and SOA concepts to be very intuitive and straightforward for describing per-thread instruction issue.
Your intuition is applying a concept relevant to one thing and projecting it to something that is separate.
VLIW does not control the format of the data in memory or in the registers that determine the operands being worked on, and AOS and SOA do not control how the processor decodes instructions and how it issues them to units.

How the data is stored is really irrelevant as scatter/gather takes care of that.
That is the difference.
According to the presentation, Larrabee does work with AOS structured data on items that it makes sense for, such as geometry work.
 
From the point of view of the instruction operands, it can very readily match Intel's model.

Intel's model is explicity based on the notion of scalar issue. It says so right in the slides!! It's an integral part of the model. I agree that they could match Intel's model - at 20% ALU utilization.

"First step is to “scalarize” the code. Remember that each “scalar” op is doing 16 things at once"


What VLIW does is offer a certain amount of flexibility at the hardware level to mix and match and allow AOS work in cases where it is appropriate.
Perhaps, but you're still using AOS to refer to data layout. The LRB presentation doesn't concern itself with that. (We are still talking about the LRB presentation right?) Essentially they are eschewing any sort of VLIW approach.

Your intuition is applying a concept relevant to one thing and projecting it to something that is separate.
I'm not doing that. Intel did. Don't get hung up on pedantic definitions of the terms. Notice how they make an explicit differentiation between the execution and the format?

"SOA or “scalar”: a register holds XXXX" - Execution
"And the data is usually not in an SOA-friendly format" - Storage

According to the presentation, Larrabee does work with AOS structured data on items that it makes sense for, such as geometry work.
Slides 9,10 and 11 are explicity demonstrating that even data stored as AOS is processed as SOA ;) Can't get much clearer than this:

"Data is usually not in a friendly format! Larrabee adds scatter/gather support for reformatting"
"Allows 90% of our code to use this mode. Very little use of AOS modes"
 
Intel's model is explicity based on the notion of scalar issue. It says so right in the slides!! It's an integral part of the model.

"First step is to “scalarize” the code. Remember that each “scalar” op is doing 16 things at once"
The dichotomy is scalar versus packed, and that is with regards to data layout in the operands, either in memory or in the registers.
In the case of a reg/mem architecture like Larrabee, the two can exist side by side.

It doesn't make a difference if v1 has values that all come from the same primitive, and then that register is scaled by some arbitrary value.
It's one primitive, not 16, but the ALUs and issue logic do not care.

Perhaps, but you're still using AOS to refer to data layout. The LRB presentation doesn't concern itself with that. (We are still talking about the LRB presentation right?)
I find it hard to believe AOS and SOA don't apply to data layout when the presentation goes through all those diagrams showing how the data is laid out in the registers.

I'm not doing that. Intel did. Don't get hung up on pedantic definitions of the terms. Notice how they make an explicit differentiation between the execution and the format?
VLIW is an specific term that applies to a specific realm. I don't think it's pendantic to say that VLIW--which applies to how instructions are organized, decoded, and issued, and SOA or AOS--which applies to how the data is handled, are orthogonal to one another.

There's nothing inherent to a design that uses VLIW instruction issue that keeps it from running a program than handles data in SOA or AOS format.

"SOA or “scalar”: a register holds XXXX" - Execution
"And the data is usually not in an SOA-friendly format" - Storage

Slides 9,10 and 11 are explicity demonstrating that even data stored as AOS is processed as SOA ;) Can't get much clearer than this:
No, it says that it can be done that way.

Slide 6 outlined two areas where it is appropriate to use AOS packed registers.

Instruction issue in either AOS or SOA programming scenarios is the same for Larrabee.
Barring implementation-related wrinkles for RV770 unrelated to it being VLIW, the same can be said for the GPU SIMD.

"Data is usually not in a friendly format! Larrabee adds scatter/gather support for reformatting"
"Allows 90% of our code to use this mode. Very little use of AOS modes"
Larrabee doesn't have a completely separate instruction issue mode for the 10% of code that can't use SOA.
It issues instructions in the exact same way, scalar issue of instructions once per clock. It just so happens that the operand registers are 16-wide SIMD registers that just so happen to have data entered in an SOA format.

edit: (issue is scalar for vector math because there is only one vector unit disclosed)

A SIMD in RV770 has superscalar issue of 5 statically compiled instructions once per clock.

Superscalar issue has nothing to do with the relationship or lack thereof between data elements within the operands.
 
Last edited by a moderator:
It doesn't make a difference if v1 has values that all come from the same primitive, and then that register is scaled by some arbitrary value.
It's one primitive, not 16, but the ALUs and issue logic do not care.

Sigh, why do I always get into trouble with this :( I get that you're taking a purist view from the perspective of the hardware but whenever I discuss this stuff I'm always thinking in terms of a unit of work (which is what a developer is concerned with). The hardware never cares about the nature of the workload or the relationships between the data in each SIMD lane so I don't really find that to be a useful thing to discuss.

Superscalar issue has nothing to do with the relationship or lack thereof between data elements within the operands.
It does when you're talking about how a unit of work is processed. Basically you are saying that the hardware doesn't care if it's processing a vec4 for 4 elements at a time or if it's processing a scalar for 16 elements at a time. And I completely agree but what the hardware cares about isn't particularly interesting. As a developer I would certainly care though, because it has a direct impact on the utilization and performance I get out of the hardware. That should be apparent if you were to perform the exercise on slide 8 for 16 elements instead of 12.
 
Sigh, why do I always get into trouble with this :( I get that you're taking a purist view from the perspective of the hardware but whenever I discuss this stuff I'm always thinking in terms of a unit of work (which is what a developer is concerned with). The hardware never cares about the nature of the workload so I don't really find that to be a useful thing to discuss.
That is my point.
VLIW is a term that resides in the hardware realm.
It doesn't really impact what the workload or programmer sees unless the programmer is coding direct assembly, and even then it is agnostic to the workload.
Intel never made a distinction between SOA and VLIW in the presentation.
There is no SOA vs VLIW dichotomy.

It does when you're talking about how a unit of work is processed. Basically you are saying that the hardware doesn't care if it's processing a vec4 for 4 elements at a time or if it's processing a scalar for 16 elements at a time. And I completely agree but what the hardware cares about isn't particularly interesting. As a developer I would certainly care though, because it has a direct impact on the utilization and performance I get out of the hardware. That should be apparent if you were to perform the exercise on slide 8 for 16 elements instead of 12.

There is no reason why RV770 cannot perform a scalar operation on 16 elements at a time.
It's actually rather straightforward.
Issue a VLIW packet with only one instruction in it.
Each lane is a scalar ALU, and there are 16 clusters.
Only activate one per cluster, and the same operation is done 16 times in parallel on 16 separate elements.
What VLIW can potentially allow is to do an additional 3 or 4 operations that Larrabee would have to do over the next 3 or 4 cycles.

There are some nifty things Larrabee does differently than RV770 and the other way around, but if that's all you want to see happen, it can be done.
 
That is my point. VLIW is a term that resides in the hardware realm.

Understood. But VLIW/AOS or scalar/SOA - it's all the same to me as the result is the same. I'll try to be more careful in the future though so people actually understand what I'm trying to say :)

Intel never made a distinction between SOA and VLIW in the presentation.
There is no SOA vs VLIW dichotomy.

Seems pretty clear to me that's exactly what they were doing on slide #4.

There is no reason why RV770 cannot perform a scalar operation on 16 elements at a time.
It's actually rather straightforward. Issue a VLIW packet with only one instruction in it. Each lane is a scalar ALU, and there are 16 clusters.

It doesn't really impact what the workload or programmer sees unless the programmer is coding direct assembly, and even then it is agnostic to the workload.

Assuming the programmer doesn't care about utilization. When you said that RV770 could equal Larrabee's behavior above I assumed you meant at high utilization. Of course it can do it if 4/5 lanes are vacant.
 
Seems pretty clear to me that's exactly what they were doing on slide #4.
Which parts indicate that?

Some of the claims, like no hard-wired sources, and features working thes same for most instructions are more a response to some things standard x86 does.

The code scheduler being written as the design was being made is something Larrabee and any GPU would share.

Assuming the programmer doesn't care about utilization. When you said that RV770 could equal Larrabee's behavior above I assumed you meant at high utilization. Of course it can do it if 4/5 lanes are vacant.

The number of vacant lanes is an implementation-specific feature, though it obviously makes less sense to do VLIW on a scalar design.

That's a debate over scalar or superscalar execution. It is not unique to VLIW and it doesn't affect the program that handles data in SOA or AOS format.
The way VLIW works, if it did matter, then the VLIW implementation is broken.

It only performs work in parallel if the results are no different than if each single operation was done in sequence.
 
Yeah you're saying VLIW is an instruction encoding thing, scalar/superscalar is an execution thing and AOS/SOA is a storage thing. But the distinction really isn't relevant for an understanding of the real-world scenarios we're discussing since they're usually closely related anyway. But I'll stick to scalar/superscalar when discussing instruction issue if that helps. Although Intel is using AOS/SOA to describe the same thing so you can't blame me for that one ;)
 
Yeah you're saying VLIW is an instruction encoding thing, scalar/superscalar is an execution thing and AOS/SOA is a storage thing. But the distinction really isn't relevant for an understanding of the real-world scenarios we're discussing since they're usually closely related anyway. But I'll stick to scalar/superscalar when discussing instruction issue if that helps. Although Intel is using AOS/SOA to describe the same thing so you can't blame me for that one ;)

There may be a confusion of terms, where "scalar" is being used to describe two separate things.
Scalar or superscalar instruction issue is whether instructions are issued one at a time or can be issued at the same time. Either way, the instructions are independent of one another.

The scalarization of vector work is using scalar in mathematical terms of whether the registers in question contain whole vectors or contain scalar elements of vectors.

So one is scalar in the sense of there being a single unit instruction, while the other is a single subunit of a data vector.
 
With both Nvidia and Intel (for the most part) pushing SOA should we expect the same from AMD? Or are there good reasons to stick with VLIW going forward?
A reduction in register pressure by ~4x in a common use case?

AMD has little to prove in this regard IMO.
 
Back
Top