Larrabee delayed to 2011 ?

. Case in point is fermi is already in use in the 2nd fastest supercomputer in the world. ATI hasnt even released a firestream version of cypress yet. Maybe ati will improve with the next gen, until then fermi is better
Oh contraire, monsieur. Even downclocked versions of Atis last-gen (desktop!) products are sufficient for the seventh-fastest supercomputer.
 
Perhaps you might want to re-read my comment, I haven't written a word about Fermi performance, just about its programmability. Performance wise Fermi better be faster than Cypress, given that it is ~50% larger than Cypress, has more memory BW and draws more power.

For peak performance at least, Cypress is faster in both SP and DP flops (680 vs 515 DP flops). So what it really comes down to is which one delivers better efficiency in what kernels.
 
Cypress in HD5870 is 544 DP GFLOPS theoretical. The server version, FireStream, is likely to be less.
 
No, T is not used for double-precision. Though I suppose it can be used to seed initial values for evalution of DP-RCP etc., since DP transcendentals aren't in the instruction set and so require a "macro".

DP MUL and MAD are both single-cycle using XYZW cooperatively. DP ADD uses pairs of lanes cooperatively, meaning that ADD is also 544 GFLOPS. I presume DP ADD in GF100 is half the FLOPS of DP MAD - not sure.
 
DP MUL and MAD are both single-cycle using XYZW cooperatively. DP ADD uses pairs of lanes cooperatively, meaning that ADD is also 544 GFLOPS. I presume DP ADD in GF100 is half the FLOPS of DP MAD - not sure.

DP Add for GF100 is 1/2 MAD flops.

Though I don't think the MAD/MUL uses the XYZW cooperatively. Makes more sense for it to simply reduce each from x16 to x4 unless the data paths between XYZW are already interleaved.
 
Though I don't think the MAD/MUL uses the XYZW cooperatively. Makes more sense for it to simply reduce each from x16 to x4 unless the data paths between XYZW are already interleaved.
It's definitely cooperative. Simple inspection of a single DP-MUL instruction, as it is compiled, shows this:

Code:
kernel void doubletest( double X<>, double Y<>, out double Z<>)
{
 Z = X * Y ;
}

compiles as:

Code:
; --------  Disassembly --------------------
00 TEX: ADDR(48) CNT(2) VALID_PIX 
      0  SAMPLE R1.xy__, R0.xyxx, t0, s0  UNNORM(XYZW) 
      1  SAMPLE R0.xy__, R0.xyxx, t1, s0  UNNORM(XYZW) 
01 ALU: ADDR(32) CNT(5) 
      2  x: MUL_64      R0.x,  R1.y,  R0.y      
         y: MUL_64      R0.y,  R1.y,  R0.y      
         z: MUL_64      ____,  R1.y,  R0.y      
         w: MUL_64      ____,  R1.x,  R0.x      
         t: MOV         R0.z,  0.0f      
02 EXP_DONE: PIX0, R0.xyzz
END_OF_PROGRAM
The ALU is already cooperative for dot product. Then there's the dependent instructions introduced by Evergreen that build-upon the dot product capability.
 
From there,

The debate ended up being just under 45 minutes long, and was chopped up into Youtube compliant lengths, we opted for picture quality instead of length. Installments will be posted until we run out, and if we ever get a host that has the bandwidth to support it all, we will post the whole thing as a single video.
 
From there,
Right... I'm just confused. Not being a real YouTube user how long does it take to upload the different parts? Or are they editing them for other reasons first? The first part seems to imply they're pretty raw so I'm curious as to what the delay is about.
 
"opted for picture quality" - in a fracking interview filmed with shaky cam (tm)? I rather think they're opting for visit-fishing…
 
Right... I'm just confused. Not being a real YouTube user how long does it take to upload the different parts? Or are they editing them for other reasons first? The first part seems to imply they're pretty raw so I'm curious as to what the delay is about.

it's can take a while to upload depending on server load but you can concurrently upload the different videos. the limit is 10 minutes and under 2GB.
 
Does someone have a transcription of this? The image quality isn't too bad but the sound quality is terrible.

Anyway, trying to understand the best I can...
Andrew Richards said:
What we should see is the full software approach comes out first and the custom product comes out in the end. And actually we see the opposite of that. We actually see that the first implementation of DX11 is the full hardware implementation.
While the facts appear to point in that direction, I believe Andrew fails to see some of the dynamics behind it.

There are thousands of hardware engineers working on DX11 products, while the number of people working on a full DX11 software implementation can probably be counted on one hand. What is lacking to change this around is fully generic multi-core hardware (such as Larrabee). Once that's on the market it won't take very long before innovative new applications appear as a software implementation before any hardware implementation (if that would ever appear at all). Even today a lot of 'hardware' features are actually implemented using software in the firmware or driver.

Also, while an optimized DX11 software impelementation has yet to appear, I believe it would have been very straightfoward for Microsoft to continue the development of WARP and make it available before any hardware. And that would have effectively been created by a handful of software engineers versus thousands of hardware engineers.

Ironically, there are millions of software developers who can work on various aspects of graphics technology, while only several thousand people have a hand in designing hardware. So there is tremendous potential waiting to be unleashed. Huge companies such as Microsoft revolutionized the way we use computers not through hardware, but through a complete focus on software. The software revolution for computer graphics has yet to begin...
 
Wow, that's naive. Hardware engineers had the bright idea to implement a software pipeline, in hardware, decades ago.
 
Wow, that's naive.
What's naive exactly?
Hardware engineers had the bright idea to implement a software pipeline, in hardware, decades ago.
Sure, but what's your point? It's just one pipeline. What developers (such as Sweeney) want is to be able to implement any pipeline. And this will require a true software implementation on fully generic hardware.
 
There are thousands of hardware engineers working on DX11 products, while the number of people working on a full DX11 software implementation can probably be counted on one hand.
Is this a hint that there is a massive untapped market for a software DX11 implementation or...?

What is lacking to change this around is fully generic multi-core hardware (such as Larrabee).
Fully generic and performant multicore hardware.
Obviously, we've had multicore CPUs for years, they are just insufficient.
Larrabee I apparently was not the one to break that trend.

Once that's on the market it won't take very long before innovative new applications appear as a software implementation before any hardware implementation (if that would ever appear at all). Even today a lot of 'hardware' features are actually implemented using software in the firmware or driver.
I'd like an analysis of this. What particularly innovative things would have a significant material impact on the market?
There a number of weaknesses in the standard pipeline that could potentially be corrected with a different implementation.
However, how much would this amount to externally for the consumer?
A number of algorithms promise to correct one weakness or another in software, and they often do, but the gains are often incremental (better transparency pre-DX11, a lot of chrome spheres) and not sufficient to counter the reduced performance in the bulk of the workload, or they wind up being capped by other restrictions (asset creation, memory, art pipeline, etc.)

Does this creative flowering of software renderers offer significantly greater utility to the market, or is it searching for a problem?

Also, while an optimized DX11 software impelementation has yet to appear, I believe it would have been very straightfoward for Microsoft to continue the development of WARP and make it available before any hardware. And that would have effectively been created by a handful of software engineers versus thousands of hardware engineers.
What is the economic incentive for Microsoft for doing so?

Ironically, there are millions of software developers who can work on various aspects of graphics technology, while only several thousand people have a hand in designing hardware. So there is tremendous potential waiting to be unleashed. Huge companies such as Microsoft revolutionized the way we use computers not through hardware, but through a complete focus on software. The software revolution for computer graphics has yet to begin...
So it's better if millions of people work on the same thing over and over versus having a few thousand work on the same thing once?
 
Last edited by a moderator:
Back
Top