Beyond3D Forum

Beyond3D Forum (http://forum.beyond3d.com/index.php)
-   Pre-release GPU Speculation (http://forum.beyond3d.com/forumdisplay.php?f=51)
-   -   The NEXT LAST R600 Rumours & Speculation Thread (http://forum.beyond3d.com/showthread.php?t=39173)

Geo 01-Mar-2007 02:56

The NEXT LAST R600 Rumours & Speculation Thread
 
Some history
The ATI R600 Rumours & Speculation Centrum
The (inappropriately named) LAST R600 Rumours & Speculation Thread
Huddy says R600
Xbit: This report is the standard for where "64 shaders" comes from.
B3D on 80nm.
B3D on 512-bit to external memory
B3D on Xenos heritage
Tech Report suggesting Dave Orton said "96 shaders" for R600. However, actual quote was "next generation", which might leave the door open that he was referring to a 65nm refresh rather than R600.
Roughly 6 pages of talking about reported claimed die shots starting at #674 here
Site claiming to have an R600 listing specs and "testing" results. Sober and considered opinion of B3D staff concerning the claimed specs: "Pttthhhppptt!"
DailyTech finding said site credible.
CJ's leaked specs discussing timeframes, prices and performance estimations on this very thread.
Henri Richard promises Q1 for R600 launch
AMD sets Tech Day for R600
Beyond3D and Xbit report R600 pushed to Q2.

Some spec and launch date information from the AMD 690G launch on 2/28/2007.

Arty 01-Mar-2007 03:01

The interesting bits from that EETimes article:
  • Separately, AMD gave one of the first public demos of the R600, its next-generation graphics controller that uses 320 multiply-accumulate units.
  • The company showed a Barcelona-based system using two 200W R600 graphics cards to hit a terabit/second benchmark.
  • AMD also demonstrated working versions of its next-generation graphics chip the R600 to be released by the end of June.
  • Release of the R600 has been delayed "a few weeks" so that AMD can roll out a full suite of graphics chips covering multiple market segments for the latest Microsoft DirectX 10 applications programming interface.

Farhan 01-Mar-2007 03:18

It's teraflop, not terabit :)

http://www.reghardware.co.uk/2007/02...d_690g_launch/
http://www.informationweek.com/news/...=Breaking+News
http://blogs.zdnet.com/Berlind/?p=363
http://content.zdnet.com/2346-10741_22-57089.html
http://blogs.zdnet.com/Berlind/?p=364

http://www.boincstats.com/stats/host...sah&st=0&or=10 - zomg Barcelona?

R300King! 01-Mar-2007 03:26

Why the new thread? http://www.beyond3d.com/forum/images...n_rolleyes.gif

Quote:

AMD gave one of the first public demos of the R600, its next-generation graphics controller that uses 320 multiply-accumulate units. The company showed a Barcelona-based system using two 200W R600 graphics cards to hit a terabit/second benchmark.
So doesn't this mean it's 320/2 = 160 units? If you divide the 160 by vec4 you get 40. :)

But not sure if any of this is even true.

INKster 01-Mar-2007 03:29

Quote:

Originally Posted by R300King! (Post 938464)

So doesn't this mean it's 320/2 = 160 units? If you divide the 160 by vec4 you get 40. :)

But not sure if any of this is even true.

And if you multiply 40 by 2, you get 80. :wink:

R300King! 01-Mar-2007 03:30

Also, that 1 Terabit(is that correct?)/sec processing power, does that include the CPUs?

Quote:

The company showed a Barcelona-based system using two 200W R600 graphics cards to hit a terabit/second benchmark

R300King! 01-Mar-2007 03:33

Quote:

Originally Posted by INKster (Post 938468)
And if you multiply 40 by 2, you get 80. :wink:

Yeah, I know. :D I mean a single R600 chip will only have 160 or 40 vec4. Maybe that board was with 2 mid-range R600s with few shaders than the XTX version. Who knows? :)



OT Have you seen Level505 recently, it's covered with ads. Way more than before. lol

Jawed 01-Mar-2007 03:34

Aha, Barcelona/R600 :razz:

Jawed

SirPauly 01-Mar-2007 03:36

Quote:

Separately, AMD gave one of the first public demos of the R600, its next-generation graphics controller that uses 320 multiply-accumulate units. The company showed a Barcelona-based system using two 200W R600 graphics cards to hit a terabit/second benchmark.

Release of the R600 has been delayed "a few weeks" so that AMD can roll out a full suite of graphics chips covering multiple market segments for the latest Microsoft DirectX 10 applications programming interface. Rival Nvidia rolled out its high-end DX10 graphics controller, the GeForce 8800 last fall but has not filled out its product line with midrange and low-end parts based on it yet.

"As soon as AMD makes their DX10 announcements, I am sure we will hear about competing products from Nvidia," said McCarron.

In addition, AMD announced a new desktop chip set, the first from the ATI division since the merger last fall. The AMD 690 sports an ATI Radeon X1250 graphics core and a new video decode block. It is also the former ATI's first chip set to support the HDMI video interface with HDCP copy protection for high definition video.
Good news.

SirPauly 01-Mar-2007 03:41

So does 320 Multiply Accumulate units = 80 shaders?

Sound_Card 01-Mar-2007 03:42

Just when I was in a down mood with all this bad news of R600 springing up, this stuff pops up and smacks me in the face.

So Indeed, AMD is going for a complete platform launch.:razz:

Natoma 01-Mar-2007 03:43

Quote:

Originally Posted by SirPauly (Post 938481)
So does 320 Multiply Accumulate units = 80 shaders?

Geo is collecting his wager winnings. ;)

Sound_Card 01-Mar-2007 03:44

Quote:

Originally Posted by SirPauly (Post 938481)
So does 320 Multiply Accumulate units = 80 shaders?


It could be some hybrid form me thinks. Like 160 scaler shaders and 40vec4. Or I could be dead wrong and dumb.

Razor1 01-Mar-2007 03:46

well if its vec4+scalar, so that would be 64 units :) if its vec 3 + scalar then 80 units, at least thats what it sounds like to me.

Sound_Card 01-Mar-2007 03:50

Quote:

Originally Posted by Razor1 (Post 938486)
well if its vec4+scalar, so that would be 64 units :) if its vec 3 + scalar then 80 units, at least thats what it sounds like to me.


64vec4+scaler sounds very good to me.

Cuthalu 01-Mar-2007 03:54

So, what's the realease date gonna be; "a few weeks" + "by the end of June" = release in April, and June referring to the ending of Q2 which was the final possible date of previous Q2-timewindow?

Geo 01-Mar-2007 04:01

http://www.beyond3d.com/#news39176

The bit about early rumors would be a reference to both Xbit reporting 64 shaders and ATI (at the time) reporting that they'd leveraged Xenos. Add in todays 320 and some "version 2" of unified hints, and you've got the reasoning we used for that last bit.

Sound_Card 01-Mar-2007 04:01

Quote:

Originally Posted by Cuthalu (Post 938489)
So, what's the realease date gonna be; "a few weeks" + "by the end of June" = release in April, and June referring to the ending of Q2 which was the final possible date of previous Q2-timewindow?


I though that "delay of a few weeks" was refering to cebit.

pakotlar 01-Mar-2007 04:14

Quote:

Originally Posted by Geo (Post 938492)
http://www.beyond3d.com/#news39176

The bit about early rumors would be a reference to both Xbit reporting 64 shaders and ATI (at the time) reporting that they'd leveraged Xenos. Add in todays 320 and some "version 2" of unified hints, and you've got the reasoning we used for that last bit.

geo that would be pretty exciting if true. 64 vec4 + scalar wouldn't be too shabby at all if the arbiter was decently efficient.

edit: Am i missing something? What happened to plain old 80vec4? I didn't read the EEtimes article. Maybe I should check that.

Geo 01-Mar-2007 04:25

Quote:

Originally Posted by Cuthalu (Post 938489)
So, what's the realease date gonna be; "a few weeks" + "by the end of June" = release in April, and June referring to the ending of Q2 which was the final possible date of previous Q2-timewindow?

I don't think we heard anything today to change the opinion we offered on the front page last week that we're probably looking at late April or early May.

Cuthalu 01-Mar-2007 04:26

What's the difference between vec4 + scalar and vec4? Is it better to have 64 vec4 + scalar than having 80 vec4, and if so, why?

Geo 01-Mar-2007 04:39

Well, that they are referring to "320" would suggest that they might be all functionally scalar, even tho grouped as 64 5D (our guess) or 80 4D (certainly not impossible). To the degree the scheduling allows them to be treated as scalar, then which it is won't be all that important for most purposes. Scalar at all is the big thing, as vec4 will not be as efficient (tho you could get a lively argument going about how much control logic you have to add to make that happen in calculating the relative efficiency).

Pete 01-Mar-2007 04:50

Once again, Geo, I didn't think before posting. :oops: Vec4 + scalar lines up with Xenos, so 64 sounds right. A little harder to line up against G80, perhaps, with that rogue scalar, but it'll be an interesting fight, for sure.

(That scalar also makes for a nice "+25%" on top of 64 vec4s. Now, where have I heard "+25%" before? Am I just spinning my brain cells if I think of preemptive PR? :))

But functionally scalar, eh? That'd be a kick in the pants. I wonder if their unified v.2 would take that step.

Anywho, if Xbit was right about 64 shader units, then they're probably right about 16 texture units, and that may mean 16 ROPs. But 16 of NV's, or something more? It'd almost have to be more, seemingly, given all that bandwidth and if we can estimate the shader and so core clock from the 2 * R600 = teraflop figure.

Geo 01-Mar-2007 04:57

What I want to see is if rwolf can make 320 ALUs and 500 mflops into something 2GHz-ish. :grin:

psurge 01-Mar-2007 05:05

Geo - easy. They could operate at 2.4GHz but only have throughput of 1 madd every 3 cycles.

Ailuros 01-Mar-2007 05:07

64*9*0.8 = 461 * 2 = 922 ? Either the frequency is higher or I'm "stealing" 2 FLOPs from my speculative layman's math there. Or it should have read "nearly 1 Teraflop"....

Quote:

To the degree the scheduling allows them to be treated as scalar, then which it is won't be all that important for most purposes.
No problem at all; but I put that one up as a nice reminder for all those that laughed or protested against the marketed "128 SPs" of G80. We can by now more safely say that it's in reality a 16*Vec8 ALU thingy (and that's open to corrections too); so if I understand the above correctly and it's truly 64 5D ALUs I could either think of 4*Vec16 or 8*Vec8.

Assuming roughly the same efficiency for the ALUs between the two architectures, the major difference so far seems to be the G80 ALU clock domain and R600's "phatter" units.

turtle 01-Mar-2007 05:19

Well, according to this:

http://www.hardspell.com/doc/hardware/34620.html

A13 silicon seems final, and 'no less than 800mhz'. It also says the GDDR3 version of R600 is 12 layer, the GDDR4 being between 12-16 layer PCB, with the OEM card being 512MB and the retail card being 1GB (if I read it correctly.)

Seems Geo's assumption-based article could very well be right based on the '800mhz' number. :grin:

Article also mentions RV630 is also in AIB hands, and they are preparing cards based on it.

Hello massive family if not enthusiast (4x4 barcelona/crossfire-physics) platform launch?

Ailuros 01-Mar-2007 05:24

Assuming my idiotic math above has any legs, they'd need roughly 870MHz to fully reach a hypothetical 500 GFLOP rate.

turtle 01-Mar-2007 05:30

Hmm...Maybe that old "A12 hits 1ghz" rumor has some legs, eh?

Things certainly are starting to come together. :yes:

Rangers 01-Mar-2007 05:52

Quote:

Originally Posted by serenity (Post 938457)
The interesting bits from that EETimes article:
  • Separately, AMD gave one of the first public demos of the R600, its next-generation graphics controller that uses 320 multiply-accumulate units.
  • The company showed a Barcelona-based system using two 200W R600 graphics cards to hit a terabit/second benchmark.
  • AMD also demonstrated working versions of its next-generation graphics chip the R600 to be released by the end of June.
  • Release of the R600 has been delayed "a few weeks" so that AMD can roll out a full suite of graphics chips covering multiple market segments for the latest Microsoft DirectX 10 applications programming interface.

So again, why the hell did they delay it?

I honestly cant believe it. It seems like ATI just did it to lose.

The whole thing about to introduce a whole "suite" is just stupid, as neither Nvidia nor anybody else does that. You go high end first.

ChrisRay 01-Mar-2007 06:00

Quote:

SirPauly

http://www.3declipse.com/images/stor...1/8xqplane.png

http://www.3declipse.com/images/stor...16xaaplane.png

The x16 AA clearly isn't offering better smoothing than the X8Q..........and the reason why this shot matters is because it is close to a near horizontal with nice color contrasts to gadge image quality in a static shot. This is easier to notice with a moving screen.
Why are you picking out illustrations that dont cover the comparison as a whole? . Comparing it to 4xAA and 8xAA will further explain the discrepency shown here. But the loss of detail on tiny objects from 4x to 8x/16x 8x CSAA has the exact same problem. I chose that scene for a specific reasons. To Illustrate that near high polygon edges will not benefit from it and to show near distant objects in low resolutions. Nor will high levels of multisampling. Its more interesting to see what CSAA does to it in the very distant objects. Clearly I am familiar with that comparison as I am the one who made it. And I will reiterate. It is still "not" comparable to 6xAA. Your trying to compare it to 6xAA is still poor comparison. There are plenty of other comparisons I used which can illustrate the exact opposite . I certainly am not hiding the issue that CSAA isnt perfect. But its still a huge upgrade to 4xAA in most circumstances as a minimal performance hit. To the point that 4xAA has become a largely irrelevent performance mode for G80 users.

Natoma 01-Mar-2007 06:03

Quote:

Originally Posted by Rangers (Post 938555)
So again, why the hell did they delay it?

I honestly cant believe it. It seems like ATI just did it to lose.

The whole thing about to introduce a whole "suite" is just stupid, as neither Nvidia nor anybody else does that. You go high end first.

ATI doesn't exist anymore. It's AMD remember. That changes the approach significantly.

nelg 01-Mar-2007 06:12

Perhaps AMD delayed the R600 to use the new family of Rx6XX cards to bolster the performance of Barcelona.

epicstruggle 01-Mar-2007 06:14

Quote:

Originally Posted by turtle (Post 938547)
Hmm...Maybe that old "A12 hits 1ghz" rumor has some legs, eh?

Things certainly are starting to come together. :yes:

I was thinking the same thing. We are going to have to revisit at least a few old rumors in the coming days.

Shtal 01-Mar-2007 06:21

Quote:

Originally Posted by Rangers (Post 938555)
So again, why the hell did they delay it?

I honestly cant believe it. It seems like ATI just did it to lose.

The whole thing about to introduce a whole "suite" is just stupid, as neither Nvidia nor anybody else does that. You go high end first.

Just my own thoughts about delay!

1st = Their is no solid DX10 driver for Vista. (Example like for G80)
2nd = Their is no DX10 Vista games.
3rd = Probably to surprise Nvidia since they don't know what they are up against, because they have to adjust GF8900GTX to match R600.
4th = Probably their is little or no profit at all for High-End, so they need midrange graphic cards to make up the cost in order for overall profit gain.
5th = Not many people will upgrade their video cards right away (Example like Geo with his GF8800GTX :) )

Dave Baumann 01-Mar-2007 06:24

http://biz.yahoo.com/bw/070301/20070228006340.html?.v=1

Quote:

SAN FRANCISCO--(BUSINESS WIRE)--AMD (NYSE: AMD - News) today showcased a single-system, Accelerated Computing platform that breaks the teraflop computing barrier. Organizations are ultimately expected to be able to apply this technology to a wide range of scientific, medical, business and consumer computing applications. At a press event in San Francisco, AMD demonstrated a "Teraflop in a Box" system running a standard version of Microsoft® Windows® XP Professional that harnessed the power of AMD Opteron(TM) dual-core processor technology and two next-generation AMD R600 Stream Processors capable of performing more than 1 trillion floating-point calculations per second using a general "multiply-add" (MADD) calculation. This achievement represents a ten-fold performance increase over today's high-performance server platforms, which deliver approximately 100 billion calculations per second.

Anarchist4000 01-Mar-2007 07:05

Quote:

Originally Posted by Rangers (Post 938555)
So again, why the hell did they delay it?

I honestly cant believe it. It seems like ATI just did it to lose.

The whole thing about to introduce a whole "suite" is just stupid, as neither Nvidia nor anybody else does that. You go high end first.

I'm still leaning towards the platform launch, not just the family launch. And nobody else does it because nobody else can. Nvidia last I checked doesn't make CPUs and Intel's discrete graphics market hasn't quite developed yet.

Quote:

Originally Posted by Ailuros
Assuming my idiotic math above has any legs, they'd need roughly 870MHz to fully reach a hypothetical 500 GFLOP rate.

64*5*3*.8=768GFLOPs * 2 = 1.536TFLOPs

A R600 Crossfire should fairly effectively destroy the TFLOP barrier. Also consider this, with G80's missing MUL a single R600 has more than twice the theoretical FP power.


In regards to the scheduling what if they just didn't bother with making it perfectly efficient as ALUs seem somewhat cheap going by the R520->R580 example.

1+1 = 1+2 = 1+3 = 1+4

If it doesn't branch you can really pack em in there. If it does branch you could look at it like 2 scalars. I can't think of how you'd end up with any shaders that had a greater than 50% scalar:vector ratio. Save the complexity of the scheduling and just add more ALUs and clockspeed.

Quote:

Originally Posted by Geo
What I want to see is if rwolf can make 320 ALUs and 500 mflops into something 2GHz-ish.

Simple... Inverted Clock Domains. :runaway:

turtle 01-Mar-2007 07:21

Hmm. Dual-core 'city' CPU was used at the event instead of QC eh? Hint of things to come? :D

Can you comment on if that mysterious Opteron that showed up on BOINC is of the same breed or a hoax? That article certainly lends credence to the possibility of it being real...

Also, just out of curiosity, what MHZ number would be needed to hit 512GFLOPS using Geo's guesstimate on ALUs? ~890 (going by Al's math)?

Going by at least 1TFlops though, Al's math (which I have no idea is correct) and assuming that Opteron was around 24-25Gflops (which might be slightly off) 975/2 = 487.5/64*9 = 846mhz.

That wouldn't quite be half a teraflop per card, but close.

I'mma guess it was running at 850mhz or greater. :razz:

overclocked 01-Mar-2007 07:27

If its Vec and Scalar units based could those be clocked differently with the vec slower and scalar faster?

Unknown Soldier 01-Mar-2007 07:29

So now, speculation prices for this monster that will be out in a few weeks.

$600?

US

epicstruggle 01-Mar-2007 07:53

Quote:

Originally Posted by Unknown Soldier (Post 938592)
So now, speculation prices for this monster that will be out in a few weeks.

$600?

US

IMO, $650. Although maybe AMD will try to be aggressive with their pricing since they are launching a whole family of cards and not high end only. It will be interesting to see what other differences this launch will have with previous ones.

epic

rwolf 01-Mar-2007 08:02

Quote:

Originally Posted by Geo (Post 938527)
What I want to see is if rwolf can make 320 ALUs and 500 mflops into something 2GHz-ish. :grin:

lol ... ummm, I think they are saving that for the refresh at 65nm. :wink:

R300King! 01-Mar-2007 08:03

Can the 8800GTX in SLI do 1 TFLOP?

Also, does this include the CPU FLOPs?

How many GFLOPs can lets say an OC'd quadcore QX6700 do?

Acert93 01-Mar-2007 08:45

Quote:

Originally Posted by rwolf (Post 938603)
lol ... ummm, I think they are saving that for the refresh at 65nm. :wink:

Fast14 lives on :cool:

northfieldz 01-Mar-2007 09:36

G80 can do it as well when SLi is enabled.

And I don't think the number is too good for ATi, especially considering R600 is more of a vector-machine.

Per B 01-Mar-2007 09:59

Given the new positive signals both regarding R600 and Barcelona, could it simply be that AMD wants to be able to provide the first R600 (pre-)reviewers with Barcelona systems?!? If they don't, we can be pretty sure that the reviews will be performed on Core 2 Duo or Quad systems, and that doesn't look that good for AMD... it is still the CPU's that are the big thing for AMD.

So the R600 delay could simply be because AMD is on track with Barcelona, as well as the matching chip-sets!

vertex_shader 01-Mar-2007 10:02

"ATI R600 and the next field demonstration engines GDC"
http://66.249.93.104/translate_c?hl=...1/78/78496.htm

Evildeus 01-Mar-2007 10:07

Quote:

Originally Posted by Dave Baumann (Post 938571)
[url]Microsoft® Windows® XP Professional

Hmmmm...

Hopefully, we will have a good GPU in may ;)

vertex_shader 01-Mar-2007 10:33

Quote:

Originally Posted by turtle (Post 938538)
Well, according to this:

http://www.hardspell.com/doc/hardware/34620.html

A13 silicon seems final, and 'no less than 800mhz'. It also says the GDDR3 version of R600 is 12 layer, the GDDR4 being between 12-16 layer PCB, with the OEM card being 512MB and the retail card being 1GB (if I read it correctly.)

Seems Geo's assumption-based article could very well be right based on the '800mhz' number. :grin:

Article also mentions RV630 is also in AIB hands, and they are preparing cards based on it.

Hello massive family if not enthusiast (4x4 barcelona/crossfire-physics) platform launch?

Now i can see the sunshine end of the dark tunnel from AMD aspect.
The situation looks good for AMD now (when this news/rumors will be true), maybe time come to buy some share :smile:

So is almost confirmed Barcelona is the reason why R600 delayed?

icecold1983 01-Mar-2007 10:37

isnt g80 like 330ish gflops? how would 2 in sli make 1 tflop

DemoCoder 01-Mar-2007 10:38

Quote:

Originally Posted by Anarchist4000 (Post 938585)
64*5*3*.8=768GFLOPs * 2 = 1.536TFLOPs

Umm, where's this * 3 coming from? A MADD is 2 FLOPS, not 3. 64 * 5 * 2 * .8 = 512 GFLOPs is more like it.

Don't you think the press release would have said "1.5TFlops!" if your math was right?

Quote:

A R600 Crossfire should fairly effectively destroy the TFLOP barrier. Also consider this, with G80's missing MUL a single R600 has more than twice the theoretical FP power.
Nope. A G80 with missing MUL = 518Gflops. Certainly, the G80 won't always be able to use the missing MUL every cycle, but neither will the R600 be able to use every SIMD slot of their VEC4 every cycle either unless it is a scalar design IMHO. The true efficiency will be hard to calculate for both, so comparing absolutely unrealistic peak rates is nonsense.

CarstenS 01-Mar-2007 10:39

Quote:

Originally Posted by Ailuros (Post 938533)
64*9*0.8 = 461 * 2 = 922 ? Either the frequency is higher or I'm "stealing" 2 FLOPs from my speculative layman's math there. Or it should have read "nearly 1 Teraflop"....

Nope, but you need to read PR-Statements more carefully... ;)

"AMD demonstrated a "Teraflop in a Box" system running a standard version of Microsoft® Windows® XP Professional that harnessed the power of AMD Opteron(tm) dual-core processor technology and two next-generation AMD R600 Stream Processors capable of performing more than 1 trillion floating-point calculations per second using a general "multiply-add" (MADD) calculation."

My bold etc.

Assuming VR-Zone was correct and there really was a barcelona-cpu (which is based on "AMD Opteron(tm) dual-core processor technology " [note the "technology"]), you might end up with significantly less than 500 GFLOPs/sec. here. for single R600, obviously.

Anarchist4000 01-Mar-2007 10:43

Quote:

Originally Posted by northfieldz
G80 can do it as well when SLi is enabled.

And I don't think the number is too good for ATi, especially considering R600 is more of a vector-machine.

One G80 = ~345GFLOPs ... Two G80's > 1000GFLOPs?

Yeah I'm not quite seeing it. Doubt they're getting 100% efficiency and perfect scaling for the SLI either.

I'm assuming the >1TFLOP mark was measured performance and not a purely theoretical number.

Quote:

Originally Posted by DemoCoder
Umm, where's this * 3 coming from? A MADD is 2 FLOPS, not 3. 64 * 5 * 2 * .8 = 512 GFLOPs is more like it.

Don't you think the press release would have said "1.5TFlops!" if your math was right?

MADD=2, ADD=1

Assuming they're still using the MADD+ADD setup they've been using before.

64*5*3*0.8 = 768.0GFLOPs R600
128*3*1.35 = 518.4GFLOPs Normal
128*2*1.35 = 345.6GFLOPs Missin MUL

When measuring the GFLOPs you're running an operation that lines up best to the card so every ALU should be fully utilized most of the time. Best case scenario basically. So R600 should be capable of feeding all of those pipelines. This would be one of those cases where I'd expect R600 to thrash G80 just because of the design focus. It's somewhat meaningless in real world application but for the purpose of doing that many operation R600 is capable of a significant amount more than G80. We don't know by how much R600 broke the barrier but if it was measured performance and discounting FLOPs from the CPUs that's 66% efficiency(1 / 1.5TFLOPs) including the scaling hit for Crossfire. I guess it really comes down to if 1TFLOP was measured or theoretical performance.

Graham 01-Mar-2007 11:14

Quote:

Originally Posted by Rangers (Post 938555)
The whole thing about to introduce a whole "suite" is just stupid, as neither Nvidia nor anybody else does that. You go high end first.

The major sales and major money come from the medium/low end. If you can launch these before added hype the high end generates fades away, then all the better. Provided the high end card does well, it can only have a positive effect on the lower end cards.

I was told geforce 8000 cards are fastest in the world... wait.. I can't afford it. Never mind.
or..
I was told radeon x2000 cards are the fastest in the world... awesome! they have one at my price point!

Lets hope if amd do release an entire platform in one hit top to bottom, they unify the naming schemes too... Like AMD X[series] [perf] [product]... AMD x2 300 graphics, AMD x2 400 cpu, AMD x2 200 platform,.. whatever. Something like that.

Arnold Beckenbauer 01-Mar-2007 11:24

Quote:

Originally Posted by Pete (Post 938522)
Anywho, if Xbit was right about 64 shader units, then they're probably right about 16 texture units, and that may mean 16 ROPs. But 16 of NV's, or something more? It'd almost have to be more, seemingly, given all that bandwidth and if we can estimate the shader and so core clock from the 2 * R600 = teraflop figure.

http://www.beyond3d.com/articles/xenos/index.php?p=06
If I understand the Xenos' diagramm correct, Xenos' TMUs are not a part of the three shader units, they are decoupled or some else.
Decoupled 24 TMUs and 64 5D-ALUs (four SIMD clusters Ã* la Xenos?).
Or is it too crazy?

CarstenS 01-Mar-2007 11:29

Quote:

Originally Posted by Anarchist4000 (Post 938645)
One G80 = ~345GFLOPs ... Two G80's > 1000GFLOPs?

Yeah I'm not quite seeing it. Doubt they're getting 100% efficiency and perfect scaling for the SLI either.

I'm assuming the >1TFLOP mark was measured performance and not a purely theoretical number.

Under Vista i'm getting about 93% percent MADD-efficiency on G80. (~322 out of 346 GFLOPs/sec.)

Anarchist4000 01-Mar-2007 11:33

Well they've already got tons of X's in the names as well as an affinity for 4 digit numbers so does that count as unified? XL, XT, XTX, FX, X2, x64

But the entire platform launch does look rather appealing from a marketing perspective. Of course all the reviewers are gonna be mad because they get nailed with a massive workload all at once.

Quote:

Originally Posted by Quasar
Under Vista i'm getting about 93% percent MADD-efficiency on G80. (~322 out of 346 GFLOPs/sec.)

I'm assuming you aren't using SLI though and how exactly was it measured out of curiosity?

CarstenS 01-Mar-2007 11:41

Quote:

Originally Posted by Anarchist4000 (Post 938659)
I'm assuming you aren't using SLI though and how exactly was it measured out of curiosity?

No, it was a single G80. Measured with official 100.65 Forceware, Vista x86 and v1.2.1 of GPUBench's "Scalar vs. Vektor Instruction Issue"-part.

On R580+ I am only getting close to 75 percent efficiency on MAD and only about 50 percent on ADD (Cat 7.1; curious note: Skalar-split does not seem to work anymore in 7.1 drivers but vec4-results are in line with older drivers).

DemoCoder 01-Mar-2007 11:43

Quote:

Originally Posted by Anarchist4000 (Post 938645)
MADD=2, ADD=1

Assuming they're still using the MADD+ADD setup they've been using before.

Unsubstantiated assumption. Xenos doesn't have that setup. And if they are trying to increase efficiency and density, as well as clocks, it's more likely that they've simplified the setup. It's harder to co-schedule dependent ALU instructions via ILP than to use TLP. Efficiency is worse. NVidia learned its lesson, and presumably ATI did too.

Quote:

This would be one of those cases where I'd expect R600 to thrash G80 just because of the design focus.
Design focus? How do you consider Nvidia's scalar design not focused on efficiency, but then go on to make the assumption that ATI went the route of vectorized ALUs and therefore, efficiency was its design focus? There's way too much cheerleading in these assumptions.

To me, the R600's true efficiency is a big question mark if it is indeed a vectorized GPU, because it takes more HW and compiler magic to extract efficiency of out this setup.

Anarchist4000 01-Mar-2007 11:52

For this case, benchmarking based on FLOPs, efficiency should be rather good under any condition. If they're using vectors they should be able to pack more into a given area. It's one of those tests were efficiency shouldn't be a significant factor as it would be high on both. Therefore the card with the higher theoretical power wins. "Design focus" probably wasn't the correct description. I meant this was a situation extremely well suited for a vector based design.

I'd agree with you that in terms of actual real world performance the efficiency of R600 will be the deciding factor.

DemoCoder 01-Mar-2007 12:02

I don't get your efficiency measure. As far as benchmarking is concerned, efficiency = actual throughput/peak theoretical throughput. In this case, vectors lose out. Vectors may win on transistor density, but that's a different efficiency measure.

Without knowing what workload the benchmarks consists of, you can't really make any claims as to real world efficiency. But it is well known that maximizing vectorization of code to match the underlying hardware vector architecture is a difficult problem. Unless you feed handcrafted *ideal* code to the vectorized units, it's unlikely you will close to peak theoretical rates, unless you think the compiler performs voodoo levels of instruction scheduling.

It's simply easier to extract maximum efficiency and parallelism not having to worry about packing and co-issuing 5D operations. There's way more opportunities to screw up.

Now, if you want to claim that ATI fed handcrafted and completely artificial workloads that extracted near peak FLOPs rates, well la-de-da, but the people looking to buy GPUs to run on real world workloads are more interested in how the chip's efficiency compares on a diverse set of workloads.

You know, the PlayStation/2 had amazing peak theoretical rates that one could hack custom and artificial benchmarks to read. It isn't hard with eDRAM. In the real world, you needed the PS/2 performance analyzer to get anywhere near sane efficiency levels.

Techno+ 01-Mar-2007 12:05

I believe that if Anarchists's speculation on 3 FLOPS\cycle is correct, then an R600 at 1 GHZ can achieve a TFLOP.

nexus_alpha 01-Mar-2007 12:06

Quote:

Originally Posted by Quasar (Post 938643)
Nope, but you need to read PR-Statements more carefully... ;)

"AMD demonstrated a "Teraflop in a Box" system running a standard version of Microsoft® Windows® XP Professional that harnessed the power of AMD Opteron(tm) dual-core processor technology and two next-generation AMD R600 Stream Processors capable of performing more than 1 trillion floating-point calculations per second using a general "multiply-add" (MADD) calculation."

My bold etc.

Assuming VR-Zone was correct and there really was a barcelona-cpu (which is based on "AMD Opteron(tm) dual-core processor technology " [note the "technology"]), you might end up with significantly less than 500 GFLOPs/sec. here.

capable of performing more than 1 trillion floating-point calculations per second using a general "multiply-add" (MADD) calculation."

DemoCoder 01-Mar-2007 12:09

Reminds me of the old 3dfx commercial.

Ailuros 01-Mar-2007 12:11

Quote:

Originally Posted by Anarchist4000 (Post 938645)
MADD=2, ADD=1

Assuming they're still using the MADD+ADD setup they've been using before.

64*5*3*0.8 = 768.0GFLOPs R600
128*3*1.35 = 518.4GFLOPs Normal
128*2*1.35 = 345.6GFLOPs Missin MUL

You realize though that it could also be 1 4D MADD + 1D ADD don't you?

nexus_alpha 01-Mar-2007 12:12

Quote:

Originally Posted by nexus_alpha (Post 938670)
capable of performing more than 1 trillion floating-point calculations per second using a general "multiply-add" (MADD) calculation."

Sorry didn't know the words would come out so big can't find the edit button

Anarchist4000 01-Mar-2007 12:15

Quote:

Originally Posted by DemoCoder
It's simply easier to extract maximum efficiency and parallelism not having to worry about packing and co-issuing 5D operations. There's way more opportunities to screw up.

Now, if you want to claim that ATI fed handcrafted and completely artificial workloads that extracted near peak FLOPs rates, well la-de-da, but the people looking to buy GPUs to run on real world workloads are more interested in how the chip's efficiency compares on a diverse set of workloads.

In the interest of running a benchmark that breaks the 1 TFLOP barrier I assumed any test performed by a company and demonstrated to the public would be handcrafted and show a best case scenario. Running a test based on conditional testing of random numbers would be rather insane for a company to actually demonstrate unless it was beneficial in some way to their hardware.

Rereading that article I would say it looks more like a single MADD and not the MADD+ADD setup I was assuming. They would have mentioned that in that article unless they didn't understand what was happening. In the past I was under the idea that ATI used Vec3+1 with each unit being a MADD+ADD and the +1 having additional SF logic.

Razor1 01-Mar-2007 12:45

correct me if I'm wrong but if this is xenos style ALU's shouldn't it only be 8 + 2 flops per Mad and co issue add?

Oh sorry didn't see your post Democoder ;)

CarstenS 01-Mar-2007 13:13

Quote:

Originally Posted by nexus_alpha (Post 938674)
Sorry didn't know the words would come out so big can't find the edit button

So what? If you're referring to my 500 GFlOPs - that was single R600.

Geo 01-Mar-2007 14:02

Quote:

Originally Posted by Ailuros (Post 938533)
64*9*0.8 = 461 * 2 = 922 ? Either the frequency is higher or I'm "stealing" 2 FLOPs from my speculative layman's math there. Or it should have read "nearly 1 Teraflop"....

I assumed 10 rather than 9 to make the math work; in other words that 5th D scalar of the original Xenos config is now MADD like the others rather than ADD. Because otherwise I'd think the point of doing scalar in the first place would be more complicated if all the units aren't interchangeable. But then this is speculative, so perhaps you're right and I'm wrong.

Quote:

No problem at all; but I put that one up as a nice reminder for all those that laughed or protested against the marketed "128 SPs" of G80. We can by now more safely say that it's in reality a 16*Vec8 ALU thingy (and that's open to corrections too); so if I understand the above correctly and it's truly 64 5D ALUs I could either think of 4*Vec16 or 8*Vec8.
And it seems likely that G80 acts in quad mode for certain things too, so it's really about context, isn't it?

Geo 01-Mar-2007 14:28

Oh, and if you look at what Wavey linked upstream:

Quote:

two next-generation AMD R600 Stream Processors capable of performing more than 1 trillion floating-point calculations per second using a general "multiply-add" (MADD) calculation.
You might wonder why it's specifying MADD. :smile: But, really, if you make the scalar assumption (and there seem to be some people who think that's not in the bag yet, that tossing it out as "320" was just marketing), then wouldn't that nearly demand that the units be all the same capability?

Jawed 01-Mar-2007 14:32

Quote:

Originally Posted by Geo (Post 938728)
wouldn't that nearly demand that the units be all the same capability?

If you count G80's MAD FLOPs, entirely excluding the SF unit, then you're left counting 128 MADs per clock.

So R600's 320 MADs per clock may be excluding SF too.

Jawed

Geo 01-Mar-2007 14:46

Err, come to think of it (he said, looking at his own report on the front page), "mulitply-accumulate units" sounds an awful lot like MADD as reported by a reporter who isn't hip to the usual lingo. Doesn't it?

Razor1 01-Mar-2007 14:50

so possibly no co issue?

Jawed 01-Mar-2007 14:51

Quote:

Originally Posted by Geo (Post 938746)
Err, come to think of it (he said, looking at his own report on the front page), "mulitply-accumulate units" sounds an awful lot like MADD as reported by a reporter who isn't hip to the usual lingo. Doesn't it?

MAC is the generally accepted term for what we GPU people like to call MAD.

Jawed


All times are GMT +1. The time now is 04:46.

Powered by vBulletin® Version 3.8.6
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.