Welcome, Unregistered.

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

Reply
Old 30-May-2005, 20:06   #126
nAo
Nutella Nutellae
 
Join Date: Feb 2002
Location: San Francisco
Posts: 4,297
Default

Quote:
Originally Posted by Jawed
Ah, I've never seen an SFU on an NVidia diagram before. Good thinking.
Nvidia explicitely talks about SFUs in their vertex shaders pipeline.
NV40 pixel pipelines can also perform a reciprocal and a normalization at the same time.

Quote:
I suppose, alternatively, it could be the Fog ALU that you can see here:

http://www.beyond3d.com/previews/nvi.../index.php?p=9

SM3 requires that Fog is done in shader code rather than as a fixed function unit in the ROP.

Jawed
Fog is not a good candiate cause it's an operation one does once per fragment, it doesn't make sense to provide a special unit for it into the shaders ALUs, morover fog doesn't not require a special function unit to be applied as it uses just a couple of fmadd ops that NV40 pixel pipelines already provide.
nAo is offline   Reply With Quote
Old 30-May-2005, 20:09   #127
nAo
Nutella Nutellae
 
Join Date: Feb 2002
Location: San Francisco
Posts: 4,297
Default

Quote:
Originally Posted by Xmas
That fog ALU is just a fixed point 4-component linear interpolation.
yeah, Nvidia still provides a fixed function 'fog unit' on NV40 to be used on integer render targets and/or with non SM3.0 shaders.
nAo is offline   Reply With Quote
Old 30-May-2005, 20:11   #128
Jawed
Regular
 
Join Date: Oct 2004
Location: London
Posts: 9,863
Send a message via Skype™ to Jawed
Default

Good info guys

Jawed
Jawed is online now   Reply With Quote
Old 30-May-2005, 22:11   #129
j^aws
Senior Member
 
Join Date: Jun 2004
Posts: 1,908
Default

Quote:
Originally Posted by DaveBaumann
Quote:
RSX ~ 136 Shop/cycle ~ 52 Vec4 + 52 Scalar + 32 Other units
Doubtful.
My first impressions too.

However I wanted to check a few things. Do you have official transistor counts for Xenos? IIRC, 232 + 100 mil was floating around?

If the 232 mil for the Xenos Shader module is correct and RSX has 300 mil, then it could be feasible?

The other question was that in the other thread, you seemed quite convinced RSX would have either 8 or 16 ROPs because of the 128 bit memory controller. Is that still a strong hunch? If so, 32 Pixel Pipes would *fit* those numbers?

Quote:
Originally Posted by nAo
RSX: 8 VS + 24 PS

1 VS = 1 vec4 + 1 scalar ops per cycle
1 PS = 1 vec4 + 1 vec4 (with co-issue 2 vec2) + 2 scalar ops per cycle (from RSX presentation diagram, there are 2 SFU units)

2 * 8 + (1 + 2 + 2) * 24 = 136
How many Dot/cycle do you count here?

I see 56 Dot/cycle which doesn't fit with the *required* 52 Dot/cycle I derived? Unless I'm missing something?
j^aws is offline   Reply With Quote
Old 30-May-2005, 22:26   #130
Jawed
Regular
 
Join Date: Oct 2004
Location: London
Posts: 9,863
Send a message via Skype™ to Jawed
Default

Quote:
Originally Posted by Jaws
The other question was that in the other thread, you seemed quite convinced RSX would have either 8 or 16 ROPs because of the 128 bit memory controller. Is that still a strong hunch? If so, 32 Pixel Pipes would *fit* those numbers?
There's no "fit" between pipelines and ROPs nowadays.

Just like Xenos seemingly only has 8 ROPs, but has "48 pixel pipelines".

http://www.beyond3d.com/forum/viewtopic.php?t=23450

Jawed
Jawed is online now   Reply With Quote
Old 30-May-2005, 23:20   #131
nAo
Nutella Nutellae
 
Join Date: Feb 2002
Location: San Francisco
Posts: 4,297
Default

Quote:
Originally Posted by Jaws
I see 56 Dot/cycle which doesn't fit with the *required* 52 Dot/cycle I derived? Unless I'm missing something?
I see 56 Dot/cycle too.
I know it doesn't fit with the 51 Gdot/s figure for full system performance but..even 52 Dot products/cycle are too many:

if we assume RSX pixel pipelines ALUs can both co-issue 2 instructions (3-1 or 2-2) as NV40 ALUs and we assume RSX has 8 VS and 20 PS we have:
8*2 + 6*20 = 136 ops per clock cycle

CELL -> 1 Dot (PPE) + 7 Dot (SPE) = 8 Dot per clock cycle -> 25.6 GDot/s
RSX -> 8 Dot (VS) + 40 Dot(PS) = 48 Dot per clock cycle -> 26.5 GDot/s

Total: 52.1 GDot/s

I'm obviously having fun here, it's a divertissement so don't take this stuff too seriously.
nAo is offline   Reply With Quote
Old 30-May-2005, 23:26   #132
pc999
Senior Member
 
Join Date: Mar 2004
Location: Portugal
Posts: 3,528
Default

BTW can you have fun and say to us how many dots can xenus do?
pc999 is offline   Reply With Quote
Old 30-May-2005, 23:31   #133
Panajev2001a
Senior Member
 
Join Date: Mar 2002
Posts: 3,187
Send a message via MSN to Panajev2001a
Default

Quote:
Originally Posted by Jaws
Quote:
Originally Posted by DaveBaumann
Quote:
RSX ~ 136 Shop/cycle ~ 52 Vec4 + 52 Scalar + 32 Other units
Doubtful.
My first impressions too.

However I wanted to check a few things. Do you have official transistor counts for Xenos? IIRC, 232 + 100 mil was floating around?

If the 232 mil for the Xenos Shader module is correct and RSX has 300 mil, then it could be feasible?

The other question was that in the other thread, you seemed quite convinced RSX would have either 8 or 16 ROPs because of the 128 bit memory controller. Is that still a strong hunch? If so, 32 Pixel Pipes would *fit* those numbers?

Quote:
Originally Posted by nAo
RSX: 8 VS + 24 PS

1 VS = 1 vec4 + 1 scalar ops per cycle
1 PS = 1 vec4 + 1 vec4 (with co-issue 2 vec2) + 2 scalar ops per cycle (from RSX presentation diagram, there are 2 SFU units)

2 * 8 + (1 + 2 + 2) * 24 = 136
How many Dot/cycle do you count here?

I see 56 Dot/cycle which doesn't fit with the *required* 52 Dot/cycle I derived? Unless I'm missing something?
Let me run through the math, just for fun.

The slide said 51 Billion Dot Products/s.

I see 8 Dot4 from the VS ALU's and 48 Dot4 from the PS ALU's: this means 30.8 GDot4/s at 550 MHz.

The CPU can do, with the 7 SPE's, 22.4 GDot4/s at 3.2 GHz (4 Dot4's every 4 cycles on each SPE).

This would mean 53.2 GDot4's/s which is a bit higher than the number they posted and we have not taken into account the VMX unit of the PPE which can provide an additional 3.2 GDot4's/s (same peak performance as the SPE's) which would bring the total for the Broadband Engine to 25.6 GDot4/s.

The GPU should then only push, approximately, 25.4 GDot4/s (taking the PPE's VMX unit into account when finding the peak value of Dot4's/s for the CPU) or 28.6 GDot4's/s (without taking the PPE's VMX unit into account when finding the peak value of Dot4's/s for the CPU). At 550 MHz this means a Dot4's/cycle count of ~46-52 Dot4's/cycle as Jaws said. No buddy, you did not miss anything.

So, we have to map 52 Dot4's/cycle to a structure which at a first look would provide 56 Dot4's cycle or in other words map 25.4 GDot4's/s to an architecture which should push 30.8 GDot4's/s by looking at what nAo posted which I will re-quote here for the reader's viewing pleasure .

Quote:
Originally Posted by nAo
RSX: 8 VS + 24 PS

1 VS = 1 vec4 + 1 scalar ops per cycle
1 PS = 1 vec4 + 1 vec4 (with co-issue 2 vec2) + 2 scalar ops per cycle (from RSX presentation diagram, there are 2 SFU units)

2 * 8 + (1 + 2 + 2) * 24 = 136
From the PS ALU's we should count 52 Dot4's/cycle - 8 Dot4's/cycle (VS ALU's) = 44 Dot4's/cycle, but instead it seems that we should count 48 Dot4's/cycle.

Uhm... lots of thinking to be done.

The fun thing would be if Jen-Hsung made a typo there .
__________________
"Any idea worth a damn is already patented... twice" -Mfa
Panajev2001a is offline   Reply With Quote
Old 30-May-2005, 23:39   #134
nAo
Nutella Nutellae
 
Join Date: Feb 2002
Location: San Francisco
Posts: 4,297
Default

Quote:
Originally Posted by pc999
BTW can you have fun and say to us how many dots can xenus do?
one dot per ALU -> 48 per cycle -> 24 GDot/s
nAo is offline   Reply With Quote
Old 30-May-2005, 23:40   #135
Panajev2001a
Senior Member
 
Join Date: Mar 2002
Posts: 3,187
Send a message via MSN to Panajev2001a
Default

Quote:
Originally Posted by nAo
Quote:
Originally Posted by Jaws
if we assume RSX pixel pipelines ALUs can both co-issue 2 instructions (3-1 or 2-2)
Edit: I guess you might want to go on this 3-1 or 2-2 business again. This is related to Shader ops count right ?

Each Vec4 ALU in the PS ALU complex would do 2 shader ops peak and then you have 1 shader op from each of the two SFU's for a total of 6 * Pixel Pipelines count/cycle.
__________________
"Any idea worth a damn is already patented... twice" -Mfa
Panajev2001a is offline   Reply With Quote
Old 30-May-2005, 23:42   #136
pc999
Senior Member
 
Join Date: Mar 2004
Location: Portugal
Posts: 3,528
Default

Quote:
Originally Posted by nAo
Quote:
Originally Posted by pc999
BTW can you have fun and say to us how many dots can xenus do?
one dot per ALU -> 48 per cycle -> 24 GDot/s
Thank, you very much .
pc999 is offline   Reply With Quote
Old 30-May-2005, 23:44   #137
nAo
Nutella Nutellae
 
Join Date: Feb 2002
Location: San Francisco
Posts: 4,297
Default

Quote:
Originally Posted by Panajev2001a
Aren't both PS ALU's in each Pixel Pipeline capable of Vec4 operations ?
yes, 1 fmadd and 1 mul (on NV40)
Quote:
On NV40 one has to be used to help texture fetching, but when you can co-issue you should be able to do 2 Dot4's/cycle, right ? You wrote "1 PS = 1 vec4 + 1 vec4" after-all.
No, the second ALU can't do a dot4.
I assumed Nvidia 'extended' the second ALU on RSX to handle dot products too.
nAo is offline   Reply With Quote
Old 30-May-2005, 23:56   #138
j^aws
Senior Member
 
Join Date: Jun 2004
Posts: 1,908
Default

Quote:
Originally Posted by Panajev2001a
...
The fun thing would be if Jen-Hsung made a typo there .
Maybe...that would mean by leaving out the VMX, he underestimated the power of PS3! :P

But I think counting 7 SPUs was *intentional* as a contributer for *shader* ops because I speculated last year that SPUs may run Cg *shaders*. If the do then, by excluding the VMX unit, it's an accurate metric and a true reflection of it's purpose!
j^aws is offline   Reply With Quote
Old 30-May-2005, 23:58   #139
Panajev2001a
Senior Member
 
Join Date: Mar 2002
Posts: 3,187
Send a message via MSN to Panajev2001a
Default

Quote:
Originally Posted by nAo
Quote:
Originally Posted by Panajev2001a
Aren't both PS ALU's in each Pixel Pipeline capable of Vec4 operations ?
yes, 1 fmadd and 1 mul (on NV40)
Quote:
On NV40 one has to be used to help texture fetching, but when you can co-issue you should be able to do 2 Dot4's/cycle, right ? You wrote "1 PS = 1 vec4 + 1 vec4" after-all.
No, the second ALU can't do a dot4.
I assumed Nvidia 'extended' the second ALU on RSX to handle dot products too.
Ok, fair assumption .
__________________
"Any idea worth a damn is already patented... twice" -Mfa
Panajev2001a is offline   Reply With Quote
Old 31-May-2005, 00:06   #140
Panajev2001a
Senior Member
 
Join Date: Mar 2002
Posts: 3,187
Send a message via MSN to Panajev2001a
Default

Quote:
Originally Posted by Jaws
Quote:
Originally Posted by Panajev2001a
...
The fun thing would be if Jen-Hsung made a typo there .
Maybe...that would mean by leaving out the VMX, he underestimated the power of PS3! :P

But I think counting 7 SPUs was *intentional* as a contributer for *shader* ops because I speculated last year that SPUs may run Cg *shaders*. If the do then, by excluding the VMX unit, it's an accurate metric and a true reflection of it's purpose!
True .

I do not think you would be barred from running a Cg shader on the PPE IMHO though.
__________________
"Any idea worth a damn is already patented... twice" -Mfa
Panajev2001a is offline   Reply With Quote
Old 31-May-2005, 00:08   #141
nAo
Nutella Nutellae
 
Join Date: Feb 2002
Location: San Francisco
Posts: 4,297
Default

Quote:
Originally Posted by Panajev2001a
Ok, fair assumption .
Another assumption one can make if we don't want to believe nvidia extended their pixel pipeline design is they count the 2 (indipedent) co-issued Dot2 ops the first ALU can execute and they summed them with Dot4 from VS pipelines.
I don't even want to consider this option
nAo is offline   Reply With Quote
Old 31-May-2005, 00:25   #142
Panajev2001a
Senior Member
 
Join Date: Mar 2002
Posts: 3,187
Send a message via MSN to Panajev2001a
Default

Quote:
Originally Posted by nAo
Quote:
Originally Posted by Panajev2001a
Ok, fair assumption .
Another assumption one can make if we don't want to believe nvidia extended their pixel pipeline design is they count the 2 (indipedent) co-issued Dot2 ops the first ALU can execute and they summed them with Dot4 from VS pipelines.
I don't even want to consider this option
That option would still make 2 full Dot4's/cycle because if they counted them all as just Dot products then how could we comparatively count the Dot Products coming from the Broadband Engine ?
__________________
"Any idea worth a damn is already patented... twice" -Mfa
Panajev2001a is offline   Reply With Quote
Old 31-May-2005, 00:26   #143
nAo
Nutella Nutellae
 
Join Date: Feb 2002
Location: San Francisco
Posts: 4,297
Default

Quote:
Originally Posted by Panajev2001a
That option would still make 2 full Dot4's/cycle because if they counted them all as just Dot products then how could we comparatively count the Dot Products coming from the Broadband Engine ?
We can't, it doesnt' make sense, that's why I refute this hypothesis.
nAo is offline   Reply With Quote
Old 31-May-2005, 13:59   #144
Tacitblue
Member
 
Join Date: Apr 2005
Posts: 131
Default

http://www.extremetech.com/article2/...1817022,00.asp

Interview with one of the hardware guys on XBox 360.

On the memory bandwidth issue, it's my guess they're using Hypertransport again, ~22GB/sec keys in well with the current stats on the HT website, and IBM is part of the HT consortium. So that's one element that's a carryover it seems from the original box.
__________________
A Fanatic is a person who won't change his mind and can't change the subject.

Glory be to fanboys of consoles everywhere for giving me something to read and laugh at while i drink my morning coffee.
Tacitblue is offline   Reply With Quote
Old 31-May-2005, 23:09   #145
blakjedi
Senior Member
 
Join Date: Nov 2004
Location: Where U wish u were...
Posts: 1,926
Send a message via AIM to blakjedi Send a message via MSN to blakjedi Send a message via Yahoo to blakjedi Send a message via Skype™ to blakjedi
Default

Where is everyone getting the Cell chip dot-product information from? i didn't think that the Cell had a dotproduct function? I'm confused.
__________________
"The bible is how god supposedly relays his message to the people. That means he wants people to understand wtf he is talking about. ." L233
*Justice --- When you get what you deserve
*Mercy ----- When you don't get what you deserve
*Grace ----- When you get what you don't deserve
blakjedi is online now   Reply With Quote
Old 31-May-2005, 23:20   #146
nAo
Nutella Nutellae
 
Join Date: Feb 2002
Location: San Francisco
Posts: 4,297
Default

Quote:
Originally Posted by blakjedi
Where is everyone getting the Cell chip dot-product information from? i didn't think that the Cell had a dotproduct function? I'm confused.
SPEs and PPE's VMX unit haven't a dot product instruction AFAIK, but four vec4 dot products can be calculated at the same time with 4 fmadd instructions, so the average troughput it's one dot4 per clock cycle.
To be fair things are more complex than that as on SPEs fmadd instructions have a 6 cycles latency AFAIK..
nAo is offline   Reply With Quote
Old 31-May-2005, 23:29   #147
blakjedi
Senior Member
 
Join Date: Nov 2004
Location: Where U wish u were...
Posts: 1,926
Send a message via AIM to blakjedi Send a message via MSN to blakjedi Send a message via Yahoo to blakjedi Send a message via Skype™ to blakjedi
Default

Quote:
Originally Posted by nAo
Quote:
Originally Posted by blakjedi
Where is everyone getting the Cell chip dot-product information from? i didn't think that the Cell had a dotproduct function? I'm confused.
SPEs and PPE's VMX unit haven't a dot product instruction AFAIK, but four vec4 dot products can be calculated at the same time with 4 fmadd instructions, so the average troughput it's one dot4 per clock cycle.
To be fair things are more complex than that as on SPEs fmadd instructions have a 6 cycles latency AFAIK..
So in other words it does have the equivalent of a dotproduct function... just with fairly high latency OK. Ok so then when you say average throughput is one dot4 per clock cycle are you talking per SPE or the entire chip?
__________________
"The bible is how god supposedly relays his message to the people. That means he wants people to understand wtf he is talking about. ." L233
*Justice --- When you get what you deserve
*Mercy ----- When you don't get what you deserve
*Grace ----- When you get what you don't deserve
blakjedi is online now   Reply With Quote
Old 31-May-2005, 23:38   #148
nAo
Nutella Nutellae
 
Join Date: Feb 2002
Location: San Francisco
Posts: 4,297
Default

A dot4 per cycle per SPE and even PPE's VMX unit should provide one dot4 per cycle.
PS3 CPU would peak at 8 dot4 per cycle.
nAo is offline   Reply With Quote
Old 31-May-2005, 23:40   #149
Panajev2001a
Senior Member
 
Join Date: Mar 2002
Posts: 3,187
Send a message via MSN to Panajev2001a
Default

Quote:
Originally Posted by blakjedi
Quote:
Originally Posted by nAo
Quote:
Originally Posted by blakjedi
Where is everyone getting the Cell chip dot-product information from? i didn't think that the Cell had a dotproduct function? I'm confused.
SPEs and PPE's VMX unit haven't a dot product instruction AFAIK, but four vec4 dot products can be calculated at the same time with 4 fmadd instructions, so the average troughput it's one dot4 per clock cycle.
To be fair things are more complex than that as on SPEs fmadd instructions have a 6 cycles latency AFAIK..
So in other words it does have the equivalent of a dotproduct function... just with fairly high latency OK. Ok so then when you say average throughput is one dot4 per clock cycle are you talking per SPE or the entire chip?
He is talking about each SPE.
__________________
"Any idea worth a damn is already patented... twice" -Mfa
Panajev2001a is offline   Reply With Quote
Old 01-Jun-2005, 01:39   #150
pc999
Senior Member
 
Join Date: Mar 2004
Location: Portugal
Posts: 3,528
Default

See this topic, this has probably already discuted but try a look.

http://www.psinext.com/forums/viewtopic.php?t=6988
pc999 is offline   Reply With Quote

Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Different filtering methods Zvekan 3D Architectures & Chips 41 30-Jul-2003 01:23
Kyoto FLAMEWAR! RussSchultz General Discussion 91 14-May-2003 00:57
My response to the latest HardOCP editorial on benchmarks... Joe DeFuria 3D Architectures & Chips 216 26-Feb-2003 11:34
GF4 has inflated 3dmarks scores so says the INQ..... jb 3D Architectures & Chips 126 19-Jun-2002 23:35
nVIDIA Cg Compiler & Language Embraced By Industry Dave Baumann Press Releases 0 14-Jun-2002 21:27


All times are GMT +1. The time now is 23:41.


Powered by vBulletin® Version 3.8.6
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.