AMD: R7xx Speculation

Status
Not open for further replies.
w/ 1,1GHz+ GDDR3 & 256bit SI RV770 would have only a bit over 70GB/s bandwidth which would be less than what RV670 has, so imho this is pure bs.
 
How do you think the 2 SIMD RV610 handles the 3 types of shaders it has to run?
It handles them sequentially, from what I've gleaned.

Input/output queues for the win.
Sure, that works. It's not the only solution I can think of, though, or at least not single queues.


Which part of R600 are you thinking of here? Each part of the chip has queues.
Eventually the instruction has to leave the queue and the control signals are fed to each of the 16 elements in the SIMD.
That connection would be analogous to a single-instruction issue port.
Actually, it's pretty much the same thing as the issue ports to an x86 processor's SIMD units.

With time-slicing for general kernels I honestly don't see any relevance in the absolute number of SIMDs in a chip (ignoring their width).
As long as it's time-slicing, yes.
I guess my hangup is that it's not really concurrent execution when doing that.
Underutilization of parallel units of any sort during a given time-slice can't be opportunistically allocated to code waiting for its time slice.
 
w/ 1,1GHz+ GDDR3 & 256bit SI RV770 would have only a bit over 70GB/s bandwidth which would be less than what RV670 has, so imho this is pure bs.
I think it's possible.

If you compare 8800GTS-512 performance against HD3870 you can conclude that RV670 is wasting a lot of bandwidth. You could argue that RV670 is performing the same as if G92 had, say, 50GB/s available.

A persistent rumour is that RV770 is 50% faster than RV670. It might turn out that this is when MSAA is turned on and is due to having 4 Zs per clock instead of 2.

So, if RV770 with enhanced Z is capable of using bandwidth more effectively, then 50->70+ GB/s is much like the 50% performance gain that's rumoured.

This would then indicate that it has 16 TUs and then I wouldn't be at all surprised about 5:1 ALU:TEX. That would leave "R780" as a 800 SP, 32 TU, 32 ROP 2-chip board.

Reminds me of when I decided that there was a distinct possibility that R600 would be only 16 TUs (instead of the much rumoured 32)...

Jawed
 
A persistent rumour is that RV770 is 50% faster than RV670. It might turn out that this is when MSAA is turned on and is due to having 4 Zs per clock instead of 2.

So, if RV770 with enhanced Z is capable of using bandwidth more effectively, then 50->70+ GB/s is much like the 50% performance gain that's rumoured.

This would then indicate that it has 16 TUs and then I wouldn't be at all surprised about 5:1 ALU:TEX. That would leave "R780" as a 800 SP, 32 TU, 32 ROP 2-chip board.

Reminds me of when I decided that there was a distinct possibility that R600 would be only 16 TUs (instead of the much rumoured 32)...

Jawed
When i posted the speculations at our own forums on our website earlier today, they were regarded as "BS" and "link bait" (you can read both of it here at B3D). :)

Well, never mind - since i do not know for sure, if it's true, i'm not bothered if people like to keep to their own theories.
 
Last edited by a moderator:
Eventually the instruction has to leave the queue and the control signals are fed to each of the 16 elements in the SIMD.
That connection would be analogous to a single-instruction issue port.
Actually, it's pretty much the same thing as the issue ports to an x86 processor's SIMD units.
Well, for what it's worth it might be best to think of an R600 clause as a macro op and the individual instructions as micro ops :D There can be as many as 128 micro ops, each of which is a 5-component VLIW.

As far as the TU is concerned, I don't know how many "micro ops" is the limit...

As long as it's time-slicing, yes.
I guess my hangup is that it's not really concurrent execution when doing that.
Underutilization of parallel units of any sort during a given time-slice can't be opportunistically allocated to code waiting for its time slice.
Are you talking about the VLIW instruction or something else?

Jawed
 
Well, for what it's worth it might be best to think of an R600 clause as a macro op and the individual instructions as micro ops :D There can be as many as 128 micro ops, each of which is a 5-component VLIW.
AMD's CPU macro ops actually might map most closely to the individual instructions in a clause.
In the Athlon derivatives, each mem-reg operation translates to an internal macro op that activates both an ALU and an AGU, a narrow form of what R600 does.
Intel, or at least before it introduced micro op fusion, would have created two independent micro ops.

An AMD designer at one point in a presentation even made a side comment that at least he considered AMD's CPUs to be internally VLIW, though he admitted others would probably not agree with him.

Are you talking about the VLIW instruction or something else?

Jawed
Just portions of the chip in general.
We can write code that exercises only certain facets of the chip and leaves others idle.
Time slicing, or at least the kind I'm thinking of, could leave portions of the chip uncalled until another slice is active again.
 
w/ 1,1GHz+ GDDR3 & 256bit SI RV770 would have only a bit over 70GB/s bandwidth which would be less than what RV670 has, so imho this is pure bs.

Long time ago I heard rumors that RV770 will use 512bit memory, it may be false. Otherwise their could be 2 versions of RV770 256bit memory 1200MHz GDDR4 and 512bit memory 900MHz GDDR3.

-And-

And R700 may have 2X RV770 + 2X 256bit ~2100MHz GDDR5 memory.
 
Last edited by a moderator:
Long time ago I heard rumors that RV770 will use 512bit memory, it may be false. Otherwise their could be 2 versions of RV770 256bit memory 1200MHz GDDR4 and 512bit memory 900MHz GDDR3.
Or it's the well known game of FUD: Feed someone a snippet about some magical "internal ringbus" with 512 Bits and he'll keep relaying that, not knowing, that you're not counting external bits.

HD2900's ringbus is also supposed to be 1024 bits wide. :)
 
Or it's the well known game of FUD: Feed someone a snippet about some magical "internal ringbus" with 512 Bits and he'll keep relaying that, not knowing, that you're not counting external bits.

HD2900's ringbus is also supposed to be 1024 bits wide. :)

Supposed to?It actually is-internally.
 
Proves my point - thanks!

['supposed to' in the sense that most of the time when you talk about memory interfaces, you're characterizing their external bit width]

That's if you're not in marketing:p:)
 
That's if you're not in marketing:p:)

If you're to calculate the maximum bandwidth of a GPU, internal bandwidth means jackshit and no marketing departments haven't gone as far yet to calculate bandwidth based on those.

As for RV770 my gut feeling tells me that it's more likely it'll come with GDDR4 than GDDR3, yet still such a detail is not that important to me as long as the chip gets supplied with sufficient bandwidth.

If it truly can now do 4z/clock the first thing I'd like to see is its 8x MSAA performance.

Equally interesting would be if resolve still takes place in the ALUs or if it "bounced back" to hardware; in the latter case I'd really like to hear then what supported that change, since we've heard times and times again that shader resolve isn't a problem after all.
 
If you're to calculate the maximum bandwidth of a GPU, internal bandwidth means jackshit and no marketing departments haven't gone as far yet to calculate bandwidth based on those.

As for RV770 my gut feeling tells me that it's more likely it'll come with GDDR4 than GDDR3, yet still such a detail is not that important to me as long as the chip gets supplied with sufficient bandwidth.

If it truly can now do 4z/clock the first thing I'd like to see is its 8x MSAA performance.

Equally interesting would be if resolve still takes place in the ALUs or if it "bounced back" to hardware; in the latter case I'd really like to hear then what supported that change, since we've heard times and times again that shader resolve isn't a problem after all.

Gah, you're all nazis.:D Marketing doesn't have to calculate bandwidth in order to strategically place 512/1024 bits here and there and voila, it creates noise about it. I wasn't arguing that's it's relevant or irelevant...in fact, i wasn't arguing at all. And this is O/T.

I'd be surprised if resolve went back to dedicated HW.
 
Gah, you're all nazis.:D Marketing doesn't have to calculate bandwidth in order to strategically place 512/1024 bits here and there and voila, it creates noise about it. I wasn't arguing that's it's relevant or irelevant...in fact, i wasn't arguing at all. And this is O/T.

No we aren't bit nazis LOL. Either way since the only GPU that had 512/1024bits was up to now R600, if there ever was any "noise" generated about it, it wasn't any good noise at the end of the day. Hopefully by now some folks have learned that buswidth or fancy theoretical numbers on paper do not define anything.

I'd be surprised if resolve went back to dedicated HW.

I'd be very surprised too; I was merely thinking out loud because there were and still are rumours about it.
 
For some reason, this part of the thread reminds me of the naming of the good old "GeForce 256". If I remember correctly, the 256 came from some bizarre addition of numbers from different parts of the architecture - PR gone mad!

Nice in a way to see that some things don't really change... :smile:
 
No, because each of the 4 SIMDs in R600 is independent of the others. The 64 pixels in a batch (where 63 don't want to run some instructions) all execute over a period of 4 consecutive clocks in just one of the SIMDs. The other SIMDs can be doing VS, GS or PS work.
Jawed

Ok I get it. A single batch is executed by a single SIMD in 64/16=4 following clock cycles.
Why R600 don't have 16 bg so that it can complete one batch in only one clock (having so better performance in dynamic branching)?
Which are in your opinion the troubles to implement a finer bg in R600?
 
I am really wondering if AMD was so impossibly stupid as to go with 16 TMU's again (while Nvidias next part is rumored at 80 btw, for perspective).

Otherwise I dont see the basis for Fudzilla claiming that R700 (supposedly two RV770 on one board) is 50% faster than RV670.

If RV770 is a 32 TMU/480 SP/900mhz part, I dont see how it cant exceed RV670 by at least 50-100% by itself let alone in multi-GPU, since it does so in all relevant specs.

But a 16 TMU/480 SP part might actually explain it.. such part would just see a small boost based on clock and more shader strength, maybe fixed ROPS, maybe 10-30%, maybe then two on RV770 get you to 50%.

If true, I am just going to laugh, and AMD should rightfully be laughed the rest of the way out of the business and we can stop with this pretending that they wish to compete.


Edit: The above speculation would also fit in with the rumored small die sizes of Rv770, as well as the idea that AMD is anbandoning single high end chips in favor of multi-GPU configurations at the high end (perhaps AMD figures a small, cleaned up, shader beefed up, faster clocked, 16 TMU chip is just fine since it's principle high end use would be as a building block for multi-gpu anyway)..which of course would be a horrible road considering the drawbacks those suffer.. but I wouldn't be surprised because it's AMD...

Ah well, it's too early to speculate I suppose, just a frightening thought.
 
Last edited by a moderator:
I think it's possible.

If you compare 8800GTS-512 performance against HD3870 you can conclude that RV670 is wasting a lot of bandwidth. You could argue that RV670 is performing the same as if G92 had, say, 50GB/s available.

A persistent rumour is that RV770 is 50% faster than RV670. It might turn out that this is when MSAA is turned on and is due to having 4 Zs per clock instead of 2.

So, if RV770 with enhanced Z is capable of using bandwidth more effectively, then 50->70+ GB/s is much like the 50% performance gain that's rumoured.

This would then indicate that it has 16 TUs and then I wouldn't be at all surprised about 5:1 ALU:TEX. That would leave "R780" as a 800 SP, 32 TU, 32 ROP 2-chip board.

Reminds me of when I decided that there was a distinct possibility that R600 would be only 16 TUs (instead of the much rumoured 32)...

Jawed

IMHO this does not cope well with the rumored transistor count and die size of RV 770 (830 Million - 230-240 mm^2), in this case we will have more than 150 million transistors only for obtaining 4z-sample per clock in ROPs and 16 more 5-way ALUs only. OK, there could be other architectural improvements (i.e. more cache, and so on) but IMO it makes more sense from a preformance perspectivw to have more texture power along with more shader power. 32 TU seems a bit too high, however, so IMHO we could see a 4 SIMD GPU with 20 5-Way SP per SIMD; and 20 TU. This IMO has a bit more sense, because we have a slight dynamic branching penalty compared to RV670, but a decise improvement in all other scenarios.
What do you think about this?
 
Status
Not open for further replies.
Back
Top