View Full Version : The ATI R600 Rumours & Speculation Centrum
Pages :
1
2
3
4
5
[
6]
7
8
9
10
Demirug
06-Dec-2006, 07:17
That's pretty much what happens when you run more than one 3D app at a time now, isn't it? I'm thinking multiple APIs would be the same as multiple windows with the same API. All the API calls get translated to native GPU instructions, and then the GPU runs with it, in order.
But that could be an interesting future point of differentiation among (b/w, really) the IHVs. Of course, the easiest solution would be to have as many GPUs in the system as simultaneous GPU apps you want to run, and attach a GPU to each app. I guess this is one case where mentioning "dual core" GPUs (in the sense of the 7950GX2 or the dual-RV560 X1650XT) won't be universally scoffed at. ;)
Actually, with unified GPUs, you might be able to just limit the different apps to their own "cores" (shader units, texture units, maybe even ROPs in ATI's case). This'd be like multithreading, vs. the hyperthreading that I believe happens in current GPUs.
Stream of consciousness rambling ends ... now.
The Vista driver model has some kind of task switcher that manages the sharing of the GPU processing time between multiple apps. As this is part of the kernel graphics subsystem it is API neutral. The different APIs are only diver in the users pace. Therefore it would be easy to add another graphics API that plays nice with OpenGL and Direct3D.
Future Windows driver models will support multitasking as part of the GPU itself. To be correct the driver interface already defines this but it is not required for WDDM 1.0 hardware.
Rocco Siffredi
06-Dec-2006, 09:56
The Vista driver model has some kind of task switcher that manages the sharing of the GPU processing time between multiple apps. As this is part of the kernel graphics subsystem it is API neutral. The different APIs are only diver in the users pace. Therefore it would be easy to add another graphics API that plays nice with OpenGL and Direct3D.
Future Windows driver models will support multitasking as part of the GPU itself. To be correct the driver interface already defines this but it is not required for WDDM 1.0 hardware.
wait till revision 2.0 for something serious.
nicolasb
07-Dec-2006, 10:27
Hasn't anyone got any more R600 rumours? Should we maybe try starting some? :cool:
There is a rumour! The PCB will be green not red!
I heard it has a secret new innovation called Bennyvision.
Its a corollary of the physics capabilities of the new chip & will be enabled by a 'Yakety' patch similar to the famous 'Chuck' patch for R5x0 chips.
It's based on a special CTM variation tailored to helping female characters with achieving that special bounce.
I understand that there will be built-in profiles to upgrade many popular existing games but its in the future games tailored for the tech that the true power of R600 will leap to the forefront.
Those who thought the G80 Froggie/Adrienne demos were pretty cool will be blown away by the new Ruby demo!
zgemboandislic
07-Dec-2006, 11:28
I heard it has a secret new innovation called Bennyvision.
Its a corollary of the physics capabilities of the new chip & will be enabled by a 'Yakety' patch similar to the famous 'Chuck' patch for R5x0 chips.
It's based on a special CTM variation tailored to helping female characters with achieving that special bounce.
I understand that there will be built-in profiles to upgrade many popular existing games but its in the future games tailored for the tech that the true power of R600 will leap to the forefront.
Those who thought the G80 Froggie/Adrienne demos were pretty cool will be blown away by the new Ruby demo!
Are you talking out of your arrrse, or are you serious about this?
Are you talking out of your arrrse, or are you serious about this?
Pre-coffee? :lol:
Well, here's one ATi related question/conversation we could have...anyone know when ATi is going to release updated Vista drivers? Considering the jumps in performance VDDM and GPU drivers on Vista were supposed to give with regards to CPU utilization I'm really saddened by the fact that World of Warcraft runs about about 2/3rd the speed than it does on XP.
trinibwoy
07-Dec-2006, 19:15
Well those gains are assuming that all other things are equal between XP and Vista drivers. I think it's way too early to expect that.
Well, here's one ATi related question/conversation we could have...anyone know when ATi is going to release updated Vista drivers? Considering the jumps in performance VDDM and GPU drivers on Vista were supposed to give with regards to CPU utilization I'm really saddened by the fact that World of Warcraft runs about about 2/3rd the speed than it does on XP.
Strange, I'm running WoW in windowed mode, with Aero on, and no perfomance losses compared to XP
(X1800 XL 256MB, 1280x900 6xAA + Performance AAA, 16x HQ AF, maxed WoW settings)
Considering the jumps in performance VDDM and GPU drivers on Vista were supposed to give with regards to CPU utilizationThose jumps were for DX9 as well as DX10?
Those jumps were for DX9 as well as DX10?
Well, the way I saw it is that since D3D9 works through the VDDM, it'll gain the jumps from that as well. In fact, alt+tabbing (i.e. scenarios in which the device is lost) doesn't seem to exhibit the same characteristics that it does in XP. Usually WoW has to do a lot of HD crunching to reload the resources when tabbing back in, but that ddoesn't happen anymore; this is one of the benefits that VDDM, I believe, was going to give since events like lost devices dont' happen when alt+tabbing. Or something.
Well those gains are assuming that all other things are equal between XP and Vista drivers. I think it's way too early to expect that.
And I think it's way too early to expect 512-bit memory buses but you guys are still discussing the R600 having that :-P
Anarchist4000
07-Dec-2006, 23:01
And I think it's way too early to expect 512-bit memory buses but you guys are still discussing the R600 having that
If those pictures of R600 are true what else would they need all of those pins for? 512bit would be the logical guess unless each "processor" on the chip has it's own separate memory bus or something along those lines.
Pin amounts really don't show us anything about what the bus size is.
Pin amounts really don't show us anything about what the bus size is.
Well it does show something and the bus size is currently the best guess.
Pin amounts really don't show us anything about what the bus size is.
Basing an argument on Geforce FX really is asking for it. If ever there was a GPU that proved the rule, that's it.
Jawed
Basing an argument on Geforce FX really is asking for it. If ever there was a GPU that proved the rule, that's it.
Jawed
How does that prove it then Jawed? There is a 100 pin difference between the 5900 and 5800, not much differce other then x2 the TMU's and a 256 vs 128 bit bus. Would you care to elaborate on why there wasn't a substantial increase in pin amounts when there was such a huge change in bus size?
By the way I could base it on many chips not just the fx......
Because FX (NV30) barely made it out the door before being thrown in the bin - in engineering terms they were prolly fighting noise (lots of grounding pins) at the high clocks they were breaking ground on and consequently power too (lots of power pins).
The pin count on NV30 makes it look like it was strapped together with sticking plasters and has absolutely nothing to do with bus configuration.
Jawed
5900s had the same number of TMUs as 5800s. With regards to comparing additional pins with 128 vs 256 and 256 vs 512, is it also worth baring in mind that difference memory technologies might require more/less pins for signalling and/or grounding? I'm only asking because I've bothered to check what the pin count of like-for-like DDR, GDDR2 and GDDR3 chips are.
Because FX (NV30) barely made it out the door before being thrown in the bin - in engineering terms they were prolly fighting noise (lots of grounding pins) at the high clocks they were breaking ground on and consequently power too (lots of power pins).
The pin count on NV30 makes it look like it was strapped together with sticking plasters and has absolutely nothing to do with bus configuration.
Jawed
why do you say that, I don't see how you can make a comment on that by looking at the pin arrangement.
That is true Neeyik, GDDR 3 vs, GDDR2, I'm not sure of the difference in pins for this.
why do you say that, I don't see how you can make a comment on that by looking at the pin arrangement.
I'm not looking at the pin arrangement on either.
I'm stating that there were significant engineering problems (clock speeds, heat, power) and that it's likely they attacked those problems with extra pins to control noise and get power where its needed cleanly. I don't know this is the case, it's an hypothesis.
Regardless, my point still stands: you can't use an engineering abortion as the basis for an argument over pin counts. If you want to make your argument more credible find something else to base it upon, that's all.
I'm still not 100% on the 512-bit bus, but I haven't got any ammunition for a contra argument... If someone can come up with a theory for there being other chips on the R600 board, apart from R600 and the GDDR4 memory, using a hefty interface then I'm all for it :razz:
Then there's always the argument that all those pins are there on R600 for much the same reason as I'm asserting for NV30: to control electrical noise and deliver power :oops: We don't know it's not an abortion, after all...
Jawed
That was what I was kinda getting at, but I don't think it will be anything like the nv30 :), come on, that chip had major issues.
Just another example the s25 delta chrome series from VIA had 950ish pins and it has a 64 bit bus.
Then if we look at the 6600, that has 1000 pins which is has a 128 bit bus, both chips are capable of using the same memory types
Then you have have the r520 which also has 1000ish pins at 256 bit bus same memory types again.
kind really makes ya wonder...
Another thing we can look at is the cross sectional area of the pins, that might also give us an idea of the power needed.
Actually counting the traces would give us a better idea of bus then the actual pins.
Agreed I'm not discounting the 512 bit bus, but there doesn't seem to be any guidance from other chips specially if they are from different generations.
silent_guy
08-Dec-2006, 04:38
Actually counting the traces would give us a better idea of bus then the actual pins.
Brillant idea! Why didn't we think of this before? :idea:
But don't we all already know it? Those extra pins are for the external ring bus!
(Yeah, I'm bored too...:???: )
Anarchist4000
08-Dec-2006, 06:57
Ok i'm apologizing now for the horrible math i'm about to put into this thread. The numbers are rough estimates but still useful. I'm just counting the MADDs and not the ADDs for the ALU counts. Idea being knock off all the pins that are likely tied to memory and then figure out how many pins/ALU they were using. A lot of assumptions involved here but the results I find interesting. Big assumption being that ATI won't drastically change the amount of pins used for V/GND of all the ALUs. The more pins they actually used for the ALUs the more accurate the results should be. I'm also not worried about how they go about grouping the actual ALUs. Sorry about the messy chart.
GDDR4 ~70 data pins per chip(136 total including V/GND)
R580 ~1261 pins
R580 ~8VS and 48PS(56 total)
R600 ~2140 pins
Pins Chips Pin/Chip Memory Pins Core Pins ALUs Pins/ALU
R580 1260 8 70 560 700 56 12.50
256bit Bus
R600 2140 8 70 560 1580 64 24.69
R600 2140 8 70 560 1580 80 19.75
R600 2140 8 70 560 1580 96 16.46
R600 2140 8 70 560 1580 128 12.34
384bit Bus
R600 2140 12 70 840 1300 64 20.31
R600 2140 12 70 840 1300 80 16.25
R600 2140 12 70 840 1300 96 13.54
R600 2140 12 70 840 1300 128 10.16
512bit Bus
R600 2140 16 70 1120 1020 64 15.94
R600 2140 16 70 1120 1020 80 12.75
R600 2140 16 70 1120 1020 96 10.63
R600 2140 16 70 1120 1020 128 7.97
From those numbers there are two results close to R580 are:
96ALUs w/ 384bit bus
80ALUs w/ 512bit bus
None of the 256bit results appear to make any sense. The pattern of ATI doubling bus width every so many generations could lean towards 512 bit. That would leave it with a good deal of bandwidth and it would be a processing powerhouse with 80 ALUs.
Well thats all fine but what if ATi trying to get higher clock speed increased the number of ground pins.
Brillant idea! Why didn't we think of this before? :idea:
But don't we all already know it? Those extra pins are for the external ring bus!
(Yeah, I'm bored too...:???: )
Heh, all we need is the what kind of Ram and the amount of it that would be so much easier then all this :)
Rangers
08-Dec-2006, 12:57
512 bus might be a powerful marketing term alone. It just sounds so much more pure than 384 bit.
Idea being knock off all the pins that are likely tied to memory and then figure out how many pins/ALU they were using.
I think R520 and R580 are pin-compatible. R580 has 3x the count of fragment shader ALUs that R520 has.
So, I don't think there's much mileage in an analysis of pins per ALU.
Jawed
512 bus might be a powerful marketing term alone. It just sounds so much more pure than 384 bit.
Nvidia was in trouble last time they were short 128-bits.
Contact pads conduct heat as well. And they're pretty cheap.
Spending an hour in ACAD can make wonders :lol: :
http://img120.imageshack.us/img120/8754/r600r5x0pinlayoutwb4.th.png (http://img120.imageshack.us/my.php?image=r600r5x0pinlayoutwb4.png)
As a scale base, I used the centric square of pads, which I believe are used for power or ground conducting, not sure though, but definitely the pad-pattern for R600 is denser (excluding the central square).
Nvidia was in trouble last time they were short 128-bits.
Really wasn't the bus size that got them in trouble, the fx had a whole lot of other issues. The fx5900 even with the 256 bit bus couldn't do much against a 9800. But take a 6600 with a 128 bit bus it does just fine against a 9800 with a 256 bit bus :wink:
Chalnoth
08-Dec-2006, 22:05
Really wasn't the bus size that got them in trouble, the fx had a whole lot of other issues. The fx5900 even with the 256 bit bus couldn't do much against a 9800. But take a 6600 with a 128 bit bus it does just fine against a 9800 with a 256 bit bus :wink:
Well, what the comparison of the GeForce FX 5800 Ultra and 6600 GT shows us is that making (or failing to make) the right design decisions can make or break an architecture. The FX 5800 Ultra and the 6600 GT have extremely similar specs: the same number of texture units, the same memory bandwidth, the same clock speeds, similar die footprint (when adjusted for the different process), to name a few. But the difference in performance between the two architectures could not be more striking.
The reason? nVidia made a large number of very poor design decisions with the FX 5800. Chief among them was not designing the shader compiler until the very end of the development process, as it wasn't until they started making the compiler that they discovered just how hard it was to write one for the architecture!
Today the GeForce 8800 really seems like it's built on an amazing architecture, with a large number of excellent design decisions. But it isn't until we see ATI's large die solution that we will have a good point of comparison. I don't think for a second that it's going to be things like the width of the memory bus that will make or break either architecture. Rather, it's going to be the fundamental design choices.
Very true, I don't see the r600 as having flaws either looking at Xenos and the r5x0 series, so its going to come down who made overall the better choices. Both architectures will have their strengths just as the previous 2 gens.
Ailuros
09-Dec-2006, 06:40
Vendors analyze their architectures and define accordingly how much bandwidth each of them needs. In a worst case scenario a GPU could end up with less bandwidth then it would actually need, but rarely or actually never more.
Design flaws are a totally different story and are not necessarily connected to buswidths or available bandwidth. Trust me if we'd have a high end TBDR today in the makes any notion of the kind "yeah but it has only a 256bit bus" would be utter nonsense.
Right now the G80 doesn't show me any signs of being bandwidth starved from the few modest tests I could run. For a ~11% bandwidth increase, a <4% performance increase with 4xAA in 2048*1536. Yes I could increase sample density, but from a point and on it becomes somewhat moot, because I never disregard playability. And yes I have severe doubts that a G80 would be playable with 8xMSAA in that resolution with a 1GB framebuffer and =/>100GB/sec bandwidth.
IF R600 will be able to achieve the latter, then it's most definitely NOT because of a larger framebuffer and higher bandwidth ALONE. And that's a reality I wish more than just a few would realise. An architecture is a PACKAGE and should be analyzed only as such.
And yes I have severe doubts that a G80 would be playable with 8xMSAA in that resolution with a 1GB framebuffer and =/>100GB/sec bandwidth.
Unfortunate. Perhaps the G81 refresh?
EasyRaider
09-Dec-2006, 09:04
Right now the G80 doesn't show me any signs of being bandwidth starved from the few modest tests I could run. For a ~11% bandwidth increase, a <4% performance increase with 4xAA in 2048*1536. Yes I could increase sample density, but from a point and on it becomes somewhat moot, because I never disregard playability. And yes I have severe doubts that a G80 would be playable with 8xMSAA in that resolution with a 1GB framebuffer and =/>100GB/sec bandwidth.
2048*1536 with 4xAA? That doesn't say much. The relative need for bandwith shouldn't change notably with resolution and AA samples. Which app(s)? What about HDR?
However, I generally agree with you. I don't think high speed GDDR4 or 512-bit external bus would help G80 that much without a similar increase in core speed.
Ailuros
09-Dec-2006, 10:49
2048*1536 with 4xAA? That doesn't say much. The relative need for bandwith shouldn't change notably with resolution and AA samples. Which app(s)? What about HDR?
FEAR which is already a pretty heavy stenciling scenario. The trend doesn't change much in Oblivion with HDR either.
**edit: just for accuracy's sake:
http://users.otenet.gr/~ailuros/AAPerf.pdf
How much could one improve on 25 fps in order to hit playable rates? A frequency increase wouldn't help much either as you suggest below.
However, I generally agree with you. I don't think high speed GDDR4 or 512-bit external bus would help G80 that much without a similar increase in core speed.
I can't run any worthwhile tests with higher clock frequencies since at 630MHz it already locks up.
Ailuros
09-Dec-2006, 10:50
Unfortunate. Perhaps the G81 refresh?
I've got a better question: do you really need more than 4 samples in 2048 on a 21" CRT? Note that the mask beyond a 1200 height is being overridden already.
Stormlord
09-Dec-2006, 11:56
I'll add more fire to the speculation...
How about if ATI added Hypertransport (apart from the PCIe interface, or use a hypertransport to PCIe bridge) to the core...
It could be used as a fast interface for linking multiple R600 GPU or other chipset and other hardware.
It could explain extra pincount...
Also: it would completely fit in AMDs strategy
Note that ATI & AMD were already Hypertransport consortium members before (AMD being a founding member)...
http://www.hypertransport.org/consortium/cons_members.cfm
Kind Regards,
David
Note that ATI & AMD were already Hypertransport consortium members before (AMD being a founding member)...
http://www.hypertransport.org/consortium/cons_members.cfm
So is Nvidia.
I've got a better question: do you really need more than 4 samples in 2048 on a 21" CRT?
CRT? Why bother with AA at all? Just fiddle with the trim pot on the flyback transformer a bit...
There's no doubt of diminishing returns, even on discrete pixel displays, but 4xMSAA is hardly the panacea circa ~2007.
Direct3D 9.0 Ex (The L version name is not longer used) offers some additional features of the new Windows Vista driver model for Direct3D 9.
May you elaborate on this?
Maybe people should spend more time thinking about floating point render targets with AA:
Method and apparatus for anti-aliasing using floating point subpixel color values and compression of same (http://appft1.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PG01&p=1&u=%2Fnetahtml%2FPTO%2Fsrchnum.html&r=1&f=G&l=50&s1=%2220060188163%22.PGNR.&OS=DN/20060188163&RS=DN/20060188163)
and how much bandwidth that consumes.
Jawed
mrcorbo
09-Dec-2006, 18:35
I'll add more fire to the speculation...
How about if ATI added Hypertransport (apart from the PCIe interface, or use a hypertransport to PCIe bridge) to the core...
It could be used as a fast interface for linking multiple R600 GPU or other chipset and other hardware.
It could explain extra pincount...
Also: it would completely fit in AMDs strategy
Note that ATI & AMD were already Hypertransport consortium members before (AMD being a founding member)...
http://www.hypertransport.org/consortium/cons_members.cfm
Kind Regards,
David
This occurred to me, as well. Though not for this generation. I was thinking more towards R700.
Can anyone speculate on whether AMD's Direct Connect architecture could be adapted to power CrossFire in the future?
Maybe people should spend more time thinking about floating point render targets with AA:
Method and apparatus for anti-aliasing using floating point subpixel color values and compression of same (http://appft1.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PG01&p=1&u=%2Fnetahtml%2FPTO%2Fsrchnum.html&r=1&f=G&l=50&s1=%2220060188163%22.PGNR.&OS=DN/20060188163&RS=DN/20060188163)
and how much bandwidth that consumes.
Jawed
The compression schemes of the present invention take advantage of that fact that in multi-sampled anti-aliasing applications, samples, not pixels, tend to have the same color values. Other prior art types of color compression are geared toward taking advantage of pixel to pixel correlation. In contrast, in multi-sampled anti-aliasing applications, the dominant pattern is a stubble pattern where adjacent pixels are usually not of the same color. Instead, the present invention takes advantage of the sample to sample correlation that exists within the pixels.
Ok, so if they have much better compression of data for AA remind me why they need a 512-bit bus.
Maybe people should spend more time thinking about floating point render targets with AA:
Method and apparatus for anti-aliasing using floating point subpixel color values and compression of same (http://appft1.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PG01&p=1&u=%2Fnetahtml%2FPTO%2Fsrchnum.html&r=1&f=G&l=50&s1=%2220060188163%22.PGNR.&OS=DN/20060188163&RS=DN/20060188163)
and how much bandwidth that consumes.
Jawed
Yeah know floating point texture compression reminded me of a paper that I read about doing audio with a GPU, wasn't that a rumor before for the r600?
Edit:
here it is
http://www-sop.inria.fr/reves/Nicolas.Tsingos/publis/posterfinal.pdf
Ok, so if they have much better compression of data for AA remind me why they need a 512-bit bus.
So they can take over the world! :shock:
Ailuros
09-Dec-2006, 21:49
Maybe people should spend more time thinking about floating point render targets with AA:
Method and apparatus for anti-aliasing using floating point subpixel color values and compression of same (http://appft1.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PG01&p=1&u=%2Fnetahtml%2FPTO%2Fsrchnum.html&r=1&f=G&l=50&s1=%2220060188163%22.PGNR.&OS=DN/20060188163&RS=DN/20060188163)
and how much bandwidth that consumes.
Jawed
Some people do think about these details and that's why some of them happen to be TBDR advocates LOL :D
Somehow, somewhere, in a deep dark dungeon, Fuad Abazovic the fearless misguided GPU Guru surfs the net and stumbles upon the holy grail of all knowledge:Ail`s post above. Using his Fud sense, he interprets the mysterious and seemingly divine inspired text. 5 minutes and a string bikini later, he starts writing his next headline: R600 IS A TBDR!(as in Truly Bad Ass Redhead, for our not so technical readers...or something like that, I think)
Reputator
10-Dec-2006, 05:32
Somehow, somewhere, in a deep dark dungeon, Fuad Abazovic the fearless misguided GPU Guru surfs the net and stumbles upon the holy grail of all knowledge:Ail`s post above. Using his Fud sense, he interprets the mysterious and seemingly divine inspired text. 5 minutes and a string bikini later, he starts writing his next headline: R600 IS A TBDR!(as in Truly Bad Ass Redhead, for our not so technical readers...or something like that, I think)He could have a point though, the need for memory bandwidth is in proportion to the power of the GPU, and the 88000GTX might not be bandwidth starved. A 512-bit bus on the R600 could be a stunt, a gimmick to give it a marketing edge, and to start a memory controller architecture intended for future generations (like the Ringbus, DDR4 support for the R520).
Ailuros
10-Dec-2006, 06:32
He could have a point though, the need for memory bandwidth is in proportion to the power of the GPU, and the 88000GTX might not be bandwidth starved.
Yes or "probably" to that....
A 512-bit bus on the R600 could be a stunt, a gimmick to give it a marketing edge, and to start a memory controller architecture intended for future generations (like the Ringbus, DDR4 support for the R520).
I don't think so; all I meant to say is that amongst the advantages of TBDRs are anything that has to do with floating point framebuffers; there such an architecture needs a lot less memory footprint and bandwidth for float HDR and AA combinations. Since neither G80 nor R600 aren't anything close to a full DR, added bandwidth and/or optimisations are a necessity.
trinibwoy
10-Dec-2006, 11:55
Isn't the memory footprint actually larger on a DR since you have to store full screen vertex information before rendering begins?
Marketing stunts are usually defined as packaging checkbox upon which minimal cost (read transistors, die size, silicon, R&D) have been spent. No one I know thinks a 512-bit bus could remotely fit that definition.
In all seriousness, no, it couldn`t be a marketing stunt...I don`t think tracing hell and some other non trivial issues associated with making a card using such a wide bus would be warranted for a marketing hoopla. If they do it, it means they need it/feel they need it for their architecture, which may behave differently WRT G80 when it comes to bandwidth/may have different goals to reach, IMO
Isn't the memory footprint actually larger on a DR since you have to store full screen vertex information before rendering begins?
Considering the memory required for a multisampled float framebuffer plus Z buffer compared to a 32bpp downsampled and tone-mapped output buffer, I'd say it probably isn't.
Ailuros
10-Dec-2006, 15:40
Isn't the memory footprint actually larger on a DR since you have to store full screen vertex information before rendering begins?
How much memory footprint do you need to store let's say 16x MSAA?
AFAIK, the memory footprint of MSAA should be equal to SSAA for the given number of samples.
Ailuros
10-Dec-2006, 16:27
AFAIK, the memory footprint of MSAA should be equal to SSAA for the given number of samples.
Common place for all architectures.
Cuthalu
10-Dec-2006, 19:03
This might be interesting if true: http://babelfish.altavista.com/babelfish/trurl_pagecontent?lp=zh_en&trurl=http%3a%2f%2fwww.pcpop.com%2fdoc%2f0%2f168%2 f168302.shtml
800MHz @ <1.2v and GDDR3.
Ailuros
10-Dec-2006, 19:06
Does anyone speak the native language, in order to provide a better translation?
Skrying
10-Dec-2006, 19:12
Interesting if true, and the mention of GDDR3 really rubs me in a oh so weird way. Very interesting to say the least.
The only thing I get from that is that at 800Mhz, R600 is consuming about 100W and that the power supply circuitry is vastly over-specified.
Jawed
trinibwoy
10-Dec-2006, 20:00
Considering the memory required for a multisampled float framebuffer plus Z buffer compared to a 32bpp downsampled and tone-mapped output buffer, I'd say it probably isn't.
Interesting - I didn't consider that AA was done on-the-fly.
Ailuros
10-Dec-2006, 20:55
Interesting if true, and the mention of GDDR3 really rubs me in a oh so weird way. Very interesting to say the least.
There were more than one indications lately for a 512bit bus (or to phrase it even safer >256bit). For anything >256bits GDDR3 makes IMO more sense from a memory availability and final board pricing perspective. Even "just" 900MHz GDDR3@512bit gives over 115GB/sec raw bandwidth.
That gibberish also indicates that the R600 has better bandwidth saving techniques than G80; that one isolated could make me think of a 256bit bus, but then GDDR4 would make far more sense than GDDR3.
Megadrive1988
10-Dec-2006, 21:03
nice to see that ATI (er AMD/ATI or just....AMD) is not going to be at a large transistor disadvantage as I previously thought -i.e. the 500M+ transistor R600 reports. now it's upto 720M transistors, about 30 or 40 million more than G80. that combined with what seems likely to be a 512-bit memory bus, and ATI could re-claim the crown again provided the architecture itself is solid and better or at least on par with than G80.
Ailuros
10-Dec-2006, 21:08
Neither transistor counts nor buswidth are able to tell anything about an architecture's efficiency. Meaning that I wouldn't had thought less of any R600 if someone told me that it has "only" 500+M transistors.
^eMpTy^
10-Dec-2006, 21:43
In regards to R600's architecture...I've heard it's going to use vec4's...is that necessarily going to be less efficient than nvidia's scalar design?
if it can be be split up into vec 2 + vec 2 or other combinations it might end up just as effecient. Hard to say at this point.
Mintmaster
10-Dec-2006, 22:09
The only thing I get from that is that at 800Mhz, R600 is consuming about 100W and that the power supply circuitry is vastly over-specified.
JawedHow much power would be consumed by sixteen 512 Mbit GDDR3 chips @ 900MHz?
Mintmaster
10-Dec-2006, 22:23
In regards to R600's architecture...I've heard it's going to use vec4's...is that necessarily going to be less efficient than nvidia's scalar design?I actually heard about the scalar design on R600 before anyone suggested it for G80, and that was way back in May. 64 shader pipes is still referring to the old mentality of vector ops, though. I bet ATI will describe it as 256 or 320 shader pipes in light of what NVidia's PR did, however.
I just want to get some more information about the texture samplers. Since the scalar processors are more efficient, there's even more need to have more texture samplers than on R580.
About the launch date of R600... I think it's fairly certain that we will not see R600 before the 17th of Januari 2007. That's most likely the day that we'll see a new X1950 card see the light of day... the X1950GT.
How much power would be consumed by sixteen 512 Mbit GDDR3 chips @ 900MHz?
10W?
I found this page which you can use to piddle about with DDR2 power consumption:
http://www.sun.com/servers/coolthreads/t2000/calc/index.jsp
though those will be ECC versions of DIMMS, I expect. Anyway, 1GB is about 5W.
GDDR4, internally, seems to be like DDR2 - so comparing GDDR3 and DDR2 isn't necessarily going to work. Also, GDDR3 does tend to run rather hot at 900MHz. Samsung seems to be notching-down GDDR3 voltages:
http://www.samsung.com/Products/Semiconductor/Support/ebrochure/memory/psg_all_product_200609.pdf
Jawed
trinibwoy
10-Dec-2006, 22:44
I bet ATI will describe it as 256 or 320 shader pipes in light of what NVidia's PR did, however.
Describing G80 as 32 vec4's isn't any better.
I bet ATI will describe it as 256 or 320 shader pipes in light of what NVidia's PR did, however.
See, and Trini laughed at me elsewhere when I said xxx shaders for R600. :smile:
I actually heard about the scalar design on R600 before anyone suggested it for G80, and that was way back in May.
I doubt R600 is going to be scalar in the way G80 is. Just call it a hunch based on patents.
I just want to get some more information about the texture samplers. Since the scalar processors are more efficient, there's even more need to have more texture samplers than on R580.
There's also clocks to take into account: 800-900 MHz TMUs?
And ATI always liked to talk about the ever-increasing ALU:TEX ratio. So you then start piddling about with a table of workable ratios, e.g.:
64:16
96:16
96:24
128:32and pondering whether R600 has a 1:1 TMU:ROP ratio like all its predecessors (from R300 onwards - erm, maybe there's an exception somewhere that catches me out?).
And there's the non-filtering "texturing" pipelines, i.e. vertex fetch and constant buffer fetch, which also want to chomp memory/cache bandwidth.
Jawed
May you elaborate on this?
To quote myself from another forum:
D3D9Ex a bit enhanced version of Direct3D9, which adds cross-processing shared surfaces, "unlimited memory" (all memory resources are "managed"), resource management control (priorization of resources) and antialiased text rendering support. Oh, and less frequent "device removed" and thus, "no more "device lost""
(almost direct quote from MS's presentation :razz: )
trinibwoy
10-Dec-2006, 23:03
See, and Trini laughed at me elsewhere when I said xxx shaders for R600. :smile:
I didn't laugh!! I just said that puts it in the 100-999 range :razz:
10W?
I was curious about that too, and was looking for the answer. That's what I figure, around there. If I interpret that article correctly, the core uses 80w, and the card uses 160-220w (idle/load?), so that leaves 80-140w...So if there is 16 chips, that's ~5-9W a piece, which sounds about right.
I was curious about that too, and was looking for the answer. That's what I figure, around there. If I interpret that article correctly, the core uses 80w, and the card uses 160-220w (idle/load?), so that leaves 80-140w...So if there is 16 chips, that's ~5-9W a piece, which sounds about right.
10W total :razz:
That article appears to be saying that the power supply on the card is capable of supporting 160W of power consumption and could easily be upgraded to 240W (erm, space for a third device?).
Jawed
He could have a point though, the need for memory bandwidth is in proportion to the power of the GPU, and the 88000GTX might not be bandwidth starved. A 512-bit bus on the R600 could be a stunt, a gimmick to give it a marketing edge, and to start a memory controller architecture intended for future generations (like the Ringbus, DDR4 support for the R520).
I doubt it. Cost is king. If they do include a 512-bit bus perhaps it is because it has much more horse power than 88000GTX by a substantial margin. :wink:
Sound_Card
11-Dec-2006, 01:19
The inq states they scraped UFO. However this site posted in the last page, states it has Gddr3 which means they scraped Pele. Conflicting info here. Clock for clock, GDDR4 consumes less power than GDDR3, and if R600 is truly the power saver, I would think it would be on GDDR4.
I'm still going with 96 ALU's with 24 ROP's and 24 TMU's.
Tell if I'm wrong please, but 64 ALU's with 16 ROP's and TMU's just seems way too small for a tranistor budget of 720m. I know DX10 pushes transistors #'s higher, but not that much. The home theater chip should still be off the GPU as well....:roll:
Reputator
11-Dec-2006, 02:19
if it can be be split up into vec 2 + vec 2 or other combinations it might end up just as effecient. Hard to say at this point.I'd hope so, though ATI is hard-set on their compiler.
I doubt it. Cost is king. If they do include a 512-bit bus perhaps it is because it has much more horse power than 88000GTX by a substantial margin. :wink:~80% more? We're all expecting a win for AMD here but I highly doubt under the best of circumstances the memory bandwidth is in reasonable proportion with this GPU's capabilities. Unless it's got 96 shaders and 24 ROPs as many have speculated.
pakotlar
11-Dec-2006, 03:43
I still don't know why you guys are arguing about whether or not the massive amount of bandwidth will be completely utilized. It is historically accurate to say that memory will always be used, no matter how much you have. I remember when R300 came out people were saying that the 256bit bus wouldn't be saturated. Well not only was it used to great effect (one of the prime advantages over NV30), but it laid the groundwork for more powerful devices that absolutely required the extra wide bus.
I think that this generation, we will see cases where the 86GB/s of bandwidth on the 8800GTX will not be enough, especially with UE3 engine games. I'm guessing cases where heavy AA + transparency SSAA is used @ high resolution we will see some major differences. There will also be cases where games are simply relying on a ton of texture reads, like COD2 where the extra bandwidth will cause a large difference (check out COD2 benchies on 1950XTX vs 1900XTX for those of you who believe that bandwidth added be the former did not give it an advantage over the former).
If, as some speculate, R600 will be underspecced in the shader department relative to its bandwidth, wouldn't it be neat to have a high-end card that finally includes standard AA on all titles? And I don't even mean user-selectable, but on by default. This may not prove a hindrance to upcoming, presumably more demanding games, b/c the absolute bandwidth excess remains.
Totally unrealistic for many reasons, I know.
BTW, that PCPop article says 80A @ 1.2V, so 96W, no? Dunno if the 160-240A translates at the same voltage (what does GDDR3/4 use?), but it's hard to believe it'll be much higher, considering it's 225W max from the PEG slot and two aux PEG plugs.
And why GDDR3 when R580+ already packs GDDR4? :???:
INKster
11-Dec-2006, 06:29
I still don't know why you guys are arguing about whether or not the massive amount of bandwidth will be completely utilized. It is historically accurate to say that memory will always be used, no matter how much you have. I remember when R300 came out people were saying that the 256bit bus wouldn't be saturated. Well not only was it used to great effect (one of the prime advantages over NV30), but it laid the groundwork for more powerful devices that absolutely required the extra wide bus.
I think that this generation, we will see cases where the 86GB/s of bandwidth on the 8800GTX will not be enough, especially with UE3 engine games. I'm guessing cases where heavy AA + transparency SSAA is used @ high resolution we will see some major differences. There will also be cases where games are simply relying on a ton of texture reads, like COD2 where the extra bandwidth will cause a large difference (check out COD2 benchies on 1950XTX vs 1900XTX for those of you who believe that bandwidth added be the former did not give it an advantage over the former).
What if Nvidia decides to launch a "8900 GTX" or even a "GX2" as it seems to be suggested lately for a March/April release ?
That would be a 768bit "virtual"-bus (i know, i can't just add the theoretical width because of framebuffer content duplication issues, but it would also be true that such framebuffer would be used for only a portion of each frame at a time, not all of it at once).
BTW, i believe R300 only used a 256bit bus because, at the time, the pace of proper GDDR-type (DDR, DDR2, GDDR3 and GDDR4, since then) memory evolution was much slower.
Nvidia knew this too.
That's why they tried to use 500MHz DDR2 (1000MHz effective) for the GeforceFX 5800 Ultra, which was extremely rare, hot and expensive back then.
In fact, it was almost experimental to use DDR2 at such high speed in late 2002.
And why GDDR3 when R580+ already packs GDDR4? :???:
Since R580+ only uses half the number of the chips for a 512bit bus R600 (8, instead of 16), maybe cost per unit was a factor when deciding the type of said memory.
Tim Murray
11-Dec-2006, 06:57
Has there been any agreement on what constitutes an "ALU" yet? :razz:
I'd hope so, though ATI is hard-set on their compiler.
~80% more? We're all expecting a win for AMD here but I highly doubt under the best of circumstances the memory bandwidth is in reasonable proportion with this GPU's capabilities. Unless it's got 96 shaders and 24 ROPs as many have speculated.
I'm still guessing that their shaders are running 1-3GHz.
Sunrise
11-Dec-2006, 09:55
10W total :razz:
:?:
Total doing what? Average when in use? Average when idle? If there are 16 devices, typical calculations are extremely theoretical and therefore very hard to do, because typically (depends on their workload) each of those don´t operate at IDLE/BURST mode all the time. 10W? Not a chance, not even if we agree that our calculations are fairly theoretical. You also have to keep in mind that the DC has to have both sufficient headroom for handling typical bursts and peak current of all modules.
16x 512MBit -BJ11 devices @ 2.0V +- 0.1V would already take about ~18W when they are all just "sitting there", meaning they are IDLE (operating current) and in BURST mode this figure could go as high as ~38W (taken from Samsung´s own GDDR3 tech papers).
Since ATI likes to "overspec" their GDDR3/GDDR4 (to increase headroom a little more) you can be fairly certain that their DC will need about 40W (max) for the memory alone.
Did you see my earlier post where I linked to a power consumption calculator for DDR2 memory?
I couldn't find any power consumption data for GDDR3, so I took 1GB of DDR2 and doubled it. It was a guess.
Jawed
Sunrise
11-Dec-2006, 10:11
Did you see my earlier post where I linked to a power consumption calculator for DDR2 memory?
I couldn't find any power consumption data for GDDR3, so I took 1GB of DDR2 and doubled it. It was a guess.
Jawed
Yeah, I read them both, but I`m a little lost how you can derive data from entirely different chips/modules/specs with different internal operation and workloads, while also applying them to typical GDDR3 graphics card operation.
Doing calculations like that using data from entirely different devices won´t do your guess any good. ;)
I have to admit to being surprised that DDR2 is only about 5W under load for 1GB.
Anyway, 1GB of GDDR3 + R600 at 800MHz sounds like it'll consume ~140W. Dunno how much extra the rest of the card will consume...
Then there's this knotty question of what SKU that is, since we've been expecting the top one to use GDDR4. Is there a 1GB GDDR4 + R600 at 900MHz variant?
Maybe AMD is keeping back that configuration for later - maybe another spin is required?
Jawed
Maybe AMD is keeping back that configuration for later - maybe another spin is required?
Jawed
If they require another respin, it's 1900 vs gx2 all over again.
I think the chinese report would be about right, they're packing, but not going to make january with the current state of affairs
Has there been any agreement on what constitutes an "ALU" yet? :razz:
I think that would be an "A", an "L" and a "U" :yep2:
SugarCoat
11-Dec-2006, 11:27
10W total :razz:
That article appears to be saying that the power supply on the card is capable of supporting 160W of power consumption and could easily be upgraded to 240W (erm, space for a third device?).
Jawed
theinq suggested about a week or two ago that the (fat, not long) PCB was being designed not only for the R600 but for the refresh and future R700 as well. That would play an intrical part into what it can support. If true i think that its problably the first good indication that they're going to attempt to be as small of a burden as possible to the overall expendetures of AMD. More mileage for your dollar if you will. Makes perfect sense to me, even if some things arent fully utilized, get the support on there working then disable what ever is redundant to the given chip until its needed.
final chips are ready Jawed, happened about 2-3 weeks back.
Rangers
11-Dec-2006, 11:43
He could have a point though, the need for memory bandwidth is in proportion to the power of the GPU, and the 88000GTX might not be bandwidth starved. A 512-bit bus on the R600 could be a stunt, a gimmick to give it a marketing edge, and to start a memory controller architecture intended for future generations (like the Ringbus, DDR4 support for the R520).
Or maybe the R600 is so powerful it needs that bandwidth?
Ailuros
11-Dec-2006, 11:48
I think that would be an "A", an "L" and a "U" :yep2:
A Lame Unit? *runs for his life*
final chips are ready Jawed, happened about 2-3 weeks back.
Groovy :grin:
Jawed
final chips are ready Jawed, happened about 2-3 weeks back.
No (and laughably so, where are you getting your info from?!). A new spin did come back, but not 2-3 weeks ago, and you'll only know if it's 'final' at this point if you work for AMD.
No (and laughably so, where are you getting your info from?!). A new spin did come back, but not 2-3 weeks ago, and you'll only know if it's 'final' at this point if you work for AMD.
oh its final :wink: , I heard it a 2 -3 weeks ago, so it might be before hand I agree. They are hitting the mhz they wanted, only if there is flaws in the silicon will they need another respin. Actually the first spin A0 the ROP's were broken, and thats why the gpu didn't work, A1 the had clocking problems. Those are the only major problems I have heard about the r600 silicon, so this last spin, should be the one. Seems like you have some information to Rys, why not share?
oh its final :wink: , I heard it a 2 -3 weeks ago, so it might be before hand I agree.
So I can now reach you at razor1@amd.com ? It's not final production silicon until someone at AMD gives the nod, and they haven't made that decision yet, I promise you that much. I think you'll find they're hoping it's the case, though, so maybe the place you hear it from is mistaking that hope for something it's not (yet, and fingers crossed since I want one sooner rather than later :twisted:).
lets wait and see then :wink:, yes I was told this is probably the final silicon, but I haven't heard about any issues with the last respin, they seem to be very happy with it, so unless there is something they haven't tested yet, I'm 99% sure its ready to go.
lets wait and see then :wink:
All the :wink: smilies in the world won't hide the fact that you're pissing in the wind with R600 most of the time, and your sources could be a bit better. Your chip revision data is off (by some distance), for example. ATI have never labelled chip spins like that.....
As for revealing my info, I always do so when it's correct, prudent and makes sense to, unlike some :runaway:
trinibwoy
11-Dec-2006, 12:33
I'm still guessing that their shaders are running 1-3GHz.
Hope that guess is based on more than just the fact that Nvidia decided to clock theirs that high. Given talk of 800Mhz speeds it's nearly guaranteed that those rumours refer to the shader domain as well.
ATI have never labelled chip spins like that.....
As for revealing my info, I always do so when it's correct, prudent and makes sense to, unlike some :runaway:
hehehe, FUDo does..
http://www.theinquirer.net/default.aspx?article=14302
All the :wink: smilies in the world won't hide the fact that you're pissing in the wind with R600 most of the time, and your sources could be a bit better. Your chip revision data is off (by some distance), for example. ATI have never labelled chip spins like that.....
As for revealing my info, I always do so when it's correct, prudent and makes sense to, unlike some :runaway:
Yeah right, next time you want to shit behind my back to the B3D members, I suggest you talk to me first, I know exactly what you have been saying, and who you are gettign your information from, YOU remember this, my friend, I don't go behind your back unless I have to! Got it, good. I would by prudent with your information, because I do have friends at AMD, I they weren't happy with you going behind my back and talking shit.
Yeah I left the metal layer, silicon respin out because I was lazy to type one number.
Yeah right, next time you want to shit behind my back to the B3D members, I suggest you talk to me first, I know exactly what you have been saying, and who you are gettign your information from, YOU remember this, my friend, I don't go behind your back unless I have to! Got it, good. I would by prudent with your information, because I do have friends at AMD, I they weren't happy with you going behind my back and talking shit.
Yeah I left the metal layer, silicon respin out because I was lazy to type one number.
Sound as if you work for a CIA as a agent in the graphic industry :cool:
Don't take my word so serious :wink:
Ailuros
11-Dec-2006, 13:10
I'd suggest we put an end to this catfight soap opera and bounce back on topic.
Hope that guess is based on more than just the fact that Nvidia decided to clock theirs that high. Given talk of 800Mhz speeds it's nearly guaranteed that those rumours refer to the shader domain as well.
We haven't seen clock domains from AMD yet, so yeah, I'm leaning the core is all one speed too (whatever that speed is).
We haven't seen clock domains from AMD yet, so yeah, I'm leaning the core is all one speed too (whatever that speed is).
How about with more processor units? I would like to ask which scenario will be more benefit between using higher speed clock (more MHz or GHz) with low processing units and a normal speed chip (600-800MHz) but with more processing units (as Thini said 100-999)?
Edit: typo...
Rys I apolgize for my previous post, it was uncalled for and it shouldn't have been geared towards you. I let my anger get in the way of my judgement. Sorry.
Reputator
11-Dec-2006, 22:39
We haven't seen clock domains from AMD yet, so yeah, I'm leaning the core is all one speed too (whatever that speed is).I'm thinking the same way.
How about with more processor units? I would like to ask which scenario will be more benefit between using higher speed clock (more MHz or GHz) with low processing units and a normal speed chip (600-800MHz) but with more processing units (as Thini said 100-999)?Generally it's better to have more pipes than clockspeed, because you get better utilization. The only case where there might be an exception to this is the FX, GF6, and GF7 architectures with the coupled texture unit, a higher clockspeed eeks out better shader performance once the texturing needs for the scene are met.
Say, how would you guys feel about much simplified ALUs (in that they're pipelined and do less each clock), but that you get four scalar ones for each single vector one? Like with the 8800? And that they're individually relocatable (in that you make batches of all the ones who have to execute the same op), instead of simple quads?
Say, how would you guys feel about much simplified ALUs (in that they're pipelined and do less each clock), but that you get four scalar ones for each single vector one? Like with the 8800? And that they're individually relocatable (in that you make batches of all the ones who have to execute the same op), instead of simple quads?
Hmm, I was planning on going to bed in about an hour...
Jawed
Megadrive1988
11-Dec-2006, 22:48
The inq states they scraped UFO. However this site posted in the last page, states it has Gddr3 which means they scraped Pele. Conflicting info here. Clock for clock, GDDR4 consumes less power than GDDR3, and if R600 is truly the power saver, I would think it would be on GDDR4.
I'm still going with 96 ALU's with 24 ROP's and 24 TMU's.
Tell if I'm wrong please, but 64 ALU's with 16 ROP's and TMU's just seems way too small for a tranistor budget of 720m. I know DX10 pushes transistors #'s higher, but not that much. The home theater chip should still be off the GPU as well....:roll:
Im sure there will be an R600 with 96 ALUs, 24 ROPs, 24 TMUs *and* a lower-end R6xx
with 64 ALUs, 16 ROPs, 16 TMUs, as per usual.
So this is some kind of windowed instruction issue, then?
Let's say you've got two separate shader programs, a VS and a PS for the sake of argument:
VS:
ADD r0.xy, r1.xy, r2.xy
PS:
ADD r4.rg, r5.rg, r6.rg
the instruction scheduler could combine these two separately executing batches into a singly issued instruction:
ADD rA.1234, rB.1234, rC.1234
where registers A, B and C are proxies for the source/destination registers, and channels 1 to 4 are a hidden notation within the ALU that isn't exposed to the programmer.
Is that the sort of thing you mean?
Sounds like a right palaver.
Jawed
So this is some kind of windowed instruction issue, then?
Let's say you've got two separate shader programs, a VS and a PS for the sake of argument:
VS:
ADD r0.xy, r1.xy, r2.xy
PS:
ADD r4.rg, r5.rg, r6.rg
the instruction scheduler could combine these two separately executing batches into a singly issued instruction:
ADD rA.1234, rB.1234, rC.1234
where registers A, B and C are proxies for the source/destination registers, and channels 1 to 4 are a hidden notation within the ALU that isn't exposed to the programmer.
Is that the sort of thing you mean?
Sounds like a right palaver.
Jawed
You go fast, but ultimately: yes. There is no real reason anymore why the execution units would resemble the old model. It's the only way how you can really do what DX10 asks. And not such a big step for ATi, considering.
Huh, interesting, I was thinking dynamic batch size. You have a set of schedulers, and a set of ALUs, and the schedulers "check-out" a set of ALUs for the duration of a batch (or until batch is suspended pending memory fetch, maybe re-allocated at branch points too, etc).
Now, *that* would be why you need more complex access to the register-mem, but that patent was NV's, wasn't it?
I think this makes a lot of sense for the GPGPU boys, or a mix-in work-set of physics, vertex (small batch size) and ps (mostly wide).
ed: would there be such a huge win to combine across shader lines (types and/or 'kernel's), and, how do you restore/maintain locality of texture access? Is there a minimum width?
ed: would there be such a huge win to combine across shader lines (types and/or 'kernel's)
Yes, the ability to make many dynamic branches alone is sufficient. Especially when you want to have a good execution unit load.
and, how do you restore/maintain locality of texture access? Is there a minimum width?
DX10 requires that to be through virtual memory, so you need a mechanism to do some kind of paging in any case. Small batches will still be pretty inefficient, but much less than with quads.
3dilettante
11-Dec-2006, 23:50
I wonder about having every individual ALU being capable of being individually relocatable to an arbitrary degree.
My impression of CTM's spec seems to indicate a grouping of the processors into arrays, where each member of the array works on a program that they all share.
Each individual unit would only need to know where it is in the shared program, and woudn't need to look elsewhere or carry much more threading data beyond the bare minimum.
Let's say the array size is 16 units.
If each individual ALU unit were completely independent, that would mean that in the worst case, a naive implementation would multiply the amount of thread data per array by 16 times. In the case of multicycle ops, where a unit can jump around, the context becomes larger for every cycle the op takes.
If that entire model is abandoned, and each unit can be arbitrarily assigned, then each unit must be able to query every thread slot available and pick out an instruction, or the schedulers must be able to know the status of every unit (4*96), and be capable of drawing operands from anywhere, since we're batching by op, which is unlikely to have data locality. That's just plain huge.
It may be less expensive and more transparent to the outside world to give each array a pool of fixed-size program sections that they can make fused ops from.
Instead of actively seeking out paired operations from every active thread, just take advantage of coincidental batching behavior, knowing that the odds favor a lot of overlap.
You go fast, but ultimately: yes. There is no real reason anymore why the execution units would resemble the old model. It's the only way how you can really do what DX10 asks. And not such a big step for ATi, considering.
It sounds like a huge amount of work gathering stuff from the register file.
The potentially weird thing about this is you no longer have batches. The only batching is for memory fetches/texture filtering. Any random combination of objects that need to run an "ADD" can be shoved together for the duration of that instruction and only that instruction.
It's interesting, actually, because you can view the ALU pipeline as only being capable of running one of about 15 different instructions (ADD, MUL, DP, RCP etc.). So in a way it shouldn't be hard to "sort" the available threads to identify objects whose pending instruction is "ADD" or whatever. You can set counters to keep a watch on threads' objects' pending instructions - taking account of dynamic branching (i.e. excluding objects that are predicated-out) and when a threshold for the array is reached (say the array is 16 vec4s) you can issue those objects together.
But still, the actual scheduling of this crazy merry-go-round sounds like sublime madness. The sort of thing you're glad you don't have to work out in detail.
Dynamic branching performance, being effectively batchless, if I'm right, should be stratospherically good. As long as there's enough objects in flight, of course. This rather ominous "some number" gets clearer :?: :
http://www.cupidity.f9.co.uk/b3d73.jpg
I'm not sure how relevant these are:
SIMD processor and addressing method (http://appft1.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PG01&p=1&u=%2Fnetahtml%2FPTO%2Fsrchnum.html&r=1&f=G&l=50&s1=%2220060047937%22.PGNR.&OS=DN/20060047937&RS=DN/20060047937)
Method and apparatus for superword register value numbering (http://appft1.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PG01&p=1&u=%2Fnetahtml%2FPTO%2Fsrchnum.html&r=1&f=G&l=50&s1=%2220050198468%22.PGNR.&OS=DN/20050198468&RS=DN/20050198468)
Jawed
I wonder about having every individual ALU being capable of being individually relocatable to an arbitrary degree.
My impression of CTM's spec seems to indicate a grouping of the processors into arrays, where each member of the array works on a program that they all share.
Each individual unit would only need to know where it is in the shared program, and woudn't need to look elsewhere or carry much more threading data beyond the bare minimum.
Let's say the array size is 16 units.
If each individual ALU unit were completely independent, that would mean that in the worst case, a naive implementation would multiply the amount of thread data by 16 times. In the case of multicycle ops, where a unit can jump around, the context becomes larger for every cycle the op takes.
If that entire model is abandoned, and each unit can be arbitrarily assigned, then each unit must be able to query every thread slot available and pick out an instruction, or the schedulers must be able to know the status of every unit (4*96), and be capable of drawing operands from anywhere, since we're batching by op, which is unlikely to have data locality. That's just plain huge.
It may be less expensive and more transparent to the outside world to give each array a pool of fixed-size program sections that they can make fused ops from.
Instead of actively seeking out paired operations from every active thread, just take advantage of coincidental batching behavior, knowing that the odds favor a lot of overlap.
You can archieve the same by making a thread for each scalar ALU. And you can still save and restore the registers per batch. A bit more logic, but about the same data traffic.
3dilettante
11-Dec-2006, 23:57
You can archieve the same by making a thread for each scalar ALU. And you can still save and restore the registers per batch. A bit more logic, but about the same data traffic.
Just one thread?
It would still need some kind of pick stage if it's pulling ops from different shaders. That wouldn't go away.
And that just one thread per array is 16x the threading context. Actually, if we're going fully scalar and each scalar unit goes in the same place as the vector unit, it's 64x the context.
I've gotten the sense that things are claustrophobic for ATI when it comes to threading already.
It sounds like a huge amount of work gathering stuff from the register file.
The potentially weird thing about this is you no longer have batches. The only batching is for memory fetches/texture filtering. Any random combination of objects that need to run an "ADD" can be shoved together for the duration of that instruction and only that instruction.
It's interesting, actually, because you can view the ALU pipeline as only being capable of running one of about 15 different instructions (ADD, MUL, DP, RCP etc.). So in a way it shouldn't be hard to "sort" the available threads to identify objects whose pending instruction is "ADD" or whatever. You can set counters to keep a watch on threads' objects' pending instructions - taking account of dynamic branching (i.e. excluding objects that are predicated-out) and when a threshold for the array is reached (say the array is 16 vec4s) you can issue those objects together.
Exactly. :D
But still, the actual scheduling of this crazy merry-go-round sounds like sublime madness. The sort of thing you're glad you don't have to work out in detail.
Dynamic branching performance, being effectively batchless, if I'm right, should be stratospherically good. As long as there's enough threads in flight, of course. This rather ominous "some number" gets clearer :?: :
You need to handle four times as much thread allocations, but the amount of data you need to handle stays the same. And you probably have to combine them back into quadruple (vector) operands in either case. Which also means that the amount of instruction dispatchers stays the same, if you have multiple targets per quad member.
"Some number". You can take that both ways.
;)
Just one thread?
Sorry, I meant that you split each thread into four.
It would still need some kind of pick stage if it's pulling ops from different shaders. That wouldn't go away.
And that just one thread per array is 16x the threading context. Actually, if we're going fully scalar and each scalar unit goes in the same place as the vector unit, it's 64x the context.
I've gotten the sense that things are claustrophobic for ATI when it comes to threading already.
? How do you get 64 times? I still assume you batch them, as there will still be many that run the same op.
3dilettante
12-Dec-2006, 00:10
The only batching is for memory fetches/texture filtering. Any random combination of objects that need to run an "ADD" can be shoved together for the duration of that instruction and only that instruction.
I may be misinterpreting what the idea is, but the dynamic grouping from any random number of places sounds like some kind of parallel+wide picking problem.
If it's anything like superscalar issue, the problem is quadratic in nature, and already huge.
So in a way it shouldn't be hard to "sort" the available threads to identify objects whose pending instruction is "ADD" or whatever.
You would need to know the status of every thread every cycle.
You can set counters to keep a watch on threads' objects' pending instructions - taking account of dynamic branching (i.e. excluding objects that are predicated-out) and when a threshold for the array is reached (say the array is 16 vec4s) you can issue those objects together.
What happens when the pending status can change with the next cycle. It would need to be checked again, quickly. Do we instead force those objects to hang and wait?
How do you get 64 times? I still assume you batch them, as there will still be many that run the same op.
Just based on the assumption of a 16 vec4 unit array that becomes 64 scalar units if the array still contains them. Since there is no way to know which threads are sharing an op, there must be some kind of list that points to which ones are being run. If there is complete arbitrariness, it's 64x the context in the worst case.
IYou would need to know the status of every thread every cycle.
Not really, as long as they all run the same sequence of ops. And a large part of that can be determined by the driver up front. You only have to watch for branches and stuff.
"Some number". You can take that both ways.
;)
Teehee, yeah, you only need a small batch (for the sake of dynamic branching performance) when the objects you're executing have a mixed predication bitset: 0111010110010101. You need to wait around until you have enough objects to make up a 1111111111111111 instruction-issue.
If the predication is identical across the set of objects (it might be 100 "lines" of 11111111111111111) then there's no need to do any fancy sorting, or worry that the "batch size" is effectively 1600.
It gets hairier when you take into account the number of channels this instruction consumes. If it's RCP, say, then that's a "batch" of 6400 :lol:
Jawed
Not really, as long as they all run the same sequence of ops. And a large part of that can be determined by the driver up front. You only have to watch for branches and stuff.
Which is, I think, where that hashing patent application (second one I linked) comes in. Erm...
Jawed
What happens when the pending status can change with the next cycle. It would need to be checked again, quickly. Do we instead force those objects to hang and wait?
It's just latency. You use a large number of threads to hide latency, so the hang doesn't affect throughput.
Jawed
Which is, I think, where that hashing patent application (second one I linked) comes in. Erm...
Jawed
Something like that, yes. ;)
The other nice thing about this is that when you KILL a pixel, vertex or primitive, the windower just shuffles-in other work to fill in the gap.
Hmm, sweet dreams... :lol:
Jawed
trinibwoy
12-Dec-2006, 02:14
I'm not sure if I missed it but in layman's terms what's the tangible advantage of such dynamic ALU allocation?
Batch size no longer determines branching granularity, or, the issue width is no longer tied to branching granularity. However you'd like to think about it. It frankly matches Imagine's streaming architecture much better (work units get shuffled on input and at branch points during divergence).
I will be interested in how they manage access to the registry, if they go into those details. Does the cost of branching go up, or is the cost of shuffling register access amortized over a number of instructions, or can that vary depending on the size of the divergent "kernel"s, or..?
ed: Hmm, perhaps I should add an example. Consider an input of a hundred fragments. Let's say the SIMD issue width is 16. Let's say that every odd fragment takes route 1, and every even fragment takes route 2. In G80, chaos. In "R600", no problem. A batch of 16 fragments which previously were labelled "1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31" are issued at once, and then the even fragments "2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32" are issued, then "35, 37..." etc. What is unclear to me is whether the registry is massively multi-ported, or, if the cost of reshuffling the data is taken "at branch" or if the cost of reshuffling is taken at each register access (if you have a 16-wide access to the registry, then it would not be surprising to see register fetch for "1" conflict with "17" -- for example....)
Multi-porting is expensive, but potentially swapping a bunch registers around isn't cheap either, and amortising costs by hoping for randomized state distribution sounds like a recipe for non-deterministic behavior. I'm not in love with any of those solutions. I'm hoping there's some other more clever approach....
Ailuros
12-Dec-2006, 08:22
Say, how would you guys feel about much simplified ALUs (in that they're pipelined and do less each clock), but that you get four scalar ones for each single vector one? Like with the 8800? And that they're individually relocatable (in that you make batches of all the ones who have to execute the same op), instead of simple quads?
Is that a trick question? :D It's a scenario that makes a lot of sense IMO.
Hanners
12-Dec-2006, 09:24
Nothing new at all really, but just for completeness Anandtech (http://www.anandtech.com/tradeshows/showdoc.aspx?i=2887) have made a brief mention of R600 in a new, pre-CES article...
What did we learn? Unfortunately, some of the more exciting items are under NDA until CES but we did pick up a few tidbits here and there. It appears the AMD/ATI R600 graphics cards are still on schedule for an early Q1 2007 launch and should provide some very serious competition to the GeForce 8800 series. However, all of the expected benefits and performance improvements of this release will also bring some serious power requirements. We heard power consumption numbers hovering around 430~450W for the high-end CrossFire setup while under full load. Those are power requirements just for the cards according to our sources who said the first silicon spins actually consumed even more power. What the final numbers will be is anyone's guess but be prepared to start looking at 800W+ power supplies in the near future if you want to run extreme performance GPU configurations.
Ailuros
12-Dec-2006, 09:30
Those are power requirements just for the cards according to our sources who said the first silicon spins actually consumed even more power.
430-450W just for the cards cannot be true. That sounds rather like total system power consumption. Up until now (and even with G80) in dual GPU configs the power consumption in SLi/CF is not 2* a single GPU.
I'm not sure if I missed it but in layman's terms what's the tangible advantage of such dynamic ALU allocation?
Maximisation of ALU utilisation, coupled with increased coherency of code execution in dynamic branching.
A traditional vector ALU in a GPU has a limited set of combinations of work it can do (not necessarily all of these):
4D
3D+1D
2D+2D
2D+1D
1D+1Dand it tries to maximise utilisation by performing two different (potentially complex) instructions in parallel, on a maximum of four operands:
MAD r0.xyz, r1.xyz, r2.xyz, r3.xyz
RSQ r6.w, r6.w
R600 appears to "turn things sideways" in a different fashion to the way that G80 turns vector operations sideways into scalars. G80 reduces the number of instructions the ALU can issue per clock to one (the missing MUL is technically a different ALU - whether it's ever found or not...). G80 uses time to determine the number of channels the instruction operates for. As a minimum, with something like ADD r0.x, r1.x, r2.x, G80 will spend one clock executing the instruction, across 16 objects (in fact it's two clocks: 8 vertices/primitives per clock; or four clocks: 8 fragments per clock).
R600 actually uses threads in flight (strictly: objects) turned sideways and tries to execute only one instruction per clock. So while ADD r0.xyzw, r1.xyzw, r2.xyzw consumes 16 vertices, ADD r0.x, r1.x, r2.x consumes 64 vertices from four different threads (assuming a thread consists of 16 vertices and the ALU array is 16x vec4 wide). Theoretically the 64 vertices could come from 64 different threads, which is why it's better to think in terms of objects really, not threads. (Or, you could take the NVidia definition of a thread: each vertex, primitive or fragment counts as a thread. I'm trying to talk in terms of R5xx threads, though.)
It appears that R600 uses a simplified pipeline. I'm going to guess that MAD takes 2 clocks, for example (i.e. MUL followed by ADD). And some instructions such as the special functions may take two or more clocks, too.
If MAD takes two clocks, it reduces the register file's port count. Instead of having to support 3 operands per clock, the register file only supports 2.
Additionally ATI's older pipeline (R300-R580) supported a fourth operand per clock because it could co-issue MAD with SF (e.g. RSQ - see the first code snippet).
R600's register file trades operand count per clock (i.e. 4 in older GPUs) against thread count (1 in older GPUs). R580, say, supports two key types of maximum operand fetches:
MAD r0.xyz, r1.xyz, r2.xyz, r3.xyz = three 3xfp32 (3*3*32) = 288 bits +
RSQ r6.w, r6.w = 32 bits
10xfp32 = 320 bits totalor:
MAD r0.xyzw, r1.xyzw, r2.xyzw, r3.xyzw = three 4xfp32
12xfp32 = 384 bits totalR600, I guess, only supports one type of operand fetch (i.e. no co-issue), but in different combinations:
MUL r0.x, r1.x, r2.x = two 1xfp32 (2*1*32) = 64 bits
four threads at a time = 8xfp32 = 256 bitsor:
MUL r0.xy, r1.xy, r2.xy = two 2xfp32 (2*2*32) = 128 bits
two threads at a time = 256 bitsor:
MUL r0.xyzw, r1.xyzw, r2.xyzw = two 4xfp32 (2*4*32) = 256 bits
one thread at a time = 256 bits.Don't forget that you have to multiply these register fetches by the array width. So R580's 12-wide array requires 12x 384 bits = 4608 bits. R600 might have an array width of 16, so that's 4096 bits.
---
It may be that R600 actually runs even more slowly than this, in order to ease fetching from the register file. It might be that it can only fetch 1 operand per clock (so that MUL takes 2 clocks or MAD takes 3 clocks). The motivation is to simplify accessing the register file.
One factor I haven't mentioned so far (a major part of the first patent I linked earlier) is that an operand can vary in size: e.g. 32, 64 or 128 bits and you can view these different combinations as encompassing different channel counts per operand. i.e. fetching a 64-bit operand from the register file is actually fetching a 2-channel fp32, e.g. r0.xy, etc. And if you note the "free" alignment that the patent discusses, you can see that fetching r0.yz is no more difficult than fetching r0.xy.
In older GPUs I assume that when r0.yz is requested, the register file actually supplies r0.xyzw and the ALU simply ignores x and w. This wastes register file bandwidth.
The key thing seems to be that R600's register file is only produing the bits you request. This, I suspect, provides a huge boost in register file bandwidth utilisation. G80 achieves something similar with its scalar fetches from the register file, where channels are never fetched if they're not needed.
So I can't be sure, but I wouldn't be surprised if R600 fetches only one operand per clock. This would be some compensation for the somewhat nutty register file access patterns that "random" packings of objects into each instruction-issue cause.
We might see 32-wide vec4 ALU arrays in R600 then.
Thinking about ALU complexity (i.e. special functions versus ADD or MUL) and the implicit pipeline stages required to support these different instructions makes me cautious about this though. A pipeline that operates as a 2-clock ADD or a 1-clock RSQ seems pretty unlikely if I'm honest (RSQ, I guess, takes more pipeline stages, so ADD would seem to be wastefully slow). But that's in the mysteries of ALU pipelining...
So, on balance, I suspect R600 fetches 2 operands per clock.
Jawed
nicolasb
12-Dec-2006, 10:47
R600 graphics cards are still on schedule for an early Q1 2007 launch"Early" Q1 sounds interesting. That (to me) implies January rather than February or March.
"Early" Q1 sounds interesting. That (to me) implies January rather than February or March.
Ati's Summer also lasts 'till November..
And another thing, if there's only 4 FLOPs per ALU, you need a lot of ALUs to get ~500GFLOPs...
Sort of 1GHz, 128 ALUs...
Jawed
And another thing, if there's only 4 FLOPs per ALU, you need a lot of ALUs to get ~500GFLOPs...
Sort of 1GHz, 128 ALUs...
Jawed
Would 900Mhz do?
Isn't the "business" Q1 actually the Q2 in reality? March/April to me.
Isn't the "business" Q1 actually the Q2 in reality? March/April to me.
It depends on the company I guess.. I can Imagine ATI having a completely different Q1 than AMD...
qaurters change depending on when the company did its intial fillings. But Q2 for ATi last year fow December to Feb. They might not be talking about fiscal quarters though.
Repost beware! doh.
The complete pre-CES blurb from Gary (http://anandtech.com/tradeshows/showdoc.aspx?i=2887)
What did we learn? Unfortunately, some of the more exciting items are under NDA until CES but we did pick up a few tidbits here and there. It appears the AMD/ATI R600 graphics cards are still on schedule for an early Q1 2007 launch and should provide some very serious competition to the GeForce 8800 series. However, all of the expected benefits and performance improvements of this release will also bring some serious power requirements. We heard power consumption numbers hovering around 430~450W for the high-end CrossFire setup while under full load. Those are power requirements just for the cards according to our sources who said the first silicon spins actually consumed even more power. What the final numbers will be is anyone's guess but be prepared to start looking at 800W+ power supplies in the near future if you want to run extreme performance GPU configurations.This is a complete 180° to pcpop's claims and we're back to square one. :huh:
That was posted on this very page, just a few posts above :)
Repost beware! doh.
The complete pre-CES blurb from Gary (http://anandtech.com/tradeshows/showdoc.aspx?i=2887)
This is a complete 180° to pcpop's claims and we're back to square one. :huh:
I suspect the talk of an 800MHz GDDR3 card relates to an XT, not an XTX.
Jawed
I suspect the talk of an 800MHz GDDR3 card relates to an XT, not an XTX.
Jawed
Hard to believe that, 800 is already very much for such a huge chip.
Hard to believe that, 800 is already very much for such a huge chip.
But why's the cooler supposedly ginormous for a 100W chip?...
Best estimates put the 100W chip and GDDR3 along with the rest of the board in the region of 150-160W - far short of 215W+ required for the reported CrossFire configuration.
So, if there's room for another 50W+, then I can't help thinking there's a ~1GHz/GDDR4 variant floating around too.
Jawed
Ailuros
12-Dec-2006, 13:37
Hard to believe that, 800 is already very much for such a huge chip.
Well 800MHz@(hypothetical) 512bit gives already 102.4GB/s bandwidth. That's a significant raise in bandwidth already even compared to the X1950XTX.
Nanofoil cooling, huh?
http://www.hexus.net/content/item.php?item=7436
Scorchio!
Jawed
128 alu's, 256 layers of aluminium and copper!
Dave Baumann
12-Dec-2006, 14:46
Isn't the "business" Q1 actually the Q2 in reality?
Financials wise, the Visual and Media Business of AMD (former ATI) is fully rolled up into AMD's accounting period, which happens to be calender based (IIRC).
Financials wise, the Visual and Media Business of AMD (former ATI) is fully rolled up into AMD's accounting period, which happens to be calender based (IIRC).
This appears to be nearly true. They seem to have a wrinkle where they pick a Sunday near to the actual CY quarter to call it the FY quarter end. So 3Q was Oct 1, and 2Q was Jul 2, and 1Q was Mar 26. All Sundays. CY would have been Sep 30, Jun 30, and Mar 31.
Nanofoil cooling, huh?
http://www.hexus.net/content/item.php?item=7436
Scorchio!
Jawed
So how reliable is Hexus these days? I'm not so sure I want to sleep next to a damn reactor called R600. :???:
Hopefully someone was kidding with "Bob".
Of course, while each ALU might only do one or two operand fetches per clock (and one operand store), we don't know what the organization of those ALUs is. If the workset is flat across a set of ALUs, how many ALUs is that, what load is placed on the register file, and how is it dealt with? I have to wonder if that is really cheaper than going full MIMD with per-ALU dedicated registers....
Also, memory coherence doesn't seem like a slam dunk. Randomly associating (eg) add instructions from multiple fragments at multiple PCs doesn't seem like it would encourage fragments to stick together. However, if you let batch size grow and combine work based on PC only within a particular fragment kernel, that would seem to get most of the benefit....
Of course, large batch sizes encourages one to consider the problem from a streaming point of view, rather than register access. Perhaps the scheduler actually bins workloads based on operators. It reads a chunk of work in whatever width it would have normally done (a certain amount of over-read for branched code is expected). It sends work and operands into slots for each sort of operator. When one of the bins is full, the instruction dispatches, that bin empties, and the scheduler continues....
As long as you aren't terribly branched (ie: more than just a single level of bifurcation), that could work pretty darned well.
trinibwoy
12-Dec-2006, 17:03
Maximisation of ALU utilisation, coupled with increased coherency of code execution in dynamic branching.
It usually takes me a while to get my head around these things so please bear with me :)
If I'm reading you guys correctly the proposal is that per-clock R600 may be capable of accumulating instructions and associated operands (between 2 and 8 FP32's per instruction assuming a 2-cycle MADD?) from 1-4 "threads" to fill all available slots in an ALU.
In this organization what exactly is a "thread" ? I can imagine a situation where you have the typical thread consisting of multiple vertices/fragments but co-issuing the same scalar/vec2 instruction for multiple vertices/fragments from the same thread. Is that what is being described or is it more elaborate/flexible than that?
The utilization gain is somewhat obvious but what happens with vec3 instructions - is the ALU still co-issuable for instructions from the same fragment/vertex for a vec3 + scalar mix?.
Still trying to understand how branching is affected....
trinibwoy
12-Dec-2006, 17:12
Nanofoil cooling, huh?
http://www.hexus.net/content/item.php?item=7436
Scorchio!
It's a bit early for April Fool's jokes isn't it :)
Of course, while each ALU might only do one or two operand fetches per clock (and one operand store), we don't know what the organization of those ALUs is. If the workset is flat across a set of ALUs, how many ALUs is that, what load is placed on the register file, and how is it dealt with? I have to wonder if that is really cheaper than going full MIMD with per-ALU dedicated registers....
I expect each array of ALUs to have a localised register file, similar to R5xx and G80.
Also, while I like to talk about "random" shuffling to fill an instruction-issue, it could well be far less sophisticated.
For example, if the array is 16 wide, it might be arranged as four quads. Substitution might only work at the quad level, thus reducing the randomness of register file access as well as hindering utilisation in worst-case branching or KILLed objects.
Also, memory coherence doesn't seem like a slam dunk. Randomly associating (eg) add instructions from multiple fragments at multiple PCs doesn't seem like it would encourage fragments to stick together.
No it definitely doesn't. We don't know if R600 will split quads. I think you aren't allowed to play with the texture coordinates within a dynamic branch (e.g. within a loop). This is a subject I'm not much good at.
As you know I was sceptical about this "shuffling" when we discussed it a while back before G80 was released - particularly because of the coherency and sequencing problems as well as being forced to perform random accesses against the register file. Even if each quad is kept coherent, amongst a set of quads (making up one or more batches) you're still left with a problem.
I think R600 needs some kind of texture re-ordering unit. In more general terms, all access to memory (for vertices, constants, textures and render targets) needs to be ordered for coherency, with a bit of pre-fetch salt sprinkled on top. I think this is what DiGuru was referring to when he described the virtual memory support that's required of D3D10. Since the virtual memory system has to be able to deal with cached, non-cached and distant (i.e. system, not local, memory) pages, it needs to be able to optimise access patterns and measure demand. Obviously there's a disconnect here, because textures come in blocks that are much smaller than what we're used to thinking of as "pages" - but the reality with D3D10 is that virtual memory paging is explicitly designed to support textures (e.g. which portions of the mip-map chain?).
So, between the client (e.g. TMU) and the virtual memory system there's got to be some give and take - So I guess some kind of tolerance within the TMU for results that don't come in the order they were requested.
In terms of the register file and execution pipeline coherence, "random" shuffling seems to solve its own problem. With G80 I don't think we ever discussed it in terms of being "random". Put another way, it seemed more constrained because back then we thought in terms of shuffling objects that all shared a common PC. I think multiple-PCs is the solution, because it increases the size of your possible solution set.
I suppose it's like bin-packing: before we were thinking that we were only allowed to pack blocks of one colour into the bin at any one time, e.g. only yellow blocks. Now we find we can combine yellow, cyan, magenta etc. blocks to pack the bin (colour=PC). Yippee!
The whole thing seems to be dependent on a less greedy pipeline. I'm guessing it only requests 2 operands per clock, which is significantly lower than R5xx or G7x, making it less intolerable.
However, if you let batch size grow and combine work based on PC only within a particular fragment kernel, that would seem to get most of the benefit....
I think PC fragmentation builds up extremely quickly once you have a few levels of nested DB. Before you know it, you've got a thinly spread set of ADDs, all with different PCs - none of which want to co-operate.
If you read the CTM manual, you'll find that DB is not a second class citizen. I think it's safe to say that ATI did this with a keen eye on making DB work even more fluidly.
Of course, large batch sizes encourages one to consider the problem from a streaming point of view, rather than register access. Perhaps the scheduler actually bins workloads based on operators. It reads a chunk of work in whatever width it would have normally done (a certain amount of over-read for branched code is expected). It sends work and operands into slots for each sort of operator. When one of the bins is full, the instruction dispatches, that bin empties, and the scheduler continues....
As long as you aren't terribly branched (ie: more than just a single level of bifurcation), that could work pretty darned well.
If you think of the predication bit for each object in a thread as being part of the PC (e.g. tack it onto the end as the least significant bit), then I don't see any problem with branch depth. Just execute the PCs that are "odd" and ignore the "even" ones!
Jawed
Dave Baumann
12-Dec-2006, 17:41
This appears to be nearly true. They seem to have a wrinkle where they pick a Sunday near to the actual CY quarter to call it the FY quarter end. So 3Q was Oct 1, and 2Q was Jul 2, and 1Q was Mar 26. All Sundays. CY would have been Sep 30, Jun 30, and Mar 31.
Thats based on the start of of the weeknumbers. 13 Weeks per Q.
trinibwoy
12-Dec-2006, 17:50
This is a question about dynamic branching in general. Why is it not feasible today to just avoid sending a subset of a batch down the "null" branch ? Why is it that all pixels/vertices in a batch are forced to take every branch?
kemosabe
12-Dec-2006, 17:51
It's a bit early for April Fool's jokes isn't it :)
You ain't kidding. How 'reliable' does this (http://www.hexus.net/content/item.php?item=7437) one sound?
"We don't know the state of play with the A13 spin but do understand that ATi has got A12 running at a speed of 1GHz.
That's around half the target speed of the production version..."
Whoa there Willy! :lol:
wishiknew
12-Dec-2006, 18:56
"... forgot to connect the pins ..."
Did I really lose 5 months of my life.
In this organization what exactly is a "thread" ?
Using R5xx terminology, a thread is a group of objects that share a program counter (PC). Additionally, a thread is constrained by the instruction-issue configuration of the execution pipeline: in R580's pixel shader, that's 4 clocks for 12 fragments at a time = 48 fragments.
In R600 it seems there's actually little meaning to a thread. The second constraint (clock width * array width) still carries weight though - R600 might use the same 4-clock schedule, for example. In R300-R480 it seems that a fragment thread is defined by the size of a tile in screen space - but that is a guess (16x16 fragments: 64 quads, 256 fragments). I expect that R600 will utilise screen-space tiling. So the size of a tile puts a potential upper bound on a "thread". But in R580 it's already pretty soft, so getting softer...
In my earlier postings, if you read the 16-wide array as defining a "thread", everything hangs together quite nicely (a thread is executed in one clock for 16 objects per instruction). But the real definition of a thread in R600 will depend on the type of object (vertex, primitive or fragment) and the capability of the units that generate them or have to buffer them. e.g. the rasteriser might issue 32 fragments per clock, so that might be a "thread".
For the purposes of all this shuffling, "thread" mostly depends on PC. Since shuffling is agnostic about PC (it's only interested in instruction and channel-count) you might argue that a thread that starts out as 256 fragments gradually shrinks as dynamic branching nests deeper and deeper. Each DB level becomes its own "sub-thread" - but the scheduler doesn't care: an ADD at the 10th level can be issued alongside an ADD at the 4th level.
Ultimately "thread" (in the sense of a "batch" as I've been using it) merely connotes the fact that objects were "created together" (or read in from memory together, e.g. a set of vertices). There is coherency implicit in "thread", simply because GPUs like to make coherent use of memory. But as GPUs get more complex they're expected to spend less of their time making neatly-square accesses to memory simply because more advanced data structures and algorithms depend on incoherency. Ray tracing is a great example of incoherency...
I can imagine a situation where you have the typical thread consisting of multiple vertices/fragments but co-issuing the same scalar/vec2 instruction for multiple vertices/fragments from the same thread. Is that what is being described or is it more elaborate/flexible than that?
It doesn't matter where vertices/fragments come from. It only matters that they currently have a common instruction waiting to be executed, and that there are enough of them to form an execution "line" in the array.
So, if you have some DB:
penumbra=0
if "pixel in shadow penumbra"
for 16 samples
get sample
ADD sample to penumbra
next
end if
and a 16-fragment thread has the following predication pattern based on that if statement (I've put the count of 1s at the end):
0011100001111100 8
you only have 8 ADDs you want to run. But there'll be other threads running this particular shader, e.g. with the following predication:
0000011000000111 5
0111111111111111 15
0110110110111001 10
0011001100110110 8
0111010111001001 9
0100001100010100 5
0101010111000011 8
0111111110010101 11
So you can just gather together the "1"s into 16-wide lines to execute. That ADD is actually a scalar operation, so you need to make up 64-wide lines in reality. So, that's a line of 64 and a line of 15, which is 2 lines * 16 loop iterations = 32 clocks. In R580 a line is 12 fragments, so that set of 144 fragments requires 12 lines * 16 iterations = 192 clocks :!:
It's a bit of a silly example because shadow penumbras don't often come that finely spaced, but I'm too lazy to make a bigger grid.
The utilization gain is somewhat obvious but what happens with vec3 instructions - is the ALU still co-issuable for instructions from the same fragment/vertex for a vec3 + scalar mix?.
Teehee, I left out vec3 earlier deliberately cos it's messy and potentially quite unruly.
In theory the register file supports the degree of "randomness" that's required to fetch 21 vec3s (21 x 96 bits). But 21 is a nasty number compared to 16, 32 or 64, so I dunno, maybe there's a gotcha in there somewhere.
By the way, I'm expecting that co-issue will not be a feature of R600. It simplifies operand fetch and makes the ALU simpler and reduces "lost utilisation" in those cases where some loss is inevitable. It's an extension of only fetching 2 operands and being unable to do a single-cycle MAD.
Clearly, though, this is me just guessing. For all I know, R600 is similar to G80 with a primary ALU and an SF ALU...
Jawed
This is a question about dynamic branching in general. Why is it not feasible today to just avoid sending a subset of a batch down the "null" branch ? Why is it that all pixels/vertices in a batch are forced to take every branch?
Because a GPU is SIMD in the pixel shader pipelines. Historically, once the "lines" for execution in the SIMD array have been determined, you're stuck with them.
As the batch size gets smaller, the chances increase that all the fragments want to go the same way, whether that's null or not. So performance only increases when the entire batch is coherent and then your performance gain is dependent on the proportion of batches that are coherent and the average instruction-clock performance gain.
Vertex pipelines in SM3 GPUs seem to be MIMD, i.e. entirely independent of each other and not subject to coherency probability.
Jawed
"... forgot to connect the pins ..."
Did I really lose 5 months of my life.
Even if true no more than 2 months, I'd think. Not that it'd be a good thing, mind you.
Otoh, this is the same article that's predicting a 2GHz R600, apparently. Now, I know I'm 0-1 at chortling at insane-looking clock speeds ("up to 1.5GHZ", we all howled in laughter at the G80 rumor that turned out to be 1.35GHz shaders), but I'm still thinking 2GHz is out of reach. Particularly since ATI hasn't shown any evidence of clock domaining yet.
I'm still thinking 2GHz is out of reach. Particularly since ATI hasn't shown any evidence of clock domaining yet.
Fast14?:smile:
http://www.beyond3d.com/forum/showthread.php?p=854679&highlight=fast14#post854679
Prolly Rys messing with our heads, just because he can.
Jawed
Voltron
12-Dec-2006, 20:29
Or maybe just with liquid nitrogen? Maybe ATI has done some custom layout as well. Early R420 rumors underestimated the 16 ROPs.
Either way though, if it is true that that final production has not begun or is just about begun, when is the soonest boards will be in the hands of consumers?
I assume the cycle time for a high end chip with 9 layers or more (is TSMC adding layers to 80 nm?) is around 12 weeks. Maybe they have brought it down a little, but I am thinking if production begins today you are looking at mid-March, minimum, for retail boards.
It's interesting at this juncture to compare the FUCK count for G80 and R600.
G80:
unified
way over 1GHz
scalarR600:
512-bit
crazy instruction-binning architecture
2GHz targetted:shock: Damn, this is funny... even if R600 is very much rumour
Jawed
kemosabe
12-Dec-2006, 20:43
Hasn't Rys previously claimed "no editorial control" over Deeplung's reports? He might want to reiterate, cause Willy sounds like a Fudo challenger today (http://www.hexus.net/content/item.php?item=7439).
Only 1250MHz to go. :lol::roll:
R300King!
12-Dec-2006, 21:01
A little more at Hexus.
Seemingly, ATi has been aiming to achieve 11,000 3DMark scores at 750MHz. That's tasty and may mean the R600 offers a strong challenge to NVIDIA's G80 GPU given that, as we reported in HEXUS.Beanz a few hours ago, ATi's already got samples running at 1GHz and has higher targets in mind.
http://www.hexus.net/content/item.php?item=7439
That's another source to backup what Razor reported earlier... ;)
If you think of the predication bit for each object in a thread as being part of the PC (e.g. tack it onto the end as the least significant bit), then I don't see any problem with branch depth. Just execute the PCs that are "odd" and ignore the "even" ones!
Jawed
The issue is with how registers are fetched. If register data comes from a unified pool of memory, how do you get a large set of random 32bit values? My proposal was to read a continuous set of registers, ignore the predicated/0 values, and store the results in a per-operator work-queue. When the operator work queue is full, all operands are read and can be dumped from the queue into the ALUs. A problem occurs when the majority of your bandwidth is spent reading data that is predicated/0, which can occur when threads diverge in a significant way.
The other problem is result storing, but, I was hoping someone else would see a solution for that :)
So if they are aiming at 11,000 3DMarks @ 750MHz, what they are aiming at 2GHZ then? 30,000? :lol:
shaders at 2Ghz, rops at 750Mhz? :smile:
Would be a weee more plausible.
trinibwoy
12-Dec-2006, 21:57
wtf does this even mean :grin:
With the G80, NVIDIA is majoring on texture units. The G80 pre-calculates loads of answers and stores them in a texture unit. This will be the best way for NVIDIA to overcome the G80's lack of maths power compared to R600 and we await with more than a little interest how well it can achieve that aim.
wtf does this even mean :grin:
Sometimes it seems some folks think R600 is the traditional definition of a camel --"a horse built by committee". So its got a 512-bit bus and huge 2GHz driven math power. . .but one texture addresser (that it probably dual-uses for z-cull or something). . .and probably on PCIe-1x bus. :lol:
wishiknew
12-Dec-2006, 22:07
Even if true no more than 2 months, I'd think.
Cross your fingers ATI won't have to do a presentation on why the R600 was late like the R520 was.
Sometimes it seems some folks think R600 is the traditional definition of a camel --"a horse built by committee". So its got a 512-bit bus and huge 2GHz driven math power. . .but one texture addresser (that it probably dual-uses for z-cull or something). . .and probably on PCIe-1x bus. :lol:
Hmm.. 430 Watts.. that's 10 for the memory, 180 for the GPU, 40 for the rest of the board and 200 for the cooling..
wtf does this even mean :grin:
With the G80, NVIDIA is majoring on texture units. The G80 pre-calculates loads of answers and stores them in a texture unit. This will be the best way for NVIDIA to overcome the G80's lack of maths power compared to R600 and we await with more than a little interest how well it can achieve that aim.
Guessing it has to do with interpolants?
Generally we think of interpolants in NVidia hardware as being calculated as they're needed. Perhaps, even, being calculated more than once, for each time they're needed :???: .
Alternatively you can calculate them once and store them somewhere. It takes a chunk of constant memory (texture memory) to store them, but then you can get these results rapidly when they're needed without wasting any more time computing them.
That's the only thing I can think of...
I think, historically, ATI's GPUs have calculated and stored interpolants, as part of rasterisation (i.e. using dedicated hardware), so that they don't take any computing power while a fragment shader is running. It's expensive to store all this data though.
Jawed
trinibwoy
12-Dec-2006, 22:51
Yeah but for something like that we're not expecting G80 to have any kind of significant advantage are we? Sure its got a lot of filtering power to spare but in terms of pure texture sampling it's only 32 @ 575Mhz.
Yeah but for something like that we're not expecting G80 to have any kind of significant advantage are we? Sure its got a lot of filtering power to spare but in terms of pure texture sampling it's only 32 @ 575Mhz.
Two things:
we know sod all how G80 supports constant buffers and texture buffers
I suspect if this data was retrieved by a texture pipeline that no "addressing" in TA is required, so you have 64 pipelines at your beck and callClearly I'm reaching :lol:
Jawed
Hasn't Rys previously claimed "no editorial control" over Deeplung's reports? He might want to reiterate, cause Willy sounds like a Fudo challenger today (http://www.hexus.net/content/item.php?item=7439).
Only 1250MHz to go. :lol::roll:
I don't work for Hexus any more, B3D full time now thankyouplease :!: So those stories have zero to do with moi, although I did explain to someone there how to decode ATI chip revisions not long ago which made it in there. 1GHz with sampling silicon and 2GHz target are pretty laughable, IMO.
3dilettante
12-Dec-2006, 23:13
It's just latency. You use a large number of threads to hide latency, so the hang doesn't affect throughput.
Jawed
Assuming on-chip storage used to keep a list of the pending instructions, registers, and desitinations is non-zero in size, the longer a grouping waits adds more than just latency. In a high-load situation, where large groupings of pending ops are waiting for a handful of lagging operations, it directly reduces the effectiveness of the threading hardware.
I'm assuming the threading capabilities of R600 are finite, of course. If the scheduling hardware is so over-engineered that it can handle more threads than can ever be put in-flight by the control processor, this may not be a problem. It probably won't clock too terribly high, but that's life.
I can imagine a number of implementations, some of which strike me as being more feasible than others.
On the other hand:
I'm having a trouble parsing what is being discussed. If there were a few more details, or maybe a handy ASCII diagram, I might be more sure of whose vesion/interpretation I'm trying to address.
Some questions:
What's the hiearchy within this design: command processor>array controller>array sub-unit?
Where exactly are the ops (or counters or pointers) we are moving around being stored?
Are they sorted at the command processor level, controller level, ALU level?
What are the steps exactly that each op goes through prior to issue and afterwards?
What talks to what, and when, and how often?
Memory access units: are they tightly coupled or decoupled in a manner similar to G80?
Other piddling details, like what happens if a fixed-size bucket takes a long time to fill. Is there a time limit before the array just executes the valid ops and negates the invalid ones?
Is 1,000 cycles too long a wait for the 16th op to be valid? If we're talking virtual memory access to the system over the bus and through a northbridge, that's a real issue.
trinibwoy
12-Dec-2006, 23:16
It doesn't matter where vertices/fragments come from. It only matters that they currently have a common instruction waiting to be executed, and that there are enough of them to form an execution "line" in the array.
I obviously don't know the implementation details of this approach but it all sounds very hairy. Wouldn't there be a potentially insane amount of latency involved in gathering up each batch of instructions to issue each clock? I just have a gut feeling that there's a simpler approach to incoherent processing. But given how I felt after lunch today my gut might not be doing its best right now :)
The issue is with how registers are fetched.
And one day I might really have a go at understanding G80's register fetching and those ATI patent applications I posted earlier.
They're both doing magical stuff - I'm sorta resigned to thinking "it just works". :lol: No matter how dubious the whole escapade seems.
If register data comes from a unified pool of memory, how do you get a large set of random 32bit values?
Well my main point is that if you simplify the pipeline you reduce the need to build a byzantine fetcher. Fetching 2 operands per clock seems a lot simpler than fetching 4 (or more, I actually suspect R300-R580 might be fetching 6 or more operands per clock, because of the secondary ADD ALU...)
My proposal was to read a continuous set of registers, ignore the predicated/0 values, and store the results in a per-operator work-queue. When the operator work queue is full, all operands are read and can be dumped from the queue into the ALUs. A problem occurs when the majority of your bandwidth is spent reading data that is predicated/0, which can occur when threads diverge in a significant way.
I think that's the deal breaker. If the issuer is allowed to choose from any four threads (of 16 objects), say, when it's binning, you could say that puts a cap on the amount of bandwidth that's needed. Then, if you're binning can only fill 12 out of 16 vec4 pipelines it's hard luck, the bin is submitted.
:???:
Here's another patent application:
Method and apparatus for static single assignment form dead code elimination (http://appft1.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&p=1&u=%2Fnetahtml%2FPTO%2Fsearch-adv.html&r=1&f=G&l=50&d=PG01&S1=20050166194.PGNR.&OS=DN/20050166194&RS=DN/20050166194)
Great title, huh?...
Jawed
Another un :!: (http://www.hexus.net/content/item.php?item=7441), so February 14th is the day. <3
I obviously don't know the implementation details of this approach but it all sounds very hairy.
I'm in speculation mode too. DiGuru's been waving a green flag, so I know I'm on the right track. Trouble is, is it the full-length Nurburgring, where one lap's exhausting enough, let alone an entire race?...
Wouldn't there be a potentially insane amount of latency involved in gathering up each batch of instructions to issue each clock? I just have a gut feeling that there's a simpler approach to incoherent processing. But given how I felt after lunch today my gut might not be doing its best right now :)
Yeah, there needs to be a cap on the parameters at play here:
how long to wait for other threads to provide instructions which meet your binning criteria
how many threads to survey, to identify binnable instructions
how fine-grained is binning (per object, per quad of objects...)?I do think it's worth pointing out that R580 has 128 threads in flight in each shader pipeline. That's 512 clocks of latency hiding in an architecture that prolly only needs to hide ~150 clocks of texture latency at any one time. Is that over-engineered? How serious are other forms of latency. What's the real world worst-case and how well does R580 cope?
R580 is quite happy running shader programs with 128 registers assigned per fragment, despite the fact SM3 only requires 32 registers. More over-engineering?
(A precis for my forthcoming answer to 3dilletante here...)
Jawed
"into the hands of the press". That probably means more like March 7 for NDA lift, based on previous precedent for a major architecture shift. Edit: Ah, actually the article suggests March 15th would be the earliest NDA lift.
But would they round up the press on Valentine's day for an editors day, while pissed off sweetums are prepared to make their lives hell when they return?
Edit: This sudden round of multiple R600 posts out of Hexus smells to me like either they are full of it (hope not), or somebody at AMD pulled the trigger on the latest rev as ready for public consumption and now the serious leaking gets going.
INKster
13-Dec-2006, 00:18
March availability seems way too late for a launch window.:???:
Vista consumer editions will be out for a month by then, with lots of media exposure and reviews, and G80 (probably G84 and G86 also) will have a second opportunity to shine completely alone in the Direct3D 10 graphics hardware category.
trinibwoy
13-Dec-2006, 01:20
This sudden round of multiple R600 posts out of Hexus smells to me like either they are full of it (hope not), or somebody at AMD pulled the trigger on the latest rev as ready for public consumption and now the serious leaking gets going.
The 2Ghz blurb expelled them out of the realm of plausability for me. But I recall saying something similiar with regard to some G80 tidbit that turned out to be true!
The 2Ghz blurb expelled them out of the realm of plausability for me. But I recall saying something similiar with regard to some G80 tidbit that turned out to be true!
Yeah, good points there. Every time I think R600 is coming into focus, it slips away again. :sad: Even what I think I know I don't entirely trust. Right now all I'd be willing to put is floors in on some numbers, no ceilings. Like "not less than 256-bit external memory bus" "not less than 64(or 320, depending on how you count) shaders" "not less than 700MHZ core" "Not less than 16 ROPs". Like that.
The 2Ghz blurb expelled them out of the realm of plausability for me. But I recall saying something similiar with regard to some G80 tidbit that turned out to be true!
There was a patent that strongly backed up the G80 tidbit though.
Assuming on-chip storage used to keep a list of the pending instructions, registers, and desitinations is non-zero in size, the longer a grouping waits adds more than just latency. In a high-load situation, where large groupings of pending ops are waiting for a handful of lagging operations, it directly reduces the effectiveness of the threading hardware.
I'm assuming the threading capabilities of R600 are finite, of course. If the scheduling hardware is so over-engineered that it can handle more threads than can ever be put in-flight by the control processor, this may not be a problem. It probably won't clock too terribly high, but that's life.
It's worth pointing out that there's a limited set of shader programs being threaded at one time. I don't know what R600 supports.
Earlier GPUs seem to support one vertex shader and one pixel shader loaded into the GPU at any one time. For the sake of argument, R600 might support eight each of vertex, primitive and pixel shader programs loaded.
Additionally, since programs can be reasonably large, the GPU is forced to page sections of code as required. I dunno what the parameters of that are.
Next we've got a high level view of scheduling, some of which is in the driver compiler:
seeks out clauses, where a clause is bounded by latency-inducing events
a latency-inducing event is a texture fetch, a constant fetch, a vertex fetch or a dynamic branch evaluation
the scheduler will shuffle latency-inducing events around, generally getting them executed as soon as possible
clauses of code (for very long programs) need to be paged with minimal disruption, implying that scheduling across tens of threads needs to have a degree of the round-robin about it, otherwise the GPU will thrash code pages - I've got no idea how big a code page is
out of order execution will generally increase cache thrashing, which ultimately costs memory bandwidth. Round robin execution also has corner cases that increase thrashing
various fixed function units (input assembler, rasteriser, ROPs etc.) all need to be kept busy, in addition to the ALU and TMU pipelines
the GPU as a whole needs to measure workloads and avoid stalls and overflowing buffers, adjusting thread priorities and throttling fixed-function units as appropriateThe whole thing is hideously complex :???: I'm just slapping down some parameters...
I can imagine a number of implementations, some of which strike me as being more feasible than others.
On the other hand:
I'm having a trouble parsing what is being discussed. If there were a few more details, or maybe a handy ASCII diagram, I might be more sure of whose vesion/interpretation I'm trying to address.
I might do a diagram at some point, but prolly not ASCII art - never been good at that. I'm interpreting the comments DiGuru has made. I don't know what he's referring to. How come he's become such an expert all of a sudden?!!! He's the one who really needs prodding!
Some questions:
What's the hiearchy within this design: command processor>array controller>array sub-unit?
Input Assembler or Setup Engine generates threads and distributes:
vertices/primitives by round-robin (?)
fragments by screen-space tiling
rasteriser/interpolator creates objects in thread, upto some per-thread limit:
early-Z culling
generates fragments
generates interpolated attributes and stores in shader state memory, per fragment
generates AA samples
Windowed issuer, issuing jobs that consist of a single instruction or request, based upon one or more reservation stations that use hashes to encode the nature of each job
when one of the sub-pipes (below) retires a job or a new thread/code-page is submitted, it
calls the register store pipeline, locking the corresponding hash until the store is complete (fixed number of cycles to execute?)
clears completed thread-IDs/object masks corresponding to the completed PC, from the hash's job queue
consults the predicate stack for the new PC and generates a hash for that instruction, summarising Instruction, Predicate, Channel-count
attaches to the hash's job queue a list of thread-IDs and object masks identifying the work to be done
surveys extant hashes for jobs that have reached thresholds:
age
percentage-full
priority
issues jobs to the appropriate pipelines, running the register/constant/attribute fetch pipeline in parallel
register/constant/attribute fetch pipeline - needs to be multi-threaded to support all of the pipelines below running in parallel :razz:
unravel the hash to determine the addresses of operands for threads at the head of the queue attached to the hash
request registers
request constants
request attributes
pool resulting operands into proxy registers for consumption by one of the pipelines below
register store pipeline, needs to be multi-threaded
accepts list of destination registers and output masks
updates register file
ALU pipeline
decode
fetch from proxy registers
math
bitwise shifting
submit result to register store pipeline
TMU pipeline
calculate job parameters, LOD etc.
fetch (including fetch re-ordering?)
filter
submit result to register store pipeline
memory-fetch pipeline
calculate address
fetch (including re-ordering?)
submit result to register store pipeline
streamout pipeline
determine output address
write to memory (including re-ordering?)
BR pipeline
evaluate BR for objects in job
push resulting predicates onto the objects' predication stacks in shader state memory
back-end, evaluates forward routing for completed threads:
to input assembler
to setup engine
to ROP
ROP
z-test
stencil test
write/read memory
alpha blend
AA blend
E&OE - it's getting late :lol:
Jawed
Edit: This sudden round of multiple R600 posts out of Hexus smells to me like either they are full of it (hope not), or somebody at AMD pulled the trigger on the latest rev as ready for public consumption and now the serious leaking gets going.
Isn't it more likely that the AIBs in Taiwan, where there's a show happening, have started productising and begun being briefed by AMD - now that the sight is in end? (Whenever that is, sigh.)
Show + Taiwanese = unabated gossip :grin:
I'm just guessing that the AIBs are now getting something concrete. How many Christmases is it, now, that "ATI" has skewered sales of important products. It's kinda weird that there hasn't been more AIB churn.
Then you get strange things like OCZ, that has ATI CrossFire branded memory, getting back into the GPU business with NVidia 8800GTX. Maybe they got sick of waiting for R600?...
Oh, and then there's Futuremark. I'm not accusing them of leaking stuff, but hey, they must have some kind of R600 by now...
Jawed
You could very well be right, Jawed. I was influenced by Hexus putting a date on press getting samples (which, btw, I will now disclose to you that I can confirm that the date. . . .is the first time I've heard it or any date, and if I had heard a date I couldn't talk about it, so there. :smile: ). That's the kind of thing that only happens when the snowball starts down the mountain. An awful lot of press is either volunteer or "semi-pro" and they need lead time to clear their calendars, get vacation scheduled, maybe get visas, etc, etc, etc. You don't mess with that much. When its set its set, typically.
But then I wouldn't necessarily have heard yet. I just can say I haven't. :grin:
Voltron
13-Dec-2006, 03:30
As I mentioned in a post in the last page, production takes about 12 weeks (unless TSMC) has reduced, for such complex chips.
Didn't Rys just say (not that he has to be correct, that final production run has not commenced)?
So mid-March at a minimum, because boards have to be made and shipped. And thats assuming the final rev is ready immediately.
"Production silicon" is often a backwards looking definition tho. :grin: It's why there is often good avail, then bad avail, then good avail again on a launch. Depends on what kind of confidence the IHV had in making that last test run order. . . because it could convert to being the first "production" run; then when you've sold out after the first week or two, you're sweating the ramping of full production hitting. There is a minimum # of wafers you can order. You can't say, "yeah, just give me 6 chips, then if they're good I'll order 60,000".
An awful lot of press is either volunteer or "semi-pro" and they need lead time to clear their calendars, get vacation scheduled, maybe get visas, etc, etc, etc. You don't mess with that much. When its set its set, typically.
Even if you're pro it's always a mess to schedule vacations because you don't want to (or simply can't) miss key launches. Vacations and cancelled should never be linked words :mad:
"Production silicon" is often a backwards looking definition tho. :grin: It's why there is often good avail, then bad avail, then good avail again on a launch. Depends on what kind of confidence the IHV had in making that last test run order. . . because it could convert to being the first "production" run; then when you've sold out after the first week or two, you're sweating the ramping of full production hitting. There is a minimum # of wafers you can order. You can't say, "yeah, just give me 6 chips, then if they're good I'll order 60,000".
I think the term you're describing is "risk production".
silent_guy
13-Dec-2006, 05:43
I assume the cycle time for a high end chip with 9 layers or more (is TSMC adding layers to 80 nm?) is around 12 weeks. Maybe they have brought it down a little, but I am thinking if production begins today you are looking at mid-March, minimum, for retail boards.
Depends on where you start counting.
It's very likely that by now they have started a large set of (risk) wafers that are halted at metal 1. This gives them the opportunity to still make ECO's should they be necessary, but it will reduce the lead time from start-of-spin to packages chips to rougly a month. It's not an uncommon practise.
This can bite them when they run into a problem that isn't metal solvable, but that's very rarely the case.
silent_guy
13-Dec-2006, 06:06
"... forgot to connect the pins ..."
Did I really lose 5 months of my life.
You know, this is one of the more believable rumors flying around.
A 'brick' or 'rock' as it is affectionally called: a design engineer's worst nightmare. It doesn't happen as often as it used to, but it's still quite common. The chip design process is one of a lot of checks and balances. There are all these verification methodologies that decrease the chance of fatal bugs, so it doesn't happen a lot that signficant pieces of logic are conceptually broken: a decent test plan ensures that the serious stuff has been kicked out.
But in the backend stages of the design flow, things are much more fuzzy. Errors are harder to detect and major mistakes can happen. They are usually easy to fix in metal, but still they can result in first silicon that is basically dead. It's a huge black mark on the ego of the perpetrator and usually the whole company will know about the problem. (Though I've never anyone seen fired for it.)
Some examples:
- the layout team using an empty boot rom and forgot to swap it with the real one. (The customer wasn't happy.)
- using a test boot rom that only downloaded 20 bytes instead of the production version that downloaded 4K (because simulation took too long). Try squeeze a decent program in that! (I was young and needed the money.)
- IO pads inverted. (Our boss forbade us to talk about it to the guy who did it. He might have jumped of a bridge.)
- forgotten power connection in the IO ring. (The guy eventually became VP of engineering.)
- digital gate inserted along an analog control line.
- the FAB mixing the metal layers of 2 different version of the chip. (We needed a camera on a microscope to convince them.)
These are thing that I have personally seen happen. Even the best checklist (and we have lots of them) can't prevent stupid mistakes. So, yes, disconnected IO pins is really not that far fetched and can easily slip the schedule by a month or 2.
Hope that guess is based on more than just the fact that Nvidia decided to clock theirs that high. Given talk of 800Mhz speeds it's nearly guaranteed that those rumours refer to the shader domain as well.
No I suspect that Nvidia did that as a knee jerk reaction to the Intrinsity/ATI press release some time ago. I have been saying this long before we knew what G80 was.
dizietsma
13-Dec-2006, 06:37
I'm not sure about all these figures but even so it sounds pretty fast, pretty power hungry and has a "pretty" cooling solution so it sounds pretty expensive to produce ( if not too buy ).
Looks like I will be contemplating the mid range cards again from AMD or nvidia this time round. I can never quite convince myself to go the whole hog on the top end.
Chalnoth
13-Dec-2006, 06:53
No I suspect that Nvidia did that as a knee jerk reaction to the Intrinsity/ATI press release some time ago. I have been saying this long before we knew what G80 was.
There's no possible way that the high clock of nVidia's shader units is any sort of knee-jerk reaction. It was a central decision to the architecture that was sure to have profound effects upon the final specs.
PeterAce
13-Dec-2006, 12:12
R520, R600 and G80 all late.
The complexity seams to make being late the norm rather than a rare mishap.
Development cycles have said to be rising....
trinibwoy
13-Dec-2006, 15:16
No I suspect that Nvidia did that as a knee jerk reaction to the Intrinsity/ATI press release some time ago. I have been saying this long before we knew what G80 was.
And the first time you said it you received the same response. You can't knee-jerk a chip to 1.35Ghz. The clock speed is obviously integral to the entire G80 architecture. Are you insane?
I'm interpreting the comments DiGuru has made. I don't know what he's referring to. How come he's become such an expert all of a sudden?!!! He's the one who really needs prodding!
Nah, I've just been following the developments with half an eye, and that was the only model that fits the info we've got so far. I have no other good sources. But it makes a lot of sense (it's what I would have done as well), and it fits the DX10 specs, the patents and the way I think ATi works.
I would like to work out the model more, for my own understanding, but I've been rather busy with work. I have to finish a small program tonight, for example.
In the mean time, I wanted to add some further explanations of some stuff here and there, but you have so far done a very good job of that. I agree with your intepretation of it.
I would only like to add, that it actually simplifies some things just as much as it complicates others. For example, scheduling and issuing the hashed code segments in arbitrary batches makes things (including register loading) easier, while reordering after a branch makes it more complex. And storing the results can be done with binary masks as well, so it might be done with (an inverted copy of) the same logic that loads the registers.
Making objects according to hashed instruction sequences is simpler and requires much less storage for state information than keeping track and trying to fit a lot of the old kind of threads for execution. You only have to store what hash goes with what sequence. And you can use a binary tree consisting of pointers to an array to keep track of it, and have the right operand submitted to all the ALUs.
And, considering the patent application with the interesting name and the one about the hashing of code fragments (with a rolling hash), that tree is build by the driver compiler in advance. It is a lot like a zip file: you "zip" the code to be executed by storing the hashes and index to a table in a binary tree, and you only have to keep track of and issue all the available unique code segments from a high-level perspective. The PC is probably split into two parts: the hash ID and the operand in that sequence. That's what you batch and execute.
In that, it doesn't resemble a "normal" processor at all. It looks more like a processor with a smallish but variable amount of execution units. And each execution unit can have many identical copies. The circulair bus is a good way to distribute the data needed to all the ALU's assigned to each execution unit. You just have to tell them: you're unit 17 for this sequence.
I haven't got a good idea how and where they store the registers, as the circulair bus would have to be pretty wide to distribute them across that. But if they do, they could use another binary tree to store the contents.
But anyway, I think you do a great job explaining it in detail. ;)
Edit: and I also think that there is a minimum length to the code sequences that can be submitted to the execution units, to be able to traverse the trees. That would be lineair to the depth of the tree. For example, if a simple ADD is the first instruction in the tree, they could issue that each clock, while a sequence of instructions that takes 5 clocks cannot be more than 5 levels deep to be able to issue it without stalls.
And the first time you said it you received the same response. You can't knee-jerk a chip to 1.35Ghz. The clock speed is obviously integral to the entire G80 architecture. Are you insane?
They basically had two years to do it. That is enough time.
nV would have had to make that technology from beginning to end, which they did to my knowledge, since they haven't licensed any ip for any specific silicon fabrication models. It would have taken them longer then 2 years to do that and make the GPU too, since something like that would likely have significant effect on the GPU.
The last IP for silicon fabs nV licensed from what I remember was from beginning of 2005 to mid 2005 from Arthimatica, which was to reduce silicon size by using mathematical layouts or something or other along this line.
Sorry it was march 2004 when they annouced it, was used in the gf6
http://arithmatica.com/news/index.html
Interesting side note looks like ATi might be using it aswell now, 7/20/06
trinibwoy
13-Dec-2006, 19:48
They basically had two years to do it. That is enough time.
Yep, it's obvious that Nvidia rewrote the G80 shader core two years ago after reading a press release. The evidence you've provided is overwhelming. It would suck for them if ATi didn't produce a 1Ghz+ part though - imagine all the heartache and tears that press release caused all for nothing :shock:
Method and apparatus for managing tasks in a Multiprocessor system (http://appft1.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&p=1&u=%2Fnetahtml%2FPTO%2Fsearch-adv.html&r=1&f=G&l=50&d=PG01&S1=20060059484.PGNR.&OS=DN/20060059484&RS=DN/20060059484)
This patent application (which I think I've discussed before, it looks awfully familiar) is about "task completion, detection and task initiation".
Jawed
Your friendly neighborhood mod team requests all members to check the settings on their phazers are at "stun" rather than "caustic wit".
Thank you.
trinibwoy
13-Dec-2006, 20:01
I would only like to add, that it actually simplifies some things just as much as it complicates others.
If R600 is anything like you guys are describing then it's nothing like Xenos right? Somebody needs to put together some ASCII art or one of those colorful pictures ( Jawed !? ) - I'm floundering here trying to understand why there's anything "simple" about this approach :???:
geo: I was being sincere...honest!
Well, I think the ring bus is probably the structure that determines the maximum clock speed. That is, if they use one. And they might segment it and add binary tokens to speed it up even further.
If the ALUs are pipelined, they could raise the speed of those quite a lot. The main consideration would be: is it better to do the same thing in two clocks at 1.6 Ghz, instead of in one clock at 800 Mhz? And can the ring bus and indexing structures keep up with that?
Nah, I've just been following the developments with half an eye, and that was the only model that fits the info we've got so far. I have no other good sources. But it makes a lot of sense (it's what I would have done as well), and it fits the DX10 specs, the patents and the way I think ATi works.
Well, funnily enough, I've had some of these patent documents lying around for a while now, wondering if they "really meant what I thought they might mean", so your initial post was the go-ahead to launch into a proper explication of my ideas!
I would like to work out the model more, for my own understanding, but I've been rather busy with work. I have to finish a small program tonight, for example.
I'm going to be away from tomorrow morning for a few days, which is a bit annoying ...
And, considering the patent application with the interesting name and the one about the hashing of code fragments (with a rolling hash), that tree is build by the driver compiler in advance.
Yeah, this is what I've been thinking, that hashing can be performed outside of the GPU. I'm a bit uncertain if this entirely relieves the GPU of hashing, though, because of dynamic stuff. e.g. dependent texturing. Need to ruminate on that.
In that, it doesn't resemble a "normal" processor at all. It looks more like a processor with a smallish but variable amount of execution units. And each execution unit can have many identical copies.
I'm thinking it could be labelled as an "Instruction Associative Processor". Dave likens it to the Imagine processor, so he's ahead of us there...
Edit: and I also think that there is a minimum length to the code sequences that can be submitted to the execution units, to be able to traverse the trees. That would be lineair to the depth of the tree. For example, if a simple ADD is the first instruction in the tree, they could issue that each clock, while a sequence of instructions that takes 5 clocks cannot be more than 5 levels deep to be able to issue it without stalls.
Ah, I've been thinking that everything is "1 instruction at a time". I'm wary of hashing multiple instructions because associativity drops off massively then.
Also, I'm thinking that since R600 is effectively "batchless", there's no need to worry feverishly about batch size. Each array might be 16x vec4s wide, but even with 4 clocks per instruction, the likelihood is that there's such an abundance of objects in flight that it's never particularly difficult to fill up 4 lines for a single instruction, to issue to the ALU array on four successive clocks.
So, what I'm thinking is that register file fetches and stores become simpler if you always have 4 clocks to perform them. The "randomness" looks far less intense - particularly if the issuer is constrained in how "randomly" it's allowed to select threads to make up the 4 lines for execution (e.g. there's some kind of page constraint and/or selection is on quads, not individual objects).
This is similar to how Xenos and R5xx spend 4 clocks per instruction - though those GPUs aren't dealing with such intensely "random" fetches. Their fetches are only random in terms of register. There's no randomness in terms of thread. Older GPUs, such as R4xx seemingly spent even longer, tens of clocks, fetching according to a pattern of (random registers, fixed thread).
I'm also still concerned by the fact that even if the ALU pipeline is only fetching 2 operands per clock, the TMU, memory fetch, streamout and BR pipelines all want to suck up one or two operands each. You could argue that these pipelines aren't so picky about operand-fetch latency - they would be happy running at half or quarter speed, for example. Erm...
---
When Dave and I discussed re-ordering before, we discussed a special pipeline for "register shuffling". This pipeline's only job is to move register values around, the idea being that when the time comes to fetch a "line" of operands, all the registers appear in a single fetch on one port.
This increases latency: when the predicate stack is pushed or popped the threads involved in the line need to have their registers realigned according to the new predicate.
This latency isn't on the critical path for instruction execution though - all the threads involved (which are already suffering a latency event due to the BR instruction, anyway) will take an extra N clocks for the shuffle to be complete.
But clearly register shuffling requires a dedicated read and write port, in order that shuffling can proceed while the main pipelines (ALU, TMU etc.) do their stuff.
It would also, effectively, do away with the need to use register proxies.
The problem I see with register shuffling is: which registers need shuffling? A thread can declare 4096 registers :!: The shuffler would need to inspect the program's predicate level to determine which registers are in scope (e.g. r4, r6, r12 and r45 are used during 3 instructions within an if statement). The longer the clause, the worse this gets. It seems like a deal-breaker to me.
Jawed
Ah, I've been thinking that everything is "1 instruction at a time". I'm wary of hashing multiple instructions because associativity drops off massively then.
But the patent is explicit about comparing variable length operand sequences, instead of only single ops. And it would decrease the amount of shuffling work around very much.
When Dave and I discussed re-ordering before, we discussed a special pipeline for "register shuffling". This pipeline's only job is to move register values around, the idea being that when the time comes to fetch a "line" of operands, all the registers appear in a single fetch on one port.
Like a second set of "ALUs" and data pathways that funcion like an active many-way and very wide cache? That sounds pretty good.
The problem I see with register shuffling is: which registers need shuffling? A thread can declare 4096 registers :!: The shuffler would need to inspect the program's predicate level to determine which registers are in scope (e.g. r4, r6, r12 and r45 are used during 3 instructions within an if statement). The longer the clause, the worse this gets. It seems like a deal-breaker to me.
Yes, that's my main problem with it as well.
We're still pretty much on the same line. ;)
Edit: I think we should see the hashed code segments as complex instructions, that take a variable amount of clocks in the pipeline, instead of looking at the actual ops as the smallest unit. Like a pipelined superscalar RISC processor executing microcode.
If R600 is anything like you guys are describing then it's nothing like Xenos right?
Nope, not really.
It's unified, and in terms of the relationship between the fixed-functions (i.e. input assembler, rasteriser(s), ROPs) and programmable arrays it is very similar (with D3D10 goodness sprinkled on top).
The ALU array is probably 16x wide, like Xenos. But the guts of the threading model is on a whole different plane.
In older GPUs, every ALU in a SIMD array is working on objects from the same thread (batch). So in Xenos, 16 fragments from a 4x4 portion of the screen, say, are in the ALU:
AAAABBBBCCCCDDDD
AAAABBBBCCCCDDDD
AAAABBBBCCCCDDDD
AAAABBBBCCCCDDDD
so that's four threads I've shown there (16 in a thread, not 64 as Xenos has, too lazy to make it bigger!).
In R600, those four threads of 16 each will progress through the ALU array as normal. Until dynamic branching occurs (there's other reasons, but keeping it simple).
Now we have an if statement, which predicates those pixels as follows:
1010101101010111
1011101010101010
1111010101010101
1110101010101101
applied to the threads on Xenos we get:
AaAaBbBBcCcCdDDD
AaAABbBbCcCcDdDd
AAAAbBbBcCcCdDdD
AAAaBbBbCcCcDDdD
Which still takes 4 clocks. On R600, after the threads are shuffled, you get:
AAAABBBBDDDD
AAAABCCCDD##
AAAACCCC####
BBBBCDDD####
which takes 3 clocks. Ideally the windowed issuer would have access to more than simply four threads, for packing, so those wasted #s would occur less often.
:grin:
Jawed
Well, I think the ring bus is probably the structure that determines the maximum clock speed. That is, if they use one. And they might segment it and add binary tokens to speed it up even further.
If the ALUs are pipelined, they could raise the speed of those quite a lot. The main consideration would be: is it better to do the same thing in two clocks at 1.6 Ghz, instead of in one clock at 800 Mhz? And can the ring bus and indexing structures keep up with that?
The ALUs are already pipelined even in current implementations. G80's ALU is 10 stages, going by the patent. What i am wondering is if this is much deeper than the previous implementations in order to clock it that high. Anyone have any info on that?
I'm thinking it could be labelled as an "Instruction Associative Processor". Dave likens it to the Imagine processor, so he's ahead of us there...
Actually, I haven't spent any time digesting (or even attempting to understand) the whole hashing thing, so the implementation could vary wildly. I haven't seen a good description from Imagine describing what it is that they might do to alleviate state management (register access/shuffling/whatever) around branches. However, the idea of batchless-ness is inherent in stream processing -- the "batch" is basically undefined, the input stream just ends at some point. The way that branches are handled is to split the shader into multiple kernels, and direct the output stream of one kernel into one of the sides of the branches. That strikes me as more similar to R600 than the traditional predication approach of G80. However, I'm pretty sure Imagine isn't doing anything like instruction association -- it doesn't really need to, and it's got a VLIW issue that it'd have to associate, which might become difficult in practice.
Also, I'm thinking that since R600 is effectively "batchless", there's no need to worry feverishly about batch size. Each array might be 16x vec4s wide, but even with 4 clocks per instruction, the likelihood is that there's such an abundance of objects in flight that it's never particularly difficult to fill up 4 lines for a single instruction, to issue to the ALU array on four successive clocks.
Yeah, I think you fetch as much as you can and issue what you can.
So, what I'm thinking is that register file fetches and stores become simpler if you always have 4 clocks to perform them. The "randomness" looks far less intense - particularly if the issuer is constrained in how "randomly" it's allowed to select threads to make up the 4 lines for execution (e.g. there's some kind of page constraint and/or selection is on quads, not individual objects).
Seems like you can have several moving parts involved -- you can have one part just gathering "next operation" instructions, a flush of that pushes the set of ops to a register fetcher, and whatever comes out of that mess winds up actually executing.
The problem I see with register shuffling is: which registers need shuffling? A thread can declare 4096 registers :!: The shuffler would need to inspect the program's predicate level to determine which registers are in scope (e.g. r4, r6, r12 and r45 are used during 3 instructions within an if statement). The longer the clause, the worse this gets. It seems like a deal-breaker to me.
I suppose one could keep this somewhat constrained by only keeping a small set of registers "live" (and therefore in need of shuffling). Downsides there, of course....
In way over my head....
-Dave
Thinking a bit more about binary trees and such, I do agree that it might be easier to issue everything one instruction at a time. Because I got it backwards: the depth of the tree from wich you can fetch the next sequence depends on the lenght of the current sequence(s) executing.
Then again, it would mostly add latency, if you can prevent dependencies. Which might be hard to do. I'll have to think that over a bit more.
Edit: On the other hand, and thinking about the patents, I think they first mix and match multiple different programs in a way to minimize dependencies before calculating the hashes. They effectively only run a single program, no matter how much different ones you supply.
But the patent is explicit about comparing variable length operand sequences, instead of only single ops. And it would decrease the amount of shuffling work around very much.
Aha! Well, I haven't read it, just looked at the pretty pictures and thought "hmm, hashing..." :lol:
Edit: I think we should see the hashed code segments as complex instructions, that take a variable amount of clocks in the pipeline, instead of looking at the actual ops as the smallest unit. Like a pipelined superscalar RISC processor executing microcode.
Yes, I'm keen to keep in mind the idea that the ALU pipeline could be lightweight. ADD or MUL might be one clock, but MAD is two, for example (that comes back to the operand count, though).
ATI's older GPUs' vertex pipelines had a 2-operand fetch register file with a 2-clock MAD.
I'm also puzzled by SF. If anything, SF should prolly take even longer, say 3-6 clocks. The different SFs (RCP, RSQ, LG2, EX2, SIN) may all take different periods of time. That might increase the importance of that TCP patent I just linked.
So, instead of older GPUs that could do vec4 or vec3 + SF (or Xenos which does vec4 + SF), which means that SF has 1/4 of the throughput of MUL (in terms of scalars), R600 might achieve ~ the same ratio between MUL and SFs because SFs, even when issued on 64 scalars in an array of 16x vec4s, would take 2-4x as long to compute.
Jawed
Yes, I'm keen to keep in mind the idea that the ALU pipeline could be lightweight. ADD or MUL might be one clock, but MAD is two, for example (that comes back to the operand count, though).
Yes, and the sequence of MAD + RSQ might take 8 clocks or such. Which might be identified by hash #0007, and regarded as instruction #0007 by the top level scheduler.
I'm floundering here trying to understand why there's anything "simple" about this approach :???:
I forgot to mention earlier the saving afforded when an instruction is vec2 or scalar, which allows you to pack objects into the pipelines.
Anyway, why's it simple?
The math/bit-shifting/format-conversion instruction set is small and the number of objects that will execute exactly the same instruction, is huge.
When you have a shader program:
MUL
ADD
MAD
MUL
DP3
RSQ
ADD
MUL
you see the same few things happening over and over. What makes them vary from each other is the destination and operands:
MUL r0, r2, r3
ADD r1, r0, r4
MAD r3, r1, r5, r0
MUL r0, r2, r6
DP3 r0, r0, r3
RSQ r0, r0
ADD r0, r0, r7
MUL r1, r0, r8
But let's look at those MULs and code them with program counter (line number):
1 r0, r2, r3
4 r0, r2, r6
8 r1, r0, r8
Imagine we've got 1024 pixels to run that code for, threads numbered 1 to 64.
You can describe the program above like this:
MUL
===
1 r0, r2, r3
4 r0, r2, r6
8 r1, r0, r8
=======
1, 2, 3, ... 63, 64
ADD
===
2 r1, r0, r4
7 r0, r0, r7
=======
1, 2, 3, ... 63, 64
etc.
There's also a list of threads, with status (program counter, predication + other stuff):
1 7 1000111001100001
2 7 1001111110000000
3 4 1111111111111111
4 4 1111111111111111
5 2 1000000000000001
6 1 1110011101111101
7 1 0000110010010000
As the program executes, it marks up each thread ID with the program counter. When a thread has performed the final MUL or ADD it's deleted from the queue attached to MUL or ADD. When the program is entirely finished, it's deleted from the thread list. As branches are evaluated, the predication is updated for each object in the thread.
Every clock, the GPU's looking to fill the array of ALUs, all 64 scalars as it were. I've left out channels on the code (r2.rgb or r8.a, etc.), just trying to simplify. Imagine instead there's an if statement surrounding this code, causing some pixels to be excluded - so packing is required.
So execution might go like this, with each thread taking it in turns, sort of:
Threads 1, 2 and 5 are ready to do ADD. Those three threads can run together, as there's 16 pixels satisfying the predicate. It doesn't matter that there's two different instructions in the program being executed, line numbers 7 and 2 - the ALU pipeline has this fact hidden from it by proxies: "r = rA + rB"
Next, thread 3 wants to do MUL. Notice all 16 pixels in that thread can fill the array
Thread 4 is just like 3, completely filling the array
Threads 6 and 7 jointly fill the array, performing a MULSo that's an example of how an out of order GPU can issue mixed threads, not caring about the specific operands in any one line of code. The ALU array can then be packed both on the basis of predication but also based on the channel-count.
As far as the ALU array is concerned, life is sweet. The ALU pipeline doesn't worry where the data's coming from or going to. The cost is the additional pipelines which work "asynchronously" gathering operands, figuring out predication and then scattering the results afterwards. That then complicates the register file.
G80's array also has a pretty sweet life: it's also spoonfed with operands but it does have to evaluate predication as it goes.
Whereas G80 optimises for channel-count by going scalar, R600 optimises for both channel-count and predication with a single mechanism. G80's scalar SIMD array doesn't solve the predication problem directly (arguably makes it worse, because the array tends to be wider in terms of pixels, demanding higher coherency). Because predication is a tougher nut to crack, solving it obviates having to build a scalar pipeline to optimise for channel-count. R600 could have a scalar pipeline, but it would be only a minor gain over the vec4 pipeline, whilst costing a bundle more in complexity. R600 clearly can't achieve perfect channel packing (vec3 :!: ) but it seems hard to argue it's more than good enough.
Obviously, still guessing...
Jawed
Yes, and the sequence of MAD + RSQ might take 8 clocks or such. Which might be identified by hash #0007, and regarded as instruction #0007 by the top level scheduler.
Hmm, yes, I can see how the driver compiler can iteratively refine down to such hashes.
For example if you have
MUL r0.a, r1.a, r2.a
MAD r3.a, r0.a, r1.a, r3.a
RSQ r3.a
MUL r4.rgba, r5, r6
ADD r5.rgba, r4, r7
MAD r5.rgba, r8, r9, r5
ADD r5.rgba, r5.rgba, r3.a
the pairs of MUL and MAD instructions can't hash together, because of the channel count.
But you can hash the MUL, MAD, RSQ together and the MUL, ADD, MAD, ADD together.
Hmm... interesting. :lol: I bet the compiler guys are in seventh heaven playing with this.
Time for another one I guess:
Method and Apparatus for Graphics Processing Using State and Shader Management (http://appft1.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&p=1&u=%2Fnetahtml%2FPTO%2Fsearch-adv.html&r=1&f=G&l=50&d=PG01&S1=20050162437&OS=20050162437&RS=20050162437)
Jawed
Edit: On the other hand, and thinking about the patents, I think they first mix and match multiple different programs in a way to minimize dependencies before calculating the hashes. They effectively only run a single program, no matter how much different ones you supply.
I wonder if it's not programs, but streams within the program.
For example in the code I posted earlier, there's two streams:
MUL r0.a, r1.a, r2.a
MAD r3.a, r0.a, r1.a, r3.a
RSQ r3.a
MUL r4.rgba, r5, r6
ADD r5.rgba, r4, r7
MAD r5.rgba, r8, r9, r5
ADD r5.rgba, r5.rgba, r3.a
These can hash and execute in any order. The dependency is defined by the 7th line.
Jawed
flippin_waffles
14-Dec-2006, 02:36
I forgot to mention earlier the saving afforded when an instruction is vec2 or scalar, which allows you to pack objects into the pipelines.
Anyway, why's it simple?
The math/bit-shifting/format-conversion instruction set is small and the number of objects that will execute exactly the same instruction, is huge.
When you have a shader program:
.......
LOTS OF INFO
......the results afterwards. That then complicates the register file.
G80's array also has a pretty sweet life: it's also spoonfed with operands but it does have to evaluate predication as it goes.
Whereas G80 optimises for channel-count by going scalar, R600 optimises for both channel-count and predication with a single mechanism. G80's scalar SIMD array doesn't solve the predication problem directly (arguably makes it worse, because the array tends to be wider in terms of pixels, demanding higher coherency). Because predication is a tougher nut to crack, solving it obviates having to build a scalar pipeline to optimise for channel-count. R600 could have a scalar pipeline, but it would be only a minor gain over the vec4 pipeline, whilst costing a bundle more in complexity. R600 clearly can't achieve perfect channel packing (vec3 :!: ) but it seems hard to argue it's more than good enough.
Obviously, still guessing...
Jawed
Well that's gotta be the granddaddy of all guesses! :lol:
flippin_waffles
14-Dec-2006, 02:39
I forgot to mention earlier the saving afforded when an instruction is vec2 or scalar, which allows you to pack objects into the pipelines.
Anyway, why's it simple?
The math/bit-shifting/format-conversion instruction set is small and the number of objects that will execute exactly the same instruction, is huge.
When you have a shader program:.......
LOTS OF INFO
........
with operands but it does have to evaluate predication as it goes.
Whereas G80 optimises for channel-count by going scalar, R600 optimises for both channel-count and predication with a single mechanism. G80's scalar SIMD array doesn't solve the predication problem directly (arguably makes it worse, because the array tends to be wider in terms of pixels, demanding higher coherency). Because predication is a tougher nut to crack, solving it obviates having to build a scalar pipeline to optimise for channel-count. R600 could have a scalar pipeline, but it would be only a minor gain over the vec4 pipeline, whilst costing a bundle more in complexity. R600 clearly can't achieve perfect channel packing (vec3 :!: ) but it seems hard to argue it's more than good enough.
Obviously, still guessing...
Jawed
Well that's gotta be the granddaddy of all guesses! :lol:
Hexus again:
http://www.hexus.net/content/item.php?item=7455
Codenames for R600... and R700!
R600:
XTX - Dragon's Head (new!) - GDDR4 ?
XT - Cat's Eye (already mentioned in PCPOP) - GDDR3 or GDDR4?
XL - UFO (already mentioned by TheInq - GDDR3?
R700:
Wekiva!
bye
silent_guy
14-Dec-2006, 05:40
Edit: I noticed in other posts you were talking about a circular buffer. If that's different than the ring bus as we know it from previous chips, than remarks below obviously don't apply. :wink:
Well, I think the ring bus is probably the structure that determines the maximum clock speed. That is, if they use one. And they might segment it and add binary tokens to speed it up even further.
On the contrary: the ring bus is probably the structure that's easiest to get a high speed. The whole idea of a bus with seperated request/return paths (ring bus or not) is to make it them easy to pipeline without blocking. Non-blocking reads make it harder to prevent RAW or WAR race conditions at the system level, but that's the price you pay for almost unlimited clock scalability.
This has happened for all busses, internal and external.
E.g. Externally, PCI went blocking reads to non-block reads. Same for HT. Internally, ARM busses went from the sorry ASB to AHB to AXI.
The problem with internal busses is not speed but congestion (which is exactly what the ringbus specifically solves... at a price.)
If the ALUs are pipelined, they could raise the speed of those quite a lot. The main consideration would be: is it better to do the same thing in two clocks at 1.6 Ghz, instead of in one clock at 800 Mhz?
You still issue 1 operation per clock at 1.6 GHz. You've only increased latency.
And can the ring bus and indexing structures keep up with that?
The ringbus is running at the memory clock speed. Not at the core speed. The interface from ring bus to the core is a 'simple' asynchronous FIFO and a bit of routing logic based on a routing tag.
It is much much harder to get the memory issue logic (arbiter etc.) running at high speed. But that's not related to the ring bus.
Well, I think the ring bus is probably the structure that determines the maximum clock speed. That is, if they use one. And they might segment it and add binary tokens to speed it up even further.
If the ALUs are pipelined, they could raise the speed of those quite a lot. The main consideration would be: is it better to do the same thing in two clocks at 1.6 Ghz, instead of in one clock at 800 Mhz? And can the ring bus and indexing structures keep up with that?
Perhaps the ALUs aren't CMOS.
Acert93
14-Dec-2006, 06:42
Well that's gotta be the granddaddy of all guesses! :lol:
:lol: We should start a pool to see how much Jawed got right! We can say this: his "guess" is pretty detailed!
Perhaps the ALUs aren't CMOS.
Yeah, they might be made of pixie dust! :roll:
silent_guy
14-Dec-2006, 07:02
I don't work for Hexus any more, B3D full time now thankyouplease :!: So those stories have zero to do with moi, although I did explain to someone there how to decode ATI chip revisions not long ago which made it in there.
You also may want to tell them that each spin doesn't cost $1M: the fixed cost of a spin is determined by the amount of masks. A full mask set is in the million dollar range: as feature sizes become smaller, the machines have to become ever more accurate and expensive. (In '96, a full mask set in 0.5um was only around $25K!)
A large part of the masks is uses for the base layers, so when you have to do a metal spin, you'll typically try to mimize the number of metal layers you have to change. A metal spin that only touches, say, 6 layers (3 metal, 3 via) will cost only a fraction of a full spin.
DemoCoder
14-Dec-2006, 07:40
Here's another patent application:
Method and apparatus for static single assignment form dead code elimination (http://appft1.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&p=1&u=%2Fnetahtml%2FPTO%2Fsearch-adv.html&r=1&f=G&l=50&d=PG01&S1=20050166194.PGNR.&OS=DN/20050166194&RS=DN/20050166194)
Great title, huh?...
Quite boring title actually, given that I have 3 compiler books sitting on my bookshelf that contain that phrase. Another lovely patent of the obvious, I look forward to an "invention" of SSA copy propagation, SSA constant folding, SSA constant conditionals, etc Of course, extending SSA analysis from scalar instructions, and long precision, to SIMD registers is amazingly inventive, and not at all considered by others doing SIMD compiler optimizations before. Therefore, deserving of the patent.
It's a good thing, ATI made sure no one can reuse their invention technique:
"It should be understood that the implementation of other variations and modifications of the invention in its various aspects will be apparent to those of ordinary skill in the art and that the invention is not limited by the specific embodiments described herein. For example, the superword register may contain any suitable number of components and thereby the write mask contains the corresponding number of bits to allow for the effect management of any number of components in compiler operations, such as swizzles. It is therefore contemplated to cover by the present invention, any and all modifications, variations, or equivalents that fall within the spirit and scope of the basic underlying principles disclosed and claimed herein. "
Of course, the SGI patent on using SSA to optimize 64-bit math instructions to 32-bit also had a similar clause, so NVidia still has to opening to repatent all the other SSA form optimizations. GPUs are going double-precision FP in the future, so they could combine the ATI SIMD-superword SSA patent and the SGI 64-bit ALU -> 32-bit ALU SSA patent, to arrive an an SSA patent that turns SIMD FP64 code into scalar/vec2/vec3 FP32 :)
DemoCoder
14-Dec-2006, 07:58
I haven't been reading this thread, so I don't know how far back to start, but let me see if I can understand what is being proposed for the R600 architecture:
* still SIMD ALU based
* driver/HW calculates "live" components. If an ALU is found to have idle components, then another thread which would be doing a MADD for example, will be "packed" in.
So, if thread 1 is about to exec a vec2 MADD, and threads 2 and 3 are going to do a scalar, then all three will share the same ALU and execute together, effectively "packing" the ALU. Presumably, this can even work if say, you need to fit any arbitrary collection of vec1/2/3 ops across N ALUs.
Right off the bat, this looks like a 0-1 knapsack problem to me (NP-hard), so I don't see a hardware implementation coming up with an optimal solution. Approximation schemes would have to be used. Presumably, the driver compiler would have to do the heavy lifting, BUT:
The HLSL compiler already does alot of these optimizations within a single shader, trying to pack the SIMD units, so it seems like a huge amount of driver effort, and perhaps hardware assistance for what would probably be a marginal improvement in efficiency.
Yes, the compiler could probably mark instructions with masks, and the thread dispatcher might be able to use this info to co-issue the next instruction in another thread, but the compiler driven scheme breaks down as soon as data dependent branches are introduced, and that means dead component detection would have to be done at runtime. Didn't we learn this lesson with Itanic?
This sounds way way too complex for me to believe it is something that ATI is actually doing.
Mintmaster
14-Dec-2006, 11:17
I just don't see the point of all this. What does this packing achieve?
I hope ATI isn't trying to optimize some obscure case that won't be used for ages. What did all the fancy thread handling and dynamic branching do for R5xx? Not much. They need to go for density now. Let the software catch up with the hardware before doing something crazy.
The Xenos style pixel grouping is all you really need to go G80-scalar style. Just change the way the ALUs operate.
I just don't see the point of all this. What does this packing achieve?
it achieves nothing (well, apart from bloating the die size)
I hope ATI isn't trying to optimize some obscure case that won't be used for ages.
I'm sure they're much more smart than that, I wouldn't be worried :)
zgemboandislic
14-Dec-2006, 12:40
Ummm, when do you guys suppose the benchmarks are going to start to leak? :D
Ummm, when do you guys suppose the benchmarks are going to start to leak? :D
maybe the scores will leak before the benchmarks will leak..
I thought the inq allready reported some 3dm06 scores a month or so ago.
Anyway...
Ailuros
14-Dec-2006, 13:13
I'd prefer if we'd stay on boards like B3D more on the technical analysis of upcoming hardware (speculative or not), instead of senseless and bloated synthetic benchmark scores thank you.
Right off the bat, this looks like a 0-1 knapsack problem to me (NP-hard), so I don't see a hardware implementation coming up with an optimal solution.
This is only hard if you don't split the vec instructions during instruction scheduling.
If you allow for a certain amount of slop on register fetch, you could think of the system as in-place building an N-wide vector for instruction issue. Any slop-over gets placed into the next vector issue. There's nothing NP-hard about that -- circular buffer with 1.x issue width.
If instructions and threads are getting moved around, what is the difference between 64 scalar ALUs and 16 Vec4 ALUs and 1 Vec64 ALU when all of them have only one instruction dispatcher?
The simplest algo I can come up with to construct the virt.vec suffers degredation when predication/0 rates are high. I think the way you address that is to fill each slot somewhat more selectively, although that really just moves the corner-cases elsewhere...
-Dave
Yeah, they might be made of pixie dust! :roll:
Oh gad, by the time Marketing gets done with that :!: "64 Pixie-L Shaders!" :razz:
trinibwoy
14-Dec-2006, 16:42
Every clock, the GPU's looking to fill the array of ALUs, all 64 scalars as it were.
The cost is the additional pipelines which work "asynchronously" gathering operands, figuring out predication and then scattering the results afterwards. That then complicates the register file.
Whereas G80 optimises for channel-count by going scalar, R600 optimises for both channel-count and predication with a single mechanism.
Well like you've indicated here there are two problems to solve - utilization and predication. Given the relative simplicity of the G80 approach to solving the utilization problem I'd have to guess that if ATi were to go this route it would be due to an acute infatuation with DB performance! This would be completely understandable though considering this general approach will probably last 4-5 years.
What's concerning is that while the instruction hashing, operand gathering and context management stuff sounds fantastic on paper - how much transistor logic would all this cost? Would the return on DB performance be worth it in the next year or two? This all makes G80's shader pipeline look pretty pedestrian in comparison - hard to believe it would be this easy to do something so radical.
What's concerning is that while the instruction hashing, operand gathering and context management stuff sounds fantastic on paper - how much transistor logic would all this cost? Would the return on DB performance be worth it in the next year or two?
That's been exactly my concern as well. By going wide you save on your dispatcher, but it costs in your scheduler. It's very clever, mind you. They are also, presumably, saving by going more RISCy, though there are tradeoffs to consider there as well, like Dot4 performance.
Of course, a simpler ALU would likely allow you to jack the clock up -- however, I detect some skepticism regarding 2Ghz, even if 64xVec4 and ~500gflops has fair buy-in, and even if a simpler 1flop/clock/alu number would kind of imply that clock-speed ;^/
R600 clearly can't achieve perfect channel packing (vec3 :!: ) but it seems hard to argue it's more than good enough.
Why not think of the ALU array as 4xVec64 rather than 64xVec4? Without splitting instructions, you can get better than 95% packing in the worst case.
If there's a worst-case scenario it's that all of the operands take forever to read out of the register-file. A random though to consider -- what if you use the results-writeback to reorder the register-file (by moving *only* the results register)? There's more bookkeeping to track where each of every thread's registers actually are, but, the re-ordering of the register-file winds up amortized over time, and migrates to be more coherent with instruction issue over time. It would seem to make the worst-case (all operands exist in the same "column" of the register file) less likely to persist, at least, though the bookkeeping costs look expensive to me. Hmm....
Right off the bat, this looks like a 0-1 knapsack problem to me (NP-hard), so I don't see a hardware implementation coming up with an optimal solution. Approximation schemes would have to be used. Presumably, the driver compiler would have to do the heavy lifting, BUT:
The HLSL compiler already does alot of these optimizations within a single shader, trying to pack the SIMD units, so it seems like a huge amount of driver effort, and perhaps hardware assistance for what would probably be a marginal improvement in efficiency.
How about this:
ADD r0.xy, r1.xy
For GPUs, we assume that that op is run on half a vector ALU. Because it consists of two scalar ops. And for unified GPUs, we further assume, that this vector ALU is part of an ALU array, that has to run the same op on all the ALUs that are part of it. And we further assume that such an array consists of at least 16 ALU blocks (4 quads), so that gives us 64 scalar ALUs, of which half are active running that op. And that is, when all the pixels of the triangle that the array is assigned to are used and run the same execution path (for a pixel shader). If only half the destignated screen area is covered by the triangle, we only use 16 of the 64 ALUs. And when half of them take a different branch, we're left with only 8 out of 64 active ALUs.
Is this a far fetched example? Definitely not, with geometry and vertex shaders, dynamic branches and long programs. And it's the main reason for nVidia and ATi to drastically change their architectures. So, how are they going to improve on that?
The easiest solution is of course to restrict the bad cases. But DX10 says otherwise. The next solution is simply adding much more quads. Which isn't cost effective.
On further study, you might wonder why you use those 64 ALUs as a single unit. Because you can route the control signals around. If you have, say, 8 blocks of control logic, an 8 port switch and some multiplexers/demultiplexers, it isn't much more difficult to set up 8 arrays of ALUs of varying sizes. And at the same time, you're not limited to quad or even vector configurations anymore. So, you could use control block 0 to only activate the 8 ALUs it actually needs. Half an old quad.
But wait! If you do that, why stop there? Because, the control block is only interested in making sure the needed data is gathered and issuing a simple operation on them, like an ADD. And it's a safe bet that there are many more ADDs waiting in the instruction queue to be executed. So, that control block can issue all of them at once. And using register reordering, it's just ADD r0, r1 for all of them.
Ok, but dependencies, loads and stores are a problem with such a method. How do we fill the registers up front, and write all of them to their destination? But then again, it's easy to see that all current GPUs are capable of performing that trick. They can have in excess of 1024 threads running at the same time, and they manage to fill those registers when needed.
You might argue, that all those registers are at the moment coupled to a specific pixel, and that each pixel is assigned to a specific block of ALUs in a quad. So each register can be in the local storage of the ALU. A stack might work. But, if a stack works (you have to be able to address individual locations in any case), it might be just as easy to use a single stack for the whole quad. That's a 512 bit word size.
But, that creates a problem with branches, TEXKILL and other ops that change the flow. Even more so: each GPU already has a mechanism in place to distribute a whoppingly wide word of data to all ALUs that need it: when you do a texture lookup. And those are individually addressed (to the vector ALU blocks) as well!
Sure, all of that adds a lot of latency. But that's not really an issue, when you know plenty in advance and pipelined your execution units. So the solution to the register load problem is already in place on current GPUs.
But, all that requires single, very wide words of data to be passed to all the ALUs. And we're discussing a mechanism that also requires a unified storage of all the registers. Like a many-way cache. And that has to gather the data for the very wide word to be distributed.
And that's where virtual memory comes in. You have a lot of small pages, that are all fetched in advance and of which words are distributed around the chip. In the worst case, you simply duplicate them multiple times. And you can have all the data delivered just in time.
But that wasn't your main objection. Because we have to manage all that: make sure we know what operation to execute on which ALUs, and where the needed data is. You want batches for that, that keep the execution units busy long enough, so you can orchestrate the next one in the mean time. And that's where the instruction reordering and hashing comes in.
You chop up all the active programs into independent sequences, normalize the register access, and hash them. That gives you a list of complex instructions, packed like multiple zip files that all use the same binary tree and index. And an array that specifies the actual instructions. And another (set of) array(s) that keeps track of the state information.
So, now for the seriously interesting question: how are we going to schedule all that in an optimal way? But then again, I think that as long as your ALU load is above 50% at all times, it's already a clear win. And it shouldn't be too hard to come up with a binning and loading algorithm that does a lot better than that.
Sure it might not be optimal, but it gives a serious speed boast all the same.
;)
Rangers
14-Dec-2006, 23:09
I'd prefer if we'd stay on boards like B3D more on the technical analysis of upcoming hardware (speculative or not), instead of senseless and bloated synthetic benchmark scores thank you.
Yeah lets not even talk about how fast they are at all! Games are just bloated synthetic tests anyway! In fact, the new 8800GTX is totally useless!
Lets just do image quality comparisons all day long! It'll be so politically correct!
vBulletin® v3.8.6, Copyright ©2000-2013, Jelsoft Enterprises Ltd.