AMD: R8xx Speculation

curious.trier · Feb 15, 2010

Squilliam said:
If AMD are going to transition a chip onto the GF 32nm process quickly I would suggest its likely the Xbox 360 GPU. It makes the most sense as its directly applicable to getting Fusion up and running,

How in the name of God has Xenos got anything to do with LIano, Ontario

Squilliam · Feb 15, 2010

curious.trier said:
How in the name of God has Xenos got anything to do with LIano, Ontario

Not sure if theres any connection there. But we were talking about the possibility of 32nm GPUs so I figured I would offer an alternative.

Area593 · Feb 15, 2010

sebbbi said:
On our last engine I also just locked the 1x1 exposure map, and used that value as a shader constant later on. It's very fast on PC also, if you have multiple 1x1 textures to prevent lock stalling. Only the last texture is really needed, and 1x1 textures are very small (the memory overhead is really nothing).

In the engine I'm currently working on I use the same technique, minus the multiple targets, and I'm yet to find any issues with lock stalling. I render the frame pass, followed by the luminance stages, lock the texture and read the value, and then use that same value in the final pass during the same frame. The frame rates are still rock solid.

Jawed · Feb 15, 2010

nAo said:
Playing a bit with GSA I noticed that UAV reads are performed via texture fetches. The interesting thing is that since texture caches are read only and not coherent they seem to issue a cache line eviction per each UAV read, followed by a memory barrier to wait for the eviction to be completed (just before the texture fetch). I wonder why they don't simply use uncached reads.

Multiple work-items' reads should be satisfied by a single cache line.

They use it to handle UAVs global counters and counters for append/consume buffers (randomly notices while playing with GSA).

And it's the route for tessellation factor writes. GDS also has a set of atomics, just like the LDS ones.

So, in general GDS could be used as a global stack/FIFO/circular buffer just as long as 64KB is enough. I wonder if GDS is persistent across kernel invocations?

Jawed

MistaPi · Feb 15, 2010

Regarding the Anandtech article and sideport, why do you have to synchronizing the GPU's? What happens if you dont? I also believed the sideport in 4870x2 was disabled because driver optimizations made it redundant.

Kaotik · Feb 15, 2010

MistaPi said:
Regarding the Anandtech article and sideport, why do you have to synchronizing the GPU's? What happens if you dont? I also believed the sideport in 4870x2 was disabled because driver optimizations made it redundant.

At least as far as I've understood it, the microstuttering for example is a side effect from the fact that the GPUs don't run exactly synchronized

Gipsel · Feb 15, 2010

Jawed said:
This may be the abuse that you mentioned in the prior paragraph It seems you would use EXPORT_RAT_INST_XCHG_RTN to write using the atomic functionality, and then ignoring the return value. The bandwidth is low, because writes are DWords with no option to write 128-bits. No idea what kind of serialisation mechanics are in play.

And why shouldn't one try to use EXPORT_RAT_INST_STORE_TYPED? One can write up to 128bits at once with that. The ISA docs specifically mentions that it is the only instruction able to write more than a single dword to cached UAVs

Jawed · Feb 15, 2010

Gipsel said:
And why shouldn't one try to use EXPORT_RAT_INST_STORE_TYPED? One can write up to 128bits at once with that. The ISA docs specifically mentions that it is the only instruction able to write more than a single dword to cached UAVs

Is that a RW UAV or just a W UAV?

For what it's worth I suspect it is write-combined cached, not RW cached. I think this corresponds with the assembly instruction MEM_EXPORT_WRITE_IND, but I haven't fiddled around enough to know for sure. Reads from this UAV cannot be served by the data in the write cache.

The ISA documentation is quite incomplete sadly...

EDIT: I wonder if the UAV is defined as RW and EXPORT_RAT_INST_STORE_TYPED is used if the write-combining cache forwards data to the atomic cache as well as writing to memory - i.e. meeting the scenario you painted.

Jawed

Dave Baumann · Feb 15, 2010

Dave Baumann said:
With regards to Vector Adaptive Deinterlacing I'm told that it does work on 5570, however its on a later driver than reviewers had. It is also available on 5450, but not with the "Enforce Smooth Video Playback" option enabled and some of the other post processing features may need to be traded off.

And here is the Vector Adaptive Deinterlacing Driver for HD 5570: http://support.amd.com/us/kbarticles/Pages/ATI-Catalyst-driver-Radeon-HD5400.aspx

Anandtech also has an updated blog on the vide features and finds that when you run 5450 in a 1080p panel Vector Adaptive Deinterlacing does operate:

http://www.anandtech.com/weblog/showpost.aspx?i=669

Silent_Buddha · Feb 15, 2010

Alexko said:
Since 28nm is only 3 months behind 32nm and is a bulk process, I wouldn't expect AMD to bother with 32nm for GPUs... if it weren't for Llano.

Since they have to make a GPU on 32nm SOI anyway, why not make use of the experience they'll gain, and release more GPUs on 32nm? I wonder if it makes sense to do so...

The bigger question however is when they started developing whatever the next GPU is 2-3 years ago, was 28 nm a serious possibility or was 32 nm what was being marketted as what would be available in that timeframe? Unless 32 nm -> 28 nm is just a simple shrink, I would think oving to 28 nm would be non-trivial and incur quite a bit of delay. Meaning 32 nm might still be the target of the next GPU. This is, of course, assuming they weren't planning ot stick with 40 nm for 2 product cycles in the first place.

It's not exactly something you can just switch to on a whim.

Regards,
SB

MfA · Feb 15, 2010

Kaotik said:
At least as far as I've understood it, the microstuttering for example is a side effect from the fact that the GPUs don't run exactly synchronized

That doesn't mean anything (it sounds like it does, but it really doesn't).

Silent_Buddha · Feb 15, 2010

Dave Baumann said:
And here is the Vector Adaptive Deinterlacing Driver: http://support.amd.com/us/kbarticles/Pages/ATI-Catalyst-driver-Radeon-HD5400.aspx

Cool, doesn't affect me in any way, but nice to see it's available now as Newegg now has free shipping the cheapest 5450 in stock.

Off to order one.

Regards,
SB

Alexko · Feb 15, 2010

Silent_Buddha said:
The bigger question however is when they started developing whatever the next GPU is 2-3 years ago, was 28 nm a serious possibility or was 32 nm what was being marketted as what would be available in that timeframe? Unless 32 nm -> 28 nm is just a simple shrink, I would think oving to 28 nm would be non-trivial and incur quite a bit of delay. Meaning 32 nm might still be the target of the next GPU. This is, of course, assuming they weren't planning ot stick with 40 nm for 2 product cycles in the first place.

It's not exactly something you can just switch to on a whim.

Regards,
SB

I believe GF's 28nm is a half-node relative to their 32nm. That's not to say that moving from the latter to the former is trivial, but it's probably worth considering if the two processes are only 3 months apart, even if they hadn't really foreseen it when they started working on Northern Islands—but I suspect they realized it soon enough.

Squilliam · Feb 15, 2010

Its very disconcerting speculating about something like this when theres someone who posts here who probably knows the minute details of the product we're speculating about. That someone who is infact the product manager of the product we speculate about the most on the ATI side!

Such is life...

Silent_Buddha · Feb 15, 2010

Squilliam said:
Its very disconcerting speculating about something like this when theres someone who posts here who probably knows the minute details of the product we're speculating about. That someone who is infact the product manager of the product we speculate about the most on the ATI side!

Such is life...

There's a few folks from ATI with intimate knowledge of these things that post here.

Now try to imagine it from their side of things. They can't say anything about unreleased projects even if speculation is wildly wrong and possibly painting upcoming products in an unkind light. Likewise, they couldn't say anything to curb enthusiasm if speculation is wildly optimistic.

Imagine for instance if your parents were about to buy something, but you know something better is coming up, but you cannot say anything about it to them no matter what. You couldn't even tell them not to buy such and such item.

That's sort of like how leaks happen. You have someone you consider a close friend that you know you shouldn't tell anything to them, but you do anyway because they are a close friend and you trust them...

Regards,
SB

Gipsel · Feb 15, 2010

Jawed said:
Is that a RW UAV or just a W UAV?

No idea. The documentation doesn't say a word about a distinction there. And as I have no HD5000 GPU I had no chance to play around with it so far.

Jawed said:
For what it's worth I suspect it is write-combined cached, not RW cached. I think this corresponds with the assembly instruction MEM_EXPORT_WRITE_IND, but I haven't fiddled around enough to know for sure. Reads from this UAV cannot be served by the data in the write cache.

The ISA documentation is quite incomplete sadly...

EDIT: I wonder if the UAV is defined as RW and EXPORT_RAT_INST_STORE_TYPED is used if the write-combining cache forwards data to the atomic cache as well as writing to memory - i.e. meeting the scenario you painted.

Hmm, when looking at the Siggraph presentation it appears to be possible.
First, the eight 16kB segments are called "global memory R/W cache with atomics" and the block diagram shows two paths. One from the write combining buffer straigth to the memory controller, but also the possibility to go from the write combining buffer through the global memory R/W cache to the memory controller. Furthermore it is explicitly stated, that while read only loads go through the texture caches, unordered loads are served by that R/W cache. I know that this conflicts with what Micah Villmow said in the Stream Developer Forum. But he obviously referred to the current state of the OpenCL implementation and he also mentioned (in another thread) that it uses raw UAVs and not typed ones (opposed to DirectCompute which uses typed or structured UAVs), so it wouldn't work either way. I would interprete it in such a way that it should be possible for the hardware. But how to get this done, is not really clear to me.

Btw., one can specify both for all EXPORT_RAT_INST_xxx instructions, MEM_EXPORT_WRITE(_IND) as well as MEM_EXPORT_WRITE(_IND)_ACK. The latter doesn't return until the write has been carried out to memory. Looks almost like a writethrough switch, isn't it?
But you are right, the documentation is still quite incomplete in some respects. It is impossible (at least for me) to figure out how some things are supposed to work.

Squilliam · Feb 15, 2010

Silent_Buddha said:
There's a few folks from ATI with intimate knowledge of these things that post here.

Now try to imagine it from their side of things. They can't say anything about unreleased projects even if speculation is wildly wrong and possibly painting upcoming products in an unkind light. Likewise, they couldn't say anything to curb enthusiasm if speculation is wildly optimistic.

Imagine for instance if your parents were about to buy something, but you know something better is coming up, but you cannot say anything about it to them no matter what. You couldn't even tell them not to buy such and such item.

That's sort of like how leaks happen. You have someone you consider a close friend that you know you shouldn't tell anything to them, but you do anyway because they are a close friend and you trust them...

Regards,
SB

Its unfortunate for both sides I guess. So many people want to tell and so many people want to hear. I think Nvidia engineers aren't even allowed to post on online forums, so I guess at least ATI are generous in that respect.

Consider the name of this thread for instance. Every time Dave utters 'RV870' he has to pay money into a swear word jar. This very thread robs poor ATI employees of their livelihoods and likely they use that money to buy cake. Well that cake is a lie if its been bought off the blood and tears of ATI employees!

Silent_Buddha · Feb 16, 2010

Haha, looks like all the hubbub about 5450 being not being able to do vector adaptive deinterlacing that was in the original Anandtech article was due to poor testing on Anandtech's part rather than any deficiency in the card...

It turns out that the difference between rendering a 1920x1200 desktop and a 2560x1600 desktop is enough to make the 5450 start dropping frames when Vector Adaptive deinterlacing is in use. If we forced the issue by disabling ESVP, we would see a few frames get dropped in the Cheese Slices test when running at 2560x1600, but not at 1920x1200. This is actually something AMD was aware of, and as we just found out AMD has ESVP disable Vector Adaptive deinterlacing on a 2560x1600 desktop, but leaves it enabled at 1920x1200. Or in other words, the 5450 has enough compute power to do Vector Adaptive deinterlacing at HD desktop resolutions, just not at resolutions above that.

And since I'm unaware of any HDTV's with a resolution higher than 1080p (1920x1080)... Talk about a lot of people getting steamed over nothing.

Regards,
SB

FrameBuffer · Feb 16, 2010

Silent_Buddha said:
Haha, looks like all the hubbub about 5450 being not being able to do vector adaptive deinterlacing that was in the original Anandtech article was due to poor testing on Anandtech's part rather than any deficiency in the card...

And since I'm unaware of any HDTV's with a resolution higher than 1080p (1920x1080)... Talk about a lot of people getting steamed over nothing.

Regards,
SB

most considerate of you to conveniently leave out the part almost immediately following that:

With that said, there is one catch: the 5450 doesn’t have enough compute power to do any further post-processing on 1080i video besides Vector Adaptive deinterlacing. Just doing the decoding and that deinterlacing requires everything the 5450 can squeeze out, which means there’s nothing left for edge enhancement, de-noise, dynamic contrast, etc.

Good enough for some.. sure.. for someone who wants a compact, silent htpc card without having to kill most post processing at HD.. hardly and most certainly not over previous gen (with the sole exception being bitstreaming) such as the 4550 (from $19.99 to 44.99 vs 44.99 to 82.99 for 5450).

Jawed · Feb 16, 2010

Gipsel said:
First, the eight 16kB segments are called "global memory R/W cache with atomics" and the block diagram shows two paths. One from the write combining buffer straigth to the memory controller, but also the possibility to go from the write combining buffer through the global memory R/W cache to the memory controller.

Yeah, that's what prompted me to do the edit.

Furthermore it is explicitly stated, that while read only loads go through the texture caches, unordered loads are served by that R/W cache. I know that this conflicts with what Micah Villmow said in the Stream Developer Forum. But he obviously referred to the current state of the OpenCL implementation and he also mentioned (in another thread) that it uses raw UAVs and not typed ones (opposed to DirectCompute which uses typed or structured UAVs), so it wouldn't work either way. I would interprete it in such a way that it should be possible for the hardware. But how to get this done, is not really clear to me.

I can't find in the ISA a cached read that isn't using the vertex or texture caches, nor a side-effect of atomics (i.e. RTN).

Well, that's not strictly true, MEM_RD allows cached reads, but if writes and reads are in the same kernel, caching is not allowed, see the UNCACHED bit of MEM_RD_WORD_0 on page 2-58. I don't know which cache is implicated in cached reads :???:

I suspect the write-combining cache, because I interpret this to be a part of shader export functionality, and shader export functionality has to be able to send data back for more shading (e.g. vertex data is collated and sent back for pixel shading).

There is no read-path shown for the caches, i.e. the atomic RTN path isn't shown. I suspect it's from R/W cache through shader export. If that's true, then the entire path for cached MEM_RD is through colour buffer cache and write-combining cache. Or through colour buffer cache and shader export, bypassing write-combining.

With all these paths it certainly seems logical that general UAV cached RW should occur through the eight 16KB colour buffer caches. I can't think why it doesn't, but everywhere I turn in the ISA it's barred. I might be nothing more than addressing restrictions - these restrictions are like those seen in R700 and earlier GPUs where only a single UAV is available. Providing fully generic multi-UAV RW addressing/caching seems to be the crunch point, but I don't get why atomics aren't hobbled. I can only think that the atomics are stuffed into a queue for address collision resolution and the slowness here is seen as too slow for general RW.

Generally I feel this kind of stuff was a victim of the belt-tightening that resulted in a 334mm² Cypress instead of a ~480mm² one.

I expect AMD to implement something pretty much the same as Fermi's cache hierarchy. Though that has a wrinkle or two because TEX L1 and L1/shared-memory sit beside each other - similar to how Larrabee's semi-decoupled TUs have their own L1s. Since ATI L1 holds decompressed texels and is dual-purpose texture and vertex cache, this distinction might not apply.

Btw., one can specify both for all EXPORT_RAT_INST_xxx instructions, MEM_EXPORT_WRITE(_IND) as well as MEM_EXPORT_WRITE(_IND)_ACK. The latter doesn't return until the write has been carried out to memory. Looks almost like a writethrough switch, isn't it?

ACK might be solely to ensure that Sequencer can keep track of logical barriers, i.e. in order to effect a work-group memory barrier it needs to know that the writes have completed. Note also that fetches can have ACK. At the hardware thread level fetch ACK isn't required because of the clause-by-clause serialisation, but work-group scheduling can't work without ACK.

Atomics don't have ACK, they only have RTN which doesn't imply ACK (though it doesn't imply a lack of write-through either). It would make sense that atomics don't wait for write-through, after all atomics are being optimised for latency and a memory barrier can always be erected if write-through is required.

So I'm dubious it's for write-through, per se - I can't think of a scenario in which the reduced-latency for cache operations (as opposed to shader operations) associated with write-through would be useful.

But you are right, the documentation is still quite incomplete in some respects. It is impossible (at least for me) to figure out how some things are supposed to work.

Not to mention, IL documentation is lagging further behind it seems

Jawed

AMD: R8xx Speculation

How soon will Nvidia respond with GT300 to upcoming ATI-RV870 lineup GPUs

Within 1 or 2 weeks

Within a month

Within couple months

Very late this year

Not until next year

curious.trier

Squilliam

Beyond3d isn't defined yet

Area593

Jawed

MistaPi

Kaotik

Drunk Member

Gipsel

Jawed

Dave Baumann

Gamerscore Wh...

Silent_Buddha

MfA

Silent_Buddha

Alexko

Squilliam

Beyond3d isn't defined yet

Silent_Buddha

Gipsel

Squilliam

Beyond3d isn't defined yet

Silent_Buddha

FrameBuffer

Jawed

Similar threads