Welcome, Unregistered.

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

Reply
Old 31-Jul-2010, 03:33   #1
Andrew Lauritzen
AndyTX
 
Join Date: May 2004
Location: British Columbia, Canada
Posts: 1,090
Default Three new DirectX 11 demos

Hey all,

Just wanted to throw out the links for three new DX11 demos from the Advances in Real-Time Rendering in 3D Graphics and Games and Beyond Programmable Shading courses at SIGGRAPH (slides will be posted soon). For now, you can run the demos (if you have a DX11 card) and check out the source at:

http://visual-computing.intel-resear...red_rendering/
http://visual-computing.intel-resear...ications/sdsm/
http://visual-computing.intel-resear...ications/avsm/

Enjoy!
__________________
The content of this message is my personal opinion only.

Last edited by Andrew Lauritzen; 31-Jul-2010 at 05:08.
Andrew Lauritzen is offline   Reply With Quote
Old 31-Jul-2010, 05:17   #2
Florin
Merrily dodgy
 
Join Date: Aug 2003
Location: The continent
Posts: 1,077
Default

Had to get the latest DX11 redist to get d3dcompiler_43.dll, and had to install the 32-bit version of the VC2010 redist (running Win7 x64).

Those shadows on sdsm are really pretty. I wonder if there's a way to get AA on the non-shadow polygons too (without CP overrides that is).

Also, I noticed weird dark spots over basic textures using Nvidia 3D glasses.

And sadly avsm.exe crashes for me:
Problem signature:
Problem Event Name: APPCRASH
Application Name: AVSM.exe
Application Version: 0.0.0.0
Application Timestamp: 4c4bb4a6
Fault Module Name: nvwgf2um.dll
Fault Module Version: 8.17.12.5721
Fault Module Timestamp: 4c0d6d9b
Exception Code: c0000005
Exception Offset: 0030d49a
OS Version: 6.1.7600.2.0.0.768.3
Locale ID: 1033
Additional Information 1: 0a9e
Additional Information 2: 0a9e372d3b4ad19135b953a78882e789
Additional Information 3: 0a9e
Additional Information 4: 0a9e372d3b4ad19135b953a78882e789

Cheers for posting
__________________
Hemlock - When crushed, the leaves and root emit a rank, unpleasant odour
Florin is online now   Reply With Quote
Old 31-Jul-2010, 05:21   #3
Andrew Lauritzen
AndyTX
 
Join Date: May 2004
Location: British Columbia, Canada
Posts: 1,090
Default

Quote:
Originally Posted by Florin View Post
Had to download [..]
Yup it mentions those in the readme Glad that you figured it out though and thanks for linking them directly in the thread (note that AVSM may require the VC2008 SP1 runtime files if you don't already have them - many games install them). FWIW all of the project files support native 64-bit builds but we just bundled 32-bit executables for client compatibility.

Quote:
Originally Posted by Florin View Post
Those shadows on sdsm are really pretty. I wonder if there's a way to get AA on the non-shadow polygons too tho (without CP overrides that is).
Yeah also mentioned in the readme... it uses deferred shading so MSAA takes a bit of work (assuming you mean just normal polygon MSAA). However it is quite possible as the specific deferred shading demo shows I just haven't ported that to the SDSM app yet.

Quote:
Originally Posted by Florin View Post
Also, I noticed weird dark spots over basic textures using Nvidia 3D glasses.
Yeah no idea what the 3D stereo stuff is messing with in the app and no way for me to test

Quote:
Originally Posted by Florin View Post
And sadly avsm.exe crashes for me:
Also noted in the readme: it crashes the shader compiler on current NVIDIA cards (works on ATI). They know about it though and I imagine will have it fixed soon.
__________________
The content of this message is my personal opinion only.

Last edited by Andrew Lauritzen; 31-Jul-2010 at 05:36.
Andrew Lauritzen is offline   Reply With Quote
Old 31-Jul-2010, 05:29   #4
Florin
Merrily dodgy
 
Join Date: Aug 2003
Location: The continent
Posts: 1,077
Thumbs up

Ooh readme.txt, right...radical concept
__________________
Hemlock - When crushed, the leaves and root emit a rank, unpleasant odour
Florin is online now   Reply With Quote
Old 02-Aug-2010, 19:05   #5
ShootMyMonkey
Senior Member
 
Join Date: Mar 2005
Posts: 1,157
Default

Cool. Hope the course materials go up soon as well. I was off at the OSL presentation, so I missed your part of the course and only got in around the time to see the tail end of Marco's bit. I was kind of saddened by the Uncharted piece -- not so much because it was a bad talk, but that he painted such a depressing picture of what he had to work with.
__________________
Life is veritably the exact opposite of a vacuum cleaner. Vacuums tend to suck less and less as time goes on.
ShootMyMonkey is offline   Reply With Quote
Old 02-Aug-2010, 19:09   #6
Humus
Crazy coder
 
Join Date: Feb 2002
Location: Stockholm, Sweden
Posts: 3,140
Send a message via ICQ to Humus Send a message via MSN to Humus
Default

Very interesting stuff. Haven't had the time to look deeper into it yet, but the demos look nice.
__________________
[ Visit my site ]
I speak for myself and only myself.
Humus is offline   Reply With Quote
Old 02-Aug-2010, 19:26   #7
rpg.314
Senior Member
 
Join Date: Jul 2008
Location: /
Posts: 2,365
Send a message via Skype™ to rpg.314
Default

Quote:
Originally Posted by ShootMyMonkey View Post
Cool. Hope the course materials go up soon as well. I was off at the OSL presentation, so I missed your part of the course and only got in around the time to see the tail end of Marco's bit. I was kind of saddened by the Uncharted piece -- not so much because it was a bad talk, but that he painted such a depressing picture of what he had to work with.
The stuff is already up.
__________________
The views presented here are my own and do not represent my present or past employers' views in any way.
My blog
Eigen : simd done right
rpg.314 is offline   Reply With Quote
Old 02-Aug-2010, 19:33   #8
nAo
Nutella Nutellae
 
Join Date: Feb 2002
Location: San Francisco, CA
Posts: 4,210
Default

AVSM and SDSM presentations from the "Advances in real-time rendering course" will be available at the end of this week.
__________________
[my blog]
Isn't it enough to see that a garden is beautiful without having to believe that there are fairies at the bottom of it too? [Douglas Adams]
The opinions expressed herein are my own personal opinions and do not represent my employer's view in any way
nAo is offline   Reply With Quote
Old 04-Aug-2010, 23:15   #9
Andrew Lauritzen
AndyTX
 
Join Date: May 2004
Location: British Columbia, Canada
Posts: 1,090
Default

The remaining presentations (SDSM and AVSM) are now available at the above links as well as a video for SDSM. Enjoy!
__________________
The content of this message is my personal opinion only.
Andrew Lauritzen is offline   Reply With Quote
Old 11-Aug-2010, 11:49   #10
Jawed
Regular
 
Join Date: Oct 2004
Location: London
Posts: 9,257
Send a message via Skype™ to Jawed
Default

In the SDSM presentation, slide 24, the histogram algorithm performs vastly slower on GTX480 than on HD 5870, 7.2ms versus 1.4ms. What the hell?

This is only a shared memory atomic operation, isn't it? Not a global atomic? (Notes on that slide imply it is shared.)

One of the peculiar things about the atomics in OpenCL (1.0 and 1.1) is that they always return a value, whereas the D3D11 atomics make a return value optional. (I presume the SDSM algorithm ignores the return value.)

http://msdn.microsoft.com/en-us/libr...(v=VS.85).aspx

Why not offer both in OpenCL?

Do NVidia atomics always return a value, in hardware? Is that the motivation for the OpenCL spec making the return not optional.

Is the performance problem really due to atomics on NVidia?
__________________
Sweet-spot + tick-tock = monster
Jawed is offline   Reply With Quote
Old 11-Aug-2010, 18:09   #11
Andrew Lauritzen
AndyTX
 
Join Date: May 2004
Location: British Columbia, Canada
Posts: 1,090
Default

Quote:
Originally Posted by Jawed View Post
In the SDSM presentation, slide 24, the histogram algorithm performs vastly slower on GTX480 than on HD 5870, 7.2ms versus 1.4ms. What the hell?
Yeah that was my reaction as well Incidentally this is what particularly motivated writing the fast reduction path.

Quote:
Originally Posted by Jawed View Post
This is only a shared memory atomic operation, isn't it? Not a global atomic?
Correct. These are very fast on ATI (almost full-speed if they don't conflict!) but apparently less-so on NVIDIA. Indeed they are also the atomics that do not return values so no latency hiding is needed and they can theoretically be fully pipelined.

Quote:
Originally Posted by Jawed View Post
Why not offer both in OpenCL?
Good question unless they assume that the compilers will detect when the return value is unused.

Quote:
Originally Posted by Jawed View Post
Is the performance problem really due to atomics on NVidia?
I can only assume so as that shader isn't doing much else. Check out the source code if you want the details but it is literally just a pile of local atomics followed by a few global atomics (these turn out to be faster than writing out all the local histograms and then reducing them). I played with removing the global atomics on NVIDIA but it made very little performance difference... it appears to be the local atomics in particular that are much slower than on ATI.

You can imagine that this and a number of other issues that are now exposed by these fairly low-level compute models complicate writing performance-portable code a lot...
__________________
The content of this message is my personal opinion only.
Andrew Lauritzen is offline   Reply With Quote
Old 11-Aug-2010, 19:11   #12
Jawed
Regular
 
Join Date: Oct 2004
Location: London
Posts: 9,257
Send a message via Skype™ to Jawed
Default

Quote:
Originally Posted by Andrew Lauritzen View Post
Yeah that was my reaction as well Incidentally this is what particularly motivated writing the fast reduction path.
Pretty lucky with how it worked out in the end...

Quote:
Correct. These are very fast on ATI (almost full-speed if they don't conflict!) but apparently less-so on NVIDIA. Indeed they are also the atomics that do not return values so no latency hiding is needed and they can theoretically be fully pipelined.
I guess NVidia is doing local atomics as math in the ALUs, which would have the same speed for return/no-return value.

What I don't get is why there's such a severe slowdown when shared memory banking should mean that non-colliding addresses are often full speed (purely a question of bank conflicts - though Fermi architecture is more susceptible to bank conflicts than GT200).

Quote:
Good question unless they assume that the compilers will detect when the return value is unused.
Ah yes, that would work.

Quote:
I can only assume so as that shader isn't doing much else. Check out the source code if you want the details but it is literally just a pile of local atomics followed by a few global atomics (these turn out to be faster than writing out all the local histograms and then reducing them). I played with removing the global atomics on NVIDIA but it made very little performance difference... it appears to be the local atomics in particular that are much slower than on ATI.
Back here:

http://forum.beyond3d.com/showthread...37#post1305337

when we had quite a bit of fun with the treacherous "Dickens word count" histogram, you were alluding to non-random distributions of data causing severe problems for GPU algorithms that depend on some form of software-managed cache.

I suppose this is that application: depth is generally not random at all. I suppose your algorithm lops bits of precision off Z so that there aren't too many bins, resulting in a huge collision rate, and disastrous slow down on NVidia.

So this poor atomic performance might be something of a corner case in that respect. Overall, atomics are still a little too new on GPUs to really know...

Quote:
You can imagine that this and a number of other issues that are now exposed by these fairly low-level compute models complicate writing performance-portable code a lot...
Yeah it's a great example. Actually, it's the kind of thing that could persuade developers to ignore atomics entirely, which seems like quite a shame.

Jawed
__________________
Sweet-spot + tick-tock = monster
Jawed is offline   Reply With Quote
Old 11-Aug-2010, 19:31   #13
Andrew Lauritzen
AndyTX
 
Join Date: May 2004
Location: British Columbia, Canada
Posts: 1,090
Default

Quote:
Originally Posted by Jawed View Post
Pretty lucky with how it worked out in the end...
Yup although it was really the combination of that and finding the robustness issues with the more complex schemes (that require the full histogram) that made the reduce path the clear winner.

Quote:
Originally Posted by Jawed View Post
I guess NVidia is doing local atomics as math in the ALUs, which would have the same speed for return/no-return value.
They can't be doing lane collision analysis in the ALUs, can they? That would be crazy-expensive! Alternatively are you implying that they can somehow "lock" memory banks over multi-instruction sequences and then loop in the core to serialize conflicts?

Quote:
Originally Posted by Jawed View Post
I suppose this is that application: depth is generally not random at all. I suppose your algorithm lops bits of precision off Z so that there aren't too many bins, resulting in a huge collision rate, and disastrous slow down on NVidia.
Yes, precisely. In this case there's two problems:
1) Collisions. They are super-common with this data set. It's a big win if you can lane-swizzle in your SIMD and do some cheap address comparisons to handle the majority of these cases and serialize the rest as appropriate. Unfortunately none of the current programming models have a lane swizzling mechanism because they all try to pretend it's not SIMD
2) Bin distribution for a spatial region. Each core is handling a tile of the screen and in that tile (or even the whole distribution sometimes) only a small number of bins are touched. Statically allocating local memory for the whole local histogram here and incidentally reducing the number of wavefronts in flight due to this is just a bad model... you want a proper cache. I would implement this by relying on Fermi's L1$ but I don't believe their global atomics are fully cached... i.e. just the switch from local to global atomics would probably slow it down a ton, even if there's no data-sharing between cores at all.

Note that this was disastrously worse when we had to use vertex scatter + ROPs back in the pre-DX11 days... collisions in that pipeline were ridiculously expensive since they amounted to global atomics even with some clever use of multiple accumulation bins + reductions. Local atomics are the only thing that make this sort of algorithm feasible at all on GPUs but we still have a ways to go in the hardware.

Quote:
Originally Posted by Jawed View Post
So this poor atomic performance might be something of a corner case in that respect. Overall, atomics are still a little too new on GPUs to really know...
Yup definitely possible as it is a fairly special data set, but not an abnormal one for graphics

Quote:
Originally Posted by Jawed View Post
Yeah it's a great example. Actually, it's the kind of thing that could persuade developers to ignore atomics entirely, which seems like quite a shame.
Yup we'll see. I would obviously prefer that people rise to the bar that ATI has set though
__________________
The content of this message is my personal opinion only.

Last edited by Andrew Lauritzen; 11-Aug-2010 at 19:39.
Andrew Lauritzen is offline   Reply With Quote
Old 11-Aug-2010, 19:39   #14
nAo
Nutella Nutellae
 
Join Date: Feb 2002
Location: San Francisco, CA
Posts: 4,210
Default

If your programming model doesn't explicitely expose horizontal ops and you have to do interlane/interthread communication vertically your hardware and/or compiler better be very good at local atomics.
__________________
[my blog]
Isn't it enough to see that a garden is beautiful without having to believe that there are fairies at the bottom of it too? [Douglas Adams]
The opinions expressed herein are my own personal opinions and do not represent my employer's view in any way
nAo is offline   Reply With Quote
Old 11-Aug-2010, 20:55   #15
Jawed
Regular
 
Join Date: Oct 2004
Location: London
Posts: 9,257
Send a message via Skype™ to Jawed
Default

Quote:
Originally Posted by Andrew Lauritzen View Post
They can't be doing lane collision analysis in the ALUs, can they? That would be crazy-expensive! Alternatively are you implying that they can somehow "lock" memory banks over multi-instruction sequences and then loop in the core to serialize conflicts?
Well, it is slow...

I guess someone could analyse this stuff in detail. Different patterns exploiting either address conflicts or bank conflicts, in varying counts per hardware thread, would enable a decent characterisation.

Quote:
Yes, precisely. In this case there's two problems:
1) Collisions. They are super-common with this data set. It's a big win if you can lane-swizzle in your SIMD and do some cheap address comparisons to handle the majority of these cases and serialize the rest as appropriate. Unfortunately none of the current programming models have a lane swizzling mechanism because they all try to pretend it's not SIMD
I don't think that's strictly true. Shared memory enables lane-swizzling, though obviously with latency.

Quote:
2) Bin distribution for a spatial region. Each core is handling a tile of the screen and in that tile (or even the whole distribution sometimes) only a small number of bins are touched. Statically allocating local memory for the whole local histogram here and incidentally reducing the number of wavefronts in flight due to this is just a bad model... you want a proper cache. I would implement this by relying on Fermi's L1$ but I don't believe their global atomics are fully cached... i.e. just the switch from local to global atomics would probably slow it down a ton, even if there's no data-sharing between cores at all.
GPU matrix multiplication is a brilliant example of a very harsh lesson: it looks insanely, almost-trivially, parallelisable but it's taken years for optimal algorithms to appear on NVidia and ATI - and the algorithms are quite different, too.

I'm not saying there's a way to make atomics vastly faster on GF100 for this kind of data/bin configuration - merely that it's early days.

Before, we saw that there was a huge speed-up on x86 with the Dickens histogram by carefully fine-tuning for the cache architecture. Having a "proper cache" didn't obviate this optimisation at all...

Quote:
Yup definitely possible as it is a fairly special data set, but not an abnormal one for graphics
Yeah. It does seem to me that Microsoft (or whoever) had its head screwed on with the optional return value thing.
__________________
Sweet-spot + tick-tock = monster
Jawed is offline   Reply With Quote
Old 11-Aug-2010, 21:32   #16
Andrew Lauritzen
AndyTX
 
Join Date: May 2004
Location: British Columbia, Canada
Posts: 1,090
Default

Quote:
Originally Posted by Jawed View Post
I don't think that's strictly true. Shared memory enables lane-swizzling, though obviously with latency.
Not only latency but it requires a barrier which can't be used inside "potentially divergent" control flow. It's not the same.

Quote:
Originally Posted by Jawed View Post
GPU matrix multiplication is a brilliant example of a very harsh lesson: it looks insanely, almost-trivially, parallelisable but it's taken years for optimal algorithms to appear on NVidia and ATI - and the algorithms are quite different, too.
Sure, but this points to the fact that the value of a portable abstraction layer (which carries penalties) is diminishing rapidly if you need to use different algorithms on different hardware. This makes it all-the-more stupid that our languages are handcuffing our ability to write optimized code based on so-called "portability".

On a separate note, I'd be really happy if someone could come up with a better algorithms to solve the SDSM histogram problem. It's not like I just tried the one thing... I tried the full gamut of different ways to do it and all were as slow or slower. Hopefully a faster algorithm will "emerge", but that's not doing me a lot of good today

Quote:
Originally Posted by Jawed View Post
I'm not saying there's a way to make atomics vastly faster on GF100 for this kind of data/bin configuration - merely that it's early days.
Sure, and it wouldn't be a problem if ATI hadn't already demonstrated that it can be done very well It's all relative.

Quote:
Originally Posted by Jawed View Post
Before, we saw that there was a huge speed-up on x86 with the Dickens histogram by carefully fine-tuning for the cache architecture. Having a "proper cache" didn't obviate this optimisation at all...
Not sure what you're trying to say exactly... yes obviously caches need tuning just like everything else but the static allocation/partitioning of resources model that GPUs use right now is simply not flexible enough for cases like this that have a small but *data-dependent* working set.
__________________
The content of this message is my personal opinion only.
Andrew Lauritzen is offline   Reply With Quote
Old 11-Aug-2010, 23:03   #17
Jawed
Regular
 
Join Date: Oct 2004
Location: London
Posts: 9,257
Send a message via Skype™ to Jawed
Default

Quote:
Originally Posted by Andrew Lauritzen View Post
Not only latency but it requires a barrier which can't be used inside "potentially divergent" control flow. It's not the same.
Not all hardware has that programming restriction. Yet another dislocation between the reality of hardware and the language. (And if you get stuck in an infinite loop it's your own fault!)

Quote:
Sure, but this points to the fact that the value of a portable abstraction layer (which carries penalties) is diminishing rapidly if you need to use different algorithms on different hardware. This makes it all-the-more stupid that our languages are handcuffing our ability to write optimized code based on so-called "portability".

I think there are two conflicting things going on here:
  1. Learning by IHVs and API bodies - what works, what do developers want, and when does the transistor budget allow it at a useful performance level
  2. Abstraction - to stand a chance of using it
One of the advantages of the current abstractions is that they aren't imposing a low-level schema that we'll be stuck with for ever after. Currently we're not witnessing a build up akin to x86 cruft in the hardware.

The price for these exciting times (accompanied by fame for originating the cool stuff) is a hell of a lot of bootstrapping. SDSM looks fuck-off cool, by the way.

Quote:
On a separate note, I'd be really happy if someone could come up with a better algorithms to solve the SDSM histogram problem. It's not like I just tried the one thing... I tried the full gamut of different ways to do it and all were as slow or slower. Hopefully a faster algorithm will "emerge", but that's not doing me a lot of good today
If one does turn up it'll be annoyingly obvious looking

Quote:
Sure, and it wouldn't be a problem if ATI hadn't already demonstrated that it can be done very well It's all relative.
If it turns out NVidia is not using the ALUs to do this, then I have to say I've not got the foggiest why it's so slow. Shared memory has the bank count and bandwidth in raw terms to be in the same ball park as ATI. That's what's so disturbing.

Quote:
Not sure what you're trying to say exactly... yes obviously caches need tuning just like everything else but the static allocation/partitioning of resources model that GPUs use right now is simply not flexible enough for cases like this that have a small but *data-dependent* working set.
Well at least NVidia has taken a step in this direction with L1$. Global atomics are nearly useless in GT200, but the ball is rolling. etc.

Do you have some other histogram-based algorithm that's really slow on ATI because it's a poor fit for the architecture? You must be working on something else... Guess you can't say.
__________________
Sweet-spot + tick-tock = monster
Jawed is offline   Reply With Quote
Old 12-Aug-2010, 00:17   #18
Andrew Lauritzen
AndyTX
 
Join Date: May 2004
Location: British Columbia, Canada
Posts: 1,090
Default

Quote:
Originally Posted by Jawed View Post
Not all hardware has that programming restriction. Yet another dislocation between the reality of hardware and the language. (And if you get stuck in an infinite loop it's your own fault!)
Sure, but that's my point: it's precisely due to the languages trying to "abstract" over SIMD widths. Obviously if you had a purely MIMD machine then lane swizzling *wouldn't* be free or even make much sense. However the reality is we can probably put a reasonably minimal bound on the minimum SIMD widths of these GPUs... 4 is practically guaranteed and even 16 is probably reasonable.

Similar programming model problems arise when trying to implement persistent threads in current APIs.

Quote:
Originally Posted by Jawed View Post
One of the advantages of the current abstractions is that they aren't imposing a low-level schema that we'll be stuck with for ever after. Currently we're not witnessing a build up akin to x86 cruft in the hardware.
Debatable... we're already seeing legacy problems with code compiled to static shared memory sizes. Equivalently the most efficient code chooses block sizes related to the number of cores on the GPU. These are significant problems going forward.

Quote:
Originally Posted by Jawed View Post
The price for these exciting times (accompanied by fame for originating the cool stuff) is a hell of a lot of bootstrapping. SDSM looks fuck-off cool, by the way.
Glad you like it To be clear, I'm not saying that the GPU models are a failure or anything - obviously this stuff would only be possible with them! I'm more commenting on the things that keep coming up and appear to be problematic moving forward... just still that I think we should be focusing our innovation and research efforts on improving. I definitely do not think the current GPU computing models are near an end state and I think most people would agree with that.

Quote:
Originally Posted by Jawed View Post
If one does turn up it'll be annoyingly obvious looking
I sure hope so! In my presentation I unsubtly hinted to the audience to find me a faster implementation so maybe with all those smart heads there and a touch of motivation/competition we'll get something

Quote:
Originally Posted by Jawed View Post
Well at least NVidia has taken a step in this direction with L1$. Global atomics are nearly useless in GT200, but the ball is rolling. etc.
Agreed. I think caches are going to be absolutely necessary going forward and I imagine they will be better-integrated with atomics in the future as well. It's certainly quite possible to do well in hardware.

Quote:
Originally Posted by Jawed View Post
Do you have some other histogram-based algorithm that's really slow on ATI because it's a poor fit for the architecture?
Not really doing any other histogram-related stuff right now... I didn't mean to imply that I was. I've been doing mostly deferred shading stuff lately (as per the other demo) which brings out its own interesting hardware and software puzzles Check my presentation and demo above for the initial batch, although there's nothing quite as drastic as the 4x perf difference in the SDSM histogram path (although frankly I do find the MSAA scheduling results to be pretty interesting in terms of programming models and hardware scheduling moving forward).
__________________
The content of this message is my personal opinion only.
Andrew Lauritzen is offline   Reply With Quote
Old 12-Aug-2010, 10:19   #19
Jawed
Regular
 
Join Date: Oct 2004
Location: London
Posts: 9,257
Send a message via Skype™ to Jawed
Default

Quote:
Originally Posted by Andrew Lauritzen View Post
Sure, but that's my point: it's precisely due to the languages trying to "abstract" over SIMD widths. Obviously if you had a purely MIMD machine then lane swizzling *wouldn't* be free or even make much sense. However the reality is we can probably put a reasonably minimal bound on the minimum SIMD widths of these GPUs... 4 is practically guaranteed and even 16 is probably reasonable.
LRBni doesn't provide arbitrary lane swizzle, does it? Swizzles are restricted to quad-lane neighbourhoods. Scatter/gather via a cache line is required.

As for SIMD width, 16 looks like a safe minimum for a few years yet.

Quote:
Similar programming model problems arise when trying to implement persistent threads in current APIs.
I dare say that seems likely to be an explicit feature of D3D12. Fingers-crossed.

Quote:
Debatable... we're already seeing legacy problems with code compiled to static shared memory sizes. Equivalently the most efficient code chooses block sizes related to the number of cores on the GPU. These are significant problems going forward.
I was referring to cruft in the hardware implementation.

Cruft in the software is an on-going problem. The size of shared memory is just one variable out of many that results in "over-optimisation" for today's hardware. Cache-line size, L1$ size, SIMD width, register file size, DDR burst length and DDR banking are some others.

NVidia, with CUDA, has attempted to obfuscate some parameters to prevent "over-optimisation". Guaranteeing warp size of 32 for the foreseeable future is good, though shared memory in Fermi (or at least GF100) has new bank conflict issues.

Even when hardware parameters are obfuscated developers are liable to dig, resulting in potential "over-optimisation".

Quote:
Glad you like it To be clear, I'm not saying that the GPU models are a failure or anything - obviously this stuff would only be possible with them! I'm more commenting on the things that keep coming up and appear to be problematic moving forward... just still that I think we should be focusing our innovation and research efforts on improving. I definitely do not think the current GPU computing models are near an end state and I think most people would agree with that.
I dare say game developers are mostly still trying to catch up with D3D10(.1) and so their input on what's needed for D3D12 is limited. Obviously there are people at the cutting edge and maybe it's best that there's only a few of you stirring the pot.

Quote:
I sure hope so! In my presentation I unsubtly hinted to the audience to find me a faster implementation so maybe with all those smart heads there and a touch of motivation/competition we'll get something

I still haven't looked at your code, but these are three generic ideas:
  1. hash by SIMD lane - this is similar to the SR (globally shared register) technique that I referred to before on ATI.

    e.g. with a maximum of 2048 bins and shared memory capacity of 32KB, you can hash by a factor of 16. The obvious key is (absolute work item ID & 15). So you can do (ZCoarse << 4) + (WorkItemID & 15) to generate a collision-free atomic.

    Obviously with more bins you'd hash by a smaller factor and so would suffer some collisions.

  2. tile by work item - generally you should be fetching multiple samples per work item in a coherent tile, between atomics.

    There are two benefits here. First the hardware prefers to fetch coherently (e.g. gather4) and ATI likes to fetch wodges (i.e. 16x gather4) and has the register file capacity to do so. Second by tiling like this you automatically serialise ZBuckets that, being neighbours in 2D, are likely to collide.

    e.g. each work item fetches an 8x8 tile.

  3. scatter work items - to improve the serialisation of ZCoarse (if 1 and 2 don't eliminate collisions), you can make each work item discontiguously sample, reducing the chances of collisions amongst neighbouring work items.

    e.g. if you are doing 8x8 tiles per work item from technique 2. then work item 1 is offset from work item 0 along the diagonal, starting at 8,8, work item 2 starting at 16,16 etc.
Undoubtedly the different cards will want different tunings for techniques 2 and 3. 2 and 3 need to be balanced against each other, with 2 improving cache coherency while 3 spoils it.

Quote:
Not really doing any other histogram-related stuff right now... I didn't mean to imply that I was. I've been doing mostly deferred shading stuff lately (as per the other demo) which brings out its own interesting hardware and software puzzles Check my presentation and demo above for the initial batch, although there's nothing quite as drastic as the 4x perf difference in the SDSM histogram path (although frankly I do find the MSAA scheduling results to be pretty interesting in terms of programming models and hardware scheduling moving forward).
I've only had a quick look at that...
__________________
Sweet-spot + tick-tock = monster
Jawed is offline   Reply With Quote
Old 12-Aug-2010, 18:55   #20
Andrew Lauritzen
AndyTX
 
Join Date: May 2004
Location: British Columbia, Canada
Posts: 1,090
Default

Quote:
Originally Posted by Jawed View Post
LRBni doesn't provide arbitrary lane swizzle, does it? Swizzles are restricted to quad-lane neighbourhoods.
Swizzles to the quads are free, but you can shuffle the 128-bit blocks using an instruction. That's all you need for horizontal reductions/address comparisons. Furthermore even if it did not, any swizzle neighbourhood (even 4) is better than none.

Quote:
Originally Posted by Jawed View Post
I dare say that seems likely to be an explicit feature of D3D12. Fingers-crossed.
I'm not as convinced that there's quite enough buy-in on this yet, but maybe OpenCL's first pass at a task system will motivate some innovation in that space.

Quote:
Originally Posted by Jawed View Post
I was referring to cruft in the hardware implementation.
Sure, although it's worth noting that increasing cache sizes behaves better with legacy code than scratch pad memory (which just goes unused).

Quote:
Originally Posted by Jawed View Post
Even when hardware parameters are obfuscated developers are liable to dig, resulting in potential "over-optimisation".
I can buy the "over-optimization" argument on CPUs where you're talking about single-digit % increases in a lot of cases (sometimes more, but on balance) but on GPUs you're often talking about an order of magnitude... that's too much performance to leave on the floor for "portability".

Quote:
Originally Posted by Jawed View Post
e.g. with a maximum of 2048 bins and shared memory capacity of 32KB, you can hash by a factor of 16. The obvious key is (absolute work item ID & 15). So you can do (ZCoarse << 4) + (WorkItemID & 15) to generate a collision-free atomic.
Right but the bins aren't just sums - they are 7 32-bit values each! There's not enough shared memory to amplify them even 2x at the moment. You might be able to pack a few things into half-floats and get a 2x spread but you're definitely not going to get to one per SIMD lane!

Of course the large majority of these bins don't get touched for a given tile so if you only had a cache...

Quote:
Originally Posted by Jawed View Post
[*]tile by work item - generally you should be fetching multiple samples per work item in a coherent tile, between atomics.
Yes mostly already done. Tile sizes are decoupled from the compute domain and chosen so that there are ~ as many tiles as required to just fill the GPU. This is important to minimize global writes/atomics traffic at the end of each work group.

I played with strided vs "linear" lookups across the thread group but the latter were generally faster. If NVIDIA's coalescing logic remains the same then the latter will definitely be faster. I haven't played with using gather4 explicitly though... it's quite an annoying programming model - if they want the access like that they have free reign to reorganize the work items in the group.

The lookup from the Z-buffer is *definitely* not the bottleneck though so I'm hesitate to try and optimize this much more.

Quote:
Originally Posted by Jawed View Post
[*]scatter work items - to improve the serialisation of ZCoarse (if 1 and 2 don't eliminate collisions), you can make each work item discontiguously sample, reducing the chances of collisions amongst neighbouring work items.
Yup I played with more random sampling to reduce collisions and it does work but with one huge problem: you can't have a massive performance falloff in the case where EVERY pixel on the screen collides. Put another way, worst case performance is what matters, not making the easier cases faster. This is a key point for game developers and one that I've heard often. Thus I haven't put a lot of effort into making the fast cases faster
__________________
The content of this message is my personal opinion only.
Andrew Lauritzen is offline   Reply With Quote
Old 13-Aug-2010, 00:54   #21
Jawed
Regular
 
Join Date: Oct 2004
Location: London
Posts: 9,257
Send a message via Skype™ to Jawed
Default

Quote:
Originally Posted by Andrew Lauritzen View Post
Swizzles to the quads are free, but you can shuffle the 128-bit blocks using an instruction. That's all you need for horizontal reductions/address comparisons. Furthermore even if it did not, any swizzle neighbourhood (even 4) is better than none.
Sigh I totally forgot about shuffle - I remember being excited about that, once

Quote:
I'm not as convinced that there's quite enough buy-in on this yet, but maybe OpenCL's first pass at a task system will motivate some innovation in that space.
Well both GPUs support concurrent compute kernels now (still pretty fuzzy on what that really means on ATI, though) so if it isn't in 12, well there's always LRB...

Quote:
Sure, although it's worth noting that increasing cache sizes behaves better with legacy code than scratch pad memory (which just goes unused).
I agree generally.

Quote:
I can buy the "over-optimization" argument on CPUs where you're talking about single-digit % increases in a lot of cases (sometimes more, but on balance) but on GPUs you're often talking about an order of magnitude... that's too much performance to leave on the floor for "portability".
The learning curve is still too steep even for the hardware designers (local atomics and tessellation being great examples, currently), so I think this needs repeated revisiting. (Then there's the physics of what's buildable, which I think is why tessellation is poor in ATI currently - not sure what the deal is with local atomics in NVidia.)

Some of Intel's Terascale work looks more like Transputer than Larrabee (maybe that's me in wishful thinking mode) and there's quite a vocal contingent who think the cache architecture of Larrabee isn't viable in the long term, where we're talking hundreds and thousands of cores.

So if we're really going to talk about programming models that can last longer than 10 years, then at best one can only hope for "local cache", whatever that actually means when programming 1000 cores.

Quote:
Right but the bins aren't just sums - they are 7 32-bit values each!
Well my posting this morning was the summation of my thoughts as I fell asleep last night before having a proper look at your code.

Quote:
There's not enough shared memory to amplify them even 2x at the moment. You might be able to pack a few things into half-floats and get a 2x spread but you're definitely not going to get to one per SIMD lane!
You reported about 1.4ms to compute the histogram on ATI, as I understand it (I've no idea what kind of overhead that has...). That's about 380 million cycles. For about 2 million samples? (Or is this 4xMSAA, 8 million samples?)

So around 200 cycles (or 50 for 4xMSAA?) per sample?

The inner loop is 31 cycles according to GPU Shader Analyzer.

Quote:
Yes mostly already done. Tile sizes are decoupled from the compute domain and chosen so that there are ~ as many tiles as required to just fill the GPU. This is important to minimize global writes/atomics traffic at the end of each work group.
Yeah, I was assuming a "persistent" kernel.

Quote:
I played with strided vs "linear" lookups across the thread group but the latter were generally faster. If NVIDIA's coalescing logic remains the same then the latter will definitely be faster. I haven't played with using gather4 explicitly though... it's quite an annoying programming model - if they want the access like that they have free reign to reorganize the work items in the group.
As I haven't done this stuff for real I don't understand the issue here. But I have a feeling I'm seeing coherent fetches where there aren't any

Quote:
The lookup from the Z-buffer is *definitely* not the bottleneck though so I'm hesitate to try and optimize this much more.
I made that point partly to defend against the cache thrashing that technique 3 engenders. On ATI clause switches cost significant cycles, so a single clause of 16 TEX is preferable to 16 clauses of 1 TEX - the latency of the latter and the increase in cache thrashing it directly causes would hurt 3.

Quote:
Yup I played with more random sampling to reduce collisions and it does work but with one huge problem: you can't have a massive performance falloff in the case where EVERY pixel on the screen collides. Put another way, worst case performance is what matters, not making the easier cases faster. This is a key point for game developers and one that I've heard often. Thus I haven't put a lot of effort into making the fast cases faster
Does the entire histogram need to be generated each frame?

As the camera translates the histogram is, in general, shifting coherently, isn't it?
__________________
Sweet-spot + tick-tock = monster
Jawed is offline   Reply With Quote
Old 13-Aug-2010, 05:59   #22
aaronspink
Senior Member
 
Join Date: Jun 2003
Posts: 2,073
Default

Quote:
Originally Posted by Jawed View Post
Some of Intel's Terascale work looks more like Transputer than Larrabee (maybe that's me in wishful thinking mode) and there's quite a vocal contingent who think the cache architecture of Larrabee isn't viable in the long term, where we're talking hundreds and thousands of cores.
If you are sharing active data between hundreds and thousands of cores, it doesn't matter if you are coherent or not. Which is what all the coherency bad people apparently don't get. What coherency buys you is significantly increased flexibility and dynamic scalability. The actual cost of the coherency when done correctly is fairly low in the actual hardware. In a lot of ways its a software/hardware trade-off that has been going on for decades. Every single time, coherency has won, simply because it is more flexible and thus requires orders of magnitude less software complexity as you scale.

Quote:
So if we're really going to talk about programming models that can last longer than 10 years, then at best one can only hope for "local cache", whatever that actually means when programming 1000 cores.
It means you don't have to track every bit of data and can instead optimize only for the data that has a performance impact.
__________________
Aaron Spink
speaking for myself inc.
aaronspink is online now   Reply With Quote
Old 13-Aug-2010, 07:52   #23
Andrew Lauritzen
AndyTX
 
Join Date: May 2004
Location: British Columbia, Canada
Posts: 1,090
Default

Quote:
Originally Posted by Jawed View Post
Sigh I totally forgot about shuffle - I remember being excited about that, once
Yeah it's nice although packstore is still the niftiest instruction IMHO

Quote:
Originally Posted by Jawed View Post
Well both GPUs support concurrent compute kernels now (still pretty fuzzy on what that really means on ATI, though) so if it isn't in 12, well there's always LRB...
Right although "concurrent" doesn't necessarily mean "can be launched by the GPU" or "guaranteed parallel execution" or anything like that. It's unclear how restricted the GPU schedulers are here.

Quote:
Originally Posted by Jawed View Post
Some of Intel's Terascale work looks more like Transputer than Larrabee (maybe that's me in wishful thinking mode) and there's quite a vocal contingent who think the cache architecture of Larrabee isn't viable in the long term, where we're talking hundreds and thousands of cores.
I hear that vocal section but TBH they're mostly software people making broad claims about hardware engineering based on generalized logic at best (of course this doesn't apply to everyone, but I've met a lot of these people ). When I ask the hardware people though they say that it scales fine and I'm inclined to trust them more...

I'll accept some argument over requiring full coherence but I really don't see how we can get away without proper caches in the future. (As an aside, I'm also told that once you have a cache, coherence isn't very hard or expensive and scales fine too...) Unless people come up with something equivalently clever it will relegate GPUs to operate only on the fairly small subset of regular problems that they can already handle now. I think Fermi made a definite step in the right direction on this one and I'm excited to see what comes next.

Quote:
Originally Posted by Jawed View Post
So if we're really going to talk about programming models that can last longer than 10 years, then at best one can only hope for "local cache", whatever that actually means when programming 1000 cores.
What do you mean by "local cache"? Fermi's L1$ counts IMHO, and that's all I'm really asking for. If they get it fully pipelined with atomics and the like then we're 90% there.

Quote:
Originally Posted by Jawed View Post
For about 2 million samples? (Or is this 4xMSAA, 8 million samples?)
No MSAA at the moment so yeah, ~2 million samples.

Quote:
Originally Posted by Jawed View Post
As I haven't done this stuff for real I don't understand the issue here. But I have a feeling I'm seeing coherent fetches where there aren't any
The "gather" fetch in the shading languages is sort of like bilinear in that you give it a float texture coordinate and get back the four surrounding elements. Not only does this mean having to do a silly int->float in this case but it also messes with the typical parallel domain since now each work item is handling those 4 elements in its kernel. Not a huge deal and definitely possible to do but not a trivial change to the code either.

I also don't get why they don't just rearrange the compute domain into gather4-style blocks if their hardware benefits from this... DirectCompute leaves that completely up in the air and implementation-defined. I must admit to not completely understanding the fast path here on ATI and how the gather4 stuff relates to linear memory loads.

Quote:
Originally Posted by Jawed View Post
Does the entire histogram need to be generated each frame?
For complete correctness, yes. It's like occlusion culling... even a minor shift from one frame to the next could reveal a new object (or occude an old one). You can start to play games but you immediately lose the guarantee that ever screen-space sample will have a shadow map sample. How important this is depends on stuff like the frequency of geometry in the frame and the speed of movement of the camera and objects.
__________________
The content of this message is my personal opinion only.
Andrew Lauritzen is offline   Reply With Quote
Old 13-Aug-2010, 17:55   #24
Jawed
Regular
 
Join Date: Oct 2004
Location: London
Posts: 9,257
Send a message via Skype™ to Jawed
Default

Quote:
Originally Posted by aaronspink View Post
It means you don't have to track every bit of data and can instead optimize only for the data that has a performance impact.
SCCC doesn't have coherency, that's my point.
__________________
Sweet-spot + tick-tock = monster
Jawed is offline   Reply With Quote
Old 13-Aug-2010, 19:03   #25
Jawed
Regular
 
Join Date: Oct 2004
Location: London
Posts: 9,257
Send a message via Skype™ to Jawed
Default

Quote:
Originally Posted by Andrew Lauritzen View Post
I'll accept some argument over requiring full coherence but I really don't see how we can get away without proper caches in the future. (As an aside, I'm also told that once you have a cache, coherence isn't very hard or expensive and scales fine too...) Unless people come up with something equivalently clever it will relegate GPUs to operate only on the fairly small subset of regular problems that they can already handle now. I think Fermi made a definite step in the right direction on this one and I'm excited to see what comes next.

What do you mean by "local cache"? Fermi's L1$ counts IMHO, and that's all I'm really asking for. If they get it fully pipelined with atomics and the like then we're 90% there.
Yes, I think Fermi L1$ fits the bill.

Quote:
No MSAA at the moment so yeah, ~2 million samples.
Well, that's pretty disturbing then, because on ATI histogram generation is running at ~1/6th throughput: theoretical 31 cycles for the inner loop takes ~200 cycles.

Quote:
The "gather" fetch in the shading languages is sort of like bilinear in that you give it a float texture coordinate and get back the four surrounding elements. Not only does this mean having to do a silly int->float in this case but it also messes with the typical parallel domain since now each work item is handling those 4 elements in its kernel. Not a huge deal and definitely possible to do but not a trivial change to the code either.
I have to admit after seeing the projection/unprojection stuff in the code I'm lost. I think you are sampling raw depth, so should be able to sample contiguous areas from depth coherently (and fast), but...

Quote:
I also don't get why they don't just rearrange the compute domain into gather4-style blocks if their hardware benefits from this...
The mapping from compute domain to data should be totally under the programmer's control, ultimately because it's a balancing act. e.g. it can be optimal to do 16 gather4s or only 1. That decision is the programmer's. On NVidia the optimisation profile is different...

Quote:
DirectCompute leaves that completely up in the air and implementation-defined. I must admit to not completely understanding the fast path here on ATI and how the gather4 stuff relates to linear memory loads.
Every fetch of a mere 32 bits in your code results in the 128-bit pipe from L1 to registers being 25% utilised (per core it's a 512-bit pipe). You've only got ~1TB/s of that bandwidth in Cypress (54GB/s per core).

In the end ALU:TEX is so high that Z fetches shouldn't be in the picture.

What is the worst-case with 100% collision rate?
__________________
Sweet-spot + tick-tock = monster
Jawed is offline   Reply With Quote

Reply

Bookmarks

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 00:53.


Powered by vBulletin® Version 3.8.4
Copyright ©2000 - 2010, Jelsoft Enterprises Ltd.