NVIDIA GF100 & Friends speculation

Berfore anyone asks. Yes this is a serious store. One of the better in Sweden.

The page can't be reached through the shoping lists.

No it's not, which card has a "666 bit memory controller" running at 666 Mhz and looks like two 8800GTXs next to eachother? don't kid anyone.
 
Yes, but the API doesn't explicitly facilitate parallel processing. On the contrary it actually makes it difficult to do due to exactly the in-order requirement you mentioned. Hence the hardware is ahead of the software in this case.

The software requirement is unlikely to change any time in the reasonably near future (aka <DX15) as it is part of the fundamental programming model that order doesn't matter for triangles. If you make it matter, you just shift the serialization point from the GPU to the APP which will have to do the same thing the GPU does now.

And its not an exact in order requirement, it is a sorted order requirement, as long as you maintain the correct sorted order, everything works out, hence you can have parallel implementations and tiling, etc. The limitation is pretty fundamental to raster graphics and 3d graphics in general.
 
No it's not, which card has a "666 bit memory controller" running at 666 Mhz and looks like two 8800GTXs next to eachother? don't kid anyone.


The picture is some inside prank. I meant that Inet is a good store. Not that the card i real.
 
That would be a significant deviation from the information given out during Nvidia's Tesla introduction.
We have articles concerning this from Anandtech and Realworldtech.

There's a pdf about Tesla on Nvidia's own page that points out exception handling.

It wouldn't be the first time that a hardware vendor embellished the actual functionality supplied vs what is possible via software.
 
It'll run multiple kernels in parallel in the same SM, as far as I can tell.
I wonder how resources like registers and local shared mem are shared across kernels, or if they are shared at all. AFAIK AMD runs pixel and vertex shaders on the same core statically partitioning the register file.
 

Look at the bottom of the page: "file under graphics and humour".

Also fermin at Wikipedia:

Saturninus (in France "Saint Saturnin") was the first bishop of Toulouse, where he was sent during the "consulate of Decius and Gratus" (AD 250). He was martyred (traditionally in 257 AD), significantly by being tied to a bull by his feet and dragged to his death, a martyrdom that is sometimes transferred to Fermin and relocated at Pamplona. In Toulouse, the earliest church dedicated to Notre-Dame du Taur ("Our Lady of the Bull") still exists, though rebuilt; though the 11th century Basilica of Saint Sernin, the largest surviving Romanesque structure in France, has superseded it, the church is said to be built where the bull stopped, but more credibly must in fact be on a site previously dedicated to a pre-Christian sacred bull, perhaps the bull of Mithras. The street, which runs straight from the Capitole, is named, not the Rue de Notre-Dame, but the Rue du Taur.

There's a lot of bull in there.
 
Again you're equating the absence of obstacles with explicit support. Just because the hardware guys were able to workaround API shortcomings it doesn't mean the API is all honky dory.

Wise man says: you don't need to sync every coherence point.
Real man says: no you don't, but I'll be buggered if I'm going to try every combination to figure out which ones I do need to sync.
Result man say: this is why relaxed ordering died.

The point being that hardware can do what ever it bloody wants, but at the end of the day, if someone does it in such a way that it makes the programmers life easier, he'll do it that way, even if it means hammering in screws.

Point being that unpredictable results lead to unpredictable bugs lead to doing everything the safest and easiest way possible. The reason the model exists is because it has to somewhere. Either the hardware has to do it, the program has to do it, or the programmer has to do it. Now, as much as I don't like to bag on programmers, they generally don't care and even the good ones aren't THAT good. Software is software, it can do anything if you don't care about it doing it fast. So hardware does it instead.
 
It wouldn't be the first time that a hardware vendor embellished the actual functionality supplied vs what is possible via software.

That level of bait-and-switch would make me want to cancel my Fermi-based supercomputer, if I were in the process of planning one.

Still waiting for someone to code up 15 infinite loops plus another kernel and see if how it really works.
We'd need a Fermi to do that, though.
 
This is only true if you want to take a lot of time scanning in and out which takes tester time which cost money.

I was also thinking that whatever overhead there should be for at-speed test, it should target test power consumption reduction.

But this is just OT..
 
There nothing for the API to support, its just doing an operation in a faster way, why should the API care?

That's a cop out. You could argue the same thing for DX8.

C has no explicit raytracking acceleration structures, does it need to change? OCL/CS/CUDA already support raytracing acceleration structures, its the point of 'general' computing, you don't have to specifiy every single thing.....Unless we see fixed function raytracing hardware then no we won't see any change to the API for a specific algorithm.

C is way lower level and isn't targeted at any specific application. Unlike DirectX. If we can have vertex buffers for rasterization why can't we have similiar data structures for raytracing to provide some sort of structure to the process? Fixed function hardware isn't the only driver for an API. Consistency and structure are also key elements. For example, why are there still restrictions on input and output parameters and data structures for programmable stages of the current pipeline?

No (device side pointers in CS), but that's a bit different than asking for DX to define the hardware-accelerated structures and algorithms, which is what you mooted first. One's a bit more low-level and generally enabling than the other.

Well you said that CS is all that's needed for raytracing but without pointer support is that really true from a practical standpoint? You could theoretically build an acceleration structure out of indexed buffers but there would be all sorts of wastage due to empty nodes etc. Not even sure how that would work as you can't guarantee a certain maximum size for leaf nodes.

ballot is what I think you're referring back to. These __synchthreads functions are refinements related to the behaviour of a synchronisation for an entire work group. Because a work group can be large, e.g. 512 work items, it's not possible to manipulate a mask as the contents of a simple register.

Ballot is used to set a bit mask based on an arbitrary predicate, however it doesn't actually perform the scan. syncthreads_count() on the other hand can be used to run a prefix sum on that mask. That comes in handy if you want to count the number of elements in an array that meet a certain criteria (e.g val < pivot in a quicksort).

See US patents 2009008952 and 20090132878. The PSCAN operation described there is pretty much equivalent to syncthreads_count except that the latter runs across the entire block and not just a single warp.
 
Well you said that CS is all that's needed for raytracing but without pointer support is that really true from a practical standpoint? You could theoretically build an acceleration structure out of indexed buffers but there would be all sorts of wastage due to empty nodes etc. Not even sure how that would work as you can't guarantee a certain maximum size for leaf nodes.

Why would you need that, isn't a pointer just an index into global memory? :) (or some UAV in this case). And with the interlocked counters you'll have no problem "allocating" new blocks either. I see no problem in either building or traversing such structures.
 
And this is supported by the inclusion of N-Patch support in R200, 3Dc and MSAA+HDR in R5xx, tessellation and Fetch-4 in R6xx, RV7xx, compute capabilities in R5xx, R6xx, RV7xx, the compute capabilities beyond CS in Evergreen (to name but a few, easy to point at things off the top of my head)?

To play devil's advocate for a second, compute in DX11/OpenCL is now getting so general purpose, that things are getting to the point of no longer beating about APis specific "features" being abled, and more about implementation and performance.

So whether Fermi or Cypress go beyond OpenCL/DX11 in features isn't really important IMHO, what's important is how fast they can run all expected shader workloads. So for example, changing 'hidden' cache, scheduling, or bus architecture could dramatically speed up shaders, without really introducing developer visible "features", rather, develoeper relations more or less have to document that stuff that used to be ass-slow is now realistically performant.
 
Back
Top