AMD: R8xx Speculation

How soon will Nvidia respond with GT300 to upcoming ATI-RV870 lineup GPUs

  • Within 1 or 2 weeks

    Votes: 1 0.6%
  • Within a month

    Votes: 5 3.2%
  • Within couple months

    Votes: 28 18.1%
  • Very late this year

    Votes: 52 33.5%
  • Not until next year

    Votes: 69 44.5%

  • Total voters
    155
  • Poll closed .
How should I understand that: all cores below "RV870" are single core, "RV870" consists of two dies on the same package and the X2 is twice the former? That doesn't make any sense to me and for such a case each core would have to be around 120mm2, otherwise in a theoretical 2*"RV870"(2 cores on same die) case (or else 2*2) the die area advantage compared to the competition would be gone.
Could the advantage left be yields? It could way easier to have 4 120mm² (or more) chips working than one a huge 480mm² and more.
Could also focusing on design a tiny design land in better result in regard to power consumption per mm²?
Economy of scale?
 
The focus seems to be on Compute Shader and it seems to me an effort to deflect attention away from the noises NVidia's making about CS. NVidia seems to be ready to market CS4/CS4.1 as all that a developer needs (because all its GPUs since G80 work this way), so developers should focus on them, not CS5.

Still worse yet. Lately, Nvidia's only been talking of DX Compute.
 
Could the advantage left be yields? It could way easier to have 4 120mm² (or more) chips working than one a huge 480mm² and more.
Could also focusing on design a tiny design land in better result in regard to power consumption per mm²?
Economy of scale?

Well Tchock drew the two pieces of info about "RV870" from chinese chiphell and you have there a hypothetical >$10 packaging/testing cost, which sounds awfully weird. It might or might not be true.

For yields I could see an advantage under the current crappy TSMC conditions; for power consumption I can't figure any particular advantage if you have in theory 2*"RV870 X2 board (with 2*120mm2 per die) vs a theoretical 480mm2 chip.

There are way too many unknown factors that won't help you speculate on things like that w/o f.e. knowing the transistor density, frequencies etc etc for each individual chips if not even architectural characteristics.

In the recent past on 4870X2 we had two 263mm2@55nm cores beating one large 583mm2@65nm GTX280 chip in terms of performance but not in terms of power consumption. In the first case you have over 1.8 B transistors for the 2 chips and in the latter 1.4B with completely different frequency characteristics.
 
This seems to imply that kernel domains can be sized, created and despatched by the GPU.

Or maybe it simply means that the GPU can auto-run a kernel based on a domain that's defined by a buffer that was created on a prior rendering pass. So the input buffer effectively defines the domain size, and completion of writes to that buffer is required before the new kernel can start. It might not even be a new kernel, but a repeated instance of the kernel that's just completed. Some kind of "successive refinement"?

They are most likely referring to the DispatchIndirect() call, which basically takes the call parameters of a regular Dispatch() call from a buffer on the GPU. This buffer would typically be written to in a previous pass by the CS. This way you don't need to read back any data to the CPU if you need to continue work on the data, but the dispatch parameters cannot be (easily) pre-determined. DX11 also has similar indirect versions of regular draw calls. These calls are going to be very important for any kind of system fully contained on the GPU.
 
Private-write/shared-read isn't even a hardware restriction in ATI (R7xx only) but Brook+ also has this restriction, which I can't work out the underlying logic of (I can only think it's super-slow on ATI).
They probably simply have no hardware for collision detection/resolution ... personally I'd say just make it undefined since it's a shame to leave the full crossbar they need anyway for the reads unused for the writes, but meh.

PS. you can always implement a ring bus in software for cross thread communication :p
 
Last edited by a moderator:
They are most likely referring to the DispatchIndirect() call, which basically takes the call parameters of a regular Dispatch() call from a buffer on the GPU. This buffer would typically be written to in a previous pass by the CS. This way you don't need to read back any data to the CPU if you need to continue work on the data, but the dispatch parameters cannot be (easily) pre-determined. DX11 also has similar indirect versions of regular draw calls. These calls are going to be very important for any kind of system fully contained on the GPU.
Thanks. So the ability to have data stream through multiple kernels like in Brook. Cool.

Even if only one CS kernel can run at any one time.

Jawed
 
They probably simply have no hardware for collision detection/resolution ... personally I'd say just make it undefined since it's a shame to leave the full crossbar they need anyway for the reads unused for the writes, but meh.
CUDA makes this undefined and writes to UAVs in D3D are undefined, so it's only really CS4.x that's affected. In Brook+ they might just be enforcing a "reasonableness", i.e. it's not reasonable in a high level abstraction to allow a programmer to "accidentally destroy data".

I wonder if the restriction in CS4.x is reflecting other restrictions in the hardware. CS4.x requires 768 threads in a thread group - all of which can share data with each other. But in CUDA only 512 threads can be in a block. I'm wondering if the inability of all 768 threads to write to an arbitrary shared memory address on NVidia also contributed to this limitation. But I'm not sure if that restrictiction on shared memory addressing actually exists - doesn't make sense to me, isn't shared memory just a flat address space per multiprocessor?

PS. you can always implement a ring bus in software for cross thread communication :p
Yes it's very robust, and variations of this sort-of fit the "stream" paradigm. I dare say I think the general concept of streams amongst threads in a group is a useful, scalable, abstraction (with the proviso that a group of threads is capped in size). Well, it appears scalable, though the compiler's first implementation has a pessimistic view, treating every read as a waterfalling read :cry: which makes me think there could be a hardware restriction there that's about to be lifted with the next GPUs.

But some things just want a big fat blob of "fast" memory shared by all threads. Though now I think about it, what algorithms can't use the private-write/shared-read model? 256 bytes of writable memory per thread is quite a lot - though the performance on ATI (only 1 wavefront) would be pretty miserable, and well the scheduling on NVidia effectively means that 64 threads function as "1 warp" so the performance there will be bad too.

I guess scan prefers write anywhere, and that's pretty fundamental.

Jawed
 
768 threads in a thread group is the basic limit of G80 - but, something I didn't fully realise till recently, the CUDA block limit is 512 threads. I wonder if this limit of 512 also influenced the shared memory model.

The basic limit of G80 are
8 groups / SM
768 threads / SM
512 threads / block

With 3 blocks of 256 threads you can get 768 threads per SM but then the shared memory would be divided by 3.
 
What's puzzling me is how can a block of <=512 threads with its private shared memory support the D3D11 requirement for 768 threads to share memory.

The only thing I can think of is the hardware must be working more flexibly than CUDA exposes.

Earlier, I was thinking that under CUDA each block can write to anywhere within its private region, but is able to read from the entirety of shared memory.

Shared memory effectively requires the thread scheduler to be able to "fence" each block - and when a fence is reached that block's portion of shared memory effectively becomes invalid, until a new block has started execution.

So if D3D11 thread groups are implemented using CUDA blocks, fencing requires scheduling of multiple blocks as a group. That's seemingly more fiddly and its going to hurt performance. Though any time a developer sets up thread groups of the maximum size under CS they're going to get shonky performance due to there being only 1 thread group, i.e. a full flush with start-up and shut-down latencies hitting the ALUs flat in the face. I suppose there is a shared-memory-write window that opens up for threads in the new block as threads in the old block are retired, so maybe not a full flush.

Maybe the 768 threads per group limit is only if there's no shared memory use. If memory is shared, then less (e.g. 512 threads)?

:???:

Jawed
 
Well, first to come to mind would be fast communications between cores (similar to dual & quad cores over CPU side where the actual CPU is 2 physical cores on one "board")
However if this is better or worse than just stuffing them to 1 big core on GPU side, I don't know - probably just worse performance wise, better yield wise.

Yep, but these CPUs that share same MCM cant comunicate with each other directly or share raw data. In fact they're just two separate entities that OS see as quad-core. Everybody here already said (with mocking tone in their voice) that would be pretty lunatic thing to do on GPU just because of that inter-core relationship that's needed for classic GPU? So do we after all said that's after all possible that ATi really done what they promised before releasing R700 series. That their expecting new bread will be MCM?

So, does that unclear notes of some predecessor, about RV870 testing needs extra 10USD means just that its an MCM? And that it's not worth in terms of extra testing?


Does that mean the real deal is coming in late 2010?

Disregard that as totally irrelevant, even for this speculative thread. It's just some sales manager selling whatever they want. While whipping engineers in a backroom. ;)

Let's just hope that they dont want to get up on nvidias last year trash talk bandwagon when actual Larabee arrives. Not that even Intel still directly market it against any future gpu.
 
Last edited by a moderator:
Fusion as a whole will have to wait till 2011 (2010 demos/test sil?) most likely due to Bulldozer's innings being worked on (that AVX support probably threw a wrench into it all too), but I wouldn't be shocked if R9XX with units similar to Fusion's APUs appear earlier.
 
I think that's just dissing LRB, isn't it?

Not entirely; I'd say those are quite diplomatic replies.

Fusion as a whole will have to wait till 2011 (2010 demos/test sil?) most likely due to Bulldozer's innings being worked on (that AVX support probably threw a wrench into it all too), but I wouldn't be shocked if R9XX with units similar to Fusion's APUs appear earlier.

I'm curious to see if and how those SoCs could place themselves in the embedded (mobile/handheld) markets in the longrun.
 
"Pleasant" speed?
Give me LUDICROUS SPEED!!!

Uhmm alas if any IHV can't manage to get up to twice the performance from each new generation GPU compared to its direct predecessor (always according to market segment). Something like that would be good enough in my book to label it a pleasant surprise or else a real upgrade.
 
Back
Top