AMD RV770 refresh -> RV790

trinibwoy · Mar 13, 2009

Jawed said:
We can also infer that NVidia, like Ageia beforehand, is taking the piss by hobbling the CPU code - it has everything to gain and nothing to lose. What's entertaining is seeing how blatant the evidence for this fraud is, yet reviewers are not calling them out on it.

Well the way I see it evidence would have to include some sort of alternative implementation that does on a CPU what Nvidia claims you can only do on a GPU. I don't think the fact that CPU cores aren't maxed out is particularly damning evidence. Especially not enough for reviewers to throw a hissy fit about. So until somebody steps up CPU physics they will continue to get away with their current strategy.

Psycho · Mar 13, 2009

We still have fudzilla and chiphell admin claiming it's not pin compatible.
But following Charlie regarding a "790" based cost reduced 4870 both would need new boards anyway.

But with a pure respin (with the same unit cost), the lineup/positioning would make some sense:
4890: 900/1000 mhz - minimum if the 20% claim should hold
4870: 750/900 - cost reduced and maybe just one power connector, both due to lower TDP
4850: EOL, 4870 is the new salvage part, and:
4770: 750/1000 - By looking at the 4860M clocks, and this would reach 4850 speeds
4750: 650/800 - as tested by guru3D
4730: DDR3

Rangers · Mar 13, 2009

15-20% would be plenty. Would make it competitive with GTX285 pretty easily. At still of course half the die size therefore much better profits for AMD. Thats if it really reaches 15-20% and not a lower number. I certainly wouldn't be disappointed that it's not a new chip.

It would prove a few people here wrong who stated a major upclock on an existing chip was nearly impossible or something to that effect though.

It would also be a pretty big psychological breakthrough for AMD as well. They would probably be at least equal to the fastest single GPU on the planet. Which they werent even with 4870. OTOH, they have lost the total top end (if only slightly) to 295 (Assuming theres no RV790X2). So it depends which you think is bigger deal. Normally I would say the fastest single GPU is the bigger deal, but these days I'm not sure anymore.

Jawed · Mar 13, 2009

Rangers said:
It would prove a few people here wrong who stated a major upclock on an existing chip was nearly impossible or something to that effect though.

A bit like 7800GTX-512, benchmark edition? 28%: 430-550MHz?

Jawed

pjbliverpool · Mar 13, 2009

I'm not sure how 15 or even 20% is going to be enough for the 4890 to match the 285, It will in some games of course, and it will even take over in others, but on average its going to be slower as far as I can see. And thats under the best of circumstances.

Whats it getting on the memory? A 20% core increase and 8% memory increase is likely going to net between 5% and 15% on average.

Ailuros · Mar 14, 2009

I doubt Charlie is 100% correct in that newsblurb. Performance increase and power consumption is subject to final frequency. The rumoured frequency increases do not suggest a simple respin to me. Maybe he should tune down his tone a little bit since he isn't exactly the first to report that it's a 55nm chip for starters.

Before I believe anything I'd prefer to see the exact rated power consumption and die size of RV790, than just vague drivels.

trinibwoy said:
Obviously Nvidia's implementation favors Geforces, just like their DirectX implementation favors Geforces. Do we have an example of a characteristic of the CUDA API that favors a specific architecture? Aren't CUDA and OpenCL all about structuring the problem in a parallel fashion and firing off kernels? I don't think there's anything about the API itself that demands a specific hardware implementation or memory architecture. For example, there's no reason shared memory couldn't be in off chip ram. It'd just be slow, but it would be the same for OpenCL as well.

I should thank rjc for finding Eric Demers' interview because I was too bored to look it up. If the reason for not supporting CUDA would had been that "it was not invented here", I have the feeling that he would have worded his reply in a completely different fashion.

pjbliverpool said:
I'm not sure how 15 or even 20% is going to be enough for the 4890 to match the 285, It will in some games of course, and it will even take over in others, but on average its going to be slower as far as I can see. And thats under the best of circumstances.

NV is preparing the GTX275 to run against it; indications for that one point into the 10 cluster/7 ROP partition direction (see GT200b for GTX295) with unknown yet frequencies. Assuming those are correct and the 275 has the exact same frequencies as on the 295, the result should lie more or less in the middle between the GTX260 and GTX285.

Rumours also suggest a "RV790-OC"; if that should be a mere vendor specific initiative then it's pure nonsense to even mention it. In any other case we'll have to wait and see what it exactly stands for.

LordEC911 · Mar 14, 2009

pjbliverpool said:
Whats it getting on the memory? A 20% core increase and 8% memory increase is likely going to net between 5% and 15% on average.

Current word is that AMD/ATi has been putting 5ghz chips on other parts...
160Gbps vs 115Gbps, 39% increase, probably not needed and not going to happen but would be fun and does make a nice dream.

MfA · Mar 14, 2009

trinibwoy said:
I don't think there's anything about the API itself that demands a specific hardware implementation or memory architecture. For example, there's no reason shared memory couldn't be in off chip ram. It'd just be slow, but it would be the same for OpenCL as well.

"Fresh" OpenCL coders would make allowances for the architectures which are out there. CUDA coders would be slow to change and there is a code base out there already optimized for the existing architectures. There is little reason for ATI to support CUDA and hurt OpenCL adoption in the process. Maybe once PhysX becomes important and OpenCL is embraced by most developers they could put in CUDA emulation, but not now.

CarstenS · Mar 14, 2009

WRT PhysX and how/why it refuses to use more than one CPU core: How about a mod-move to a new thread?

Don't want to take this one further OT.

edit:
Jawed: Just take a look at the Novodex' Domino Fermat demo. It's also quite serial

Jawed · Mar 14, 2009

CarstenS said:
Jawed: Just take a look at the Novodex' Domino Fermat demo. It's also quite serial

That's parallel as occasionally fallen blocks not at the leading edge shift

It's interesting that this starts using ~6% CPU and trends to about 12% over the 10 minutes it takes to run - I've discovered that running windowed and hiding the window so it's not visible leaves just the simulation CPU usage. Or so it seems.

Jawed

trinibwoy · Mar 14, 2009

Ailuros said:
If the reason for not supporting CUDA would had been that "it was not invented here", I have the feeling that he would have worded his reply in a completely different fashion.

My cynicism gland is acting up again so I'll go ahead and say that he would never explicitly state that no matter the situation.

MfA said:
"Fresh" OpenCL coders would make allowances for the architectures which are out there. CUDA coders would be slow to change and there is a code base out there already optimized for the existing architectures.

I hope not. CUDA isn't nearly mature enough for developers to already be set in their ways. What happens when CUDA 3.x drops? Maybe I need to educate myself more but I still don't see where the differences are between CUDA and OpenCL where the latter is less attuned to a G80 like architecture.

Jawed · Mar 14, 2009

For single-precision under OpenCL, SSE, Cell and ATI are vec4 native architectures as opposed to NVidia's scalar native.

The underlying nature of OpenCL is:

Processing Element: A virtual scalar processor. A work-item may execute on one or more processing elements.

Jawed

trinibwoy · Mar 14, 2009

Jawed said:
For single-precision under OpenCL, SSE, Cell and ATI are vec4 native architectures as opposed to NVidia's scalar native.

Not sure what you mean by that. Does OpenCL have explicit accomodations for vec4 processing elements? I couldn't see anything along those lines in any sample code. It would make sense if each provider's OpenCL back-end packed work-items appropriately for the hardware but then that's not an API issue. On the flip-side there's no reason why a CUDA back-end couldn't do the same and map to vec4 hardware. A "vec4 native" data parallel API certainly doesn't sound very flexible to me. Are you sure you're not mixing up instruction vectorization with work-unit vectorization?

Ailuros · Mar 14, 2009

trinibwoy said:
My cynicism gland is acting up again so I'll go ahead and say that he would never explicitly state that no matter the situation.

I said he might have worded it differently; I won't claim I know Demers, but he always stroke me as someone that wouldn't lie or exaggerate both in public as in private.

Jawed · Mar 14, 2009

trinibwoy said:
Not sure what you mean by that.

It means the starting point for writing efficient code is about as far away as it's possible to get when using these architectures.

Given the problem of scalar-only code a compiler has an obvious starting point to simply lump multiple threads into a single hardware thread on these architectures. But it'd be better if programmers were encouraged to think vec4 I think. Vectors aren't going away.

It'd be interesting to see an analysis of CUDA apps out there, to find out how many of them are most efficient when coded as purely scalar, i.e. at no point in the kernel code are any vector types used, nor structures.

Does OpenCL have explicit accomodations for vec4 processing elements? I couldn't see anything along those lines in any sample code. It would make sense if each provider's OpenCL back-end packed work-items appropriately for the hardware but then that's not an API issue.

OpenCL supports vectors of various sizes as well as structures, both based upon various data types, such as float and integer. The programmer can query the underlying hardware to find out what the preferred "packing" is, e.g. scalar or vec4, but that doesn't lead to any automated optimisation as far as I can see.

OpenCL is not particularly close to the metal (though it promotes itself as such in much the way CUDA does, it seems), so yes each vendor of a just-in-time compiler for their hardware architecture has to translate from the given code to their hardware. When most of the hardware out there uses vec4-float as the most fundamental data type in its throughput ALUs/registers, obfuscating this by making the programming model based upon scalars seems troublesome.

Larrabee is interesting because it is essentially a vec16 architecture. There is support for native vec16 data in OpenCL, so it hasn't been left out in the cold. It may well turn out to be useful to program NVidia GPUs with vec8 or vec16 as the underlying data type, matching either SIMD-width or bank-count for various types of memory.

The issue is fundamentally about appropriate ways to "auto-parallelise", and starting with scalar code makes this rather more fiddly than starting with vec4.

Sure it's too-early days yet to be able to tell, but it seems that programmers that want to write cross-platform OpenCL are going to be spending a lot of time tripping up on the scalar versus vector issue, particularly as OpenCL makes scalar the underlying programming model, not vector.

Maybe everyone thinks it's no big deal - clearly I'm on the sidelines :!:

It's been amazingly quiet so far, there's no real OpenCL community to speak of.

A "vec4 native" data parallel API certainly doesn't sound very flexible to me.

What you're saying is that graphics, being vec4 native isn't flexible at all, particularly on machines that are effectively vec4 :???:

And further, that running this same graphics code on NVidia's scalar architecture is difficult :???:

Have you looked at:

http://www.khronos.org/registry/cl/specs/opencl-1.0.33.pdf

The forum's pretty quiet:

http://www.khronos.org/message_boards/viewforum.php?f=28

Jawed

trinibwoy · Mar 14, 2009

As I suspected you're confusing instruction vectorization with work-item vectorization. Graphics code is written for a single pixel. There's no explicit coding for working on multiple pixels at once. That's up to the hardware provider to manage. There's nothing in OpenCL that explicitly maps multiple kernel instances (work-items) to a physical configuration. As for scalar vs vector instruction issue OpenCL cares about that as much as DirectX does. i.e it doesn't.

I get what you're saying but you're essentially describing best practices for "using" the API. I'm saying there's nothing implicit in the API itself to make it a better fit for vec4 hardware. There's nothing stopping you from packing vec4's in Cuda either except that it would be a waste of time on Nvidia hardware.

I would argue that the autovectorization problem has been solved. See g80. Packing vec4's everywhere is clumsy, unintuitive and obfuscates the underlying algorithm. It's much easier to just write your code without fumbling around trying to issue vector instructions.

rpg.314 · Mar 14, 2009

I think G80's vec4->scalar transition is like the AoS to SoA transition. Scalar way is better, much better particularly when there is hardware beneath you to aid it.

How can there be _any_ opencl community when there are no implementations for it. I think the implementations are done, they are waiting for conformance tests to come up.

Jawed · Mar 14, 2009

trinibwoy said:
As I suspected you're confusing instruction vectorization with work-item vectorization. Graphics code is written for a single pixel. There's no explicit coding for working on multiple pixels at once. That's up to the hardware provider to manage. There's nothing in OpenCL that explicitly maps multiple kernel instances (work-items) to a physical configuration. As for scalar vs vector instruction issue OpenCL cares about that as much as DirectX does. i.e it doesn't.

Why is .xyzw notation a part of D3D if the vec4 organisation of graphics and a lot of graphics hardware is irrelevant?

I get what you're saying but you're essentially describing best practices for "using" the API. I'm saying there's nothing implicit in the API itself to make it a better fit for vec4 hardware.

It seems to me that it hides the vec4-ness of most hardware and so presents the worst possible fit.

Vector organisation is going nowhere and strictly speaking NVidia's architecture is actually vec32 at the ALU level and vec16 at the data level.

The default practice with OpenCL, scalar, leaves 3/4 of the vec4 machines unused.

There's nothing stopping you from packing vec4's in Cuda either except that it would be a waste of time on Nvidia hardware.

It's pretty normal to end up using vectors of data in CUDA threads, optimising the code for memory access patterns. It's the converse of a wasting time.

If the scalar approach was as tractable as you believe it to be than a naive CUDA matrix multiplication using only scalars would perform the same as the fully optimised code, making the scalar versus vector argument entirely moot. The compiler would see the scalar code, it would understand the bandwidths and bottlenecks of the underlying architecture and produce optimised code.

Even in graphics, things like shadow rendering from an existing shadow buffer improves in performance when using something like the gather instruction, added to D3D10.1. This isn't about fitting on the ALUs (though that's nice) so much as making better use of the available bandwidth, because the machine has vector memory and vector fetching architecture.

I'm not arguing that vec4 on its own is enough for maximum performance, e.g. in matrix multiplication.

It seems to me that making programmers start off thinking "how do I best use .xyzw" is a better fit for the platforms OpenCL will find itself on.

Though you could argue that SSE and Cell will become irrelevant within a few years due to things like Larrabee and AVX, both being wider vector architectures.

Oh well, I'm just naively beating a dead horse :???:

Jawed

trinibwoy · Mar 14, 2009

Jawed said:
Why is .xyzw notation a part of D3D if the vec4 organisation of graphics and a lot of graphics hardware is irrelevant?

It's a consequence of the multi-component position and color attributes common in graphics. It's not generally applicable in a meaningful way otherwise. D3D doesn't require any particular hardware solution and certainly not vec4 specifically.

It seems to me that it hides the vec4-ness of most hardware and so presents the worst possible fit.

It seems to me that making programmers start off thinking "how do I best use .xyzw" is a better fit for the platforms OpenCL will find itself on.

Guess that's why we differ. I don't think API's should be designed around hardware limitations but should be designed to suit the problem they're intended to solve. The vec4-ness of graphics hardware won't map well to a lot of solutions that can't readily produce vector instructions.

The default practice with OpenCL, scalar, leaves 3/4 of the vec4 machines unused.

As it should be. Scalar is the general case. Anything above that is getting into architecture specific optimization. I'm surprised there hasn't been more discussion on this topic actually. I guess once people start coding OpenCL and running on different hardware we'll get some feedback.

If the scalar approach was as tractable as you believe it to be than a naive CUDA matrix multiplication using only scalars would perform the same as the fully optimised code, making the scalar versus vector argument entirely moot.

I'm not saying that everything is easy and self-optimizing with a scalar approach. Just more so than vector. Once you start doing per-thread optimizations for memory accesses your app isn't going to be that portable any more. It's going to be relatively easy to setup vector data access patterns in OpenCL that run fast on a vector machine but fall flat on their face when serialized on Nvidia hardware. The opposite is true as well of course. But while there may be some per-thread optimization happening in CUDA I'm sure most of the optimization is at the warp level in terms of how arrays are aligned and indexed in memory.

Though you could argue that SSE and Cell will become irrelevant within a few years due to things like Larrabee and AVX, both being wider vector architectures.

Well that's what I'd like to see too. If Larrabee can run 16 different threads on the SIMD in parallel then cool. But if it's really 4 vec4's then that's another level of per-thread optimization that you have to worry about that's not necessarily portable to other implementations. Something about the forced packing of everything into vectors bothers me as it adds another level of complexity to already complex problems.

MfA · Mar 14, 2009

ATI can do scalar (addressing is scalar, which is what's important, the per VLIW branching isn't a big issue). The branch granularity is just lousy while doing it.

AMD RV770 refresh -> RV790

trinibwoy

Meh

Psycho

Rangers

Jawed

pjbliverpool

B3D Scallywag

Ailuros

Epsilon plus three

LordEC911

MfA

CarstenS

Moderator

Jawed

trinibwoy

Meh

Jawed

trinibwoy

Meh

Ailuros

Epsilon plus three

Jawed

trinibwoy

Meh

rpg.314

Jawed

trinibwoy

Meh

MfA