Welcome, Unregistered.

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

Reply
Old 08-Nov-2007, 18:01   #26
3vi1
Junior Member
 
Join Date: Jan 2007
Posts: 22
Default

AMD is not leveraging their product correctly.. Thx for th link Jawed..

In that entire PDF I did not once see "Financial Calculations" leveraging the DP aspect of their new stream computing line.. Or did I miss something? They need to go after the derivative market like Nvidia because that's where the money is..

I swear when ATI sold AMD their company at 7 billion they snatched the balls off that company right along with it..


Ok.. as far as MXM goes, couldn't this aspect of their presentation be used for biomedical appliations like imaging and visual systems ala Global Hawk or surveillance?




I also imagine that stream computing will be a boon for voice recognition systems and other streaming applications coming to PCs. The market is moving more twoards a mobile application anyhow, people are abandoning home PCs in favor of mobility - makes sense to me.

A day late and a dollar short = AMD..


PS: That PDF had some of the slopiest image work I have ever seen.. And I am not at all surprised it came from AMD considering the state of affairs.
3vi1 is offline   Reply With Quote
Old 08-Nov-2007, 18:01   #27
Jawed
Regular
 
Join Date: Oct 2004
Location: London
Posts: 9,955
Send a message via Skype™ to Jawed
Default

Quote:
Originally Posted by Tim Murray View Post
Er, I severely doubt that a standalone medical imaging device that uses the 9170 would be ready (from a software standpoint) within the next year or so.
Chicken and egg. To me this is merely a question of form-factor - MXM is a compact alternative to PCI Express connectivity - even if it's cabled PEG that we've seen demonstrated earlier in the year.

As for the readiness of the software, well it's quite possible that a company already has the software running - there's no need to wait for 9170 to do the R&D.

Jawed
Jawed is offline   Reply With Quote
Old 08-Nov-2007, 18:07   #28
3dilettante
Regular
 
Join Date: Sep 2003
Location: Well within 3d
Posts: 5,487
Default

Quote:
Originally Posted by MfA View Post
What's the point in ECC? I doubt the difference in soft errors would be even close to a single order of magnitude (that GPU is a huge target, far more area reserved for actual computation than in the normal systems with ECC). If you can't deal with them ECC just provides an illusion of reliability.
The ECC is for the 2 GiB of RAM.

IBM's rule of thumb is 1 bit error per month per GiB of memory.

One such card is going to have 2 errors a month.
Assuming we try to pack these babies into a compute node, it's 2-4 cards a node.
At 8 errors, a month, that is more than weekly.

A large system might have hundreds to thousands of nodes.

At a hundred nodes, the sytem is going to be hitting a silent data error in video RAM every hour.
At a thousand, the system is going to hit an error every five minutes, or would if anyone had enough faith in GPGPU to put together a system at that scale.

As for ECC on the GPU itself, I've not heard of any such measures for GPUs.
CPU and other processors have been doing that since 90nm to avoid the rising error rates inherent to smaller geometries.
Whether that is important depends on the error rates no GPU designer has disclosed.
__________________
Dreaming of a .065 micron etch-a-sketch.
3dilettante is offline   Reply With Quote
Old 08-Nov-2007, 18:51   #29
MfA
Regular
 
Join Date: Feb 2002
Posts: 5,583
Send a message via ICQ to MfA
Default

Quote:
Originally Posted by 3dilettante View Post
Whether that is important depends on the error rates no GPU designer has disclosed.
You can't determine the value of ECC without making a guess at it.

PS. I don't think ECC will usually help much for logic errors.
MfA is offline   Reply With Quote
Old 08-Nov-2007, 19:02   #30
3dilettante
Regular
 
Join Date: Sep 2003
Location: Well within 3d
Posts: 5,487
Default

ECC is usually used on CPU caches and parity is used on the register files.

SRAM has a higher error rate than DRAM and that CPUs need such features to keep error rates the same as process features shrink.

CPUs have had a much higher burden placed on them, since they also manage the system.

Whether GPUs need such measures, given their increasing use of cache and massive register files, is something their designers must evaluate when they push their products into new fields.
__________________
Dreaming of a .065 micron etch-a-sketch.
3dilettante is offline   Reply With Quote
Old 09-Nov-2007, 00:15   #31
MfA
Regular
 
Join Date: Feb 2002
Posts: 5,583
Send a message via ICQ to MfA
Default

Cosmic rays can flip a bit in a latch or on a gate just like they can a SRAM cell ... what makes ECC so effective for DRAM/SRAM in most systems is the simple fact that RAM and caches make up so much of the area. GPUs are special.
MfA is offline   Reply With Quote
Old 09-Nov-2007, 00:26   #32
3dilettante
Regular
 
Join Date: Sep 2003
Location: Well within 3d
Posts: 5,487
Default

Error detection in memory is also important because the initial event can persist.

A cosmic ray hitting a latch in the divider unit won't matter unless there is a divide instruction that just happens to be going through that precise layer of logic at that exact time in the clock cycle.

A bit flip to a memory cell does not end with the next clock cycle, and any time is a good time for a bit flip to wreak havoc.
That's why register files have parity, even though the registers themselves are actually rather small in relation to logic.
Error detection on memory is relatively cheap, compared to logic with built-in error checking.

Error detection and correction for logic is an active area of research, however.
The future geometries are expected to reduce reliability to the point that it will no longer be safe to assume any given unit will function correctly to the standards placed on it today.
__________________
Dreaming of a .065 micron etch-a-sketch.
3dilettante is offline   Reply With Quote
Old 09-Nov-2007, 02:51   #33
Andrew Lauritzen
AndyTX
 
Join Date: May 2004
Location: British Columbia, Canada
Posts: 2,275
Default

Sure it might be nice to have ECC, but keep it in perspective: for the price/performance ratio of GPUs and even AMD's new card, it's more cost effective to just buy TWO, perform computation completely redundantly and compare the results. This method is also resilient to errors in the chip logic. With a sufficiently abstract computation platform you could even buy one NVIDIA part and one AMD part to get some protection from errors in the hardware/compilers/drivers

I'm not saying that ECC isn't useful, but it's definitely not a show-stopper. It would matter for comparison if - say - NVIDIA supported it and AMD didn't but right now no GPGPU cards do. Thus the worst you can knock the cards is doubling their effective price/performance vs. platforms with ECC, and even if you do that they're still worth it by a long shot
Andrew Lauritzen is offline   Reply With Quote
Old 09-Nov-2007, 03:52   #34
ShaidarHaran
hardware monkey
 
Join Date: Mar 2007
Posts: 3,910
Default

Quote:
Originally Posted by AndyTX View Post
Sure it might be nice to have ECC, but keep it in perspective: for the price/performance ratio of GPUs and even AMD's new card, it's more cost effective to just buy TWO, perform computation completely redundantly and compare the results. This method is also resilient to errors in the chip logic. With a sufficiently abstract computation platform you could even buy one NVIDIA part and one AMD part to get some protection from errors in the hardware/compilers/drivers

I'm not saying that ECC isn't useful, but it's definitely not a show-stopper. It would matter for comparison if - say - NVIDIA supported it and AMD didn't but right now no GPGPU cards do. Thus the worst you can knock the cards is doubling their effective price/performance vs. platforms with ECC, and even if you do that they're still worth it by a long shot
Excellent point. Reminds me of a story...

Once upon a time I worked for a retail PC outlet. Since it was a big-box store, margins on PCs were extremely slim. Of course, management was always pressuring us to sell "the extended warranty". Since "warranty" is a dirty word in most customer's vocabularies, one of my co-workers would do whatever it took to sell more than just a PC to every customer, even if he couldn't sell them on the warranty. Seeing as how PCs are commodities nowadays, pushing anything more than just the computer out the door got to be quite difficult.
Anyway, one time a particularly "hard-sell" customer who only wanted to buy a cheapo $399 PC and "didn't see the value in purchasing a warranty" was sold a second system as a spare in case the first one broke. I could never sell like that, was too busy reading Ace's & RWT & the like
ShaidarHaran is offline   Reply With Quote
Old 09-Nov-2007, 15:15   #35
3dilettante
Regular
 
Join Date: Sep 2003
Location: Well within 3d
Posts: 5,487
Default

Quote:
Originally Posted by AndyTX View Post
Sure it might be nice to have ECC, but keep it in perspective: for the price/performance ratio of GPUs and even AMD's new card, it's more cost effective to just buy TWO, perform computation completely redundantly and compare the results.
Even if ECC RAM costs more than double standard RAM, I don't think that's true.
I've seen 1GB DDR2 priced at $40-50 non-ECC non-registered, and ECC registered between $70-80.
Non-registered ECC 1GB DDR2 RAM exists and I've seen it priced at around $50.

ECC or some kind of correction on video RAM is likely to catch over 90% of memory errors.
From a large system standpoint, knowing that one card amongst 4000 cards has memory that is failing is much easier to catch if ECC errors keep popping up.

Your method is twice the price, twice the power consumption, and half the compute density at a system level, at the same or lesser performance.

If the system was designed to max out at 4 cards per motherboard, it's doubled the number of PCI-Ex slots needed, and likely every other system component besides the hard disks.

Quote:
This method is also resilient to errors in the chip logic. With a sufficiently abstract computation platform you could even buy one NVIDIA part and one AMD part to get some protection from errors in the hardware/compilers/drivers
My computer cluster burns 1 MWatt and catches +90% of errors.
Yours burns 2 MWatts, catches a few percent more errors, takes up twice the floor space, and is slower.

Quote:
I'm not saying that ECC isn't useful, but it's definitely not a show-stopper.
For large systems, it likely is.
And I'm sure there are some workloads that would really like the throughput with a modicum of data checking.
__________________
Dreaming of a .065 micron etch-a-sketch.

Last edited by 3dilettante; 09-Nov-2007 at 15:25.
3dilettante is offline   Reply With Quote
Old 09-Nov-2007, 17:32   #36
Andrew Lauritzen
AndyTX
 
Join Date: May 2004
Location: British Columbia, Canada
Posts: 2,275
Default

Quote:
Originally Posted by 3dilettante View Post
Your method is twice the price, twice the power consumption, and half the compute density at a system level, at the same or lesser performance.
I think you're missing my point... it's not that GPU+ECC == 2*GPU as that clearly isn't the case. My point is that even without ECC GPUs still provide order-of-magnitude better price/performance *and* power/performance than - say - a CPU cluster (for many tasks). Thus the lack of ECC compared to CPU clusters is not critical, as you can afford to introduce high-level redundancy into the system and still be laughing yourself to the bank.

Quote:
Originally Posted by 3dilettante View Post
My computer cluster burns 1 MWatt and catches +90% of errors.
Yours burns 2 MWatts, catches a few percent more errors, takes up twice the floor space, and is slower.
What are you comparing this too? A mythical GPU with ECC? If that existed, I would agree that it would probably be worth looking at. Since it doesn't, however, the comparison is a bit moot.
Andrew Lauritzen is offline   Reply With Quote
Old 09-Nov-2007, 18:26   #37
3dilettante
Regular
 
Join Date: Sep 2003
Location: Well within 3d
Posts: 5,487
Default

Quote:
Originally Posted by AndyTX View Post
I think you're missing my point... it's not that GPU+ECC == 2*GPU as that clearly isn't the case. My point is that even without ECC GPUs still provide order-of-magnitude better price/performance *and* power/performance than - say - a CPU cluster (for many tasks). Thus the lack of ECC compared to CPU clusters is not critical, as you can afford to introduce high-level redundancy into the system and still be laughing yourself to the bank.
I misread your statement as a comparison between a GPU solution with ECC against a GPU solution without ECC.
I did not interpret it as a comparison between a CPU-only system versus a CPU+GPU system.

I'd agree with the price-performance in single-precision throughput, assuming the workload doesn't have very long runtimes and can tolerate error.

DP may be a spoiler: a quad-core top bin Harpertown is expected to reach about ~50 Gflops when it comes out.
Dual socket puts it at 100 Gflops DP.
At 500 GFLOPS SP, 1/4 puts a single stream processor board at 125 GFLOPS.
That's not quite an order of magnitude.

In terms of power consumption, Harpertown is rated at 120W per chip.

A single stream processor has a TDP 150W.
Since we're doubling hardware, it's 300 Watts for 125 GFLOPS.

GPGPU= 125 GFLOPS DP at 300W
CPU dual socket (near future)= 100 GFLOPS DP at 240W

Granted, Intel's TDP doesn't match AMD's definition for its CPUs. I don't know how AMD measures it for GPUs.

In that comparison, the GFLOPS/W is actually the same, though I'll give the edge to the GPU since it might fall short of 150 W each processor.

Of course, Harpertown is likely to sell for over a $1.5k per processor.
So pricewise, 2 Harpertowns alone would be $3k.
AMD is competitively priced at $1,999 per board and we now need two of them.
It's $3,998 for 125 GFLOPS.

Considering that a powerful processor and the accompanying platform must be bought to support and direct the cards, a significant portion of the CPU-only machine's cost must go towards the cost of the GPU system.

I figure these considerations are not as important for very small projects, say one or two machines running tasks that don't take too long to run.

It rules out doubling GPGPUs, at least for big systems with long run times that need DP.

Single-precision should be better.

edit:

On further reflection
Memory bandwidth would be an edge for the GPU, if it's around 80 GB/s like it is for the consumer cards versus Intel's 20 GB/s.

Still not quite an order of magnitude, but still workable for small systems.
__________________
Dreaming of a .065 micron etch-a-sketch.

Last edited by 3dilettante; 09-Nov-2007 at 18:51.
3dilettante is offline   Reply With Quote
Old 09-Nov-2007, 19:27   #38
Dave Baumann
Gamerscore Wh...
 
Join Date: Jan 2002
Posts: 13,592
Default

You're making an assumption that performance of DP ops are 25% single precision.
__________________
Radeon is Gaming
Tweet Tweet!
Dave Baumann is offline   Reply With Quote
Old 09-Nov-2007, 19:45   #39
3dilettante
Regular
 
Join Date: Sep 2003
Location: Well within 3d
Posts: 5,487
Default

That is true, 1/4 was only mentioned somewhere in this thread, not confirmed by AMD.

The press release found it necessary to make a footnote that the product's GFLOPS rating was for SP throughput.

If DP were 1/2, my math would be off by a factor of two and two Firestream cards at $3,998 dollars running the exact same work unit for the sake of redundancy would provide 250 GFLOPS DP.

Not an order of magnitude improvment over the upcoming Harpertown, but better.

I'm assuming that DP throughput is not the same as SP, otherwise AMD would not have added that sneaky little footnote. I'm also assuming that it's not greater than SP, since then AMD would have a higher GFLOPS rating in their PDF.

The cost of a the surrounding sytem capable of running two such cards in tandem or just having twice the nodes just for the sake of redundancy, however, might still eat away at the cost advantages.
__________________
Dreaming of a .065 micron etch-a-sketch.
3dilettante is offline   Reply With Quote
Old 09-Nov-2007, 19:50   #40
Tim Murray
the Windom Earle of GPUs
 
Join Date: May 2003
Location: Mountain View, CA
Posts: 3,277
Default

Quote:
Originally Posted by Dave Baumann View Post
You're making an assumption that performance of DP ops are 25% single precision.
If they're not, I'll eat two hats, and I'll make videos to sell on the Internet.
Tim Murray is offline   Reply With Quote
Old 09-Nov-2007, 20:03   #41
3dilettante
Regular
 
Join Date: Sep 2003
Location: Well within 3d
Posts: 5,487
Default

What if it's less than 25%?

The 90nm Cell's throughput drops down to something like 1/10, and that's on an architecture that doesn't try to fit tons of ALUs into a small area.
The 65nm variant is half-speed at DP with specialized hardware added.

Where does RV670 fit on that continuum I wonder?
__________________
Dreaming of a .065 micron etch-a-sketch.
3dilettante is offline   Reply With Quote
Old 09-Nov-2007, 23:02   #42
Mintmaster
Senior Member
 
Join Date: Mar 2002
Posts: 3,897
Default

I always thought it was mostly bus width, routing logic, and register storage that were the main limitations of increasing precision on GPUS. If you have 50% DP rate and forget about increasing register space, then all those problems go away.

Because GPUs are made to handle hundreds of cycles of latency for texture instructions via thousands of fragments in flight, it's okay for them to have much, much longer instruction latency than a CPU if they want to. That takes a big chunk out of the cost of increasing precision.

If you actually look at the fundamentals, a DP multiplier isn't that big. 160 of them on a 666M transistor chip is pretty reasonable, especially when you're just modifying 320 SP multipliers to act like that. Half rate DP isn't out of the question, IMO. The original Cell was probably 1/10 speed because DP was a near useless feature for its original market (PS3). Now that it's getting some traction in HPC, the minimal investment required for half speed DP is worth it.
Mintmaster is offline   Reply With Quote
Old 09-Nov-2007, 23:50   #43
3dilettante
Regular
 
Join Date: Sep 2003
Location: Well within 3d
Posts: 5,487
Default

1/2 is the upper limit to what can be expected because a higher ratio between DP and SP would indicate there is some hardware that could have been used to up the SP throughput.

The question is how much effort AMD put into DP for RV670. It is a derivative of a product only capable of SP, so how much would AMD be willing to tweak the design?

Why wouldn't they have added DP capability as a checkbox figure?
They certainly didn't disclose DP throughput, and 1/2 SP throughput would have been respectable enough to disclose.
__________________
Dreaming of a .065 micron etch-a-sketch.
3dilettante is offline   Reply With Quote
Old 10-Nov-2007, 02:39   #44
Geo
Mostly Harmless
 
Join Date: Apr 2002
Location: Uffda-land
Posts: 9,156
Send a message via MSN to Geo
Default

Quote:
Originally Posted by Tim Murray View Post
If they're not, I'll eat two hats, and I'll make videos to sell on the Internet.
Quoted for permenance, my brotha.
__________________
"We'll thrash them --absolutely thrash them."--Richard Huddy on Larrabee
"Our multi-decade old 3D graphics rendering architecture that's based on a rasterization approach is no longer scalable and suitable for the demands of the future." --Pat Gelsinger, Intel
"Christ, this is Beyond3D; just get rid of any f**ker talking about patterned chihuahuas! Can the dog write GLSL? No. Then it can f**k off." --Da Boss
Geo is offline   Reply With Quote
Old 10-Nov-2007, 03:09   #45
Tim Murray
the Windom Earle of GPUs
 
Join Date: May 2003
Location: Mountain View, CA
Posts: 3,277
Default

Quote:
Originally Posted by Geo View Post
Quoted for permenance, my brotha.
Welp. D:
Tim Murray is offline   Reply With Quote
Old 10-Nov-2007, 09:41   #46
3vi1
Junior Member
 
Join Date: Jan 2007
Posts: 22
Default

Quote:
Originally Posted by 3dilettante View Post
What if it's less than 25%?

The 90nm Cell's throughput drops down to something like 1/10, and that's on an architecture that doesn't try to fit tons of ALUs into a small area.
The 65nm variant is half-speed at DP with specialized hardware added.

Where does RV670 fit on that continuum I wonder?

In an ExtremeTech article from June of this year there is a company building a GPU API with benchmarks for the ATI 2900 versus a Nvidia Quadro 4600 and a CPU in SP and DP.

You can see the PDF here: http://www.gpucomputing.eu/download/en_presskit.pdf or just the three graphs in question here [no pdf]: http://www.gpucomputing.eu/index3.ph...demo1.php&id=2



From the benchmarks featured it seems that the 2900's DP is about 40% of it's SP.



Any thoughts?

Last edited by 3vi1; 10-Nov-2007 at 10:18. Reason: short cut without downloading the pdf..
3vi1 is offline   Reply With Quote
Old 10-Nov-2007, 09:51   #47
mhouston
A little of this and that
 
Join Date: Oct 2005
Location: Cupertino
Posts: 343
Default

No GPGPU chip before this FireStream and has native 64-bit. It can be emulated, but not at that performance. I have no idea how they are getting their claimed double precision performance... I know for sure that R600 doesn't support double, nor does any shipping Nvidia part.

Last edited by mhouston; 10-Nov-2007 at 09:58. Reason: It's this FireStream, not the older one with double support
mhouston is offline   Reply With Quote
Old 10-Nov-2007, 11:05   #48
3vi1
Junior Member
 
Join Date: Jan 2007
Posts: 22
Default

Like a side of cole slaw and mashed potates,,

I was reading this article http://techreport.com/articles.x/10956/3 after watching AMD/ATI stream computing presentation from Sep. 2006.

So, it hit me.. What the hell has ATI/AMD been doing? Why have they not leveraged the awesome power of their GPU to do physics? If their GPU technology could do the stuff they showed in their demos what's the holdup from getting something to the public?


So, after viewing the video I wondered where the hell can I buy a PhysicsCAD?!

Imagine for a moment a tool that allows rapid product development.. Not only can you design it's shape, form and factor but you can test it's physical charactoristics. Building a jungle gym? Ok, let's drop a few 40 pound balls onto the top and against the sides to see how the structure holds up.. Designing a rocket / lunar componet to fly in the Google lunar competition? Well, lets test it's design in a simualted hostile enviroment before you commit to expensive development costs.. Do you have a patent for a new device and want to simulate it's physical charactorisitcs before the build phase? No problem, design the componet shape then select the material type and run it through a battery of tests.

I don't mean to simplify things too much but I imagine the applications for such software could be endless. A PhysicsCAD could be used in everything from educational arenas to product design and testing - before the actual product is ever built - thereby decreasing costs and incressing productivity.

Anyhow, unless AMD/ATI does something exciting to get developers back they may not be able to withsatnd Intels comming assult. So yea great they got new hardware, but so what? Nvidia is talking about 1 teraflop for christmas store shelves, just in time for Santa!



AMD THINK CREATIVLY! We need 3 players in the game to keep things honest... I think all may be for naught, but a PhysicalCAD/simualator would be a great tool, that would attract developers and increase sales beyond just games.. Oh yea, Nano-tech modeling ..



Just a thought, cheers.
3vi1 is offline   Reply With Quote
Old 10-Nov-2007, 14:56   #49
MfA
Regular
 
Join Date: Feb 2002
Posts: 5,583
Send a message via ICQ to MfA
Default

Quote:
Originally Posted by 3dilettante View Post
What if it's less than 25%?

The 90nm Cell's throughput drops down to something like 1/10, and that's on an architecture that doesn't try to fit tons of ALUs into a small area.
They didn't reuse multiplier hardware between the SP/DP processing, not a valid comparison.
MfA is offline   Reply With Quote
Old 12-Nov-2007, 02:35   #50
3dilettante
Regular
 
Join Date: Sep 2003
Location: Well within 3d
Posts: 5,487
Default

That is true.

I hadn't given much thought of the implementation on the SPE, just that 1/10 was perhaps the lowest an implementation could go and still be acceptable.

Actually, bringing up the separate hardware point, a possible compromise for RV670 would be to conserve transistors by only extending the complex ALU to handle DP calculations. Expanding one unit would be less drastic than extending five ALUs in a processor.

That would put DP performance at 1/5 SP.
Unless it actually iterates a DP through SP hardware twice, which would put it back at 1/10.
__________________
Dreaming of a .065 micron etch-a-sketch.
3dilettante is offline   Reply With Quote

Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 09:29.


Powered by vBulletin® Version 3.8.6
Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.