Welcome, Unregistered.

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

Reply
Old 29-Jun-2011, 17:14   #376
Gipsel
Member
 
Join Date: Jan 2010
Location: Hamburg, Germany
Posts: 987
Default

Quote:
Originally Posted by CarstenS View Post
Everything that's a physically correct reflection shows a color based on the curvature of the reflective surface and the distance of the reflected object. Or do you propose to alter the reflection calculations in raytracing to make them more … dramatic (?) instead of realistic?
No, of course not. The argument goes actually in another direction: As long as the whole picture does not show only (pseudo) random colors, there is inherent locality to exploit.
Gipsel is offline   Reply With Quote
Old 29-Jun-2011, 17:29   #377
CarstenS
Senior Member
 
Join Date: May 2002
Location: Germany
Posts: 2,842
Send a message via ICQ to CarstenS
Default

Quote:
Originally Posted by Arun View Post
I think there's a solid argument for downright faking it when it gets way too random. The human brain is unable to make anything useful out of it but the aliasing will still annoy it so it's a lose-lose situation. Anyway that's a last recourse (most of the time sane content and/or local AA will be enough) but I think it should be considered - physically accurate rendering for the sake of physical accuracy isn't a viable strategy.
I see your (and probably Gipsels') point, but nonetheless do secondary rays exhibit a tendency to not behave nearly as nice as primary ones - and ray tracing with primary rays only is moot as well most of the time.


edit: And that's where the argument started, we can now move on with the primary topic Surely, there's a certain level of parallelism to exploit in some of the secondary rays, but equally surely, it won't be as high as for primary rays. And it's not only reflections adding to cache pollution.
__________________
English is not my native tongue. Before flaming please consider the possiblity that I did not mean to say what you might have read from my posts.
Work| Recreation
Warning! This posting may contain unhealthy doses of gross humor, sarcastic remarks and exaggeration!

Last edited by CarstenS; 29-Jun-2011 at 17:37.
CarstenS is offline   Reply With Quote
Old 29-Jun-2011, 17:35   #378
Gipsel
Member
 
Join Date: Jan 2010
Location: Hamburg, Germany
Posts: 987
Default

Quote:
Originally Posted by 3dilettante View Post
I can't speak for aaronspink
But we can let him speak. The original question was about the power consumption of a cache, which was beefed up to serve the same usage profile as the register files of GPUs. So while this sentence is normally true:
Quote:
Originally Posted by aaronspink View Post
A cache will generally have much less power per bit than a register file.
, it only applies to the "normal" usage scenarios and designs as found in CPUs for instance. Aaron writes also, that a cache
Quote:
Originally Posted by aaronspink View Post
just devolves into a register file as you add ports.
So where should the power consumption advantage come from, if the actual memory arrays doesn't differ anymore?

If you add the few additional tasks a cache must be able (how expensive or cheap that may be) to handle and the simple fact that the cache is very likely physically further away from the units than the register files (which are even splitted, so each lane of a vector ALU has its own register file to place it closer to the individual ALU) and it costs energy to drive data over a distance, it necessarily follows, that it would have a higher power consumption if used as register file. What you gain is some flexibility and the performance will decrease more gracefully, if you need more register space than offered by the register file.

Eventually, we may very well see something like the thing proposed by nvidia in that paper, where you have a few registers basically within each ALU to cover the operands for 4 or 5 instructions only, backed up by a larger register file, backed up by a cache system. That way the data transfer between the levels further away from the ALU decreases, i.e. it requires less transfers over larger distances, lowering the power consumption.
Gipsel is offline   Reply With Quote
Old 29-Jun-2011, 17:38   #379
Gipsel
Member
 
Join Date: Jan 2010
Location: Hamburg, Germany
Posts: 987
Default

Quote:
Originally Posted by CarstenS View Post
I see your (and probably Gipsels') point, but nonetheless do secondary rays exhibit a tendency to not behave nearly as nice as primary ones
But they do behave "nice enough". If they stop doing that, they are creating just noise, what you normally want to avoid anyway, because the picture probably starts to look awful at this point. A bit of noise can improve the perceived realism, but that bit doesn't mean everything.
Gipsel is offline   Reply With Quote
Old 29-Jun-2011, 17:43   #380
ToTTenTranz
Senior Member
 
Join Date: Jul 2008
Posts: 2,155
Default

Quote:
Originally Posted by Arun View Post
I think what you mean is that they are not specifically targeting Llano. But it is clearly compatible with Llano and given that there's nothing fancy about Llano's CPU-GPU integration and the architecture is the same one used in many other AMD GPUs, I'm really not sure what the benefit of that could possibly be?
Sure, there might be no technical benefit (except for lower latency in CPU<->GPU communication, if it'll ever make a difference?), but Llano is damn cheap.
Looking at APU + motherboard prices, it's almost like a €50 computing-capable graphics card is being offered for free.

From a market standpoint, it should be a game changer, increasing the interest for software in supporting OpenCL\DirectCompute.
Especially if Llano's demand for laptops is as it's been rumoured to be.
ToTTenTranz is offline   Reply With Quote
Old 29-Jun-2011, 17:44   #381
rpg.314
Senior Member
 
Join Date: Jul 2008
Location: /
Posts: 4,070
Send a message via Skype™ to rpg.314
Default

Quote:
Originally Posted by Nick View Post
And CPUs run OpenCL too.
CPU's run OCL only when specifically asked to do so.

fGPU will run transparently the code written for dGPU.
__________________
The views presented here are my own and not my employer's.
Quote:
Originally Posted by Alexko View Post
So in a nutshell, model [BLANK] will have [BLANK], up to [BLANK], and even [BLANK] for a power consumption of just [BLANK]. Impressive.
rpg.314 is offline   Reply With Quote
Old 29-Jun-2011, 18:24   #382
Nick
Senior Member
 
Join Date: Jan 2003
Location: Ottawa, Ontario
Posts: 1,783
Default

Quote:
Originally Posted by Arun View Post
I think what you mean is that they are not specifically targeting Llano.
Specifically as in inclusively, yes.

Intel is expected to support OpenCL on Ivy Bridge's IGP. For the sake of argument, let's assume performance will be horrendous. Then clearly the existence of OpenCL applications doesn't prove that GPGPU on an IGP is the future. Similarly there is no proof yet that Llano's architecture has any merit beyond mere graphics. In other words, just because Llano runs OpenCL, doesn't mean it's a convincing incentive for developers to invest more into OpenCL development.

I'd love to see a software renderer written purely in OpenCL (not using any fixed-function hardware), and compare that against SwiftShader. Then we'd be able to get a true picture of the value of IGPs for computing...
Nick is offline   Reply With Quote
Old 29-Jun-2011, 18:39   #383
ToTTenTranz
Senior Member
 
Join Date: Jul 2008
Posts: 2,155
Default

Quote:
Originally Posted by Nick View Post
I'd love to see a software renderer written purely in OpenCL (not using any fixed-function hardware), and compare that against SwiftShader. Then we'd be able to get a true picture of the value of IGPs for computing...
Why would an OpenCL-based software renderer be a better benchmark than many already-available image-editing, video-editing, video-encoding and password decrypting applications?
ToTTenTranz is offline   Reply With Quote
Old 29-Jun-2011, 19:01   #384
rpg.314
Senior Member
 
Join Date: Jul 2008
Location: /
Posts: 4,070
Send a message via Skype™ to rpg.314
Default

Quote:
Originally Posted by Nick View Post
In other words, just because Llano runs OpenCL, doesn't mean it's a convincing incentive for developers to invest more into OpenCL development.
Llano is a huge incentive for developers to invest more into OpenCL development as the installed base of machines having competent gpu's just went through the roof.

EDIT
Quote:
Intel is expected to support OpenCL on Ivy Bridge's IGP. For the sake of argument, let's assume performance will be horrendous.
It's Intel's IGP. Of course it is going to suck.
EDIT
Quote:
I'd love to see a software renderer written purely in OpenCL (not using any fixed-function hardware), and compare that against SwiftShader. Then we'd be able to get a true picture of the value of IGPs for computing...
I'd love to see a dx11 software renderer (not using any fixed-function hardware), and compare that against Llano. Then we'd be able to get a true picture of the value of pure software rendering.
__________________
The views presented here are my own and not my employer's.
Quote:
Originally Posted by Alexko View Post
So in a nutshell, model [BLANK] will have [BLANK], up to [BLANK], and even [BLANK] for a power consumption of just [BLANK]. Impressive.
rpg.314 is offline   Reply With Quote
Old 29-Jun-2011, 19:48   #385
Nick
Senior Member
 
Join Date: Jan 2003
Location: Ottawa, Ontario
Posts: 1,783
Default

Quote:
Originally Posted by ToTTenTranz View Post
Why would an OpenCL-based software renderer be a better benchmark than many already-available image-editing, video-editing, video-encoding and password decrypting applications?
Because many GPGPU applications and benchmarks claim extraordinary speedups by comparing the results of high-end GPUs against a plain C implementation on the CPU.
Nick is offline   Reply With Quote
Old 29-Jun-2011, 19:55   #386
liolio
French frog
 
Join Date: Jun 2005
Location: France
Posts: 4,172
Default

In regard to software renderer vs IGP it would be imho more interesting to see an updated version of Unreal vs BF3 running on IGP (either llano, Sandy Bridge and Ivy bridge).
liolio is offline   Reply With Quote
Old 29-Jun-2011, 20:12   #387
Nick
Senior Member
 
Join Date: Jan 2003
Location: Ottawa, Ontario
Posts: 1,783
Default

Quote:
Originally Posted by rpg.314 View Post
Llano is a huge incentive for developers to invest more into OpenCL development as the installed base of machines having competent gpu's just went through the roof.
Unless Intel just went out of business and every consumer decided to upgrade today, nothing is going through the roof any time soon.
Nick is offline   Reply With Quote
Old 30-Jun-2011, 01:47   #388
OpenGL guy
Senior Member
 
Join Date: Feb 2002
Posts: 2,291
Send a message via ICQ to OpenGL guy
Default

Quote:
Originally Posted by Nick View Post
Unless Intel just went out of business and every consumer decided to upgrade today, nothing is going through the roof any time soon.
If Intel's software graphics rendering power is as you claim, then why wouldn't their OpenCL computational power scale just as well? People can use an OpenCL CPU device just as easily as a GPU device.
__________________
I speak only for myself.
OpenGL guy is offline   Reply With Quote
Old 30-Jun-2011, 06:05   #389
trinibwoy
Meh
 
Join Date: Mar 2004
Location: New York
Posts: 9,809
Default

Quote:
Originally Posted by Nick View Post
That's easy. Workloads depend on ILP, TLP or DLP for high performance, and increasingly a combination of these. GPUs still only offer good DLP, with TLP improving but still suffering from cache contention. CPUs are great for both ILP and TLP, and are catching up really fast in DLP.
I suspect you're giving CPUs a free pass here. Their ILP prowess comes at the cost of large caches, OOOE and speculative branching which are all very expensive. Are you suggesting that all those things will scale accordingly with higher arithmetic throughput? Not to mention the burden of x86 decoders.

CPUs are also subject to cache thrashing due to TLP and frequent context switching, especially on larger data sets. There is no magic that will enable CPUs to feed much wider execution resources efficiently and do so for free.

GPUs will rely less on DLP in the future but it's doubtful that x86+AVX will offer much competition in graphics workloads. IGPs may be doomed but anything more than that has the memory bandwidth and transistor/power budget to put CPUs to shame.

Quote:
Which converges it toward the CPU...
Not really, it just makes it faster.
__________________
What the deuce!?
trinibwoy is offline   Reply With Quote
Old 30-Jun-2011, 07:25   #390
rpg.314
Senior Member
 
Join Date: Jul 2008
Location: /
Posts: 4,070
Send a message via Skype™ to rpg.314
Default

Quote:
Originally Posted by trinibwoy View Post
IGPs may be doomed but anything more than that has the memory bandwidth and transistor/power budget to put CPUs to shame.
SB class igp's are doomed. I don't see any reason why strong alternatives like Llano's projected successors are doomed as well.
__________________
The views presented here are my own and not my employer's.
Quote:
Originally Posted by Alexko View Post
So in a nutshell, model [BLANK] will have [BLANK], up to [BLANK], and even [BLANK] for a power consumption of just [BLANK]. Impressive.
rpg.314 is offline   Reply With Quote
Old 30-Jun-2011, 10:54   #391
aaronspink
Senior Member
 
Join Date: Jun 2003
Posts: 2,570
Default

Quote:
Originally Posted by Arun View Post
I think you're using the more hardware-centric definition of a register file as necessarily not being based on SRAM, or at least much more expensive SRAM? If so that doesn't apply because (as far as I can tell) GPUs frequently use L1-like SRAM for their register file as they can tolerate the inherently higher latency.
Given the porting requirements it is unlikely that it is anywhere near as dense nor as power efficient as a cache ram array.
__________________
Aaron Spink
speaking for myself inc.
aaronspink is offline   Reply With Quote
Old 30-Jun-2011, 12:07   #392
trinibwoy
Meh
 
Join Date: Mar 2004
Location: New York
Posts: 9,809
Default

Quote:
Originally Posted by rpg.314 View Post
SB class igp's are doomed. I don't see any reason why strong alternatives like Llano's projected successors are doomed as well.
Llano's projected successors will be going up against their discrete counterparts. The only reason igp's and apu's exist is that many people don't care about graphics performance. For everyone else they're pretty useless.

Also, what's going to happen when games aren't based on 6 yr old console hardware any more? All IGPs will then resume their place in the trash bin.
__________________
What the deuce!?
trinibwoy is offline   Reply With Quote
Old 30-Jun-2011, 12:14   #393
rpg.314
Senior Member
 
Join Date: Jul 2008
Location: /
Posts: 4,070
Send a message via Skype™ to rpg.314
Default

Quote:
Llano's projected successors will be going up against their discrete counterparts.
And they will have significant advantages of latency, power and cost.
__________________
The views presented here are my own and not my employer's.
Quote:
Originally Posted by Alexko View Post
So in a nutshell, model [BLANK] will have [BLANK], up to [BLANK], and even [BLANK] for a power consumption of just [BLANK]. Impressive.
rpg.314 is offline   Reply With Quote
Old 30-Jun-2011, 12:26   #394
Nick
Senior Member
 
Join Date: Jan 2003
Location: Ottawa, Ontario
Posts: 1,783
Default

Quote:
Originally Posted by trinibwoy View Post
I suspect you're giving CPUs a free pass here. Their ILP prowess comes at the cost of large caches, OOOE and speculative branching which are all very expensive. Are you suggesting that all those things will scale accordingly with higher arithmetic throughput? Not to mention the burden of x86 decoders.
You have to think of a high throughput homogeneous CPU as the unification of a legacy CPU and an IGP. The compute density isn't necessarily much higher than that of a whole APU. But the high throughput AVX units benefit from having access to the same cache hierarchy and from out-of-order execution. You save a lot of communication overhead and certain structures don't have to be duplicated. And as I've detailed before, executing AVX-1024 on 256-bit execution units drastically reduces the power consumption of the CPU's front-end and schedulers, and hides latency by implicitly allowing access to four times more registers.

So there are no compromises to legacy scalar execution, and it also exploits DLP in practically the same way as a GPU!

Besides, there is no viable alternative. You said you agree they will converge but wonder whether CPUs or GPUs are more representative (i.e. closer to the result of the convergence)? GPUs have a very long way to go to offer acceptable sequential performance. Some form of out-of-order execution, and a comprehensive cache hierarchy are an absolute must to be able to compete with CPUs. For CPUs to compete with GPUs the only thing lacking is AVX-1024...
Quote:
CPUs are also subject to cache thrashing due to TLP and frequent context switching, especially on larger data sets.
Hardware thread switches are rare when you use software fiber scheduling. But even full context switches can be accelerated if that ever proves to be useful.
Quote:
There is no magic that will enable CPUs to feed much wider execution resources efficiently and do so for free.
It would be no more magical nor free than on a GPU. I don't see any noteworthy obstacles in increasing the CPU's DLP.
Quote:
GPUs will rely less on DLP in the future but it's doubtful that x86+AVX will offer much competition in graphics workloads. IGPs may be doomed but anything more than that has the memory bandwidth and transistor/power budget to put CPUs to shame.
Anything more as in anything larger? Once IGPs have been replaced by software rendering nothing is holding Intel from selling CPUs with more cores and more bandwidth. If they can increase their revenue by keeping people from buying low-end and mid-end discrete graphics cards, they won't let that opportunity slip.

Actually it's a simple question of growing the IGP or growing the CPU cores to threaten the mid-end discrete GPU market. Given that AVX2 brings us everything to drastically speed up software rendering and other high throughput applications, and it's readily extendable to 1024-bit registers, Intel seems focused on increasing CPU DLP. They only have to keep an adequate IGP around for long enough to make the transition.

Software rendering is not limited by the API so once developers start using the CPU more directly it would even compete with high-end discrete cards. It will take many years, but the convergence isn't stopping so this is bound to happen. Perhaps by the end of this decade buying a discrete graphics card may seem as silly as buying a discrete sound card. They'll still exist but for the majority of consumers won't offer any worthwhile benefit.
Quote:
Not really, it just makes it faster.
Increasing the clock frequency doesn't come for free. You need substantial changes to the register set, caches, instruction scheduling, etc. to sustain the higher throughput while keeping relative latencies the same. No matter what you do, increasing the clock frequency converges the GPU closer to the CPU microarchitecturally as well.

Last edited by Nick; 30-Jun-2011 at 12:41.
Nick is offline   Reply With Quote
Old 30-Jun-2011, 12:58   #395
Nick
Senior Member
 
Join Date: Jan 2003
Location: Ottawa, Ontario
Posts: 1,783
Default

Quote:
Originally Posted by rpg.314 View Post
SB class igp's are doomed. I don't see any reason why strong alternatives like Llano's projected successors are doomed as well.
Butchering CPU performance for 50% higher graphics performance is hardly a success formula. Intel's IGPs might be "doomed", but that has never stopped them before. Why would it be of significance now?
Nick is offline   Reply With Quote
Old 30-Jun-2011, 13:41   #396
Gipsel
Member
 
Join Date: Jan 2010
Location: Hamburg, Germany
Posts: 987
Default

Quote:
Originally Posted by aaronspink View Post
Given the porting requirements it is unlikely that it is anywhere near as dense nor as power efficient as a cache ram array.
Due to the splitted nature of the register files (each slice serves just a single or only a few vector lanes) and that the units ecxecute the same instruction for the same nominal register (just a different lane of the logical vector each clock) over 2 to 4 clocks, the register files do not have more ports than your typical L1. You can get away with a single read and a single write port.
Gipsel is offline   Reply With Quote
Old 30-Jun-2011, 14:16   #397
ToTTenTranz
Senior Member
 
Join Date: Jul 2008
Posts: 2,155
Default

Quote:
Originally Posted by Nick View Post
Butchering CPU performance for 50% higher graphics performance is hardly a success formula. Intel's IGPs might be "doomed", but that has never stopped them before. Why would it be of significance now?

Those charts are a good representation of cumulative sales over the past 15 years.
Regarding sales for the past 2-3 years (which is what matters the most for OEMs and computing-demanding software developers), they're a bit useless, as the top 5 GPUs aren't even in the market anymore.

I think you're downplaying graphics a bit too much, as if we were in ~2005.

Even if the end customer is ignorant of that fact for 99% of the cases, OEMs know that a better GPU drastically enhances gaming, video and web-browsing performance.That's why Brazos was sold out in Q1 2011, and has taken quite a chunk out of Atom shipments.

So if OEMs value better performing iGPUs and prefer the option to bundle AMD APUs, more PCs with AMD APUs will be on the market, more people will buy AMD APUs, and more developers will put a nice, big and shiny stamp in their latest software claiming it takes full advantage of the iGPU in people's newly-bought PCs.
ToTTenTranz is offline   Reply With Quote
Old 30-Jun-2011, 14:45   #398
trinibwoy
Meh
 
Join Date: Mar 2004
Location: New York
Posts: 9,809
Default

Quote:
Originally Posted by rpg.314 View Post
And they will have significant advantages of latency, power and cost.
Pretty trivial to pull off when they offer middling performance. Have you seen comparisons between desktop Llano and the lowly 6670?
__________________
What the deuce!?
trinibwoy is offline   Reply With Quote
Old 30-Jun-2011, 15:10   #399
rpg.314
Senior Member
 
Join Date: Jul 2008
Location: /
Posts: 4,070
Send a message via Skype™ to rpg.314
Default

Llano has noticeable gap in it's integration today. Needn't be the case with it's successors.

Besides, if IB's dram stacking takes off, it might reduce the bw advantage of discretes to a dead heat with integration benefits.
__________________
The views presented here are my own and not my employer's.
Quote:
Originally Posted by Alexko View Post
So in a nutshell, model [BLANK] will have [BLANK], up to [BLANK], and even [BLANK] for a power consumption of just [BLANK]. Impressive.
rpg.314 is offline   Reply With Quote
Old 30-Jun-2011, 21:13   #400
liolio
French frog
 
Join Date: Jun 2005
Location: France
Posts: 4,172
Default

Sorry to derail the conversation but I've some questions about larrabee/knight's corner.
I read news about Intel managing to get CMOS @32nm they expect this technology now to scale along with their process progress (they were stuck @65nm till then).
Are ring buses made out using this technique or they are done other way?

About larrabee/K'sC, basically can we expect change from the original larrabee text units removal aside?
I've the feeling that K'sC is clearly a "filler product", something Intel push out to somewhat compete with GPGPU and make some money out of their investments.

As nick is saying Intel is putting is strength in AVX2 instruction set (and proper implementation). It doesn't make much sense to launch next year something that use a completely different instruction set (hence my feel about K'sC being a filler product).
We know really few about Haswell but I don't believe it's the architecture that will allow Intel to do it all. It may allow software rendering with acceptable result for casual gamers, do marvels for physics, AI, etc. for the others but that's it. GPUs (and GPGPUs) will still be a compliant target for the workload that map well to their architectures. 500 GFLOPS won't cut it against modern GPUs.

Intel needs a more throughput oriented design if they want to stop GPUs to bite into their market share. It may also help them to reach (or definitively secure) others markets. Honestly I don't know much but after reading some stuffs about UltraSparc CPU line or upcoming IBM POWERPC A2, it looks like to me that the way larrabee was design is no longer adapt to the goals Intel may pursue now. May be it's nothing but I noticed that in all those designs the cores can access a "shared L2" (as I understand it vs larrabee local subset of the L2 is that they can read and write anywhere on the L2 cache whereas larrabee core can only read&write on their local subset of the L2 and read from the others). Could this be a wanted feature for the kind of works larrabee successors (after K'sC) migh be intended to? (Or Intel could/should scale back the number of cores and include an L3?)
There is also the focus on power consumption, 16 wide SIMD may not be workable within the design, it supposedly consume a lot, it set terrible constrain on the memory system. A move AVX2 as Nick is proposing sounds like a win to me, actually I wonder if it worse it to get them push 4FLOPS per cycle (Haswell is supposed to do 2 FMAC per cycle so twice 2 FLOPS right? I'm not sure I got this properly while reading).
I also notice that in POWERPC A2 the designer have put a huge focus on chip to chip communication. Something that seems absent from Larrabee/K'sC and the looks like a huge lack. They need something that scale well.

So what are you POV(s) on the matter, do you believe Intel after experimenting with larrabee, with Itanium crumbling support, a possible threat on their CPUs dominance could launch a proper throughput cores? How could they look like?
It could be a win for Intel as Haswell might be awesome but I don't believe the silicon budget will allow proper do it all architecture (if it is to happen), they could have their way with heterogeneous designs, different cores but using the same ISA(s).
liolio is offline   Reply With Quote

Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 14:30.


Powered by vBulletin® Version 3.8.6
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.