Welcome, Unregistered.

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

Reply
Old 29-Jun-2011, 09:04   #351
aaronspink
Senior Member
 
Join Date: Jun 2003
Posts: 2,570
Default

Quote:
Originally Posted by rpg.314 View Post
How and why?
cell density, ports, delay, signalling, etc. The designs are generally quite different between a cache sram and a register file cell.
__________________
Aaron Spink
speaking for myself inc.
aaronspink is offline   Reply With Quote
Old 29-Jun-2011, 09:08   #352
Arun
Unknown.
 
Join Date: Aug 2002
Location: UK
Posts: 4,877
Default

Quote:
Originally Posted by aaronspink View Post
No it wouldn't. There is little different between a register file and a cache ram except the cache ram is more compact and significantly lower power. Both can have a lot of sideband hardware but they don't tend to be at all dominate on the characteristics.
I think you're using the more hardware-centric definition of a register file as necessarily not being based on SRAM, or at least much more expensive SRAM? If so that doesn't apply because (as far as I can tell) GPUs frequently use L1-like SRAM for their register file as they can tolerate the inherently higher latency.

Quote:
Originally Posted by rpg.314 View Post
Doubtful. TMU's and rasterizer can't be removed without nuking your graphics perf. ROPs can't be removed unless you do sort middle rasterization, which is too big a change. There's not much ff hw left to remove without doing serious surgery on GPU arch.
Ah, but who talked of removing them? The ALU-FF ratio will just naturally increase over time.

And although many of them might not really save much hardware because of the increased cost of data movement, there are also tons of ideas that have been proposed over the years to reduce the amount of fixed-function hardware and improve programmability:
- Keep texture addressing in HW, but do texture filtering in the shader core at least for all >8-bit and FP formats. Arguably slightly less likely to be partial now that there's a 8-bit compressed FP format in DX11 (as opposed to FP10 which is really 32-bit); if this happens, it should be for all filtering, which is a pretty controversial step.
- Handle blending in the shader core. This is already done not only on PowerVR hardware where it's easier because it's a TBDR, but also on Tegra which is a IMR. Some of the collision checking means it probably doesn't save much but it's useful. And while you're at it, non-linear color spaces and non-traditional AA techniques will become more frequent so you might as well do MSAA resolve in the shader core as well (but properly, hi R600!)
- Do triangle setup in the shader core. Intel IGPs already do that (or did a few generations ago at least). One historical problem with that, ironically enough, was that FP32 wasn't enough for the corner cases unless you did things rather obtusely iirc. With FP64 becoming mainstream that's no longer a problem, although it may or may not still hurt power efficiency.

Quote:
Doubtful. Those clocks would need 2x more threads for same mem latency.
2x more threads for the same memory latency compared to the same architecture running the same programs at 1.5GHz, but a bit less than 2x more threads than today - as programs become longer, there will naturally be more Instruction Level Parallelism in them, which helps hide more memory latency for a given RF size. Not that the RF isn't important going forward, as indicated by the amount of effort NVIDIA put into that register file cache paper for relatively minor improvements in the grand scheme of things.
__________________
Focusing on non-graphics projects in 2013 (but I still love triangles)
"[...]; the kind of variation which ensues depending in most cases in a far higher degree on the nature or constitution of the being, than on the nature of the changed conditions."
Arun is offline   Reply With Quote
Old 29-Jun-2011, 09:38   #353
hkultala
Member
 
Join Date: May 2002
Location: Herwood, Tampere, Finland
Posts: 264
Default

Quote:
Originally Posted by Arun View Post
2x more threads for the same memory latency compared to the same architecture running the same programs at 1.5GHz, but a bit less than 2x more threads than today - as programs become longer, there will naturally be more Instruction Level Parallelism in them, which helps hide more memory latency for a given RF size.
Actually the ILP seems to be DECREASING instead of increasing. AMD is moving away from ILP-centric (VLIW) approach, because there is not enough ILP on modern shader programs. When shader programs get more complex, they also start having more complex control structures, which limits ILP.
hkultala is offline   Reply With Quote
Old 29-Jun-2011, 09:53   #354
rpg.314
Senior Member
 
Join Date: Jul 2008
Location: /
Posts: 4,070
Send a message via Skype™ to rpg.314
Default

Quote:
Originally Posted by aaronspink View Post
cell density, ports, delay, signalling, etc. The designs are generally quite different between a cache sram and a register file cell.
It would really help if you could explain in some detail.
__________________
The views presented here are my own and not my employer's.
Quote:
Originally Posted by Alexko View Post
So in a nutshell, model [BLANK] will have [BLANK], up to [BLANK], and even [BLANK] for a power consumption of just [BLANK]. Impressive.
rpg.314 is offline   Reply With Quote
Old 29-Jun-2011, 09:57   #355
rpg.314
Senior Member
 
Join Date: Jul 2008
Location: /
Posts: 4,070
Send a message via Skype™ to rpg.314
Default

Quote:
Handle blending in the shader core. This is already done not only on PowerVR hardware where it's easier because it's a TBDR, but also on Tegra which is a IMR. Some of the collision checking means it probably doesn't save much but it's useful. And while you're at it, non-linear color spaces and non-traditional AA techniques will become more frequent so you might as well do MSAA resolve in the shader core as well (but properly, hi R600!)
All this plainly screams for sort middle rendering. Or a big ass dram on package.
__________________
The views presented here are my own and not my employer's.
Quote:
Originally Posted by Alexko View Post
So in a nutshell, model [BLANK] will have [BLANK], up to [BLANK], and even [BLANK] for a power consumption of just [BLANK]. Impressive.
rpg.314 is offline   Reply With Quote
Old 29-Jun-2011, 09:58   #356
Gipsel
Member
 
Join Date: Jan 2010
Location: Hamburg, Germany
Posts: 987
Default

Quote:
Originally Posted by CarstenS View Post
…or your surfaces are finely detailed, (like the metalish-reflective spokes in a wheel). Oh wait, that's what we really want, don't we?
No, I don't want each neighboring pixel/sample to show a completely different color. Then it's a noisy and completely aliased picture. That looks awful.
Gipsel is offline   Reply With Quote
Old 29-Jun-2011, 10:03   #357
Arun
Unknown.
 
Join Date: Aug 2002
Location: UK
Posts: 4,877
Default

Quote:
Originally Posted by hkultala View Post
Actually the ILP seems to be DECREASING instead of increasing. AMD is moving away from ILP-centric (VLIW) approach, because there is not enough ILP on modern shader programs. When shader programs get more complex, they also start having more complex control structures, which limits ILP.
That's a good point, it is true that more complex control structures limit ILP, but I think you're reading too much into AMD's move away from VLIW. It is GPGPU-centric, where the control structures have always been more complex. And some of their limitations did not come from an inherent lack of ILP but rather than restrictive nature of their multibanked register file (which I'm sure is extremely efficient in hardware so I'm not saying it's a bad design decision - only that it obscures the real amount of ILP available).

Secondly, we're not talking about the same kind of ILP - AMD's VLIW needs enough independent ALU instructions, whereas hiding memory latency only requires enough ALU instructions (independent or not) between texture/memory accesses to hide the latency. In general, the number of ALU instructions between accesses to external memory MUST increase because ALU performance will increase faster than bandwidth.

And finally using ILP to hide memory latency on GPUs isn't like using ILP to improve single-threaded performance on CPUs at all. What matters is the AVERAGE amount of ILP over many threads, not the fluctuating amount of it available inside a single thread. There will often be parts of a program which are a sort of 'serial bottleneck' but those will be compensated by the parts that have a lot of ILP.

I could certainly be wrong and ILP won't increase, but I'd be surprised if it actually decreased.
__________________
Focusing on non-graphics projects in 2013 (but I still love triangles)
"[...]; the kind of variation which ensues depending in most cases in a far higher degree on the nature or constitution of the being, than on the nature of the changed conditions."
Arun is offline   Reply With Quote
Old 29-Jun-2011, 10:13   #358
Nick
Senior Member
 
Join Date: Jan 2003
Location: Ottawa, Ontario
Posts: 1,783
Default

Quote:
Originally Posted by Gipsel View Post
I thought it was already discussed how power hungry that eventually is.
16 registers per thread is plenty to ensure that the vast majority of accesses to reused data are register accesses. I sincerely doubt that having any more registers would have a significant effect on power consumption.

Also, you can't force data accesses to be register accesses. What makes it even more ironic is that on a GPU you want to minimize the number of registers to maximize the number of wavefronts. And that again also worsens cache contention.
Quote:
I remember the FMA specification first turned up years ago. AMD announced that Bulldozer will support FMA(4) more than two years ago (and before, SSE5 proposed in 2007 also included [differently encoded 3 operand] FMA instructions). The confusion about FMA4 and FMA3 is also already 2 years old. So someone who didn't know that intel will implement FMA(3) must have lived under a stone in a cave somewhere in the middle of nowhere for the last years.
Some people still doubted that Intel would introduce FMA with Haswell, because Sandy Bridge doubled the ALU width already.
Quote:
For conventional GPUs it does not matter, as you can stream with a higher bandwidth directly from memory. So a huge LLC it is basically wasted. But look on intels iGPU in Sandybridge! They share already the L3. Why do you think it will be any different in future versions of it (or AMD's)?
RAM bandwidth is highly proportional to computing power, both for discrete cards as well as IGPs. However, developers won't create GPGPU applications for something as weak as an IGP, regardless of whether or not it has access to an L3 cache. In other words, even though a large L3 cache can have a profound effect on the performance of a CPU, you don't fix all of a GPU's problems by throwing an L3 cache at it.
Quote:
And how does a compilation profit from wide vector units in that course?
Again, it doesn't. And it doesn't have to. The CPU is already great at tasks of this complexity. It's the GPU that has a long way to go to become any better at ILP and TLP.
Quote:
Yes, in the same 65nm process the core area (i.e. just the core including L1) grew from 19.7mm˛ to 31.5mm˛. And the complete die was 57% larger. Doesn't look like a doubling of compute density to me.
Aside from supporting x64, Core 2 also widened from a 3 to a 4 instruction architecture and implemented many other features. Unfortunately these major changes make it impossible to assess the isolated cost of widening the SSE paths to 128-bit.

It turns out a much better comparison is Brisbane versus Barcelona. Together with a slew of other changes which probably don't take a lot of space each, the widening of the SSE path made the core grow by only 23%. That's only 8% of Barcelona's entire die. So 5% for doubling the throughput probably isn't a bad approximation.

Doubling it again obviously costs more in absolute terms, but the rest of the core has grown / become more powerful as well. Sandy Bridge already widened part of the execution paths. Suffice to say that implementing AVX2 in Haswell will be relatively cheap and we can consider it to have twice the throughput at a negligible cost.

That's absolutely not the case for GPUs, unless they start trading fixed-function hardware for programmable cores...
Nick is offline   Reply With Quote
Old 29-Jun-2011, 10:14   #359
Arun
Unknown.
 
Join Date: Aug 2002
Location: UK
Posts: 4,877
Default

Quote:
Originally Posted by rpg.314 View Post
All this plainly screams for sort middle rendering. Or a big ass dram on package.
Heh, all of 3D rendering screams for TSV (true 3D packaging) or even better cheap silicon photonics. Not that I'd complain about a TBDR-based console, mind you.

Quote:
Originally Posted by Gipsel View Post
No, I don't want each neighboring pixel/sample to show a completely different color. Then it's a noisy and completely aliased picture. That looks awful.
Well ideally you'd try to automatically raytrace as many rays as necessary to avoid that (effectively dynamic SSAA) - which will also naturally result in a moderate amount of secondary ray coherence.
__________________
Focusing on non-graphics projects in 2013 (but I still love triangles)
"[...]; the kind of variation which ensues depending in most cases in a far higher degree on the nature or constitution of the being, than on the nature of the changed conditions."
Arun is offline   Reply With Quote
Old 29-Jun-2011, 10:25   #360
rpg.314
Senior Member
 
Join Date: Jul 2008
Location: /
Posts: 4,070
Send a message via Skype™ to rpg.314
Default

Quote:
Originally Posted by Gipsel View Post
No, I don't want each neighboring pixel/sample to show a completely different color. Then it's a noisy and completely aliased picture. That looks awful.
And what should we do with microtriangles?
__________________
The views presented here are my own and not my employer's.
Quote:
Originally Posted by Alexko View Post
So in a nutshell, model [BLANK] will have [BLANK], up to [BLANK], and even [BLANK] for a power consumption of just [BLANK]. Impressive.
rpg.314 is offline   Reply With Quote
Old 29-Jun-2011, 10:30   #361
rpg.314
Senior Member
 
Join Date: Jul 2008
Location: /
Posts: 4,070
Send a message via Skype™ to rpg.314
Default

Quote:
Originally Posted by Nick View Post
However, developers won't create GPGPU applications for something as weak as an IGP, regardless of whether or not it has access to an L3 cache.
Are you asserting that developers won't create applications for Llano?
__________________
The views presented here are my own and not my employer's.
Quote:
Originally Posted by Alexko View Post
So in a nutshell, model [BLANK] will have [BLANK], up to [BLANK], and even [BLANK] for a power consumption of just [BLANK]. Impressive.
rpg.314 is offline   Reply With Quote
Old 29-Jun-2011, 10:55   #362
Gipsel
Member
 
Join Date: Jan 2010
Location: Hamburg, Germany
Posts: 987
Default

Quote:
Originally Posted by rpg.314 View Post
And what should we do with microtriangles?
The still form a smooth shape, isn't it?

The argument is actually a very general one. The complete absence of locality in an image means it is just a random coloring of pixels. You won't be able to recognize anything in it (which would basically be certain structures, like a cube, a square or whatever). Therefore each meaningful picture whill show some locality. And on the way to generate this image, your algorithm will experience necessarily also locality in the accessed data structures (describing the scene).

Just take the simple example of a raytraced image of a half transparent half reflective sphere in some environment. Neighboring rays intersecting the sphere giving rise to secondary rays. But those secondary rays still intersect very likely the same (or very close) objects. After all the reflection will show just a (distorted) image of the environment, same as the refracted rays. So if you don't have a surface with pixel sized small mirrors each pointing to random directions (which would create simply noise), also secondary rays will benefit from locality in the accessed data structures.
Gipsel is offline   Reply With Quote
Old 29-Jun-2011, 11:17   #363
AlexV
Heteroscedasticitate
 
Join Date: Mar 2005
Posts: 2,354
Default

Quote:
Originally Posted by Arun View Post
- Do triangle setup in the shader core. Intel IGPs already do that (or did a few generations ago at least). One historical problem with that, ironically enough, was that FP32 wasn't enough for the corner cases unless you did things rather obtusely iirc. With FP64 becoming mainstream that's no longer a problem, although it may or may not still hurt power efficiency.
The problem there isn't the math, or available precision, IMHO (special case BFT - Big Fucking Triangles - and voila!), but rather the data-flow.
__________________
Donald Knuth: Science is what we understand well enough to explain to a computer. Art is everything else we do.
AlexV is offline   Reply With Quote
Old 29-Jun-2011, 11:19   #364
Nick
Senior Member
 
Join Date: Jan 2003
Location: Ottawa, Ontario
Posts: 1,783
Default

Quote:
Originally Posted by RecessionCone View Post
So, if widening AVX to 1024 bits will bring such large performance improvements at such low cost, why stop at 1024? Why not 2048 bits? Or even more?
Widening AVX without widening the execution units won't bring a large performance improvement. It mainly lowers power consumption, and would help hide latency. There's no point in anything beyond a 4:1 ratio due to diminishing returns.

Widening the execution units would increase throughput, but beyond dual 256-bit FMA units it would have a significant cost and require sacrificing scalar performance. It would also increase the instruction rate again and thus all related power consumption, and worsen the latency hiding. I doubt these compromises are worth it.

2048-bit and beyond isn't practically feasible since AVX is limited to 1024-bit. But it's a very reasonable limit since wider vectors would also worsens branch granularity and task granularity.
Nick is offline   Reply With Quote
Old 29-Jun-2011, 12:05   #365
Nick
Senior Member
 
Join Date: Jan 2003
Location: Ottawa, Ontario
Posts: 1,783
Default

Quote:
Originally Posted by trinibwoy View Post
Yes, everyone knows they will converge but the argument is whether the resulting architecture will look more like a GPU or a CPU. I don't understand why you're so convinced that today's CPUs are more representative of future many-core achitectures than today's GPUs.
That's easy. Workloads depend on ILP, TLP or DLP for high performance, and increasingly a combination of these. GPUs still only offer good DLP, with TLP improving but still suffering from cache contention. CPUs are great for both ILP and TLP, and are catching up really fast in DLP.
Quote:
There is additional compute density to be had on GPUs as well. nVidia at least is predicting up to 3Ghz shader clocks in the next few years on GPU parts.
Which converges it toward the CPU...
Quote:
What are we expecting from Haswell in terms of fixed function hardware?
Nothing has been confirmed, but my personal expectation is that Intel won't risk any radical changes yet and will just include an enhanced DX11 IGP. They'll be able to seriously experiment with having the CPU cores assist in vertex and/or geometry processing though. If Skylake features AVX-1024 then a mainstream chip would deliver ~1 TFLOP at low power consumption so it can definitely entirely take over the IGP's task. By then things like texture sampling will likely require more programmability anyway so there's no need for any fixed-function hardware, although of course AVX can be extended with a few more valuable instructions.
Nick is offline   Reply With Quote
Old 29-Jun-2011, 12:14   #366
MfA
Regular
 
Join Date: Feb 2002
Posts: 5,227
Send a message via ICQ to MfA
Default

GPUs are great at DLP and surviving high frequency cache misses from semi-random streaming.
__________________
Cinematic is the new streamlined.
MfA is offline   Reply With Quote
Old 29-Jun-2011, 12:45   #367
Nick
Senior Member
 
Join Date: Jan 2003
Location: Ottawa, Ontario
Posts: 1,783
Default

Quote:
Originally Posted by rpg.314 View Post
Are you asserting that developers won't create applications for Llano?
Yes.
Nick is offline   Reply With Quote
Old 29-Jun-2011, 14:37   #368
rpg.314
Senior Member
 
Join Date: Jul 2008
Location: /
Posts: 4,070
Send a message via Skype™ to rpg.314
Default

Quote:
Originally Posted by Nick View Post
Yes.
http://forum.beyond3d.com/showthread.php?t=58195
__________________
The views presented here are my own and not my employer's.
Quote:
Originally Posted by Alexko View Post
So in a nutshell, model [BLANK] will have [BLANK], up to [BLANK], and even [BLANK] for a power consumption of just [BLANK]. Impressive.
rpg.314 is offline   Reply With Quote
Old 29-Jun-2011, 15:01   #369
Nick
Senior Member
 
Join Date: Jan 2003
Location: Ottawa, Ontario
Posts: 1,783
Default

Quote:
Originally Posted by rpg.314 View Post
So which of those is targeting Llano?
Nick is offline   Reply With Quote
Old 29-Jun-2011, 15:18   #370
rpg.314
Senior Member
 
Join Date: Jul 2008
Location: /
Posts: 4,070
Send a message via Skype™ to rpg.314
Default

Quote:
Originally Posted by Nick View Post
So which of those is targeting Llano?
EVERYONE that uses opencl
__________________
The views presented here are my own and not my employer's.
Quote:
Originally Posted by Alexko View Post
So in a nutshell, model [BLANK] will have [BLANK], up to [BLANK], and even [BLANK] for a power consumption of just [BLANK]. Impressive.
rpg.314 is offline   Reply With Quote
Old 29-Jun-2011, 15:31   #371
ToTTenTranz
Senior Member
 
Join Date: Jul 2008
Posts: 2,157
Default

Quote:
Originally Posted by rpg.314 View Post
EVERYONE that uses opencl
And Compute Shader, and STREAM, and APP.
ToTTenTranz is offline   Reply With Quote
Old 29-Jun-2011, 15:48   #372
Nick
Senior Member
 
Join Date: Jan 2003
Location: Ottawa, Ontario
Posts: 1,783
Default

Quote:
Originally Posted by rpg.314 View Post
EVERYONE that uses opencl
They're not targeting Llano. And CPUs run OpenCL too.
Nick is offline   Reply With Quote
Old 29-Jun-2011, 16:29   #373
CarstenS
Senior Member
 
Join Date: May 2002
Location: Germany
Posts: 2,842
Send a message via ICQ to CarstenS
Default

Quote:
Originally Posted by Gipsel View Post
No, I don't want each neighboring pixel/sample to show a completely different color. Then it's a noisy and completely aliased picture. That looks awful.
Everything that's a physically correct reflection shows a color based on the curvature of the reflective surface and the distance of the reflected object. Or do you propose to alter the reflection calculations in raytracing to make them more … dramatic (?) instead of realistic?

Just let me clarify:
You'll be getting this kind of thrashing around of secondary or even worse tertiary rays also if you choose to apply some form of antialiasing in order to remove some of the noise from you rendered picture.
__________________
English is not my native tongue. Before flaming please consider the possiblity that I did not mean to say what you might have read from my posts.
Work| Recreation
Warning! This posting may contain unhealthy doses of gross humor, sarcastic remarks and exaggeration!
CarstenS is offline   Reply With Quote
Old 29-Jun-2011, 16:48   #374
Arun
Unknown.
 
Join Date: Aug 2002
Location: UK
Posts: 4,877
Default

Quote:
Originally Posted by Nick View Post
They're not targeting Llano. And CPUs run OpenCL too.
I think what you mean is that they are not specifically targeting Llano. But it is clearly compatible with Llano and given that there's nothing fancy about Llano's CPU-GPU integration and the architecture is the same one used in many other AMD GPUs, I'm really not sure what the benefit of that could possibly be?

Quote:
Originally Posted by CarstenS View Post
Everything that's a physically correct reflection shows a color based on the curvature of the reflective surface and the distance of the reflected object. Or do you propose to alter the reflection calculations in raytracing to make them more … dramatic (?) instead of realistic?
I think there's a solid argument for downright faking it when it gets way too random. The human brain is unable to make anything useful out of it but the aliasing will still annoy it so it's a lose-lose situation. Anyway that's a last recourse (most of the time sane content and/or local AA will be enough) but I think it should be considered - physically accurate rendering for the sake of physical accuracy isn't a viable strategy.
__________________
Focusing on non-graphics projects in 2013 (but I still love triangles)
"[...]; the kind of variation which ensues depending in most cases in a far higher degree on the nature or constitution of the being, than on the nature of the changed conditions."
Arun is offline   Reply With Quote
Old 29-Jun-2011, 16:49   #375
3dilettante
Senior Member
 
Join Date: Sep 2003
Location: Well within 3d
Posts: 4,109
Default

Quote:
Originally Posted by rpg.314 View Post
It would really help if you could explain in some detail.
I can't speak for aaronspink, but I do recall some differences were mentioned in the context of some recent CPUs.
The register file uses 8T SRAM, while caches use 6T, though in the case of Atom the L1 data cache also shifted to use 8T, which has a commensurate cost in storage per transistor.
The other caches stuck with 6T.

The upshot is that the use 8T SRAM allowed the design to run reliably at lower voltages. SRAM reliability at a given voltage level has come up as a design consideration in discussions or articles about the latest designs.

I know that caches tend to favor density while register files tend to favor high performance.
I had thought that pushing a cache to the same level of porting as a register file would make it noticeably more bloated than a register file due to the scaling of the cache's ancilliary hardware, but the most recent posts on the matter indicate the RAM would dominate.
__________________
Dreaming of a .065 micron etch-a-sketch.

Last edited by 3dilettante; 29-Jun-2011 at 16:54.
3dilettante is offline   Reply With Quote

Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 02:57.


Powered by vBulletin® Version 3.8.6
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.