If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.
![]() |
|
|
#351 |
|
Senior Member
Join Date: Jun 2003
Posts: 2,570
|
cell density, ports, delay, signalling, etc. The designs are generally quite different between a cache sram and a register file cell.
__________________
Aaron Spink speaking for myself inc. |
|
|
|
|
|
#352 | |||
|
Unknown.
Join Date: Aug 2002
Location: UK
Posts: 4,877
|
Quote:
Quote:
And although many of them might not really save much hardware because of the increased cost of data movement, there are also tons of ideas that have been proposed over the years to reduce the amount of fixed-function hardware and improve programmability: - Keep texture addressing in HW, but do texture filtering in the shader core at least for all >8-bit and FP formats. Arguably slightly less likely to be partial now that there's a 8-bit compressed FP format in DX11 (as opposed to FP10 which is really 32-bit); if this happens, it should be for all filtering, which is a pretty controversial step. - Handle blending in the shader core. This is already done not only on PowerVR hardware where it's easier because it's a TBDR, but also on Tegra which is a IMR. Some of the collision checking means it probably doesn't save much but it's useful. And while you're at it, non-linear color spaces and non-traditional AA techniques will become more frequent so you might as well do MSAA resolve in the shader core as well (but properly, hi R600!) - Do triangle setup in the shader core. Intel IGPs already do that (or did a few generations ago at least). One historical problem with that, ironically enough, was that FP32 wasn't enough for the corner cases unless you did things rather obtusely iirc. With FP64 becoming mainstream that's no longer a problem, although it may or may not still hurt power efficiency. Quote:
__________________
Focusing on non-graphics projects in 2013 (but I still love triangles) "[...]; the kind of variation which ensues depending in most cases in a far higher degree on the nature or constitution of the being, than on the nature of the changed conditions." |
|||
|
|
|
|
|
#353 | |
|
Member
Join Date: May 2002
Location: Herwood, Tampere, Finland
Posts: 264
|
Quote:
|
|
|
|
|
|
|
#354 |
|
Senior Member
|
It would really help if you could explain in some detail.
|
|
|
|
|
|
#355 | |
|
Senior Member
|
Quote:
|
|
|
|
|
|
|
#356 | |
|
Member
Join Date: Jan 2010
Location: Hamburg, Germany
Posts: 987
|
Quote:
__________________
x: RCP_sat R2.x, R1.y y: RCP_sat ____, R1.y z: RCP_sat ____, R1.y |
|
|
|
|
|
|
#357 | |
|
Unknown.
Join Date: Aug 2002
Location: UK
Posts: 4,877
|
Quote:
Secondly, we're not talking about the same kind of ILP - AMD's VLIW needs enough independent ALU instructions, whereas hiding memory latency only requires enough ALU instructions (independent or not) between texture/memory accesses to hide the latency. In general, the number of ALU instructions between accesses to external memory MUST increase because ALU performance will increase faster than bandwidth. And finally using ILP to hide memory latency on GPUs isn't like using ILP to improve single-threaded performance on CPUs at all. What matters is the AVERAGE amount of ILP over many threads, not the fluctuating amount of it available inside a single thread. There will often be parts of a program which are a sort of 'serial bottleneck' but those will be compensated by the parts that have a lot of ILP. I could certainly be wrong and ILP won't increase, but I'd be surprised if it actually decreased.
__________________
Focusing on non-graphics projects in 2013 (but I still love triangles) "[...]; the kind of variation which ensues depending in most cases in a far higher degree on the nature or constitution of the being, than on the nature of the changed conditions." |
|
|
|
|
|
|
#358 | |||||
|
Senior Member
Join Date: Jan 2003
Location: Ottawa, Ontario
Posts: 1,783
|
Quote:
Also, you can't force data accesses to be register accesses. What makes it even more ironic is that on a GPU you want to minimize the number of registers to maximize the number of wavefronts. And that again also worsens cache contention. Quote:
Quote:
Quote:
Quote:
It turns out a much better comparison is Brisbane versus Barcelona. Together with a slew of other changes which probably don't take a lot of space each, the widening of the SSE path made the core grow by only 23%. That's only 8% of Barcelona's entire die. So 5% for doubling the throughput probably isn't a bad approximation. Doubling it again obviously costs more in absolute terms, but the rest of the core has grown / become more powerful as well. Sandy Bridge already widened part of the execution paths. Suffice to say that implementing AVX2 in Haswell will be relatively cheap and we can consider it to have twice the throughput at a negligible cost. That's absolutely not the case for GPUs, unless they start trading fixed-function hardware for programmable cores... |
|||||
|
|
|
|
|
#359 | |
|
Unknown.
Join Date: Aug 2002
Location: UK
Posts: 4,877
|
Quote:
Well ideally you'd try to automatically raytrace as many rays as necessary to avoid that (effectively dynamic SSAA) - which will also naturally result in a moderate amount of secondary ray coherence.
__________________
Focusing on non-graphics projects in 2013 (but I still love triangles) "[...]; the kind of variation which ensues depending in most cases in a far higher degree on the nature or constitution of the being, than on the nature of the changed conditions." |
|
|
|
|
|
|
#360 |
|
Senior Member
|
And what should we do with microtriangles?
|
|
|
|
|
|
#361 |
|
Senior Member
|
Are you asserting that developers won't create applications for Llano?
|
|
|
|
|
|
#362 |
|
Member
Join Date: Jan 2010
Location: Hamburg, Germany
Posts: 987
|
The still form a smooth shape, isn't it?
The argument is actually a very general one. The complete absence of locality in an image means it is just a random coloring of pixels. You won't be able to recognize anything in it (which would basically be certain structures, like a cube, a square or whatever). Therefore each meaningful picture whill show some locality. And on the way to generate this image, your algorithm will experience necessarily also locality in the accessed data structures (describing the scene). Just take the simple example of a raytraced image of a half transparent half reflective sphere in some environment. Neighboring rays intersecting the sphere giving rise to secondary rays. But those secondary rays still intersect very likely the same (or very close) objects. After all the reflection will show just a (distorted) image of the environment, same as the refracted rays. So if you don't have a surface with pixel sized small mirrors each pointing to random directions (which would create simply noise), also secondary rays will benefit from locality in the accessed data structures.
__________________
x: RCP_sat R2.x, R1.y y: RCP_sat ____, R1.y z: RCP_sat ____, R1.y |
|
|
|
|
|
#363 | |
|
Heteroscedasticitate
Join Date: Mar 2005
Posts: 2,354
|
Quote:
__________________
Donald Knuth: Science is what we understand well enough to explain to a computer. Art is everything else we do. |
|
|
|
|
|
|
#364 | |
|
Senior Member
Join Date: Jan 2003
Location: Ottawa, Ontario
Posts: 1,783
|
Quote:
Widening the execution units would increase throughput, but beyond dual 256-bit FMA units it would have a significant cost and require sacrificing scalar performance. It would also increase the instruction rate again and thus all related power consumption, and worsen the latency hiding. I doubt these compromises are worth it. 2048-bit and beyond isn't practically feasible since AVX is limited to 1024-bit. But it's a very reasonable limit since wider vectors would also worsens branch granularity and task granularity. |
|
|
|
|
|
|
#365 | |||
|
Senior Member
Join Date: Jan 2003
Location: Ottawa, Ontario
Posts: 1,783
|
Quote:
Quote:
Quote:
|
|||
|
|
|
|
|
#366 |
|
Regular
|
GPUs are great at DLP and surviving high frequency cache misses from semi-random streaming.
__________________
Cinematic is the new streamlined. |
|
|
|
|
|
#367 |
|
Senior Member
Join Date: Jan 2003
Location: Ottawa, Ontario
Posts: 1,783
|
|
|
|
|
|
|
#368 |
|
Senior Member
|
|
|
|
|
|
|
#369 | |
|
Senior Member
Join Date: Jan 2003
Location: Ottawa, Ontario
Posts: 1,783
|
Quote:
|
|
|
|
|
|
|
#370 |
|
Senior Member
|
|
|
|
|
|
|
#371 |
|
Senior Member
Join Date: Jul 2008
Posts: 2,157
|
|
|
|
|
|
|
#372 |
|
Senior Member
Join Date: Jan 2003
Location: Ottawa, Ontario
Posts: 1,783
|
|
|
|
|
|
|
#373 | |
|
Senior Member
|
Quote:
Just let me clarify: You'll be getting this kind of thrashing around of secondary or even worse tertiary rays also if you choose to apply some form of antialiasing in order to remove some of the noise from you rendered picture.
__________________
English is not my native tongue. Before flaming please consider the possiblity that I did not mean to say what you might have read from my posts. Work| RecreationWarning! This posting may contain unhealthy doses of gross humor, sarcastic remarks and exaggeration! |
|
|
|
|
|
|
#374 | |
|
Unknown.
Join Date: Aug 2002
Location: UK
Posts: 4,877
|
I think what you mean is that they are not specifically targeting Llano. But it is clearly compatible with Llano and given that there's nothing fancy about Llano's CPU-GPU integration and the architecture is the same one used in many other AMD GPUs, I'm really not sure what the benefit of that could possibly be?
Quote:
__________________
Focusing on non-graphics projects in 2013 (but I still love triangles) "[...]; the kind of variation which ensues depending in most cases in a far higher degree on the nature or constitution of the being, than on the nature of the changed conditions." |
|
|
|
|
|
|
#375 |
|
Senior Member
Join Date: Sep 2003
Location: Well within 3d
Posts: 4,109
|
I can't speak for aaronspink, but I do recall some differences were mentioned in the context of some recent CPUs.
The register file uses 8T SRAM, while caches use 6T, though in the case of Atom the L1 data cache also shifted to use 8T, which has a commensurate cost in storage per transistor. The other caches stuck with 6T. The upshot is that the use 8T SRAM allowed the design to run reliably at lower voltages. SRAM reliability at a given voltage level has come up as a design consideration in discussions or articles about the latest designs. I know that caches tend to favor density while register files tend to favor high performance. I had thought that pushing a cache to the same level of porting as a register file would make it noticeably more bloated than a register file due to the scaling of the cache's ancilliary hardware, but the most recent posts on the matter indicate the RAM would dominate.
__________________
Dreaming of a .065 micron etch-a-sketch. Last edited by 3dilettante; 29-Jun-2011 at 16:54. |
|
|
|
![]() |
| Thread Tools | |
| Display Modes | |
|
|