If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.
![]() |
|
|
#101 |
|
Member
Join Date: Jun 2007
Posts: 263
|
It must be like our brain works. Seriously specialization has it's benefits. You would not want to emulate floating point with integer arithmetic.
|
|
|
|
|
|
#102 | |||||||||
|
Senior Member
Join Date: Sep 2003
Location: Well within 3d
Posts: 4,071
|
Quote:
That's just one cycle of hidden latency for a branch that won't resolve for another 15 cycles (best case). Quote:
Quote:
Quote:
Quote:
As to the approximation you are using. What is the mispredict rate you've chosen? Is it 10% chance of misprediction per individual branch, or the cumulative probability of a misprediction somewhere in a 64-instruction window with 6 branches in that range? Successive branches lead to cumulative misprediction rates that can lead to 30% of ROB entries not being committed with more reasonable misprediction rates per branch instruction. What do you mean by IPC in this case? Barring an I-cache miss, a good 4-wide speculative processor is going to push close to 4 instructions through the front end of the pipeline every clock. Referencing the stacked penalty model in that paper, the front-end penalty is going to be over 3 times higher than the 5-stage chosen in the model. 100% of all speculatively issued instructions will go through the mispredict pipeline up to the final point of execution. In terms of ROB instructions that do not commit, that's 30% of instructions going through hardware that as a percentage of the non-cache core area is close to 2/3 or more comprised of active logic. This is a fixed power cost 100% of the time. Some as yet undetermined percentage of the wasted instructions will go so far as to execute and have their results pending in the ROB when they are negated, depending on the situation. This is the decode+schedule power draw all speculated instructions draw, then a certain amount due to execution unit consumption, for which I have no figures but will vary depending on the operation type. Loads and stores in i7 are very heavily speculated, more so than what is already the case in other OoO chips. Quote:
That's different than running an instruction through the pipeline and not knowing it is invalid until the end. Quote:
If all the chip is doing is drawing a wall or a 10 pixel character, whatever time it takes will be sufficiently fast for the purposes of the GPU's target market. Quote:
Quote:
__________________
Dreaming of a .065 micron etch-a-sketch. |
|||||||||
|
|
|
|
|
#103 | ||||
|
Senior Member
Join Date: Jan 2003
Location: Ottawa, Ontario
Posts: 1,783
|
Quote:
Dear god, CUDA's definitions are an abomination! Quote:
Quote:
Quote:
You can either try loading the necessary data explicitely, or you can let caching heuristics do it for you automatically. Given the increase in complexity and wide variety of algorithms, my vote goes to caches. Maybe not as much as 24 MB, but Larrabee's 4-8 MB sounds about right for a wide variety of workloads. |
||||
|
|
|
|
|
#104 | |
|
Senior Member
Join Date: Mar 2006
Posts: 1,682
|
Quote:
Let's define resources in terms of area of the full chip. I can see how crysis at the highest quality level on a g98 has major parts of the chip are sitting idle, waiting for a shader to complete. But on a well proportioned RV770 or g98 or GT200? Come on... |
|
|
|
|
|
|
#105 | ||
|
Senior Member
Join Date: Jan 2003
Location: Ottawa, Ontario
Posts: 1,783
|
Quote:
Quote:
Vice versa, CPUs have a lot to learn from GPUs, but they're quickly moving in the right direction. Today, a Core i7 delivers about 100 GFLOP/s. By 2012 the number of cores has doubled, the vector width has doubled, we'll have FMA support, and it will run at a slightly higher clock frequency. That's good for 1 TFLOP/s. Soon after that we should see support for scatter/gather appear. That does become compelling. The most beautiful thing of all is that developers will be largely unrestricted by the API or any fixed-function hardware. As shown by FQuake, one can achieve amazing performance by doing things the way the developer intends to do them, instead of bending over backwards to work within the constraints of the API and the hardware. API's won't dissapear, but they'll become more of a framework like OpenCL. Today's GPUs are still very much designed specifically to support the Direct3D pipeline. Everything is meticulously balanced to support that as best as possible. But as soon as you do something out of the ordinary you hit one bottleneck after the other, and that's happening a lot given the growing diversification of graphics techniques. Anyway, I have a hard time believing that GPU architects are sitting on their hands. Programmable texture sampling can be implemented using largely generic scatter/gather units and doing the filtering in the shader units. ROP is also destined to become programmable, and tesselation units could eventually become programmable rasterizers as well. By moving everything to programmable cores the de facto bottlenecks vanish. Larrabee has a head start, but possibly not for very long. I don't think there's any question whether in the end the CPU or the GPU will 'win'. They'll both win, but each in their own domain. Discrete graphics cards will keep ruling the high-performance graphics market, while CPUs will become adequate for low-end markets. Compare it to sound processing. Nowadays 95% of us don't have a discrete card any more. The CPU processes sound in a driver "on the side". With a CPU already capable of 1 TFLOP/s, there's no need to shell out for a second more dedicated chip, unless you want the absolute latest in graphics. |
||
|
|
|
|
|
#106 |
|
Regular
|
GPUs in fact have a far easier time using every CPU concept than x86 ... I don't think Intel is ever going to introduce split branches for instance.
|
|
|
|
|
|
#107 |
|
Senior Member
Join Date: Feb 2002
Posts: 2,019
|
|
|
|
|
|
|
#108 | |
|
Member
Join Date: Sep 2006
Posts: 273
|
Quote:
http://www.forum-3dcenter.org/vbulle...=267950&page=6 |
|
|
|
|
|
|
#109 | |
|
Member
Join Date: Jun 2007
Posts: 263
|
Quote:
There seems to be a tendency that on a more core machine the total CPU usage percentage drops Also with lower resolution screens this happens. It could be the driver part that copies the image from CPU to GPU memory. When I only display every 10 frames the rendered image, CPU utilization is well over 90%. Last edited by Voxilla; 12-Aug-2009 at 07:24. |
|
|
|
|
|
|
#110 | |
|
Naughty Boy!
|
Quote:
They started adding features that have no meaning for graphics, and since G80 they really didn't add any graphics features, they only improved Cuda. Mainly things like the shared cache and the double precision math. Rather radical changes from a hardware point-of-view, but Direct3D10 doesn't even know they exist. Direct3D11 only knows what to do with those things for compute shaders, not graphics. And in a slightly different direction, Intel also started experimenting with more programmability and less hardwired functionality in their IGPs. Part of the clipping and triangle setup is actually performed by special kernels. Same with acceleration of video decoding. Obviously Larrabee will be a similar approach... I wonder when nVidia and AMD are going to go down this route aswell. Perhaps already with DX11 hardware?
__________________
ZX81 -> C64 -> Hercules -> Plantronics CGA -> Paradise VGA -> Amiga ECS -> Amiga AGA -> Cirrus Logic 5428 VLB -> S3 Trio64 -> Matrox Mystique -> PCX2 -> Matrox G200 -> Matrox G450 -> GeForce2 GTS -> Kyro II -> Radeon 8500 -> Radeon 9600XT -> GeForce 7600GT -> GeForce 8800GTS -> HD5770 |
|
|
|
|
|
|
#111 | |
|
Regular
|
Quote:
Still, having lots of space for context on chip is currently the only solution. And with the difference in performance between on-die memory and off-die memory only increasing, there's no solution in sight. 3D chip stacking is likely to lead to entire layers of nothing but memory - so on-chip memory is set merely to grow, not shrink. Regardless of the name you give the processor, CPU or GPU. Jawed
__________________
Can it play WoW? |
|
|
|
|
|
|
#112 | |
|
Regular
|
Quote:
At least it'll have bags of cache Jawed
__________________
Can it play WoW? |
|
|
|
|
|
|
#113 |
|
Now Officially a Top 10 Poster
Join Date: May 2006
Location: Maastricht, The Netherlands
Posts: 12,879
|
I looked into this and from what I read you can configure each core 1 to four threads, which would suggest that if you configure it to run with 1 thread, it would run at full speed.
|
|
|
|
|
|
#114 | |||||||
|
Senior Member
Join Date: Sep 2003
Location: Well within 3d
Posts: 4,071
|
Quote:
If the example actually did exist, I would obviously go for the smaller chip, unless the bigger chip also came with a free unicorn. Quote:
The wastage may be invariant over time. Expecting a certain performance level in code that provides no obvious extra non-speculative work per clock may only come by speculation. For workloads where this is not a problem, the amount of non-computational active logic and die area expended on what turns out to be irrelevant bit-fiddling becomes increasingly inappropriate. For power-limited and mobile applications, which are increasingly driving the market, it does not yet appear to be appropriate, and I haven't seen anything rumored from now until the transition 22nm, after which there's still nothing but also nothing substantive about anything else anyway. As a side note: In the case of other constraints, like the limits of the memory bus or on-die communications networks, the more precious a resource becomes, the less acceptable it is to waste it on things that turn out not to be needed. Quote:
Quote:
Read/write coherent caches are not yet in use, but their presence is orthogonal to the question of speculation. Larrabee's level of speculation is bare-minimum. It has an extremely short pipeline; it is an in-order design, and much of the code it runs is statically unrolled. As far as use of its vector capability, the pixel shader emulation reduces the amount of speculation to as close to zero as possible. It also reduces the use of the coherent capability of its caches to a very low level, but that is a separate issue. Quote:
Quote:
I think the world will have moved on from Crysis at that point. The more prevalent hardware case in 2012 won't be 8-core desktops, but probably hybrid solutions with on-die IGPs or GPUs. As the "core race" has already become uninteresting for most users. Quote:
__________________
Dreaming of a .065 micron etch-a-sketch. |
|||||||
|
|
|
|
|
#115 | ||
|
Senior Member
Join Date: Jan 2003
Location: Ottawa, Ontario
Posts: 1,783
|
Quote:
Quote:
|
||
|
|
|
|
|
#116 | |
|
Senior Member
Join Date: Sep 2003
Location: Well within 3d
Posts: 4,071
|
Quote:
If the core can be set to a 1-thread mode, Larrabee would look like a 1 GHz Pentium.
__________________
Dreaming of a .065 micron etch-a-sketch. |
|
|
|
|
|
|
#117 | |||
|
Regular
|
Quote:
CUDA introduced the concept of a fixed-size, shared memory. This low level architectural detail "breaks" the stream programming model. It's the MS-DOS 640KB limit all over again D3D11 mandates 32KB of shared memory (and surely CUDA will mandate at least that for GT3xx cards). Arguably shared memory is a performance kludge - an alternative to launching two kernels in succession (since all writes to shared memory require bounding with a fence). You could say it's no more bad than programming for a CPU with a known 32KB L1, 256KB L2 and 4MB L3. The GPUs lose a lot of performance moving data off chip, only to read it back again - even though they're capable of hiding such latencies. So shared memory takes on the role of joining two distinct kernels separated by a fence to become arbitrary writes/reads of shared memory in a single kernel, with the performance constraints of fences and limited capacity. Shared memory was originally called parallel data cache - hinting at what I think was its original graphics-related purpose of caching vertex and geometry kernel output - i.e. barycentrics for triangles and GS-amplification - for use by pixel-shading and setup. Is 16KB enough? Dunno. I'm doubtful. As I am that 32KB is enough. Future versions of D3D are supposedly planned to increase this amount. OpenCL allows the programmer to query the device to obtain the "key dimensions" (I just picked out a few from Table 4.3 of the specification, v1.0.43):
Quote:
Quote:
Jawed
__________________
Can it play WoW? |
|||
|
|
|
|
|
#118 | |
|
Regular
|
Quote:
Jawed
__________________
Can it play WoW? |
|
|
|
|
|
|
#119 | |
|
Naughty Boy!
|
Quote:
So theoretically nVidia *could* have implemented it this way, but I wouldn't be surprised if they didn't. In fact, I'd be surprised if they did. Either way, it demonstrates the point that by adding extra 'general purpose' features and instructions, you can expand the usefulness of the processing core, and either use it for new graphics and non-graphics applications. So that would indicate a trend that we're already moving away from a 'hardwired' or 'optimized' Direct3D implementation to something more general purpose, which happens to also work okay for Direct3D.
__________________
ZX81 -> C64 -> Hercules -> Plantronics CGA -> Paradise VGA -> Amiga ECS -> Amiga AGA -> Cirrus Logic 5428 VLB -> S3 Trio64 -> Matrox Mystique -> PCX2 -> Matrox G200 -> Matrox G450 -> GeForce2 GTS -> Kyro II -> Radeon 8500 -> Radeon 9600XT -> GeForce 7600GT -> GeForce 8800GTS -> HD5770 |
|
|
|
|
|
|
#120 | |
|
Now Officially a Top 10 Poster
Join Date: May 2006
Location: Maastricht, The Netherlands
Posts: 12,879
|
Quote:
Not sure we'll see it even in any chip at all. Talking about fast software renderers, I have a completely new idea for setting up data and rendering it. Since I have a relatively poor background in rendering I'm not sure how new it really is, and I'm wondering if I should just discuss it here or file a patent immediately. Probably just discussing it here would be good, or in a new thread. |
|
|
|
|
|
|
#121 |
|
Senior Member
Join Date: Sep 2003
Location: Well within 3d
Posts: 4,071
|
The core is running at clock, but it's roughly half as wide as the dual-issue Pentium when running general code.
It was indicated in Intel slides (there was some additional reinforcement of my interpretation in this forum) that the core has a scalar x86 pipe and the vector pipe, and absent a vector instruction issue plain x86 is not going to take up the full width. The performance jump from single to dual issue is close enough to doubling that I rounded it off.
__________________
Dreaming of a .065 micron etch-a-sketch. |
|
|
|
|
|
#122 |
|
Now Officially a Top 10 Poster
Join Date: May 2006
Location: Maastricht, The Netherlands
Posts: 12,879
|
Ah I see, fair enough.
|
|
|
|
|
|
#123 | |
|
Regular
|
Quote:
http://forum.beyond3d.com/showpost.p...&postcount=918 That's all I have. Feel free to suggest why graphics would leave shared memory unused. Jawed
__________________
Can it play WoW? |
|
|
|
|
|
|
#124 | ||
|
Senior Member
Join Date: Jan 2003
Location: Ottawa, Ontario
Posts: 1,783
|
Quote:
Quote:
There is a solution for having to keep lots of data on chip. That's why I pointed back to NV40. Clearly they've improved on that architecture since. What changed is that strands are processed faster when you decouple the ALUs from the texture samplers. On NV40 each instruction in a single strand had the same latency as the entire pipeline length. Nowadays it's a mix of texture sampling latencies, branch latencies, and ALU latencies. This reduced the number of strands in flight, allowing to have more registers per strand. But we can do even better. That's where caches, speculation and out-of-order execution come in. Caches reduce the average memory access latency, speculation reduces average branch latency, and out-of-order execution reduces latencies due to instruction order. So all of them help process a strand much faster so we don't have to store so many contexts and have to reduce throughput when we run out of registers (slide 7). Of course the burning question is: do we really need to do even better than today's GPU architectures? In my opinion, yes. Shaders are becoming more complex every day. And to support things like deep function calls or recursion a large stack is required (CPUs typically have around 1 MB of stack space). Even though 1 MB won't be needed any time soon, I still expect a spectacular increase in the temporary data a strand needs to store. Any architecture not able to cope with that will be left behind. So I wouldn't instantly exclude something like speculation as (part of) the solution. Maybe it shouldn't be implemented as aggressively as on CPUs, and maybe there are better ways to spend your transistors, but it's not adding ever more registers. Oh and if you're worried about power consumption, there has been some interesting research about branch prediction fidelity. When it's lower than a threshold, instruction fetch is halted till the branch is resolved. By choosing the right threshold a good balance between power consumption and utilization can be obtained, while keeping the number of strands in check. Last edited by Nick; 13-Aug-2009 at 12:16. |
||
|
|
|
|
|
#125 |
|
Senior Member
Join Date: Jan 2003
Location: Ottawa, Ontario
Posts: 1,783
|
Yes. I believe they should have made it 2 x SSE, instead of SSE with 256-bit registers. Some instructions only operate on the lower 128-bit. Although this can be corrected in the future, it means we'll have to write yet another code path for AVX2. This sort of unnecessary complications slow down the adoption of new ISA extensions.
Translating existing code from SSE to AVX should have been straightforward: use the corresponding AVX instructions and process twice the data in parallel. But because some instructions don't have a 256-bit equivalent yet (or should I say 2 x 128-bit), it can be a bit more complicated. Of course they did this to save a few transistors by not having to place two equivalent SSE units in parallel, but that's a mistake in my book. They could have foreseen having to do that at a later point anyway. I wouldn't care if some 256-bit instructions had an extra cycle of latency because they're executed on 128-bit units. At least that preserves binary compatibility, and just like when they widened SSE execution units from 64-bit to 128-bit it would seamlessly increase performance of existing executables on newer hardware! Same thing for scatter/gather. They should have already added it to the ISA. It doesn't matter if the first implementation uses multiple cycles to read/write each element individually. People would already be able to use the instructions and performance would increase from one CPU generation to the next. This would also allow Intel to assess how many transistors they should throw at it, using real world applications. The problem of collecting data elements from different memory locations is only going to get bigger, so again they could have foreseen that and added the instructions early on. Last but not least, by specifying foreseeable ISA extensions and sharing it with the competition, they can speed up the adoption and just battle things out at the hardware level. There are no winners when they add different ISA extensions. Developers just use the lowest common denominator while hardware manufacturers waste precious transistors and R&D on unused features. |
|
|
|
![]() |
| Thread Tools | |
| Display Modes | |
|
|