NVIDIA shows signs ... [2008 - 2017]

Status
Not open for further replies.
1. Shared Memory (yes it's used to hold operands for interpolation but that's not its primary purpose)
KB of shared memory should be used quite regularly, I expect. Worst case, 24 warps, each with one triangle, each with 3 vertices, each with 16 vec4 attributes = 18KB.

2. Going scalar, and the extra overhead that comes with it. We've seen in the past few days examples of CS code that required explicit vectorization to take full advantage of VLIW hardware. At the same time you've put in a lot of work demonstrating VLIW's higher efficiency in game shaders.
NLM Denoise has also shown vectorisation (of PS code - though the CS code should benefit equally) is worth doing, not only on ATI but on NVidia - but that's because it's memory bound.

I'm convinced NVidia thought it was more efficient/easier than building another VLIW - after all there's a widely held view that VLIWs are doomed to fail because compilers are terrible and absolute utilisation or utilisation per mm² is the pits. ATI's compiler still needs work. Then again, NVidia's compiler still has issues with things like 3DMark Vantage feature tests (hardly the most innocuous application) bouncing down and up in performance with driver revisions.

In my view NVidia made a bet that it could build a more area-/power-efficient "scalar" architecture than VLIW - it still had to be considerably faster at graphics then G71 - NVidia's consistent on the subject of scaling graphics performance for each generation. I don't know if NVidia still thinks pixel shaders are tending towards scalar, but that was the "marketing" at launch. Generally a lot of NVidia's "marketing" has been about the inherent utilisation benefits of scalar, including graphics. Scalar freed NVidia and graphics programmers from the worries about "coding for an architecture" - everything "just works, optimally". Of course one only finds out about the officially recognised failings in utilisation with each new GPU (i.e. the increased register file in GT200, the revamp of Fermi...). Anyway, NVidia appeared to think it could only win with scalar in terms of efficiency, however it's measured and regardless of workload.

Fermi appears to indicate NVidia has realised the wastefulness of G80 style per-thread out-of-order instruction issue. But such fine-grained scheduling appears to go hand-in-hand with the texturing architecture, which raises questions on how texturing (or load/store) instructions are going to be scheduled in Fermi - i.e. how the latency-hiding is scheduled/tracked. Fermi might be relaxed about this solely because L1 is huge in comparison with previous architectures - but I'm dubious because texturing, historically, has been very happy with very small caches. Bit of a puzzler that.

Maybe the benefits of out-of-order per-thread scheduling tail-off as register file size increases (i.e. as count of threads in flight increases), so it was always going to be an interim solution.

Jawed
 
KB of shared memory should be used quite regularly, I expect. Worst case, 24 warps, each with one triangle, each with 3 vertices, each with 16 vec4 attributes = 18KB.
And making it available to CUDA developers was free and easy? We see how well that worked out for the LDS on AMD hardware.
I'm convinced NVidia thought it was more efficient/easier than building another VLIW - after all there's a widely held view that VLIWs are doomed to fail because compilers are terrible and absolute utilisation or utilisation per mm² is the pits. ATI's compiler still needs work. Then again, NVidia's compiler still has issues with things like 3DMark Vantage feature tests (hardly the most innocuous application) bouncing down and up in performance with driver revisions.
Convinced by what? NV40 already took steps toward VLIW and there is nothing indicating that Nvidia thought they weren't competent enough to build one, especially considering VLIW hardware is implicitly simpler.
In my view NVidia made a bet that it could build a more area-/power-efficient "scalar" architecture than VLIW
That doesn't make sense, VLIW by definition is more area-efficient for a given number of (theoretical) flops so why would they believe that? In terms of measured efficiency see G92.
Generally a lot of NVidia's "marketing" has been about the inherent utilisation benefits of scalar, including graphics. Scalar freed NVidia and graphics programmers from the worries about "coding for an architecture" - everything "just works, optimally".
Hence, why the G8x design was indicative of a move towards accomodating the needs of general computation. Fact is that it requires less "coding for an architecture" than the alternatives, so setting some arbitrary benchmark of developer nirvana upon which to base your constant criticisms is pretty silly.
Of course one only finds out about the officially recognised failings in utilisation with each new GPU (i.e. the increased register file in GT200, the revamp of Fermi...). Anyway, NVidia appeared to think it could only win with scalar in terms of efficiency, however it's measured and regardless of workload.
I
see, so improvements in later chips are now indicative of "failings" in predecessors? The fact that those chips were competitive with far lower theoretical flop numbers should be evidence enough for you. So how do you figure that they remained competitive with far lower theoretical numbers and "failings in utilization" to boot? Magic?
Fermi appears to indicate NVidia has realised the wastefulness of G80 style per-thread out-of-order instruction issue.
What gives you that impression that (1) G80 was wasteful in this regard and (2) that Fermi is any different?
 
And making it available to CUDA developers was free and easy? We see how well that worked out for the LDS on AMD hardware.

LDS is kind of a texture cache on rv770. Dunno about r8xx. That's why it has pathetic performance when used as a cuda style shared memory. :rolleyes:

That doesn't make sense, VLIW by definition is more area-efficient for a given number of (theoretical) flops so why would they believe that? In terms of measured efficiency see G92.

What makes you think VLIW is more area efficient for GPU's? :oops: VLIW uses area and power to make individual thread run fast. In GPU's, only the total running time for all threads matters. So why bother with spending area to speed up a single thread? Every instruction slot unused wastes power and area. And compilers will never be perfect. VLIW is not worth the area and power for GPU's atleast in the near future. 10 years from now, who knows?

Hence, why the G8x design was indicative of a move towards accomodating the needs of general computation. Fact is that it requires less "coding for an architecture" than the alternatives, so setting some arbitrary benchmark of developer nirvana upon which to base your constant criticisms is pretty silly.

I am with Jawed here. Scalar code is easier to write, and vec4 computes won't be always around (ie in all workloads) to save the VLIW's efficiency. Also, it locks ATI at 4:1 scalar:sfu ratio (atleast apparently), while nv is going the other way.

I see, so improvements in later chips are now indicative of "failings" in predecessors? The fact that those chips were competitive with far lower theoretical flop numbers should be evidence enough for you.

May be they were competitive due to efficiencies/brute force in other regards. I am of the opinion that out of lrb, fermi and r8xx, r8xx is the most wasteful due to it's vliw nature. May be next only to cell's spe's.
 
What makes you think VLIW is more area efficient for GPU's? :oops: VLIW uses area and power to make individual thread run fast.
What mechanism would cause this?
VLIW takes the responsibility for dependence checking away from the hardware and puts it on the compiler.
The cost is in packing inefficiences and code bloat. The hardware is permitted to become simpler, unless you think operand collectors and per-instruction status tracking comes cheap.


I am with Jawed here. Scalar code is easier to write, and vec4 computes won't be always around (ie in all workloads) to save the VLIW's efficiency. Also, it locks ATI at 4:1 scalar:sfu ratio (atleast apparently), while nv is going the other way.
I'm not sure how bound to that ratio AMD would be in the future to a fixed ratio. At least for games, the compiler is going to rework packing to match the internal structure.
 
What makes you think VLIW is more area efficient for GPU's? :oops: VLIW uses area and power to make individual thread run fast. In GPU's, only the total running time for all threads matters. So why bother with spending area to speed up a single thread? Every instruction slot unused wastes power and area. And compilers will never be perfect. VLIW is not worth the area and power for GPU's atleast in the near future. 10 years from now, who knows?

VLIW is more area efficient in terms of theoretical flops / mm^2. I don't think you understood my post as I also highlighted the relative efficiencies of Nvidia's scalar approach.

I am with Jawed here. Scalar code is easier to write, and vec4 computes won't be always around (ie in all workloads) to save the VLIW's efficiency.

Heh, then I don't think you understand what Jawed is saying....as usual he was mocking claims such as you are making of scalar architectures being easier on developers.
 
What mechanism would cause this?
VLIW takes the responsibility for dependence checking away from the hardware and puts it on the compiler.
The cost is in packing inefficiences and code bloat. The hardware is permitted to become simpler, unless you think operand collectors and per-instruction status tracking comes cheap.

Why bother with any ILP extraction (in sw, aka vliw or hw aka OoOE) in the first place? Just let the damn thing be scalar, in order. That will make it more area efficient still for GPU workloads.

You are comparing hw efficiency of extracting ILP in hw or in sw. I am comparing the benefits/costs of bothering with ILP in first place.


I'm not sure how bound to that ratio AMD would be in the future to a fixed ratio. At least for games, the compiler is going to rework packing to match the internal structure.

That's possible, I feel that it will likely be accompanied with a big break in terms of their ALU architecture.
 
Why bother with any ILP extraction (in sw, aka vliw or hw aka OoOE) in the first place? Just let the damn thing be scalar, in order. That will make it more area efficient still for GPU workloads.

You are comparing hw efficiency of extracting ILP in hw or in sw. I am comparing the benefits/costs of bothering with ILP in first place.
Scalar isn't an easy win unless the code you're running has lots of scalar dependencies. Let's take a hypothetical vec4 architecture vs. a pure scalar one. Here I will refer to 1 vec4 ALU and 1 scalar ALU as being 1 unit in each architecture. In order for simple operations like "modulate texture color with vertex color" to go at the same rate on the two machines, the scalar architecture needs 4x the units, 4x the clock speed or a combination thereof. An increase in units means you need a more complex scheduler and an increase in clock speed means you need a more complex design. Also note that going wider means that the scalar architecture would have more latency per pixel. In either case, the peak ALU rate for the scalar architecture is likely to be lower than the vec4 architecture since it's unlikely that you could pack 4x the scalar units (by scaling the number of ALUs and/or clockspeed) given the costs elsewhere.

This doesn't make scalar bad, but it's not as simple as some people think.

-FUDie

P.S. Intel and AMD came up with MMX and SSE for a reason, right?
 
And making it available to CUDA developers was free and easy?
NVidia needs this functionality for graphics. Shared memory of some kind is implicit in NV40, as NV40 also does on-demand interpolation (though I suppose it's possible to do this by duplicating all vertex attributes and barycentrics across all fragments). So I interpret this as an evolution of the NV40 approach that suits the unified architecture (where one triangle might be shaded by tens of multiprocessors - i.e. each multiprocessor needs a local copy of all the triangle data).

We see how well that worked out for the LDS on AMD hardware.
You might just as well point the finger at the whole of CAL/IL/Brook+. LDS was added solely for compute.
Convinced by what? NV40 already took steps toward VLIW and there is nothing indicating that Nvidia thought they weren't competent enough to build one, especially considering VLIW hardware is implicitly simpler.
NV30 was VLIW, NV40 didn't need to take any steps.

That doesn't make sense, VLIW by definition is more area-efficient for a given number of (theoretical) flops so why would they believe that?
NVidia was, seemingly, driven by realisable FLOPs, not theoretical FLOPs.


Seemingly to the extent of killing absolute realisable FLOPs per mm².

In terms of measured efficiency see G92.
Hindsight's great, isn't it? It seems that NVidia's 40nm chips can't even exceed the performance of 55nm ATI chips - that's not solely because of a lack of GFLOPs.

Hence, why the G8x design was indicative of a move towards accomodating the needs of general computation.
Indicative of a move in software terms, yes. The hardware was built for graphics. The irony about this is that it's the inefficiencies of the fixed function hardware that make NVidia's design so large. The GPU size, for "equivalent realisable" FLOPs in vector form in NVidia's design might only have been 10% smaller (that's total GPU size, not mm² per FLOP).

Fact is that it requires less "coding for an architecture" than the alternatives, so setting some arbitrary benchmark of developer nirvana upon which to base your constant criticisms is pretty silly.
What arbitrary benchmark?

However "inefficiently" the code runs on VLIW, it still runs. Optimising for an architecture is more than just optimising for the ALUs. My beef is with a comparison that starts and ends with the "scalar has perfect utilisation" mantra. Optimising for VLIW is harder - better hope the compiler is decent. Optimising for memory hierarchies and latencies normally means some kind of vectorisation/unrolling.

I see, so improvements in later chips are now indicative of "failings" in predecessors?
Regardless, whether they're obvious (such as the register file being too small) or when they're not obvious (some instructions in MI block instruction-issue for the MAD and DP-MAD ALUs - I'm still not sure which instructions in MI do this), they're still failings. Utilisation isn't 100% on these older GPUs (like the scalar mantra likes to pretend it is). We knew it, but it's interesting to see what design changes are brought to bear.

You can judge how important those failings were separately (e.g. those failings aren't important for game shaders from 2005). Some things, like dynamic branching in NV40, are just an abortion.

The fact that those chips were competitive with far lower theoretical flop numbers should be evidence enough for you. So how do you figure that they remained competitive with far lower theoretical numbers and "failings in utilization" to boot? Magic?
FLOPs pretty-much always seem to fall victim to some other bottleneck: rasterisation rate, texture rate, fillrate, bandwidth etc.

What gives you that impression that (1) G80 was wasteful in this regard and (2) that Fermi is any different?
Fermi issues purely in-order from each warp. Until we know more about GF100's design we won't know how realisable it was in G80's timeframe. We'll also probably never really know how wasteful G80's approach was. Fermi, Larrabee and R800 all being in-order is a bit of a clue though, don't you think?

Jawed
 
Why bother with any ILP extraction (in sw, aka vliw or hw aka OoOE) in the first place? Just let the damn thing be scalar, in order. That will make it more area efficient still for GPU workloads.
What is the exact implementation you imagine in this instance? This can go either way without knowing details.
Do we want the same unit count, and how independent are these scalar lanes and how are we dividing the register file?
 
What is the exact implementation you imagine in this instance? This can go either way without knowing details.
Do we want the same unit count, and how independent are these scalar lanes and how are we dividing the register file?

I want scalar like nv or lrb. The advantage it has, is that it removes one dimension of parallelism that you would otherwise bother with.
 
We are talking of scalar vs vliw in gpu's not cpu's.
Wow good answer! :rolleyes: Good work dodging the rest of my post.

If it's good for CPUs, which don't do graphics, why wouldn't it be good for GPUs which do do graphics? Many graphics operations are vector operations and map quite well to VLIW implementations.

-FUDie
 
Good work dodging the rest of my post.

Your arguments do not make sense in the present context though they have their merits.

GPU's already have 64 wide logical SIMD (in r8xx), compared to that, SSE is soo-scalar. We are talking of AMD GPU's need for vector ILP to maximize throughput, which nv gpu's don't need (not counting their RAW latency), even though they are 32 wide logical SIMD.

If it's good for CPUs, which don't do graphics, why wouldn't it be good for GPUs which do do graphics? Many graphics operations are vector operations and map quite well to VLIW implementations.

Except that GPU's evolution isn't about graphics anymore. They have to keep in mind the needs of gpgpu folks as well.

As for VLIW's merits in a massively parallel xPU, try finding ILP in Hoshen Kopelman algorithm (for cubic-like lattices). :p

In other areas too, sometimes finding ILP is pretty ugly and needs twisted data layouts.
 
LDS is kind of a texture cache on rv770.
LDS is just a block of memory local to a core.

Dunno about r8xx.
Still have no idea whether LDS in R800 is any different.

That's why it has pathetic performance when used as a cuda style shared memory. :rolleyes:
You have a range of benchmarks? Since AMD never bothered to fix the CAL/Brook+ implementation, I've got no idea what the performance is like under D3D-CS or OpenCL, for either GPU.

What makes you think VLIW is more area efficient for GPU's? :oops: VLIW uses area and power to make individual thread run fast. In GPU's, only the total running time for all threads matters. So why bother with spending area to speed up a single thread? Every instruction slot unused wastes power and area. And compilers will never be perfect. VLIW is not worth the area and power for GPU's atleast in the near future. 10 years from now, who knows?
VLIW in this case costs ~15% the area of NVidia's design per MAD :

http://forum.beyond3d.com/showpost.php?p=1321966&postcount=1815

and serially-dependent MADs run at 37% of the absolute performance:

http://forum.beyond3d.com/showpost.php?p=1263711&postcount=498

and in absolute performance in games VLIW suffers no meaningful deficit:

http://forum.beyond3d.com/showpost.php?p=1220350&postcount=27

though it's not possible to discern the effect on framerate from that data as framerate is subject to lots of other factors too.

I am with Jawed here. Scalar code is easier to write, and vec4 computes won't be always around (ie in all workloads) to save the VLIW's efficiency.
Scalar code is easier to write. At the other extreme I doubt there's anyone who wants to write ATI VLIW assembly, that would be horrendous - writing D3D assembly is bad enough.

You don't need vec4 computes to maximise a VLIW's efficiency, scalar operations often have implicit parallelism.

http://forum.beyond3d.com/showthread.php?t=53170

Also, it locks ATI at 4:1 scalar:sfu ratio (atleast apparently), while nv is going the other way.
You're still stuck with that bizarre misapprehension about scalar:sfu.

May be they were competitive due to efficiencies/brute force in other regards. I am of the opinion that out of lrb, fermi and r8xx, r8xx is the most wasteful due to it's vliw nature. May be next only to cell's spe's.
No doubt, VLIW is brute force, just like SIMD is brute force. RV870 really pushes it with 64-way 5-way VLIW versus Larrabee's 16-way scalar.

Jawed
 
LDS is just a block of memory local to a core.

http://gpgpu.org/wp/wp-content/uploads/2009/09/E1-OpenCL-Architecture.pdf, slide 8 suggests the texturing element to me.

You have a range of benchmarks? Since AMD never bothered to fix the CAL/Brook+ implementation, I've got no idea what the performance is like under D3D-CS or OpenCL, for either GPU.

I think it was you who pointed out the blas3 is slower on r7xx with shared memory than with texturing.

VLIW in this case costs ~15% the area of NVidia's design per MAD :

http://forum.beyond3d.com/showpost.php?p=1321966&postcount=1815

I only see a
per single-precision MAD - 0.626mm² v 0.095mm² - 659%
Sorry, I missed the 15% bit, no sarcasm intended.

(on a side note, considering their present flop density, they'll prolly keep using vliw in the near future at least)
and serially-dependent MADs run at 37% of the absolute performance:
http://forum.beyond3d.com/showpost.php?p=1263711&postcount=498

From that post,
37% of the absolute performance of GTX285.

Shouldn't it be measured against peak perf of 4870?
Worst case, AMD's ALUs are 76% bigger than NVidia's when running serial scalar code. Most of the time they're effectively 50% of the size in terms of performance per mm2.

It suggests amd suffers a 3.25x (1.76/0.5) perf hit in no-ILP code. Better than 5x, sure, not sure if it worth the area/power.
and in absolute performance in games VLIW suffers no meaningful deficit:

http://forum.beyond3d.com/showpost.php?p=1220350&postcount=27

In games of course, lots of geometry oriented operations are there, so vliw does fine.
At the other extreme I doubt there's anyone who wants to write ATI VLIW assembly,
prunedtree :smile:

You don't need vec4 computes to maximise a VLIW's efficiency, scalar operations often have implicit parallelism.

http://forum.beyond3d.com/showthread.php?t=53170

Yeah they do, that's why every cpu since pentium has been chasing ILP. ATM, I am not sure of vliw's future in gpu's. It's just that 5 way vliw seems a bridge too far for more gp workloads.

You're still stuck with that bizarre misapprehension about scalar:sfu.

I might have missed the finer points of amd's t unit. Care to clarify?
 
Texture fetches and LDS operations are mutually exclusive (LDS operations happen under the TEX control flow instruction), and LDS probably shares the same register file read/write cycles as texture-address/texture-data - but there's nothing "texturing" about it.

I think it was you who pointed out the blas3 is slower on r7xx with shared memory than with texturing.
That was AMD's implementation. AMD's implementation of matrix multiply (texture-cache optimised) is also slower than prunedtree's. Vasily Volkov and I discussed the possibilities (all his ideas) of improving the shared memory implementation - but the compiler was being naive (the unnecessarily-slow reads) and he abandoned it.

In the meantime AMD focused on OpenCL instead. Which relies upon IL. Which implies AMD will have sorted the compilation of IL LDS instructions. Or will do. The same applies to Direct Compute, I expect. AMD has no real intention of providing public support for IL now, as far as I can tell. AMD's making seemingly contradictory statements about the capability of LDS in R700 (can it or can't it support absolute addressed writes?). Maybe some brave soul will have a go at it under CS or OpenCL.

I thought you might have encountered some other work, away from CAL/IL/Brook+...

I wonder if LDS is used for holding attributes/barycentrics for on-demand interpolation in R800.

I only see a [quoted text] Sorry, I missed the 15% bit, no sarcasm intended.
:???: 15% is the "reciprocal" of 659%.

(on a side note, considering their present flop density, they'll prolly keep using vliw in the near future at least)
If you implement a transcendental ALU, you should implement other ALU capabilities to defray the large cost of it plus the cost of instruction decoding etc. ATI has several more MADs (4). NVidia in GT200 has SP-MAD and DP-MAD ALUs. Fermi changes this to a pair of general ALUs (SP, DP, Int).

Larrabee puts RCP and RSQRT in every lane of the vector unit. All the other transcendentals are evaluated the long way, it seems. So Intel kept to a pure scalar vector unit. Which is by far the easiest to compile for. Easier than NVidia's approach. Though by deleting MUL from the SFU NVidia's simplified compilation for Fermi. But Fermi's per warp in-order instruction-issue returns some work to the compiler I guess.

From that post,

Shouldn't it be measured against peak perf of 4870?
I don't think there's much point. If it isn't fast enough it isn't fast enough, buy the NVidia :p

It suggests amd suffers a 3.25x (1.76/0.5) perf hit in no-ILP code. Better than 5x, sure, not sure if it worth the area/power.
The Hoshen-Kopelman algorithm you mentioned earlier, at first glance, appears to be fetch/bandwidth limited on a GPU. I suppose the additional analysis you might run alongside the generation of the clusters would up the arithmetic intensity. I couldn't find anyone who's written-up a GPU implementation of it.

When kernels on ATI run notably slower than other processors AMD will be under pressure to compete. Whether it's because of the 64-wide threads, low VLIW utilisation, fragmented cache architecture, poor LDS bandwidth or general bandwidth constraints. AMD might change it radically before we ever really find out, since the penetration of ATI in compute is so low. Fermi moves the goalposts as well, of course.

But I reckon competing with Larrabee is much more interesting than competing with Fermi, because AMD has to compete on x86, GPU and x86/GPU fronts.

Double-precision could be interesting. DP is scalar/vec2 on ATI, effectively, making worst-case VLIW utilisation less glaring. But AMD doesn't seem to want to chase the HPC crowd.

prunedtree :smile:
My understanding is he didn't write any ATI assembly.

Yeah they do, that's why every cpu since pentium has been chasing ILP. ATM, I am not sure of vliw's future in gpu's. It's just that 5 way vliw seems a bridge too far for more gp workloads.
If you build a MAD+MAD/transcendental core, what's the incremental cost for another MAD? Bearing in mind that apart from the total cost of the core (including texturing and caches) the cores are still only 30-40% of the die. And bearing in mind that on graphics workloads utilisation is easily 60% and typically 70-80%.

(I should point out that I suspect there might be some core-shared scheduling hardware in ATI that's outside of the clearly defined cores on the die picture. That should be included in the area of the cores. But I don't know where it is or how big it is.)

I might have missed the finer points of amd's t unit. Care to clarify?
See prunedtree's thread. It gets used quite heavily (20-50% of non-graphics code? More?) and it isn't just for transcendentals (5-10% of non-graphics code?), which makes it cheaper than a pure transcendental ALU. Its utility in graphics should be even higher.

Jawed
 
Even if it's dynamic, it still implies that instructions could be issued from a variable number of threads/work-items per clock. That sounds like a further step beyond the already ominous DWF. And doesn't that completely break VLIW anyway?
Defined at compile time, programmer may not be sure the SIMD width when writting, but it won't be dynamic during runtime.

This trick could be applied on software today, would save us from the need of vectorizing some codes, even if ATI architeture isn't the most indicated to apply it it works.

BTW, I like DWF too.

As for VLIW's merits in a massively parallel xPU, try finding ILP in Hoshen Kopelman algorithm (for cubic-like lattices). :p

In other areas too, sometimes finding ILP is pretty ugly and needs twisted data layouts.
Put 4 lanes in one and you will have at least 80% utilization, I have seen more challenging problems before...

What makes you think VLIW is more area efficient for GPU's? :oops: VLIW uses area and power to make individual thread run fast. In GPU's, only the total running time for all threads matters. So why bother with spending area to speed up a single thread? Every instruction slot unused wastes power and area. And compilers will never be perfect. VLIW is not worth the area and power for GPU's atleast in the near future. 10 years from now, who knows?

I am with Jawed here. Scalar code is easier to write, and vec4 computes won't be always around (ie in all workloads) to save the VLIW's efficiency. Also, it locks ATI at 4:1 scalar:sfu ratio (atleast apparently), while nv is going the other way.
I think everyone (including me) had doubts about the efficiency of VLIW when the R600 came out, it didn't worked for Itanium...

But now I think it proved that works, at least for graphics/SIMD workloads, after all GPUs rely on parallelism and VLIW is just another form of parallelism, and a form more flexible than SIMD.

In the case of R600 it was specially efficient, many shaders before was using vec4 to describe pixel colors and operating on all channels at once, if they use 15 float4 registers a pure scalar architeture like G80 will have to care about 60 float registers and them suffer from register pressure issues (and so, not so efficient), and also, those shaders tends to have a high ILP.

Of course, there are the scalar cases (wich proved to be far less than the discussion here implies), tipicaly using much less registers where G80 shines, but even there there is no reason to the R600 VLIW not perform well, software may split each lane in 4 bringing utilization to at least 80%, the big vec4 register file will serve it as well.

And also, there is the fat unit to talk about, there is simple no reason to have transcedental hardware on each ALU, they are not used so often, nVidia have a separate unit for handling them and it increases the costs of the control unit, the VLIW is fine with one slot, no aditional complexity.

I wouldn't take VLIW out from graphics now, I think 5-issue isn't a good width, my preference for graphics now goes to VLIW multiple of 3-issue (3, 6, maybe 9, why not?) due to 64-bit math and double precision, a 64-bit mul may be done by 3 32-bit mul (Karatsuba) instead of wasting 4 slots and leaving one free like on current hardware, dedicated hardware for DP doesn't seens a good idea because DP hardware is expensive and only a very small part of code running on GPUs today depends on DP.
 
Put 4 lanes in one and you will have at least 80% utilization, I have seen more challenging problems before...

Try putting 4 lanes of Hoshen Kopelman algorithm (for cubic-like lattices) in one thread then. I am curious to know if you have seen/done this before.

More generally, try packing branch intensive control flow into a single vliw thread.

But now I think it proved that works, at least for graphics/SIMD workloads, after all GPUs rely on parallelism and VLIW is just another form of parallelism, and a form more flexible than SIMD.

No body is doubting the utility of 5 way vliw in graphics. It's the more gp workloads I am looking at.

Of course, there are the scalar cases (wich proved to be far less than the discussion here implies), tipicaly using much less registers where G80 shines, but even there there is no reason to the R600 VLIW not perform well, software may split each lane in 4 bringing utilization to at least 80%, the big vec4 register file will serve it as well.

And also, there is the fat unit to talk about, there is simple no reason to have transcedental hardware on each ALU, they are not used so often, nVidia have a separate unit for handling them and it increases the costs of the control unit, the VLIW is fine with one slot, no aditional complexity.

I wouldn't take VLIW out from graphics now, I think 5-issue isn't a good width, my preference for graphics now goes to VLIW multiple of 3-issue (3, 6, maybe 9, why not?) due to 64-bit math and double precision, a 64-bit mul may be done by 3 32-bit mul (Karatsuba) instead of wasting 4 slots and leaving one free like on current hardware, dedicated hardware for DP doesn't seens a good idea because DP hardware is expensive and only a very small part of code running on GPUs today depends on DP.

I'd like to see the reactions of B3D crowd if they build a 9 way vliw on top of present 64 wide logical simd:)
 
I'd like to see the reactions of B3D crowd if they build a 9 way vliw on top of present 64 wide logical simd:)

Hmm, variable length VLIW + SMT = win?
AFAIK (which is little) ATI is already variable length VLIW 1-5 instructions.
Going to 9 wide VLIW means scheduling almost 2 threads per clock. If you can keep 4+ threads on the ready to run list and go all hyperthreading on them you could do a very simple scheduling to achieve better resource utilization.
Mix up to 9 instructions out of the 5max VLIW streams into a dynamic 9max VLIW stream.
Wait, is this superscalar VLIW (dynamic superscalar static superscalar VLIW architecture)?
Seriously, what's this called?
And couldnt this help ATI going forward? Keep the virtual 5 wide arch, make the underlying hardware wider to absorb sub 5 instructions cycles and increment efficiency.
 
Try putting 4 lanes of Hoshen Kopelman algorithm (for cubic-like lattices) in one thread then. I am curious to know if you have seen/done this before.

If it is too branchy it will be slow on GPUs anyway.

BTW, do you know any GPU implementation of it?
 
Status
Not open for further replies.
Back
Top