Binning Render ?????
Page 2
The PowerVR* MBX and SGX series [PowerVR 2008],
the Intel® Graphics Media Accelerator 900 Series [Lake 2005],
the ARM Mali* [Stevens 2006], and Talisman [Torborg & Kajiya
1996] have been generally classified as sort middle architectures.
Page 5
This section describes a sort-middle software renderer designed
for the Larrabee architecture that uses binning for load balancing.
Section 5 provides performance studies for this software renderer.
RV770 is 10*(16*(1+1+1+1+1) = 800 SP right?
A 48 core Larrabee would be 48*(16*1) = 768 SP
So one ATI 16*(1+1+1+1+1) VLIW SIMD = 5 Larrabee cores
I'd be curious to know what this is based on.On real shaders, the scalar unit is usually 50% utilized or less. That's factoring in that the shader compiler does attempt to move work from the vector unit to the scalar unit (replacing a single vector instruction with 3 or 4 scalar instructions).
Its not a 4 wide vector, they are 4 scalars; other ops can be scheduled in there to "fill that hole". The shader compiler does this type of stuff automatically.The other issue is that a majority of the operations that go through the 4-wide vector unit only use 3-element vectors. You're only using the fourth element maybe 25% of the time.
Please tell me how you found out...The 5 ALU ops on ATI are split into 4-wide vector and 1-wide scalar operations.
Please tell me how you found out...
Yes, so? You can use large tiles inside your rendering engine and small tiles within the rasterizer.The only way to save space is to have tiles so large that object-level binning is doable without obscene overlap costs.
If tiling is a win. Yes, starting with the rendering of bins before the entire scene has been binned will introduce an extra super-linear time/complexity factor ... but that in and of itself is not important, what's important is whether at those higher complexities it actually becomes slower than direct rasterization.If what is a win? Seems like a disadvantage to me.
He didn't. It's an assumption carried over from previous ATi architectures. This assumption is incorrect.
Good, so you agree with me that it makes sense to 'cache' them. The paper doesn't mention this at all, but so many things are not mentioned that I'm willing to bet that loads of interesting details have been omitted.Just process those actually needed. Set a flag in the output buffer to indicate which vertices have already been processed. Like an infinite cache (from a software view, in hardware the actual cache keeps frequently used vertices close).
Well, the beauty of this rendering architecture is that it's fluid and it will keep evolving/improving.That's what they're telling you today...
This is going to be true for pretty much any hardware that implements scatter/gather: 4870, G100, any vector supers computers throughout the ages that don't have word width SRAM main memory.
I think you're confusing thread-level parallelism and data-level parallelism (SIMD). They are two completely different forms of parallelism.
You don't need multi-core for SIMD, and you don't need SIMD for multi-core, and SIMD was available about a decade before multi-core was (at least in terms of consumer hardware).
Whoops. I should have seen this obvious solution. You're absolutely right.Yes, so? You can use large tiles inside your rendering engine and small tiles within the rasterizer.
Sure, and it very well could be.but that in and of itself is not important, what's important is whether at those higher complexities it actually becomes slower than direct rasterization.
Actually that particular configuration (Vec4+Scalar) has never appeared on the PC, it was XBOX 360 / Xenos specific. DX9 non-unified architectures featured a Vec3+Scalar, dual issue MAD+ADD ALU configuration, while DX10 unified architectures have been 5D Scalar MADs.I didn't know that it had changed.
Yes, the ALU's can process this and shader comiler is design to do this. Vectors can be even be broken and issued over multiple clocks if order to achieve better packing as well.Do the newer ATI VLIW cores support issuing multiple scalar or vector instructions in one clock?
Get GPUSA:I didn't know that it had changed.
[maven];1200036 said:2) I would not want to have to maintain the Larrabee driver / software stack. Ever! Under no circumstances. And in particular knowing Intel's GMA driver reputation, I'm not sure Intel can do it.
The render engine would have to sit on top of some kind of driver.If Larrabee takes over, perhaps Intel doesn't have to maintain a "driver" in the traditional sense. Intel could give the big game design companies access to their baseline software render engine and then said "have fun", allowing these companies to totally bypass DX and do any sort of customization they want to the rendering engine, physics computations, etc.
If and only if there is only Intel, and even then it is only guaranteed if there is a single Larrabee revision to program for.In the long run, there is no need for the hardware to hide behind some thick Microsoft-controlled API and driver layer.