Real World Technologies analyses Sandy Bridge IGP

Intel should shift more transistor budget for GPU.

Although we "could" easily double the amount of transistor budget for iGPU. The problem is our bandwidth aren't anywhere near the requirement. Would like to know how Intel is going address that.

Another thing is drivers. Unlike ATI and Nvidia which has regular or monthly release of drivers. Intel hasn't has new drivers since April. That is more then 4 months coming.

I love fixed function hardware. Powerful and Speedy, I personally would love a GPU for Graphics, i could leave all the rest of workload for Quad Core CPU.
 
Here's a question for the devs out there. If it were in a console environment, would it be possible to use the stream processors in the sandybridge GPU as auxillary processors in a similar way to how the SPU's work in Cell?
 
Here's a question for the devs out there. If it were in a console environment, would it be possible to use the stream processors in the sandybridge GPU as auxillary processors in a similar way to how the SPU's work in Cell?

I'm not an expert on this, but I don't see why it wouldn't. There's only one problem, though: together, the SPUs in Cell are much faster than the main CPU, while Sandy Bridge's shaders are much slower than the four CPU cores, so there would be little point in going through all the trouble; especially since Sandy's GPU isn't really designed with programmability in mind.

But the CPU/GPU balance in Sandy Bridge isn't suitable for a console at all, and you really wouldn't want to take any GPU power away from graphics. In fact, you'd want to do the opposite and use the CPU for some software rendering, or just a part of the pipeline, which I believe is precisely what Intel drivers do on PCs.
 
I agree and I wasn't suggesting Sandybridge alone would be suitable for a console, the GPU portion is far too weak. It was more a suggestion of using the IGP as a sort of big SIMD extension to the CPU alongside a dedicated GPU. i.e. if you want to do some heavy SIMD lifting on the CPU then you could potentially offload some of that work to IGP shaders which would work in conjunction with the SIMD units on the CPU cores themselves.

This is just hypothetical though, I know there are plenty of other reasons why Sandybridge wouldn't be suitable for use in this way (heat, power, size, wasted transistors in the form of fixed function GPU hardware in the IGP etc..)
 
I agree and I wasn't suggesting Sandybridge alone would be suitable for a console, the GPU portion is far too weak. It was more a suggestion of using the IGP as a sort of big SIMD extension to the CPU alongside a dedicated GPU. i.e. if you want to do some heavy SIMD lifting on the CPU then you could potentially offload some of that work to IGP shaders which would work in conjunction with the SIMD units on the CPU cores themselves.

This is just hypothetical though, I know there are plenty of other reasons why Sandybridge wouldn't be suitable for use in this way (heat, power, size, wasted transistors in the form of fixed function GPU hardware in the IGP etc..)

OK. I suppose in theory this should be possible, although far from easy, but in practice the overhead would probably far outweigh any potential gains, given that you'd have to jump through quite a few hoops to run some of your calculations on the GPU.

Plus, I'm not even sure Sandy's GPU is IEEE compliant, so you might run into consistency issues as well.
 
Just to elaborate further on the Sandy Bridge IGP being used for offload -- consider also that SB allows turbo-boost "sharing" between the GPU and CPU. If you have them both working at high levels, you'll lose the extra speed binning that you might otherwise pick up.

Given how slow the GPU would probably do things, having it also soak up your extra turbo bins means you may actually decelerate the whole process :)
 
There's no reason why you couldn't use the SNB graphics cores for real work. However, there is no OpenCL support, so the programming environment would be very challenging. Intel made a choice to only expose the GPU through their Media SDK...but that doesn't mean the hardware is crap.

And incidentally, I would expect the area for the GPU to more than double for Ivy Bridge.
 
Apparently the OpenCL BOF (whatever a bof is) at siggraph
intel showed a 3d audio demo (although I cant find any info about it) so opencl support may be coming
 
Yeah, as it stands right now, almost all x86 CPUs have OpenCL/DirectCompute support (although Bulldozer/Atom might be tricky for variants that require 32KB of shared memory). Intel's integrated grpahics on SNB does not have OpenCL support, but IVB will.

AMD's graphics in Fusion does have OpenCL support.

However, Intel has excellent programmable media hardware, something that AMD is somewhat behind on.

David
 
Atom doesn't seem to be supported. At least the SDK refused to install on my Z530-based netbook.
edit: Ah, you probably mean the IGP parts of the respective CPUs?
 
edit: Ah, you probably mean the IGP parts of the respective CPUs?

I think he meant the CPU, with the tight L1 being the problem (both BD and Atom have less than 32kB of L1, and I think on CPUs L1 serves as an imposter for shared-mem). Which is not to say that one can't fix it (heck, the 4xxx series from ATI places shared-mem stuff in main VRAM), but it's interesting if anyone will care about Atom / what they'll do in BD.
 
I think he meant the CPU, with the tight L1 being the problem (both BD and Atom have less than 32kB of L1, and I think on CPUs L1 serves as an imposter for shared-mem). Which is not to say that one can't fix it (heck, the 4xxx series from ATI places shared-mem stuff in main VRAM), but it's interesting if anyone will care about Atom / what they'll do in BD.
AFAIK there is no possibility to lock something in the L1, isn't it? The local memory is simply "emulated" as a buffer in the global (main) memory on CPUs. It completely relies on the cache structure to speed it up. If it is reused heavily it works quite well without any special constructs. So if a small L1 is backed up by a reasonable fast L2, it doesn't hurt much, the performance reduces gracefully.

But using local memory on current CPUs doesn't help anyway in my opinion, or has someone a different experience? Using a small buffer in global memory for the same purpose should be the same speed.
Just looked it up, AMD's OpenCL CPU runtime (APP SDK 2.4) also says, the local memory resides in global memory (it's a Phenom X6 in this case):
Code:
Number of platforms:                 1 
  Platform Profile:                 FULL_PROFILE 
  Platform Version:                 OpenCL 1.1 AMD-APP-SDK-v2.4 (595.10) 
  Platform Name:                 AMD Accelerated Parallel Processing 
  Platform Vendor:                 Advanced Micro Devices, Inc. 
  Platform Extensions:                 cl_khr_icd cl_amd_event_callback cl_amd_offline_devices 


  Platform Name:                 AMD Accelerated Parallel Processing 
Number of devices:                 1 
  Device Type:                     CL_DEVICE_TYPE_CPU 
  Device ID:                     4098 
  Max compute units:                 6 
  Max work items dimensions:             3 
    Max work items[0]:                 1024 
    Max work items[1]:                 1024 
    Max work items[2]:                 1024 
  Max work group size:                 1024 
  Preferred vector width char:             16 
  Preferred vector width short:             8 
  Preferred vector width int:             4 
  Preferred vector width long:             2 
  Preferred vector width float:             4 
  Preferred vector width double:         0 
  Native vector width char:             16 
  Native vector width short:             8 
  Native vector width int:             4 
  Native vector width long:             2 
  Native vector width float:             4 
  Native vector width double:             0 
  Max clock frequency:                 3200Mhz 
  Address bits:                     64 
  Max memory allocation:             2147483648 
  Image support:                 Yes 
  Max number of images read arguments:         128 
  Max number of images write arguments:         8 
  Max image 2D width:                 8192 
  Max image 2D height:                 8192 
  Max image 3D width:                 2048 
  Max image 3D height:                 2048 
  Max image 3D depth:                 2048 
  Max samplers within kernel:             16 
  Max size of kernel argument:             4096 
  Alignment (bits) of base address:         1024 
  Minimum alignment (bytes) for any datatype:     128 
  Single precision floating point capability 
    Denorms:                     Yes 
    Quiet NaNs:                     Yes 
    Round to nearest even:             Yes 
    Round to zero:                 Yes 
    Round to +ve and infinity:             Yes 
    IEEE754-2008 fused multiply-add:         No 
  Cache type:                     Read/Write 
  Cache line size:                 64 
  Cache size:                     65536 
  Global memory size:                 8389558272 
  Constant buffer size:                 65536 
  Max number of constant args:             8 
  [B]Local memory type:                 Global[/B] 
  Local memory size:                 32768 
  Kernel Preferred work group size multiple:     1 
  Error correction support:             0 
  Unified memory for Host and Device:         1 
  Profiling timer resolution:             1 
  Device endianess:                 Little 
  Available:                     Yes 
  Compiler available:                 Yes 
  Execution capabilities:                  
    Execute OpenCL kernels:             Yes 
    Execute native function:             Yes 
  Queue properties:                  
    Out-of-Order:                 No 
    Profiling :                     Yes 
  Platform ID:                     0x7f9bd4692800 
  Name:                         AMD Phenom(tm) II X6 1090T Processor 
  Vendor:                     AuthenticAMD 
  Driver version:                 2.0 
  Profile:                     FULL_PROFILE 
  Version:                     OpenCL 1.1 AMD-APP-SDK-v2.4 (595.10) 
  Extensions:                     cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_byte_addressable_store cl_khr_gl_sharing cl_ext_device_fission cl_amd_device_attribute_query cl_amd_vec3 cl_amd_media_ops cl_amd_popcnt cl_amd_printf
Really, since when, what context?
In OpenCL since the beginning as the RV7xx doesn't support the OpenCL access model for the local memory.
In DirectCompute in basically depends on the used shader profile. RV7xx as a DX10.1 GPU simply don't support DX11 CS 5.0 (which allows an unrestricted local memory access but also requires 32kB of it), but it does support CS 4.0/4.1. There, MS has chosen the common denominator as requirement for the local memory and defined a more restricted access scheme, which is fully supported by the RV7xx LDS.
 
is drivers. Unlike ATI and Nvidia which has regular or monthly release of drivers. Intel hasn't has new drivers since April. That is more then 4 months coming.

They released about 5 versions since the 2361 drivers in April. Although it took a bit of effort to find.

2372-?
2406-?
2430-?
2462-Aug 1
2476-Aug 16
 
Do the pixel pipes run at the same clock of the GPU as the shader units in SNB IGP? I read the article, but I didn't see if it was mentioned (maybe it was, I may just have overlooked it...)
 
Back
Top