Integrated graphics performance

Discussion in '3D Hardware, Software & Output Devices' started by Color me Dan, Oct 29, 2007.

  1. Color me Dan

    Regular

    Joined:
    May 19, 2007
    Messages:
    300
    Likes Received:
    1
    Location:
    Sweden
    Just wanted to tell everyone how pleasently surprised i have been with my new computers IGP. When i built it i didn't think i'd play alot of games or anything on it, but soon my fingers were itching to see if the IGP could take anything at all. I've bought a motherboard from ASUS, M2A-something-HDMI, with an AMD 690G IGP (X1250). I fired up HL2 with it and i'm running through the game at 1280*1024 with shaders, shadows and water on high! I'd use a lower resolution and better filtering, but i have an LCD screen with fixed resolution, so i'm avoiding it looking like crap.

    No stuttering but when there is a "sudden" explosion, a little slowdown that is almost unnoticable to me in huge areas. Otherwise this little chip is like the train in the "The train that could" hehe, it's a champ! When i played HL2 on my previous computer with a GeforceFX 5900 it was stuttering all the time at lower resolutions and everything set on low. I'm amazed IGPs can outperform (admittedly not a very good and a bit old) standalone solutions.

    Now i think i might just play a little more HL2!
     
  2. ShaidarHaran

    ShaidarHaran hardware monkey
    Veteran

    Joined:
    Mar 31, 2007
    Messages:
    3,984
    Likes Received:
    34
    Yay for 690G powah!

    Quite the capable IGP. Certainly better than anything the competition has out ATM.
     
  3. Flux

    Regular

    Joined:
    Nov 10, 2006
    Messages:
    313
    Likes Received:
    2
    What would be the current performance target for Intel/AMD IGP as of 2011-2014+?

    10m triangles w/ DX11 shaders effects? 100m triangles w/ DX11 shader effects?
    I would imagine the Radeon igp and intel hd igp would be pretty decent in 2014-2011.
     
  4. orangpelupa

    orangpelupa Elite Bug Hunter
    Legend Veteran

    Joined:
    Oct 14, 2008
    Messages:
    7,287
    Likes Received:
    1,383
    Intel hd is decent on Haswell when it's not throttled.

    Example:
    My tablet can run left 4 dead 2 in 60fps for 20 sec. Then it crumble to 30 fps....

    Bloody Intel set the tdp limit in time duration instead of realtime temperature. The chip run in 55 C, 60 C, whatever.. As long as 20 sec passed it became slow
     
  5. Flux

    Regular

    Joined:
    Nov 10, 2006
    Messages:
    313
    Likes Received:
    2
    How many polygons can a i5 igp push on average?
    10m-20m with all shader effects on full blast?
    I know they are hot garbage compared to what nvidia and amd have built.
     
  6. Wynix

    Veteran Regular

    Joined:
    Feb 23, 2013
    Messages:
    1,052
    Likes Received:
    57
    Not at all; Haswell has a decent GPU given the size/heat/power restraints.
    Next gen Broadwell iGPU is meant to be a large improvement over Haswell.
     
  7. orangpelupa

    orangpelupa Elite Bug Hunter
    Legend Veteran

    Joined:
    Oct 14, 2008
    Messages:
    7,287
    Likes Received:
    1,383
    But if Broadwell still limit performance based on timer rather than temperatures...
    I'm not sure it will be good for gaming, but will be fine for many kind of works outside gaming.

    Btw my Haswell i5 tablet can run titanfall in 720p medium-low in 40fps then crawl to 10fps when the throttle time come. Super annoying.

    On AMD laptop I can simply put a small table fan to cool it or put it right under Air Conditioner and the throttle will gone.

    this i5 Haswell Keeps throttling after 20 secs.
     
    #7 orangpelupa, Mar 30, 2014
    Last edited by a moderator: Mar 30, 2014
  8. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,288
    Location:
    Helsinki, Finland
    I have been optimizing our engine for Intel integrated graphics for the last week. HD 4000 (Ivy Bridge) is highly bandwidth bound (HD 5000 would be even more, but I don't have one to test). Simple buffer copies take huge amount of time compared to ALU-heavy shader passes. I don't expect Broadwell to improve GPU performance much as all, unless Intel brings faster memory bus (DDR4 isn't yet an option) or they increase the L3 cache sizes. 128 MB L4 (Crystalwell) is obviously a huge help, but likely only included in a few top Broadwell models (that cost almost double compared to the ones without L4, making them very low volume products).

    One thing that Broadwell likely improves a lot is the throttling. As the GPU is highly BW bound, increasing the perf/watt is likely the main target instead improving pure performance. Better perf/watt should reduce throttling. Ivy's HD 4000 (3632QM) throttles quite fast to 850 MHz (from 1150 MHz) according to my experiments, making it quite hard an optimization target. Optimizing against a moving target is quite hard, especially as your optimizations will make it faster, but because it is faster, it sometimes starts to throttle sooner as well.
     
  9. orangpelupa

    orangpelupa Elite Bug Hunter
    Legend Veteran

    Joined:
    Oct 14, 2008
    Messages:
    7,287
    Likes Received:
    1,383
    hi sebbi, yeah the throttling in Haswell is what killing its performace. When not throttled, its quite fast.

    sorry this is a bit out of topic..
    but is there something that make developer limit their "lowest resolution" available to be selected?

    My intel HD 4200 is bloody slow after 20secs but its still strong enough to run games in 800x600 or 1024xSomething. Unfortunately nowadays new games dont have resolution lower than 1280x720 and if i forced low resolution from .ini file, registry, or command line, the game will simply crash.

    thanks,
    pardon my english
     
  10. Davros

    Legend

    Joined:
    Jun 7, 2004
    Messages:
    15,026
    Likes Received:
    2,374
  11. Wynix

    Veteran Regular

    Joined:
    Feb 23, 2013
    Messages:
    1,052
    Likes Received:
    57
    Is it possible to run your CPU at a lower frequency to give the GPU extra leg-room?
    I can run my CPU at 800mhz, 2100mhz or 3200mhz depending on which power plan i use.
     
  12. Andrew Lauritzen

    Moderator Veteran

    Joined:
    May 21, 2004
    Messages:
    2,526
    Likes Received:
    454
    Location:
    British Columbia, Canada
    What are you using to determine that curiously (GPA counters, etc)? What's the specs of the system you're testing on (is it a single channel memory config or something)? Certainly HD5000 can get bandwidth (and more often, power!) bound, but Ivy Bridge is a significantly smaller GPU... I wouldn't be surprised if you're running more into basic ROP/sampler throughput issues than bandwidth per se. Intel chips also have rather large caches (GPU and LLC - the latter of which is easily big enough for entire render targets) which do help a fair bit especially if you order your produce/consume passes in a friendly manner.

    Anyways, optimizing to reduce bandwidth usage is never a bad thing, and I'm the first person to bitch at people for almost-always-unnecessary surface copy operations :) That said, I'd be sort of surprised if HD4000 is terribly bandwidth limited on a typical dual channel config @ 850Mhz outside of throughput benchmarks.

    It's not "throttling" per se, it's turbo. And alas, that is just going to be the world we live in, even on the high end. Power efficiency is king and entirely determines performance.

    Yep, that's the brave new world. Note that that's a 35W chip though, so obviously it's going to clock somewhat lower than bigger form factors. Haswell is more power efficient overall (and Broadwell even more so) and thus tends to maintain higher turbo speeds as well, but there's no escaping physics.

    The presentation from Codemasters at GDC speaks to this a bit:
    http://www.gdcvault.com/play/1020221/Rendering-in-Codemasters-GRID2-and
    Of the interesting notes: optimizing your CPU code is definitely one that people rarely think about but can make a big difference when power constrained. It's also important to pick a frame rate target and stick to it for an optimal experience (vsync, half vsync, etc). If you just let it run "as fast as possible" and heat up your chip in the first few seconds the experience after that point will suck more than just constraining it to an attainable rate from the get go and letting it sleep in the pauses between frames. As I side effect you'll get less fan noise and lower input latency as well :)

    It might work but the simpler solution is just to do less on the CPU/optimize with multithreading/SIMD. It's running single-threaded applications @ ~4Ghz that sucks the power. Do note though that there are some arch issues with IVB and CPU vs. ring bus clocks that were improved on in Haswell.

    Anyways we can chat here more or feel free to shoot me an e-mail or PM to follow up in more detail. Always happy to take a look at workloads and see if there's ways to make them perform better :)

    Regarding the "timer" stuff, the only thing I know of that operates that way is the PL2 power mode that allows it to exceed TDP for a brief period of time. This is meant to allow better race to idle, not something that is sustainable long term (otherwise that would be the TDP...). If you have a tablet that can truly maintain that higher TDP for long periods without heating up then the OEM made a poor SKU decision ;) In any case, you can modify a lot of this behaviour with the Intel Extreme Tuning Utility or similar (sometimes even in the BIOS). Reconfiguring TDP is effectively over-clocking, but as long as you're fine with that go nuts.
     
  13. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,288
    Location:
    Helsinki, Finland
    I changed the light accumulation buffer to 11f-11f-10f from RGBA16F, and got a huge boost to many places. Full screen alpha blend to the lighting buffer got almost twice as fast. Many post processing passes sampling the lighting buffer got considerably faster as well. Also when I disabled 4xMSAA from virtual texture tile buffer generation (rendering to 128x128 x four RGBA8 MRTs) it got a nice boost as well. Of course it might be that the HD 4000 hardware is just generally slow in sampling RGBA16F textures and blending to RGBA16F render targets and the ROPs are slow when MSAA is used (but not bandwidth starved). It's hard to distinguish between them, because Intel GPA tool doesn't have a draw call bottleneck viewer (that would show the relative utilization of all the fixed function units, ALUs and memory controller for a selected draw call, allowing developer to easily see what is the hardware bottleneck for each draw call).
    We have a tight g-buffer layout, and our vertex formats etc, are tightly bit packed. However because the blending to EDRAM is very fast on Xbox 360 (and is the only way to do read-modify-write on data on EDRAM on x360) our engine heavy exploits it. This is also working well on modern high end GPUs, but on low end PCs with limited bandwidth, blending is burning more bandwidth that would be required by other means.
    Yes, this is an Ultrabook (very light and thin), and I think the 35W Intel chip was chosen because it also has NVIDIA GeForce 650M GPU. Fortunately the GeForce chip now runs the game at 1600x900 + locked 60 fps after my latest batch of optimizations.
    Our code is using all the four CPU cores, but the total CPU utilization is only 12% according to GPA (in "low" quality settings). When I tried to reduce the draw distance, the Intel GPU actually got more faster than I anticipated, because that also dropped the CPU usage (less data to process in CPU side, and around 30% less draw calls). That likely helped the GPU to get a bigger share of the TDP.
    I could try that to get more predictable analysis results. GPA usually shows clocks dropping to 850 MHz after a few frame captures.
     
  14. Andrew Lauritzen

    Moderator Veteran

    Joined:
    May 21, 2004
    Messages:
    2,526
    Likes Received:
    454
    Location:
    British Columbia, Canada
    IIRC RGBA16f is half speed in ROPs and maybe trilinear too (when not in the brilinear zone) although the latter probably doesn't apply to light accumulation. I think it actually has the same blend throughput though as basically all blending is half speed of the ROPs on Ivy Bridge. i.e. you may just be seeing throughput differences.

    Obviously reducing the bandwidth use helps with power and other stuff though too, especially on ultrabook-class parts where that enables more power headroom even if you're not entirely saturating the DRAM bandwidth.

    MSAA perf is terrible on Ivy Bridge and pretty bad on Haswell as well. I recommend avoiding it entirely until Broadwell and using screen-space/temporal methods instead.

    There's a myriad of hardware counters available... you can see how busy a lot of the units are (EUs, samplers), total amount of GPU memory read/writes, etc. Are you not able to see those counters for some reason?

    Yeah I mean the reality is on power-constrained stuff burning DRAM bandwidth is never going to be good, whether it be Intel, NVIDIA, IMG or otherwise. People are simply going to have to adjust to organizing things in a more cachable manner, although increasingly large caches (see Iris Pro, Maxwell, etc.) will help a bit here too.

    I get that this was designed for 360 and such, but obviously blending is going to be expensive if you can't keep it in cache. Have you experimented with doing a quasi-tiled thing and scissoring out regions of the screen to do in different "passes" to try and get better cache use? I remember you mentioning something similar with GCN in another thread and that sort of structure can definitely benefit Intel as well assuming there's not a massive amount of geometry to "re-render" between the passes (usually not for light volumes, even with very basic culling).

    Yeah I mean there's ultimately no way a 35W CPU+GPU is going to be able to compete with a 35W CPU+45W GPU. If it's a Macbook Pro you're talking about here, that's even more true since it's really not a "650M" per se in those given the clocks. Iris Pro is a better match to that sort of config but even then has a lower TDP.

    Yeah that is usually helpful although I recommend the frame-locking solution regardless for the final game. As I said letting everything run "as fast as possible" tends to produce an inferior user experience. For analysis you can also lower the GPU turbo multiplier, etc.
     
  15. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,288
    Location:
    Helsinki, Finland
    And I suppose it doesn't help that we are rendering simultaneously to four render targets using alpha blending and 4xMSAA :)

    4xMSAA was actually a bug, because it's only needed in our offline virtual texture baking tools. Game newer generates upper mip pages at runtime (there decal triangles can be subpixel sized).
    I am spoiled about console tools that pretty much tell you instantly what's your bottleneck. Intel GPA seem to be among the best PC profiling tools available, but it takes some time to get used to it (and the Intel hardware specifics).
    We have optimized all the draw calls where memory bandwidth have shown to be a bottleneck (on console and desktop hardware). For those draw calls where something else is the bottleneck, memory bandwidth usage hasn't been optimized, since it wouldn't have helped performance at all. In general on desktop PCs and consoles it has never been a good idea to optimize something that is not a bottleneck (waste of time that could be used for optimizing something useful).

    However with the recent "race to sleep" - style GPU designs in laptops (and also Radeon 290 / 290X on desktop) I might need to reconsider optimizing things that do not provide any immediate performance gains, but reduce the power consumption and thus reduce the possibility GPU TDP/temp based throttling. However profiling the gains of these optimizations is very difficult, as there's no immediate performance gain, and monitoring hardware clock rate or temperature is quite error prone (very difficult to get exact results). Does Intel have any tools that specifically help in this kind of optimization process?
    We don't do multiple combine passes with the alpha blender in row (as it's always better to combine multiple things in a single pass). However, even a single full screen alpha blended pass that reads two (uncompressed 32 bit full screen) textures and outputs to RGBA16F render target costs a lot.

    On "low" settings, our particle rendering is at half resolution and soft particles are disabled (no 32 bit depth texture read). So the cost is not that heavy. The vertex cost in our particle rendering is very cheap, because we do particle lighting in a separate step (using stream out / memexport). Splitting to multiple small viewports could be a good idea in this case, as the resolution is low and it divided by 2x2 is even lower (= splitting to 128x128 tiles that fit into GPU ROP caches wouldn't cause that many scissor + draw pairs). Some GPUs however don't like frequent scissor/viewport changes (stalls), so this might not be a win on all GPUs.
    Our game is already locked to 60 fps. We don't try to render any more frames than the display can show. We briefly tried locked 30 fps option (for low end hardware) for our previous game, but it was horrible. Trials games NEED 60 fps to be enjoyable. I would rather disable shadows than go to 30 fps (that would almost double the frame rate, but that's not possible either since we have some user created levels where the game character itself is a shadow).
     
  16. Andrew Lauritzen

    Moderator Veteran

    Joined:
    May 21, 2004
    Messages:
    2,526
    Likes Received:
    454
    Location:
    British Columbia, Canada
    That's probably getting a bit heavy for any integrated GPU ;)

    Yeah there are always improvements to be made, but it is legitimately a bit more complicated on the PC with many SKUs and things like power limitations/turbo. I can help if you have questions about Intel hardware or GPA metrics - feel free to ask here or PM. I'm also happy to forward feedback to the GPA team on how to improve the tool further.

    It's definitely a brave new world now :) It gets even more complicated in that you can optimize something based on a GPA Frame Analyzer profile and have it go faster there, but when you run it with the full game have no performance improvement due to how power sharing works. It's really more about optimizing for power efficiency on those targets. I wish I could tell you we have some magical tool to make that easy, but it's legitimately hard. We do have a few basic things like GPA system analyzer can show you GPU vs. CPU TDPs in real time and Intel Extreme Tuning Utility can do some similar things, but it's fairly coarse-grained. There are also tools like "Power Gadget" but a lot of those are more about managing sleep states and stuff that isn't as directly relevant to games.

    There's some high level recommendations in the Haswell dev guide (http://download-software.intel.com/...tion_Core_Graphics_Developers_Guide_Final.pdf) that apply to Ivy Bridge as well in terms of avoiding spin wait loops, capping frame rate, minimizing memory bandwidth, etc. Then there's more architectural stuff like realizing that stalled EUs waste power and trying to balance the load on the various parts of the pipeline, but there's no one silver bullet unfortunately.

    There's definitely a limit to how many full passes over the screen you want to do, especially with "wider" formats. Memory traffic costs power/heat and there's only so much of that in these mobile parts :) I realize it's hard to reorganize an engine that was designed around different constraints, but going forward making more use of caches and such, avoiding unnecessary copying, making sure to read a given piece of data a minimal amount of times, etc. is absolutely going to be critical across the board on all GPUs.

    Yep as a trials player I agree :) The key point really is not so much to drop it to 30 but to pick an achievable rate at a given quality level and cap it there rather than letting it run away and heat up the chip on the "easy" views (or menu... believe me that one happens a lot!). It's a significant enough problem that the driver will actually cap frame rates in certain situations (on battery, in certain power profiles) but it's obviously more ideal if the applications do it themselves.

    Makes sense. Shadows can definitely be expensive and note that on Ivy Bridge doing foliage in shadow maps (depth writes + discard) hits some performance issues. Unfortunately there's not a lot you can do about it other than trying not to render a lot of stuff back-to-back to the same region of the screen with that state combination (i.e. the opposite of what you want to be doing for good caching ;)), but the issue is mostly fixed in Haswell. See the dev guide for some more details.

    And as always, feel free to fire me an e-mail/PM! I'm always happy to help out or answer questions.
     
  17. Davros

    Legend

    Joined:
    Jun 7, 2004
    Messages:
    15,026
    Likes Received:
    2,374
    @orangpelupa
    If the windowing program doesnt work try hialgo switch
     
  18. orangpelupa

    orangpelupa Elite Bug Hunter
    Legend Veteran

    Joined:
    Oct 14, 2008
    Messages:
    7,287
    Likes Received:
    1,383
    @davros
    i have not tried the windowing thing, sorry. will try it as soon as i finish fixing my tablet. The windows 8.1 store is broken again, again and again lol

    @Wynix
    i have done that and the gpu keep getting throttled. According to throttlestop, my intel Core i5 is also limited by timer. The maximum speed with 15 TDP only allowd for 20 secs. After that it goes down to 5 TDP (usually hover in 4.x)
     
  19. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,288
    Location:
    Helsinki, Finland
    Andrew, do you know what the "3D Preference" slider ("Performance - Quality") does in Intel Graphics and Media Control Panel? If I put it at "Performance", the frame rate of our game increases slightly, but I don't clearly see reduction in image quality. Does it turn on some sort of optimized texture filtering (or mip bias, etc)?

    In the "Performance" mode, one level (the cheapest one) already runs at locked 60 fps for most of the time in 640x480 resolution (HD 4000 with 35W TDP). Lowering resolution from 720p to 640x480 doesn't actually improve performance that much, since shadow map rendering seems to take majority of the frame time (on this GPU at this resolution). I wonder how well a HD 4000 with a higher TDP ceiling or a HD 5000 / Iris Pro would run it. It would be definitely nice to reach locked 60 fps at 720p on an integrated GPU :)
     
  20. Andrew Lauritzen

    Moderator Veteran

    Joined:
    May 21, 2004
    Messages:
    2,526
    Likes Received:
    454
    Location:
    British Columbia, Canada
    I think it does things like control how big the "brilinear" region is, etc. but I'm not 100% sure on the comprehensive list. I'll dig and let you know what I find out :)

    If you're willing, fire me an e-mail/PM and we can easily run your stuff through a range of GPUs and send you the results.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...