October 2nd, 2012 - A few weeks ago I became quite disappointed when comparing my rendering speed to that of the real WoW client which is one of the most high-performing game clients in recent history.
I’ve made a ton of changes, using new profiling tools like Intel’s VTune, and the newest NSight for Visual Studio.
Some of the optimizations in recent weeks are below. A few were silly and obvious, but some were only found when really drilling down.
— Completely re-tool cbuffer usage into frame specific, object specific, and mesh specific structures. This especially helps reduce lighting related Maps/Unmaps into the shader. I reduced Map/Unmap copies by about 75%.
— Stop using string (shader name)->pointer maps for shader-specific d3d11 buffers and such. I converted my whole ShaderSet system to be enum based arrays.
— Don’t use sqrt in pythagorean distance calculations unless you actually need to present the actual value. Comparisions don’t need them. This brought the render lists object sorting (see below) from about 5% of the rendering thread’s CPU time to 0.5%. Surprise kids: sqrt isn’t fast.
— Toy with shader compile options for speed.
— Locally cache a bunch of variables that were accessed by accessor a lot and not optimized out in compilation.
— Separate opaque and transparent render lists, resulting in far fewer blend state changes (also in preparation for instancing of opaques)
— Sort opaque list front to back and transparent list back to front. The former takes advantage of early-z culling in the pipeline which is always recommended and I never did it because of worries about my transparent objects.
— Don’t bother clearing stencil buffer - not using it.
— Calculate world-light lighting values once every second or so instead of every frame. That was definitely overkill.
— Turn some copy-constructor-calling assigns into references to avoid lots of copying (specifically transparency timelines and the hefty bone timelines)
— Get rid of my ‘extra stuff’ Map/Unmap buffer and move the stuff into one of the other buffers. Still not sure why I didn’t do this before.
— Only calculate bounding box corners once for passive (stationary) objects. Massive speedup. Bounding box calculation was 9% of the rendering thread’s CPU time before, now negligible. Just a terrible oversight on my part before.
— Draw terrain early in the frame so it can be worked on while doing all the object-specific culling and such on the CPU side. This is a technique mentioned in several of the more hardcore optimization sites (like Matt Fisher who wrote GPUView while interning(!) at Microsoft).
— Start reducing branching in my terrain’s pixel shader. This will be one of the biggest speedups of all (going from about 60 fps in an average scene to 100+) if my tests so far prove correct. I have quite a ways to go though. Unfortunately, accessing texture arrays by index in pixel shaders can’t be done with variables, only integrals. In order to get by this and do true texture array accessing (sampling via x,y,z rather than just trying to access textures[n]) I will no longer be able to use D3DX11 SRV loading from graphic files and will have to roll my own texture creation and mipmapping. To get started with this I’m beginning to use stb_image, if you’ve heard of that. Nice for lightweight no frills loading of png files (for example) into RGBA raw memory which is nice pointing D3D to.
What’s the result of all this? Well, I don’t have a good test framework and scene set up for exact testing numbers of such things. But all told, I think frame rates are up anywhere from 50-80% compared to before the optimizations.
More updates soon.