In the last six years we have seen huge improvements in the performance of mobile phone system-on-a-chip designs, with gains spanning the CPU, the GPU, and the DDR memory system. The scale of these performance increases – 100x on the CPU and 300x on the GPU by my reckoning – have made it possible to start treating high-end mobile phones as a target for the types of graphics algorithms which would have previously only been possible on games consoles and PC platforms. However, the mobile form factor places obstacles in the path of developers which must be successfully avoided.
System Power Optimization
The most significant limitation of smartphones from a performance point of view is the form-factor. Passively cooling a chip inside a sealed case is never an easy task and the ability of a device to dissipate heat will determine the how much power it can sustainably draw during game play. The challenge for AAA content on mobile is to get as much useful work as possible out of that thermally stable power budget, which is somewhere between 2 and 3 Watts for a typical smartphone.
Effectiveness vs Efficiency
One of my favourite quotes, from Peter Drucker’s book “The Effective Executive”, is:
Efficiency is doing things right; effectiveness is doing the right things.
While originally aimed at business managers, this quote can be re-tasked for our purposes of achieving the best possible performance in game content. Whenever working on content optimization it is human nature to focus effort on the algorithms and assets already implemented in the game – efficiency – rather than taking a step back to review if new approaches in the rendering pipeline or more significant reworking of assets could give bigger improvements – effectiveness.
What is “effectiveness” for console quality gaming on mobile? In my opinion it means spending energy on CPU cycles, GPU cycles, and DDR memory accesses which actually result in a visible output on the screen. Any cycle or byte we spend on something which is not visible is energy wasted and visual quality lost. In the remainder of this article I’ll present my top five fundamental principles which should be considered when trying to optimize content for mobile devices.
Principle One: Remove major redundancy in the application
The game application is the top of the stack and the only part which has overall knowledge of the scene structure; by the time draw operations reach the graphics driver that structural knowledge has been lost and cannot be exploited. Applications must aggressively exploit their scene knowledge to cull work which can be proven to be out of camera-shot, using techniques such as scene-graph node bounding boxes, zone and portal visibility information, primitive visibility sets, etc.
In the Afterpulse multiplayer third-person shooter, developed by Digital Legends, their game engine Karisma takes proactive steps to cull occluded meshes by inserting simplified occlusion volumes into the rendering pipeline. This allows the engine to discard complex meshes in the next frame if they are occluded by geometry which is closer to the camera. The images below show the effect of occlusion culling (top) compared to the unoptimized scene (bottom) when the player’s view is blocked by a vehicle.
Game engine culling schemes can also be used to minimize CPU usage, freeing up more Joules of energy which we can spend rendering instead. For example, culling an entire sub-tree in the scene graph rather than testing every node in it, or evaluating logic and physics updates for off-screen game elements at a lower frequency or precision.
Principle Two: Help the hardware remove the in-frustum redundancy
A game engine can remove things which are out of frustum relatively easily, but that is hard for things which are inside the frustum. One of the major potential sources of inefficiency is fragment overdraw, with occluded fragments getting fragment shaded and then subsequently being written over by fragments closer to the camera.
All GPUs support early depth and stencil testing, which allows fragments which fail tests to be discarded before shading. To maximize the utility of early-zs testing first render opaque geometry in front-to-back depth order with depth testing enabled using a GL_LESS_THAN comparison. Render transparencies in a back-to-front pass with depth testing enabled, but depth writes disabled. This will ensure the hardware can remove as much overdraw as possible, but still allow correct blending.
Principle Three: Amortize software overheads
Despite our best efforts graphics drivers are not zero-cost; committing operations into the command stream consumes CPU cycles. To reduce CPU use it is important that applications batch draw operations together to make larger submissions into the graphics API. This may require the use of texture atlases and similar techniques to merge render states together, which will allow for larger batches. This advice holds true even for the new low-overhead APIs such as Vulkan; batching will help minimize CPU usage even there.
Principle Four: Optimize your data streams
GPUs are data-plane processors which consume large vertex and texture payloads, so optimizing these data streams is critical when trying to minimize DDR power. The aim here is to minimize the amount of bytes we need to convey the necessary information to the GPU.
The easiest means to control geometry bandwidth is to reduce the number of triangles being used. This will have an impact on object silhouette quality, but a balance can be struck using reduced triangle count for meshes which are further from the camera and normal maps to improve lighting inside the silhouette edge. In addition per-vertex bandwidth can be minimized by using lower precision inputs, and minimizing padding and unused fields. Afterpulse makes heavy use of compact data formats, such as using GL_INT_10_10_10_2 for normals and GL_BYTE vectors for color values, to ensure it makes the best use of the available bandwidth.
Static texture data should also be compressed using texture compression whenever possible, and should use mipmapping to dynamically match data size to fragment size. Adaptive Scalable Texture Compression (ASTC) LDR profile, which provides a flexible selection of format and bitrate for both color and non-color data, is mandatory in OpenGL ES 3.2 and Vulkan so it’s a great time to re-review your assets and what compression schemes and bitrates you are using.
Principle Five: Optimize shaders
The final stage of optimization is to focus on optimizing shader programs to use as few cycles and as little power as possible. Specialize shaders for each use, avoiding control flow decisions based on uniform values, and removing uniform-on-uniform and uniform-on-constant computations, which both avoid repeating the same computation for every program invocation in a draw call.
Like the data steam optimizations, also aim to use the lowest computational precision which works. The Afterpulse development process defaulted to using “mediump” fp16 operations in all shaders, and only increased precision to “highp” on a case-by-case basis when artefacts became visible.
The trick behind successfully deploying complex rendering pipelines onto mobile devices is simply to make sure you are spending your precious energy budget where it makes a visible difference in the on-screen output. Make effective use of CPU and GPU cycles and DDR bandwidth by first removing the redundant work as early in the pipeline as possible, and then focus on efficiency, aiming to minimize the flops, bytes, and primitives required to render the parts of the scene which are visible on screen. Come and see our talk from ARM and Digital Legends at GDC 2017, and check out https://developer.arm.com/graphics/tutorials if you would like to know more.