Optimizing Matrix Operations

Last time, we reviewed some major frame rate issues I found in ng and established a process to measure and improve on the problem. We reviewed performance of a long-running test and identified the 4x4 matrix and vector modules as major culprits. Today, we’ll do line-by-line analysis of the slow parts and review my progress optimizing them.

Performance Analysis

Recapping the last measurements, here are the most expensive functions for the 5 worst exceptional frames, according to callgrind. This list was generated by the performance tool I wrote to do this exceptional frame analysis.

Frame  # ms/frame  call site

25798  4167.96     /usr/lib/python3.6/site-packages/ng/math/vector.py:__init__
27314  4120.22     /usr/lib/python3.6/site-packages/ng/ng.py:_entities_velocity
24394  3972.68     /usr/lib/python3.6/site-packages/ng/math/matrix.py:__matmul_
21420  3672.30     /usr/lib/python3.6/site-packages/ng/ng.py:_entities_velocity
20191  3374.32     /usr/lib/python3.6/site-packages/ng/math/matrix.py:__init__

The most expensive operations are initializing new matrix and vector class instances and multiplying matrices. These math operations are used per frame, per entity to update entity positions and to prepare MVP matrices for display output.

_entities_velocity indirectly uses vector operations through its own function calls, so it may drop off the list when vector is optimized. We’ll focus on vector and matrix right now.

The goal of this session is to reduce the number of usages of the vector and matrix modules, as well as to improve the performance of each usage.

Let’s use callgrind_annotate to get more information about the call patterns. To reduce the number of calls to a function, we need to know who is calling. We add the --tree=caller option to get this info. To improve the performance of the function, we need to know which lines perform worst. We add the --auto=yes option so callgrind will annotate the source lines of each module with the amount of time spent in each line.

Here are the header and the call tree for frame 25798, which give us more details on calls to vector.py:__init__.

$ callgrind_annotate --auto=yes --tree=caller  emitter-25798.pstats.cg
Profile data file 'emitter-25798.pstats.cg'
Profiled target:  (unknown)
Events recorded:  ns
Events shown:     ns
Event sort order: ns
Thresholds:       99
Include dirs:
User annotated:
Auto-annotation:  on

4,167,334,999  PROGRAM TOTALS

           ns  file:function

      415,000  < /usr/lib/python3.6/site-packages/ng/ng.py:_entities_velocity (255x) []
       34,000  < /usr/lib/python3.6/site-packages/ng/render/__init__.py:normalized_vertices (32
x) []
       20,000  < /usr/lib/python3.6/site-packages/ng/entity.py:size:128 (8x) []
      521,000  < /usr/lib/python3.6/site-packages/ng/math/bounding_box.py:aa_box (492x) []
       14,000  < /usr/lib/python3.6/site-packages/ng/entity.py:origin:118 (8x) []
4,040,391,999  < /usr/lib/python3.6/site-packages/ng/math/vector.py:__iadd__ (242x) []
       26,000  < /usr/lib/python3.6/site-packages/ng/ng.py:entity_create (8x) []
       34,000  < /usr/lib/python3.6/site-packages/ng/entity.py:velocity:170 (8x) []
       87,000  < /usr/lib/python3.6/site-packages/ng/entity.py:__init__ (40x) []
    1,238,000  < /usr/lib/python3.6/site-packages/ng/render/opengl.py:_draw_entity (984x) []
      278,000  < /usr/lib/python3.6/site-packages/ng/math/vector.py:__mul__ (242x) []
      235,000  < /usr/lib/python3.6/site-packages/ng/entity.py:position:137 (250x) []
       11,000  < /usr/lib/python3.6/site-packages/ng/entity.py:scale:159 (8x) []
4,042,445,000  *  /usr/lib/python3.6/site-packages/ng/math/vector.py:__init__

Most of the calls in the graph look innocuous at a glance. There may be opportunity to optimize some of those call patterns later, but nothing stands out since they are all necessary for the update routine. The calls to vector.py:__init__ from vector.py:__iadd__ and vector.py:__mul__ look suspicious, though. These are bound methods used for class instantiation and operator overloading. The callgrind analysis shows us the lines for __iadd__.

      .      def __iadd__(self, other: Any) -> 'Vector':
147,000          other = Vector(other)
      .          self.x += other.x
180,999          self.y += other.y
278,000  => /usr/lib/python3.6/site-packages/ng/math/vector.py:__init__ (242x)
      .          self.z += other.z
      .          return self

This method overloads the += operator, performing vector addition with a scalar value and mutating the original vector instance. This operation doesn’t need to allocate a new vector, but the first line creates a new vector instance from the user’s input. I intended this as a convenience, allowing the user to call the method with another type such as a tuple. Removing this convenience will significantly reduce the number of Vector allocations. Here’s the change:

def __iadd__(self, other: Vector) -> 'Vector':
    self.x += other.x
    self.y += other.y
    self.z += other.z

    return self

The __sub__ and __add__ methods had a similar pattern, so I also changed those. With these changes, we can expect frame 25798 to be improved and can move to frames 24394 and 20191. These both spend a lot of time in the matrix module. I did a quick comparison of 20191 and 24394 and saw that they are similar, so we can review only 24394.

Here are the callgrind header and tree for frame 24394.

$ callgrind_annotate --auto=yes --tree=caller  emitter-24394.pstats.cg
Profile data file 'emitter-24394.pstats.cg'
Profiled target:  (unknown)
Events recorded:  ns
Events shown:     ns
Event sort order: ns
Thresholds:       99
Include dirs:
User annotated:
Auto-annotation:  on

3,972,006,999  PROGRAM TOTALS

           ns  file:function

3,889,429,000  < /usr/lib/python3.6/site-packages/ng/entity.py:matrix_init (4074x) []
    2,352,000  < /usr/lib/python3.6/site-packages/ng/render/opengl.py:_render_entity (410x) []
    2,343,000  < /usr/lib/python3.6/site-packages/ng/render/opengl.py:_draw_entity (784x) []
3,881,484,999  *  /usr/lib/python3.6/site-packages/ng/math/matrix.py:__matmul__

   57,394,000  < /usr/lib/python3.6/site-packages/ng/render/opengl.py:_render_entity (198x) []
   59,996,999  < /usr/lib/python3.6/site-packages/ng/render/opengl.py:update (7x) []
   21,398,999  *  /usr/lib/python3.6/site-packages/ng/render/opengl.py:_render_entity

   30,022,000  < /usr/lib/python3.6/site-packages/ng/render/opengl.py:_render_entity (205x) []
   10,284,999  *  /usr/lib/python3.6/site-packages/ng/render/opengl.py:_draw_entity

    5,807,000  < /usr/lib/python3.6/site-packages/ng/math/matrix.py:__matmul__ (5268x) []
    5,807,000  *  ~:<built-in method numpy.core.multiarray.matmul>

    5,312,000  < /usr/lib/python3.6/site-packages/ng/math/matrix.py:__init__ (6767x) []
    5,312,000  *  ~:<built-in method numpy.core.multiarray.array>

    7,177,999  < /usr/lib/python3.6/site-packages/ng/entity.py:matrix_init (252x) []
3,882,551,000  < /usr/lib/python3.6/site-packages/ng/entity.py:position:137 (198x) []
   11,371,000  < /usr/lib/python3.6/site-packages/ng/entity.py:rotation:148 (211x) []
      819,000  < /usr/lib/python3.6/site-packages/ng/entity.py:children_add (6x) []
      135,000  < /usr/lib/python3.6/site-packages/ng/entity.py:scale:159 (6x) []
      151,999  < /usr/lib/python3.6/site-packages/ng/entity.py:origin:118 (6x) []
    4,083,000  *  /usr/lib/python3.6/site-packages/ng/entity.py:matrix_init

The worst performer, matrix.py:__matmul__, is the matrix multiplication operator on the Matrix class. The engine makes heavy use of this operation while calculating entity model matrices and while preparing MVP matrices for display rendering.

Again, there might be opportunity to improve caller efficiency, but let’s see how we can optimize this method in the matrix module.

Here are the line-by-line costs of the methods.

            .  class Matrix():
    3,702,999      def __init__(self, values):
    5,312,000  => ~:<built-in method numpy.core.multiarray.array> (6767x)
    2,327,000  => ~:<method 'tolist' of 'numpy.ndarray' objects> (6767x)
            .          self._np_array = np.array(values)
            .          self.values = self._np_array.tolist()
            .      def _columns(self):
            .          return len(self.values[0])
3,881,484,999      def __matmul__(self, other):
    5,807,000  => ~:<built-in method numpy.core.multiarray.matmul> (5268x)
    6,831,999  => /usr/lib/python3.6/site-packages/ng/math/matrix.py:__init__ (5268x)
            .          return Matrix(np.matmul(self._np_array, other._np_array))

The only purpose of the Matrix class is to be a wrapper around the numpy array. The original ng Matrix class implemented matrix operations in Python. That was slow, so I previously changed it to use a numpy array for its underlying operations, leaving Matrix as a wrapper class.

The price of the wrapper class is apparent; every time we create a new matrix or multiply two matrices, we incur the cost of a new Python class instance that serves only to create a new numpy array instance underneath it.

Wrapping third party APIs is a good practice in certain languages and situations, allowing one to protect one’s own APIs from changes to third party APIs. In this case, though, performance is the larger consideration.

If the wrapper class were very important to the design, one potential optimization would be to rewrite the matrix class as a C library. In this case, numpy is ubiquitous in the Python computation space and carries a stable API, so I’m comfortable using it directly instead of wrapping it. I ditched the Matrix class entirely and ng now directly uses numpy arrays for all matrix operations.

The Data

Here’s the data from a new run measured after the changes. I reformatted it a bit to compare it more easily with the data from the last run.


Run 0

Run 1

Session length (s)



ms/frame (mean)



ms/frame (median)



ms/frame (std dev)



Exceptional frames (variation multiple > 0.5



  • sec/exceptional frame



  • ms/frame (mean)



  • ms/frame (median)



  • ms/frame (std dev)



Active entities (mean)



Active entities (median)



Active entities (std dev)



The first thing to highlight is this run length is much shorter than the last. Because there will be several iterations of optimization, I decided to do shorter 10 minute runs for the intermediate iterations. The caveat is we can’t directly compare the summary data from this run to the last; the decreased session length affects the averaged data.

That’s okay. Again, this will be a series of improvements. We don’t need to do a direct session-to-session comparison to answer the questions I want to answer right now.

  • Are there still exceptional frames?

  • If so, how often?

  • If so, are the exceptions getting worse over time?

The first two questions are answered by the summary data. Yes, there are still exceptional frames. They happen several times per minute on average.


Last Run

This Run

Exceptional frames (variation multiple > 0.5



  • sec/exceptional frame



Let’s look at the graphs.


The spikes in the graph indicate the same pattern as before. The frame-to-frame variation worsens over the length of the session.

Here’s the new short list of expensive calls.

Frame # ms/frame  call site
8980  581.74      /usr/lib/python3.6/site-packages/ng/math/bounding_box.py:aa_box
8232  526.43      /usr/lib/python3.6/site-packages/ng/math/vector.py:__init__
7645  494.95      /usr/lib/python3.6/site-packages/ng/math/bounding_box.py:aa_box
6999  478.29      /usr/lib/python3.6/site-packages/ng/math/vector.py:__init__
6484  465.14      /usr/lib/python3.6/site-packages/OpenGL/wrapper.py:calculate_cArgs

Finally, some good news. The matrix module has dropped off the list of worst offenders and numpy’s array multiplication routines are not on the list. This suggests replacing the matrix class with numpy arrays improved things. The vector module is still there, indicating we might be able to improve it further. boundingbox_py.aa_box is new to the list. That function uses the vector module, so it’s reasonable to expect that vector is the bottleneck for aabox. An internal call in the PyOpenGL library is also new to the list this time.


We could continue to optimize the vector module. However, I’m certain we can make improvements there. I have less knowledge and certainty around the PyOpenGL library, because it is third-party code. Also, since ng’s OpenGL renderer is my first experience using OpenGL, I made some early design mistakes and already have some TODO items to fix there. For these reasons, PyOpenGL is the greater concern. I’ll leave the vector module for now and focus on improving OpenGL usage next.

Thanks for reading! Next time, we’ll look at my OpenGL optimizations, and review measurements of another run.