Many numerical computation libraries have efficient implementations for vectorized operations. Operations like matrix multiplication, finding dot products are very efficient. These operations are implemented to utilize multiple cores in the CPUs as well as offload the computation to GPU if available. Usually operations for matrix and vectors are provided by BLAS (Basic Linear Algebra Subprograms). Some of the examples are Intel MKL, OpenBLAS, cuBLAS etc.

In this post, we’ll start with naive implementation for matrix multiplication and gradually improve the performance. The goal of this post is to highlight the usage of existing numerical libraries for vectorized operations and how they can significantly speedup the operations. We’ll be using numpy as well as tensorflow libraries for this demo. Also, this demo was prepared in Jupyter Notebook and we’ll use some Jupyter magic commands to find out execution time.

First let’s create two matrices and use numpy’s matmul function to perform matrix multiplication so that we can use this to check if our implementation is correct.

1
2
3
4
5
6
7
8
import tensorflow as tf
import numpy as np
tf.__version__ # 2.0.0

a = np.random.normal(size=(200, 784)).astype('float32')
b = np.random.normal(size=(784, 10)).astype('float32')

expected = np.matmul(a, b)

Pure Python implementation

Our first implementation will be purely based on Python. We will not use any external libraries. We need three loops here. The first loop is for all rows in first matrix, 2nd one is for all columns in second matrix and 3rd one is for all values within each value in the \(i_{th}\) row and \(j_{th}\) column of matrices a and b respectively. We need to multiply each elements of \(i_{th}\) row and \(j_{th}\) column together and finally sum the values. The final sum is the value for output[i, j].

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
def py_matmul1(a, b):
    ra, ca = a.shape
    rb, cb = b.shape
    assert ca == rb, f"{ca} != {rb}"
    
    output = np.zeros(shape=(ra, cb))
    for i in range(ra):
        for j in range(cb):
            for k in range(rb):
                output[i, j] += a[i, k] * b[k, j]
                
    return output

%time result = py_matmul1(a, b)
assert result.shape == expected.shape
assert np.allclose(result, expected, rtol=1e-02), (result, expected)

When executed, it takes 1.38 s on my machine. It is quite slow and can be improved significantly. If you noticed the innermost loop is basically computing a dot product of two vectors. In this case the two vectors are \(i_{th}\) row and \(j_{th}\) column of a and b respectively. So let’s remove the inner most loop with a dot product implementation.

Vectorization I

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
def py_matmul2(a, b):
    ra, ca = a.shape
    rb, cb = b.shape
    assert ca == rb, f"{ca} != {rb}"
    
    output = np.zeros(shape=(ra, cb))
    for i in range(ra):
        for j in range(cb):
	        # we replaced the loop with dot product
            output[i, j] = np.dot(a[i], b[:,j])
                
    return output

%time result = py_matmul2(a, b)
assert result.shape == expected.shape
assert np.allclose(result, expected, rtol=1e-02), (result, expected)

This implementation takes just 6 ms. A huge improvement from the naive implementation. Since the inner loop was essentially computing the dot product, we replaced that with np.dot function and pass the \(i_{th}\) row from matrix a and \(j_{th}\) column from matrix b.

Vectorization II

Now let’s remove the for loop where we iterate over the columns of matrix b.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
def py_matmul3(a, b):
    ra, ca = a.shape
    rb, cb = b.shape
    assert ca == rb, f"{ca} != {rb}"
    
    output = np.zeros(shape=(ra, cb))
    for i in range(ra):
        output[i] = np.dot(a[i], b)
        
                
    return output

%time result = py_matmul3(a, b)
assert result.shape == expected.shape
assert np.allclose(result, expected, rtol=1e-02), (result, expected)

This implementation takes 2.97 ms. Using technique called broadcasting, we can essentially remove the loop and using just a line output[i] = np.dot(a[i], b) we can compute entire value for \(i_{th}\) row of the output matrix. What numpy does is broadcasts the vector a[i] so that it matches the shape of matrix b. Then it calculates the dot product for each pair of vector. Broadcasting rules are pretty much same across major libraries like numpy, tensorflow, pytorch etc.

Numpy matmul

Now let’s use the numpy’s builtin matmul function.

1
2
3
4
5
6
7
8
9
10
11
def py_matmul4(a, b):
    ra, ca = a.shape
    rb, cb = b.shape
    assert ca == rb, f"{ca} != {rb}"
    
    return np.matmul(a, b)
    

%time result = py_matmul4(a, b)
assert result.shape == expected.shape
assert np.allclose(result, expected, rtol=1e-02), (result, expected)

Using numpy’s builtin matmul function, it takes 999 \(\mu\)s. Which is the fastest among all we have implemented so far.

Tensorflow matmul

In tensorflow also it is very similar to numpy. We just need to call matmul function.

1
2
3
4
5
6
7
8
9
10
11
12
def py_matmul5(a, b):
    ra, ca = a.shape
    rb, cb = b.shape
    assert ca == rb, f"{ca} != {rb}"
    
    return tf.matmul(a, b)
    
tf_a = tf.constant(a)
tf_b = tf.constant(b)
%time result = py_matmul5(tf_a, tf_b)
assert result.shape == expected.shape
assert np.allclose(result, expected, rtol=1e-02), (result, expected)

It takes about 999 \(\mu\)s for tensorflow to compute the results. We can directly pass the numpy arrays without having to convert to tensorflow tensors but it performs a bit slower. In my experiments, if I just call py_matmul5(a, b), it takes about 10 ms but converting numpy array to tf.Tensor using tf.constant function yielded in a much better performance.

Conclustion

In this post we saw different ways to do matrix multiplication. During this process, we also looked at how to remove loops from our code to use optimized functions for better performance. Most operations in neural networks are basically tensor operations i.e. matrix multiplication, dot products etc. and getting familiar with different functions provided by the libraries for these operations is helpful.

Categories:

Updated:

Comments