To simply put, a Dense layer is a layer where all the nodes in the layer has a connection to all the nodes in the previous layer. Let’s create a simple neural network and see how the dense layer works. The image below is a simple feed forward neural network with one hidden layer. The input to the network consists of a vector X with elements x1 and x2, the hidden layer H contains 3 nodes h1, h2 and h3. Finally there is an output layer O with only one node o.

graph LR x1(x1) -- w11 --> h1; x2(x2) -- w21 --> h1; x1 -- w12 --> h2; x2 -- w22 --> h2; x1 -- w13 --> h3; x2 -- w23 --> h3; h1 --> o h2 --> o h3 --> o

In the network, the hidden layer and output layer is composed made up of Dense layer. Hidden layer has 3 nodes and output layer has 1 node. All nodes in hidden layer are connected to all nodes in the input layer and all nodes in the output layer, in this case there is only one node, is also connected to all nodes in hidden layer.

The connections, also called edges, have weights associated with them. Collectively, these weights form a weight matrix. To calculate how many weights we need for a layer, we need to multiply the number of nodes in a layer with number of input features. In the network above, there are 2 input features (x1 and x2) and 3 hidden nodes. So in total the number of weights for the hidden layer is 2*3=6.

Let’s create a concrete example and see how the inputs and weights are used to calculate output of a node. I’ll show to to calculate output of a node h1 but same principle applies for all other nodes including h2, h3 and o.

Let’s say our input is a vector with two elements [10, 20]. These numbers could mean anything depending on your problem. For example these two numbers could mean the number of male and female students in a class.

From the diagram, we can see that h1 has connections with weights w11 and w21. Let’s assume the value of these weights are 0.1 and 0.2 respectively. Then the output of h is calculated as:

graph LR x1 --w11 --> sum[x1*w11 + x2*w21] x2 -- w21 --> sum sum --> output
\[x_{1} * w_{11} + x_{2} * w_{21}\] \[10*0.1 + 20*0.2 = 5\]

We can follow the same procedure for all hidden and output nodes in the network to calculate the output of the node.

If you noticed the formula, we are essentially computing dot product of two vectors \(<x_{1}, x_{2}>\) and \(<w_{11}, w_{21}>\). We can calculate the dot products like this for all hidden nodes to get the final output of the hidden layer. Note that the output of a node is a single number whereas the output of a layer is a vector/list of these numbers.

We can calculate the output of the entire layer in just one step as well. Instead of computing the dot product for each node one by one, we can convert this step into a single matrix multiplication operation. This is very fast as many numerical libraries like numpy, tensorflow, pytorch have efficient implementations for these operations.

Let’s see implement this in Python using numpy and pytorch.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import numpy as np
import torch

# initialize input data
X = np.array([
        [10.0, 20.0], 
    ], dtype="float32")

# initialize weight matrix
W = np.array([
    [0.1, 0.2],
    [0.3, 0.4],
    [0.5, 0.6]
], dtype="float32")

# numpy matmul
print(X@W.T) # [[ 5. 11. 17.]]

First we create an input X. It is a matrix with shape (1, 2) i.e. one input row with 2 features. Although we just have one input vector, we still represented it as a matrix with 1 row and 2 columns. This is because in most of the cases, you’ll process multiple inputs at the same time. If we had 10 inputs, then our input matrix would be of shape (10, 2).

We set the values of the first input to 10 and 20 like above. Then, we initialize the weight matrix W of shape (3, 2). Each row in the matrix indicates weights for the hidden nodes coming from the input nodes. Notice the shape of the weight matrix. It has 3 rows and 2 columns because there are 3 nodes in the layer and each of them connect to the 2 input nodes.

Finally, to get the output of the layer, we multiply the input with the weight matrix. The weight matrix is transposed so that matrix multiplication rule is satisfied. We use @ operator to perform the matrix multiplication. This is builtin Python operator for matrix-matrix or matrix-vector operation. You can also use np.matmul function to get the same result.

The output is a matrix of shape (1, 3). Each row contains output for each input. In our case there is only one input so the output also has one row. The 3 columns are output values for each hidden node. We see that the first value is 5 which is the output for node h1, which is same as we calculated above.

Now let’s see how to create a dense layer using pytorch. We need to use torch.nn.Linear class to create a dense layer. I’ve set bias=False so that we can check if our calculations so far are correct. I’ll explain biases shortly.

1
2
3
4
5
6
7
# convert numpy arr
X_torch = torch.from_numpy(X)
W_torch = torch.from_numpy(W)

h_torch = torch.nn.Linear(in_features=2, out_features=3, bias=False)
h_torch.weight = torch.nn.Parameter(W_torch)
print(h_torch(X_torch)) # tensor([[ 5., 11., 17.]], grad_fn=<MmBackward>)

First we initialize a dense layer using Linear class. It needs 3 parameters:

  • in_features : how many features does the input contain
  • out_features : how many nodes are there in the hidden layer
  • bias : whether to enable bias or not

Once we create the layer, we assign the weight matrix for this layer and finally get the output. Again, the output is same as we expected.

In pytorch and tensorflow, we can treat layers like a function so that is why we are able to use h_torch(X_torch). Internally these implement __call__ method which tells Python to treat an object like a function.

Note: Output from a layer is also called as feature vectors.

So far we didn’t consider biases. Bias is a vector which has same size as the number of nodes in the layer. This vector is added element wise to the output vector produced by the nodes. Let’s see this in action. We will create a bias vector of size 3 because there are 3 hidden nodes and set their values to 1, 2, 3. We expect the output elements to be incremented by those amounts.

1
2
3
4
5
6
7
8
X_torch = torch.from_numpy(X)
W_torch = torch.from_numpy(W)
b_torch = torch.from_numpy(np.array([1, 2, 3], dtype="float32"))

h_torch = torch.nn.Linear(in_features=2, out_features=3, bias=True)
h_torch.weight = torch.nn.Parameter(W_torch)
h_torch.bias = torch.nn.Parameter(b_torch)
print(h_torch(X_torch)) # tensor([[ 6., 13., 20.]], grad_fn=<AddmmBackward>)

As expected, the bias vector was added element-wise to the output from the hidden nodes.

Weights are initialized either randomly or using popular weight initialization methods such as kaiming, glorot etc. Bias vectors are typically initialized as zero vectors.

While a neural network is trained, the values on weights and biases of every layers are updated so that they can generalize on the input dataset.

Now as a final example, let’s implement the complete neural network shown above. Note that we haven’t discussed about activation functions in this post so I’m deliberately ignoring those for the purpose of this demonstration.

1
2
3
4
5
6
7
8
9
10
h = torch.nn.Linear(in_features=2, out_features=3, bias=True)
o = torch.nn.Linear(in_features=3, out_features=1)

hidden_output = h(X_torch)
print("Hidden output = ", hidden_output)
final_output = o(hidden_output)
print("Final output = ", final_output)

# Hidden output =  tensor([[7.7544, 3.1324, 1.0741]], grad_fn=<AddmmBackward>)
# Final output =  tensor([[0.7573]], grad_fn=<AddmmBackward>)

We created two layers h and o. First we calculated the output from the hidden node and passed that output to the output layer o and then calculate the final output. Usually before passing the data to the next layer, activation function is applied which we will discuss in the future post.

Comments