Separable convolutions are a rather interesting type of convolution. They work on two-dimensional inputs and can be applied spatially or depthwise. The way this works is we decompose our k × k sized kernel into two smaller kernels with sizes of k × 1 and 1 × k. Instead of applying the k × k kernel, we would first apply the k × 1 kernel and then, to its output, the 1 × k kernel. The reason this is used is that it reduces the number of parameters in our network. With the original kernel, we would have had to carry out k2 multiplications at each step, but with separable convolution, we only have to carry out 2,000 multiplications, which is a lot less.
Suppose we have a 3 × 3 kernel that we want to apply to a 6 × 6 input, ...