Convolutions For Video Understanding

Anas R.
5 min readJan 29, 2023

Talking about convolutions in the era of transformers, diffusion, and RLHF (ChatGPT)? I am not old school but an engineer whose job is to find the best solution.

Without any further adieu, let’s start.

Videos have become ubiquitous in today’s digital landscape, with an ever-increasing volume being produced on a daily basis. The ability to extract learned meaningful information from these videos, known as video understanding, is crucial for a wide range of applications such as content analysis, surveillance, and entertainment. Video understanding involves extracting spatiotemporal information from a stack of frames, which represents the visual and temporal dynamics of the video. This information can be used to classify, detect, or track objects, actions, and events within the video.

Architectures for Video Understanding

Machine learning researchers have proposed various architectures for recognizing and interpreting visual and temporal information. One popular approach is to use 3D CNNs, which are designed to capture both the spatial and temporal aspects of videos. Another approach is to use 2D CNNs as feature extractors, which are then fed into sequence models such as LSTMs or GRUs. Recently, the Transformer architectures, known for their ability to handle sequential data, have also been used for video modeling. I have covered details of some of these architectures here.

In this article, we will try to summarize some Convolutional Neural Network based architectures in the literature.

Convolutional Neural Networks

In a convolutional neural network, filters slide along input features and calculate results known as feature maps. A nice visualization of how these architectures work while changing parameters can be seen here.

A 3 By 3 Convolution Filter is run on blue [image] returning green [feature map], </source>.

Convolutional Neural Networks(CNNs) have been popular in computer vision due to fewer parameters. Although CNNs have stayed a de-facto solution for images for many years, they have also been successfully applied to other data types like graphs, videos, time series, etc.

Below we are going to discuss details of some CNN architectures that have been used for Video Understanding.

Inflated I3D Architectures

I3D architectures convert a pre-trained 2D CNN, the Inception network, into a 3D CNN by inflating its weights to 3D. This is done by replicating the 2D filters along the temporal dimension and training them on video data. This allows the network to learn both spatial and temporal features, making it suitable for video-understanding tasks. Furthermore, the I3D architecture also uses 3D pooling layers, which are able to capture temporal information by pooling across time.


Other performant 2D architectures have also been translated to 3D data. Some examples are EfficientNet 3D, ResNet 3D,

S3D Architectures

S3D architecture separates the spatial and temporal information processing in video modeling. The S3D architecture uses a 2D CNN to process the spatial information in each frame of the video and a 1D CNN to process the temporal information across frames.


There are other approaches like X3D which also use spatial and temporal convolutions for processing videos but use a different backbone network.

In the above and following architectures, the 2-D spatial convolution is represented by Conv3D with filter size (1, 3, 3). Which means our layer does not take temporal changes into account. On the other hand, 1-D temporal convolution is represented by Conv3D with a filter size (3, 1, 1). In this case, our layer doesn’t take spatial changes in to account.

Mixed Convolution (MC)

Mixed convolution (MC) uses 3D convolutions only in the early layers of the network, with 2D convolutions in the top layers. The authors give the rationale that motion modeling is a low/mid-level operation that can be implemented via 3D convolutions in the early layers of a network, and spatial reasoning over these mid-level motion features (implemented by 2D convolutions in the top layers) leads to accurate action recognition.

# PyTorch Styled sample of R(2+1)D
Conv3d(kernel_size=(3, 3, 3)) # Input shape #b, c, t, h, w # Spatial Convolution
Conv3d(kernel_size=(1, 3, 3)) # Convolution 2D only spatial filters


R(2+1)D extracts 3D convolutional filters into separate spatial and temporal components. This decomposition is achieved by separating the 3D convolution into two separate operations, a 2D spatial convolution and a 1D temporal convolution resulting in an additional nonlinear rectification between the two operations. This allows the model to represent more complex functions with the same number of parameters. Additionally, the decomposition facilitates optimization, resulting in a lower training and testing loss.

# PyTorch Styled sample of R(2+1)D
Conv3d(kernel_size=(1, 3, 3)) # Input shape #b, c, t, h, w # Spatial Convolution
Conv3d(kernel_size=(3, 1, 1)) # Temporal Convolution


SlowFast architecture uses two parallel streams of CNNs: a Slow pathway and a Fast pathway. The Slow pathway is designed to capture detailed spatial and temporal information by processing the video at a lower frame rate, while the Fast pathway is designed to capture coarse spatial and temporal information by processing the video at a higher frame rate.



blvNet captures temporal interactions between sequences of frames using fusion. Features are extracted for even and odd frames using different layers and the interactions between neighboring nodes are fused. Another variant blVNet-TAM, also considers global temporal context using global fusion of extracted deep features. The basic idea is to fuse temporal information at each time instance by weighted channel-wise aggregation.


For temporal aggregation, the authors use TAM which is essentially a depthwise convolution followed by shifting



In this article, we cover details of some convolutional neural network based approaches for video understanding. Some architectures are translations of performant 2D architectures while others decompose spatiotemporal information into spatial and temporal using 2D and 3D CNNs.