Transformers in Video Understanding

Anas R.
3 min readJan 15, 2023

Videos are everywhere and they are only increasing over time. One way to solve problems related to videos is by using individual frames for classification. This strategy doesn’t take temporal changes into account. For working on space and time, machine learning researchers have proposed many solutions and one of the recent techniques is using transformers.

Transformers were introduced in Natural Language Processing. Now transformers are almost everywhere. Be it images or videos, classification, segmentation, or generation.

Defining Video Understanding

Video understanding means extracting learned information from a stack of frames. Technically we call this learned information spatiotemporal information.

The video understanding domain has advanced in parallel to image recognition. The traditional architectures include Spatiotemporal 3D Convolutional Neural Networks which require significantly more computation than their image counterparts. Another way to deal with spatiotemporal information is by extracting features using a performant image recognition architecture which are then fed to sequence models like LSTMs, and GRUs.

Today we are going to learn about some recent transformer architectures employed in video understanding.

Video Vision Transformer (ViViT)

Arnab et al., introduce transformer-based models for video classification.

Their architecture has four variants:

Spatio-temporal attention

Spatiotemporal Attention architecture in ViViT extracts 3D tubelets from a stack of frames and projects them using dense layers. The rest of the architecture includes positional embedding of these patches, a transformer encoder, and a multi-layer perceptron head for classification.

To explain the architecture I have added the tubelet embedding diagram from the paper.

Spatiotemporal Attention using Tubelet Embedding [Source]

A TensorFlow implementation of the framework can be seen here.

The architecture has quadratic complexity with respect to the number of input tokens.

Factorized encoder

Factorized encoder has two separate transformer encoders having independent spatial and temporal interactions.

Factorized Encoder (ViViT) for Video Understanding [Source]

Spatial Encoder: Only tokens from a frame are attended with respect to other tokens in the same frame. These interactions within the frame are forwarded to Temporal Encoder after global average pooling or through a class token.

Temporal Encoder: Tokens extracted from the spatial encoder are attended to with respect to each other.

These independent spatially and temporally attended tokens are then forwarded to a multilayer perceptron head for classification.

Factorized encoder architecture reduces the complexity from quadratic as well.

Factorized self-attention

Here instead of using different transformer encoders as in the last architecture, the authors use independent spatial and temporal interactions in the same encoder.

The self-attention in the transformer encoder is modified to first only find interactions spatially (within one frame), and then temporally (all spatially attended tokens).

Factorized Self-Attention (ViViT) for Video Understanding [Source]

Factorized dot-product attention

Coming further down, here we are modifying self-attention to have separate attention heads for spatial and temporal tokens.

Factorized Dot Product Attention (ViViT) for Video Understanding [Source]

Video Swin Transformer

Video Swin transformer limits self-attention to non-overlapping local windows while also allowing for cross-window connections. By doing this, it adds an inductive bias of locality in transformer architecture.

PREVIOUS VIDEO TRANSFORMERS compute self-attention globally even with factorization across spatial and temporal dimensions.

Video Swin Transformer: [Source]

Locality Inductive Bias: “The notion that image pixels are locally correlated and that their correlation maps are translation-invariant.” [A nice article about inductive biases in ML algorithm here]


The authors of TimeSFormer experiment with different self-attention schemes and suggest that “divided attention” leads to the best video classification accuracy among the design choices considered.

The attention scheme is the same as the Factorized self-attention in Video Vision Transformer.