DEVIAS

Learning Disentangled Video Representations of Action and Scene

ECCV 2024 Oral Paper πŸ”₯
1. Kyung Hee University 2. Korea Advanced Institute of Science and Technology

βˆ— Equally contributed first authors. † Corresponding author.

DEVIAS works well across diverse scenarios, including both seen and unseen action-scene combinations.

Video reference [seen1] [seen2] [unseen1] [unseen2]

Abstract

  • Humans can naturally understand the content of a video by extracting human actions from the surrounding scene context. Even when encountering a previously unseen action-scene combination, humans easily recognize both the action and the scene.
  • Unlike humans, video recognition models often learn scene-biased action representation due to the spurious correlation between actions and scenes in the training data. Although scene-debiased action recognition models might address the issue, they often overlook valuable scene information in the data.
  • We propose to learn DisEntangled VIdeo representations of Action and Scene (DEVIAS). DEVIAS consists of a Disentangling Encoder (DE) with the slot attention and an Action Mask Decoder (AMD) leverging action masks to help an action slot's learning process. Since slots are complementary to each other in slot attention, learning good action representation by AMD encourages the DE to learn good scene representation as well.

Architecture

Disentangling Encoder

Disentangling Encoder (DE) leverages slot attention to disentangle and learn action and scene representations separately from entangled features. It employs the Hungarian algorithm to assign an action and a scene slot among multiple encoded slots, using cross-entropy loss as the cost fuction.

Action Mask Decoder & Attention Guidance

Given an action slot, Action Mask Decoder (AMD) learns to predict an action mask. We also design the Attention Guidance (AG) loss between an attetion map of the action slot and an action mask. They make an action slot shoule contain pure action information, not scene or object information. Since slots are complementary to each other, learning good action representation by AMD and AG encourages the DE to learn good scene representation as well.


Experiments

DEVIAS learns both disentangled action and scene representation. With disentangled representations, DEVIAS can understand video regardless of seen or unseen action-scene combinations. To validate this, we evaluate our model on various datasets; UCF-101, Kinetics-400, HVU (ECCV 2020), SCUBA (ICCV 2023), and HAT (NeurIPS 2022). We provide some sample images of these datasets below.

Performances for both seen and unseen combination scenarios on UCF-101

We report the Top-1 action recognition and the Top-5 scene recognition accuracies(%). We also report the harmonic mean (H.M.) of the action recognition and scene recognition. V.C./Sin. denotes the SCUBA VQGAN-CLIP/Sinusoidal; S.O./Rand. denotes the HAT Scene-Only/Random. For the description of each baseline model, please see the paper.

Performances for both seen and unseen combination scenarios on Kinetics-400

We report the Top-1 action recognition and the Top-5 scene recognition accuracies(%). We also report the harmonic mean (H.M.) of the action recognition and scene recognition. V.C./Sin. denotes the SCUBA VQGAN-CLIP/Sinusoidal; S.O./Rand. denotes the HAT Scene-Only/Random.

Disentangled action and scene representation is beneficial for diverse downstream tasks

For fine-tuning on downstream datasets with diverse characteristics, temporal-biased or scene-biased, DEVIAS uses the concatenation of the action and scene slot as input to the classification head. As a result, DEVIAS shows favorable performances on Diving48, Something-Something V2, UCF-101, and ActivityNet compared to the baselines. Please see the paper for more details.

Ablation study

We provide extensive ablation studies to validate the effect of each component of DEVIAS. Please see the paper for detail explanations and more studies.


Qualitative Results

We demonstrate DEVIAS has disentangled action and scene representation with UMAP and attention map visualization.

UMAP Visualization

Slot Attention Map Visualization

BibTeX


  @inproceedings{bae2024devias,
    title       = {DEVIAS: Learning Disentangled Video Representations of Action and Scene for Holistic Video Understanding},
    author      = {Bae, Kyungho and Ahn, Geo and Kim, Youngrae and Choi, Jinwoo},
    journal={European Conference on Computer Vision},
    year={2024}
  }
  

This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.