Repository logo
  • English
  • Deutsch
Log In
or
  1. Home
  2. HSG CRIS
  3. HSG Publications
  4. Sparse Multimodal Vision Transformer for Weakly Supervised Semantic Segmentation
 
  • Details

Sparse Multimodal Vision Transformer for Weakly Supervised Semantic Segmentation

Type
conference contribution
Date Issued
2023-06
Author(s)
Joëlle Hanna  
Michael Mommert  
Damian Borth  orcid-logo
Editor(s)
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (EarthVision)
Abstract
Vision Transformers have proven their versatility and utility for complex computer vision tasks, such as land cover segmentation in remote sensing applications. While performing on par or even outperforming other methods like Convolutional Neural Networks (CNNs), Transformers tend to require even larger datasets with fine-grained annotations (e.g., pixel-level labels for land cover segmentation). To overcome this limitation, we propose a weaklysupervised vision Transformer that leverages image-level labels to learn a semantic segmentation task to reduce the human annotation load. We achieve this by slightly modifying the architecture of the vision Transformer through the use of gating units in each attention head to enforce sparsity during training and thereby retaining only the most meaningful heads. This allows us to directly infer pixel-level labels from image-level labels by post-processing the unpruned attention heads of the model and refining our predictions by iteratively training a segmentation model with high fidelity. Training and evaluation on the DFC2020 dataset show that our method 1 not only generates high-quality segmentation masks using image-level labels, but also performs on par with fully-supervised training relying on pixellevel labels. Finally, our results show that our method is able to perform weakly-supervised semantic segmentation even on small-scale datasets.
Keywords
Weakly Supervised Learning
Transformers
Sparsity
URL
https://www.alexandria.unisg.ch/handle/20.500.14171/122024
File(s)
Loading...
Thumbnail Image

open.access

Name

Hanna_Sparse_Multimodal_Vision_Transformer_for_Weakly_Supervised_Semantic_Segmentation_CVPRW_2023_paper.pdf

Size

3.38 MB

Format

Adobe PDF

Checksum (MD5)

d49098aedbc67b40a608025032a1c638

here you can find instructions and news.

Built with DSpace-CRIS software - Extension maintained and optimized by 4Science

  • Privacy policy
  • End User Agreement
  • Send Feedback