Sparse Multimodal Vision Transformer for Weakly Supervised Semantic Segmentation
Type
conference contribution
Date Issued
2023-06
Author(s)
Editor(s)
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (EarthVision)
Abstract
Vision Transformers have proven their versatility and utility for complex computer vision tasks, such as land cover segmentation in remote sensing applications. While performing on par or even outperforming other methods like Convolutional Neural Networks (CNNs), Transformers tend to require even larger datasets with fine-grained annotations (e.g., pixel-level labels for land cover segmentation). To overcome this limitation, we propose a weaklysupervised vision Transformer that leverages image-level labels to learn a semantic segmentation task to reduce the human annotation load. We achieve this by slightly modifying the architecture of the vision Transformer through the use of gating units in each attention head to enforce sparsity during training and thereby retaining only the most meaningful heads. This allows us to directly infer pixel-level labels from image-level labels by post-processing the unpruned attention heads of the model and refining our predictions by iteratively training a segmentation model with high fidelity. Training and evaluation on the DFC2020 dataset show that our method 1 not only generates high-quality segmentation masks using image-level labels, but also performs on par with fully-supervised training relying on pixellevel labels. Finally, our results show that our method is able to perform weakly-supervised semantic segmentation even on small-scale datasets.
Keywords
Weakly Supervised Learning
Transformers
Sparsity
File(s)![Thumbnail Image]()
Loading...
open.access
Name
Hanna_Sparse_Multimodal_Vision_Transformer_for_Weakly_Supervised_Semantic_Segmentation_CVPRW_2023_paper.pdf
Size
3.38 MB
Format
Adobe PDF
Checksum (MD5)
d49098aedbc67b40a608025032a1c638