This paper presents a comparative analysis of transformerbased fusion methods applied to a novel multimodal dataset for remote sensing semantic segmentation. This investigation evaluates the impact of several fusion methods on the accuracy of the results. In particular, for early fusion, we investigate the Early Concatenation. For middle fusion, we investigate four methods, namely the Token Patch Embedding, Channel Patch Embedding, Token Fusion at Attention Level, and Cross-Attention. Finally, as a representative of late fusion, we investigate the use of Late Concatenation. The methods presented here are specifically designed to operate effectively with all modalities under investigation. Experiments conducted on the Ticino dataset show that Late Concatenation outperforms the best single modality RGB method of 4.04%, 2.24% and 3.47% respectively on accuracy, precision and mIoU. This study provides an opportunity to further explore fusion methods utilizing transformers, thereby enhancing our understanding of the potential of data fusion.

Morelli, V., Barbato, M., Piccoli, F., Napoletano, P. (2024). Multimodal Fusion Methods with Vision Transformers for Remote Sensing Semantic Segmentation. In 2023 13th Workshop on Hyperspectral Imaging and Signal Processing: Evolution in Remote Sensing (WHISPERS). IEEE Computer Society [10.1109/WHISPERS61460.2023.10430788].

Multimodal Fusion Methods with Vision Transformers for Remote Sensing Semantic Segmentation

Morelli, V;Barbato, M
;
Piccoli, F;Napoletano, P
2024

Abstract

This paper presents a comparative analysis of transformerbased fusion methods applied to a novel multimodal dataset for remote sensing semantic segmentation. This investigation evaluates the impact of several fusion methods on the accuracy of the results. In particular, for early fusion, we investigate the Early Concatenation. For middle fusion, we investigate four methods, namely the Token Patch Embedding, Channel Patch Embedding, Token Fusion at Attention Level, and Cross-Attention. Finally, as a representative of late fusion, we investigate the use of Late Concatenation. The methods presented here are specifically designed to operate effectively with all modalities under investigation. Experiments conducted on the Ticino dataset show that Late Concatenation outperforms the best single modality RGB method of 4.04%, 2.24% and 3.47% respectively on accuracy, precision and mIoU. This study provides an opportunity to further explore fusion methods utilizing transformers, thereby enhancing our understanding of the potential of data fusion.
slide + paper
Multimodal fusion; Remote sensing; Semantic Segmentation; Vision Transformers;
English
13th Workshop on Hyperspectral Imaging and Signal Processing: Evolution in Remote Sensing, WHISPERS 2023 - 31 October 2023 through 2 November 2023
2023
2023 13th Workshop on Hyperspectral Imaging and Signal Processing: Evolution in Remote Sensing (WHISPERS)
9798350395570
2024
open
Morelli, V., Barbato, M., Piccoli, F., Napoletano, P. (2024). Multimodal Fusion Methods with Vision Transformers for Remote Sensing Semantic Segmentation. In 2023 13th Workshop on Hyperspectral Imaging and Signal Processing: Evolution in Remote Sensing (WHISPERS). IEEE Computer Society [10.1109/WHISPERS61460.2023.10430788].
File in questo prodotto:
File Dimensione Formato  
Morelli-2023-Whispers-AMM.pdf

accesso aperto

Descrizione: Intervento a convegno
Tipologia di allegato: Author’s Accepted Manuscript, AAM (Post-print)
Licenza: Altro
Dimensione 1.79 MB
Formato Adobe PDF
1.79 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10281/446101
Citazioni
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
Social impact