Hyperspectral image (HSI) classification has recently reached its performance bottleneck. Multimodal data fusion is emerging as a promising approach to overcome this bottleneck by providing rich complementary information from the supplementary modality (X-modality). However, achieving comprehensive cross-modal interaction and fusion that can be generalized across different sensing modalities is challenging due to the disparity in imaging sensors, resolution, and content of different modalities. In this study, we propose a local-to-global cross-modal attention-aware fusion (LoGoCAF) framework for HSI-X segmentation. LoGoCAF adopts a two-branch semantic segmentation architecture to learn information from HSI and X modalities. The pipeline of LoGoCAF consists of a local-to-global encoder and a lightweight all multilayer perceptron (ALL-MLP) decoder. In the encoder, convolutions are used to encode local and high-resolution fine details in shallow layers, while transformers are used to inte...

Hyperspectral image (HSI) classification has recently reached its performance bottleneck. Multimodal data fusion is emerging as a promising approach to overcome this bottleneck by providing rich complementary information from the supplementary modality (X-modality). However, achieving comprehensive cross-modal interaction and fusion that can be generalized across different sensing modalities is challenging due to the disparity in imaging sensors, resolution, and content of different modalities. In this study, we propose a local-to-global cross-modal attention-aware fusion (LoGoCAF) framework for HSI-X segmentation. LoGoCAF adopts a two-branch semantic segmentation architecture to learn information from HSI and X modalities. The pipeline of LoGoCAF consists of a local-to-global encoder and a lightweight all multilayer perceptron (ALL-MLP) decoder. In the encoder, convolutions are used to encode local and high-resolution fine details in shallow layers, while transformers are used to integrate global and low-resolution coarse features in deeper layers. The ALL-MLP decoder aggregates information from the encoder for feature fusion and prediction. In particular, two cross-modality modules, the feature enhancement module (FEM) and the feature interaction and fusion module (FIFM), are introduced in each encoder stage. The FEM is used to enhance complementary information by combining information from the other modality across direction-aware, position-sensitive, and channel-wise dimensions. With the enhanced features, the FIFM is designed to promote cross-modality information interaction and fusion for the final semantic prediction. Extensive experiments demonstrate that our LoGoCAF achieves superior performance and generalizes well on various multimodal datasets. Code is available at https://github.com/xumzhang.

Local-to-Global Cross-Modal Attention-Aware Fusion for HSI-X Semantic Segmentation / Zhang, Xuming; Yokoya, Naoto; Gu, Xingfa; Tian, Qingjiu; Bruzzone, Lorenzo. - In: IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING. - ISSN 0196-2892. - 62:5531817(2024), pp. 1-17. [10.1109/TGRS.2024.3465030]

Local-to-Global Cross-Modal Attention-Aware Fusion for HSI-X Semantic Segmentation

Bruzzone, Lorenzo
2024-01-01

Abstract

Hyperspectral image (HSI) classification has recently reached its performance bottleneck. Multimodal data fusion is emerging as a promising approach to overcome this bottleneck by providing rich complementary information from the supplementary modality (X-modality). However, achieving comprehensive cross-modal interaction and fusion that can be generalized across different sensing modalities is challenging due to the disparity in imaging sensors, resolution, and content of different modalities. In this study, we propose a local-to-global cross-modal attention-aware fusion (LoGoCAF) framework for HSI-X segmentation. LoGoCAF adopts a two-branch semantic segmentation architecture to learn information from HSI and X modalities. The pipeline of LoGoCAF consists of a local-to-global encoder and a lightweight all multilayer perceptron (ALL-MLP) decoder. In the encoder, convolutions are used to encode local and high-resolution fine details in shallow layers, while transformers are used to inte...
2024
5531817
Zhang, Xuming; Yokoya, Naoto; Gu, Xingfa; Tian, Qingjiu; Bruzzone, Lorenzo
Local-to-Global Cross-Modal Attention-Aware Fusion for HSI-X Semantic Segmentation / Zhang, Xuming; Yokoya, Naoto; Gu, Xingfa; Tian, Qingjiu; Bruzzone, Lorenzo. - In: IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING. - ISSN 0196-2892. - 62:5531817(2024), pp. 1-17. [10.1109/TGRS.2024.3465030]
File in questo prodotto:
File Dimensione Formato  
Local-to-Global_Cross-Modal_Attention-Aware_Fusion_for_HSI-X_Semantic_Segmentation.pdf

embargo fino al 20/09/2026

Tipologia: Post-print referato (Refereed author’s manuscript)
Licenza: Tutti i diritti riservati (All rights reserved)
Dimensione 50.23 MB
Formato Adobe PDF
50.23 MB Adobe PDF   Visualizza/Apri
Local-to-Global_Cross-Modal_Attention-Aware_Fusion_for_HSI-X_Semantic_Segmentation (1).pdf

Solo gestori archivio

Tipologia: Versione editoriale (Publisher’s layout)
Licenza: Tutti i diritti riservati (All rights reserved)
Dimensione 14.31 MB
Formato Adobe PDF
14.31 MB Adobe PDF   Visualizza/Apri
Local-to-Global_Cross-Modal_Attention-Aware_Fusion_for_HSI-X_Semantic_Segmentation (1)_compressed.pdf

Solo gestori archivio

Tipologia: Versione editoriale (Publisher’s layout)
Licenza: Tutti i diritti riservati (All rights reserved)
Dimensione 1.97 MB
Formato Adobe PDF
1.97 MB Adobe PDF   Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/444072
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 2
  • ???jsp.display-item.citation.isi??? 0
  • OpenAlex ND
social impact