Improving hyperspectral image (HSI) semantic segmentation by exploiting complementary information from supplementary modalities (termed X-modality) is promising but challenging due to significant differences in imaging sensors, image content, and resolution. Existing methods often underutilize the unique spatial-spectral features of HSIs by processing them uniformly with X-modality data. In addition, current cross-modality fusion strategies often suffer from limited intermodal interaction or significantly increased model complexity. To address these limitations, we propose CoMiX, an asymmetric encoder-decoder architecture with deformable convolutional networks (DCNs) for HSI-X semantic segmentation. CoMiX includes an encoder with two parallel, interacting backbones and a lightweight all-multilayer perceptron (ALL-MLP) decoder. The encoder consists of four stages, each incorporating 2-D DCN blocks for the X-modality to accommodate geometric variations and 3-D DCN blocks for HSIs to adaptively capture spatial-spectral features. Each stage also incorporates a cross-modality feature enhancement and eXchange (CMFeX) module and a feature fusion module (FFM). CMFeX exploits spatial-spectral correlations across modalities to recalibrate and enhance modality-specific and modality-shared features while adaptively exchanging complementary information. Its outputs are subsequently fused in the FFM and propagated to the next stage for further learning. Finally, the ALL-MLP decoder aggregates the fused features from all stages to produce the final predictions. Extensive experiments demonstrate that CoMiX achieves state-of-the-art performance and generalizes well to various multimodal datasets. The CoMiX code will be released soon.

Improving hyperspectral image (HSI) semantic segmentation by exploiting complementary information from supplementary modalities (termed X-modality) is promising but challenging due to significant differences in imaging sensors, image content, and resolution. Existing methods often underutilize the unique spatial–spectral features of HSIs by processing them uniformly with X-modality data. In addition, current cross-modality fusion strategies often suffer from limited intermodal interaction or significantly increased model complexity. To address these limitations, we propose CoMiX, an asymmetric encoder–decoder architecture with deformable convolutional networks (DCNs) for HSI-X semantic segmentation. CoMiX includes an encoder with two parallel, interacting backbones and a lightweight all-multilayer perceptron (ALL-MLP) decoder. The encoder consists of four stages, each incorporating 2-D DCN blocks for the X-modality to accommodate geometric variations and 3-D DCN blocks for HSIs to adaptively capture spatial–spectral features. Each stage also incorporates a cross-modality feature enhancement and eXchange (CMFeX) module and a feature fusion module (FFM). CMFeX exploits spatial–spectral correlations across modalities to recalibrate and enhance modality-specific and modality-shared features while adaptively exchanging complementary information. Its outputs are subsequently fused in the FFM and propagated to the next stage for further learning. Finally, the ALL-MLP decoder aggregates the fused features from all stages to produce the final predictions. Extensive experiments demonstrate that CoMiX achieves state-of-the-art performance and generalizes well to various multimodal datasets. The CoMiX code will be released soon.

CoMiX: Cross-Modal Fusion With Deformable Convolutions for HSI-X Semantic Segmentation / Zhang, Xuming; Yokoya, Naoto; Gu, Xingfa; Tian, Qingjiu; Bruzzone, Lorenzo. - In: IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING. - ISSN 1558-0644. - 63:5533317(2025), pp. 1-17. [10.1109/TGRS.2025.3624105]

CoMiX: Cross-Modal Fusion With Deformable Convolutions for HSI-X Semantic Segmentation

Lorenzo Bruzzone
2025-01-01

Abstract

Improving hyperspectral image (HSI) semantic segmentation by exploiting complementary information from supplementary modalities (termed X-modality) is promising but challenging due to significant differences in imaging sensors, image content, and resolution. Existing methods often underutilize the unique spatial-spectral features of HSIs by processing them uniformly with X-modality data. In addition, current cross-modality fusion strategies often suffer from limited intermodal interaction or significantly increased model complexity. To address these limitations, we propose CoMiX, an asymmetric encoder-decoder architecture with deformable convolutional networks (DCNs) for HSI-X semantic segmentation. CoMiX includes an encoder with two parallel, interacting backbones and a lightweight all-multilayer perceptron (ALL-MLP) decoder. The encoder consists of four stages, each incorporating 2-D DCN blocks for the X-modality to accommodate geometric variations and 3-D DCN blocks for HSIs to adaptively capture spatial-spectral features. Each stage also incorporates a cross-modality feature enhancement and eXchange (CMFeX) module and a feature fusion module (FFM). CMFeX exploits spatial-spectral correlations across modalities to recalibrate and enhance modality-specific and modality-shared features while adaptively exchanging complementary information. Its outputs are subsequently fused in the FFM and propagated to the next stage for further learning. Finally, the ALL-MLP decoder aggregates the fused features from all stages to produce the final predictions. Extensive experiments demonstrate that CoMiX achieves state-of-the-art performance and generalizes well to various multimodal datasets. The CoMiX code will be released soon.
2025
5533317
Zhang, Xuming; Yokoya, Naoto; Gu, Xingfa; Tian, Qingjiu; Bruzzone, Lorenzo
CoMiX: Cross-Modal Fusion With Deformable Convolutions for HSI-X Semantic Segmentation / Zhang, Xuming; Yokoya, Naoto; Gu, Xingfa; Tian, Qingjiu; Bruzzone, Lorenzo. - In: IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING. - ISSN 1558-0644. - 63:5533317(2025), pp. 1-17. [10.1109/TGRS.2025.3624105]
File in questo prodotto:
File Dimensione Formato  
TGRS3624105.pdf

accesso aperto

Descrizione: This article has been accepted for publication in IEEE Transactions on Geoscience and Remote Sensing. This is the author's version which has not been fully edited and content may change prior to final publication. Citation information: DOI 10.1109/TGRS.2025.3624105
Tipologia: Post-print referato (Refereed author’s manuscript)
Licenza: Tutti i diritti riservati (All rights reserved)
Dimensione 54.48 MB
Formato Adobe PDF
54.48 MB Adobe PDF Visualizza/Apri
CoMiX_Cross-Modal_Fusion_With_Deformable_Convolutions_for_HSI-X_Semantic_Segmentation.pdf

Solo gestori archivio

Tipologia: Versione editoriale (Publisher’s layout)
Licenza: Tutti i diritti riservati (All rights reserved)
Dimensione 9.71 MB
Formato Adobe PDF
9.71 MB Adobe PDF   Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/475674
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? 0
  • OpenAlex 0
social impact