CoMiX: Cross-Modal Fusion With Deformable Convolutions for HSI-X Semantic Segmentation

Zhang, Xuming; Yokoya, Naoto; Xingfa, Gu; Tian, Qingjiu; Bruzzone, Lorenzo

doi:10.1109/TGRS.2025.3624105

Improving hyperspectral image (HSI) semantic segmentation by exploiting complementary information from supplementary modalities (termed X-modality) is promising but challenging due to significant differences in imaging sensors, image content, and resolution. Existing methods often underutilize the unique spatial-spectral features of HSIs by processing them uniformly with X-modality data. In addition, current cross-modality fusion strategies often suffer from limited intermodal interaction or significantly increased model complexity. To address these limitations, we propose CoMiX, an asymmetric encoder-decoder architecture with deformable convolutional networks (DCNs) for HSI-X semantic segmentation. CoMiX includes an encoder with two parallel, interacting backbones and a lightweight all-multilayer perceptron (ALL-MLP) decoder. The encoder consists of four stages, each incorporating 2-D DCN blocks for the X-modality to accommodate geometric variations and 3-D DCN blocks for HSIs to adaptively capture spatial-spectral features. Each stage also incorporates a cross-modality feature enhancement and eXchange (CMFeX) module and a feature fusion module (FFM). CMFeX exploits spatial-spectral correlations across modalities to recalibrate and enhance modality-specific and modality-shared features while adaptively exchanging complementary information. Its outputs are subsequently fused in the FFM and propagated to the next stage for further learning. Finally, the ALL-MLP decoder aggregates the fused features from all stages to produce the final predictions. Extensive experiments demonstrate that CoMiX achieves state-of-the-art performance and generalizes well to various multimodal datasets. The CoMiX code will be released soon.

Improving hyperspectral image (HSI) semantic segmentation by exploiting complementary information from supplementary modalities (termed X-modality) is promising but challenging due to significant differences in imaging sensors, image content, and resolution. Existing methods often underutilize the unique spatial–spectral features of HSIs by processing them uniformly with X-modality data. In addition, current cross-modality fusion strategies often suffer from limited intermodal interaction or significantly increased model complexity. To address these limitations, we propose CoMiX, an asymmetric encoder–decoder architecture with deformable convolutional networks (DCNs) for HSI-X semantic segmentation. CoMiX includes an encoder with two parallel, interacting backbones and a lightweight all-multilayer perceptron (ALL-MLP) decoder. The encoder consists of four stages, each incorporating 2-D DCN blocks for the X-modality to accommodate geometric variations and 3-D DCN blocks for HSIs to adaptively capture spatial–spectral features. Each stage also incorporates a cross-modality feature enhancement and eXchange (CMFeX) module and a feature fusion module (FFM). CMFeX exploits spatial–spectral correlations across modalities to recalibrate and enhance modality-specific and modality-shared features while adaptively exchanging complementary information. Its outputs are subsequently fused in the FFM and propagated to the next stage for further learning. Finally, the ALL-MLP decoder aggregates the fused features from all stages to produce the final predictions. Extensive experiments demonstrate that CoMiX achieves state-of-the-art performance and generalizes well to various multimodal datasets. The CoMiX code will be released soon.

CoMiX: Cross-Modal Fusion With Deformable Convolutions for HSI-X Semantic Segmentation / Zhang, Xuming; Yokoya, Naoto; Gu, Xingfa; Tian, Qingjiu; Bruzzone, Lorenzo. - In: IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING. - ISSN 1558-0644. - 63:5533317(2025), pp. 1-17. [10.1109/TGRS.2025.3624105]

CoMiX: Cross-Modal Fusion With Deformable Convolutions for HSI-X Semantic Segmentation

Xuming Zhang;Naoto Yokoya;Xingfa Gu;Qingjiu Tian;Lorenzo Bruzzone

2025-01-01

Abstract

Improving hyperspectral image (HSI) semantic segmentation by exploiting complementary information from supplementary modalities (termed X-modality) is promising but challenging due to significant differences in imaging sensors, image content, and resolution. Existing methods often underutilize the unique spatial-spectral features of HSIs by processing them uniformly with X-modality data. In addition, current cross-modality fusion strategies often suffer from limited intermodal interaction or significantly increased model complexity. To address these limitations, we propose CoMiX, an asymmetric encoder-decoder architecture with deformable convolutional networks (DCNs) for HSI-X semantic segmentation. CoMiX includes an encoder with two parallel, interacting backbones and a lightweight all-multilayer perceptron (ALL-MLP) decoder. The encoder consists of four stages, each incorporating 2-D DCN blocks for the X-modality to accommodate geometric variations and 3-D DCN blocks for HSIs to adaptively capture spatial-spectral features. Each stage also incorporates a cross-modality feature enhancement and eXchange (CMFeX) module and a feature fusion module (FFM). CMFeX exploits spatial-spectral correlations across modalities to recalibrate and enhance modality-specific and modality-shared features while adaptively exchanging complementary information. Its outputs are subsequently fused in the FFM and propagated to the next stage for further learning. Finally, the ALL-MLP decoder aggregates the fused features from all stages to produce the final predictions. Extensive experiments demonstrate that CoMiX achieves state-of-the-art performance and generalizes well to various multimodal datasets. The CoMiX code will be released soon.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione (Date of publication)
	
				2025
			
	Titolo del periodico (Journal title)
	
				IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING
			
	Numero e parte del fascicolo (Issue number and part)
	
				5533317
			
	DOI
	
				https://dx.doi.org/10.1109/TGRS.2025.3624105
			
	Codice Scopus (Scopus identifier)
	
				2-s2.0-105019704422
			
	Codice WOS (WOS identifier)
	
				WOS:001626679400015
			
	Tutti gli autori
	
						Zhang, Xuming; Yokoya, Naoto; Gu, Xingfa; Tian, Qingjiu; Bruzzone, Lorenzo
					
	Citazione
	
				CoMiX: Cross-Modal Fusion With Deformable Convolutions for HSI-X Semantic Segmentation / Zhang, Xuming; Yokoya, Naoto; Gu, Xingfa; Tian, Qingjiu; Bruzzone, Lorenzo. - In: IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING. - ISSN 1558-0644. - 63:5533317(2025), pp. 1-17. [10.1109/TGRS.2025.3624105]
			
	Appare nelle tipologie:
	
				03.1 Articolo su rivista (Journal article)

File in questo prodotto:

File	Dimensione	Formato
TGRS3624105.pdf accesso aperto Descrizione: This article has been accepted for publication in IEEE Transactions on Geoscience and Remote Sensing. This is the author's version which has not been fully edited and content may change prior to final publication. Citation information: DOI 10.1109/TGRS.2025.3624105 Tipologia: Post-print referato (Refereed author’s manuscript) Licenza: Tutti i diritti riservati (All rights reserved) Dimensione 54.48 MB Formato Adobe PDF Visualizza/Apri	54.48 MB	Adobe PDF	Visualizza/Apri
CoMiX_Cross-Modal_Fusion_With_Deformable_Convolutions_for_HSI-X_Semantic_Segmentation.pdf Solo gestori archivio Tipologia: Versione editoriale (Publisher’s layout) Licenza: Tutti i diritti riservati (All rights reserved) Dimensione 9.71 MB Formato Adobe PDF Visualizza/Apri	9.71 MB	Adobe PDF	Visualizza/Apri