In machine learning, an information-theory optimal way to filter the best input features, without reference to any specific machine learning models, consists of maximizing the mutual information between the selected features and the model output, a choice which will minimize the uncertainty in the output to be predicted, given the feature values. Although this criterion is optimal in the context of information theory, a practical difficulty in using it lies in the need to estimate the mutual information from a limited set of input-output examples, in possibly very-high-dimensional input spaces. Estimating probability densities from some data points in these conditions is far from trivial. Starting from the seminal proposals in [1], different approaches focus on approximating the mutual information by considering a limited set of variable dependencies (like dependencies among couples or triplets), or by assuming specific forms for the probability densities (like Gaussian forms). In this paper we study the effect of considering the exact mutual information between selected features and output, without resorting to any approximation (apart from that implicit and unavoidable in estimating it from experimental data). The objectives of this investigation are: to assess how far one can go by adopting the exact mutual information in terms of CPU time and number of features, and to measure what is lost by adopting some popular approximations which consider only relationships among small subsets of features, assumptions about the distribution of feature values (e.g. Gaussian) or upper bounds on the mutual information as proxies to maximize instead of the exact value. The experimental results show a significant performance advantage when the feature sets identified by exact mutual information are used in both binary and multi-valued classification tasks, with longer CPU times.
X-MIFS: Exact Mutual Information for feature selection / Brunato, Mauro; Battiti, Roberto. - STAMPA. - (2016), pp. 3469-3476. (Intervento presentato al convegno IJCNN 2016 tenutosi a Vancouver, Canada nel 24th-29th July 2016) [10.1109/IJCNN.2016.7727644].
X-MIFS: Exact Mutual Information for feature selection
Brunato, Mauro;Battiti, Roberto
2016-01-01
Abstract
In machine learning, an information-theory optimal way to filter the best input features, without reference to any specific machine learning models, consists of maximizing the mutual information between the selected features and the model output, a choice which will minimize the uncertainty in the output to be predicted, given the feature values. Although this criterion is optimal in the context of information theory, a practical difficulty in using it lies in the need to estimate the mutual information from a limited set of input-output examples, in possibly very-high-dimensional input spaces. Estimating probability densities from some data points in these conditions is far from trivial. Starting from the seminal proposals in [1], different approaches focus on approximating the mutual information by considering a limited set of variable dependencies (like dependencies among couples or triplets), or by assuming specific forms for the probability densities (like Gaussian forms). In this paper we study the effect of considering the exact mutual information between selected features and output, without resorting to any approximation (apart from that implicit and unavoidable in estimating it from experimental data). The objectives of this investigation are: to assess how far one can go by adopting the exact mutual information in terms of CPU time and number of features, and to measure what is lost by adopting some popular approximations which consider only relationships among small subsets of features, assumptions about the distribution of feature values (e.g. Gaussian) or upper bounds on the mutual information as proxies to maximize instead of the exact value. The experimental results show a significant performance advantage when the feature sets identified by exact mutual information are used in both binary and multi-valued classification tasks, with longer CPU times.File | Dimensione | Formato | |
---|---|---|---|
07727644.pdf
Solo gestori archivio
Descrizione: Versione editoriale, scaricata da IEEEXplore
Tipologia:
Versione editoriale (Publisher’s layout)
Licenza:
Tutti i diritti riservati (All rights reserved)
Dimensione
225.23 kB
Formato
Adobe PDF
|
225.23 kB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione