The effects observed in so-called compatibility tasks (e.g., the Stroop and Simon paradigms) are typically quantified as the difference between performance in incompatible and compatible conditions. Usually, such difference scores are highly replicable at the group level, but show low reliability as measures of individual differences. The present contribution critically examines this «reliability paradox» and its implications for the use of difference scores in experimental research, situating it within the research program developed by Carlo A. Umiltà and collaborators. On the one hand, the low reliability of these scores does not preclude their use in experimental settings; on the other hand, their high replicability alone does not guarantee their appropriate374 ness. Indeed, their suitability depends on the relationship between their psychometric properties and the variability of the construct they are intended to measure in the population under study. Poorly reliable scores may be adequate in studies aimed at assessing the effects of experimental manipulations on constructs that are relatively homogeneous across individuals, such as the automatic processes that generate interference in compatibility tasks, but not for investigating constructs that vary substantially within the population, such as interference control processes. In the latter case, low reliability indicates that the score is not an adequate measure of the construct of interest. Finally, the article underscores the importance of critically evaluating research instruments and the theoretical assumptions guiding their use, in line with the methodological rigor that characterized Umiltà’s scientific work.
Replicabilità e attendibilità delle misure cognitive: considerazioni teoriche e metodologiche sull’uso dei punteggi differenziali / Treccani, B.. - In: GIORNALE ITALIANO DI PSICOLOGIA. - ISSN 0390-5349. - STAMPA. - 53:2(2026), pp. 351-374. [10.1421/120862]
Replicabilità e attendibilità delle misure cognitive: considerazioni teoriche e metodologiche sull’uso dei punteggi differenziali
Barbara Treccani
2026-01-01
Abstract
The effects observed in so-called compatibility tasks (e.g., the Stroop and Simon paradigms) are typically quantified as the difference between performance in incompatible and compatible conditions. Usually, such difference scores are highly replicable at the group level, but show low reliability as measures of individual differences. The present contribution critically examines this «reliability paradox» and its implications for the use of difference scores in experimental research, situating it within the research program developed by Carlo A. Umiltà and collaborators. On the one hand, the low reliability of these scores does not preclude their use in experimental settings; on the other hand, their high replicability alone does not guarantee their appropriate374 ness. Indeed, their suitability depends on the relationship between their psychometric properties and the variability of the construct they are intended to measure in the population under study. Poorly reliable scores may be adequate in studies aimed at assessing the effects of experimental manipulations on constructs that are relatively homogeneous across individuals, such as the automatic processes that generate interference in compatibility tasks, but not for investigating constructs that vary substantially within the population, such as interference control processes. In the latter case, low reliability indicates that the score is not an adequate measure of the construct of interest. Finally, the article underscores the importance of critically evaluating research instruments and the theoretical assumptions guiding their use, in line with the methodological rigor that characterized Umiltà’s scientific work.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione



