Interpreting Dataset Shift in Clinical Notes

Machine Learning for Health

Shariar Vaez-Ghaemi, Furong Jia, Monica Agrawal

ICD-9 depression notes form a tight cluster; ICD-10 depression notes shows a broader semantic dispersion, which mimics the wider distribution.

Summary

Distribution shift can lead to degradation in the performance of machine learning models. This concern is particularly salient in medicine, in which several forces can lead to shifts in Electronic Health Record (EHR) data. Distribution shift in the text domain is vastly understudied, but increasingly important, given the widespread integration of large language models into clinical workflows. Identifying the existence of a shift is necessary but insufficient; actionability often requires understanding the nature of the shift. To address this challenge, we establish an extensible benchmark suite that induces synthetic distribution shifts using real clinical notes and develop two methods to assess generated shift explanations. We further introduce SIReNs, a general-domain end-to-end approach that explains distributional differences between two datasets by selecting representative notes from each. The SIReNs method was evaluated on both binary and continuous feature shifts, and the results show that it recovers salient binary shifts well, but struggles with more subtle shifts. A substantial gap remains to a ground-truth oracle for continuous shifts, suggesting room for improvement in future methods.

Citation

Vaez-Ghaemi, Shariar, Furong Jia, and Monica Agrawal. “Interpreting Dataset Shift in Clinical Notes.” Machine Learning for Health 2025. 2025.

BibTex

@inproceedings{vaez2025interpreting, title={Interpreting Dataset Shift in Clinical Notes}, author={Vaez-Ghaemi, Shariar and Jia, Furong and Agrawal, Monica}, booktitle={Machine Learning for Health 2025}, year={2025} }

Collaborators:

Referenced Research: