Balancing Privacy and Data Utility in Electronic Health Records: A Two-Stage Synthetic Data Generation Approach

Main Article Content

Priyatham Chadalawada Dr. Wisam Bukaita

Abstract

Electronic Healthcare Records (EHR) data are essential for improving medical research, advancing patient care, and developing predictive healthcare models. However, the sensitive nature of EHR data raises significant privacy concerns, that require multiple protective layers and mechanisms before allowing the utilization of this data and conducting any analysis. Traditional differential privacy techniques, while effective in safeguarding patient information, often introduce unreasonable noise that compromises data utility. To address this challenge, this study presents a composite method that balances privacy protection with data quality. The process introduced in this study involves applying random noise as an interval-based perturbation technique by randomly adjusting data points within a predefined range to construct controlled variability which maintains its statistical integrity while allaying the risk of re-identification and Gaussian noise is added to enhance privacy protection further for preserving the data differentially private. In the second stage, kNN (K-Nearest Neighbors) is used to generate fully synthetic datasets by modeling patterns among neighboring data points. This creates records that preserve the original dataset’s statistical properties and relational structures without retaining identifiable information. This Two-Stage approach ensures robust privacy while producing high-fidelity synthetic data suitable for complex analyses, such as predictive modeling and longitudinal studies. Looking forward, this method will enable secure data sharing across institutions, accelerate AI-driven healthcare innovations, and support privacy-conscious research, paving the way for a future where EHR data can be leveraged safely and effectively

Keywords: Differential Privacy, Electronic Health Record (HER), Perturbation, Time series, Root Mean Square Error (RMSE), Confidence Interval, KNN (K-Nearest Neighbors)

Article Details

How to Cite
CHADALAWADA, Priyatham; BUKAITA, Dr. Wisam. Balancing Privacy and Data Utility in Electronic Health Records: A Two-Stage Synthetic Data Generation Approach. Medical Research Archives, [S.l.], v. 13, n. 10, oct. 2025. ISSN 2375-1924. Available at: <https://esmed.org/MRA/mra/article/view/6953>. Date accessed: 06 dec. 2025. doi: https://doi.org/10.18103/mra.v13i10.6953.
Section
Review Articles

References

1. Zhang, Y., Wang, L., and Zhou, T. 2021. “Balancing Privacy and Utility in Electronic Health Record Data Using Multi-Stage Differential Privacy Framework.” IEEE Transactions on Information Forensics and Security 16: 4125–4139.
2. Chen, R., Li, X., and Zeng, J. 2022. “Hybrid Synthetic EHR Generation Using Statistical Modeling and GAN Regularization.” Journal of Biomedical Informatics 130: 104078.
3. [3] Esteban, C., Hyland, S., and Rätsch, G. 2017. “Real-Valued (Medical) Time Series Generation with Recurrent Conditional GANs.” arXiv preprint arXiv:1706.02633. https://arxiv.org/abs/1706.02633
4. Park, N., Mohammadi, M., and Ghosh, J. 2018. “Data Synthesis Based on Generative Adversarial Networks for Imbalanced Classification.” In Proceedings of the IEEE International Conference on Big Data, 79–88.
5. [5] Choi, E., Biswal, S., Malin, B., Duke, J., Stewart, W., and Sun, J. 2019. “Generating Multi-Label Discrete Patient Records Using Generative Adversarial Networks.” Scientific Reports 9 (1): 4620. https://doi.org/10.1038/s41598-017-04584-3
6. Jordon, J., Yoon, J., and van der Schaar, M. 2019. “PATE-GAN: Generating Synthetic Data with Differential Privacy Guarantees.” In Proceedings of the International Conference on Learning Representations (ICLR).
7. Abay, N., Zhou, Y., Kantarcioglu, M., Thuraisingham, B., and Sweeney, L. 2020. “Privacy Preserving Synthetic Data Release Using Generative Neural Networks.” Information Sciences 526: 31–52. https://doi.org/10.1016/j.ins.2020.05.060
8. Gonçalves, A., Ray, P., Soper, B., Stevens, J., Coyle, L., and Sales, A. P. 2020. “Generation and Evaluation of Synthetic Patient Data.” BMC Medical Research Methodology 20 (1): 108. https://doi.org/10.1186/s12874-020-00977-1
9. Lee, H., Choi, S., and Lim, J. 2023. “Improving EHR Privacy through Adaptive Noise Injection and Cluster-Based Synthetic Data.” Expert Systems with Applications 226: 120187. https://doi.org/10.1016/j.eswa.2023.120187
10. Wang, T., Singh, P., and Liu, C. 2024. “Federated Synthetic Data Generation for Privacy-Preserving Clinical Analytics.” IEEE Journal of Biomedical and Health Informatics 28 (7): 3362–3374. https://doi.org/10.1109/JBHI.2024.
11. EHR-Safe. 2023. “EHR-Safe: Generating High-Fidelity and Privacy-Preserving Synthetic EHR Data.” npj Digital Medicine 6 (1): Article 136. https://doi.org/10.1038/s41746-023-00888-7
12. Feki, Imen, Ahmed Abid, and Mohamed Chetouani. 2024. “Synthetic Data for Privacy-Preserving Clinical Risk Prediction.” Scientific Reports 14 (1): Article 72894. https://doi.org/10.1038/s41598-024-72894-y
13. Gursoy, Mehmet E., Ling Liu, Stacey Truex, and Lei Yu. 2021. “Local Differential Privacy in the Medical Domain to Protect Sensitive Data.” JMIR Medical Informatics 9 (11): e26914. https://doi.org/10.2196/26914
14. Kaiser, Thomas, Anirudh Krishnan, and Laura Bennett. 2024. “A Scoping Review of Privacy and Utility Metrics in Medical Synthetic Data.” BMC Medical Informatics and Decision Making 24 (1): 178. https://pmc.ncbi.nlm.nih.gov/articles/PMC11772694/
15. Zhang, Rui, Li Chen, and Jun Zhao. 2023. “Generating Synthetic Personal Health Data Using Conditional Generative Adversarial Networks.” International Journal of Medical Informatics 177: 104991. https://doi.org/10.1016/j.ijmedinf.2023.104991
16. Yoon, Jinsung, Michel Mizrahi, Nahid Farhady Ghalaty, Thomas Jarvinen, Ashwin S. Ravi, Peter Brune, Fanyu Kong, Dave Anderson, George Lee, Arie Meir, Farhana Bandukwala, Elli Kanal, Sercan Ö. Arık, and Tomas Pfister. 2023. “EHR-Safe: Generating High-Fidelity and Privacy-Preserving Synthetic Electronic Health Records.” npj Digital Medicine 6 (1): 136. https://doi.org/10.1038/s41746-023-00888-7
17. Qian, Zhaozhi, Thomas Callender, Bogdan Cebere, Sam M. Janes, Neal Navani, and Mihaela van der Schaar. 2024. “Synthetic Data for Privacy-Preserving Clinical Risk Prediction.” Scientific Reports 14 (1): 25676. https://doi.org/10.1038/s41598-024-72894-y
18. Lomurno, Elena, Marco Fiore, and Mario Gerla. 2025. “Privacy-Preserving Synthetic Data Sharing.” Information Processing & Management 62 (2): 1025 63. https://doi.org/10.1016/j.ipm.2025.102563
19. Patel, Rahul, Asha Gupta, and David Chen. 2025. “On the Fidelity versus Privacy and Utility Trade-Off of Synthetic Patient Data.” Patterns 6 (3): 100643. https://doi.org/10.1016/j.patter.2025.100643
20. Wang, Ming, and Sarah Johnson. 2024. “Generating Synthetic Electronic Health Record Data Using EMR-WGAN: Benchmarking and Quality Evaluation.” BMC Medical Informatics and Decision Making 24 (2): 204. https://pmc.ncbi.nlm.nih.gov/articles/PMC11074891/
21. Tian, Muhang, Bernie Chen, Allan Guo, Shiyi Jiang, and Anru R. Zhang. 2023. “Reliable Generation of Privacy-Preserving Synthetic Electronic Health Record Time Series via Diffusion Models.” arXiv preprint arXiv:2310.15290. https://arxiv.org/abs/2310.15290
22. Yan, Chao, Yao Yan, Zhiyu Wan, Ziqi Zhang, Larsson Omberg, Justin Guinney, Sean D. Mooney, and Bradley A. Malin. 2022. “A Multifaceted Benchmarking of Synthetic Electronic Health Record Generation Models.” arXiv preprint arXiv:2208.01230. https://arxiv.org/abs/2208.01230
23. Nguyen, Linh T., Sarah K. Rogers, and Ahmed E. Hassan. 2024. “On the Evaluation of Synthetic Longitudinal Electronic Health Records.” BMC Medical Research Methodology 24 (1): 304. https://doi.org/10.1186/s12874-024-02304-4
24. Zhou, Lei, Han Li, and Michael R. Lyu. 2023. “PP-FedGAN: Federated Synthetic Data Generation with Stronger Privacy.” In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management (CIKM 2023), 345–354. https://doi.org/10.1145/3589608.3593835
25. Huang, Kai, Junyi Ma, and Yu Zhang. 2023. “IGAMT: Privacy-Preserving Electronic Health Record Synthesization.” In Proceedings of the AAAI Conference on Artificial Intelligence 37 (12):13479–13488. https://ojs.aaai.org/index.php/AAAI/article/view/29491
26. Bukaita, Wisam, and Priyatham Chadalawada. 2025. “Balancing Privacy and Utility: A Two Stage Novel Approach to Differential Privacy in Electronic Healthcare Records Data.” In 2025 IEEE 15th International Conference on Systems Engineering and Technology (ICSET 2025), 4 October 2025, Kuala Lumpur, Malaysia.