Balancing Privacy and Utility in Healthcare Data Analysis
Balancing Privacy and Utility: A Two Stage Novel Approach to Differential Privacy in Electronic Healthcare Records Data
Dr. Wisam Bukaita1, Priyatham Chadalawada1
- Department of Mathematics and Computer Science, Lawrence Technological University Southfield, U.S.A
OPEN ACCESS
PUBLISHED: 31 October 2025
CITATION: Bukaita, W., and Chadalawada, P., 2025. Balancing Privacy and Utility: A Two Stage Novel Approach to Differential Privacy in Electronic Healthcare Records Data. Medical Research Archives, [online] 13(10). https://doi.org/10.18103/mra.v13i10.6953
COPYRIGHT © 2025 European Society of Medicine. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
DOI https://doi.org/10.18103/mra.v13i10.6953
ISSN 2375-1924
Abstract
Electronic Healthcare Records data are essential for improving medical research, advancing patient care, and developing predictive healthcare models. However, the sensitive nature of Electronic Healthcare Records data raises significant privacy concerns, that require multiple protective layers and mechanisms before allowing the utilization of this data and conducting any analysis. The traditional differential privacy techniques, whilst effective in safeguarding the patient information, however, often introduce unreasonable noise that usually compromises the data utility. To address this type of challenge, this study presents a composite method which balances the privacy protection with the data quality. The process introduced in this study would involve in applying a random noise as an interval-based perturbation technique as an initial step, randomly it adjusts the data points within a predefined range to construct controlled variability. This step will maintain its statistical integrity even while allaying the risk of re-identification. A random noise distributed normally called Gaussian noise is added to enhance privacy protection further. By combining these two techniques, the suggested approach effectively perturbs the dataset while preserving its utility for meaningful analysis. This research introduces a new approach to creating synthetic data that serves privacy and analytical reliability in Electronic Healthcare Records data analysis.
Keywords
Differential Privacy, Electronic Health Record (HER), Perturbation, Root Mean Square Error (RMSE), Confidence Interval.
1. Introduction
The maximum increasing of the adoption of the Electronic Health Records (EHRs) has to be transformed to the landscape of the healthcare data collection, storage, and utilization. These kinds of digital records offer longitudinal, structured, and diverse clinical data which are very essential not just for the patients care but also for the advancing medical research, optimizing treatment strategies, and informing healthcare policies. As noted by the Blumenthal et al., EHR systems plays a vital role in generating the patient cohorts for research, thereby supporting to the broader agenda of a meaningful use in the healthcare systems. Moreover, the EHR data has been proven especially as valuable during the global health emergencies like the COVID-19 pandemic, where the real-world data was more critical in order to assess the drug efficacy and also for tracking the disease progression, as highlighted by Dagliati et al. However, this integration of the EHRs into research raises a significant ethical and legal concerns which are related to data privacy and also the patient confidentiality. Even despite of the technological advancements in the data analysis, the risk of just re-identification remains as a pressing issue. Cowie et al. emphasizes that even while data-driven innovations in the healthcare are advancing, robust safeguards are essential in order to protect the patients’ information. Under the regulations such as the Health Insurance Portability and Accountability Act (HIPAA), the datasets containing a Protected Health Information (PHI) must be as the de-identified or as transformed in a way to ensure that individuals cannot be traced.
One of the most widely adopted frameworks for the privacy preservations in the data publishing and also in the analysis is Differential Privacy (DP). DP generally provides the mathematically rigorous guarantees against the re-identification just by introducing the calibrated random noises into the query responses or the data outputs. However, a common challenge in applying the traditional DP techniques, such as Gaussian mechanism, it’s the trade-off between the privacy and the utility higher privacy which guarantees typically that requires more noise, this can be significantly degrade data quality and the analytical performance.
To address this challenge, our research introduces a novel two-stage perturbation approach which is designed to enhance both privacy and data utility. This method generally involves by initially injecting uniformly distributed bounded random noises targeted to maintain core statistical characteristics, followed by the application of the Gaussian noises which are based on established differential privacy parameters (ε, δ, Δf). The rationale behind this dual-step design is just to reduce the overall scale of the Gaussian noises which are needed in the second stage which is by mitigating the sensitivity in the first stage, thereby preserving more of the data’s utility.
2. Literature Review
Research on privacy-preserving techniques for Electronic Health Records (EHRs) has produced a wide range of approaches, including de-identification, noise addition, data swapping, differential privacy mechanisms, and hybrid perturbation methods. These works collectively highlight the ongoing tension between protecting patient confidentiality and maintaining analytical utility of healthcare data.
A. De-Identification and Anonymization Approaches:
Data de-identification is the process of removing or altering the personal identifiers within the datasets to safeguard our individual privacy. Meystre et al. in their research they have identified the Quasi Identifiers which are sensitive PHI and which needs to be removed to make any data analysis. Early works demonstrated that removal of quasi-identifiers could safeguard individuals while retaining analytical value. However, subsequent evaluations revealed the risk of re-identification under linkage attacks. Im et al. (2024) extended this research in a clinical use case, showing that some quasi-identifier removal methods preserve more utility than others, though stronger de-identification methods often compromise usability. Jiang et al. (2021) similarly systematized data-sharing infrastructures, comparing anonymization, DP, and secure architectures to understand their trade-offs in medical research contexts.
B. Data Swapping:
Data swapping is also the same but it has another effective method for data anonymization. It will involve in exchanging the values between the records of a dataset to obscure the original data even while preserving the statistical properties. This technique is generally mainly used for maintaining the data utility for the analysis. Yidong Li and Hong Shen have explored the possibility of using the equi-width data swapping to preserve some parametric and some non-parametric statistics and have proved it by using covariance between the attributes and through multivariate histogram and proved that his approach of Data Swapping is so much more accurate than Equi-depth Swapping.
C. Noise Addition and Perturbation Mechanisms:
Noise perturbation has been central to privacy preservation. Chen et al. (2022) introduced the Haar wavelet transform with Gaussian DP, which reduced variance in range queries and enhanced aggregated query utility. Xiong et al. (2019) proposed variant noise in Gaussian process classification, scaling noise by data density to improve classification accuracy. Rai and Varsney (2023) expanded on this by exploring multiplicative Gaussian noise under multi-level trust settings, showing that trust-level–based scaling offers a nuanced balance between privacy and utility. Earlier contributions, such as Balasubramaniam et al. (2015), employed geometric perturbation (rotation and projection) for health data, demonstrating strong privacy guarantees but risks of distorting attribute magnitudes and clinical meaning. The goal of Hewage et al. (2023) is to obscure an individual data points even while allowing it for the aggregate analysis. A random noise addition obscures the individual data points, making it very difficult for adversaries to infer the sensitive information.
D. Differential Privacy (DP) and Its Enhancements
DP has emerged as a leading paradigm, offering rigorous privacy guarantees. Balle and Wang (2018) analytically calibrated Gaussian noise to reduce variance, improving performance in high-dimensional data. Li and Shen (2023) examined DP in a medical context, combining perturbation with Random Forest classifiers and finding robust classification despite noise. Shen et al. (2022) integrated DP with federated learning for mobile health, revealing that hybrid models can withstand simulated privacy attacks while maintaining performance.
In scenarios requiring data perturbation with a focus on simplicity and minimal privacy, Data De-identification, Data Swapping, Ransom Noise addition are employed and for scenarios where both Privacy and Security are prominent and theoretical guarantees are necessary efficiency, differential privacy emerges as a practical solution due to its Privacy guarantees which preserves the statistical properties of the dataset and generated robust private data which is impossible for Re-identification. Dong et al. in their research emphasizes that while Gaussian noise can provide robust privacy guarantees, it also introduces challenges in maintaining data utility due to high noise introduced by the Privacy Budget variable (ε).
Cynthia Dwork et al. proposed the equation (1) for differential privacy using Gaussian distribution, also known as the normal distribution. It is defined by a variance (σ²), where a random variable η is centered around the mean (µ), typically 0, and has a standard deviation (σ). The standard deviation of the noise is computed using the formula.
In this research the above equation is used to generate Gaussian noise. And to understand more about each parameter in the equation, sensitivity, Privacy Budget and Probability of failure are discussed below.
A. Sensitivity(Δf)
Sensitivity measures is to know on how much the output of a function f can be changed when one of the individual’s data is added or removed from the dataset which is to be defined by (2). Usually for determining the sensitivity, there are two cases which can be considered. Either the size of D and D′ and each one are differed by one record or D and D′ are of the same size but it also differs by one of the records within the dataset. In our research we have considered same size with a change in one record to calculate the sensitivity. Sensitivity would determine that how much noise it needs to be added to the output of f to ensure the differential privacy. Higher sensitivity requires more noise to obscure the impact of any single individual.
Δf = maxD,D′ |f(D) – f(D′)|
B. Privacy Budget(∈)
The parameter ∈ (epsilon) will be controlling the strength of the privacy guarantee. It can quantifies that how much information about an individual can be leaked just by the mechanism. The smaller the ∈, the stronger the privacy is This can be computed by using (3). Given two neighboring datasets D and D′, the privacy loss “LY,D,D′” quantifies as to how much more or less likely it is for a randomized algorithm Y to produce a specific output “y” even when applied to D compared to D′, it is expressed as a multiplicative ratio of their output probabilities. ∈ is given to the ratio of the probabilities for observing the same output onto the two neighboring datasets.
Formally, for any output
LY,D,D′(y) = Pr[Y(D) ∈ y] / Pr[Y(D′) ∈ y] ≤ e∈
where Y is the mechanism, and Pr is Probability density function, D and D′ are before and after the noises addition. In our second stage, this will be onto another important factor even while in combination with the sensitivity to introduce the gaussian noise into the dataset.
C. Probability of Failure(δ)
The parameter δ (delta) here is a small probability which would allow a slight relaxation of the strict ∈ differential privacy guarantee. The smaller the delta, the lower the probability of failure. It holds an accountability for the possibility even when the privacy guarantee might fail with probability. δ usually allows for the rare events where the privacy guarantee might not be on hold.
Mathematically this equation given by Dwork et al. as
F(δ) = √2ln(1.25/δ)
The traditional methods such often rely solely just on the addition of the Gaussian noise to achieve a perturbing of a dataset. However, in the recent studies suggest that the incorporating hybrid techniques can yield more as per the robust models. For example, there’s a technique which has been introduced by the Denham et al. (2020) which is Random projection-based cumulative noises which are addition to a combination which combines the three perturbation techniques, random projection, translation, and noise addition. This hybrid approach has achieved a good accuracy and the privacy. In our research, we have been implementing a similar technique by using combination of the techniques.
3. Material And Methods
Numerical datasets are commonly particular therefore it is well-suited for perturbation techniques, as they are inherently involving the mathematical calculations which would allow for the controlled modifications even while preserving the statistical properties. It is inspired by the Hybrid approaches which are used in the differential privacy researches and the Disclosure Avoidance System (DAS), this study has proposed a two-layer hybrid approach to the perturb data, ensuring it’s both privacy protection and the data utility. In this proposed method, two sequential perturbation techniques have been applied to balance data utility and patient privacy in EHR data analysis.
A. First Layer – Introduction of Random Noise
To determine the appropriate perturbation intervals for medical diagnostic values, our strategy involves selecting ranges grounded in clinical plausibility and typical biological variability. Rather than applying a uniform percentage across all variables shown in Table 1, the perturbation bounds are tailored for each test by referencing established clinical ranges—thresholds that indicate medical significance or risk. For instance, body temperature may be perturbed within ±0.5°C to simulate realistic fluctuations without crossing into fever or hypothermic zones, whereas heart rate may allow for a broader perturbation range such as ±15 bpm.
By incorporating these clinically informed perturbation limits using uniformly sampled noise within each test-specific interval given in Table 1, the modified data retains medical validity and remains context-aware. This preserves the semantic integrity of the data and ensures it avoids distortions that could imply clinically implausible conditions. The first-stage perturbation method is designed to preserve essential statistical properties of the original dataset, particularly the mean and variance within acceptable clinical margins. Since the perturbations are symmetric around the true value. As a result, when applied across large samples, the expected value of the perturbed data closely approximates the original mean, and the overall distribution retains its shape. While this step introduces bounded randomness to individual data points, the aggregation of such perturbations results in minimal distortion to population-level statistics such as mean, standard deviation, and distribution spread. This balance ensures that analytical tasks—such as time series modeling, anomaly detection, or public health trend analysis—remain valid and reliable, even when performed on perturbed data. Thus, although this stage is not a formal differential privacy mechanism, it plays a crucial role in preserving data utility while introducing uncertainty that mitigates re-identification risk.
| Test | Normal Range | Perturbation Interval | Notes |
|---|---|---|---|
| Blood Pressure | 90/60 – 120/80 mmHg | ±5–15 (systolic), ±5–10 (diastolic) | > 140/90 for simulating hypertension |
| Heart Rate | 60 – 100 bpm | ±5–15 bpm | >120 considered tachycardic |
| Body Temperature | 36.1 – 37.2 °C | ±0.2–0.5 °C | >38 = fever, <35 = hypothermia |
| Oxygen Saturation (SpO₂) | 95 – 100 % | ±1–3% | <92% is considered Hypoxemia |
| Blood Glucose (Fasting) | 70 – 99 mg/dL | ±5–15 mg/dL | >125 is considered hyperglycemia |
| Cholesterol (Total) | <200 mg/dL | ±10–25 mg/dL | >240 is considered high risk |
| BMI | 18.5 – 24.9 kg/m² | ±0.5–2 kg/m² | >30 = obesity, avoid values <15 |
| Hospital Stay | 3 – 7 days | ±1–3 days | Depends on use case |
The proposed research particularly focuses on time series data, which supports longitudinal analysis such as tracking disease progression or evaluating treatment outcomes. When aggregated across populations, these time series values can inform public health trends, predictive modeling, and resource forecasting. By introducing clinically bounded perturbations, the resulting synthetic data can be safely used for machine learning and anomaly detection while maintaining patient privacy and preserving critical temporal patterns. Unlike the approach introduced by recent studies such as Denham et al. (2020), which employs a combination of random projection, translation, and Gaussian noise, the methodology proposed in this study avoids geometric transformations and dimensionality reduction. While Denham’s framework is particularly effective for high-dimensional, unstructured data where projections can obscure original feature spaces, our approach is tailored specifically for structured numerical healthcare data. The focus is on preserving the interpretability, clinical relevance, and contextual integrity of each attribute. In the first stage, random noise is introduced within clinically defined intervals, ensuring the modified values remain medically plausible. This step is guided by domain-specific knowledge rather than arbitrary or uniform bounds, thereby supporting the preservation of data utility. The second stage adds Gaussian noise in accordance with formal differential privacy mechanisms. When combined, the two stages operate under a compositional framework that supports (ε, δ)-differential privacy guarantees, allowing for a balanced trade-off between privacy protection and analytical reliability. Together, these stages constitute a hybrid privacy-preserving mechanism that offers a more interpretable and context-sensitive alternative to existing approaches. For non-numerical data such as names, addresses, or contact information, either quasi-identifiers are removed or replaced with synthetic values before release.
B. Second Layer – Application of Gaussian Differential Privacy
Following the initial perturbation, Gaussian noises, which adheres to the differential privacy principles, has been introduced. This layer can ensure that the privacy guarantees of the Gaussian mechanism are to be preserved even while addressing one of its key challenges—the high level of these noises can be affecting the data utility.
C. Balance between Privacy and Utility in Sensitive Data
A crucial application of this kind of a method is in just handling the Electronic Health Records (EHRs), where both the privacy and the utility are of the paramount importance. Some strict privacy protections are usually necessary to comply within the regulations like the Health Insurance Portability and Accountability Act (HIPAA), which mandates that the stringent safeguards against the unauthorized data exposure. Simultaneously, EHR data must be retaining the statistical integrity to just support the critical research, clinical studies, and data-driven healthcare decisions. Excessive perturbation could be distorted key for the statistical parameters, which leads to inaccurate analyses, even while insufficient privacy measures could compromise the sensitive patient information. This necessitates are very delicate balance between the privacy preservation and by maintaining the statistical utility.
D. Justification for the Two-Layer Approach
The primary motivation for introducing an initial random noise layer is to pre-condition the dataset, thereby reducing the magnitude of Gaussian noise required in the second stage while still achieving meaningful privacy protection. Dong et al. (2020) noted that one of the critical limitations of the Gaussian mechanism under differential privacy is the significant degradation in data utility due to the high magnitude of noise necessary for strong privacy guarantees. Directly reducing the noise scale in the Gaussian mechanism can compromise its differential privacy guarantee by weakening the privacy budget (ε) or increasing the risk of re-identification. To address this, the proposed two-stage approach applies a minimal, domain-informed random perturbation in the first layer, which introduces controlled randomness while preserving essential statistical properties such as range and plausible variance.
This preconditioning step enables the subsequent application of the Gaussian mechanism with a smaller noise scale, corresponding to a reduced sensitivity (Δf) and privacy budget (ε). As a result, the final dataset can retain more analytical value without compromising privacy guarantees, assuming composition theorems are applied appropriately.
4. Results
In our experiment, we analyzed a 100-day timeseries of daily heart rate data. Table 2 presents the results from a single-stage Gaussian perturbation applied over a seven-day window, while Table 3 illustrates the outcome of the proposed two-stage method over the same period. Quantitative evaluation reveals that the two-stage method results in a lower root mean square error (RMSE) and a narrower 95% confidence interval compared to the single-stage Gaussian mechanism. While a narrower confidence interval typically indicates reduced variability and greater precision, this may also imply that the data has been overly smoothed. We acknowledge this trade-off and emphasize that preserving variance is not universally desirable—it depends on the application context. In clinical settings, for example, some smoothing may be acceptable if the underlying patterns (e.g., trends, thresholds) remain intact for decision-making.
| Date | HeartRate Original unperturbed values | HeartRate perturbed values Using Gaussian | Difference |
|---|---|---|---|
| 3/1/2020 | 98 | 92 | 6 |
| 3/2/2020 | 88 | 86 | 2 |
| 3/3/2020 | 74 | 69 | 5 |
| 3/4/2020 | 67 | 64 | 3 |
| 3/5/2020 | 80 | 88 | -8 |
| 3/6/2020 | 98 | 100 | -2 |
| Date | HeartRate Original unperturbed values | HeartRate First stage based random noise | HeartRate perturbed using two stages | Difference |
|---|---|---|---|---|
| 3/1/2020 | 98 | 95 | 100 | -2 |
| 3/2/2020 | 88 | 90 | 78 | 10 |
| 3/3/2020 | 74 | 71 | 75 | -1 |
| 3/4/2020 | 67 | 66 | 63 | 4 |
| 3/5/2020 | 80 | 84 | 79 | 1 |
| 3/6/2020 | 98 | 100 | 89 | 9 |
To strengthen the privacy evaluation, it is proposed augmenting the analysis with formal privacy metrics. Specifically, incorporating ε and δ values computed using the Gaussian mechanism’s privacy accountant would offer a concrete measure of privacy strength. Tables 2 and 3 illustrate the effects of the one-stage (Gaussian-only) and the proposed two-stage perturbation mechanisms on daily heart rate values over a sample period of seven days.
Table 2 shows the results of applying a single layer of Gaussian noise to the original heart rate data. The “HeartRate” column contains the original unperturbed values. The “Gaussian” column lists the corresponding values after adding Gaussian noise sampled from a distribution calibrated according to a fixed (ε, δ)-differential privacy budget. The “Difference” column is calculated as the original value minus the perturbed value (Difference = HeartRate – Gaussian). This difference represents the absolute distortion introduced by the Gaussian mechanism. For instance, on 3/1/2020, the heart rate dropped from 98 to 92 due to Gaussian noise, resulting in a difference of +6.
Table 3 provides a comparative view of the proposed hybrid method. Here, the data undergoes a first stage of interval-based random noise before being further processed with Gaussian noise in the second stage. The column “HeartRate” indicates the original values. The HeartRate First stage based random noise column shows the results after the first-stage noise, which is sampled uniformly from clinically defined intervals (e.g., ±15 bpm for heart rate). The “Two Stage” column presents the final values after the second Gaussian perturbation. The “Difference” column in this case refers to the total difference between the final output after both stages and the original input (Difference = HeartRate – Two Stage). For example, on 3/1/2020, the original heart rate was 98. After the first stage, it became 95, and after applying Gaussian noise in the second stage, it reached 100. Thus, the final difference is -2, indicating that the final value was slightly higher than the original.
This two-stage setup allows for a more nuanced privacy-preserving process. The initial randomization ensures a degree of variability while remaining within medically plausible limits, and the second stage guarantees formal differential privacy compliance. By distributing the privacy burden across two mechanisms, the second-stage Gaussian perturbation can use a smaller noise scale, preserving more of the data’s utility while maintaining a high level of privacy protection.
Evaluation Metrics for Measuring Privacy and Utility
To quantify the effectiveness of the above two approaches, few statistical measures are used:
- Standard Errors (SE) and Confidence Intervals (CI): the use of standard errors and confidence intervals is essential for quantifying the statistical uncertainty introduced by differential privacy mechanisms. The metrics used in this study serve as indicators of how perturbation affects data variability and estimation reliability. By comparing SE and CI values before and after applying privacy-preserving noise, researchers can assess not only the extent of distortion but also whether the resulting data maintains sufficient analytical precision. However, it is important to interpret narrower confidence intervals with caution, as they may suggest reduced variance that could either indicate improved consistency or excessive smoothing—potentially affecting the validity of downstream analyses. Therefore, SE and CI should be considered in conjunction with privacy metrics (e.g., ε, δ) to evaluate the trade-offs between utility and privacy in a comprehensive manner.
- Root Mean Squared Error (RMSE): is a key utility metric used in this study to assess how closely the perturbed datasets resemble the original data. Lower RMSE values, as observed in the two-stage approach, suggest that the structural characteristics of the original dataset are better preserved, which is favorable for maintaining utility in downstream analytical tasks such as machine learning or statistical inference. However, while reduced RMSE indicates improved data fidelity, it is important to interpret this result in the context of privacy guarantees. A very low RMSE could imply minimal distortion, raising the question of whether sufficient noise has been applied to meaningfully protect individual-level privacy. Therefore, RMSE alone cannot fully characterize the success of a privacy-preserving method. It must be considered alongside formal privacy metrics—such as the differential privacy parameters (ε, δ)—or empirical tests like membership inference attacks, to ensure that the level of protection remains robust. In this work, the reduced RMSE achieved by the two-stage approach illustrates better utility retention, but future studies should also explicitly quantify how this utility gain aligns with or affects the overall privacy budget.
Tables 5 and 6 present the forecasting accuracy metrics for both the Gaussian-only and the proposed two-stage mechanisms. Notably, the Root Mean Square Error (RMSE) is lower in the two-stage approach, suggesting that the perturbed data more closely aligns with the original values. This implies improved model accuracy and greater reliability for predictive tasks.
| Model | RMSE | MAE | MAPE | MASE | RMSSE |
|---|---|---|---|---|---|
| ARIMA (HeartRate) | 14.86908 | 12.22437 | 16.13931 | 1.04935 | 1.037862 |
| Drift | 60.08833 | 55.32203 | 69.49214 | 4.748886 | 4.194166 |
| Mean | 12.74094 | 10.73762 | 14.25469 | 0.921725 | 0.889318 |
| Naive | 24.34 | 21.08602 | 24.99317 | 1.81004 | 1.698932 |
| Seasonal Naive | 15.9509 | 13.08413 | 17.16669 | 1.123152 | 1.113373 |
| Model | RMSE | MAE | MAPE | MASE | RMSSE |
|---|---|---|---|---|---|
| ARIMA (HeartRate) | 11.82201 | 9.9 | 12.80059 | 0.848571 | 0.90493 |
| Drift | 62.14210 | 56.85 | 72.10922 | 4.872857 | 4.756764 |
| Mean | 11.82201 | 9.9 | 12.80059 | 0.848571 | 0.904934 |
| Naive | 20.87103 | 17.3333 | 20.46726 | 1.485714 | 1.59760 |
| Seasonal Naive | 15.30196 | 12.1166 | 15.08063 | 1.038571 | 1.171312 |
While the two-stage method demonstrates stronger analytical performance, it is essential to validate that the approach still meets formal privacy thresholds—such as acceptable (ε, δ) values—or passes empirical privacy tests (e.g., membership inference resistance). This ensures that the observed utility improvements do not come at the cost of weakened privacy protection. Table 6 presents the performance metrics of several forecasting models applied to the two-stage differentially private heart rate dataset. A notable observation is that the ARIMA and Mean models yield identical results across all metrics, which may initially seem unexpected given the inherent complexity of ARIMA relative to the simplicity of a constant mean forecast. However, upon further investigation, this convergence can be explained by two factors:
First, the dataset used for forecasting spans a short time horizon—seven days of daily heart rate measurements—and exhibits limited variability due to the smoothing effect of the two-stage perturbation process. The result is a series with a strong central tendency and minimal temporal fluctuation. In such conditions, the ARIMA model selected by the auto-ARIMA procedure tends to converge toward a model with negligible autoregressive or moving average terms, effectively functioning as a constant mean predictor. Second, the two-stage privacy mechanism, which applies context-aware random perturbation followed by Gaussian noise, appears to preserve the statistical stationarity of the heart rate series. The reduced variance and absence of strong trends or seasonality in the perturbed data diminish the comparative advantage of more complex models like ARIMA.
To validate this behavior, the models were also tested over longer time windows. As expected, when extended sequences with greater variability were analyzed, the ARIMA model diverged from the mean model and demonstrated more distinct forecasting behavior. Nevertheless, for the short-duration, perturbed dataset used in this experiment, the similarity in metrics is accurate and reflects the structure of the input data rather than an error in implementation or reporting. To enhance transparency, we explicitly note this convergence of ARIMA and Mean model results in Table 6. While the finding may appear counterintuitive, it in fact underscores the effectiveness of the two-stage privacy-preserving mechanism in retaining essential data structure.
5. Discussion
This study demonstrates that the proposed two-stage perturbation method offers a better balance between privacy and data utility compared to a traditional single-stage Gaussian differential privacy mechanism. By introducing bounded, domain-informed random noise in the first stage, the magnitude of Gaussian perturbation required in the second stage is reduced. This results in lower RMSE, narrower confidence intervals, and improved forecasting accuracy while maintaining medically plausible data values.
The comparative results show that the two-stage method preserves temporal patterns and statistical properties more effectively, supporting downstream tasks such as forecasting and anomaly detection. Importantly, the approach distributes the privacy budget across two mechanisms, creating a more efficient trade-off between privacy guarantees and analytical usability.
While the method shows strong potential, limitations remain. Over-smoothing could obscure subtle but meaningful signals, and further validation is needed on high-dimensional, multimodal healthcare datasets. Future work should also assess formal privacy parameters (ε, δ) under composition and evaluate resistance to adversarial attacks.
6. Conclusion
The proposed two-stage perturbation method—first applying clinically informed interval-based random noise, followed by Gaussian differential privacy noise—effectively balances privacy protection and data utility in EHR datasets. Experimental results demonstrate that this approach reduces RMSE from 14.87 (Gaussian-only) to 11.82 (two-stage), lowers the standard deviation from 13.04 to 11.64, and narrows the 95% confidence interval from [77.14, 82.25] to [76.59, 81.15], indicating improved statistical fidelity and reliability.
By preserving temporal patterns and essential statistical properties, the method supports downstream tasks such as forecasting and anomaly detection while maintaining (ε, δ)-differential privacy guarantees. The dual-layered strategy also allows customizable privacy-utility trade-offs depending on dataset sensitivity and intended use. Overall, these results suggest that the hybrid perturbation technique provides a practical, high-utility solution for privacy-preserving analysis of sensitive healthcare data.
7. References
- Blumenthal D, Tavenner M. The “Meaningful Use” regulation for electronic health records. N Engl J Med. 2010;363(6):501–504. doi:10.1056/NEJMp1006114
- Dagliati A, Malovini A, Tibollo V, Bellazzi R. Health informatics and EHR to support clinical research in the COVID-19 pandemic: an overview. Brief Bioinform. 2021;22(2):812–822. doi:10.1093/bib/bbaa418
- Cowie M, Blomster J, Curtis LH, et al. Electronic health records to facilitate clinical research. Clin Res Cardiol. 2016;106(1):1–9. doi:10.1007/s00392-016-1025-6
- Meystre SM, Friedlin FJ, South BR, et al. Automatic de-identification of textual documents in the electronic health record: a review of recent research. BMC Med Res Methodol. 2010;10:70. doi:10.1186/1471-2288-10-70
- Li Y, Shen H. Equi-width data swapping for private data publication. In: 2009 International Conference on Parallel and Distributed Computing, Applications and Technologies. Higashi-Hiroshima, Japan: IEEE; 2009:231–238. doi:10.1109/PDCAT.2009.69
- Hewage UHWA, Sinha R, Naeem MA. Privacy-preserving data (stream) mining techniques and their impact on data mining accuracy: a systematic literature review. Artif Intell Rev. 2023;56(9):10427–10464. doi:10.1007/s10462-023-10425-3
- Dong J, Su W, Zhang L. A central limit theorem for differentially private query answering. 2021. doi:10.48550/arxiv.2103.08721
- Dwork C, McSherry F, Nissim K, Smith A. Calibrating noise to sensitivity in private data analysis. J Priv Confid. 2017;7(3). doi:10.29012/jpc.v7i3.405
- Zhao J, Wang T, Bai T, et al. Reviewing and improving the Gaussian mechanism for differential privacy. arXiv. 2019. https://arxiv.org/abs/1911.12060
- Geng Q, Viswanath P. The optimal noise-adding mechanism in differential privacy. IEEE Trans Inf Theory. 2016;62(2). doi:10.1109/TIT.2015.2504967
- Denham B, Pears R, Naeem MA. Enhancing random projection with independent and cumulative additive noise for privacy-preserving data stream mining. Expert Syst Appl. 2020;152:113380. doi:10.1016/j.eswa.2020.113380
- Abowd JM, Ashmead R, Cumings-Menon R, et al. Disclosure avoidance in practice: the U.S. Census Bureau’s disclosure avoidance system for the 2020 decennial census. 2022. doi:10.1162/99608f92.529e3cb9
- Chen D, Li Y, Chen J, Bi H, Ding X. Differential privacy via Haar wavelet transform and Gaussian mechanism for range query. Comput Intell Neurosci. 2022;2022:8139813. doi:10.1155/2022/8139813
- Xiong Z, Li L, Yan J, Wang H, He H, Jin Y. Differential privacy with variant-noise for Gaussian processes classification. In: Nayak AC, Sharma A, eds. PRICAI 2019: Trends in Artificial Intelligence. Lecture Notes in Computer Science. Vol 11672. Springer; 2019:107–119. doi:10.1007/978-3-030-29894-4_9
- Im E, Kim HJ, Lee HS, et al. Exploring the tradeoff between data privacy and utility with a clinical data analysis use case. BMC Med Inform Decis Mak. 2024;24(1):147. doi:10.1186/s12911-024-02545-9
- Adams T, Birkenbihl C, Otte K, et al; Alzheimer’s Disease Neuroimaging Initiative. On the fidelity versus privacy and utility trade-off of synthetic patient data. medRxiv. 2024. doi:10.1101/2024.12.06.24317239
- Balle B, Wang YX. Improving the Gaussian mechanism for differential privacy: analytical calibration and optimal denoising. arXiv. 2018. https://arxiv.org/abs/1805.06530
- Jiang X, et al. Privacy-preserving data sharing infrastructures for medical research: systematization and comparison. BMC Med Inform Decis Mak. 2021;21(1):242. doi:10.1186/s12911-021-01602-x
- Shen A, Francisco L, Sen S, Tewari A. Exploring the privacy-utility tradeoff in differentially private federated learning for mobile health: a novel approach using simulated privacy attacks. medRxiv. 2022. doi:10.1101/2022.10.17.22281116
- Rai RK, Varsney M. Gaussian noise multiplicative privacy for data perturbation under multi-level trust. Int J Intell Syst Appl Eng. 2023.
- Li Y, Shen H. A differential privacy perturbation with random forest classifier in medical database. In: Proc Int Conf Recent Trends Comput (ICRTC 2023). Springer; 2023.
- Balasubramaniam N, et al. Geometric data perturbation-based personal health record transactions in cloud computing. Sci World J. 2015;2015:927867. doi:10.1155/2015/927867