Using Geographically Weighted Random Forests to Analyze County-level Diabetes Prevalence in the USA
Main Article Content
Abstract
Diabetes poses a major public health challenge in the United States, ranking among the top ten leading causes of death. Its prevalence is closely tied to factors such as obesity and lifestyle behaviors, yet it varies significantly across different geographic regions. Traditional regression models often fail to capture the entirety of the relationships between dependent and independent variables, especially when spatial heterogeneity is present. To understand county-level diabetes prevalence and its associated risk factors, researchers have employed spatial linear regression models, which have limitations, including the assumption of linear relationships and inadequate handling of multicollinearity. To address this, a geographically weighted random forest model (GW-RF), which combines random forests and locally weighted regressions via a spatially weighted matrix, is employed as an exploratory and predictive tool in this study. County-level diabetes prevalence data for the USA, along with twelve other independent variables, from 2010 to 2020 were divided into two time periods: pre- and post-National Health Information Survey updates (referred to as “historical” and “current” periods, respectively). These data were then used to explore the nature and pattern of county-level diabetes prevalence and to estimate the performance of GW-RF against other global and spatially weighted models. In this study, we found that all geographically weighted models outperformed their non-spatial counterparts across periods, indicating that spatial variation plays an important role in explaining county-level diabetes prevalence. Our results further indicate that the GW-RF model more effectively captures spatial heterogeneity and predicts diabetes prevalence than both global and local models. Compared to global ordinary least squares regression, global random forests, and geographically weighted ordinary least squares, our GW-RF model achieved higher values by 3.5%, 1.1%, and 0.6% (historic), and 2.3%, 0.5%, and 0.4% (current), as well as lower normalized root mean squared error values by 6.1%, 2%, and 1% (historic), and 0.8%, 0.3%, and 0.2% (current), respectively. We also found that, although models generally performed well, their performance dropped in the current period. This decline in model performance may be because the current period showed less spatial autocorrelation in diabetes prevalence (historical Moran’s : 0.559, ; current Moran’s : 0.45, ). This shift in the underlying spatial patterns of diabetes could reflect known changes in survey methodology or actual epidemiological changes, both of which warrant further investigation. The findings also suggest that the GW-RF model can support health professionals and policymakers in making accurate projections, detecting emerging hotspots, and guiding targeted prevention and control efforts.
Article Details
The Medical Research Archives grants authors the right to publish and reproduce the unrevised contribution in whole or in part at any time and in any form for any scholarly non-commercial purpose with the condition that all publications of the contribution include a full citation to the journal as published by the Medical Research Archives.
References
2. Andes LJ, Cheng YJ, Rolka DB, Gregg EW, Imperatore G. Prevalence of prediabetes among adolescents and young adults in the United States, 2005-2016. JAMA Pediatr. 2020;174:e194498. doi:10.1001/jamapediatrics.2019.4498
3. Hipp JA, Chalise N. Spatial analysis and correlates of county-level diabetes prevalence, 2009-2010. Prev Chronic Dis. 2015;12:E08.
4. Neupane S, Florkowski WJ, Dhakal C. Trends and disparities in diabetes prevalence in the United States from 2012 to 2022. Am J Prev Med. 2024.
5. Neupane S, Florkowski WJ, Dhakal U, Dhakal C. Regional disparities in type 2 diabetes prevalence and associated risk factors in the United States. Diabetes Obes Metab. 2024;26:4776-4782.
6. Quiñones S, Goyal A, Ahmed ZU. Geographically weighted machine learning model for untangling spatial heterogeneity of type 2 diabetes mellitus (T2D) prevalence in the USA. Sci Rep. 2021;11:6955.
7. Parker ED, Lin J, Mahoney T, et al. Economic costs of diabetes in the US in 2022. Diabetes Care. 2024;47:26-43.
8. Bottaro A. Type 2 diabetes cure. Verywell Health. 2022. Accessed March 13, 2026. https://www.verywellhealth.com/type-2-diabetes-cure-6823636
9. Joshi RD, Dhakal CK. Predicting type 2 diabetes using logistic regression and machine learning approaches. Int J Environ Res Public Health. 2021;18:7346.
10. CDC Diabetes Surveillance System. CDC Diabetes Atlas. Centers for Disease Control and Prevention. 2024. Accessed March 13, 2026. https://gis.cdc.gov/grasp/diabetes/diabetesatlas.html
11. Abraham TM, Fox CS. Implications of rising prediabetes prevalence. Diabetes Care. 2013;36:2139.
12. Georganos S, Kalogirou S. A forest of forests: a spatially weighted and computationally efficient formulation of geographical random forests. ISPRS Int J Geo-Inf. 2022;11:471.
13. Luo Y, Yan J, McClure S. Distribution of the environmental and socioeconomic risk factors on COVID-19 death rate across continental USA: a spatial nonlinear analysis. Environ Sci Pollut Res. 2021;28:6587-6599.
14. Li X, Staudt A, Chien LC. Identifying counties vulnerable to diabetes from obesity prevalence in the United States: a spatiotemporal analysis. Geospat Health. 2016;11(1).
15. Cadwell BL, Thompson TJ, Boyle JP, Barker LE. Bayesian small area estimates of diabetes prevalence by US county, 2005. J Data Sci. 2010;8:171-188.
16. Barker LE, Thompson TJ, Kirtland KA, et al. Bayesian small area estimates of diabetes incidence by United States county, 2009. J Data Sci. 2013;11:269.
17. Bell WR, Basel WW, Maples JJ. An overview of the US Census Bureau's small area income and poverty estimates program. In: Analysis of Poverty Data by Small Area Estimation. 2016:349-378.
18. Khan SN, Li D, Maimaitijiang M. A geographically weighted random forest approach to predict corn yield in the US corn belt. Remote Sens (Basel). 2022;14:2843.
19. Seamon E, Ridenhour BJ, Miller CR, Johnson-Leung J. Spatial modeling of sociodemographic risk for COVID-19 mortality. Preprint. medRxiv. Posted 2024.
20. Wright MN, Ziegler A. ranger: a fast implementation of random forests for high dimensional data in C++ and R. J Stat Softw. 2017;77:1-17. doi:10.18637/jss.v077.i01
21. Luo Y, Yan J, McClure SC, Li F. Socioeconomic and environmental factors of poverty in China using geographically weighted random forest regression model. Environ Sci Pollut Res. 2022:1-13.
22. Santos F, Graw V, Bonilla S. A geographically weighted random forest approach for evaluate forest change drivers in the northern ecuadorian amazon. PLoS One. 2019;14:e0226224.
23. R-Bloggers. Be aware of bias in RF variable importance metrics. R-Bloggers. Published June 2018. Accessed March 13, 2026. https://www.r-bloggers.com/2018/06/be-aware-of-bias-in-rf-variable-importance-metrics/
24. Greenwell BM. pdp: an R package for constructing partial dependence plots. R J. 2017;9:421-436. doi:10.32614/RJ-2017-016
25. Molnar C. Partial dependence plot (PDP). Interpretable Machine Learning. 2024. Accessed March 13, 2026. https://christophm.github.io/interpretable-ml-book/pdp.html
26. Tobler WR. A computer movie simulating urban growth in the Detroit region. Econ Geogr. 1970;46:234-240.
27. Anselin L. Local indicators of spatial association—LISA. Geogr Anal. 1995;27:93-115.
28. Arabameri A, Pradhan B, Rezaei K. Gully erosion zonation mapping using integrated geographically weighted regression with certainty factor and random forest models in GIS. J Environ Manage. 2019;232:928-942.
29. Chalkias C, Kalogirou S, Ferentinou M. Landslide susceptibility, Peloponnese Peninsula in south Greece. J Maps. 2014;10:211-222.
30. Georganos S, Grippa T, Niang Gadiaga A, et al. Geographical random forests: a spatial extension of the random forest algorithm to address spatial heterogeneity in remote sensing and population modelling. Geocarto Int. 2021;36:121-136.
31. Kalogirou S, Georganos S. SpatialML: spatial machine learning. R package. 2024. Accessed March 13, 2026. https://CRAN.R-project.org/package=SpatialML
32. Siordia C, Saenz J, Tom SE. An introduction to macro-level spatial nonstationarity: a geographically weighted regression analysis of diabetes and poverty. Hum Geogr. 2012;6:5.
33. Lord J, Roberson S, et al. A retrospective investigation of spatial clusters and determinants of diabetes prevalence: scan statistics and geographically weighted regression modeling approaches. PeerJ. 2023;11:e15107.
34. Kauhl B, Schweikart J, Krafft T, Keste A, Moskwyn M. Do the risk factors for type 2 diabetes mellitus vary by location? A spatial analysis of health insurance claims in northeastern Germany using kernel density estimation and geographically weighted regression. Int J Health Geogr. 2016;15:1-12.
35. Sharma A. Exploratory spatial analysis of food insecurity and diabetes: an application of multiscale geographically weighted regression. Ann GIS. 2023;29:485-498.
36. Geiss LS, Kirtland K, Lin J, et al. Changes in diagnosed diabetes, obesity, and physical inactivity prevalence in US counties, 2004-2012. PLoS One. 2017;12:e0173428.
37. Adams EJ, Grummer-Strawn L, Chavez G. Food insecurity is associated with increased risk of obesity in California women. J Nutr. 2003;133:1070-1074.
38. Weigel MM, Armijos RX, Hall YP, Ramirez Y, Orozco R. The household food insecurity and health outcomes of US-Mexico border migrant and seasonal farmworkers. J Immigr Minor Health. 2007;9:157-169.
39. Barker LE, Kirtland KA, Gregg EW, Geiss LS, Thompson TJ. Geographic distribution of diagnosed diabetes in the US: a diabetes belt. Am J Prev Med. 2011;40:434-439.