Machine learning enhances biomarker discovery: From multi- omics to functional genomics.
Main Article Content
Abstract
Importance: Biomarkers are critical for precision medicine, supporting disease diagnosis, prognosis, personalized treatments, and monitoring. Traditional biomarker discovery methods, which often focus on single genes or proteins, face several challenges, including limited reproducibility, a limited ability to integrate multiple data streams, high false-positive rates, and inadequate predictive accuracy. Machine learning and deep learning methods, and large language models, paired with advancements in omics technologies, address these limitations by analyzing large, complex multi-omics datasets to identify more reliable and clinically useful biomarkers. Observations: Machine learning and deep learning have proven effective in biomarker discovery by integrating diverse and high-volume data types, such as genomics, transcriptomics, proteomics, metabolomics, imaging, and clinical records. These approaches successfully identify diagnostic, prognostic, and predictive biomarkers across fields, such as oncology, infectious diseases, neurological disorders, and autoimmune diseases. Newer methodological developments include approaches to identify functional biomarkers, notably biosynthetic gene clusters, crucial for discovering antibiotics and anticancer drugs. Key artificial intelligence (AI) techniques include neural networks, transformers, large language models, and feature selection methods, which are finding more and more application to omics data and in clinical settings. However, challenges remain regarding data quality, biological complexity, model interpretability, validation, and generalization. Regulatory and ethical considerations also impact clinical adoption, emphasizing the importance of validated, trustworthy, and explainable AI methods. Conclusions and Relevance: Machine learning, deep learning, and AI agent-based approaches significantly enhance biomarker discovery, providing valuable biological insights and advancing precision medicine. Future research should focus on directly linking genomic data to functional outcomes, particularly with biosynthetic gene clusters and non- coding RNAs. Rigorous validation, model interpretability, and regulatory compliance are essential for clinical implementation. These advancements promise to improve personalized treatment strategies and patient outcomes.
Article Details
The Medical Research Archives grants authors the right to publish and reproduce the unrevised contribution in whole or in part at any time and in any form for any scholarly non-commercial purpose with the condition that all publications of the contribution include a full citation to the journal as published by the Medical Research Archives.
References
2. Johnson KB, Wei WQ, Weeraratne D, et al. Precision Medicine, AI, and the Future of Personalized Health Care. Clinical and Translational Science. 2021;14(1):86-93.
3. Su J, Yang L, Sun Z, Zhan X. Personalized Drug Therapy: Innovative Concept Guided With Proteoformics. Mol Cell Proteomics. 2024;23(3):100737.
4. Hong S, Prokopenko D, Dobricic V, et al. Genome-wide association study of Alzheimer’s disease CSF biomarkers in the EMIF-AD Multimodal Biomarker Discovery dataset. Transl Psychiatry. 2020;10(1):403.
5. Visscher PM, Wray NR, Zhang Q, et al. 10 years of GWAS discovery: Biology, function, and translation. Am J Hum Genet. 2017;101(1):5-22.
6. Safari F, Kehelpannala C, Safarchi A, Batarseh AM, Vafaee F. Biomarker Reproducibility Challenge: A Review of Non-Nucleotide Biomarker Discovery Protocols from Body Fluids in Breast Cancer Diagnosis. Cancers (Basel). 2023;15(10). doi:10.3390/cancers15102780
7. Kraljevic S, Stambrook PJ, Pavelic K. Accelerating drug discovery. EMBO reports. Published online September 1, 2004. doi:10.1038/sj.embor.7400236
8. Wang RC, Wang Z. Precision Medicine: Disease Subtyping and Tailored Treatment. Cancers (Basel). 2023;15(15). doi:10.3390/cancers15153837
9. Ottaiano A, Ianniello M, Santorsola M, et al. From Chaos to Opportunity: Decoding Cancer Heterogeneity for Enhanced Treatment Strategies. Biology (Basel). 2023;12(9). doi:10.3390/biology12091183
10. Chen C, Wang J, Pan D, et al. Applications of multi-omics analysis in human diseases. MedComm (2020). 2023;4(4):e315.
11. Yetgin A. Revolutionizing multi-omics analysis with artificial intelligence and data processing. Quantitative Biology. 2025;13(3):e70002.
12. Athieniti E, Spyrou GM. A guide to multi-omics data collection and integration for translational medicine. Comput Struct Biotechnol J. 2023;21:134-149.
13. Role of artificial intelligence in revolutionizing drug discovery. Fundamental Research. Published online May 9, 2024. doi:10.1016/j.fmre.2024.04.021
14. Choudhary K, DeCost B, Chen C, et al. Recent advances and applications of deep learning methods in materials science. npj Computational Materials. 2022;8(1):1-26.
15. pubSight. Github Accessed July 9, 2025. https://github.com/omicsEye/pubSight
16. Website. doi:10.1136/bmj.h3449
17. Atanasov AG, Zotchev SB, Dirsch VM, Supuran CT. Natural products in drug discovery: advances and opportunities. Nature Reviews Drug Discovery. 2021;20(3):200-216.
18. Martinet L, Naômé A, Deflandre B, et al. A Single Biosynthetic Gene Cluster Is Responsible for the Production of Bagremycin Antibiotics and Ferroverdin Iron Chelators. mBio. 2019;10(4). doi:10.1128/mBio.01230-19
19. A survey of the biosynthetic potential and specialized metabolites of archaea and understudied bacteria. Current Research in Biotechnology. 2023;5:100117.
20. Molujin AM, Abbasiliasi S, Nurdin A, Lee PC, Gansau JA, Jawan R. Bacteriocins as Potential Therapeutic Approaches in the Treatment of Various Cancers: A Review of In Vitro Studies. Cancers (Basel). 2022;14(19). doi:10.3390/cancers14194758
21. Rios-Martinez C, Bhattacharya N, Amini AP, Crawford L, Yang KK. Deep self-supervised learning for biosynthetic gene cluster detection and product classification. PLOS Computational Biology. 2023;19(5):e1011162.
22. Winchester LM, Harshfield EL, Shi L, et al. Artificial intelligence for biomarker discovery in Alzheimer’s disease and dementia. Alzheimer’s & Dementia. 2023;19(12):5860-5871.
23. Wallstrom G, Anderson KS, LaBaer J. Biomarker discovery for heterogeneous diseases. Cancer Epidemiol Biomarkers Prev. 2013;22(5):747-755.
24. López OAM, López AM, Crossa J. Multivariate Statistical Machine Learning Methods for Genomic Prediction. Springer Nature; 2022.
25. Obeagu EI, Ezeanya CU, Ogenyi FC, Ifu DD. Big data analytics and machine learning in hematology: Transformative insights, applications and challenges. Medicine (Baltimore). 2025;104(10):e41766.
26. Website. doi:10.1016/j.patter.2020.100129
27. Website. doi:10.1136/bmj.i3140
28. Ou FS, Michiels S, Shyr Y, Adjei AA, Oberg AL. Biomarker Discovery and Validation: Statistical Considerations. J Thorac Oncol. 2021;16(4):537-545.
29. Liang W, Tadesse GA, Ho D, et al. Advances, challenges and opportunities in creating data for trustworthy AI. Nat Mach Intell. 2022;4(8):669-677.
30. Harishbhai Tilala M, Kumar Chenchala P, Choppadandi A, et al. Ethical Considerations in the Use of Artificial Intelligence and Machine Learning in Health Care: A Comprehensive Review. Cureus. 2024;16(6):e62443.
31. Center for Drug Evaluation, Research. Qualifying a Biomarker through the Biomarker Qualification Program. U.S. Food and Drug Administration. May 1, 2024. Accessed May 13, 2025. https://www.fda.gov/drugs/biomarker-qualification-program/qualifying-biomarker-through-biomarker-qualification-program
32. Mirakhori F, Niazi SK. Harnessing the AI/ML in Drug and Biological Products Discovery and Development: The Regulatory Perspective. Pharmaceuticals. 2025;18(1):47.
33. Constructing bibliometric networks: A comparison between full and fractional counting. Journal of Informetrics. 2016;10(4):1178-1195.
34. García-Gutiérrez MS, Navarrete F, Sala F, Gasparyan A, Austrich-Olivares A, Manzanares J. Biomarkers in Psychiatry: Concept, Definition, Types and Relevance to the Clinical Reality. Front Psychiatry. 2020;11:432.
35. Han Y. Biomarker Analysis in Drug Development: Boosting Precision Medicine. November 11, 2024. Accessed May 14, 2025. https://blog.crownbio.com/biomarker-analysis-drug-development-precision-medicine
36. Al-Tashi Q, Saad MB, Muneer A, et al. Machine Learning Models for the Identification of Prognostic and Predictive Cancer Biomarkers: A Systematic Review. Int J Mol Sci. 2023;24(9). doi:10.3390/ijms24097781
37. Debellotte O, Dookie RL, Rinkoo F, et al. Artificial Intelligence and Early Detection of Breast, Lung, and Colon Cancer: A Narrative Review. Cureus. 2025;17(2):e79199.
38. Peng J, Jury EC, Dönnes P, Ciurtin C. Machine Learning Techniques for Personalised Medicine Approaches in Immune-Mediated Chronic Inflammatory Diseases: Applications and Challenges. Front Pharmacol. 2021;12:720694.
39. Ceniceros A, Cuervo L, Méndez C, Salas JA, Olano C, Malmierca MG. A Multidisciplinary Approach to Unraveling the Natural Product Biosynthetic Potential of a Strain Collection Isolated from Leaf-Cutting Ants. Microorganisms. 2021;9(11). doi:10.3390/microorganisms9112225
40. Li Y, Wu X, Fang D, Luo Y. Informing immunotherapy with multi-omics driven machine learning. NPJ Digit Med. 2024;7(1):67.
41. Xavier JB, Young VB, Skufca J, et al. The Cancer Microbiome: Distinguishing Direct and Indirect Effects Requires a Systemic View. Trends Cancer Res. 2020;6(3):192-204.
42. Lydon EC, Henao R, Burke TW, et al. Validation of a host response test to distinguish bacterial and viral respiratory infection. EBioMedicine. 2019;48:453-461.
43. Aljameel SS, Khan IU, Aslam N, Aljabri M, Alsulmi ES. Machine Learning-Based Model to Predict the Disease Severity and Outcome in COVID-19 Patients. Scientific Programming. 2021;2021(1):5587188.
44. Tang N, Yuan M, Chen Z, et al. Machine Learning Prediction Model of Tuberculosis Incidence Based on Meteorological Factors and Air Pollutants. Int J Environ Res Public Health. 2023;20(5). doi:10.3390/ijerph20053910
45. Sui J, Jiang R, Bustillo J, Calhoun V. Neuroimaging-based Individualized Prediction of Cognition and Behavior for Mental Disorders and Health: Methods and Promises. Biol Psychiatry. 2020;88(11):818-828.
46. Ricka N, Pellegrin G, Fompeyrine DA, Lahutte B, Geoffroy PA. Predictive biosignature of major depressive disorder derived from physiological measurements of outpatients using machine learning. Scientific Reports. 2023;13(1):1-13.
47. Gashkarimov VR, Sultanova RI, Efremov IS, Asadullin AR. Machine learning techniques in diagnostics and prediction of the clinical features of schizophrenia: a narrative review. Consort Psychiatr. 2023;4(3):43-53.
48. Zaslavsky ME, Craig E, Michuda JK, et al. Disease diagnostics using machine learning of B cell and T cell receptor sequences. Science. 2025;387(6736):eadp2407.
49. Hubbard EL, Bachali P, Kingsmore KM, et al. Analysis of transcriptomic features reveals molecular endotypes of SLE with clinical implications. Genome Medicine. 2023;15(1):1-23.
50. Yang X, Kui L, Tang M, et al. High-Throughput Transcriptome Profiling in Drug and Biomarker Discovery. Front Genet. 2020;11:505377.
51. Echle A, Rindtorff NT, Brinker TJ, Luedde T, Pearson AT, Kather JN. Deep learning in cancer pathology: a new generation of clinical biomarkers. Br J Cancer. 2021;124(4):686-696.
52. Taheriyoun AR, Ross A, Safikhani A, Soudbakhsh D, Rahnavard A. Longitudinal Omics Data Analysis: A Review on Models, Algorithms, and Tools. Published online June 11, 2025. Accessed June 17, 2025. http://arxiv.org/abs/2506.11161
53. Yu Z, Peng W, Li F, et al. Integrated metabolomics and transcriptomics to reveal biomarkers and mitochondrial metabolic dysregulation of premature ovarian insufficiency. Front Endocrinol (Lausanne). 2023;14:1280248.
54. Chen JW, Dhahbi J. Lung adenocarcinoma and lung squamous cell carcinoma cancer classification, biomarker identification, and gene expression analysis using overlapping feature selection methods. Sci Rep. 2021;11(1):13323.
55. Kim SY, Jacob L, Speed TP. Combining calls from multiple somatic mutation-callers. BMC Bioinformatics. 2014;15(1):1-8.
56. Nicora G, Zucca S, Limongelli I, Bellazzi R, Magni P. A machine learning approach based on ACMG/AMP guidelines for genomic variant classification and prioritization. Sci Rep. 2022;12(1):2517.
57. Karar ME, El-Fishawy N, Radad M. Automated classification of urine biomarkers to diagnose pancreatic cancer using 1-D convolutional neural networks. J Biol Eng. 2023;17(1):28.
58. Li Y, Sun T, Chen J, et al. Metabolomics profile and machine learning prediction of treatment responses in immune thrombocytopenia: A prospective cohort study. Br J Haematol. 2024;204(6):2405-2417.
59. Lee AM, Hu J, Xu Y, et al. Using Machine Learning to Identify Metabolomic Signatures of Pediatric Chronic Kidney Disease Etiology. J Am Soc Nephrol. 2022;33(2):375-386.
60. Li YY, Qian FC, Zhang GR, et al. FunlncModel: integrating multi-omic features from upstream and downstream regulatory networks into a machine learning framework to identify functional lncRNAs. Brief Bioinform. 2024;26(1). doi:10.1093/bib/bbae623
61. Zhang Y, Yan C, Yang Z, Zhou M, Sun J. Multi-Omics Deep-Learning Prediction of Homologous Recombination Deficiency-Like Phenotype Improved Risk Stratification and Guided Therapeutic Decisions in Gynecological Cancers. IEEE J Biomed Health Inform. 2025;29(3):1861-1871.
62. Machine learning algorithms and biomarkers identification for pancreatic cancer diagnosis using multi-omics data integration. Pathology - Research and Practice. 2024;263:155602.
63. Ewels PA, Peltzer A, Fillinger S, et al. The nf-core framework for community-curated bioinformatics pipelines. Nat Biotechnol. 2020;38(3):276-278.
64. Mölder F, Jablonski KP, Letcher B, et al. Sustainable data analysis with Snakemake. F1000Research. 2021;10(33):33.
65. Terra. Terra. December 12, 2023. Accessed June 16, 2025. https://terra.bio/
66. Getting Started — Luigi 3.6.0 documentation. Accessed June 16, 2025. https://luigi.readthedocs.io/en/stable/
67. AWS HealthOmics. Amazon Web Services, Inc. Accessed June 16, 2025. https://aws.amazon.com/healthomics/
68. GitHub - openwdl/wdl: Specification for the Workflow Description Language (WDL). GitHub. Accessed June 16, 2025. https://github.com/openwdl/wdl
69. Partek Flow software. Accessed June 16, 2025. https://www.illumina.com/content/illumina-marketing/en/products/by-type/informatics-products/partek-flow.html
70. McKenna A, Hanna M, Banks E, et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010;20(9):1297-1303.
71. Lin YL, Chang PC, Hsu C, et al. Comparison of GATK and DeepVariant by trio sequencing. Sci Rep. 2022;12(1):1809.
72. Medema MH, Kottmann R, Yilmaz P, et al. Minimum Information about a Biosynthetic Gene cluster. Nature Chemical Biology. 2015;11(9):625-631.
73. Hannigan GD, Prihoda D, Palicka A, et al. A deep learning genome-mining strategy for biosynthetic gene cluster prediction. Nucleic Acids Res. 2019;47(18):e110.
74. Liu M, Li Y, Li H. Deep Learning to Predict the Biosynthetic Gene Clusters in Bacterial Genomes. J Mol Biol. 2022;434(15):167597.
75. Kawano T, Shiraishi T, Kuzuyama T, Umemura M. A novel transformer-based platform for the prediction and design of biosynthetic gene clusters for (un)natural products. bioRxiv. Published online June 4, 2025:2025.06.02.657346. doi:10.1101/2025.06.02.657346
76. Zdouc MM, Blin K, Louwen NLL, et al. MIBiG 4.0: advancing biosynthetic gene cluster curation through global collaboration. Nucleic Acids Res. 2025;53(D1):D678-D690.
77. Lai Q, Yao S, Zha Y, et al. Deciphering the biosynthetic potential of microbial genomes using a BGC language processing neural network model. Nucleic Acids Res. 2025;53(7). doi:10.1093/nar/gkaf305
78. Du Z, Zhong N, Li J. Enhancing gene cluster identification and classification in bacterial genomes through synonym replacement and deep learning. In: 2024 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). IEEE; 2024:19-24.
79. Zhou Z, Riley R, Kautsar S, et al. GenomeOcean: An Efficient Genome Foundation Model Trained on Large-Scale Metagenomic Assemblies. bioRxiv. Published online February 5, 2025:2025.01.30.635558. doi:10.1101/2025.01.30.635558
80. Xu T, Yang Y, Zhu R, et al. DeepSeMS: a large language model reveals hidden biosynthetic potential of the global ocean microbiome. bioRxiv. Published online March 3, 2025:2025.03.02.641084. doi:10.1101/2025.03.02.641084
81. Kautsar SA, Blin K, Shaw S, et al. MIBiG 2.0: a repository for biosynthetic gene clusters of known function. Nucleic Acids Res. 2019;48(D1):D454-D458.
82. Lundberg S, Lee SI. A Unified Approach to Interpreting Model Predictions. Published online May 22, 2017. Accessed June 5, 2025. http://arxiv.org/abs/1705.07874
83. Ribeiro MT, Singh S, Guestrin C. “Why Should I Trust You?”: Explaining the Predictions of Any Classifier. Published online February 16, 2016. Accessed June 5, 2025. http://arxiv.org/abs/1602.04938
84. Huang K, Zhang S, Wang H, et al. Biomni: A General-Purpose Biomedical AI Agent. bioRxiv. Published online June 2, 2025. doi:10.1101/2025.05.30.656746
85. Kamya P, Ozerov IV, Pun FW, et al. PandaOmics: An AI-Driven Platform for Therapeutic Target and Biomarker Discovery. J Chem Inf Model. 2024;64(10):3961-3969.
86. Website. https://chatgpt.com/auth/login?sso
87. Anil R, Borgeaud S, Alayrac JB, et al. Gemini: A Family of Highly Capable Multimodal Models. Published online December 19, 2023. Accessed July 9, 2025. http://arxiv.org/abs/2312.11805
88. Zhang X, Mallick H, Rahnavard A. Meta-analytic microbiome target discovery for immune checkpoint inhibitor response in advanced melanoma. bioRxiv. Published online March 21, 2025:2025.03.21.644637. doi:10.1101/2025.03.21.644637
89. Mirza B, Wang W, Wang J, Choi H, Chung NC, Ping P. Machine Learning and Integrative Analysis of Biomedical Big Data. Genes (Basel). 2019;10(2). doi:10.3390/genes10020087
90. Jiao L, Wang Y, Liu X, et al. Causal inference meets deep learning: A comprehensive survey. Research (Wash DC). 2024;7:0467.
91. A biomarker identification model from protein protein interaction network using natural language processing and graph convolutional network. Procedia Computer Science. 2024;246:1548-1557.
92. Grover A, Leskovec J. node2vec: Scalable Feature Learning for Networks. Published online July 3, 2016. Accessed June 16, 2025. http://arxiv.org/abs/1607.00653
93. Digital Twins: State of the art theory and practice, challenges, and open research questions. Journal of Industrial Information Integration. 2022;30:100383.
http://orcid.org/0000-0001-7713-8874