Advances in Human Genome Resolution: The Role of Pan-Genomic Strategies and Fine-Tuning Pre-trained Genomic Models

Main Article Content

Duo Du Yupeng Zhang Fan Zhong Lei Liu

Abstract

The groundbreaking theory of DNA double helix structure has greatly promoted the development of molecular genetics, shaping and refining the genetic central dogma, thus enabling researchers to explore genotype-phenotype regulation at different levels. In particular, with the continued advancement of third-generation sequencing technology, an increasing number of highly accurate human genomes have been assembled, such as T2T-CHM13 and HG002. These high-quality genome sequences not only provide a more comprehensive human reference sequence, but also enable functional genomics studies within a unified coordinate system. To better explore and resolve the complex genetic information encompassed within human genome sequences, scientists have proposed novel research strategies, involving graphical pan-genome and pre-trained genomic models. The graphical pan-genomes provide population- level high-quality references, revealing the genomic diversity within populations and exploring the sequence complexity of specific regions, such as the KIR immune region. Concurrently, related studies of pre-trained models within the human genome offer new perspectives for interpreting sequence functions and delving into the hidden genetic codes, potentially leading to complete DNA decoding. Overall, graphical pan-genome and pre-trained genomic models represent two crucial strategies in genomics research, which will provide more new insights and make greater breakthroughs in the human genome. Together, these approaches have deepened our understanding of the human genome, fostered the development of bioinformatics ecosystems, and will contribute to the establishment and improvement of the entire field. Therefore, this review focuses on DNA sequencing, human genome assembly, high-quality pan-genome and pre-trained genomic large language models (LLMs), highlighting and summarizing the latest achievements and progress in human genome research, discussing existing challenges and providing future perspectives.

Keywords: Human Genome, Pan-Genomics, Third-Generation Sequencing, Genomic Models, DNA Sequencing, Genome Assembly, Genomic Diversity, Functional Genomics, Graphical Pan-Genome, Pre-Trained Genomic Models

Article Details

How to Cite
DU, Duo et al. Advances in Human Genome Resolution: The Role of Pan-Genomic Strategies and Fine-Tuning Pre-trained Genomic Models. Medical Research Archives, [S.l.], v. 12, n. 7, july 2024. ISSN 2375-1924. Available at: <https://esmed.org/MRA/mra/article/view/5571>. Date accessed: 15 nov. 2024. doi: https://doi.org/10.18103/mra.v12i7.5571.
Section
Review Articles

References

1. Dahm R. Friedrich Miescher and the discovery of DNA. Dev Biol 278, 274-288 (2005).

2. Watson JD, Crick FH. Molecular structure of nucleic acids; a structure for deoxyribose nucleic acid. Nature 171, 737-738 (1953).

3. Timmis JN, Ayliffe MA, Huang CY, Martin W. Endosymbiotic gene transfer: organelle genomes forge eukaryotic chromosomes. Nat Rev Genet 5, 123-135 (2004).

4. Mefford HC. Genotype to phenotype-discovery and characterization of novel genomic disorders in a "genotype-first" era. Genet Med 11, 836-842 (2009).

5. Orgogozo V, Morizot B, Martin A. The differential view of genotype-phenotype relationships. Front Genet 6, 179 (2015).

6. Raben TG, Lello L, Widen E, Hsu SDH. From Genotype to Phenotype: Polygenic Prediction of Complex Human Traits. Methods Mol Biol 2467, 421-446 (2022).

7. Maston GA, Evans SK, Green MR. Transcriptional regulatory elements in the human genome. Annu Rev Genomics Hum Genet 7, 29-59 (2006).

8. Komili S, Farny NG, Roth FP, Silver PA. Functional specificity among ribosomal proteins regulates gene expression. Cell 131, 557-571 (2007).

9. Hood L, Rowen L. The Human Genome Project: big science transforms biology and medicine. Genome Med 5, 79 (2013).

10. Hatje K, Muhlhausen S, Simm D, Kollmar M. The Protein-Coding Human Genome: Annotating High-Hanging Fruits. Bioessays 41, e1900066 (2019).

11. An assembly line for an improved human reference genome. Nature, (2022).

12. O'Leary K. Diversifying the 'reference' genome. Nat Med 29, 2972 (2023).

13. Sherman RM, Salzberg SL. Pan-genomics in the human genome era. Nat Rev Genet 21, 243-254 (2020).

14. Karollus A, Hingerl J, Gankin D, Grosshauser M, Klemon K, Gagneur J. Species-aware DNA language models capture regulatory elements and their evolution. Genome Biol 25, 83 (2024).

15. Naveed H, et al. A Comprehensive Overview of Large Language Models. Preprint at https://ui.adsabs.harvard.edu/abs/2023arXiv230706435N (2023).

16. Tang L. Large models for genomics. Nat Methods 20, 1868 (2023).

17. Ayoib A, Hashim U, Gopinath SCB, Md Arshad MK. DNA extraction on bio-chip: history and preeminence over conventional and solid-phase extraction methods. Appl Microbiol Biotechnol 101, 8077-8088 (2017).

18. Kloten V, et al. Liquid biopsy in colon cancer: comparison of different circulating DNA extraction systems following absolute quantification of KRAS mutations using Intplex allele-specific PCR. Oncotarget 8, 86253-86263 (2017).

19. van Dijk EL, Jaszczyszyn Y, Naquin D, Thermes C. The Third Revolution in Sequencing Technology. Trends Genet 34, 666-681 (2018).

20. Goodwin S, McPherson JD, McCombie WR. Coming of age: ten years of next-generation sequencing technologies. Nat Rev Genet 17, 333-351 (2016).

21. Amarasinghe SL, Su S, Dong X, Zappia L, Ritchie ME, Gouil Q. Opportunities and challenges in long-read sequencing data analysis. Genome Biol 21, 30 (2020).

22. Baysoy A, Bai Z, Satija R, Fan R. The technological landscape and applications of single-cell multi-omics. Nat Rev Mol Cell Biol 24, 695-713 (2023).

23. Park J, et al. Spatial omics technologies at multimodal and single cell/subcellular level. Genome Biol 23, 256 (2022).

24. Hess JF, et al. Library preparation for next generation sequencing: A review of automation strategies. Biotechnol Adv 41, 107537 (2020).

25. Ekblom R, Wolf JB. A field guide to whole-genome sequencing, assembly and annotation. Evol Appl 7, 1026-1042 (2014).

26. Pasquali F, et al. Application of different DNA extraction procedures, library preparation protocols and sequencing platforms: impact on sequencing results. Heliyon 5, e02745 (2019).

27. Li H, Durbin R. Genome assembly in the telomere-to-telomere era. ArXiv, (2023).

28. Dominguez Del Angel V, et al. Ten steps to get started in Genome Assembly and Annotation. F1000Res 7, (2018).

29. Guan D, McCarthy SA, Wood J, Howe K, Wang Y, Durbin R. Identifying and removing haplotypic duplication in primary genome assemblies. Bioinformatics 36, 2896-2898 (2020).

30. Cheng H, Concepcion GT, Feng X, Zhang H, Li H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat Methods 18, 170-175 (2021).

31. DeRaad DA, et al. De novo assembly of a chromosome-level reference genome for the California Scrub-Jay, Aphelocoma californica. J Hered 114, 669-680 (2023).

32. Rhie A, et al. Towards complete and error-free genome assemblies of all vertebrate species. Nature 592, 737-746 (2021).

33. Nurk S, et al. The complete sequence of a human genome. Science 376, 44-53 (2022).

34. Rautiainen M, et al. Telomere-to-telomere assembly of diploid chromosomes with Verkko. Nat Biotechnol 41, 1474-1482 (2023).

35. Hu J, Wang Z, Liang F, Liu S-L, Ye K, Wang D-P. NextPolish2: A Repeat-aware Polishing Tool for Genomes Assembled Using HiFi Long Reads. Genomics, Proteomics & Bioinformatics, (2024).

36. Jung H, et al. Twelve quick steps for genome assembly and annotation in the classroom. PLoS Comput Biol 16, e1008325 (2020).

37. Zhang L, Zhou X, Weng Z, Sidow A. De novo diploid genome assembly for genome-wide structural variant detection. NAR Genom Bioinform 2, lqz018 (2020).

38. Singh V, Pandey S, Bhardwaj A. From the reference human genome to human pangenome: Premise, promise and challenge. Front Genet 13, 1042550 (2022).

39. Cohen ASA, et al. Genomic answers for children: Dynamic analyses of >1000 pediatric rare disease genomes. Genet Med 24, 1336-1348 (2022).

40. Groza C, et al. Pangenome graphs improve the analysis of structural variants in rare genetic diseases. Nat Commun 15, 657 (2024).

41. Sibbesen JA, et al. Haplotype-aware pantranscriptome analyses using spliced pangenome graphs. Nat Methods 20, 239-247 (2023).

42. Zheng Z, et al. A sequence-aware merger of genomic structural variations at population scale. Nat Commun 15, 960 (2024).

43. Liu K, et al. Pan-Genome Analysis of TIFY Gene Family and Functional Analysis of CsTIFY Genes in Cucumber. Int J Mol Sci 25, (2023).

44. Eisenstein M. Every base everywhere all at once: pangenomics comes of age. Nature 616, 618-620 (2023).

45. Tao Y, Zhao X, Mace E, Henry R, Jordan D. Exploring and Exploiting Pan-genomics for Crop Improvement. Mol Plant 12, 156-169 (2019).

46. Bayer PE, Golicz AA, Scheben A, Batley J, Edwards D. Plant pan-genomes are the new reference. Nat Plants 6, 914-920 (2020).

47. Li R, et al. Recovery of non-reference sequences missing from the human reference genome. BMC Genomics 20, 746 (2019).

48. Duan Z, et al. HUPAN: a pan-genome analysis pipeline for human genomes. Genome Biol 20, 149 (2019).

49. Sherman RM, et al. Assembly of a pan-genome from deep sequencing of 910 humans of African descent. Nat Genet 51, 30-35 (2019).

50. Liu Y, Tian Z. From one linear genome to a graph-based pan-genome: a new era for genomics. Sci China Life Sci 63, 1938-1941 (2020).

51. Outten J, Warren A. Methods and Developments in Graphical Pangenomics. J Indian Inst Sci 101, 485-498 (2021).

52. Hickey G, et al. Pangenome graph construction from genome alignments with Minigraph-Cactus. Nat Biotechnol, (2023).

53. Garrison E, et al. Building pangenome graphs. bioRxiv, (2023).

54. Andreace F, Lechat P, Dufresne Y, Chikhi R. Comparing methods for constructing and representing human pangenome graphs. Genome Biol 24, 274 (2023).

55. Liao WW, et al. A draft human pangenome reference. Nature 617, 312-324 (2023).

56. Gao Y, et al. A pangenome reference of 36 Chinese populations. Nature 619, 112-121 (2023).

57. Abondio P, Cilli E, Luiselli D. Human Pangenomics: Promises and Challenges of a Distributed Genomic Reference. Life (Basel) 13, (2023).

58. Wang T, et al. The Human Pangenome Project: a global resource to map genomic diversity. Nature 604, 437-446 (2022).

59. Ji Y, Zhou Z, Liu H, Davuluri RV. DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome. Bioinformatics 37, 2112-2120 (2021).

60. Yang M, et al. Integrating convolution and self-attention improves language model of human genome for interpreting non-coding regions at base-resolution. Nucleic Acids Res 50, e81 (2022).

61. Dalla-Torre H, et al. The Nucleotide Transformer: Building and Evaluating Robust Foundation Models for Human Genomics. bioRxiv, 2023.2001.2011.523679 (2023).

62. Zvyagin M, et al. GenSLMs: Genome-scale language models reveal SARS-CoV-2 evolutionary dynamics. bioRxiv, (2022).

63. Fishman V, et al. GENA-LM: A Family of Open-Source Foundational Models for Long DNA Sequences. bioRxiv, 2023.2006.2012.544594 (2023).

64. Liu H, Zhou S, Chen P, Liu J, Huo K-G, Han L. Exploring Genomic Large Language Models: Bridging the Gap between Natural Language and Gene Sequences. bioRxiv, 2024.2002.2026.581496 (2024).

65. Nguyen E, et al. HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution. Preprint at https://ui.adsabs.harvard.edu/abs/2023arXiv230615794N (2023).

66. Gu A, Dao T. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. Preprint at https://ui.adsabs.harvard.edu/abs/2023arXiv231200752G (2023).

67. Zhou Z, Ji Y, Li W, Dutta P, Davuluri R, Liu H. DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genome. Preprint at https://ui.adsabs.harvard.edu/abs/ 2023arXiv230615006Z (2023).

68. Nguyen E, et al. Sequence modeling and design from molecular to genome scale with Evo. bioRxiv, 2024.2002.2027.582234 (2024).

69. Sun H. Reinforcement Learning in the Era of LLMs: What is Essential? What is needed? An RL Perspective on RLHF, Prompting, and Beyond. Preprint at https://ui.adsabs.harvard.edu/abs/2023arXiv231006147S (2023).

70. Weigmann K. The code, the text and the language of God. When explaining science and its implications to the lay public, metaphors come in handy. But their indiscriminant use could also easily backfire. EMBO Rep 5, 116-118 (2004).

71. Holur P, et al. Embed-Search-Align: DNA Sequence Alignment using Transformer Models. Preprint at https://ui.adsabs.harvard.edu/abs/2023arXiv230911087H (2023).

72. Liu J, Yang M, Yu Y, Xu H, Li K, Zhou X. Large language models in bioinformatics: applications and perspectives. Preprint at https://ui.adsabs.harvard.edu/abs/2024arXiv240104155L (2024).

73. Consens ME, et al. To Transformers and Beyond: Large Language Models for the Genome. Preprint at https://ui.adsabs.harvard.edu/abs/2023arXiv231107621C (2023).