Research Projects Directory

Research Projects Directory

17,687 active projects

This information was updated 4/27/2025

The Research Projects Directory includes information about all projects that currently exist in the Researcher Workbench to help provide transparency about how the Workbench is being used. Each project specifies whether Registered Tier or Controlled Tier data are used.

Note: Researcher Workbench users provide information about their research projects independently. Views expressed in the Research Projects Directory belong to the relevant users and do not necessarily represent those of the All of Us Research Program. Information in the Research Projects Directory is also cross-posted on AllofUs.nih.gov in compliance with the 21st Century Cures Act.

Binge-Drinking “Contagion”

Question. How does binge-drinking behavior spread through social and neighborhood networks in adults aged 25-60, and how does that spread shape early cardiovascular-disease (CVD) risk? Why it matters. Binge drinking has shifted from college years into early mid-life, driving preventable…

Scientific Questions Being Studied

Question. How does binge-drinking behavior spread through social and neighborhood networks in adults aged 25-60, and how does that spread shape early cardiovascular-disease (CVD) risk?
Why it matters. Binge drinking has shifted from college years into early mid-life, driving preventable hypertension, stroke, and heart disease—especially in communities that already shoulder disproportionate health burdens. Yet most studies look at people in isolation and miss the “contagion” nature of heavy drinking. By using the nationwide, highly diverse All of Us data set, we can follow real-world patients over time, map how drinking patterns cluster, and link those patterns to blood-pressure, lipid, and diabetes readings already collected in electronic health records (EHRs). Understanding these connections will help public-health agencies design smarter, more equitable interventions—such as targeting key peer influencers rather than applying one-size-fits-all policies.

Project Purpose(s)

  • Disease Focused Research (Cardiovascular Risk)
  • Population Health
  • Social / Behavioral
  • Educational
  • Methods Development

Scientific Approaches

We will build a cohort of All of Us participants aged 25-60 who have at least one alcohol-use survey response or EHR code from 2018-2024 and at least 12 months of clinical follow-up. Step 1: clean and merge survey alcohol measures, vital signs, and lab results; derive neighborhood deprivation indices from ZIP codes. Step 2: use machine-learning “similarity” models (graph neural networks) to infer peer ties based on shared traits and geography. Step 3: embed those ties into an open-source agent-based simulation that treats binge drinking like an infectious process (people can move from non-drinker → moderate → binge under peer influence). Each simulated agent’s cardiovascular risk will update yearly using validated CVD prediction equations fed by their own AoU blood-pressure, lipid, and HbA1c data. We will test intervention scenarios—tax increases, social-media campaigns, and combined strategies—and examine effects on CVD events and racial/ethnic risk gaps.

Anticipated Findings

We expect heavy drinking to cluster around highly connected “hub” individuals, accelerating CVD risk beyond traditional models. Network-targeted interventions—like reaching the 10% of participants with the greatest peer influence—should reduce future heart attacks and strokes more effectively than blanket alcohol-tax increases while narrowing the risk gap between Black and White participants. The study will produce three outputs: (1) a reproducible All of Us dataset and code notebook for other researchers; (2) an interactive simulation tool for local health departments to test policies; and (3) peer-reviewed evidence that integrating social-network dynamics into chronic-disease prevention strategies yields larger, equitable health gains. This will extend All of Us data into behavioral-epidemiological modeling and provide a template for future precision public health studies.

Demographic Categories of Interest

This study will not center on underrepresented populations.

Data Set Used

Registered Tier

Research Team

Owner:

  • MUNTASIR MASUM - Early Career Tenure-track Researcher, University at Albany, State University of New York

SDOH, HRQoL, and Cancer

I would like to test the correlation between social determinants of health (SDoH) and health-related quality of life (HRQoL) outcomes in cancer patients. In so doing, I will learn more about the disparities that patients with cancer experience, and possible…

Scientific Questions Being Studied

I would like to test the correlation between social determinants of health (SDoH) and health-related quality of life (HRQoL) outcomes in cancer patients. In so doing, I will learn more about the disparities that patients with cancer experience, and possible interventions to remedy such disparities.

Project Purpose(s)

  • Population Health
  • Social / Behavioral

Scientific Approaches

Use All of Us survey data to gauge SDoH and HRQoL variables, create logistic models that take SDoH as independent variable and HRQoL as dependent variable.

Anticipated Findings

Worse SDoH may lead to worse HRQoL, and some SDoH may have a more significant impact. By identifying which SDoH are most impactful, that could provide a clearer target on how to improve the quality of life of cancer patients.

Demographic Categories of Interest

  • Race / Ethnicity

Data Set Used

Controlled Tier

Research Team

Owner:

  • Michael Zhong - Project Personnel, University of Maryland, Baltimore

HAP 464 NEW

am conducting research on the use and effectiveness of bupropion (an antidepressant) among Hispanic participants using data from the All of Us Research Program. The goal is to explore potential differences in treatment response, side effects, and outcomes compared to…

Scientific Questions Being Studied

am conducting research on the use and effectiveness of bupropion (an antidepressant) among Hispanic participants using data from the All of Us Research Program. The goal is to explore potential differences in treatment response, side effects, and outcomes compared to other populations.

Project Purpose(s)

  • Educational

Scientific Approaches

This research aims to highlight any disparities or unique trends in the prescription and effectiveness of bupropion for Hispanic populations, contributing to more personalized and equitable healthcare approaches.

Anticipated Findings

Through this project, I seek to understand the intersection of ethnicity, pharmacology, and mental health by studying Hispanic experiences with bupropion treatment within a diverse national cohort.

Demographic Categories of Interest

  • Race / Ethnicity
  • Age
  • Sex at Birth

Data Set Used

Registered Tier

Research Team

Owner:

v8_PGS_WGS

We aim to develop and validate pan-phenome polygenic risk scores (PRS) across diverse diseases and health conditions using large-scale genomic and clinical data. The key scientific questions include: 1) Can integrating genomic data across phenotypes enhance predictive accuracy of PRS…

Scientific Questions Being Studied

We aim to develop and validate pan-phenome polygenic risk scores (PRS) across diverse diseases and health conditions using large-scale genomic and clinical data. The key scientific questions include: 1) Can integrating genomic data across phenotypes enhance predictive accuracy of PRS for complex diseases? 2) Do pan-phenome PRS models improve disease prediction consistently across diverse populations? These questions are critical because robust PRS can significantly improve personalized risk prediction, early disease detection, and prevention strategies for all.

Project Purpose(s)

  • Drug Development
  • Ancestry

Scientific Approaches

We integrate genomic and clinical data from biobank resources (e.g., All of Us, UK Biobank) to construct pan-phenome polygenic risk scores (PRS). We will utilize genome-wide association study (GWAS) summary statistics, linkage disequilibrium score regression, and multi-trait PRS methods (e.g., MTAG, PRS-CS, PRSice) to capture shared genetic architecture across phenotypes. PRS validation and evaluation of predictive performance will be done using independent datasets and cross-validation approaches. Analyses will use Python, R, PLINK, and Hail for data processing, and machine learning tools (XGBoost, scikit-learn) for optimizing predictive accuracy.

Anticipated Findings

Anticipated findings include demonstration that multi‐phenotype PRSs built on shared genetic architectures across traits will outperform traditional single‐trait PRSs, yielding higher discrimination (e.g. AUC or C-index) and better calibrated risk estimates. We expect to identify pleiotropic loci contributing to multiple diseases, quantify genetic correlations across the phenome, and show improved transferability of PRSs in non‐European ancestry cohorts—thereby reducing prediction disparities. These results will advance understanding of genetic pleiotropy, refine multi-trait PRS construction methods, and establish a scalable, reproducible framework for equitable genomic risk prediction. By releasing optimized pipelines and summary metrics, our work will catalyze broader adoption of pan-phenome PRSs in precision medicine and population health research.

Demographic Categories of Interest

This study will not center on underrepresented populations.

Data Set Used

Controlled Tier

Research Team

Owner:

  • Chenjie Zeng - Research Fellow, National Human Genome Research Institute (NIH - NHGRI)

Autoimmunity and Cancer

In this research project, we aim to identify new genetic variants that may contribute to the development of autoimmune diseases and cancer. The identification of these variants may influence the diagnosis and treatment of autoimmune diseases and cancer.

Scientific Questions Being Studied

In this research project, we aim to identify new genetic variants that may contribute to the development of autoimmune diseases and cancer. The identification of these variants may influence the diagnosis and treatment of autoimmune diseases and cancer.

Project Purpose(s)

  • Ancestry

Scientific Approaches

In this study, we aim to investigate the association between genetic variants and the development of autoimmunity and cancer. We will compare the frequencies of patients with autoimmunity and cancer among subjects who harbor specific genetic variants and those who do not. We plan to use other biobanks across the globe to answer our study question more robustly.

Anticipated Findings

We anticipate discovering new genetic variants that may contribute to the development of autoimmunity and cancer. The identification of such variants may shed light on the identification of new biological mechanisms, diagnosis, and treatment of autoimmune diseases and cancer.

Demographic Categories of Interest

This study will not center on underrepresented populations.

Data Set Used

Controlled Tier

Research Team

Owner:

Collaborators:

  • Mehmet Hocaoglu - Research Fellow, University of Pittsburgh

LDL and Coronary Artery Disease Polygenic Risk Score Hazard Modeling

Currently, statins or other lipid lowering therapy is automatically prescribed to patients with LDL >190, so-called severe hypercholesterolemia, regardless of the presence or absence of other risk factors. With the advent and impending widespread clinical availability of polygenic risk scores…

Scientific Questions Being Studied

Currently, statins or other lipid lowering therapy is automatically prescribed to patients with LDL >190, so-called severe hypercholesterolemia, regardless of the presence or absence of other risk factors. With the advent and impending widespread clinical availability of polygenic risk scores (PRS), particularly for coronary artery disease, there is now the possibility to predict patients' risk based on the combination of their genetic risk and clinical or modifiable risk factors. This will have implications on clinical decision making in a number of ways but one of particular interest to this study is informing when to prescribe statins. Thus, we seek to ascertain the PRSs above which individuals with hypercholesterolemia, but not severe hypercholesterolemia, have equivalent or greater risk than someone with severe hypercholesterolemia alone. This will inform statin prescription guidelines taking into account one's aggregate genetic risk in addition to their clinical risk factors.

Project Purpose(s)

  • Disease Focused Research (Coronary Artery Disease)
  • Ancestry

Scientific Approaches

Our analyses will involve Cox proportional hazard modeling of strata with specific LDL and polygenic risk score ranges with the reference group being all individuals with LDL 190+ regardless of polygenic risk score. We will utilize coronary artery disease outcome data, risk factor data to use for covariate control, and genetic data to use polygenic risk scores as an independent variable (the latter of which have already been calculated and validated by our lab). Replication across other cohorts is additionally underway.

Anticipated Findings

We anticipate that subgroups with lower LDL ranges (130-160, 160-190) but with sufficiently high PRSs (determining the exact score/percentile thresholds will be the target of this study) will have hazard ratios and event rates comparable to or greater than individuals with LDL >190. This risk modeling will provide clinically actionable thresholds to inform lipid lowering guidelines for individuals with higher genetic risk who have heightened, but not extremely elevated, lipid levels. This, for example, could tell us that someone with an LDL of 100-130 and a 90th percentile PRS has equivalent risk as individuals with LDL > 190+, and thus warrant initiating preventative therapy (lipid lowering medication) for these individuals.

Demographic Categories of Interest

This study will not center on underrepresented populations.

Data Set Used

Controlled Tier

Research Team

Owner:

Version 8 Rafiq Projects

• What are the germline variants, molecular markers, and reproductive factors associated with lung cancer in adult females within the All of Us cohort? • Do reproductive factors interact with molecular and genetic markers to increase lung cancer risk in…

Scientific Questions Being Studied

• What are the germline variants, molecular markers, and reproductive factors associated with lung cancer in adult females within the All of Us cohort?
• Do reproductive factors interact with molecular and genetic markers to increase lung cancer risk in adult females?

Project Purpose(s)

  • Educational

Scientific Approaches

a) Study design: nested case-control design
b) Study population:
• Case group: Adult female (≥18 years) diagnosed with lung cancer
• Control group: Adult female (≥18 years) without a history of lung cancer, matched on age, smoking status, and ethnicity.
c) Primary outcome: Incidence of lung cancer in female
d) Primary exposure:
• Reproductive factors: Age at menarche, Age at menopause, Age at first child, history of Oral Contraceptive pill, history of uterus/ovary removal, Pattern of menstruation, parity, History of Hormone replacement therapy, History of HPV vaccination
• Molecular factors: SNPs related to Estrogen Receptor (ER), Aromatase, PD-1/PDL1, Foxp3 expressing CD4+ cells/ Treg Cell, EGFR and Errb-2 gene
e) Covariates:
• Demographic:
Education, BMI, Residence, Occupation
• Clinical:
Histopathology, Site, Differentiation, Stage, Metastatic site

Anticipated Findings

Our research aims to enhance understanding of lung cancer susceptibility by exploring interactions between genetic and reproductive factors, informing targeted prevention and personalized treatments. By identifying specific SNPs associated with lung cancer and reproductive factors, we improve risk prediction models, enabling early interventions to reduce mortality rates. Integrating genetic findings with clinical data elucidates biological mechanisms underlying lung cancer susceptibility, potentially leading to novel therapies. Insights from our study have significant public health implications, informing tailored policies emphasizing modifiable lifestyle factors for lung cancer prevention, particularly among genetically vulnerable adult females.

Demographic Categories of Interest

This study will not center on underrepresented populations.

Data Set Used

Controlled Tier

Research Team

Owner:

Privacy-Preserving Synthetic Whole Genome Data Using Diffusion Model

"How can discrete diffusion models be optimized to generate synthetic whole genome data that maintains high statistical fidelity and utility for downstream genomic analysis while providing robust privacy guarantees against re-identification attacks?" While whole genome data is invaluable for advancing…

Scientific Questions Being Studied

"How can discrete diffusion models be optimized to generate synthetic whole genome data that maintains high statistical fidelity and utility for downstream genomic analysis while providing robust privacy guarantees against re-identification attacks?"

While whole genome data is invaluable for advancing precision medicine and disease understanding, its highly identifiable nature creates significant privacy risks. We investigate whether diffusion models—specifically adapted for discrete genomic data with ancestry-conditioning—can generate synthetic genomic datasets that preserve critical population-level statistics. This work addresses a need in biomedical research to enable broader data sharing while protecting participant privacy, potentially accelerating discoveries in disease genetics while maintaining public trust. Our approach focuses particularly on ensuring balanced representation across ancestry groups, addressing a critical equity gap in current genomic research.

Project Purpose(s)

  • Population Health
  • Educational
  • Ancestry

Scientific Approaches

Our study employs a privacy-preserving deep learning approach using the All of Us genomic dataset comprising 500 samples with 13.8M variants after quality control filtering. I'll implement a discrete diffusion probabilistic model with ancestry-conditional generation. My focus is adapting diffusion models for discrete genomic data while preserving population structure.
I'll evaluate model using: (1) statistical fidelity metrics (allele frequency distribution, linkage disequilibrium patterns, PCA overlap); (2) privacy protection measures (membership inference attack resistance, differential privacy guarantees); and (3) utility preservation metrics (GWAS reproducibility, disease risk prediction performance).
All preprocessing and model training occurs within the secured All of Us Researcher Workbench environment using Hail for genomic processing and PyTorch for model implementation.

Anticipated Findings

Anticipated results:
-Maintain high statistical fidelity (preserving population structure and linkage disequilibrium patterns)
-Provide quantifiable privacy guarantees against re-identification attacks
-Support downstream genomic analyses with minimal utility loss compared to real data

The contributions to scientific knowledge would include:
-Novel methodological advances in adapting diffusion models specifically for discrete genomic data
-Empirical evidence on the privacy-utility tradeoff in synthetic genomic data
-A framework for ancestry-conditional generation that helps address representation gaps
-Quantifiable metrics for evaluating synthetic genomic data quality

Demographic Categories of Interest

This study will not center on underrepresented populations.

Data Set Used

Registered Tier

Research Team

Owner:

Duplicate of HAP823 Linda

Effect of drugs on patients with bipolar disorder.

Scientific Questions Being Studied

Effect of drugs on patients with bipolar disorder.

Project Purpose(s)

  • Educational
  • Other Purpose (Continuing analysis)

Scientific Approaches

Python to analyze data and conduct regression analysis to find effect of bipolar drugs.

Anticipated Findings

Evaluate which bipolar drugs are better for which condition.

Demographic Categories of Interest

This study will not center on underrepresented populations.

Data Set Used

Registered Tier

Research Team

Owner:

Hip Fracture Risk Prediction Study

I intend to study the clinical, demographic, and lifestyle risk factors associated with hip fractures among older adults. Hip fractures are a major public health concern due to their high morbidity, mortality, and healthcare costs. By analyzing the All of…

Scientific Questions Being Studied

I intend to study the clinical, demographic, and lifestyle risk factors associated with hip fractures among older adults. Hip fractures are a major public health concern due to their high morbidity, mortality, and healthcare costs. By analyzing the All of Us dataset, I aim to identify predictors of hip fracture risk, explore potential disparities across population subgroups, and contribute to the development of targeted prevention strategies.

Project Purpose(s)

  • Population Health
  • Methods Development
  • Control Set
  • Ancestry

Scientific Approaches

I will conduct a retrospective observational study using the Registered Tier dataset, including survey responses, electronic health records, and physical measurements. Logistic regression models and machine learning methods such as random forests may be used to identify predictors of hip fractures. Analyses will be conducted using R and Python tools within the secure Researcher Workbench environment, following All of Us privacy and data use guidelines.

Anticipated Findings

I anticipate identifying key risk factors associated with hip fractures and uncovering potential health disparities among different population groups. These findings could contribute to improved risk assessment models, inform early prevention strategies, and ultimately reduce the burden of hip fractures on individuals and the healthcare system.

Demographic Categories of Interest

  • Race / Ethnicity
  • Age
  • Income Level

Data Set Used

Registered Tier

Research Team

Owner:

  • shuo sun - Graduate Trainee, Michigan Technological University

Genetics

I am exploring data within the All of Us cohort to perform research projects that characterize the genetic architecture of immune-mediated disease.

Scientific Questions Being Studied

I am exploring data within the All of Us cohort to perform research projects that characterize the genetic architecture of immune-mediated disease.

Project Purpose(s)

  • Ancestry

Scientific Approaches

I plan on performing analyses linking common human genetic variation to EHR phenotypes in the All of Us dataset.

Anticipated Findings

Our work could improve our understanding of why certain patients are more susceptible to certain immune-mediated conditions.

Demographic Categories of Interest

This study will not center on underrepresented populations.

Data Set Used

Controlled Tier

Research Team

Owner:

Duplicate of Data Storage in the Researcher Workbench

The purpose of this workspace is to teach users how to use and manage data storage within the researcher workbench. Getting into the habit of storing intermediate results will help researchers be more efficient and allow easy collaboration between users,…

Scientific Questions Being Studied

The purpose of this workspace is to teach users how to use and manage data storage within the researcher workbench. Getting into the habit of storing intermediate results will help researchers be more efficient and allow easy collaboration between users, whether they use R, Python or SAS as their primary programing language.

Project Purpose(s)

  • Educational

Scientific Approaches

First, we will provide an overview of the different storage options within the Researcher workbench 9RW) along with their advantages and disadvantages.
Then, using the current CDR, we will query example datasets and demonstrate:
- how to save a dataframe or a plot to the RW storage
- how to read a dataframe or a plot from the storage
- how to read list data in the storage
- how to move data in/between storages.

We will use a package created by one of our DST members, Aymone Kouame, to perform these tasks in R, Python or SAS. The GitHub link to the code will also be available.

Anticipated Findings

After going through this tutorial. researchers will have a better understanding of data storage and management in the RW.

Demographic Categories of Interest

This study will not center on underrepresented populations.

Data Set Used

Registered Tier

Research Team

Owner:

Duplicate of Data Wrangling in All of Us Program (v8)

For Educational purpose to show best practices when using jupyter notebooks for data access, storage, data manipulations - transformations, conversions, cleaning, optimization and other research support related issues that is useful for multiple AoU researchers.

Scientific Questions Being Studied

For Educational purpose to show best practices when using jupyter notebooks for data access, storage, data manipulations - transformations, conversions, cleaning, optimization and other research support related issues that is useful for multiple AoU researchers.

Project Purpose(s)

  • Educational
  • Other Purpose (For use with Office hours. notebooks for adding code snippets useful for researchers. This is a placeholder for creating notebooks for best practices among other things)

Scientific Approaches

For Educational purpose to show best practices when using jupyter notebooks for data access, storage, data manipulations - transformations, conversions, cleaning, optimization and other research support related issues that is useful for multiple AoU researchers.

Anticipated Findings

For Educational purpose to show best practices when using jupyter notebooks for data access, storage, data manipulations - transformations, conversions, cleaning, optimization and other research support related issues that is useful for multiple AoU researchers.

Demographic Categories of Interest

This study will not center on underrepresented populations.

Data Set Used

Registered Tier

Research Team

Owner:

IBD and Cognitive

Our team aims to investigate the association between inflammatory bowel disease (IBD) and cognitive function using the All of Us database, focusing on three primary scientific questions. First, we will examine whether IBD patients demonstrate poorer performance on cognitive assessments…

Scientific Questions Being Studied

Our team aims to investigate the association between inflammatory bowel disease (IBD) and cognitive function using the All of Us database, focusing on three primary scientific questions. First, we will examine whether IBD patients demonstrate poorer performance on cognitive assessments compared to matched controls, adjusting for key confounders like age, education, and comorbidities. Second, we will explore whether disease duration, severity, or treatment modalities modify this relationship. This work could have significant implications for clinical practice by highlighting the need for cognitive monitoring in IBD management and potentially identifying modifiable risk factors for intervention.

Project Purpose(s)

  • Disease Focused Research (IBD)

Scientific Approaches

Our team will employ a multidisciplinary approach combining epidemiological methods with advanced analytics using the All of Us Researcher Workbench. We will utilize the controlled tier dataset containing EHRs, surveys, and physical measurements from a diverse participant population. We'll implement a matched cohort design, pairing IBD cases with non-IBD controls (1:3 ratio) based on age, sex, and education, followed by multivariable regression adjusting for cardiovascular risk factors, depression, and lifestyle variables. For longitudinal analysis, we'll leverage follow-up survey data where available to assess cognitive trajectory. Advanced methods will include mediation analysis to evaluate inflammatory markers (CRP, ESR) as potential pathways, and machine learning (logistic regression/Random Forest) to identify high-risk subgroups.

Anticipated Findings

Our team anticipates finding a significant association between IBD and poorer cognitive performance, particularly in processing speed and executive function, with stronger effects observed in patients with longer disease duration or more severe inflammation. We expect systemic inflammatory markers to partially mediate this relationship, supporting the gut-brain axis hypothesis. These findings would provide the first large-scale evidence of cognitive impairment in IBD within a diverse US population, addressing current literature gaps from predominantly small, homogenous studies. The results could reshape clinical paradigms by highlighting the need for cognitive screening in IBD management and informing trials of anti-inflammatory therapies for neuroprotection.

Demographic Categories of Interest

  • Race / Ethnicity
  • Age
  • Sex at Birth
  • Gender Identity
  • Disability Status
  • Access to Care
  • Education Level
  • Income Level

Data Set Used

Controlled Tier

Research Team

Owner:

  • Ligang Liu - Research Fellow, Ohio State University

Collaborators:

  • Lusi Zhang - Graduate Trainee, University of Minnesota

Genetic risk in autoimmune diseases

To observe genetic risk factors and identifying novel gene-gene combinations or interactions that are associated with autoimmune diseases, primarily SLE, Sjogren's syndrome, rheumatoid arthritis and type 1 diabetes.

Scientific Questions Being Studied

To observe genetic risk factors and identifying novel gene-gene combinations or interactions that are associated with autoimmune
diseases, primarily SLE, Sjogren's syndrome, rheumatoid arthritis and type 1 diabetes.

Project Purpose(s)

  • Disease Focused Research (Autoimmune diseases)
  • Methods Development
  • Control Set
  • Ancestry

Scientific Approaches

Will look for genetic risk factors from genotype data. Will use Statistical and association analysis using odds ratio and fisher exact's test between risk alleles and the subject phenotype groups. Will use control data as a validation set.

Anticipated Findings

Identifying genetic risk factors and their associations with autoimmune diseases could help in understanding mechanisms of disease
developments and pathogenesis pathways.

Demographic Categories of Interest

This study will not center on underrepresented populations.

Data Set Used

Controlled Tier

Research Team

Owner:

  • Ilona Nln - Research Associate, Cornell University

STARRpipeline for srWGS v8

The goal of this study is to construct a Genomic Data Structure (GDS) file that includes all variants and samples from the All of Us v8 genotype dataset (short-read Whole Genome Sequencing, srWGS). This effort is crucial for enabling scalable,…

Scientific Questions Being Studied

The goal of this study is to construct a Genomic Data Structure (GDS) file that includes all variants and samples from the All of Us v8 genotype dataset (short-read Whole Genome Sequencing, srWGS). This effort is crucial for enabling scalable, efficient genomic analyses, as GDS is a highly optimized data format for storing and processing large-scale genetic data. By converting the All of Us v8 genotype dataset into GDS, we aim to facilitate downstream population genetics studies, variant annotation, imputation, and genome-wide association studies (GWAS) in a computationally efficient manner.
This work is particularly relevant to public health and precision medicine as it ensures that genomic data from a diverse cohort is stored in a structured, accessible manner for researchers. The All of Us dataset is one of the most diverse genetic datasets available, representing populations historically underrepresented in genetic research.

Project Purpose(s)

  • Methods Development
  • Ancestry

Scientific Approaches

For this study, we will utilize high-performance bioinformatics tools and computational methods to efficiently process and analyze the All of Us v8 short-read Whole Genome Sequencing (srWGS) dataset. Our primary objective is to construct a Genomic Data Structure (GDS) file that includes all variants and samples, enabling efficient large-scale genomic analysis.
Datasets
We will use the All of Us v8 genotype dataset, which consists of high-coverage short-read WGS data from a diverse cohort. This dataset contains single nucleotide polymorphisms (SNPs) and small insertions/deletions (indels), and provides a unique opportunity to study population genetics and disease-associated variants in underrepresented populations.
Research Methods & Tools
Variant Data Conversion & GDS Construction
Convert Hail MatrixTable (MT) or VCF format into GDS using SeqArray (R-based tool) or Bioconductor packages.
Optimize data compression and indexing for efficient query performance.

Anticipated Findings

Improved Data Storage & Accessibility – The GDS format will enhance data compression, retrieval speed, and computational efficiency compared to traditional formats like VCF or Hail MatrixTable.
High-Quality Variant Representation – Through rigorous quality control (QC), we expect to retain high-confidence SNPs and indels, ensuring data integrity for downstream analyses.
Population Genomic Insights – Preliminary analyses may reveal population structure, ancestry distributions, and allele frequency patterns, providing deeper insight into the diverse All of Us cohort.
Benchmarking Computational Performance – We will assess how GDS improves storage efficiency and query performance, facilitating its integration into bioinformatics pipelines.
Scientific Contributions
Enable Large-Scale Genomic Research – By optimizing the dataset in GDS format, researchers will be able to conduct GWAS, polygenic risk score (PRS) studies, and rare variant burden analysis more efficiently.

Demographic Categories of Interest

This study will not center on underrepresented populations.

Data Set Used

Controlled Tier

Research Team

Owner:

  • Hufeng Zhou - Early Career Tenure-track Researcher, Harvard T. H. Chan School of Public Health

Collaborators:

  • Xihao Li - Early Career Tenure-track Researcher, University of North Carolina, Chapel Hill

Create aGDS for all of US srWGS v8

The goal of this study is to construct a Genomic Data Structure (GDS) file that includes all variants and samples from the All of Us v8 genotype dataset (short-read Whole Genome Sequencing, srWGS). This effort is crucial for enabling scalable,…

Scientific Questions Being Studied

The goal of this study is to construct a Genomic Data Structure (GDS) file that includes all variants and samples from the All of Us v8 genotype dataset (short-read Whole Genome Sequencing, srWGS). This effort is crucial for enabling scalable, efficient genomic analyses, as GDS is a highly optimized data format for storing and processing large-scale genetic data. By converting the All of Us v8 genotype dataset into GDS, we aim to facilitate downstream population genetics studies, variant annotation, imputation, and genome-wide association studies (GWAS) in a computationally efficient manner.
This work is particularly relevant to public health and precision medicine as it ensures that genomic data from a diverse cohort is stored in a structured, accessible manner for researchers. The All of Us dataset is one of the most diverse genetic datasets available, representing populations historically underrepresented in genetic research.

Project Purpose(s)

  • Methods Development
  • Ancestry

Scientific Approaches

For this study, we will utilize high-performance bioinformatics tools and computational methods to efficiently process and analyze the All of Us v8 short-read Whole Genome Sequencing (srWGS) dataset. Our primary objective is to construct a Genomic Data Structure (GDS) file that includes all variants and samples, enabling efficient large-scale genomic analysis.
Datasets
We will use the All of Us v8 genotype dataset, which consists of high-coverage short-read WGS data from a diverse cohort. This dataset contains single nucleotide polymorphisms (SNPs) and small insertions/deletions (indels), and provides a unique opportunity to study population genetics and disease-associated variants in underrepresented populations.
Research Methods & Tools
Variant Data Conversion & GDS Construction
Convert Hail MatrixTable (MT) or VCF format into GDS using SeqArray (R-based tool) or Bioconductor packages.
Optimize data compression and indexing for efficient query performance.

Anticipated Findings

Improved Data Storage & Accessibility – The GDS format will enhance data compression, retrieval speed, and computational efficiency compared to traditional formats like VCF or Hail MatrixTable.
High-Quality Variant Representation – Through rigorous quality control (QC), we expect to retain high-confidence SNPs and indels, ensuring data integrity for downstream analyses.
Population Genomic Insights – Preliminary analyses may reveal population structure, ancestry distributions, and allele frequency patterns, providing deeper insight into the diverse All of Us cohort.
Benchmarking Computational Performance – We will assess how GDS improves storage efficiency and query performance, facilitating its integration into bioinformatics pipelines.
Scientific Contributions
Enable Large-Scale Genomic Research – By optimizing the dataset in GDS format, researchers will be able to conduct GWAS, polygenic risk score (PRS) studies, and rare variant burden analysis more efficiently.

Demographic Categories of Interest

This study will not center on underrepresented populations.

Data Set Used

Controlled Tier

Research Team

Owner:

  • Hufeng Zhou - Early Career Tenure-track Researcher, Harvard T. H. Chan School of Public Health

Collaborators:

  • Jun Qian - Other, All of Us Program Operational Use

aGDS for all of US srWGS v8 chrY

The goal of this study is to construct a Genomic Data Structure (GDS) file that includes all variants and samples from the All of Us v8 genotype dataset (short-read Whole Genome Sequencing, srWGS). This effort is crucial for enabling scalable,…

Scientific Questions Being Studied

The goal of this study is to construct a Genomic Data Structure (GDS) file that includes all variants and samples from the All of Us v8 genotype dataset (short-read Whole Genome Sequencing, srWGS). This effort is crucial for enabling scalable, efficient genomic analyses, as GDS is a highly optimized data format for storing and processing large-scale genetic data. By converting the All of Us v8 genotype dataset into GDS, we aim to facilitate downstream population genetics studies, variant annotation, imputation, and genome-wide association studies (GWAS) in a computationally efficient manner.
This work is particularly relevant to public health and precision medicine as it ensures that genomic data from a diverse cohort is stored in a structured, accessible manner for researchers. The All of Us dataset is one of the most diverse genetic datasets available, representing populations historically underrepresented in genetic research.

Project Purpose(s)

  • Methods Development
  • Ancestry

Scientific Approaches

For this study, we will utilize high-performance bioinformatics tools and computational methods to efficiently process and analyze the All of Us v8 short-read Whole Genome Sequencing (srWGS) dataset. Our primary objective is to construct a Genomic Data Structure (GDS) file that includes all variants and samples, enabling efficient large-scale genomic analysis.
Datasets
We will use the All of Us v8 genotype dataset, which consists of high-coverage short-read WGS data from a diverse cohort. This dataset contains single nucleotide polymorphisms (SNPs) and small insertions/deletions (indels), and provides a unique opportunity to study population genetics and disease-associated variants in underrepresented populations.
Research Methods & Tools
Variant Data Conversion & GDS Construction
Convert Hail MatrixTable (MT) or VCF format into GDS using SeqArray (R-based tool) or Bioconductor packages.
Optimize data compression and indexing for efficient query performance.

Anticipated Findings

Improved Data Storage & Accessibility – The GDS format will enhance data compression, retrieval speed, and computational efficiency compared to traditional formats like VCF or Hail MatrixTable.
High-Quality Variant Representation – Through rigorous quality control (QC), we expect to retain high-confidence SNPs and indels, ensuring data integrity for downstream analyses.
Population Genomic Insights – Preliminary analyses may reveal population structure, ancestry distributions, and allele frequency patterns, providing deeper insight into the diverse All of Us cohort.
Benchmarking Computational Performance – We will assess how GDS improves storage efficiency and query performance, facilitating its integration into bioinformatics pipelines.
Scientific Contributions
Enable Large-Scale Genomic Research – By optimizing the dataset in GDS format, researchers will be able to conduct GWAS, polygenic risk score (PRS) studies, and rare variant burden analysis more efficiently.

Demographic Categories of Interest

This study will not center on underrepresented populations.

Data Set Used

Controlled Tier

Research Team

Owner:

  • Hufeng Zhou - Early Career Tenure-track Researcher, Harvard T. H. Chan School of Public Health

Collaborators:

  • Jun Qian - Other, All of Us Program Operational Use

aGDS for all of US srWGS v8 chrX

The goal of this study is to construct a Genomic Data Structure (GDS) file that includes all variants and samples from the All of Us v8 genotype dataset (short-read Whole Genome Sequencing, srWGS). This effort is crucial for enabling scalable,…

Scientific Questions Being Studied

The goal of this study is to construct a Genomic Data Structure (GDS) file that includes all variants and samples from the All of Us v8 genotype dataset (short-read Whole Genome Sequencing, srWGS). This effort is crucial for enabling scalable, efficient genomic analyses, as GDS is a highly optimized data format for storing and processing large-scale genetic data. By converting the All of Us v8 genotype dataset into GDS, we aim to facilitate downstream population genetics studies, variant annotation, imputation, and genome-wide association studies (GWAS) in a computationally efficient manner.
This work is particularly relevant to public health and precision medicine as it ensures that genomic data from a diverse cohort is stored in a structured, accessible manner for researchers. The All of Us dataset is one of the most diverse genetic datasets available, representing populations historically underrepresented in genetic research.

Project Purpose(s)

  • Methods Development
  • Ancestry

Scientific Approaches

For this study, we will utilize high-performance bioinformatics tools and computational methods to efficiently process and analyze the All of Us v8 short-read Whole Genome Sequencing (srWGS) dataset. Our primary objective is to construct a Genomic Data Structure (GDS) file that includes all variants and samples, enabling efficient large-scale genomic analysis.
Datasets
We will use the All of Us v8 genotype dataset, which consists of high-coverage short-read WGS data from a diverse cohort. This dataset contains single nucleotide polymorphisms (SNPs) and small insertions/deletions (indels), and provides a unique opportunity to study population genetics and disease-associated variants in underrepresented populations.
Research Methods & Tools
Variant Data Conversion & GDS Construction
Convert Hail MatrixTable (MT) or VCF format into GDS using SeqArray (R-based tool) or Bioconductor packages.
Optimize data compression and indexing for efficient query performance.

Anticipated Findings

Improved Data Storage & Accessibility – The GDS format will enhance data compression, retrieval speed, and computational efficiency compared to traditional formats like VCF or Hail MatrixTable.
High-Quality Variant Representation – Through rigorous quality control (QC), we expect to retain high-confidence SNPs and indels, ensuring data integrity for downstream analyses.
Population Genomic Insights – Preliminary analyses may reveal population structure, ancestry distributions, and allele frequency patterns, providing deeper insight into the diverse All of Us cohort.
Benchmarking Computational Performance – We will assess how GDS improves storage efficiency and query performance, facilitating its integration into bioinformatics pipelines.
Scientific Contributions
Enable Large-Scale Genomic Research – By optimizing the dataset in GDS format, researchers will be able to conduct GWAS, polygenic risk score (PRS) studies, and rare variant burden analysis more efficiently.

Demographic Categories of Interest

This study will not center on underrepresented populations.

Data Set Used

Controlled Tier

Research Team

Owner:

  • Hufeng Zhou - Early Career Tenure-track Researcher, Harvard T. H. Chan School of Public Health

Collaborators:

  • Jun Qian - Other, All of Us Program Operational Use

aGDS for all of US srWGS v8 chr9

The goal of this study is to construct a Genomic Data Structure (GDS) file that includes all variants and samples from the All of Us v8 genotype dataset (short-read Whole Genome Sequencing, srWGS). This effort is crucial for enabling scalable,…

Scientific Questions Being Studied

The goal of this study is to construct a Genomic Data Structure (GDS) file that includes all variants and samples from the All of Us v8 genotype dataset (short-read Whole Genome Sequencing, srWGS). This effort is crucial for enabling scalable, efficient genomic analyses, as GDS is a highly optimized data format for storing and processing large-scale genetic data. By converting the All of Us v8 genotype dataset into GDS, we aim to facilitate downstream population genetics studies, variant annotation, imputation, and genome-wide association studies (GWAS) in a computationally efficient manner.
This work is particularly relevant to public health and precision medicine as it ensures that genomic data from a diverse cohort is stored in a structured, accessible manner for researchers. The All of Us dataset is one of the most diverse genetic datasets available, representing populations historically underrepresented in genetic research.

Project Purpose(s)

  • Methods Development
  • Ancestry

Scientific Approaches

For this study, we will utilize high-performance bioinformatics tools and computational methods to efficiently process and analyze the All of Us v8 short-read Whole Genome Sequencing (srWGS) dataset. Our primary objective is to construct a Genomic Data Structure (GDS) file that includes all variants and samples, enabling efficient large-scale genomic analysis.
Datasets
We will use the All of Us v8 genotype dataset, which consists of high-coverage short-read WGS data from a diverse cohort. This dataset contains single nucleotide polymorphisms (SNPs) and small insertions/deletions (indels), and provides a unique opportunity to study population genetics and disease-associated variants in underrepresented populations.
Research Methods & Tools
Variant Data Conversion & GDS Construction
Convert Hail MatrixTable (MT) or VCF format into GDS using SeqArray (R-based tool) or Bioconductor packages.
Optimize data compression and indexing for efficient query performance.

Anticipated Findings

Improved Data Storage & Accessibility – The GDS format will enhance data compression, retrieval speed, and computational efficiency compared to traditional formats like VCF or Hail MatrixTable.
High-Quality Variant Representation – Through rigorous quality control (QC), we expect to retain high-confidence SNPs and indels, ensuring data integrity for downstream analyses.
Population Genomic Insights – Preliminary analyses may reveal population structure, ancestry distributions, and allele frequency patterns, providing deeper insight into the diverse All of Us cohort.
Benchmarking Computational Performance – We will assess how GDS improves storage efficiency and query performance, facilitating its integration into bioinformatics pipelines.
Scientific Contributions
Enable Large-Scale Genomic Research – By optimizing the dataset in GDS format, researchers will be able to conduct GWAS, polygenic risk score (PRS) studies, and rare variant burden analysis more efficiently.

Demographic Categories of Interest

This study will not center on underrepresented populations.

Data Set Used

Controlled Tier

Research Team

Owner:

  • Hufeng Zhou - Early Career Tenure-track Researcher, Harvard T. H. Chan School of Public Health

Collaborators:

  • Jun Qian - Other, All of Us Program Operational Use

aGDS for all of US srWGS v8 chr6 third

The goal of this study is to construct a Genomic Data Structure (GDS) file that includes all variants and samples from the All of Us v8 genotype dataset (short-read Whole Genome Sequencing, srWGS). This effort is crucial for enabling scalable,…

Scientific Questions Being Studied

The goal of this study is to construct a Genomic Data Structure (GDS) file that includes all variants and samples from the All of Us v8 genotype dataset (short-read Whole Genome Sequencing, srWGS). This effort is crucial for enabling scalable, efficient genomic analyses, as GDS is a highly optimized data format for storing and processing large-scale genetic data. By converting the All of Us v8 genotype dataset into GDS, we aim to facilitate downstream population genetics studies, variant annotation, imputation, and genome-wide association studies (GWAS) in a computationally efficient manner.
This work is particularly relevant to public health and precision medicine as it ensures that genomic data from a diverse cohort is stored in a structured, accessible manner for researchers. The All of Us dataset is one of the most diverse genetic datasets available, representing populations historically underrepresented in genetic research.

Project Purpose(s)

  • Methods Development
  • Ancestry

Scientific Approaches

For this study, we will utilize high-performance bioinformatics tools and computational methods to efficiently process and analyze the All of Us v8 short-read Whole Genome Sequencing (srWGS) dataset. Our primary objective is to construct a Genomic Data Structure (GDS) file that includes all variants and samples, enabling efficient large-scale genomic analysis.
Datasets
We will use the All of Us v8 genotype dataset, which consists of high-coverage short-read WGS data from a diverse cohort. This dataset contains single nucleotide polymorphisms (SNPs) and small insertions/deletions (indels), and provides a unique opportunity to study population genetics and disease-associated variants in underrepresented populations.
Research Methods & Tools
Variant Data Conversion & GDS Construction
Convert Hail MatrixTable (MT) or VCF format into GDS using SeqArray (R-based tool) or Bioconductor packages.
Optimize data compression and indexing for efficient query performance.

Anticipated Findings

Improved Data Storage & Accessibility – The GDS format will enhance data compression, retrieval speed, and computational efficiency compared to traditional formats like VCF or Hail MatrixTable.
High-Quality Variant Representation – Through rigorous quality control (QC), we expect to retain high-confidence SNPs and indels, ensuring data integrity for downstream analyses.
Population Genomic Insights – Preliminary analyses may reveal population structure, ancestry distributions, and allele frequency patterns, providing deeper insight into the diverse All of Us cohort.
Benchmarking Computational Performance – We will assess how GDS improves storage efficiency and query performance, facilitating its integration into bioinformatics pipelines.
Scientific Contributions
Enable Large-Scale Genomic Research – By optimizing the dataset in GDS format, researchers will be able to conduct GWAS, polygenic risk score (PRS) studies, and rare variant burden analysis more efficiently.

Demographic Categories of Interest

This study will not center on underrepresented populations.

Data Set Used

Controlled Tier

Research Team

Owner:

  • Hufeng Zhou - Early Career Tenure-track Researcher, Harvard T. H. Chan School of Public Health

Collaborators:

  • Jun Qian - Other, All of Us Program Operational Use

aGDS for all of US srWGS v8 chr5

The goal of this study is to construct a Genomic Data Structure (GDS) file that includes all variants and samples from the All of Us v8 genotype dataset (short-read Whole Genome Sequencing, srWGS). This effort is crucial for enabling scalable,…

Scientific Questions Being Studied

The goal of this study is to construct a Genomic Data Structure (GDS) file that includes all variants and samples from the All of Us v8 genotype dataset (short-read Whole Genome Sequencing, srWGS). This effort is crucial for enabling scalable, efficient genomic analyses, as GDS is a highly optimized data format for storing and processing large-scale genetic data. By converting the All of Us v8 genotype dataset into GDS, we aim to facilitate downstream population genetics studies, variant annotation, imputation, and genome-wide association studies (GWAS) in a computationally efficient manner.
This work is particularly relevant to public health and precision medicine as it ensures that genomic data from a diverse cohort is stored in a structured, accessible manner for researchers. The All of Us dataset is one of the most diverse genetic datasets available, representing populations historically underrepresented in genetic research.

Project Purpose(s)

  • Methods Development
  • Ancestry

Scientific Approaches

For this study, we will utilize high-performance bioinformatics tools and computational methods to efficiently process and analyze the All of Us v8 short-read Whole Genome Sequencing (srWGS) dataset. Our primary objective is to construct a Genomic Data Structure (GDS) file that includes all variants and samples, enabling efficient large-scale genomic analysis.
Datasets
We will use the All of Us v8 genotype dataset, which consists of high-coverage short-read WGS data from a diverse cohort. This dataset contains single nucleotide polymorphisms (SNPs) and small insertions/deletions (indels), and provides a unique opportunity to study population genetics and disease-associated variants in underrepresented populations.
Research Methods & Tools
Variant Data Conversion & GDS Construction
Convert Hail MatrixTable (MT) or VCF format into GDS using SeqArray (R-based tool) or Bioconductor packages.
Optimize data compression and indexing for efficient query performance.

Anticipated Findings

Improved Data Storage & Accessibility – The GDS format will enhance data compression, retrieval speed, and computational efficiency compared to traditional formats like VCF or Hail MatrixTable.
High-Quality Variant Representation – Through rigorous quality control (QC), we expect to retain high-confidence SNPs and indels, ensuring data integrity for downstream analyses.
Population Genomic Insights – Preliminary analyses may reveal population structure, ancestry distributions, and allele frequency patterns, providing deeper insight into the diverse All of Us cohort.
Benchmarking Computational Performance – We will assess how GDS improves storage efficiency and query performance, facilitating its integration into bioinformatics pipelines.
Scientific Contributions
Enable Large-Scale Genomic Research – By optimizing the dataset in GDS format, researchers will be able to conduct GWAS, polygenic risk score (PRS) studies, and rare variant burden analysis more efficiently.

Demographic Categories of Interest

This study will not center on underrepresented populations.

Data Set Used

Controlled Tier

Research Team

Owner:

  • Hufeng Zhou - Early Career Tenure-track Researcher, Harvard T. H. Chan School of Public Health

Collaborators:

  • Jun Qian - Other, All of Us Program Operational Use

aGDS for all of US srWGS v8 chr4

The goal of this study is to construct a Genomic Data Structure (GDS) file that includes all variants and samples from the All of Us v8 genotype dataset (short-read Whole Genome Sequencing, srWGS). This effort is crucial for enabling scalable,…

Scientific Questions Being Studied

The goal of this study is to construct a Genomic Data Structure (GDS) file that includes all variants and samples from the All of Us v8 genotype dataset (short-read Whole Genome Sequencing, srWGS). This effort is crucial for enabling scalable, efficient genomic analyses, as GDS is a highly optimized data format for storing and processing large-scale genetic data. By converting the All of Us v8 genotype dataset into GDS, we aim to facilitate downstream population genetics studies, variant annotation, imputation, and genome-wide association studies (GWAS) in a computationally efficient manner.
This work is particularly relevant to public health and precision medicine as it ensures that genomic data from a diverse cohort is stored in a structured, accessible manner for researchers. The All of Us dataset is one of the most diverse genetic datasets available, representing populations historically underrepresented in genetic research.

Project Purpose(s)

  • Methods Development
  • Ancestry

Scientific Approaches

For this study, we will utilize high-performance bioinformatics tools and computational methods to efficiently process and analyze the All of Us v8 short-read Whole Genome Sequencing (srWGS) dataset. Our primary objective is to construct a Genomic Data Structure (GDS) file that includes all variants and samples, enabling efficient large-scale genomic analysis.
Datasets
We will use the All of Us v8 genotype dataset, which consists of high-coverage short-read WGS data from a diverse cohort. This dataset contains single nucleotide polymorphisms (SNPs) and small insertions/deletions (indels), and provides a unique opportunity to study population genetics and disease-associated variants in underrepresented populations.
Research Methods & Tools
Variant Data Conversion & GDS Construction
Convert Hail MatrixTable (MT) or VCF format into GDS using SeqArray (R-based tool) or Bioconductor packages.
Optimize data compression and indexing for efficient query performance.

Anticipated Findings

Improved Data Storage & Accessibility – The GDS format will enhance data compression, retrieval speed, and computational efficiency compared to traditional formats like VCF or Hail MatrixTable.
High-Quality Variant Representation – Through rigorous quality control (QC), we expect to retain high-confidence SNPs and indels, ensuring data integrity for downstream analyses.
Population Genomic Insights – Preliminary analyses may reveal population structure, ancestry distributions, and allele frequency patterns, providing deeper insight into the diverse All of Us cohort.
Benchmarking Computational Performance – We will assess how GDS improves storage efficiency and query performance, facilitating its integration into bioinformatics pipelines.
Scientific Contributions
Enable Large-Scale Genomic Research – By optimizing the dataset in GDS format, researchers will be able to conduct GWAS, polygenic risk score (PRS) studies, and rare variant burden analysis more efficiently.

Demographic Categories of Interest

This study will not center on underrepresented populations.

Data Set Used

Controlled Tier

Research Team

Owner:

  • Hufeng Zhou - Early Career Tenure-track Researcher, Harvard T. H. Chan School of Public Health

Collaborators:

  • Jun Qian - Other, All of Us Program Operational Use

aGDS for all of US srWGS v8 chr3 third

The goal of this study is to construct a Genomic Data Structure (GDS) file that includes all variants and samples from the All of Us v8 genotype dataset (short-read Whole Genome Sequencing, srWGS). This effort is crucial for enabling scalable,…

Scientific Questions Being Studied

The goal of this study is to construct a Genomic Data Structure (GDS) file that includes all variants and samples from the All of Us v8 genotype dataset (short-read Whole Genome Sequencing, srWGS). This effort is crucial for enabling scalable, efficient genomic analyses, as GDS is a highly optimized data format for storing and processing large-scale genetic data. By converting the All of Us v8 genotype dataset into GDS, we aim to facilitate downstream population genetics studies, variant annotation, imputation, and genome-wide association studies (GWAS) in a computationally efficient manner.
This work is particularly relevant to public health and precision medicine as it ensures that genomic data from a diverse cohort is stored in a structured, accessible manner for researchers. The All of Us dataset is one of the most diverse genetic datasets available, representing populations historically underrepresented in genetic research.

Project Purpose(s)

  • Methods Development
  • Ancestry

Scientific Approaches

For this study, we will utilize high-performance bioinformatics tools and computational methods to efficiently process and analyze the All of Us v8 short-read Whole Genome Sequencing (srWGS) dataset. Our primary objective is to construct a Genomic Data Structure (GDS) file that includes all variants and samples, enabling efficient large-scale genomic analysis.
Datasets
We will use the All of Us v8 genotype dataset, which consists of high-coverage short-read WGS data from a diverse cohort. This dataset contains single nucleotide polymorphisms (SNPs) and small insertions/deletions (indels), and provides a unique opportunity to study population genetics and disease-associated variants in underrepresented populations.
Research Methods & Tools
Variant Data Conversion & GDS Construction
Convert Hail MatrixTable (MT) or VCF format into GDS using SeqArray (R-based tool) or Bioconductor packages.
Optimize data compression and indexing for efficient query performance.

Anticipated Findings

Improved Data Storage & Accessibility – The GDS format will enhance data compression, retrieval speed, and computational efficiency compared to traditional formats like VCF or Hail MatrixTable.
High-Quality Variant Representation – Through rigorous quality control (QC), we expect to retain high-confidence SNPs and indels, ensuring data integrity for downstream analyses.
Population Genomic Insights – Preliminary analyses may reveal population structure, ancestry distributions, and allele frequency patterns, providing deeper insight into the diverse All of Us cohort.
Benchmarking Computational Performance – We will assess how GDS improves storage efficiency and query performance, facilitating its integration into bioinformatics pipelines.
Scientific Contributions
Enable Large-Scale Genomic Research – By optimizing the dataset in GDS format, researchers will be able to conduct GWAS, polygenic risk score (PRS) studies, and rare variant burden analysis more efficiently.

Demographic Categories of Interest

This study will not center on underrepresented populations.

Data Set Used

Controlled Tier

Research Team

Owner:

  • Hufeng Zhou - Early Career Tenure-track Researcher, Harvard T. H. Chan School of Public Health

Collaborators:

  • Jun Qian - Other, All of Us Program Operational Use

aGDS for all of US srWGS v8 chr3 new

The goal of this study is to construct a Genomic Data Structure (GDS) file that includes all variants and samples from the All of Us v8 genotype dataset (short-read Whole Genome Sequencing, srWGS). This effort is crucial for enabling scalable,…

Scientific Questions Being Studied

The goal of this study is to construct a Genomic Data Structure (GDS) file that includes all variants and samples from the All of Us v8 genotype dataset (short-read Whole Genome Sequencing, srWGS). This effort is crucial for enabling scalable, efficient genomic analyses, as GDS is a highly optimized data format for storing and processing large-scale genetic data. By converting the All of Us v8 genotype dataset into GDS, we aim to facilitate downstream population genetics studies, variant annotation, imputation, and genome-wide association studies (GWAS) in a computationally efficient manner.
This work is particularly relevant to public health and precision medicine as it ensures that genomic data from a diverse cohort is stored in a structured, accessible manner for researchers. The All of Us dataset is one of the most diverse genetic datasets available, representing populations historically underrepresented in genetic research.

Project Purpose(s)

  • Methods Development
  • Ancestry

Scientific Approaches

For this study, we will utilize high-performance bioinformatics tools and computational methods to efficiently process and analyze the All of Us v8 short-read Whole Genome Sequencing (srWGS) dataset. Our primary objective is to construct a Genomic Data Structure (GDS) file that includes all variants and samples, enabling efficient large-scale genomic analysis.
Datasets
We will use the All of Us v8 genotype dataset, which consists of high-coverage short-read WGS data from a diverse cohort. This dataset contains single nucleotide polymorphisms (SNPs) and small insertions/deletions (indels), and provides a unique opportunity to study population genetics and disease-associated variants in underrepresented populations.
Research Methods & Tools
Variant Data Conversion & GDS Construction
Convert Hail MatrixTable (MT) or VCF format into GDS using SeqArray (R-based tool) or Bioconductor packages.
Optimize data compression and indexing for efficient query performance.

Anticipated Findings

Improved Data Storage & Accessibility – The GDS format will enhance data compression, retrieval speed, and computational efficiency compared to traditional formats like VCF or Hail MatrixTable.
High-Quality Variant Representation – Through rigorous quality control (QC), we expect to retain high-confidence SNPs and indels, ensuring data integrity for downstream analyses.
Population Genomic Insights – Preliminary analyses may reveal population structure, ancestry distributions, and allele frequency patterns, providing deeper insight into the diverse All of Us cohort.
Benchmarking Computational Performance – We will assess how GDS improves storage efficiency and query performance, facilitating its integration into bioinformatics pipelines.
Scientific Contributions
Enable Large-Scale Genomic Research – By optimizing the dataset in GDS format, researchers will be able to conduct GWAS, polygenic risk score (PRS) studies, and rare variant burden analysis more efficiently.

Demographic Categories of Interest

This study will not center on underrepresented populations.

Data Set Used

Controlled Tier

Research Team

Owner:

  • Hufeng Zhou - Early Career Tenure-track Researcher, Harvard T. H. Chan School of Public Health

Collaborators:

  • Jun Qian - Other, All of Us Program Operational Use
1 - 25 of 17688
<
>
Request a Review of this Research Project

You can request that the All of Us Resource Access Board (RAB) review a research purpose description if you have concerns that this research project may stigmatize All of Us participants or violate the Data User Code of Conduct in some other way. To request a review, you must fill in a form, which you can access by selecting ‘request a review’ below.