Research Projects Directory

Research Projects Directory

16,893 active projects

This information was updated 3/19/2025

The Research Projects Directory includes information about all projects that currently exist in the Researcher Workbench to help provide transparency about how the Workbench is being used. Each project specifies whether Registered Tier or Controlled Tier data are used.

Note: Researcher Workbench users provide information about their research projects independently. Views expressed in the Research Projects Directory belong to the relevant users and do not necessarily represent those of the All of Us Research Program. Information in the Research Projects Directory is also cross-posted on AllofUs.nih.gov in compliance with the 21st Century Cures Act.

Duplicate of LiRong lab PTSD and ASUD/OUD

The purpose of this research is to evaluate the built transformer binary classification model by predicting the development of Alcohol and Substance Use Disorder (ASUD) or Opioid Use Disorder (OUD) in patients diagnosed with Post-Traumatic Stress Disorder (PTSD) based on…

Scientific Questions Being Studied

The purpose of this research is to evaluate the built transformer binary classification model by predicting the development of Alcohol and Substance Use Disorder (ASUD) or Opioid Use Disorder (OUD) in patients diagnosed with Post-Traumatic Stress Disorder (PTSD) based on their hospital visits, conditions, medication uses and other Electronic Health Record (EHR) data.

Project Purpose(s)

  • Disease Focused Research (psychiatric disorders)
  • Population Health
  • Social / Behavioral
  • Drug Development
  • Methods Development

Scientific Approaches

STUDY DESIGN
• Study design: Retrospective data analysis
• Study population and sample size: PTSD and ASUD patients
• Data source: All of Us
• Analytic methods: Deep Learning(DL)-based modeling

Our proposed research will evaluate the risk of through hospital visits and diagnosis patterns using DL and predict risk of ASUD/OUD development in patients diagnosed with PTSD.

Anticipated Findings

Contribution to scientific field:
1. Advancing care through DL and statistical analysis. Given patient information and medical history, risk of developing ASUD/OUD can be predicted with little cost, assisting in preliminary diagnosis of health care personnel.
2. Application of new computational tools for analysis of big data. This project will develop a new integrated pharmaco-analytical tools for PTSD and ASUD/OUD. These computational tools will be used to analyze interactions between multimodal information for stronger clinical outcome prediction.

Demographic Categories of Interest

  • Race / Ethnicity
  • Age
  • Sex at Birth
  • Gender Identity
  • Sexual Orientation
  • Geography
  • Disability Status
  • Access to Care
  • Education Level
  • Income Level

Data Set Used

Controlled Tier

Research Team

Owner:

statin

Lipid-lowering agents, statins, are widely used to prevent cardiovascular disease. However, recent studies have reported the risk of metabolic disorders among statin users. We are about to investigate the association between genetic factors and statin adverse events.

Scientific Questions Being Studied

Lipid-lowering agents, statins, are widely used to prevent cardiovascular disease. However, recent studies have reported the risk of metabolic disorders among statin users. We are about to investigate the association between genetic factors and statin adverse events.

Project Purpose(s)

  • Drug Development
  • Ancestry

Scientific Approaches

We will conduct various analyses such as genome-wide association study, polygenic risk score analysis, and others.

Anticipated Findings

Finding genetic factors for dementia could improve patients' quality of life and contribute to developing treatment methods.

Demographic Categories of Interest

  • Race / Ethnicity

Data Set Used

Controlled Tier

Research Team

Owner:

  • Yoon-A Park - Graduate Trainee, Ewha Womans University, College of Pharmacy
  • Da Hoon Lee - Graduate Trainee, Ewha Womans University, College of Pharmacy

Duplicate of How to query All by All results and analysis details

The All by All tables encompass about 3,400 phenotypes with gene-based and single-variant associations across nearly 250,000 whole genome sequences. The phenotypes in the All by All tables include data from the Personal and Family Health History survey, physical measurements,…

Scientific Questions Being Studied

The All by All tables encompass about 3,400 phenotypes with gene-based and single-variant associations across nearly 250,000 whole genome sequences. The phenotypes in the All by All tables include data from the Personal and Family Health History survey, physical measurements, lab measurements, conditions, and medications.

The All by All tables are made available to Controlled Tier researchers as Hail Tables and Hail Matrix Tables. In this way, the summary statistics of the association tests can be incorporated into research studies without the need for researchers to perform these costly analyses themselves. More details about the All by All tables can be found in the User Support Hub Article: https://support.researchallofus.org/hc/en-us/articles/27049847988884-Overview-of-the-All-by-All-tables-available-on-the-All-of-Us-Researcher-Workbench.

Project Purpose(s)

  • Educational

Scientific Approaches

The All by All tables are available as Hail Tables and Hail MatrixTables. This Featured Workspace is focused on demonstrating methods to effectively filter and export summary statistics of interest from the All by All tables. For example, a method to filter the P value and Beta from a phenotype specific Hail Table is provided.

Anticipated Findings

The All by All tables include known and novel associations between genomic and phenotypic data contributed by All of Us participants. These associations will enable a number of research studies, such as examining a genetic variant of interest for association with a variety of disease conditions and other phenotypes.

Demographic Categories of Interest

This study will not center on underrepresented populations.

Data Set Used

Controlled Tier

Research Team

Owner:

Genome-Phenome Foundation Model (V7)

The objective of this project is to develop a deep learning-based genomic foundation model to provide mechanistic insights into pan-disease biology and substantially advance precision medicine. In contrast with traditional methods in statistical genetics (e.g. Genome-Wide Association Studies (GWAS), and…

Scientific Questions Being Studied

The objective of this project is to develop a deep learning-based genomic foundation model to provide mechanistic insights into pan-disease biology and substantially advance precision medicine. In contrast with traditional methods in statistical genetics (e.g. Genome-Wide Association Studies (GWAS), and Polygenic Risk Scores (PRS)), this approach aims to more effectively utilize the vast amounts of genome-wide sequencing data currently available through self-supervised learning and data-driven patient subtyping. It also aims to actively learn from human ancestral variation, rather than ignoring it, in order to make its predictions more generalisable and equitable across different human populations. If successful, this framework will offer a more scalable, accurate, and unbiased means of interrogating genome-phenome interactions.

Project Purpose(s)

  • Population Health
  • Drug Development
  • Methods Development
  • Control Set
  • Ancestry
  • Ethical, Legal, and Social Implications (ELSI)

Scientific Approaches

First, we train a self-supervised foundation DNA-only model to learn the general patterns of genetic sequences. This is analogous to how Large Language Models (like chatGPT) first pretrain on large corpuses of unstructured text to learn general principles of language and concepts. Next, we will embed the genomes of individuals into lower-dimensional representations that capture important variation. Next, we will train a phenome-wide model to predict relationships between the genome and all phenotypes (symptoms, diseases, traits). Unlike traditional approach, we aim to create unified representations of the human phenome using knowledge graphs that describe the relationships between diverse medical concepts. This allows us to study the genomic underpinnings all diseases on a continuous spectrum and make more precise patient-specific predictions.

Anticipated Findings

The proposed genomic foundation model will have numerous scientific and clinical applications. First, it can be used to project genomes from new individuals into the shared latent space to identify digital twins, making it a powerful multi-disease diagnostic tool. In silico experiments can be run to predict the phenotypic outcomes of specific therapeutic interventions, thereby maximising efficacy and minimising off-target effects. In other words, this model could be used to inform the most safe and effective course of treatment given an individual's particular genetic makeup. Furthermore, it could serve as a high-throughput engine for novel therapeutic target discovery that takes into account the effect across all phenotypes and diseases simultaneously. This could be used to identify individuals who are most likely to respond to a given therapeutic, permitting more targeted clinical trials with higher success rates.

Demographic Categories of Interest

This study will not center on underrepresented populations.

Data Set Used

Controlled Tier

Research Team

Owner:

Collaborators:

  • Moritz Schaefer - Research Associate, Medical University Vienna, Austria

All of Us - Controlled

The goal of the workspace is to develop a pipeline to streamline the retrieval of GWAS summary statistics from a user-entered phenotype accession number. Subsequent goals include integration of qqman (https://CRAN.R-project.org/package=qqman ) , LocusZoomR (https://CRAN.R-project.org/package=locuszoomr ), and MetaXcan S-PrediXcan (https://github.com/hakyimlab/MetaXcan)…

Scientific Questions Being Studied

The goal of the workspace is to develop a pipeline to streamline the retrieval of GWAS summary statistics from a user-entered phenotype accession number. Subsequent goals include integration of qqman (https://CRAN.R-project.org/package=qqman ) , LocusZoomR (https://CRAN.R-project.org/package=locuszoomr ), and MetaXcan S-PrediXcan (https://github.com/hakyimlab/MetaXcan) as subsequent analyses of the summary statistics.

Project Purpose(s)

  • Educational

Scientific Approaches

We will use GWAS summary statistics from the All of Us database along with Jupyter notebook to build a pipeline that can find a Hail MatrixTable, plot a manhattan plot from the table, plot a LocusZoom plot, and compute omic associations from a phenotype accession number.

Anticipated Findings

A GitHub repository for the pipeline will be made available and updated as research progress occurs. This community workspace enables a training environment for a coursework project in Loyola University Chicago’s masters in bioinformatics degree program.

Demographic Categories of Interest

This study will not center on underrepresented populations.

Data Set Used

Controlled Tier

Research Team

Owner:

Collaborators:

  • Drew Patterson - Graduate Trainee, Loyola University Chicago
  • Benjamin Moginot - Graduate Trainee, Loyola University Chicago
  • Angelina Carcione - Graduate Trainee, Loyola University Chicago

GWAS, TWAS, and Data Viz in the All of Us Cloud

The goal of the workspace is to develop a pipeline to streamline the retrieval of GWAS summary statistics from a user-entered phenotype accession number. Subsequent goals include integration of qqman (https://CRAN.R-project.org/package=qqman ) , LocusZoomR (https://CRAN.R-project.org/package=locuszoomr ), and MetaXcan S-PrediXcan (https://github.com/hakyimlab/MetaXcan) as subsequent…

Scientific Questions Being Studied

The goal of the workspace is to develop a pipeline to streamline the retrieval of GWAS summary statistics from a user-entered phenotype accession number. Subsequent goals include integration of qqman (https://CRAN.R-project.org/package=qqman ) , LocusZoomR (https://CRAN.R-project.org/package=locuszoomr ), and MetaXcan S-PrediXcan (https://github.com/hakyimlab/MetaXcan) as subsequent analyses of the summary statistics.

Project Purpose(s)

  • Educational

Scientific Approaches

We will use GWAS summary statistics from the All of Us database along with Jupyter notebook to build a pipeline that can find a Hail MatrixTable, plot a manhattan plot from the table, plot a LocusZoom plot, and compute omic associations from a phenotype accession number.

Anticipated Findings

A GitHub repository for the pipeline will be made available and updated as research progress occurs. This community workspace enables a training environment for a coursework project in Loyola University Chicago’s masters in bioinformatics degree program.

Demographic Categories of Interest

This study will not center on underrepresented populations.

Data Set Used

Registered Tier

Research Team

Owner:

Collaborators:

  • Drew Patterson - Graduate Trainee, Loyola University Chicago
  • Benjamin Moginot - Graduate Trainee, Loyola University Chicago
  • Angelina Carcione - Graduate Trainee, Loyola University Chicago

Dup. of SGM Identity, Adverse SDOH, and Trauma-Related Mental Health Disorders

This project will investigate the links between adverse social determinants of health, stress associated with minority group membership (sexual and gender minority SGM membership), and the prevalence of trauma-related mental health conditions. It will also examine group-level differences between SGM…

Scientific Questions Being Studied

This project will investigate the links between adverse social determinants of health, stress associated with minority group membership (sexual and gender minority SGM membership), and the prevalence of trauma-related mental health conditions. It will also examine group-level differences between SGM and non-SGM people to examine how minority stress may affect social determinants of health and diagnosis with trauma-related mental health conditions.

These questions are important because extant literature has illustrated the mental health disparities experienced by SGM and has hypothesized (e.g., through the minority stress model) that social determinants of health (and specifically experiences of discrimination) may explain the disparity. However, little quantitative research to date has positioned adverse social determinants of health as a proxy for psychological trauma and explored all four categories of disorder in the present study (anxiety, depressive, PTSD and SU disorder).

Project Purpose(s)

  • Population Health
  • Social / Behavioral

Scientific Approaches

This study will utilize the All of Us Controlled Tier Dataset v8, analyzed in either Jupyter notebook (with R) or R studio. It will use independent measures t-tests, regression analysis, and moderation analysis. Depending on the initial data analysis results with these tests, the study may also utilize other statistical tests, such as path analysis.

Anticipated Findings

It is anticipated that the study will reveal that sexual and gender minority people are more likely to experience adverse social determinants of health related to psychological trauma and that experience with these conditions may explain the previously documented disparity in diagnosis with particular trauma-related mental health disorders. This anticipated finding will help further nuance conversation around mental health disparities for sexual and gender minority people to recognize how systemic and structural inequities trickle down to individual experiences and inform future research and public health initiatives.

Demographic Categories of Interest

  • Sex at Birth
  • Gender Identity
  • Sexual Orientation

Data Set Used

Controlled Tier

Research Team

Owner:

  • Zachary McNiece - Early Career Tenure-track Researcher, San Jose State University

Collaborators:

  • Sean Bullock - Graduate Trainee, San Jose State University

Stroke GWAS v8 CDR

Stroke is a major cause of global morbidity and mortality. Genome-wide association studies (GWAS) have improved our understanding of stroke and their risk factors. However, a majority of GWAS have focused on European ancestry populations, leaving other populations underrepresented. Our…

Scientific Questions Being Studied

Stroke is a major cause of global morbidity and mortality. Genome-wide association studies (GWAS) have improved our understanding of stroke and their risk factors. However, a majority of GWAS have focused on European ancestry populations, leaving other populations underrepresented. Our study aims to identify genetic variants associated with stroke in individuals with African ancestry.

Project Purpose(s)

  • Ancestry

Scientific Approaches

We will use a subset of participants within the All of Us study with African ancestry and extract data pertaining to stroke events using ICD10 codes. Genome-wide association study (GWAS) methods will be used to identify genetic associations with total stroke, ischemic stroke, and stroke subtypes.

Anticipated Findings

We anticipate to find previously identified and novel genetic associations with stroke. These findings will expand our understanding of stroke biology and improve the representation of non-European populations in genetic research.

Demographic Categories of Interest

  • Race / Ethnicity

Data Set Used

Controlled Tier

Research Team

Owner:

  • Alice Man - Graduate Trainee, McMaster University

Collaborators:

  • Guilherme da Rocha - Research Fellow, McMaster University
  • Michael Chong - Early Career Tenure-track Researcher, McMaster University

Duplicate of Monogenic Diabetes v8

Genetic Analysis of Mody/ Monogenic diabetes. We are looking at known monogenic diabetes variants and their phenotypes. tion.

Scientific Questions Being Studied

Genetic Analysis of Mody/ Monogenic diabetes. We are looking at known monogenic diabetes variants and their phenotypes.
tion.

Project Purpose(s)

  • Disease Focused Research (Monogenic Diabetes)

Scientific Approaches

We will use the genetic dataset of All of Us, R code on Jupyter Notebooks, and statistical regression to display the correlation of genes to their phenotypes, which will be pulled from ICD codes and other health record information.

Anticipated Findings

Findings we anticipate are gene variant associations to phenotypes within the genes known to be risk factors for MODY. Our findings could help us better understand MODY and create a clearer understanding of its various components and complications. MODY disproportionately affects non-European ancestry. With the AoU dataset, we can provide a more robust analysis of MODY because of the greater genetic diversity available to us in this dataset.

Demographic Categories of Interest

This study will not center on underrepresented populations.

Data Set Used

Controlled Tier

Research Team

Owner:

Neighborhood Dynamics and Health Disparities

This research uses machine learning to assess how neighborhood characteristics influence health outcomes for African Americans. We aim to quantify the impact of economic and non-economic factors on psychological well-being and explore their intersections. Prior research tends to focus on…

Scientific Questions Being Studied

This research uses machine learning to assess how neighborhood characteristics influence health outcomes for African Americans. We aim to quantify the impact of economic and non-economic factors on psychological well-being and explore their intersections. Prior research tends to focus on economic indicators and overlooks non-economic factors that can be protective in disadvantaged neighborhoods. By considering both factors, we aim to provide a comprehensive understanding of the social determinants of health and identify community assets that foster resilience. We also employ innovative machine learning techniques, which can provide a more robust assessment of model performance, identify the most important predictors, and uncover complex interactions. Findings can lead to more targeted policy and community programs by highlighting the importance of equitable resource allocation to underrepresented communities and interventions that improve social cohesion and environmental conditions.

Project Purpose(s)

  • Population Health
  • Social / Behavioral

Scientific Approaches

This study uses data from the All of Us Research Program. The dataset contains detailed demographic, socioeconomic, lifestyle, and health information from over one million participants, and survey data on perceived neighborhood characteristics, social life, stress, and everyday life perceptions. Data preprocessing and machine learning analyses will be conducted in R. Supervised and unsupervised machine learning will be used to quantify the predictive effects of neighborhood characteristics on psychological health for African Americans. We will evaluate several techniques, including regularized linear regression, support vector machines, multilayer perceptron, and random forest, and retain the model with the lowest testing error. Variable importance scores and SHAP values will be used to identify key neighborhood characteristics and interpret their relation to psychological health. H statistics will be used to identify interactions, with plots generated to interpret these interactions.

Anticipated Findings

We anticipate finding that economic and non-economic neighborhood characteristics predict psychological well-being for African Americans. Socioeconomic indicators like income and employment rates are expected to remain strongly associated with psychological health. Additionally, respondents in neighborhoods with high social cohesion, low disorder, and adaptive environments will report better mental health, higher quality of life, greater social satisfaction, and increased resilience, despite economic disadvantage. We also expect a complex interplay between neighborhood characteristics that differentially impacts psychological health. Findings can highlight the importance of non-economic factors in predicting psychological well-being and identify the most robust predictors of this outcome. This work can also demonstrate the utility of machine learning and how these methods can be used to uncover complex interactions and generate new directions for research and theory development.

Demographic Categories of Interest

  • Race / Ethnicity

Data Set Used

Controlled Tier

Research Team

Owner:

  • Alaysia Brown - Research Fellow, Harvard Faculty of Arts and Sciences

AD clinical variables

We aim to study how clinical factors, such as lifestyle, impact the response to Alzheimer's disease treatment. This research is essential for ensuring that patients receive the most effective treatments.

Scientific Questions Being Studied

We aim to study how clinical factors, such as lifestyle, impact the response to Alzheimer's disease treatment. This research is essential for ensuring that patients receive the most effective treatments.

Project Purpose(s)

  • Disease Focused Research (Alzheimer's disease)

Scientific Approaches

We will use patient demographics, medical history, medication usage, cognitive assessments, and other clinical factors to correlate treatment responses with specific characteristics. R will be used to analyze the data and create visualizations.

Anticipated Findings

I anticipate identifying predictive factors for treatment efficacy, which will help optimize treatment outcomes. Ultimately, the study will contribute valuable insights to our understanding of Alzheimer’s treatment response and resistance.

Demographic Categories of Interest

This study will not center on underrepresented populations.

Data Set Used

Controlled Tier

Research Team

Owner:

  • Elizabeth Kim - Undergraduate Student, University of California, San Diego

Collaborators:

  • Tianyi Chen - Undergraduate Student, University of California, San Diego
  • Miski Abdi - Early Career Tenure-track Researcher, University of California, Los Angeles
  • Kevin Zhang - Undergraduate Student, University of California, San Diego

Genetics of dystonia V8

Dystonia is a neurological disorder that causes involuntary muscle movement. Many forms of dystonia are inherited, but genetic factors that cause or predispose individuals to dystonia are still unknown in many patients. Thus, diagnosis, prognosis, and treatment strategies are limited.

Scientific Questions Being Studied

Dystonia is a neurological disorder that causes involuntary muscle movement. Many forms of dystonia are inherited, but genetic factors that cause or predispose individuals to dystonia are still unknown in many patients. Thus, diagnosis, prognosis, and treatment strategies are limited.

Project Purpose(s)

  • Disease Focused Research (dystonia)
  • Ancestry

Scientific Approaches

We will conduct genome-wide association study (GWAS) on a cohort of dystonia patients, excluding drug-induced or acquired forms of dystonia, to determine genetic variants associated with dystonic phenotypes.

Anticipated Findings

We anticipate that the large number of participants in All of Us will allow us to identify genetic associations of dystonia and help inform diagnosis, prognosis, and treatment of patients with dystonia.

Demographic Categories of Interest

This study will not center on underrepresented populations.

Data Set Used

Controlled Tier

Research Team

Owner:

ME/CFS biomarker replication study

Myalgic Encephalomyelitis/Chronic Fatigue Syndrome (ME/CFS) is a multi-system illness with an unknown set of biological causes and risk factors. The prevalence of ME/CFS is relatively high (estimated at 0.2-0.4% worldwide), and it is known to have a heritable/genetic component. There…

Scientific Questions Being Studied

Myalgic Encephalomyelitis/Chronic Fatigue Syndrome (ME/CFS) is a multi-system illness with an unknown set of biological causes and risk factors. The prevalence of ME/CFS is relatively high (estimated at 0.2-0.4% worldwide), and it is known to have a heritable/genetic component. There are currently no diagnostic criteria, and there is no curative treatment. Diagnosis of ME/CFS is typically a long-winded process that proceeds by exclusion.

In this study, we seek to understand if ME/CFS can be detected from blood-based biomarkers. This question is important as it has the potential to support the development of a blood-based diagnostic for the condition.

Project Purpose(s)

  • Disease Focused Research (Myalgic Encephalomyelitis / Chronic Fatigue Syndrome)
  • Control Set

Scientific Approaches

We aim to estimate the association of ME/CFS status with blood-based biomarkers, correcting for known confounders such as Sex assigned at birth and Age. ME/CFS is known to be a sex-biased condition that occurs more frequently in younger individuals. We will use a cohort of individuals with ICD-10 code G93.3 (CFS) and an indication of overall poor health. In order to obtain an accurate and robust estimate of these associations, we will leverage state-of-the-art non-parametric statistical methodology which minimises model misspecification bias and leads to robust error quantification. These one-step estimators are implemented in the R package npcausal. We will also perform prediction modelling.

This work is a partial replication study of our earlier large-scale analysis of the same scientific questions on the UK Biobank. Since the latter has been performed on white British individuals, we will restrict our study population in All of Us to white individuals of European descent.

Anticipated Findings

This work is a partial replication study of our earlier large-scale analysis of the same scientific questions on the UK Biobank (https://doi.org/10.1101/2024.08.26.24312606). Our anticipated findings are a replication (or not) of the association of ME/CFS with various blood-based biomarkers. An independent replication would solidify our findings on the UK Biobank. A discrepancy could be explained by a lower number of individuals with ME/CFS in All of Us than in the UK Biobank, or by a difference in the two populations.

Demographic Categories of Interest

This study will not center on underrepresented populations.

Data Set Used

Controlled Tier

Research Team

Owner:

Collaborators:

  • Artur Miralles Méharon - Graduate Trainee, The University of Edinburgh

Cancer Risk Prediction

The overall objective of this application is to further develop and validate an existing model of predicting cancer risk from electronic health records, focused primarily on pancreatic cancer, lung cancer, and ovarian cancer, with the view to enabling future surveillance…

Scientific Questions Being Studied

The overall objective of this application is to further develop and validate an existing model of predicting cancer risk from electronic health records, focused primarily on pancreatic cancer, lung cancer, and ovarian cancer, with the view to enabling future surveillance programs among the general public that can facilitate earlier detection of the disease in high-risk cohorts. Our central hypothesis is that artificial intelligence / machine learning methods (AI/ML), specifically transformer-based neural network models, can produce usefully accurate predictions of pancreatic cancer risk based on temporal data in electronic health records. Early detection of cancer is critical to early treatment with better outcomes than late stage treatment, and accurate tools for the design of surveillance programs based on patient records would have enormous benefits for public health. See also our May 2023 publication: doi.org/10.1038/s41591-023-02332-5.

Project Purpose(s)

  • Methods Development

Scientific Approaches

We plan to create a cancer risk prediction model based on trajectories of clinical data routinely available in the electronic health record using a transformer-based neural network. Clinical data elements to be considered not only include known risk factors including diabetes, BMI, and smoking status, but complete time trajectories of a large number of diagnosis codes from clinical histories, pharmacy records, and laboratory test results, as well as inherited genomic variants. We will use the All of Us EHR data as one source of prospectively testing our models, in addition to other outside datasets.

Anticipated Findings

With the completion of this research, (a) we will be poised to implement a prediction-surveillance program that can be used to identify patients who are at elevated risk for cancer and should be enrolled in surveillance and/or interception programs for disease detection, therapy, and prevention; (b) we will have characterized interactions among clinical risk factors on cancer risk; and (c) we will have produced a prototype method and open-source software tool that applies AI technology to patient trajectory data for prediction of disease risk for cancer and beyond.

Demographic Categories of Interest

This study will not center on underrepresented populations.

Data Set Used

Registered Tier

Research Team

Owner:

Duplicate of How to Work with All of Us Genomic Data (Hail - Plink)(v8)

Not applicable - these notebooks demonstrate example analysis how to use Hail and PLINK to perform genome-wide association studies using the All of Us genomic data and phenotypic data.

Scientific Questions Being Studied

Not applicable - these notebooks demonstrate example analysis how to use Hail and PLINK to perform genome-wide association studies using the All of Us genomic data and phenotypic data.

Project Purpose(s)

  • Other Purpose (Demonstrate to the All of Us Researcher Workbench users how to get started with the All of Us genomic data and tools. It includes an overview of all the All of Us genomic data and shows some simple examples on how to use these data.)

Scientific Approaches

Not applicable - these notebooks demonstrate example analysis how to use Hail and PLINK to perform genome-wide association studies using the All of Us genomic data and phenotypic data.

Anticipated Findings

Not applicable - these notebooks demonstrate example analysis how to use Hail and PLINK to perform genome-wide association studies using the All of Us genomic data and phenotypic data.

Demographic Categories of Interest

This study will not center on underrepresented populations.

Data Set Used

Controlled Tier

Research Team

Owner:

Duplicate of Beginner Intro to AoU Data and the Workbench

This workspace contains multiple notebooks that assess users' understanding of the workbench and OMOP. These notebooks are meant to help users check their knowledge not only on Python, R, and SQL, but also on the general data structure and data…

Scientific Questions Being Studied

This workspace contains multiple notebooks that assess users' understanding of the workbench and OMOP. These notebooks are meant to help users check their knowledge not only on Python, R, and SQL, but also on the general data structure and data model used by the All of Us program.

Project Purpose(s)

  • Educational

Scientific Approaches

There are no scientific approach used in this workspace because it is meant for educational purposes only. We will cover all aspects of OMOP, and hence will use most datasets available in the workbench.

Anticipated Findings

We do not anticipate to have any findings. Instead, we are educating people on the use of the workbench and the common data model OMOP used by the program.

Demographic Categories of Interest

This study will not center on underrepresented populations.

Data Set Used

Registered Tier

Research Team

Owner:

Telomere Length Among Racially Diverse Breast Cancer Survivors

Breast cancer disparities exists and these may be due to experiences of chronic stress. This project will investigate biological aging, as measured by telomere length, among racially and ethnically diverse breast cancer survivors to better understand biological mechanisms in breast…

Scientific Questions Being Studied

Breast cancer disparities exists and these may be due to experiences of chronic stress. This project will investigate biological aging, as measured by telomere length, among racially and ethnically diverse breast cancer survivors to better understand biological mechanisms in breast cancer disparities. Using demographic, survey, actigraphy, and genomic data, we hypothesize that lower socioeconomic status, increased stress, and poor sleep, will be associated with shorter telomere length. Findings from this study will uncover how stress gets under the skin and help us identify points of intervention to address health inequities.

Project Purpose(s)

  • Disease Focused Research (female breast cancer)
  • Population Health
  • Social / Behavioral
  • Educational
  • Ancestry

Scientific Approaches

The datasets we will be using include the whole genomic data and long-read Telomere-to-Telomere (T2T) Hail MatrixTable (MT) to obtain telomere length data as well as demographic, electronic health records, actigraphy, and survey data. We will estimate relationships between our measures of interest, with telomere length as the primary outcome.

Anticipated Findings

Current research shows racial and ethnic breast cancer disparities that may be due to chronic stress. Chronic stress has been shown to impact biological aging, specifically shorten telomere length. However, there is very little known about the telomere lengths in racial and ethnically diverse breast cancer survivors. We anticipate there to be racial/ethnic differences in breast cancer survivor’s telomere length, along with differences in telomere length with other stress and resilience factors. Findings from this study will uncover how stress gets under the skin and help us identity points of intervention to address health inequities.

Demographic Categories of Interest

  • Race / Ethnicity
  • Age
  • Education Level
  • Income Level

Data Set Used

Controlled Tier

Research Team

Owner:

  • Erica Tate - Undergraduate Student, San Francisco State University

Duplicate of All of Us v7 GWAS on LDL Cholesterol with Regenie

The main questions this workspace is attempting to address is whether a scalable GWAS can be created on the AoU platform and whether (and how) that GWAS can be optimized for run cost and time. These questions are important for…

Scientific Questions Being Studied

The main questions this workspace is attempting to address is whether a scalable GWAS can be created on the AoU platform and whether (and how) that GWAS can be optimized for run cost and time. These questions are important for making the AoU platform—and a common genomics analysis like a GWAS—as accessible and understandable as possible for researchers of all experience levels; by attempting to address the scalability of this GWAS and comparing its methodological accuracy, this workspace will confirm the usability of the data and Workbench platform for genomic research. This workspace allows researchers to reliably analyze future AoU data releases, and performing this GWAS multiple times and tracking cluster metrics will unveil optimized configurations to better inform future researchers seeking to recreate this GWAS or perform other analyses of similar computational intensity.

Project Purpose(s)

  • Methods Development
  • Other Purpose (The purpose of this workspace is to recreate an efficient and scalable Genome Wide Association Study (GWAS) across whole genome sequenced data on an LDL Cholesterol phenotype with Regenie and dsub.)

Scientific Approaches

This workspace is intended to provide a functional and scalable GWAS on AoU data. The GWAS will apply the methodologies of the hail.is GWAS tutorial, the featured workspace GWAS, Nicole DeFlaux's and Margret Sunitha's phenotype generation and PC analysis, as well as Seung Hoan Choi and Xin Wang's QC methodologies used in the GWAS demonstration project—the corresponding papers of which are available here: https://www.biorxiv.org/content/10.1101/2022.11.29.518423v2 https://www.medrxiv.org/content/10.1101/2022.11.23.22282687v1

Anticipated Findings

The research conducted in this study is not novel and there are no anticipated findings from this study other than a successful recreation of prior GWAS performances. The success of this replication, however, will contribute to the body of bioinformatic knowledge by further acknowledging the utility and necessity of cloud-based analysis platforms that enable genomics research. Moreover, this replication's success establishes the validity of the All of Us Researcher Workbench and dataset as usable, reliable resources with which genomics analyses can be conducted.

Demographic Categories of Interest

This study will not center on underrepresented populations.

Data Set Used

Controlled Tier

Research Team

Owner:

  • Olivia Inch - Undergraduate Student, Pennsylvania State University

Common metabolic disease genetic association analysis (v7)

Common metabolic diseases (type 2 diabetes, cardiovascular disease, etc.) are influenced by both genetic and non-genetic risk factors, such as lifestyle habits (e.g. smoking, alcohol consumption, dietary patterns, physical activity, and sleep duration). Understanding both genetic and non-genetic factors and…

Scientific Questions Being Studied

Common metabolic diseases (type 2 diabetes, cardiovascular disease, etc.) are influenced by both genetic and non-genetic risk factors, such as lifestyle habits (e.g. smoking, alcohol consumption, dietary patterns, physical activity, and sleep duration). Understanding both genetic and non-genetic factors and their interactions may provide insights into potential pharmacologic targets and further lay the scientific foundation for precision interventions. Genome-wide association studies (GWAS) can characterize the genetic effects on complex diseases. The All of Us Research Program offers an excellent path forward to use GWAS to elucidate human disease drivers. In response to FNIH RFP2 “GENERATION of New genetic, -omic, or biomarker data for Common Metabolic Diseases,” we propose to generate and discover novel genetic associations in the All of Us Research Program.

Project Purpose(s)

  • Disease Focused Research (Lifespan diabetes, diabetes complications and other metabolic diseases)

Scientific Approaches

We will perform disease, disease-related trait, and covariate harmonization in the All of Us Cohort for diseases (type 2 diabetes, cardiovascular disease outcomes and hypertension), disease-related traits (glycemic traits, CVD risk factors, blood pressure), and important covariates (adiposity traits, smoking, alcohol consumption, dietary patterns, physical activity, and sleep duration). We will report counts and power calculations that describe the genetic associations that could be discovered in the All of Us Cohort. We will explore the availability of additional common metabolic diseases and complications and perform power calculations for these traits. We will perform genetic association analyses with each disease and outcome (type 2 diabetes, type 1 diabetes, obesity, atherosclerosis, chronic kidney disease, heart failure, and NAFLD).

Anticipated Findings

We aim to discover novel genetic associations with metabolic disease traits and contribute summary statistics to the Common Metabolic Disease Knowledge Portal.

Demographic Categories of Interest

  • Race / Ethnicity
  • Age

Data Set Used

Controlled Tier

Research Team

Owner:

  • Kaavya Ashok - Early Career Tenure-track Researcher, Broad Institute
  • Alisa Manning - Early Career Tenure-track Researcher, Mass General Brigham

Collaborators:

  • Yonah Borns-Weil - Project Personnel, Mass General Brigham
  • Ravi Mandla - Project Personnel, Broad Institute
  • Raymond Kreienkamp - Research Fellow, Broad Institute
  • Lukasz Szczerbinski - Research Fellow, Broad Institute
  • Varun Lingadal - Research Assistant, Boston Children's Hospital
  • Katherine Taylor - Other, Broad Institute
  • Stephanie Giamberardino - Other, All of Us Researcher Academy/RTI International
  • Sara Cromer - Research Fellow, Mass General Brigham
  • Reagan Ballard - Undergraduate Student, University of North Carolina, Chapel Hill
  • Laura Raffield - Other, University of North Carolina, Chapel Hill
  • Miriam Udler - Early Career Tenure-track Researcher, Broad Institute
  • Micah Hysong - Graduate Trainee, University of North Carolina, Chapel Hill
  • Mali DiMeo - Research Assistant, Boston Children's Hospital
  • Maheak Vora - Project Personnel, Broad Institute
  • Josephine Li - Early Career Tenure-track Researcher, Mass General Brigham
  • Josep Mercader - Early Career Tenure-track Researcher, Broad Institute
  • Jia Zhu - Early Career Tenure-track Researcher, Boston Children's Hospital
  • Harry Wright - Research Assistant, University of Exeter
  • Grier Page - Senior Researcher, All of Us Researcher Academy/RTI International
  • Gareth Hawkes - Research Fellow, University of Exeter
  • Alicia Huerta - Research Fellow, Broad Institute
  • Aaron Deutsch - Research Fellow, Mass General Brigham
  • Alexandra Barry - Graduate Trainee, Mass General Brigham

Metabolic Syndrome in SCI

To determine if there are sex differences in the prevalence of cardiometabolic syndrome and associated risk factors in individuals with spinal cord injury.

Scientific Questions Being Studied

To determine if there are sex differences in the prevalence of cardiometabolic syndrome and associated risk factors in individuals with spinal cord injury.

Project Purpose(s)

  • Population Health

Scientific Approaches

We will calculate the prevalence of cardiometabolic syndrome between males and females with spinal cord injury by assessing body mass index, diagnosis of type-2 diabetes, HDL-C and triglyceride concentrations and blood pressure measurements. We will include key confounders in our analysis, including age, race, smoking and alcohol use.

Anticipated Findings

This research will further knowledge of sex and gender differences in cardiometabolic diseases for individuals with spinal cord injury.

Demographic Categories of Interest

  • Disability Status

Data Set Used

Controlled Tier

Research Team

Owner:

Collaborators:

  • Jia Li - Other, Ohio State University

Cancer Risk Study v8

Cancer is among the leading causes of death world wide. While there are many factors that increases risk for cancer, like smoking and exposure to carcinogens, genetics also play huge role. The hypothesis of this project is that genomic evaluation…

Scientific Questions Being Studied

Cancer is among the leading causes of death world wide. While there are many factors that increases risk for cancer, like smoking and exposure to carcinogens, genetics also play huge role. The hypothesis of this project is that genomic evaluation of individuals with cancer and healthy individuals will aid us in identifying cancer risk genes. Therefore, the overarching goal of our study involves a case-control cohort analysis to identify inherited risk genes for cancer for individuals of different origins. Identifying genetic factors for tissue-specific cancer risk can then be used towards the development of a genetic diagnostic test in the clinic to identify cancer high-risk individuals.

Project Purpose(s)

  • Disease Focused Research (Cancer)
  • Methods Development
  • Ancestry

Scientific Approaches

We will take advantage of the massive amount of genomic information in All of Us and select variants that are more likely to be disease associated or pathogenic. We will then conduct association testing to identify mutations that are more frequent in cases compared to controls. Upon finding genes of interest, we examine the top candidates in detail. Many of the required tools for our pipeline are vcftools, bcftools, samtools, vt, snpeff, annovar, king, plink and plink/seq.

Anticipated Findings

This study will help us identify genes/pathways that predisposes an individual to cancer. These findings can then be used towards the development of genetic diagnostic test in the clinic to identify high-risk individuals. These high-risk individuals would then benefit from personalized screening program that usually target high-risk groups (for example heavy drinkers, smokers, familial history etc.). And early detection could improve prognosis of cancer substantially.

Demographic Categories of Interest

  • Race / Ethnicity

Data Set Used

Controlled Tier

Research Team

Owner:

G6PDD Validation Study

G6PDdef is commonly associated with anemia and protection against malaria, though hypotheses still form around other defining traits. Studies results show that G6PD expression is most prevalent in bone marrow and testis tissue, but this enzyme’ plays an important role…

Scientific Questions Being Studied

G6PDdef is commonly associated with anemia and protection against malaria, though hypotheses still form around other defining traits. Studies results show that G6PD expression is most prevalent in bone marrow and testis tissue, but this enzyme’ plays an important role in metabolism so it’s deficiency might affect a variety of organs. Previous attempts to examine specific hypotheses about non-hematologic manifestations of G6PDdef have been poorly powered or designed for analyzing across carrier males and females and, thus, non-conclusive. A recent PheWAS confirmed one of these hypotheses: that G6PDdef is associated with increased complications of diabetes mellitus (Breeyear et al., 2024). This study supports our hypothesis that a properly powered and designed PheWAS can empower a deeper understanding of G6PDdef. Our study preforms PheWAS using All of Us (AoU) datasets to analyze individuals with G6PDdef.

Project Purpose(s)

  • Disease Focused Research (glucose-6-phosphate dehydrogenase deficiency)
  • Ancestry

Scientific Approaches

For our study, our primary aim is to perform a PheWAS focused only on traits for which it has been posited that there is an association with G6PDdef. Our discovery PheWAS will be undertaken with the AoU data from all populations using males hemizygous for G6PD alleles rated as class A or B as per Geck et al. as well as females homozygous or compound heterozygous for class A and/or class B alleles. Each allele will be filtered using metrics (read depth, b-allele frequency, genotype quality) to ensure high quality. Bonferroni corrections for multiple hypotheses testing for the PheWAS will be based on the number of traits used in the Concept ID list. Our replication PheWAS will be undertaken with the new cases from v8 dataset using males and females as described for the discovery work. Multiple-hypothesis correction will be used based on the number of traits showing statistically significant association from the discovery PheWAS.

Anticipated Findings

Our study will provide improved understanding of the genetic basis of human disease. We aim to better understand the phenotypic profile of G6PDdef, which would lead to more accurate diagnoses and treatment plans in the clinic.

Demographic Categories of Interest

This study will not center on underrepresented populations.

Data Set Used

Controlled Tier

Research Team

Owner:

  • David D'Onofrio - Graduate Trainee, Icahn School of Medicine at Mount Sinai

Duplicate of How to run WDLs using Cromwell in the Researcher Workbench

The purpose of this workspace is to demonstrate how to use Cromwell within the Researcher Workbench. This workspace will demonstrate writing a WDL script to validate VCF files.

Scientific Questions Being Studied

The purpose of this workspace is to demonstrate how to use Cromwell within the Researcher Workbench. This workspace will demonstrate writing a WDL script to validate VCF files.

Project Purpose(s)

  • Educational

Scientific Approaches

The purpose of this workspace is to demonstrate how to use Cromwell within the researcher workbench. This workspace will demonstrate writing a WDL script to validate VCF files.

Anticipated Findings

N/A

Demographic Categories of Interest

This study will not center on underrepresented populations.

Data Set Used

Controlled Tier

Research Team

Owner:

  • Yunxi Li - Graduate Trainee, Boston Children's Hospital

Passive Predictors of Postoperative Pain and Mental Health

Over 300 million surgeries are performed worldwide each year. Approximately 10-35% of surgical patients experience Chronic Post-Surgical Pain (CPSP), typically defined as pain at the site of surgery lasting for 3 or more months. As the number of surgeries increases…

Scientific Questions Being Studied

Over 300 million surgeries are performed worldwide each year. Approximately 10-35% of surgical patients experience Chronic Post-Surgical Pain (CPSP), typically defined as pain at the site of surgery lasting for 3 or more months. As the number of surgeries increases both nationally and globally, CPSP is becoming a major public health problem associated with poor quality of life, inability to return to work, and greater healthcare costs . Given these adverse outcomes and the number of impacted individuals, it is critical to identify and address modifiable risk factors for transition from acute to chronic pain after surgery. The purpose of the current research is to identify risk factors for CPSP and related problems (e.g., mental health conditions) that can be measured unobtrusively via consumer wearable devices.

Project Purpose(s)

  • Disease Focused Research (Persistent Post-Surgical Pain)

Scientific Approaches

This study will include individuals in the All of Us study who: 1) had a documented surgical procedure ; 2) provided at least 5 days of Fitbit data within 31 days before and/or after the date of surgery; and 3) completed the Overall Health survey at least 3 months and not more than 3 years after the date of surgery. From the Fitbit data, we will extract average pre- and post-operative sleep, activity, and heart rate information, as well as within-person trends (e.g., increase or decrease in activity over time) and day-to-day variability. We will use logistic regression to examine whether these variables are associated with pain and/or mental health symptoms reported 3 months to 3 years after surgery.

Anticipated Findings

We expect sleep, activity, and heart rate information derived from Fitbits to be associated with persistent postoperative pain and mental health symptoms. Preoperatively, we expect lower sleep quantity and quality, higher night-to-night sleep variability, and lower physical activity to be risk factors for persistent postoperative pain and mental health problems. In the first month after surgery, we expect positive day-to-day trends in sleep and physical activity to be associated with lower risk of persistent postoperative pain and mental health problems. This knowledge will aid in early identification of individuals at risk for persistent postoperative pain.

Demographic Categories of Interest

This study will not center on underrepresented populations.

Data Set Used

Controlled Tier

Research Team

Owner:

  • Madelyn Frumkin - Early Career Tenure-track Researcher, Dartmouth College

Collaborators:

  • Wenyu Zhang - Graduate Trainee, Dartmouth College

TestProj

Exploring dataset

Scientific Questions Being Studied

Exploring dataset

Project Purpose(s)

  • Ancestry

Scientific Approaches

Kaplan meier, cox regression

Anticipated Findings

NOt sure

Demographic Categories of Interest

This study will not center on underrepresented populations.

Data Set Used

Controlled Tier

Research Team

Owner:

1 - 25 of 16894
<
>
Request a Review of this Research Project

You can request that the All of Us Resource Access Board (RAB) review a research purpose description if you have concerns that this research project may stigmatize All of Us participants or violate the Data User Code of Conduct in some other way. To request a review, you must fill in a form, which you can access by selecting ‘request a review’ below.