Research Projects Directory

Research Projects Directory

13,290 active projects

This information was updated 9/19/2024

The Research Projects Directory includes information about all projects that currently exist in the Researcher Workbench to help provide transparency about how the Workbench is being used. Each project specifies whether Registered Tier or Controlled Tier data are used.

Note: Researcher Workbench users provide information about their research projects independently. Views expressed in the Research Projects Directory belong to the relevant users and do not necessarily represent those of the All of Us Research Program. Information in the Research Projects Directory is also cross-posted on AllofUs.nih.gov in compliance with the 21st Century Cures Act.

CV Disease Intermediate Module

Train Users in using the All Of Us database

Scientific Questions Being Studied

Train Users in using the All Of Us database

Project Purpose(s)

  • Educational

Scientific Approaches

To explore Fitbit data and CV disease for training purposes.

Anticipated Findings

Training users in using the All Of Us database

Demographic Categories of Interest

This study will not center on underrepresented populations.

Data Set Used

Registered Tier

Research Team

Owner:

Collaborators:

  • Riley Breitlow - Undergraduate Student, Crown College
  • Jocelynn Shockley - Undergraduate Student, Crown College
  • Ella Hollander - Undergraduate Student, Crown College

Analyze Missing Values Project

The project focuses on assessing the effectiveness of an existing guide to depression medications for minority populations, specifically African Americans diagnosed with Major Depression (without Bipolar disorder). The key antidepressant drug in focus for this project is Venlafaxine.

Scientific Questions Being Studied

The project focuses on assessing the effectiveness of an existing guide to depression medications for minority populations, specifically African Americans diagnosed with Major Depression (without Bipolar disorder). The key antidepressant drug in focus for this project is Venlafaxine.

Project Purpose(s)

  • Educational

Scientific Approaches

The project involves data cleaning, specifically addressing missing values and imputing them, followed by data transformation and machine learning model building and testing. This project utilizes logistic regression and Lasso regression, including pairwise interactions of factors that predict response to antidepressants. The SAFE procedure will be used for dimensionality reduction.

Anticipated Findings

The findings of this project aim to predict the factors influencing response to antidepressants. By analyzing a diverse set of variables, including demographic, clinical, and psychological factors, we seek to identify which predictors are most significant in determining how individuals respond to antidepressant treatment. These insights will contribute to personalized treatment approaches, helping clinicians tailor antidepressant therapies to better meet the needs of their patients, particularly within minority populations.

Demographic Categories of Interest

  • Race / Ethnicity

Data Set Used

Registered Tier

Research Team

Owner:

Creutzfeldt-Jakob Disease (CJD) AI Model Development

Primary scientific question: Can we develop an AI model using the All of Us dataset to improve early detection and diagnosis of Creutzfeldt-Jakob Disease (CJD)? Importance: CJD is often misdiagnosed or diagnosed late due to similarities with other dementias. Early…

Scientific Questions Being Studied

Primary scientific question: Can we develop an AI model using the All of Us dataset to improve early detection and diagnosis of Creutzfeldt-Jakob Disease (CJD)?
Importance:
CJD is often misdiagnosed or diagnosed late due to similarities with other dementias.
Early detection is crucial for patient care and potential treatments.
The diverse All of Us dataset can reveal subtle patterns and risk factors not apparent in smaller studies.
Research aims:
Identify early symptoms and risk factors across diverse populations.
Develop an AI model to enhance CJD detection.
Explore disparities in CJD diagnosis and progression among demographic groups.
This study could improve diagnostic tools, enable earlier interventions, and deepen understanding of CJD, benefiting patients and advancing neurodegenerative disease research. It also showcases the value of large-scale, diverse datasets in rare disease research.

Project Purpose(s)

  • Disease Focused Research (Creutzfeldt-Jakob Disease (CJD))
  • Methods Development
  • Control Set
  • Ancestry

Scientific Approaches

I will analyze the All of Us Registered Tier dataset to study CJD, focusing on EHR data, survey responses, and physical measurements. My approach includes:
Data analysis using the All of Us Researcher Workbench
Machine learning models to identify CJD patterns and risk factors
Natural language processing for clinical note analysis
Methods involve cohort identification, feature engineering, and model development. Tools include R, Python, SQL, and Jupyter Notebooks.
This approach aims to develop an AI model for improved CJD detection and diagnosis, potentially revealing new insights across diverse populations.

Anticipated Findings

Key anticipated findings for CJD from All of Us dataset:
Early symptoms and risk factors
Novel biomarkers in health records
CJD progression across populations
Improved AI detection model
Contributions:
Enhanced early detection
Comprehensive view addressing disparities
New pathophysiology insights
Demonstrating value of large datasets
Potential prevention strategies
This research aims to advance CJD understanding and care using the diverse All of Us dataset, improving outcomes and guiding future neurodegenerative disease research.

Demographic Categories of Interest

  • Race / Ethnicity
  • Age

Data Set Used

Registered Tier

Research Team

Owner:

Sleep Outcomes and Geographical Locations

The study will explore how geographical location and changes in daylight savings impact sleep patterns, using wearable Fitbit data. We aim to understand the relationship between sleep quality, duration, and disturbances with various regions and their climates, alongside examining the…

Scientific Questions Being Studied

The study will explore how geographical location and changes in daylight savings impact sleep patterns, using wearable Fitbit data. We aim to understand the relationship between sleep quality, duration, and disturbances with various regions and their climates, alongside examining the potential influence of illnesses through a PheWAS analysis. Investigating how daylight savings affects sleep across regions is also a central question. These insights are crucial for understanding how external environmental factors influence public health through sleep, a known critical component of overall well-being.

Project Purpose(s)

  • Population Health

Scientific Approaches

We will use de-identified Fitbit sleep data linked to participants' geographical locations and medical history, conducting a PheWAS to identify associations between sleep disorders and illnesses. The study will leverage geographic information systems (GIS) to analyze location-specific variables, such as altitude, latitude, and climate. Analytical tools like regression models and machine learning will be employed to examine correlations, while time-series analysis will help assess daylight savings effects. Our study will also account for confounders like age, gender, and lifestyle factors.

Anticipated Findings

We anticipate identifying significant relationships between sleep disturbances and specific geographical factors like altitude or latitude, alongside finding patterns in the impact of daylight savings. The PheWAS may reveal associations between sleep disturbances and chronic illnesses. These findings could advance our understanding of environmental and geographical influences on sleep health and inform public health interventions to improve sleep outcomes across diverse populations.

Demographic Categories of Interest

This study will not center on underrepresented populations.

Data Set Used

Controlled Tier

Research Team

Owner:

Collaborators:

  • Diego Mazzotti - Senior Researcher, University of Kansas Medical Center

bxu_all

This project aims to explore if the physical activities measured by the fitbit are associated with the general well being of the mental health and what is the potential moderating factors. By understanding the relationships between physical activities and mental…

Scientific Questions Being Studied

This project aims to explore if the physical activities measured by the fitbit are associated with the general well being of the mental health and what is the potential moderating factors. By understanding the relationships between physical activities and mental health, we can identify the modifiable factors that can promote the well-being.

Project Purpose(s)

  • Social / Behavioral

Scientific Approaches

We will explore the following three sets of data: 1). fitbit measures, 2). psychiatric diagnoses, and 3). genomic data. The first set of analyses will examine the relationships between fitbit and psychiatric diagnoses. The second set will include genetic instruments, including polygenic scores, to see if the physical activities are moderating factors or independent contributors to te mental well beings.

Anticipated Findings

We expect to see there are associations between levels of physical activities and the risk of psychiatric disorders. The findings can help the field to understand the role of physical activity and whether it can serve as a modifying factor for public intervention.

Demographic Categories of Interest

This study will not center on underrepresented populations.

Data Set Used

Controlled Tier

Research Team

Owner:

  • Katherine Forthman - Project Personnel, Laureate Institute for Brain Research
  • Firas Naber - Project Personnel, Laureate Institute for Brain Research
  • Bohan Xu - Project Personnel, Laureate Institute for Brain Research

Collaborators:

  • Wenjie Zheng - Other, Laureate Institute for Brain Research

Duplicate of How to Work with All of Us Genomic Data (Hail - Plink)(v7)

Not applicable - these notebooks demonstrate example analysis how to use Hail and PLINK to perform genome-wide association studies using the All of Us genomic data and phenotypic data.

Scientific Questions Being Studied

Not applicable - these notebooks demonstrate example analysis how to use Hail and PLINK to perform genome-wide association studies using the All of Us genomic data and phenotypic data.

Project Purpose(s)

  • Other Purpose (Demonstrate to the All of Us Researcher Workbench users how to get started with the All of Us genomic data and tools. It includes an overview of all the All of Us genomic data and shows some simple examples on how to use these data.)

Scientific Approaches

Not applicable - these notebooks demonstrate example analysis how to use Hail and PLINK to perform genome-wide association studies using the All of Us genomic data and phenotypic data.

Anticipated Findings

Not applicable - these notebooks demonstrate example analysis how to use Hail and PLINK to perform genome-wide association studies using the All of Us genomic data and phenotypic data.

Demographic Categories of Interest

This study will not center on underrepresented populations.

Data Set Used

Controlled Tier

Research Team

Owner:

HAP786 - TEAM B - Remove Algorithm Bias

We are analyzing a dataset that examines the effects of antidepressants across different populations. The goal is to explore the data and perform analyses on both race-neutral and race-specific groups to identify significant outcomes. We will employ regression models to…

Scientific Questions Being Studied

We are analyzing a dataset that examines the effects of antidepressants across different populations. The goal is to explore the data and perform analyses on both race-neutral and race-specific groups to identify significant outcomes. We will employ regression models to identify key risk factors associated with the use of antidepressants in these populations.

Project Purpose(s)

  • Educational

Scientific Approaches

We will be using a dataset that contains information on the use of antidepressants in various populations. The dataset likely includes demographic variables such as race, age, gender, and clinical variables like dosage, response rates, and potential side effects. The dataset will allow us to distinguish between race-neutral and race-specific subgroups, enabling a deeper analysis of how antidepressants affect different populations.
Race-neutral analysis: How do antidepressants impact the overall population, regardless of race, and what are the general risk factors associated with treatment outcomes?
Race-specific analysis: How do antidepressants affect specific racial groups, and are there unique risk factors or outcomes that differ by race?
Our goal is to highlight both general and race-specific risk factors, offering actionable findings that could inform clinical practice and enhance treatment outcomes for different demographic groups.

Anticipated Findings

We expect to find important factors that affect how well antidepressants work, both for everyone and for specific racial groups. A key goal of our project is to remove algorithm bias, ensuring that the findings are fair and accurate across all populations. By understanding these differences and eliminating bias, the study can help create more personalized and effective mental health care, improving how antidepressants are used and guiding better treatment decisions in the future.

Demographic Categories of Interest

  • Race / Ethnicity

Data Set Used

Registered Tier

Research Team

Owner:

Collaborators:

  • Venkata Siva Durga Ponugumatla - Graduate Trainee, George Mason University
  • NAGAGAYATHRI MADIRAJU - Graduate Trainee, George Mason University
  • Heidi Barrientos - Graduate Trainee, George Mason University

IschemicHeartDiseaseGenomics

Our central hypothesis is neurocognitive impairment and CAD share genetic and modifiable risk factors. We propose the following aims to evaluate further our hypothesis: 1) Assess modifiable risk factors across CAD, neurocognitive impairment, and patients sharing both phenotypes. 2) Evaluate…

Scientific Questions Being Studied

Our central hypothesis is neurocognitive impairment and CAD share genetic and modifiable risk factors. We propose the following aims to evaluate further our hypothesis: 1) Assess modifiable risk factors across CAD, neurocognitive impairment, and patients sharing both phenotypes. 2) Evaluate the burden of bleeding, arrhythmic and cardiovascular procedural events across these phenotypes. 3) To further evaluate the shared genetic underpinnings, perform a GWAS analysis across these phenotypes.

Project Purpose(s)

  • Disease Focused Research (cognitive disorder, ischemic heart disease)
  • Ancestry

Scientific Approaches

We plan to use univariate analysis to identify clinical, behavioral and procedural risk factors associate with CAD and neurocognitive impairment. Further we plan to perform subgroup analysis and add compute polygenic risk scores across the phenotypes. Finally, we plan to perform GWAS analysis to identify pleotropic loci associated with both phenotypes.

Anticipated Findings

Impaired cognition and coronary artery disease (CAD) are major public health concerns and share many risk factors. Cognitive impairment after CAD remains a matter of significant concern. This study aims to evaluate modifiable clinical risk factors and identify unique genetic risk factors shared among patients with impaired cognition and CAD. Identifying novel risk factors would help clinicians identify patients at risk of cognitive impairment after CAD or vice versa.

Demographic Categories of Interest

  • Race / Ethnicity
  • Disability Status
  • Access to Care
  • Education Level

Data Set Used

Controlled Tier

Research Team

Owner:

Collaborators:

  • Minoo Bagheri - Teacher/Instructor/Professor, Vanderbilt University Medical Center

Duplicate of probiotic and bc controlled

explore risk factors for breast cancer

Scientific Questions Being Studied

explore risk factors for breast cancer

Project Purpose(s)

  • Disease Focused Research (female breast cancer)

Scientific Approaches

explore risk factors for breast cancer

Anticipated Findings

explore risk factors for breast cancer

Demographic Categories of Interest

This study will not center on underrepresented populations.

Data Set Used

Controlled Tier

Research Team

Owner:

  • Yangbo Sun - Early Career Tenure-track Researcher, University of Tennessee Health Science Center, Memphis

Duplicate of Food Insecurity in Lung Cancer Survivors

What is the prevalence of food insecurity in lung cancer survivors from the All of Us Research Program? Is food insecurity associated with inflammation and stress in lung cancer survivors, after controlling for stage of cancer, smoking status, demographics (e.g.,…

Scientific Questions Being Studied

What is the prevalence of food insecurity in lung cancer survivors from the All of Us Research Program? Is food insecurity associated with inflammation and stress in lung cancer survivors, after controlling for stage of cancer, smoking status, demographics (e.g., income, education, race/ethnicity), and co-morbidities? We hypothesize that food insecurity is one factor associated with stress and inflammation in lung cancer survivors.

The researchers acknowledge that systemic racism/discrimination is a distal determinant of lung cancer and the intermediate factors, like food insecurity and smoking, are on the causal pathway. In this study, we focus on a highly modifiable, proximal determinant of lung cancer: food insecurity. Determining the association between food insecurity and c-reactive protein, neutrophil to lymphocyte ratio, and cortisol in lung cancer survivors will help to inform interventions that aim to reduce food insecurity and in turn, could reduce lung cancer disparities.

Project Purpose(s)

  • Disease Focused Research (lung cancer)
  • Social / Behavioral

Scientific Approaches

Aim 1a: Determine the prevalence of food insecurity in lung cancer survivors. The food insecurity prevalence rate for lung cancer survivors will be compared to the general U.S. population. Responses to two questions from the Social Determinants of Health questionnaire will be used to assess food insecurity.
Aim 1b: Determine differences in food insecurity by race/ethnicity and household income.

Aim 2: Determine the association between food insecurity and c-reactive protein, neutrophil to lymphocyte ratio, and cortisol in lung cancer survivors. C-reactive protein, neutrophil to lymphocyte ratio, and cortisol will be obtained from lung cancer survivors’ electronic health records.

Anticipated Findings

Aim 1: We hypothesize that food insecurity prevalence in lung cancer survivors from the All of Us research project is higher than that in the general United States population (10%). Based on prior literature, we hypothesize that food insecurity is higher in racial/ethnic minorities and in those from low household incomes.

Aim 2: We hypothesize that food insecurity is associated with c-reactive protein, neutrophil to lymphocyte ratio, and cortisol in lung cancer survivors, after controlling for stage of cancer, smoking status, demographic characteristics (e.g., individual-level SES, education, race/ethnicity), and co-morbidities.

Demographic Categories of Interest

  • Race / Ethnicity
  • Income Level

Data Set Used

Registered Tier

Research Team

Owner:

Complex Traits GWAS and Polygenic Scores (v7)_bella

Genome-wide association studies (GWAS) have identified tens of thousands of genotype-phenotype associations for human complex traits. Polygenic risk score (PRS) for a trait is typically calculated as a weighted sum of trait-associated allele counts across numerous loci in the genome,…

Scientific Questions Being Studied

Genome-wide association studies (GWAS) have identified tens of thousands of genotype-phenotype associations for human complex traits. Polygenic risk score (PRS) for a trait is typically calculated as a weighted sum of trait-associated allele counts across numerous loci in the genome, where the weight is obtained from a corresponding GWAS. PRS is an effective tool to quantify the aggregated genetic propensity for a trait or disease. With rapid advances in GWAS sample size and statistical methodologies, PRS has shown substantially improved prediction accuracy and great potential in disease risk screening and precision medicine. The main goals of this project are 1) to run GWAS on numerous complex traits to identify and interpret genetic associations through integrative modeling of annotation data, and 2) to produce a set of PRS for hundreds of complex traits using newly released genomic data in AllofUs.

Project Purpose(s)

  • Social / Behavioral
  • Methods Development
  • Ancestry

Scientific Approaches

We will use the softwares like Hail, Regenie, and/or BOLT-LMM to run GWAS. We will implement a state-of-the-art method named PRS-CS to compute PRS for each GWAS trait. We will benchmark and optimize the performance of PRS models using a summary statistics-based cross-validation approach called PUMAS developed by our group (Zhao et al. Genome Biology 22(1), 2021). AllofUs genomic data will undergo rigorous quality control (QC) procedures including removing variants with lower sequencing depth and variant calling quality.

Anticipated Findings

We will produce GWAS summary statistics for numerous complex traits and disorders. We will also produce PRS for all individuals with whole-genome sequencing (WGS) data in AllofUs. Every individual will have hundreds of scores quantifying their genetic propensity for a large collection of diseases and traits. These scores will be immediately applicable in future studies. For example, one planned future study is to integrate breast cancer PRS with electronic health record data in AllofUs to improve risk screening accuracy.

Demographic Categories of Interest

This study will not center on underrepresented populations.

Data Set Used

Controlled Tier

Research Team

Owner:

  • Bella Ren - Undergraduate Student, University of Wisconsin, Madison

Duplicate of Neighborhood Environment and Inflammation in Cancer

Prior studies find associations between the neighborhood environment and poor health outcomes in cancer, including screening, diagnosis, treatment, and survival. The physical (e.g., disinvestment) and social (e.g., segregation) neighborhood environment could impact cancer-related health through several mechanisms, including stress and…

Scientific Questions Being Studied

Prior studies find associations between the neighborhood environment and poor health outcomes in cancer, including screening, diagnosis, treatment, and survival. The physical (e.g., disinvestment) and social (e.g., segregation) neighborhood environment could impact cancer-related health through several mechanisms, including stress and immune dysregulation, by promoting unhealthy behaviors, or through access to health care and resources. The aim of this research project is to determine:
1. What is the association between the neighborhood environment, including deprivation, disorder and social cohesion, and cancer diagnosis in All of Us participants?
2. Is the neighborhood environment associated with stress-evoked inflammation, measured via serum levels of c-reactive protein, neutrophil to lymphocyte ratio, and albumin in All of Us participants?

Project Purpose(s)

  • Disease Focused Research (cancer)
  • Social / Behavioral

Scientific Approaches

Using 3-digit zip code, the area deprivation index, and the social determinants of health questionnaire questions on neighborhood disorder and social cohesion, we will investigate the association between neighborhood environment and cancer diagnosis, including lung, breast, prostate and colorectal cancer diagnosis. We will conduct a sub-analysis of cancer diagnosis occurring after 2018, year of initiation of the SDOH questionnaire, to determine if there is a temporal relationship between the neighborhood environment and cancer diagnosis. We will extract data on serum levels of c-reactive protein, neutrophils, lymphocytes, albumin from electronic health records to correspond to values drawn within 1 year of completion of the social determinants of health questionnaire. We will use standard descriptive statistics, logistic regression, and machine learning to determine neighborhood predictors of cancer diagnosis.

Anticipated Findings

We hypothesize that neighborhood environment, including more deprivation, more neighborhood disorder and less social cohesion, is associated with having a cancer diagnosis. We hypothesize that individuals from more deprived neighborhood environments will have higher inflammation as measured by CRP, NLR, and albumin.

Demographic Categories of Interest

This study will not center on underrepresented populations.

Data Set Used

Controlled Tier

Research Team

Owner:

SLMM_software_test

Historically, researchers responded to limitations in genomic data sharing policy and practice by conducting meta analysis on summary outputs from isolated genomic datasets. Recent work has demonstrated the increased power of individual-level genetic analysis on pooled datasets. In addition, advancements…

Scientific Questions Being Studied

Historically, researchers responded to limitations in genomic data sharing policy and practice by conducting meta analysis on summary outputs from isolated genomic datasets. Recent work has demonstrated the increased power of individual-level genetic analysis on pooled datasets. In addition, advancements in data access and sharing policies coupled with technological advancements in cloud-based environments for data access and analysis have opened up new possibilities for pooled analysis of large-scale genomic datasets. The NIH All of Us Research Program and UK Biobank are two leading examples of large, population scale studies which combine genomic data with deep phenotypic health data. There is a grand opportunity to demonstrate how the world’s largest research-ready biomedical datasets can create more value together and advance discovery in genome science.

Project Purpose(s)

  • Other Purpose (This is a demonstration project meant to support research with All of Us genomic data. Please see https://www.biorxiv.org/content/10.1101/2022.11.29.518423)

Scientific Approaches

The primary goal of this project is to demonstrate the potential of the All of Us Researcher Workbench for pooled analyses of All of Us and UK Biobank data. Specifically, we aim to: 1. Develop and describe an approved, secure path for connecting UK Biobank data to the All of Us Researcher Workbench. 2. Conduct a genome-wide association study of blood lipids on the pooled dataset aimed at demonstrating that biomedical researchers can be more productive when permitted to analyze the union of the cohorts, as opposed to computing aggregate results in separate data silos for each cohort and then combining those aggregates.

Anticipated Findings

The secondary goal of this project is to demonstrate and measure the experience when the same analyses are repeated in a siloed manner. Specifically we aim to: 3. Repeat the previously described genome-wide association study on the All of Us Researcher Workbench when working with the All of Us data and on UK Biobank’s DNAnexus when working with the UK Biobank data. 4. Conduct a meta analysis on the aggregate results for each cohort (in accordance with each program’s data use policies) and compare the result of combining those aggregates to the results from the pooled analysis. Evaluate not only differences in results, but also differences in analysis cost and analyst productivity.

Demographic Categories of Interest

This study will not center on underrepresented populations.

Data Set Used

Controlled Tier

Research Team

Owner:

  • Daxin Wu - Graduate Trainee, University of Toronto

Duplicate of How to Work with All of Us Genomic Data (Hail - Plink)(v7)

Not applicable - these notebooks demonstrate example analysis how to use Hail and PLINK to perform genome-wide association studies using the All of Us genomic data and phenotypic data.

Scientific Questions Being Studied

Not applicable - these notebooks demonstrate example analysis how to use Hail and PLINK to perform genome-wide association studies using the All of Us genomic data and phenotypic data.

Project Purpose(s)

  • Other Purpose (Demonstrate to the All of Us Researcher Workbench users how to get started with the All of Us genomic data and tools. It includes an overview of all the All of Us genomic data and shows some simple examples on how to use these data.)

Scientific Approaches

Not applicable - these notebooks demonstrate example analysis how to use Hail and PLINK to perform genome-wide association studies using the All of Us genomic data and phenotypic data.

Anticipated Findings

Not applicable - these notebooks demonstrate example analysis how to use Hail and PLINK to perform genome-wide association studies using the All of Us genomic data and phenotypic data.

Demographic Categories of Interest

This study will not center on underrepresented populations.

Data Set Used

Controlled Tier

Research Team

Owner:

  • Jaylin Knight - Graduate Trainee, North Carolina State University

Predictive Modeling for Antidepressant Efficacy in African American Populations

What are the key predictors of antidepressant response among African American participants in the All of Us database, and how do interactions between these predictors influence treatment outcomes? Additionally, how can we effectively address missing values to enhance the robustness…

Scientific Questions Being Studied

What are the key predictors of antidepressant response among African American participants in the All of Us database, and how do interactions between these predictors influence treatment outcomes? Additionally, how can we effectively address missing values to enhance the robustness of our predictive models, and what factors contribute to the discontinuation of antidepressants
Understanding these aspects is crucial for advancing personalized medicine, improving treatment outcomes, and reducing health disparities. Identifying predictors and interactions can lead to more effective and individualized antidepressant therapies, while addressing missing data ensures accurate and reliable models. Insights into discontinuation factors can also inform strategies to improve treatment adherence and success.

Project Purpose(s)

  • Educational

Scientific Approaches

Descriptive Analysis:
Goal: Summarize key details about the study population, including antidepressant use and discontinuation.
Tools: Python (Pandas, NumPy), SQL.

Regression Modeling:
Goal: Identify which factors predict how well antidepressants work and explore interactions between these factors.
Tools: Python , R.

Imputation of Missing Values:
Goal: Handle missing data to ensure accurate model results.
Approach: Use techniques like mean imputation, KNN, or multiple imputation.
Tools: Python , R.

LASSO Regression:
Goal: Simplify the model by selecting the most important predictors.
Tools: Python (Scikit-learn), R.

Tools and Software:
Python: For data processing, modeling, and analysis.
R: For detailed statistical analysis and visualization.
SQL: For extracting data.
Tableau: For visualizing and presenting results.

Anticipated Findings

Anticipated findings from the study include identifying key predictors of antidepressant response, understanding how interactions between these predictors influence treatment outcomes, and revealing factors leading to antidepressant discontinuation. The study is expected to highlight specific demographic, medical, and treatment-related variables that impact efficacy. By addressing missing data effectively, it will enhance model accuracy. These findings will contribute to the body of scientific knowledge by refining personalized treatment strategies, improving understanding of antidepressant responses in African American populations, and advancing methods for handling missing data in health research. This can lead to better-targeted interventions and reduced health disparities in mental health care.

Demographic Categories of Interest

  • Race / Ethnicity

Data Set Used

Registered Tier

Research Team

Owner:

Stroke Phenotype Association Study

The primary scientific question of this study is: How do genetic markers associated with stroke differ between Black/African descended populations and white populations, and how do social determinants of health modulate these associations? This question is crucial because stroke disproportionately…

Scientific Questions Being Studied

The primary scientific question of this study is: How do genetic markers associated with stroke differ between Black/African descended populations and white populations, and how do social determinants of health modulate these associations? This question is crucial because stroke disproportionately affects Black/African-descended populations, contributing to significant health disparities. Understanding the genetic underpinnings of stroke in conjunction with social determinants of health can provide insights into the multifactorial nature of this disease. This research is relevant to public health as it aims to identify both biological and social risk factors that contribute to stroke, with the ultimate goal of informing targeted interventions to reduce these disparities.

Project Purpose(s)

  • Disease Focused Research (Stroke)
  • Population Health
  • Ancestry

Scientific Approaches

To address this question, I will utilize the All of Us dataset, focusing on the controlled tier that includes both genomic and demographic information. The study will involve a genotype-phenotype association analysis to identify genetic variants associated with stroke across different populations. I will stratify the analysis by race/ethnicity, specifically comparing Black/African descended populations with white populations. Additionally, I will incorporate social determinants of health, such as socioeconomic status, access to healthcare, and neighborhood characteristics, as covariates in the analysis. Statistical methods such as logistic regression, GWAS, and polygenic risk scores will be employed to assess the interactions between genetic markers and social factors. Data will be processed using bioinformatics tools for variant calling, imputation, and annotation, followed by statistical analysis using R or Python.

Anticipated Findings

It is anticipated that this study will reveal distinct genetic markers associated with stroke in Black/African-descended and White populations, potentially identifying novel variants that are specific to or more prevalent in these populations. Furthermore, the study is expected to highlight how social determinants of health modify the relationship between these genetic markers and stroke risk. The findings will contribute to the growing body of literature on the genetic basis of stroke, particularly in diverse populations, and provide evidence on the role of social factors in exacerbating or mitigating genetic risk. This research will also provide a framework for integrating genetic data with social determinants of health in studies addressing other complex diseases. Ultimately, this research could inform the development of precision medicine approaches and public health strategies aimed at reducing stroke disparities and improving health outcomes in marginalized communities.

Demographic Categories of Interest

  • Race / Ethnicity
  • Gender Identity
  • Sexual Orientation
  • Geography
  • Access to Care
  • Education Level
  • Income Level

Data Set Used

Controlled Tier

Research Team

Owner:

  • Carter Clinton - Early Career Tenure-track Researcher, North Carolina State University

Collaborators:

  • Jaylin Knight - Graduate Trainee, North Carolina State University

Long Reads TRs PreProd Transition

We want to profile the distribution of tandem repeat sizes in the All of Us PacBio long read genomes. This will serve as a control cohort for analyzing and identifying pathogenic repeat expansions in patients with rare diseases.

Scientific Questions Being Studied

We want to profile the distribution of tandem repeat sizes in the All of Us PacBio long read genomes. This will serve as a control cohort for analyzing and identifying pathogenic repeat expansions in patients with rare diseases.

Project Purpose(s)

  • Methods Development
  • Ancestry

Scientific Approaches

We will run the tool 'TRGT' on the aligned 1,027 PacBio long read samples. This should produce an estimate of the size of each of the two alleles for each tandem repeat locus that we specify for analysis.

Anticipated Findings

This will generate a first of its kind dataset of allele sizes of tandem repeat loci which will be a valuable reference as others in the field attempt to interpret whether a given repeat expansion is pathogenic at a particular size seen in patients.

Demographic Categories of Interest

  • Race / Ethnicity

Data Set Used

Controlled Tier

Research Team

Owner:

  • Matt Danzi - Research Fellow, University of Miami

Collaborators:

  • Sarah Fazal - Research Fellow, University of Miami

Duplicate of Genomics Undergrad Lesson Plan Exemplar v7

This workspace is an example used to train new users of the AoU researcher workbench on how to design and execute projects using R and Python in Jupyter Notebook.

Scientific Questions Being Studied

This workspace is an example used to train new users of the AoU researcher workbench on how to design and execute projects using R and Python in Jupyter Notebook.

Project Purpose(s)

  • Educational

Scientific Approaches

We plan on incorporating CURE best practices into these resources and designing them to be used in both independent and course-based research environments.

Anticipated Findings

This project will have a far-reaching impact by making it easier for student researchers across the United States to design and implement research projects using the All of Us data resource.

Demographic Categories of Interest

This study will not center on underrepresented populations.

Data Set Used

Controlled Tier

Research Team

Owner:

Collaborators:

  • Glorie Ofoche - Undergraduate Student, Towson University
  • Ebenezer Ajisafe - Undergraduate Student, Towson University

Duplicate of Duplicate of POAG

Our specific scientific questions is to understand the genetic determinants of glaucoma and their implications for disease risk prediction and management. We aim to identify genetic variants associated with glaucoma susceptibility through genome-wide association studies using whole-genome sequencing (WGS) data.…

Scientific Questions Being Studied

Our specific scientific questions is to understand the genetic determinants of glaucoma and their implications for disease risk prediction and management.
We aim to identify genetic variants associated with glaucoma susceptibility through genome-wide association studies using whole-genome sequencing (WGS) data. The importance of these questions lies in their potential to advance our understanding of glaucoma genetics, improve risk prediction, and inform personalized approaches to disease management.

Project Purpose(s)

  • Disease Focused Research (glaucoma)
  • Methods Development
  • Ancestry

Scientific Approaches

We will use whole genome sequencing data.
We will firstly run sample-level and variant-level QC on WGS data by using Hail and bcftools.
After QC, Regenie will be used to run GWAS to explore genotype-phenotype associations in glaucoma by correlating identified genetic variants with clinical phenotypes. Furthermore, we plan to develop polygenic risk scores (PRS) for glaucoma using PLINK.

Anticipated Findings

We anticipate identifying population-specific genetic risk factors and novel genetic variants for glaucoma within diverse populations represented in the All of Us cohort. Moreover, we plan to figure out the factors that are strongly correlated with high risk of genetic burdens. These findings can shed light on the biological pathways involved in glaucoma development and provide potential targets for therapeutic intervention. This knowledge can inform culturally sensitive interventions and address disparities in glaucoma care across diverse populations.

Demographic Categories of Interest

  • Race / Ethnicity
  • Age
  • Geography

Data Set Used

Controlled Tier

Research Team

Owner:

  • LIYIN CHEN - Research Fellow, Mass General Brigham

Collaborators:

  • Kirill Zaslavsky - Research Fellow, Mass General Brigham
  • Chloe Park - Project Personnel, Mass General Brigham

Xiaoyu's workspace

We are going to generate prediction models using HIV patient cohort. One purpose is to generate a classification model to determine if a person might have HIV according to their EHR.

Scientific Questions Being Studied

We are going to generate prediction models using HIV patient cohort. One purpose is to generate a classification model to determine if a person might have HIV according to their EHR.

Project Purpose(s)

  • Disease Focused Research (Human immunodeficiency virus infectious disease)

Scientific Approaches

Machine learning modeling, stratified sampling based model inference.

The case definition includes individuals aged 18-29 who have an HIV diagnosis ICD codes, or relevant personal medical history (“Personal Medical History” survey: “Are you currently prescribed medications and/or receiving treatment for HIV/AIDS?”), together with HIV-related lab tests or drug exposure, excluding those on pre-exposure prophylaxis (PrEP).

Anticipated Findings

We anticipate to build a generalised model for HIV diagnosis related outcomes, suggest HIV classification, care management, and viral load detection.

Demographic Categories of Interest

  • Education Level
  • Income Level

Data Set Used

Registered Tier

Research Team

Owner:

  • Xiaoyu Wang - Research Assistant, Florida State University

Collaborators:

  • Balu Bhasuran - Research Fellow, Florida State University

Complex Traits GWAS and Polygenic Scores (v7)

Genome-wide association studies (GWAS) have identified tens of thousands of genotype-phenotype associations for human complex traits. Polygenic risk score (PRS) for a trait is typically calculated as a weighted sum of trait-associated allele counts across numerous loci in the genome,…

Scientific Questions Being Studied

Genome-wide association studies (GWAS) have identified tens of thousands of genotype-phenotype associations for human complex traits. Polygenic risk score (PRS) for a trait is typically calculated as a weighted sum of trait-associated allele counts across numerous loci in the genome, where the weight is obtained from a corresponding GWAS. PRS is an effective tool to quantify the aggregated genetic propensity for a trait or disease. With rapid advances in GWAS sample size and statistical methodologies, PRS has shown substantially improved prediction accuracy and great potential in disease risk screening and precision medicine. The main goals of this project are 1) to run GWAS on numerous complex traits to identify and interpret genetic associations through integrative modeling of annotation data, and 2) to produce a set of PRS for hundreds of complex traits using newly released genomic data in AllofUs.

Project Purpose(s)

  • Social / Behavioral
  • Methods Development
  • Ancestry

Scientific Approaches

We will use the softwares like Hail, Regenie, and/or BOLT-LMM to run GWAS. We will implement a state-of-the-art method named PRS-CS to compute PRS for each GWAS trait. We will benchmark and optimize the performance of PRS models using a summary statistics-based cross-validation approach called PUMAS developed by our group (Zhao et al. Genome Biology 22(1), 2021). AllofUs genomic data will undergo rigorous quality control (QC) procedures including removing variants with lower sequencing depth and variant calling quality.

Anticipated Findings

We will produce GWAS summary statistics for numerous complex traits and disorders. We will also produce PRS for all individuals with whole-genome sequencing (WGS) data in AllofUs. Every individual will have hundreds of scores quantifying their genetic propensity for a large collection of diseases and traits. These scores will be immediately applicable in future studies. For example, one planned future study is to integrate breast cancer PRS with electronic health record data in AllofUs to improve risk screening accuracy.

Demographic Categories of Interest

This study will not center on underrepresented populations.

Data Set Used

Controlled Tier

Research Team

Owner:

  • Yuchang Wu - Research Fellow, University of Wisconsin, Madison

Collaborators:

  • Stephen Dorn - Graduate Trainee, University of Wisconsin, Madison
  • Zijie Zhao - Graduate Trainee, University of Wisconsin, Madison
  • Zhongxuan Sun - Undergraduate Student, University of Wisconsin, Madison
  • Qiongshi Lu - Early Career Tenure-track Researcher, University of Wisconsin, Madison
  • Meiyi Yan - Graduate Trainee, University of Wisconsin, Madison
  • Longjin Che - Undergraduate Student, University of Wisconsin, Madison
  • Jonathan Haugstad - Undergraduate Student, University of Wisconsin, Madison
  • Inês Dutra - Late Career Tenured Researcher, University of Wisconsin, Madison
  • Bella Ren - Undergraduate Student, University of Wisconsin, Madison
  • Aubrey Barnard - Research Fellow, University of Wisconsin, Madison

Cancer Research

is there a connection between Breast Cancer and Melanoma, if so is the correlation in the genome and not due to environmental factors

Scientific Questions Being Studied

is there a connection between Breast Cancer and Melanoma, if so is the correlation in the genome and not due to environmental factors

Project Purpose(s)

  • Other Purpose (want to explore the available data for potential interest in Breast Cancer and Melanoma research.)

Scientific Approaches

We plan to look at electronic health data and genomic data.

Anticipated Findings

Determine possible correlation between Breast cancer and Melanoma in specific areas of the genome.

Demographic Categories of Interest

This study will not center on underrepresented populations.

Data Set Used

Controlled Tier

Research Team

Owner:

Duplicate of Workshop: Intro to All of Us Genomics Data

This workspace is meant to help researchers get familiar with the All of Us Researcher Workbench. There are five hands-on exercises during the workshop, each with a specific notebook. Exercise 1: Duplicate the workspace & start the cloud environment Exercise…

Scientific Questions Being Studied

This workspace is meant to help researchers get familiar with the All of Us Researcher Workbench. There are five hands-on exercises during the workshop, each with a specific notebook.
Exercise 1: Duplicate the workspace & start the cloud environment
Exercise 2: Looking at the genomic data (notebook)
Exercise 3: GWAS - extracting phenotypic data (notebook)
Exercise 4: GWAS - running Hail GWAS (notebook)
Exercise 5: Advanced GWAS (2 notebooks)

By running the exercises in this workspace, researchers will become more familiar with the genomic data, know how to access the genomic data, see how the genomic data and tools can be used in the Researcher Workbench, and be able to start their own genomic data project.

Project Purpose(s)

  • Other Purpose (This workspace is meant for use during the Introduction to Analyzing All of Us Genomic Data workshop. In this workshop, participants will get hands-on experience using the genomics data running a genome-wide association study (GWAS) using Hail. )

Scientific Approaches

We are using the All of Us dataset in order to run a genome-wide association study (GWAS) using Hail. In the workshop, we will give an introduction to the All of Us Researcher Workbench and demonstrate how to use the Cohort Builder and Jupyter Notebooks to set up a research project. Using Jupyter notebooks, we will create a dataset linking the All of Us phenotypic data to the short read whole genome sequencing (srWGS) data. After running the GWAS steps using Hail, we will visualize the results.

Anticipated Findings

This study is running a genome-wide association study (GWAS) using Hail, using height as the selected phenotypic data. We do not anticipate findings from this example workspace but we expect that workshop participants will be able to apply similar methods to their future research.

Demographic Categories of Interest

This study will not center on underrepresented populations.

Data Set Used

Controlled Tier

Research Team

Owner:

  • Jennifer Cohen - Early Career Tenure-track Researcher, Duke University

Practice

This will be used for educational Purposes for school and learning. I will be editing my own coherts and datasets

Scientific Questions Being Studied

This will be used for educational Purposes for school and learning. I will be editing my own coherts and datasets

Project Purpose(s)

  • Educational

Scientific Approaches

This will be used for educational Purposes for school and learning. I will be editing my own coherts and datasets

Anticipated Findings

This will be used for educational Purposes for school and learning. I will be editing my own coherts and datasets

Demographic Categories of Interest

This study will not center on underrepresented populations.

Data Set Used

Controlled Tier

Research Team

Owner:

Practice

We are looking at how the presence of certain genes and maybe environmental factors impact the presence of certain birth defects.

Scientific Questions Being Studied

We are looking at how the presence of certain genes and maybe environmental factors impact the presence of certain birth defects.

Project Purpose(s)

  • Disease Focused Research (birth defects)
  • Ancestry

Scientific Approaches

We plan to look at datasets with the genes of healthy pregnant populations and pregnancies that resulted in birth defects and see if there are certain genes that set the two apart. The same will be done looking at other environmental factors.

Anticipated Findings

We are hoping to be able to determine if there are specific genes and/or elements that are connected to the presence of birth defects

Demographic Categories of Interest

This study will not center on underrepresented populations.

Data Set Used

Controlled Tier

Research Team

Owner:

Collaborators:

  • Priscille Ndzomo - Undergraduate Student, Towson University
1 - 25 of 13290
<
>
Request a Review of this Research Project

You can request that the All of Us Resource Access Board (RAB) review a research purpose description if you have concerns that this research project may stigmatize All of Us participants or violate the Data User Code of Conduct in some other way. To request a review, you must fill in a form, which you can access by selecting ‘request a review’ below.