Master Thesis: Modeling SNP-Conditioned Cellular Phenotypes via Flow Matching in the labs of Prof. Fabian Theis and Dr. Matthias Heinig
04.09.2025, Abschlussarbeiten, Bachelor- und Masterarbeiten
This Master thesis aims to extend CellFlow to model how SNP-based genetic variation influences single-cell phenotypes and responses to perturbations. It integrates single-cell data with population genomics to predict genotype-dependent effects, focusing on epistasis and disease relevance. Candidates with machine learning and bioinformatics skills are encouraged to apply via email.
Context
Advances in high-throughput single-cell profiling and large-scale phenotypic screens have unlocked unprecedented insights into cellular heterogeneity and the effects of experimental perturbations. Tools like CellFlow (Klein et al. 2025), a generative framework based on flow matching, have demonstrated powerful capabilities in predicting cellular responses to perturbations such as drug treatments, gene knockouts, or cytokine stimulations.
In parallel, population-scale genomics projects such as OneK1K (Yazar et al. 2022), which combine genetic variation (e.g., SNPs) with single-cell RNA-seq data, provide a rich substrate for exploring how genetic variation modulates cellular phenotypes. This thesis will integrate these two directions, leveraging CellFlow to model how SNP-defined genetic backgrounds influence cellular responses to perturbations, with a particular focus on QTL-associated variants and epistatic interactions.
Thesis Goals and Research Questions
The primary objective of this thesis is to extend CellFlow to incorporate individual-specific genetic variation, particularly common SNPs, as conditioning variables. This enables predictive modeling of single-cell phenotypes under both genetic and experimental perturbation contexts, with applications ranging from fundamental questions in gene regulation, genetic risk prediction and genotype-informed response prediction. A longer-term goal is to position this framework as a tool for modeling cellular phenotypes relevant to disease predisposition, therapeutic response, and inter-individual variability. Specific questions include:
- Can we model epistatic interactions between SNPs in terms of their effect on phenotypic response trajectories?
- Can we identify causal variants and study gene-environment interactions?
- How well can CellFlow predict phenotypes for unseen SNP combinations or rare alleles, particularly under perturbation conditions?
- Can SNP-conditioned models reveal genotype-dependent variation in cellular responses that align with known or hypothesized disease mechanisms?
- To what extent can SNP-conditioned models be generalized or transferred across genetically diverse populations or cell types?
Methodology and Scope
- Adapt the CellFlow architecture to ingest SNP-based sample covariates, including QTLs and epistatic variant pairs.
- Use genotype-informed embeddings (e.g., one-hot encoding, gene embeddings inferred from ESM3 (Hayes et al. 2025), learned SNP embeddings inferred from DNA language models (Nguyen et al. 2023; Hingerl et al. 2025), or integration with known functional annotations (Zheng et al. 2024)).
- Evaluate on datasets such as OneK1K, eQTL Consortium, or sc-eQTLGen.
- Compare against baseline models, such as:
- Conditional VAEs
- Epistasis prediction models
Candidate Profile
- Strong background in machine learning, computational biology, or bioinformatics
- Familiarity with deep generative models
- Experience with single-cell RNA-seq data and basic statistical genetics
- Programming skills in Python, PyTorch or JAX, and data science tools
How to Apply
Send an email to matthias.heinig@tum.de, dominik.klein@helmholtz-munich.de, and lucas.arnoldt@tum.de with the following information:
- Your CV
- A brief introduction outlining your background and motivation
- Your preferred start date
- Academic transcripts
We look forward to receiving your email!
References
- Klein et al. (2025). CellFlow enables generative single-cell phenotype modeling with flow matching. bioRxiv; DOI: 10.1101/2025.04.11.648220
- Yazar et al. (2022). Single-cell eQTL mapping identifies cell type–specific genetic control of autoimmune disease. Science, 376 (6589); DOI: 10.1126/science.abf3041
- Hayes et al. (2025). Simulating 500 million years of evolution with a language model. Science, 387 (6736); DOI: 10.1126/science.ads0018
- Nguyen et al. (2023). HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution. arXiv; DOI: 10.48550/arXiv.2306.15794
- Hingerl et al. (2025). scooby: Modeling multi-modal genomic profiles from DNA sequence at single-cell resolution. bioRxiv; DOI: 10.1101/2024.09.19.613754
- Zheng et al. (2024). Leveraging functional genomic annotations and genome coverage to improve polygenic prediction of complex traits within and between ancestries. Nature Genetics, 56; DOI: 10.1038/s41588-024-01704-y
Kontakt: lucas.arnoldt@tum.de
Mehr Information
https://www.helmholtz-munich.de/en/icb/research-groups/theis-lab