Mathematical Biology @ Penn

header image


All talks will take place on Mondays 4‐5 PM in 318 Carolyn Lynch Lab, unless stated otherwise. Click on the "Abstract" icon to display the abstract of a particular talk.

NOTE: If you wish to receive announcements regarding this seminar series, please subscribe to the mailing list from this link.

Spring 2017

Date Speaker & Title
Jan 23 Adam Siepel (Cold Spring Harbor Laboratory)
Fast, scalable prediction of deleterious noncoding variants from genomic data

Across many species, a large fraction of genetic variants that influence phenotypes of interest is located outside of protein-coding genes, yet existing methods for identifying such variants have poor predictive power. Here, we introduce a new computational method, called LINSIGHT, that substantially improves the prediction of noncoding nucleotide sites at which mutations are likely to have deleterious fitness consequences, and which therefore are likely to be phenotypically important. LINSIGHT combines a simple neural network for functional genomic data with a probabilistic model of molecular evolution. The method is fast and highly scalable, enabling it to exploit the “Big Data” available in modern genomics. It can be fitted to data by maximum likelihood using an online stochastic gradient ascent algorithm, with gradients computed efficiently by back-propagation. We show that LINSIGHT outperforms the best available methods in identifying human noncoding variants associated with inherited diseases. In addition, we apply LINSIGHT to an atlas of human enhancers and show that the fitness consequences at enhancers depend on cell type, tissue specificity, and constraints at associated promoters.

Jan 30 Aaditya Rangan (NYU Courant)
Covariate-corrected biclustering methods for gene-expression and GWAS data

A common goal in data-analysis is to sift through a large matrix and detect any significant submatrices (i.e., biclusters) that have a low numerical rank. To give an example from genomics, one might imagine a data-matrix involving several genetic-measurements taken across many patients. In this context a ‘bicluster’ would correspond to a subset of genetic-measurements that are correlated across a subset of the patients. While some biclusters might extend across most (or all) of the patients, it is also possible for biclusters to involve only a small subset of patients. Detecting biclusters such as these provides a first step towards unraveling the physiological mechanisms underlying the heterogeneity within a patient population.

We present a simple algorithm for tackling this biclustering problem - i.e., for detecting low-rank submatrices from within a much larger data-matrix. An important feature of our method is that it can easily be modified to account for many considerations which commonly arise in practice. For example, our algorithm can be used to find biclusters that manifest only within a ‘case’-population without manifesting within a ‘control’-population. Moreover, our algorithm can correct for categorical- and continuous-covariates, as well as sparsity within the data. We illustrate these practical features with two examples; the first drawn from gene-expression analysis and the second drawn from a much larger genome-wide-association-study (GWAS).

Feb 6 Champak Reddy (CUNY Graduate Center)
A flexible inference of complex population histories and recombination from multiple genomes

Analyzing whole genome sequences provides an unprecedented resolution of the historical demography of populations. In the process, most inferential methods either ignore or simplify the confounding effects of recombination and population history on the observed polymorphism. Going further, we build upon an existing analytic approach that partitions the genome into blocks of equal (and arbitrary) size and summarizes the polymorphism and linkage information as blockwise counts of SFS types (bSFS). We introduce a novel composite likelihood framework, using the bSFS, that jointly models demography and recombination and is explicitly designed to scale up to multiple whole genome sequences. The flexible nature of our method further allows for arbitrarily complex population histories using unphased and unpolarized whole genome sequences ( We review the demographic history of the two known Orangutan species for the first time using multiple genome sequences (over 160 Mbp in length) from each population. Our results indicate that the orangutan species diverged approximately 650-950 thousand years ago. After speciation, secondary contact modelled as pulse admixture (∼300,000 years ago) is shown to have a better support than continuous gene flow which corresponds to dispersal opportunity coupled with the periodic sea-level changes in South East Asia.

Feb 13 Davorka Gulisija (University of Pennsylvania)
Phenotypic plasticity promotes balanced polymorphism and recombination modification in periodic environments

Phenotypic plasticity is known to arise in varying habitats where it diminishes harmful environmental effects. How plasticity shapes genetic architecture of traits under varying selection is unknown. Using an analytic approximation and Monte Carlo simulations, we show that balanced polymorphism and recombination modification arise simultaneously as a consequence of epistatic plastic modification in periodic environments. Under this novel finite-population scenario of balancing selection, recombination arises between a plasticity modifier locus and its target coding-locus in the absence of typically assumed antagonistic co-evolution or constant influx of mutation. Moreover, even in the absence of epistasis or initial physical linkage between the co-modified coding loci, they cluster together such that alleles with aligned effects associate in supergenes. In turn, diversity increases due to both recombination between modifier and target loci and recombination suppression between additively acting target loci. This study uncovers the role of phenotypic plasticity in the maintenance of genetic variation and in the evolution of recombination rates.

Feb 24 (Friday) Bernd Sturmfels (University of California, Berkeley)
Does antibiotic resistance evolve in hospitals?

We present a joint paper with Anna Seigal, Portia Mira and Miriam Barlow, aimed at addressing the question in the title. Nosocomial outbreaks of bacteria and the heavy usage of antibiotics suggest that resistance evolves in hospital environments. To test this assumption, we studied resistance phenotypes of bacteria collected from patient isolates at a community hospital. A graphical model analysis shows no association between resistance and patient information other than time of arrival. This allows us to focus on time course data. Our main contribution is a statistical hypothesis test called the Nosocomial Evolution of Resistance Detector (NERD). It calculates the significance of resistance trends occurring in a hospital. It can help detect clonal outbreaks, and is available as an R-package. This lecture offers a glimpse into the use of mathematics and statistics in a bio-medical project.

See here for a related news item.

Feb 27 Alex McAvoy (Harvard University)
Linear payoff relationships in repeated games

In 2012, the study of the repeated Prisoner’s Dilemma was revitalized by the discovery of a new class of strategies known as “zero-determinant” (ZD) strategies. Through coercion, for example, ZD strategies allow a player to extort the opponent and obtain an unfair share of the payoffs. More generally, a player can use ZD strategies to unilaterally enforce linear relationships on expected payoffs, capturing classical fair strategies like tit-for-tat as well as extortionate and generous counterparts. I will discuss extensions of ZD strategies to arbitrary repeated games, including those with (i) larger action spaces (beyond just “cooperate” and “defect”), (ii) more than two players, and (iii) asynchronous moves. Beyond the fact that ZD strategies exist for a broad class of biologically-relevant interactions, these strategies can also always be assumed to have a short memory of the past (in fact, a memory of just the most recent round). Therefore, they are robust to changes in the structure of the game and can be implemented relatively easily.

Mar 6 Spring Recess
Mar 13 Philipp Messer (Cornell University)
Genetic manipulation of entire populations with CRISPR gene drives

A functioning gene drive system could fundamentally change our strategies for the control of vector-borne diseases by facilitating rapid dissemination of transgenes that prevent pathogen transmission or reduce vector capacity. CRISPR/Cas9 gene drive promises such a mechanism, which works by converting cells that are heterozygous for a drive construct into homozygotes, thereby enabling super-Mendelian inheritance. Though CRISPR gene drive activity has already been demonstrated, a key obstacle for current systems is their propensity to generate resistance alleles. I will present both theoretical and experimental results that shed light on how such resistance alleles emerge during the drive process, how genetic variability in the population impacts their formation rate, and how these alleles are expected to ultimately affect the spread of a drive in the population. We find that the key factor determining the probability that resistance evolves is the formation rate of resistance alleles, while the conversion efficiency of the drive construct, its fitness cost, and its introduction frequency have only minor impact. Our experiments in the model system Drosophila melanogaster confirm that resistance alleles arise at high rates both prior to fertilization in the germline and post-fertilization in the embryo due to maternally deposited Cas9. These findings demonstrate that resistance will likely impose a severe limitation to the effectiveness of current CRISPR gene drive approaches, inform strategies for the engineering of drives with lower resistance potential, and motivate the possibility to embrace resistance as a possible mechanism for controlling a drive.

Mar 20 Thibault Lagache (Columbia University)
Spatial statistics in bioimage analysis

New advances in fluorescence microscopy make possible the localization of thousands of molecules with nanometer resolution inside living cells. This calls for the development of new statistical tools in spatial analysis to characterize molecules' distribution, and the spatial coupling between different molecules in multi-color microscopy. We will present the tools that we have recently developed. These are based on the mathematical analysis of the Ripley’s K function, and allow to test statistically the randomness of molecules’ distribution inside individual cells, and also to measure the coupling between different molecules labeled in different colors. We will show some of the main biological applications of our tools.

Mar 27 Tom Chou (UCLA)
Hematopoietic stem cell self-renewal and clonal aging determine clone size fluctuations of granulocyte populations in rhesus macaque

In recent experiments, virally tagged hematopoietic stem cells (HSCs) that were autologously transplanted into rhesus macaques and peripheral blood cells were sampled over fourteen years. Peripheral blood samples were then sequenced and the abundances of cells with different tags were quantified. Analysis of these clone sizes using a rescaled neutral growth model indicated rapid equilibration of clone size distributions after transplantation. Here, we address the clone size heterogeneity and the puzzling temporal fluctuations of the sizes of clones. Through mathematical modeling and statistical analysis on data, we find that random HSC self-renewal in the bone marrow is consistent with the observed clonal size heterogeneity in the sampled peripheral blood. The dynamic variability in the sizes of individual clones, including the occasional extinctions and resurrections of certain clones, is naturally explained by a proliferation model that incorporates clonal aging by imposing a maximum number of divisions on progenitor cells. Our analysis quantifies the multi-stage stochastic dynamics of HSCs, progenitor cells, and peripheral blood, and shows that they can arise from an initial self-renewal stage followed by generation-limited progenitor cell bursting. Within this mechanistic picture, we use the data to infer estimates for HSC differentiation rates and a consistent maximum number of progenitor cell divisions.

Apr 3 Hong Qian (University of Washington)
The mathematical foundation of a landscape theory for living matters and life

The physicists' notion of energy is derived from Newtonian mechanics. The theory of thermodynamics is developed based on that notion, and the realization of mechanical energy dissipation in terms of heat. Since the work of L. Boltzmann, who trusted that atoms were real as early as in 1884, the heat became intimately related to the stochastic motion of the invisible atoms and molecules. In this talk, starting from a stochastic description of a class of rather general dynamics that is not limited to mechanics, we show a notion of energy can be derived mathematically, in the limit of vanishing stochasticity, based on the Kullback-Leibler divergence, or relative entropy associated with the stochastic, Markov processes. With the emergent notion of an energy function, e.g., "landscape", a mathematical structure inherent to the stochastic dynamics, which is akin to thermodynamics, is revealed. This analysis implies that an abstract "thermodynamic structure" exists, and can be formulated, for dynamics of complex systems independent of classical thermal physics, for example, in ecology.

Apr 10 Jeff Wall (UCSF)
Population genetic analyses of the GenomeAsia 100K Pilot Project

The GenomeAsia 100K consortium plans to generate high-coverage whole-genome sequence data from 100,000 Asian individuals over the next ~3 years, as a way of facilitating future sequence-based biomedical studies in Asian populations. I will discuss some initial analyses of the pilot phase of the project, consisting of 1,739 genomes from ~200 different populations. These will include analyses of the historical branching order of populations, contributions of Neanderthal, Denisovan and other archaic human groups to modern human genomes, estimates of the 1000 Genomes Project genotype call error rate, and estimates of the somatic mutation rate in cell lines.

Apr 17 No Seminar. MathBio Social.
Apr 24 Jun Li (University of Michigan)
Mutations, genetic identity, and data granularity

I will talk about two studies where new insights are gained after we work on a different level of data granularity. First, in collaboration with Sebastian Zoellner we analyzed ~36 million extremely rare variants (defined as singletons in ~4,000 individuals) uniformly ascertained in an as yet unpublished whole-genome sequencing dataset. Our goal is to estimate mutation rate variation across the genome, and to identify genomic and sequence-based predictors of such variation. We found that some genomic features, such as H3K36me3 peaks and CpG islands, can either increase or decrease mutation rates depending on the adjacent sequence context. This shows that their impact of mutations cannot be understood by studying all mutation subtypes in aggregate. In the second study, in collaboration with Noah Rosenberg we assessed the possibility of using an individual's microsatellite genotype data to find matched records in a database of SNP genotypes, even when they have no shared markers. By using ~1,000 samples analyzed on both the 13 tandem repeat markers in the FBI standard forensic panel and 650K common variants routinely typed in GWAS we demonstrate the feasibility of cross-identifying individuals between the criminal justice system on one hand and genetic or ancestry research on the other. These results add to the list of examples where group-level patterns cannot always be transferred to the individual level, or vice versa. Choosing the right granular level of inquiry thus continues to be one of the biggest challenges in data science.

May 1 Pier Francesco Palamara (Harvard School of Public Health)
Decoding of pairwise coalescent times and detection of recent adaptation in biobank-scale SNP array data sets

Coalescent hidden Markov models (HMM) such as the pairwise sequentially Markovian coalescent (PSMC, Li and Durbin, 2010) enable estimating the locus-specific posterior distribution of the time to most recent common ancestor (TMRCA) of a pair of haploid chromosomes when high-coverage sequencing data is available. I will present the “ascertained sequentially Markovian coalescent” (ASMC), a coalescent HMM that can be used to accurately estimate locus-specific TMRCA probabilities in widely available SNP array data. ASMC utilizes an extremely efficient recursive formulation of the forward/backward HMM algorithm, which enables analysis of very large data sets to reconstruct a detailed landscape of coalescent times along the genome. I will describe results from running ASMC in several cohorts, including ~120,000 unrelated British individuals from the UK Biobank data set, where we find that multiple loci underwent positive selection during the past ~200 generations. Looking at deeper time scales, we detect widespread negative selection that concentrates in regions enriched for heritability in several disease phenotypes.

Past Seminars

Date Speaker & Title
Sep 12 Reception, 4:00‐5:30 PM, 318 Carolyn Lynch Lab
Sep 19 Barbara Engelhardt (Princeton University)
Structured latent factor models to recover interpretable networks from transcriptomic data

Latent factor models have been the recent focus of much attention in "big data" applications because of their ability to quickly allow the user to explore the underlying data in a controlled and interpretable way. In genomics, latent factor models are commonly used to identify population substructure, identify gene clusters, and control noise in large data sets. In this talk I present a general framework for Bayesian latent factor models. I will illustrate the power of these models for a broad class of structured problems in genomics via application to the Genotype-tissue Expression (GTEx) data set. In particular, by using a Bayesian biclustering version of this model, the estimated latent structure may be used to identify gene co-expression networks that co-vary uniquely in one tissue type (and other conditions). We validate network edges using tissue-specific expression quantitative trait loci.

Sep 26 Yoseph Barash (University of Pennsylvania)
Modeling RNA local splicing variations from large heterogeneous datasets

Alternative splicing (AS) of genes is a key contributor to transcriptome variations and numerous disease. RNA-Seq experiments produce millions of short RNA reads and are commonly used to assess alternative splicing variations in one of two ways: Full gene isoform quantification, or relative abundance of binary AS events such as exon skipping. In this talk I will present a new framework we developed, based on gene splice graphs, to define, quantify and visualize splicing variations. The new formulation, termed LSV (local splice variations) captures previously defined binary AS events, but also much more complex variations. We show such complex variations are common across the metazoan, and can be accurately quantified. Next, I will discuss our current research into accurately capturing splicing variations when handling large heterogeneous datasets. Such data can involve hundreds or more human subjects and pose statistical and computational challenges.

Oct 3 Tandy Warnow (University of Illinois at Urbana–Champaign)
Grand challenges in phylogenomics

Estimating the Tree of Life will likely involve a two-step procedure, where in the first step trees are estimated on many genes, and then the gene trees are combined into a tree on all the taxa. However, the true gene trees may not agree with with the species tree due to biological processes such as deep coalescence, gene duplication and loss, and horizontal gene transfer. Statistically consistent methods based on the multi-species coalescent model have been developed to estimate species trees in the presence of incomplete lineage sorting; however, the relative accuracy of these methods compared to the usual "concatenation" approach is a matter of substantial debate within the research community. In this talk I will present new state of the art methods we have developed for estimating species trees in the presence of incomplete lineage sorting (ILS), and show how they can be used to estimate species trees from genome-scale datasets. I will also discuss tradeoffs between data quantity and quality, and the implications for big data genomic analysis.

Oct 10 Petros Drineas (Purdue University)
Dimensionality reduction in the analysis of human genetics data

Dimensionality reduction algorithms have been widely used for data analysis in numerous application domains including the study of human genetics. For instance, linear dimensionality reduction techniques (such as Principal Components Analysis) have been extensively applied in population genetics. In this talk we will discuss such applications and their implications for human genetics.

Oct 17 Erol Akcay Lab Presentation + MathBio Social
Oct 24 Philip Johnson (University of Maryland)
Quantitative methods for comparing T cell repertoires

The vertebrate T cell adaptive immune response has the challenging task of recognizing all possible pathogens while not attacking "self." Evolution's solution to this challenge has been to generate a repertoire of T cells within a single individual via a process of recombination and intra-individual selection that creates a vast diversity of distinct T cell receptors (TCRs). The subset of this repertoire that respond to any particular infection can be qualitatively described as broad or narrow and public or private. We have developed methods for analyzing TCR repertoire sequencing data and quantitatively evaluating the statistical significance of differences between samples. These T cell data are equivalent to population genetic sequencing of pooled individuals with mixed frequencies. We use summary statistics as well as the full frequency spectrum, which incorporates TCR frequencies in addition to binary presence/absence data. We apply these methods to experiments examining mouse naive repertoires and mouse antigen-specific (post-LCMV) repertoires

Oct 31 Amit Singer (Princeton University)
Vector diffusion maps and the graph connection Laplacian

Vector diffusion maps (VDM) is a mathematical framework for organizing and analyzing high-dimensional datasets that generalizes diffusion maps and other nonlinear dimensionality reduction methods, such as LLE, ISOMAP, and Laplacian eigenmaps. Whereas weighted undirected graphs are commonly used to describe networks and relationships between data objects, in VDM each edge is endowed with an orthogonal transformation encoding the relationship between the data at its vertices. The graph structure and orthogonal transformations are summarized by the graph connection Laplacian. In manifold learning, VDM can infer topological properties from point cloud data such as orientability, and graph connection Laplacians converge to their manifold counterparts (Laplacians for vector fields and higher order forms) in the large sample limit. The graph connection Laplacian satisfies a Cheeger-type inequality that provides a theoretical performance guarantee for the popular spectral algorithm for rotation synchronization, a problem with many applications in robotics and computer vision. The application to 2D class averaging in cryo-electron microscopy will serve as our main motivation.

Nov 7 Assaf Amitai (MIT)
Changes in local chromatin structure during homology search: effects of local contacts on search time

Double-strand break (DSB) repair by homologous recombination (HR) requires an efficient and timely search for a homologous template. We developed a statistical method of analysis based on single-particle trajectory data which allows us to extract forces acting on chromatin at DSBs. We can differentiate between extrinsic forces from the actin cytoskeleton and intrinsic alterations on the nucleosomal level at the cleaved MAT locus in budding yeast. Using polymer models we show that reduced tethering forces lead to local decondensation near DSBs, which reduces the mean first encounter time by two orders of magnitude. Local decondensation, likely stems from loss of internal mechanical constraints and a local redistribution of nucleosomes that depends on chromatin remodelers. Simulations verify that local changes in inter-nucleosomal contacts near DSBs would shorten drastically the time required for a long-range homology search.

Nov 14 Uri Keich (University of Sydney)
Controlling the rate of false discoveries in tandem mass spectrum identifications

A typical shotgun proteomics experiment produces thousands of tandem mass spectra, each of which can be tentatively assigned a corresponding peptide by using a database search procedure that looks for a peptide-spectrum match (PSM) that optimizes the score assigned to a matched pair. Some of the resulting PSMs will be correct while others will be false, and we have no way to verify which is which. The statistical problem we face is of controlling the false discovery rate (FDR), or the expected proportion of false PSMs among all reported pairings. While there is a rich statistical literature on controlling the FDR in the multiple hypothesis testing context, controlling the FDR in the PSM context is mostly done through the "home grown" method called target-decoy competition (TDC). After a brief introduction to the problem of tandem mass spectrum identification we will explore the reasons why the mass spec community has been using this non-standard approach to controlling the FDR. We will then discuss how calibration can increase the number of correct discoveries and offer an alternative method for controlling the FDR in the presence of calibrated scores. We will conclude by arguing that our analysis extends to a more general setup than the mass spectrum identification problem.
Joint work with Bill Noble (University of Washington)

Nov 21 Yujin Chung (Temple University)
Bayesian inference of evolutionary divergence with genomic data under diverse demographic models

In the study of diverging populations and species, a common goal is to disentangle the conflicting signals of prolonged genetic drift (elevating divergence) and gene exchange (removing it). In this talk, I present a new Bayesian method for estimating demographic history using population genomic samples. Several key innovations are introduced that allow the study of diverse models within an Isolation with Migration framework. The new method implements a 2-step analysis, with an initial Markov chain Monte Carlo (MCMC) phase that samples simple coalescent trees, followed by the calculation of the joint posterior density for the parameters of a demographic model. In step 1, the MCMC sampling phase, the method uses a reduced state space, consisting of coalescent trees without migration paths, and a simple importance sampling distribution without the demography of interest. Once obtained, a single sample of trees can be used in step 2 to calculate the joint posterior density for model parameters under multiple diverse demographic models, without having to repeat MCMC runs. Migration paths are not included in the state space of the MCMC phase, but rather are handled by analytic integration using a Markov chain as a representation of genealogy in step 2 of the analysis. Because of the separation of coalescent events from migration events and the demography of interest, the method is scalable to a large number of loci with excellent MCMC mixing properties. With an implementation of the new method in the computer program MIST, I demonstrate the method’s accuracy, scalability and other advantages using simulated data and DNA sequences of two common chimpanzee subspecies: Pan troglodytes (P. t.) troglodytes and P. t. verus

Nov 28 Joshua Schraiber (Temple University)
Assessing the relationship of ancient samples to modern populations and to each other

When ancient samples are sequenced, one of the first questions asked is how those samples relate to modern populations and to each other. Commonly, this is assessed using methods such as Structure or Admixture, which model individuals as mixtures of latent "ancestry components". If an ancient individual is found to not carry similar ancestry components to a modern individual, that sample is considered to be not directly related to the modern individual. However, the model used by Structure fails to account for the difference in genetic drift in ancient and modern populations and hence can cause misleading inferences about the relationships of ancient samples to modern populations. As a first step toward remedying this, I developed a novel method that can estimate the relationship of an arbitrarily large sample of ancient individuals to a modern reference population. I do this using a diffusion theory approach, and integrate over the uncertainty in genotypes that results from low coverage ancient sequences. Although the approach can only estimate the relationships of a single population at a time, I use an ad hoc clustering method that can group individuals into populations and refine estimates of how those populations are related to the modern reference panel. I might show some application to human ancient DNA data from Europe.

Dec 5 Jonathan Terhorst (University of California, Berkeley)
Robust and scalable inference of population history from hundreds of unphased whole-genomes

It has recently been demonstrated that inference methods based on genealogical processes with recombination can reveal past population history in unprecedented detail. However, these methods scale poorly with sample size, which limits resolution in the recent past, and they require phased genomes, which contain switch errors that can catastrophically distort the inferred history. I this talk I present SMC++, a new statistical tool capable of analyzing orders of magnitude more samples than existing methods, while requiring only unphased genomes (its results are independent of phasing). The key innovation is a novel probabilistic framework which couples the genaelogical process for a given individual with allele frequency information from a large reference panel such as 1000 Genomes.

SMC++ can jointly infer population size histories and split times in diverged populations, and it employs a novel spline regularization scheme that greatly reduces estimation error. We apply SMC++ to analyze sequence data from over a thousand human genomes in Africa and Eurasia, hundreds of genomes from a Drosophila population in Africa, and tens of genomes from zebra finch and long-tailed finch populations in Australia.

Date Speaker & Title
Feb 1 Joshua Plotkin Lab Presentation + MathBio Social
Feb 8 Paul Jenkins (University of Warwick)
Exact simulation of the Wright-Fisher diffusion

The Wright-Fisher family of diffusion processes is a class of evolutionary models widely used in population genetics. Simulation and inference from these diffusions is therefore of widespread interest. However, simulating a Wright-Fisher diffusion is difficult because there is no known closed-form formula for its transition function. In this talk I show how it is possible to simulate exactly from the scalar Wright-Fisher diffusion with general drift, extending ideas based on 'retrospective' simulation. The key idea is to exploit an eigenfunction expansion representation of the transition function. This approach also yields methods for exact simulation from several processes related to the Wright-Fisher diffusion: (i) the ancestral process of an infinite-leaf Kingman coalescent tree; (ii) its infinite-dimensional counterpart, the Fleming-Viot process; and (iii) its bridges. This is joint work with Dario Spano.

Feb 15 Kelley Harris (Stanford University)
The genetic cost of Neanderthal introgression

In human populations living outside Africa, at least 2-4% of the gene pool is inferred to have Neanderthal origin. This is the result of interbreeding between humans and Neanderthals that took place soon after humans first migrated out of Africa. Recent studies have shown that this Neanderthal DNA appears to be most abundant in regions of the genome that have no known functional importance, whereas Neanderthal DNA appears to be depleted from regions of the human genome that are subject to strong evolutionary constraint. A possible explanation for this pattern is that early human-Neanderthal hybrids had low fertility or fitness due to epistatic incompatibilities between the two species. In this talk I will present a possible alternative explanation: that Neanderthals had lower fitness than their human contemporaries due to long periods of inbreeding and low population size. Using published estimates of Neanderthal effective population size history and the distribution of mutational fitness effects, we infer that Neanderthal fitness was at least 40% lower than human fitness on average. This mutational load implies that Neanderthal DNA should have been selected against and depleted from functional regions of the genome without need to invoke epistasis. We also predict a residual Neanderthal mutation load in non-Africans, leading to a fitness reduction of at least 0.5%. This effect of Neanderthal admixture has been left out of previous debate on mutation load differences between Africans and non-Africans. We also show that if many deleterious mutations are recessive, the Neanderthal admixture fraction could increase over time due to the protective effect of Neanderthal haplotypes against deleterious alleles that arose recently in the human population. This might partially explain why so many organisms retain gene flow from other species and appear to derive adaptive benefits from introgression. This is joint work with Rasmus Nielsen.

Feb 22 Nancy Zhang Lab Presentation + MathBio Social
Feb 29 Po-Ru Loh (Harvard University)
Fast and accurate long-range phasing in a UK Biobank cohort

Genotyping arrays produce diploid genetic data in which maternally and paternally derived haploid chromosomes are combined into a total allele count at each site. Inferring haploid phase from diploid data -- "phasing" for short -- is a fundamental question in human genetics and a key step in genotype imputation. Most existing methods for computational phasing apply hidden Markov models; these algorithms are statistically precise but computationally challenging. Long-range phasing (LRP) is an alternative, much faster approach that harnesses long identity-by-descent (IBD) tracts shared among related individuals; in such IBD regions, phase inference is straightforward. However, because of its reliance on long IBD, LRP has previously only been successfully applied in data sets containing close relatives. I will describe a new LRP-based method, Eagle, that leverages distant relatedness -- ubiquitous in very large data sets such as the UK Biobank -- along with fast approximate HMM decoding to achieve a 1-2 order of magnitude speedup over existing methods.

Mar 14 David Holcman (École Normale Supérieure)
Polymer models, asymptotic analysis, simulations of cellular nucleus dynamics

The organization and the dynamics of chromatin in the cell nucleus are still not resolved. Experimental and theoretical approaches are now converging and I propose to summarize recent progress using polymer models, mathematical analysis for studying stochatic single particle trajectories. Finally, we will discuss asymptotic methods for estimating the mean time for a polymer to loop, interpreting large data sets and reconstructing chromatin organization.

Mar 21 Eleni Katifori Lab Presentation + MathBio Social
Mar 28 James Greene (Rutgers University)
Mathematical models of drug resistance and tumor heterogeneity

Resistance to chemotherapy is one of the major impediments to treatment success. Furthermore, experimental evidence suggests that drug resistance is a complex biological phenomena, with many influences that interact nonlinearly. In this talk, we present work on models of tumor heterogeneity and their impact on drug resistance. Specifically, we present a structured population model as a general framework for studying drug resistance, as well specific models of cell-cycle dynamics and its effect on cancer growth and treatment efficacy.

Apr 4 Anne Shiu (Texas A&M University)
Multisite phosphorylation systems: dynamics and steady states

Reaction networks taken with mass-action kinetics arise in many settings, from epidemiology to population biology to systems of chemical reactions. The systematic study of the resulting polynomial ordinary differential equations began in the 1970s, and in recent years, this area has seen renewed interest, due in part to applications to systems biology. This talk focuses on the dynamics and steady states of such systems. Our main interest is in certain signaling networks, namely, multisite phosphorylation networks. In particular, these systems exhibit “toric steady states” (that is, the ODEs generate a binomial ideal), which enables us to efficiently determine their capacity for multiple steady states. Also, we show that when the phosphorylation/dephosphorylation mechanism is “processive” (binding of a substrate and an enzyme molecule results in addition or removal of phosphate groups at all phosphorylation sites), these systems exhibit rigid dynamics: each invariant set contains a unique steady state, which is a global attractor.

Apr 11 Guy Sella (Columbia University)
A model for the genetic architecture of quantitative traits

Many phenotypes of interest, including susceptibility to many common diseases, are “quantitative”, meaning that the heritable variation in the trait is largely due to numerous genetic variants of small effects segregating in the population. The causes of quantitative genetic variation have been studied in evolutionary biology for over a century. This pursuit has recently come to the forefront of research in human genetics as well, with the push to map variants that underlie heritable genetic variation in phenotypes. Notably, since 2007, genome-wide association studies (GWAS) in humans have led to the identification of thousands of variants reproducibly associated with hundreds of quantitative traits, including susceptibility to a wide variety of diseases. These studies reveal intriguing differences among traits in their genetic architecture (i.e., the number of associated variants, their effect sizes and frequencies) and in the fraction of the heritable variation explained. Interpreting these findings has been difficult, however, in no small part because we lack generative models relating population genetic processes (e.g., pleiotropy, selection and genetic drift) to the genetic architecture of quantitative traits. I will present such a model and discuss how it helps to understand the results of GWAS. I will also describe our preliminary results using GWAS findings to learn about the forces underlying heritable variation in human height.

Apr 18 Amaury Lambert (UPMC Univ Paris 06)
From individual-based population models to lineage-based models of phylogenies

A popular line of research in evolutionary biology is to use time-calibrated phylogenies to understand the process of diversification. This is done by performing statistical inference under stochastic models of species diversification. These models thus need to be robust, biologically sound and mathematically tractable. We first introduce some new lineage-based, stochastic models of phylogenies, featuring e.g., protracted speciation or age-dependent extinction. Using recent mathematical results allowing the computation of tree likelihoods, we present ML parameter estimates inferred from the bird phylogeny. Our goal is then to obtain (these or other) models of phylogenies, starting from an individual-based description of populations. We present in particular two non-exchangeable models of phylogenetic trees thus obtained. In the first one, speciation is modelled by genetic differentiation of individual lineages. The second one is a scaling limit of the Tilman-Levins type model where interspecific competition is only felt by older species from younger species.

Apr 25 Junhyong Kim Lab Presentation + MathBio Social
May 2 Simon Gravel (McGill University)
Population structure in African-Americans

We present a detailed population genetic study of 3 African-American cohorts comprising over 3000 genotyped individuals across US urban and rural communities: two nation-wide longitudinal cohorts, and the 1000 Genomes ASW cohort. Ancestry analysis reveals a uniform breakdown of continental ancestry proportions across regions and urban/rural status, with 79% African, 19% European, and 1.5% Native American/Asian ancestries, with substantial between-individual variation. The Native American ancestry proportion is higher than previous estimates and is maintained after self-identified Hispanics and individuals with substantial inferred Spanish ancestry are removed. This supports direct admixture between Native Americans and African Americans on US territory, and linkage patterns suggest contact early after African-American arrival to the Americas. Local ancestry patterns and variation in ancestry proportions across individuals are broadly consistent with a single African-American population model with early Native American admixture and ongoing European gene flow in the South. The size and broad geographic sampling of our cohorts enable detailed analysis of the geographic and cultural determinants of finer-scale population structure. Recent identity-by-descent analysis reveals fine-scale geographic structure consistent with the routes used during slavery and in the great African-American migrations of the twentieth century: east-to-west migrations in the south, and distinct south-to-north migrations into New England and the Midwest. These migrations follow transit routes available at the time and are in stark contrast with European-American relatedness patterns.

May 9 Matthias Steinrücken (University of Massachusetts Amherst)
The joint total tree length at linked loci in populations of variable size

The inference of historical population sizes from contemporary genomic sequence data has gained a lot of attention in recent years. A particular focus has been on recent exponential growth in humans. This recent growth has had a strong impact on the distribution of rare genetic variants, which are of particular importance when studying disease related genetic variation. The popular PSMC method (Li and Durbin, 2011) can be used to infer population sizes from a sample of two chromosomes. However, the small sample size severely limits the power of this method in the recent past. To improve inference in the recent past, we extend the Coalescent Hidden Markov model approach of PSMC to larger sample sizes. To this end, we augment the hidden states of the model to the total length of the coalescent trees relating the sampled chromosomes at each locus. A key ingredient in this extension is the joint distribution of the total tree length at two linked loci in populations of variable size. This joint distribution has received little attention in the literature, especially under variable population size. In this talk, we show that it can be obtained as the solution of a system of semi-linear hyperbolic partial differential equations. Such systems can cause instabilities in standard numerical solution approaches. We thus implement a solution scheme based on the method of characteristics that allows us to compute the solutions efficiently. We further compare these solutions to explicit coalescent simulations to demonstrate their accuracy. We will discuss potential extension of our approach to infer divergence times and migration rates in structured populations.

May 18 Nicolas Lanchier (Arizona State University)
Fluctuation and fixation in the Axelrod model

The Axelrod model is a spatial stochastic model for the dynamics of cultures which includes two key social components: homophily, the tendency of individuals to interact more frequently with individuals who are more similar, and social influence, the tendency of individuals to become more similar when they interact. Each individual is characterized by a collection of opinions about different issues, and pairs of neighbors interact at a rate equal to the number of issues for which they agree, which results in the interacting pair agreeing on one more issue. This model has been extensively studied during the past 15 years based on numerical simulations and heuristic arguments while there is a lack of analytical results. This talk gives rigorous fluctuation and fixation results for the one-dimensional system that sometimes confirm and sometimes refute some of the conjectures proposed by applied scientists.

Date Speaker & Title
Sep 14 Reception, 4:00‐5:30 PM, 318 Carolyn Lynch Lab
Sep 21 John (Jack) Kamm (University of California, Berkeley)
momi: A new method for inferring demography and computing the multipopulation sample frequency spectrum

The sample frequency spectrum (SFS) describes the distribution of allele counts at segregating sites, and is a useful statistic for both summarizing genetic data and inferring biological parameters. SFS-based inference proceeds by comparing observed and expected values of the SFS, but computing the expectations is computationally challenging when there are multiple populations related by a complex demographic history.

We are developing a new software package, momi (MOran Models for Inference), that computes the multipopulation SFS under population size changes (including exponential growth), population mergers and splits, and pulse admixture events. Underlying momi is a multipopulation Moran model, which is equivalent to the coalescent and the Wright-Fisher diffusion, but has computational advantages in both speed and numerical stability. Techniques from graphical models are used to integrate out historical allele frequencies. Automatic differentiation provides the gradient and Hessian, which are useful for searching through parameter space and for computing asymptotic confidence intervals.

Using momi, we are able to compute the exact SFS for more complex demographies than previously possible. In addition, the expectations of a wide range of statistics, such as the time to most recent common ancestor (TMRCA) and total branch length, can also be efficiently computed. The scaling properties of momi depend heavily on the pattern of migration events, but for certain demographic histories, momi can scale up to tens to hundreds of populations. We demonstrate the accuracy of momi by applying it to simulated data, and are in the process of applying it to real data to infer a model of human history involving archaic hominins (Neanderthal and Denisovan) and modern humans in Africa, Europe, East Asia, and Melanesia.

Sep 28 Peter Ralph (University of Southern California)
Geography and adaptation

Most species are distributed across geography, and so can develop regional differences in different parts of the range. These may often be as a result of natural selection, and the geographic scale on which different solutions to evolutionary problems can appear -- the spatial resolution of adaptation -- is determined by a balance between migration, selection, and availability of genetic variation. I will describe some recent work to develop appropriate stochastic models for this problem, as well as progress using georeferenced whole genomes to fit a geographically explicit population model. A recurrent theme is the use of diffusion theory to describe the spatial motion of lineages.

Oct 5 Michael Lässig (University of Cologne)
Predicting the evolution of influenza

The human flu virus undergoes rapid evolution, which is driven by interactions with its host immune system. We describe the evolutionary dynamics by a fitness model based on two phenotypes of the virus: protein folding stability and susceptibility to human immune response. This model successfully predicts the evolution of influenza one year into the future. Thus, evolutionary analysis transcends its traditional role of reconstructing past history. This has important consequences for public health: evolutionary predictions can inform the selection of influenza vaccine strains.

Oct 12 Ben Kunsberg (Brown University)
Local shape from shading with a generic constraint

Humans have a remarkable ability to infer shape from shading (SFS) information. In computer vision this is often formulated with a Lambertian reflectance function, but it remains under-posed and incompletely solved. Abstractly, the intensity in an image is a single valued function and the goal is to uncover the vector valued normal function. This ill-posedness has resulted in many proposed techniques that are either regularizations or propagations from known values. Our goal is to understand, mathematically and computationally, how we solve this problem.

First, it has been shown psychophysically that our perception (via gauge figure estimates) is remarkably accurate even when the boundary is masked. Thus classical propagating approaches requiring a known values along a boundary, such as that of characteristic curves or fast marching methods, are unlikely to model the visual system's solution.

An alternative approach requires regularization priors (in a Bayesian framework) or energy terms (in a variational framework). However, many of the proposed priors are ad-hoc and chosen by researchers to optimize performance for a particular test dataset. It is hard to conclude (from solely performance metrics) whether these priors are useful or accurate, e.g. good results are functions of these priors, resolution, the optimization techniques, the test set, and so on.

In this talk, we describe a different approach. We consider the SFS problem on image patches modeled as Taylor polynomials of any order and seek to recover a solution for that patch. We build a boot-strapping tensor framework that allows us to relate a smooth image patch to all of the polynomial surface solutions (under any light source). We then use a generic constraint on the light source to restrict these solutions to a 2-D subspace, plus an unknown rotation matrix. We then investigate several special cases where the ambiguity reduces and the solution can be anchored. Interestingly, these anchor solutions relate to those situations in which human performance is also veridical.

Oct 19 George Hagstrom (Princeton University)
The evolution of distributed sensing and collective computation in animal populations

Many animal groups exhibit rapid, coordinated collective motion. Yet, the evolutionary forces that cause such collective responses to evolve are poorly understood. Here we develop analytical methods and evolutionary simulations based on experimental data from schooling fish. We use these methods to investigate how populations evolve within unpredictable, time-varying resource environments. We show that populations evolve toward a distinctive regime in behavioral phenotype space, where small responses of individuals to local environmental cues cause spontaneous changes in the collective state of groups, similar to phase transitions in physical systems. Through these transitions, individuals evolve the emergent capacity to sense and respond to resource gradients (i.e. individuals perceive gradients via social interactions, rather than sensing gradients directly), and to allocate themselves among distinct, distant resource patches. These results yield insight into how natural selection, acting on selfish individuals, results in the highly effective collective responses evident in nature.

Oct 26 James Zou (Microsoft Research New England and MIT)
Estimating the unseen variants in human populations provides a roadmap for precision medicine

What can we learn from 60,000 genome sequences? I will describe how we are leveraging the largest collection of human exomes to model the landscape of harmful genetic variations in healthy individuals. I will also discuss how we can use the previously identified variants to accurately estimate properties of the unobserved variants that exist in the general population. Our linear program estimators have strong mathematical guarantees. This model of rare, unobserved variants provides a roadmap for future sequencing projects, such as the Precision Medicine Initiative.

Nov 2 Erick Matsen (Fred Hutchinson Cancer Research Center)
Learning how antibodies are drafted and revised

Antibodies must recognize a great diversity of antigens to protect us from infectious disease. The binding properties of antibodies are determined by the sequences of their corresponding B cell receptors (BCRs). These BCR sequences are created in "draft" form by VDJ recombination, which randomly selects and deletes from the ends of V, D, and J genes, then joins them together with additional random nucleotides. If they pass initial screening and bind an antigen, these sequences then undergo an evolutionary process of mutation and selection, "revising" the BCR to improve binding to its cognate antigen. It has recently become possible to determine the antibody-determining BCR sequences resulting from this process in high throughput. Although these sequences implicitly contain a wealth of information about both antigen exposure and the process by which we learn to resist pathogens, this information can only be extracted using computer algorithms.

In this talk, I will describe two recent projects to develop model-based inferential tools for analyzing BCR sequences: first, a hidden Markov model (HMM) framework to reconstruct BCR rearrangement events and determine which BCRs derived from the same rearrangements, and second, a novel method for assessing selection on BCRs that side-steps the difficulties in differentiating between selection and motif-driven mutation. We use this new method to derive a per-residue map of selection on millions of reads, which provides a more nuanced view of the constraints on framework and variable regions.

This work is joint with Trevor Bedford (Fred Hutch), Vladimir Minin (UW Statistics), and Duncan Ralph (Fred Hutch).

Nov 9 Iain Mathieson (Harvard University)
Eight thousand years of natural selection in Europe

The arrival of farming in Europe around 8,500 years ago necessitated adaptation to new environments, pathogens, diets, and social organizations. While indirect evidence of adaptation can be detected in patterns of genetic variation in present-day people, ancient DNA makes it possible to witness selection directly by analyzing samples from populations before, during and after adaptation events. Here we report the first genome-wide scan for selection using ancient DNA, capitalizing on the largest ancient whole-genome dataset yet assembled: 230 West Eurasians dating to between 6500 and 1000 BCE, including 163 with newly reported data. The new samples include the first genome-wide data from the Anatolian Neolithic culture, who we show were members of the population that was the source of Europe’s first farmers. We identify genome-wide significant signatures of selection at loci associated with diet, pigmentation and immunity, and two independent episodes of selection on height.

Nov 16 Sebastien Roch (University of Wisconsin — Madison)
From Genomes to Trees and Beyond: A Survey of Recent Results in Mathematical Phylogenomics

The reconstruction of the Tree of Life is a classical problem in evolutionary biology that has benefited from many branches of mathematics, including probability, combinatorics, algebra, and geometry. Modern DNA sequencing technologies are producing a deluge of new genetic data---transforming how we view the Tree of Life and how it is reconstructed. I will survey recent progress on some mathematical questions that arise in this context.

Nov 23 Charles Epstein (University of Pennsylvania)
The ins and outs of diffusion limits
Nov 30 MathBio Social, 4:00‐5:30 PM, 318 Carolyn Lynch Lab