Bayesian Mixtures for Large Scale Inference

Denti, F

Bayesian mixture models are ubiquitous in statistics due to their simplicity and flexibility and can be easily employed in a wide variety of contexts. In this dissertation, we aim at providing a few contributions to current Bayesian data analysis methods, often motivated by research questions from biological applications. In particular, we focus on the development of novel Bayesian mixture models, typically in a nonparametric setting, to improve and extend active research areas that involve large-scale data: the modeling of nested data, multiple hypothesis testing, and dimensionality reduction.\\ Therefore, our goal is twofold: to develop robust statistical methods motivated by a solid theoretical background, and to propose efficient, scalable and tractable algorithms for their applications.\\ The thesis is organized as follows. In Chapter \ref{intro} we shortly review the methodological background and discuss the necessary concepts that belong to the different areas that we will contribute to with this dissertation. \\ In Chapter \ref{CAM} we propose a Common Atoms model (CAM) for nested datasets, which overcomes the limitations of the nested Dirichlet Process, as discussed in \citep{Camerlenghi2018}. We derive its theoretical properties and develop a slice sampler for nested data to obtain an efficient algorithm for posterior simulation. We then embed the model in a Rounded Mixture of Gaussian kernels framework to apply our method to an abundance table from a microbiome study.\\ In Chapter \ref{BNPT} we develop a BNP version of the two-group model \citep{Efron2004}, modeling both the null density $f_0$ and the alternative density $f_1$ with Pitman-Yor process mixture models. We propose to fix the two discount parameters $\sigma_0$ and $\sigma_1$ so that $\sigma_0>\sigma_1$, according to the rationale that the null PY should be closer to its base measure (appropriately chosen to be a standard Gaussian base measure), while the alternative PY should have fewer constraints. To induce separation, we employ a non-local prior \citep{Johnson} on the location parameter of the base measure of the PY placed on $f_1$. We show how the model performs in different scenarios and apply this methodology to a microbiome dataset.\\ Chapter \ref{Peluso} presents a second proposal for the two-group model. Here, we make use of non-local distributions to model the alternative density directly in the likelihood formulation. We propose both a parametric and a nonparametric formulation of the model. We provide a theoretical justification for the adoption of this approach and, after comparing the performance of our model with several competitors, we present three applications on real, publicly available genomic datasets.\\ In Chapter \ref{CRIME} we focus on improving the model for intrinsic dimensions (IDs) estimation discussed in \citet{Allegra}. In particular, the authors estimate the IDs modeling the ratio of the distances from a point to its first and second nearest neighbors (NNs). First, we propose to include more suitable priors in their parametric, finite mixture model. Then, we extend the existing theoretical methodology by deriving closed-form distributions for the ratios of distances from a point to two NNs of generic order. We propose a simple Dirichlet process mixture model, where we exploit the novel theoretical results to extract more information from the data. The chapter is then concluded with simulation studies and the application to real data.\\ Finally, Chapter \ref{Conclusions} presents the future directions and concludes.

I modelli mistura bayesiani sono onnipresenti in statistica per la loro semplicità e flessibilità e possono essere facilmente impiegati in un'ampia varietà di contesti. In questa tesi, miriamo a fornire alcuni contributi agli attuali metodi bayesiani di analisi dei dati, spesso motivati da domande di ricerca provenienti da applicazioni biologiche. In particolare, ci concentriamo sullo sviluppo di nuovi modelli mistura bayesiani, tipicamente in un ambiente non parametrico, per migliorare ed estendere aree di ricerca che coinvolgono dati caratterizzati da grande dimensioni: la modellazione di dati nested, test di ipotesi simultaneo e la riduzione della dimensionalità. \\ Pertanto, il nostro obiettivo è duplice: sviluppare metodi statistici robusti motivati da un solido background teorico e proporre algoritmi efficienti, scalabili e trattabili per le loro applicazioni. \\ La tesi è organizzata come segue. Nel capitolo 1 esamineremo brevemente il background metodologico e discuteremo i concetti necessari che appartengono alle diverse aree a cui contribuiremo con questa tesi. \\ Nel capitolo 2 proponiamo un modello di atomi comuni (CAM) per nested data, che supera le limitazioni del processo del nested Dirichlet Process, come discusso in \ citep {Camerlenghi2018}. Deriviamo le sue proprietà teoriche e sviluppiamo uno slice sampler per dati nested al fine di ottenere un algoritmo efficiente per la simulazione della posterior. Abbiamo poi incorporato il modello in un framework di Rounded mixture of Gaussian Kernels, così da applicare il nostro metodo a una abundance table derivante da uno studio di microbioma. \\ Nel capitolo \ref {BNPT} sviluppiamo una versione BNP del two-group model, modellando sia $ f_0 $ che $ f_1 $ con Pitman-Yor mixtures models. Proponiamo di fissare i due parametri $ \sigma_0 $ e $ \sigma_1 $ in modo che $ \sigma_0> \sigma_1 $, in base alla logica secondo cui il PY che modella la distribuzione nulla dovrebbe essere più vicino alla sua misura di base (opportunamente scelta Gaussiana standard), mentre il PY alternativo dovrebbe avere meno vincoli. Per indurre la separazione, impieghiamo una non-local prior sul parametro location della misura base del PY collocato su $ f_1 $. Mostriamo come il modello si comporta in diversi scenari e applichiamo questa metodologia a un set di dati del microbioma. \\ Il capitolo \ref{Peluso} presenta una seconda proposta per il two-group model. Qui, utilizziamo non-local distributions per modellare la densità alternativa direttamente nella formulazione della Likelihood. Abbiamo proposto una formulazione sia parametrica che non parametrica del modello. Forniamo poi una giustificazione teorica per l'adozione di questo approccio e, dopo aver confrontato le prestazioni del nostro modello con diversi concorrenti, presentiamo tre applicazioni su set di dati genomici reali pubblicamente disponibili. \\ Nel capitolo \ref {CRIME} ci concentriamo sul miglioramento del modello per la stima delle dimensioni intrinseche (ID) discusso in \citet {Allegra}, dove gli autori stimano gli IDs modellando il rapporto delle distanze da un punto dal suo primo e secondo vicino più vicino (NN). Innanzitutto, proponiamo di includere distribuzioni a priori più adatte nel loro modello mistura finita. Quindi, estendiamo la metodologia teorica esistente derivando distribuzioni in forma chiusa per i rapporti di distanze da un punto a due NNs di ordine generico. Proponiamo poi un semplice modello di mistura nonparametrica usando il processo di Dirichlet, in cui sfruttiamo le distribuzioni derivate per estrarre più informazioni dai dati. Il capitolo si conclude quindi con studi di simulazione e l'applicazione a dati reali. \\ Infine, il capitolo \ref {Conclusions} presenta le direzioni future e le conclusioni.

(2020). Bayesian Mixtures for Large Scale Inference. (Tesi di dottorato, Università degli Studi di Milano-Bicocca, 2020).