The aim of the paper is to discuss the association between SNP genotype data and a disease. For genetic association studies, the statistical analyses with multiple markers have been shown to be more powerful, efficient, and biologically meaningful than single marker association tests. As the number of genetic markers considered is typically large, here we cluster them and then study the association between groups of markers and disease. We propose a two-step procedure: first a Bayesian nonparametric cluster estimate under normalized generalized gamma process mixture models is introduced, so that we are able to incorporate the information from a large-scale SNP data with a much smaller number of explanatory variables. Then, thanks to the introduction of a genetic score, we study the association between the relevant disease response and groups of markers using a logit model. Inference is obtained via an MCMC truncation method recently introduced in the literature. We also provide a review of the state of art of Bayesian nonparametric cluster models and algorithms for the class of mixtures adopted here. Finally, the model is applied to genome-wide association study of Crohn’s disease in a case-control setting.
Nonparametric Bayesian Methods in Biostatistics and Bioinformatics