Principal Investigator: Professor Z Ghahramani
We live in an era of abundant data. Rapid technological advances, such as the internet, have made it possible to collect, store and share large amounts of information more easily than ever before. The availability of large amounts of data has had a major impact on society, commerce, and the sciences.
Data plays a particularly important role in the sciences. Data is what you get from conducting experiments, and data is what you use to test scientific theories. In recent years, the amount of data collected and generated in the sciences has grown tremendously. We need better tools to model this data, so that we can understand and test theories and make scientific predictions.
Our proposal focuses on advanced statistical tools for modelling data. It is important that the models are based on probability and statistics, because any model of real world phenomena has to represent the uncertainty we have from incomplete information and noisy measurements. Probability theory provides a coherent mathematical language for expressing uncertainty in models. Our proposal develops models based on Bayesian statistics, which used to be called “inverse probability’” until the 20th century, and refers to the application of probability theory to learn unknown quantities from observable data. Bayesian statistics can also be used to compare multiple models (i.e. hypotheses) given the data, and thus can play a fundamental role in scientific hypothesis testing.
We will develop new computational tools for Bayesian modelling, ensuring that the models are flexible enough to capture the complexity of real-world phenomena and scalable enough to deal with very large data sets. We will also develop new methods for deciding which data to collect and which experiments to perform, which can greatly reduce the cost of scientific inquiry. We will make use of the latest advances in computer hardware, in the form of massively parallel graphics processing units (GPUs) to speed up modelling of scientific data.
This proposal is truly cross-disciplinary in that we do not focus on a single scientific discipline. In fact, we have assembled a team whose expertise spans Bayesian modelling across the physical, biological and social sciences. We will create modelling tools for better astronomical surveying of the skies so that we can understand the composition of our universe; we will create tools for analysing gene and protein data to so that we can better understand biological phenomena and design drug therapies; and we will develop powerful methods for modelling and predicting economic and financial data which will hopefully reduce risk in financial markets.
Surprisingly, these diverse areas of the sciences—astronomy, biology and economics—can come together through a unified set of computational and statistical modelling tools. Our advances will benefit not just these areas but many other areas of science based on data-intensive modelling.