Deep Learning — Publications

Bridging the Human-AI Knowledge Gap - Concept Discovery and Transfer in AlphaZero

Artificial Intelligence (AI) systems have made remarkable progress, attaining super-human performance across various domains. This presents us with an opportunity to further human knowledge and improve human expert performance by leveraging the hidden knowledge encoded within these highly performant AI systems. Yet, this knowledge is often hard to extract, and may be hard to understand or learn from. Here, we show that this is possible by proposing a new method that allows us to extract new chess concepts in AlphaZero, an AI system that mastered the game of chess via self-play without human supervision. Our analysis indicates that AlphaZero may encode knowledge that extends beyond the existing human knowledge, but knowledge that is ultimately not beyond human grasp, and can be successfully learned from. In a human study, we show that these concepts are learnable by top human experts, as four top chess grandmasters show improvements in solving the presented concept prototype posi... [full abstract]


Lisa Schut, Nenad Tomasev, Tom McGrath, Demis Hassabis, Ulrich Paquet, Been Kim
arXiv pre-print (2023)
[paper]
Link to this publication

Managing AI Risks in an Era of Rapid Progress

In this short consensus paper, we outline risks from upcoming, advanced AI systems. We examine large-scale social harms and malicious uses, as well as an irreversible loss of human control over autonomous AI systems. In light of rapid and continuing AI progress, we propose priorities for AI R&D and governance.In this short consensus paper, we outline risks from upcoming, advanced AI systems. We examine large-scale social harms and malicious uses, as well as an irreversible loss of human control over autonomous AI systems. In light of rapid and continuing AI progress, we propose priorities for AI R&D and governance.


Yoshua Bengio, Geoffrey Hinton, Andrew Yao, Dawn Song, Pieter Abbeel, Yuval Noah Harari, Ya-Qin Zhang, Lan Xue, Shai Shalev-Shwartz, Gillian Hadfield, Jeff Clune, Tegan Maharaj, Frank Hutter, Atılım Güneş Baydin, Sheila McIlraith, Qiqi Gao, Ashwin Acharya, David Krueger, Anca Dragan, Philip Torr, Stuart Russell, Daniel Kahneman, Jan Brauner, Sören Mindermann
arXiv (2023)
[paper]
Link to this publication

In-Context Learning in Large Language Models Learns Label Relationships but Is Not Conventional Learning

The predictions of Large Language Models (LLMs) on downstream tasks often improve significantly when including examples of the input–label relationship in the context. However, there is currently no consensus about how this in-context learning (ICL) ability of LLMs works. For example, while Xie et al. (2021) liken ICL to a general-purpose learning algorithm, Min et al. (2022) argue ICL does not even learn label relationships from in-context examples. In this paper, we provide novel insights into how ICL leverages label information, revealing both capabilities and limitations. To ensure we obtain a comprehensive picture of ICL behavior, we study probabilistic aspects of ICL predictions and thoroughly examine the dynamics of ICL as more examples are provided. Our experiments show that ICL predictions almost always depend on in-context labels, and that ICL can learn truly novel tasks in-context. However, we also find that ICL struggles to fully overcome prediction preferences acqui... [full abstract]


Jannik Kossen, Yarin Gal, Tom Rainforth
arXiv, 2023
[arXiv]
Link to this publication

Three Towers: Flexible Contrastive Learning with Pretrained Image Models

We introduce Three Towers (3T), a flexible method to improve the contrastive learning of vision-language models by incorporating pretrained image classifiers. While contrastive models are usually trained from scratch, LiT (Zhai et al., 2022) has recently shown performance gains from using pretrained classifier embeddings. However, LiT directly replaces the image tower with the frozen embeddings, excluding any potential benefits of contrastively training the image tower. With 3T, we propose a more flexible strategy that allows the image tower to benefit from both pretrained embeddings and contrastive training. To achieve this, we introduce a third tower that contains the frozen pretrained embeddings, and we encourage alignment between this third tower and the main image-text towers. Empirically, 3T consistently improves over LiT and the CLIP-style from-scratch baseline for retrieval tasks. For classification, 3T reliably improves over the from-scratch baseline, and while it under... [full abstract]


Jannik Kossen, Mark Collier, Basil Mustafa, Xiao Wang, Xiaohua Zhai, Lucas Beyer, Andreas Peter Steiner, Jesse Berent, Rodolphe Jenatton, Effrosyni Kokiopoulou
NeurIPS, 2023
[OpenReview] [arXiv]
Link to this publication

Diversifying AI - Towards Creative Chess with AlphaZero

In recent years, Artificial Intelligence (AI) systems have surpassed human intelligence in a variety of computational tasks. However, AI systems, like humans, make mistakes, have blind spots, hallucinate, and struggle to generalize to new situations. This work explores whether AI can benefit from creative decision-making mechanisms when pushed to the limits of its computational rationality. In particular, we investigate whether a team of diverse AI systems can outperform a single AI in challenging tasks by generating more ideas as a group and then selecting the best ones. We study this question in the game of chess, the so-called “drosophila of AI”. We build on AlphaZero (AZ) and extend it to represent a league of agents via a latent-conditioned architecture, which we call AZdb. We train AZdb to generate a wider range of ideas using behavioral diversity techniques and select the most promising ones with sub-additive planning. Our experiments suggest that AZdb plays chess in dive... [full abstract]


Tom Zahavy, Vivek Veeriah, Shaobo Hou, Kevin Waugh, Matthew Lai, Edouard Leurent, Nenad Tomasev, Lisa Schut, Demis Hassabis, Satinder Singh
arXiv pre-print (2023)
[paper]
Link to this publication

Question Decomposition Improves the Faithfulness of Model-Generated Reasoning

As large language models (LLMs) perform more difficult tasks, it becomes harder to verify the correctness and safety of their behavior. One approach to help with this issue is to prompt LLMs to externalize their reasoning, e.g., by having them generate step-by-step reasoning as they answer a question (Chain-of-Thought; CoT). The reasoning may enable us to check the process that models use to perform tasks. However, this approach relies on the stated reasoning faithfully reflecting the model’s actual reasoning, which is not always the case. To improve over the faithfulness of CoT reasoning, we have models generate reasoning by decomposing questions into subquestions. Decomposition-based methods achieve strong performance on question-answering tasks, sometimes approaching that of CoT while improving the faithfulness of the model’s stated reasoning on several recently-proposed metrics. By forcing the model to answer simpler subquestions in separate contexts, we greatly increase the... [full abstract]


A Radhakrishnan, K Nguyen,, A Chen, C Chen, C Denison, D Hernandez, E Durmus, E Hubinger, J Kernion, K Lukosiute, N Cheng, N Joseph, N Schiefer, O Rausch, S McCandlish, S El Showk, T Lanham, T Maxwell, V Chandrasekaran, Z Hatfield-Dodds, J Kaplan, Jan Brauner, SR Bowman, E Perez
Anthropic pre-print (2023)
[paper]
Link to this publication

Measuring Faithfulness in Chain-of-Thought Reasoning

Large language models (LLMs) perform better when they produce step-by-step, “Chain-ofThought” (CoT) reasoning before answering a question, but it is unclear if the stated reasoning is a faithful explanation of the model’s actual reasoning (i.e., its process for answering the question). We investigate hypotheses for how CoT reasoning may be unfaithful, by examining how the model predictions change when we intervene on the CoT (e.g., by adding mistakes or paraphrasing it). Models show large variation across tasks in how strongly they condition on the CoT when predicting their answer, sometimes relying heavily on the CoT and other times primarily ignoring it. CoT’s performance boost does not seem to come from CoT’s added test-time compute alone or from information encoded via the particular phrasing of the CoT. As models become larger and more capable, they produce less faithful reasoning on most tasks we study. Overall, our results suggest that CoT can be faithful if the circumsta... [full abstract]


Tamera Lanham, Anna Chen, Ansh Radhakrishnan, Benoit Steiner, Carson Denison, Danny Hernandez, Dustin Li, Esin Durmus, Evan Hubinger, Jackson Kernion, Kamile Lukosiute, Karina Nguyen, Newton Cheng, Nicholas Joseph, Nicholas Schiefer, Oliver Rausch, Robin Larson, Sam McCandlish, Sandipan Kundu, Saurav Kadavath, Shannon Yang, Thomas Henighan, Timothy Maxwell, Timothy Telleen-Lawton, Tristan Hume, Zac Hatfield-Dodds, Jared Kaplan, Jan Brauner, Samuel R. Bowman, Ethan Perez
Anthropic pre-print (2023)
[paper]
Link to this publication

Active Acquisition for Multimodal Temporal Data: A Challenging Decision-Making Task

We introduce a challenging decision-making task that we call active acquisition for multimodal temporal data (A2MT). In many real-world scenarios, input features are not readily available at test time and must instead be acquired at significant cost. With A2MT, we aim to learn agents that actively select which modalities of an input to acquire, trading off acquisition cost and predictive performance. A2MT extends a previous task called active feature acquisition to temporal decision making about high-dimensional inputs. We propose a method based on the Perceiver IO architecture to address A2MT in practice. Our agents are able to solve a novel synthetic scenario requiring practically relevant cross-modal reasoning skills. On two large-scale, real-world datasets, Kinetics-700 and AudioSet, our agents successfully learn cost-reactive acquisition behavior. However, an ablation reveals they are unable to learn adaptive acquisition strategies, emphasizing the difficulty of the task ev... [full abstract]


Jannik Kossen, Cătălina Cangea, Eszter Vértes, Andrew Jaegle, Viorica Patraucean, Ira Ktena, Nenad Tomasev, Danielle Belgrave
TMLR, 2023
[OpenReview] [arXiv]
Link to this publication

Prediction-Oriented Bayesian Active Learning

Information-theoretic approaches to active learning have traditionally focused on maximising the information gathered about the model parameters, most commonly by optimising the BALD score. We highlight that this can be suboptimal from the perspective of predictive performance. For example, BALD lacks a notion of an input distribution and so is prone to prioritise data of limited relevance. To address this we propose the expected predictive information gain (EPIG), an acquisition function that measures information gain in the space of predictions rather than parameters. We find that using EPIG leads to stronger predictive performance compared with BALD across a range of datasets and models, and thus provides an appealing drop-in replacement.


Freddie Bickford Smith, Andreas Kirsch, Sebastian Farquhar, Yarin Gal, Adam Foster, Tom Rainforth
International Conference on Artificial Intelligence and Statistics (AISTATS), 2023
[Paper] [BibTeX]
Link to this publication

KL Guided Domain Adaptation

Domain adaptation is an important problem and often needed for real-world applications. In this problem, instead of i.i.d. training and testing datapoints, we assume that the source (training) data and the target (testing) data have different distributions. With that setting, the empirical risk minimization training procedure often does not perform well, since it does not account for the change in the distribution. A common approach in the domain adaptation literature is to learn a representation of the input that has the same (marginal) distribution over the source and the target domain. However, these approaches often require additional networks and/or optimizing an adversarial (minimax) objective, which can be very expensive or unstable in practice. To improve upon these marginal alignment techniques, in this paper, we first derive a generalization bound for the target loss based on the training loss and the reverse Kullback-Leibler (KL) divergence between the source and the ... [full abstract]


Tuan Nguyen, Toan Tran, Yarin Gal, Philip H. S. Torr, Atılım Güneş Baydin
International Conference on Learning Representations, 2022
[arXiv] [BibTex]
Link to this publication

Bayesian uncertainty quantification for machine-learned models in physics

Being able to quantify uncertainty when comparing a theoretical or computational model to observations is critical to conducting a sound scientific investigation. With the rise of data-driven modelling, understanding various sources of uncertainty and developing methods to estimate them has gained renewed attention. Yarin Gal and four other experts discuss uncertainty quantification in machine-learned models with an emphasis on issues relevant to physics problems.


Yarin Gal, Petros Koumoutsakos, Francois Lanusse, Gilles Louppe, Costas Papadimitriou
Nature Reviews Physics volume 4, pages 573–577 (2022)
[Nature Review Physics]
Link to this publication

Plex: Towards Reliability using Pretrained Large Model Extensions

A recent trend in artificial intelligence is the use of pretrained models for language and vision tasks, which have achieved extraordinary performance but also puzzling failures. Probing these models’ abilities in diverse ways is therefore critical to the field. In this paper, we explore the reliability of models, where we define a reliable model as one that not only achieves strong predictive performance but also performs well consistently over many decision-making tasks involving uncertainty (e.g., selective prediction, open set recognition), robust generalization (e.g., accuracy and proper scoring rules such as log-likelihood on in- and out-of-distribution datasets), and adaptation (e.g., active learning, few-shot uncertainty). We devise 10 types of tasks over 40 datasets in order to evaluate different aspects of reliability on both vision and language domains. To improve reliability, we developed ViT-Plex and T5-Plex, pretrained large model extensions for vision and language... [full abstract]


Dustin Tran, Jeremiah Liu, Michael W. Dusenberry, Du Phan, Mark Collier, Jie Ren, Kehang Han, Zi Wang, Zelda Mariet, Huiyi Hu, Neil Band, Tim G. J. Rudner, Karan Singhal, Zachary Nado, Joost van Amersfoort, Andreas Kirsch, Rodolphe Jenatton, Nithum Thain, Honglin Yuan, Kelly Buchanan, Kevin Murphy, D. Sculley, Yarin Gal, Zoubin Ghahramani, Jasper Snoek, Balaji Lakshminarayan
Contributed Talk, ICML Pre-training Workshop, 2022
[OpenReview] [Code] [BibTex] [Google AI Blog Post]
Link to this publication

Tranception: protein fitness prediction with autoregressive transformers and inference-time retrieval

The ability to accurately model the fitness landscape of protein sequences is critical to a wide range of applications, from quantifying the effects of human variants on disease likelihood, to predicting immune-escape mutations in viruses and designing novel biotherapeutic proteins. Deep generative models of protein sequences trained on multiple sequence alignments have been the most successful approaches so far to address these tasks. The performance of these methods is however contingent on the availability of sufficiently deep and diverse alignments for reliable training. Their potential scope is thus limited by the fact many protein families are hard, if not impossible, to align. Large language models trained on massive quantities of non-aligned protein sequences from diverse families address these problems and show potential to eventually bridge the performance gap. We introduce Tranception, a novel transformer architecture leveraging autoregressive predictions and retrieva... [full abstract]


Pascal Notin, Mafalda Dias, Jonathan Frazer, Javier Marchena-Hurtado, Aidan Gomez, Debora Marks, Yarin Gal
ICML, 2022
[Preprint] [BibTex] [Code]
Link to this publication

RITA: a Study on Scaling Up Generative Protein Sequence Models

In this work we introduce RITA: a suite of autoregressive generative models for protein sequences, with up to 1.2 billion parameters, trained on over 280 million protein sequences belonging to the UniRef-100 database. Such generative models hold the promise of greatly accelerating protein design. We conduct the first systematic study of how capabilities evolve with model size for autoregressive transformers in the protein domain: we evaluate RITA models in next amino acid prediction, zero-shot fitness, and enzyme function prediction, showing benefits from increased scale. We release the RITA models openly, to the benefit of the research community.


Daniel Hesslow, Niccoló Zanichelli, Pascal Notin, Iacopo Poli, Debora Marks
ICML, Workshop on Computational biology, 2022 (Spotlight)
[Preprint] [Code] [BibTex]
Link to this publication

Mixtures of large-scale dynamic functional brain network modes

Accurate temporal modelling of functional brain networks is essential in the quest for understanding how such networks facilitate cognition. Researchers are beginning to adopt time-varying analyses for electrophysiological data that capture highly dynamic processes on the order of milliseconds. Typically, these approaches, such as clustering of functional connectivity profiles and Hidden Markov Modelling (HMM), assume mutual exclusivity of networks over time. Whilst a powerful constraint, this assumption may be compromising the ability of these approaches to describe the data effectively. Here, we propose a new generative model for functional connectivity as a time-varying linear mixture of spatially distributed statistical “modes”. The temporal evolution of this mixture is governed by a recurrent neural network, which enables the model to generate data with a rich temporal structure. We use a Bayesian framework known as amortised variational inference to learn model parameters ... [full abstract]


Chetan Gohil, Evan Roberts, Ryan Timms, Alex Skates, Cameron Higgins, Andrew Quinn, Usama Pervaiz, Joost van Amersfoort, Pascal Notin, Yarin Gal, Stanislaw Adaszewski, Mark Woolrich
NeuroImage Volume 263 [Paper] [BibTex]
Link to this publication

On Feature Collapse and Deep Kernel Learning for Single Forward Pass Uncertainty

Inducing point Gaussian process approximations are often considered a gold standard in uncertainty estimation since they retain many of the properties of the exact GP and scale to large datasets. A major drawback is that they have difficulty scaling to high dimensional inputs. Deep Kernel Learning (DKL) promises a solution: a deep feature extractor transforms the inputs over which an inducing point Gaussian process is defined. However, DKL has been shown to provide unreliable uncertainty estimates in practice. We study why, and show that with no constraints, the DKL objective pushes “far-away” data points to be mapped to the same features as those of training-set points. With this insight we propose to constrain DKL’s feature extractor to approximately preserve distances through a bi-Lipschitz constraint, resulting in a feature space favorable to DKL. We obtain a model, DUE, which demonstrates uncertainty quality outperforming previous DKL and other single forward pass uncertain... [full abstract]


Joost van Amersfoort, Lewis Smith, Andrew Jesson, Oscar Key, Yarin Gal
arXiv (2022)
[Paper]
Link to this publication

Prioritized Training on Points that are Learnable, Worth Learning, and not yet Learnt

Training on web-scale data can take months. But much computation and time is wasted on redundant and noisy points that are already learnt or not learnable. To accelerate training, we introduce Reducible Holdout Loss Selection (RHO-LOSS), a simple but principled technique which selects approximately those points for training that most reduce the model’s generalization loss. As a result, RHO-LOSS mitigates the weaknesses of existing data selection methods: techniques from the optimization literature typically select “hard” (e.g. high loss) points, but such points are often noisy (not learnable) or less task-relevant. Conversely, curriculum learning prioritizes “easy” points, but such points need not be trained on once learned. In contrast, RHO-LOSS selects points that are learnable, worth learning, and not yet learnt. RHO-LOSS trains in far fewer steps than prior art, improves accuracy, and speeds up training on a wide range of datasets, hyperparameters, and architectures (MLPs, C... [full abstract]


Sören Mindermann, Jan Brauner, Muhammed Razzak, Mrinank Sharma, Andreas Kirsch, Winnie Xu, Benedikt Höltgen, Aidan Gomez, Adrien Morisot, Sebastian Farquhar, Yarin Gal
ICML, 2022 [Paper]
Link to this publication

Prospect Pruning: Finding Trainable Weights at Initialization using Meta-Gradients

Pruning neural networks at initialization would enable us to find sparse models that retain the accuracy of the original network while consuming fewer computational resources for training and inference. However, current methods are insufficient to enable this optimization and lead to a large degradation in model performance. In this paper, we identify a fundamental limitation in the formulation of current methods, namely that their saliency criteria look at a single step at the start of training without taking into account the trainability of the network. While pruning iteratively and gradually has been shown to improve pruning performance, explicit consideration of the training stage that will immediately follow pruning has so far been absent from the computation of the saliency criterion. To overcome the short-sightedness of existing methods, we propose Prospect Pruning (ProsPr), which uses meta-gradients through the first few steps of optimization to determine which weights t... [full abstract]


Milad Alizadeh, Shyam A. Tailor, Luisa Zintgraf, Joost van Amersfoort, Sebastian Farquhar, Nicholas Donald Lane, Yarin Gal
ICLR, 2022 [arXiv] [OpenReview] [BibTex]
Link to this publication

Evaluating Approximate Inference in Bayesian Deep Learning

Uncertainty representation is crucial to the safe and reliable deployment of deep learning. Bayesian methods provide a natural mechanism to represent epistemic uncertainty, leading to improved generalization and calibrated predictive distributions. Understanding the fidelity of approximate inference has extraordinary value beyond the standard approach of measuring generalization on a particular task: if approximate inference is working correctly, then we can expect more reliable and accurate deployment across any number of real-world settings. In this competition, we evaluate the fidelity of approximate Bayesian inference procedures in deep learning, using as a reference Hamiltonian Monte Carlo (HMC) samples obtained by parallelizing computations over hundreds of tensor processing unit (TPU) devices. We consider a variety of tasks, including image recognition, regression, covariate shift, and medical applications. All data are publicly available, and we release several baselines... [full abstract]


Andrew Gordon Wilson, Sanae Lotfi, Sharad Vikram, Matthew D Hoffman, Yarin Gal, Yingzhen Li, Melanie F Pradier, Andrew Foong, Sebastian Farquhar, Pavel Izmailov
Proceedings of Machine Learning Research, 176;113-114
[paper]
Link to this publication

Benchmarking Bayesian Deep Learning on Diabetic Retinopathy Detection Tasks

Bayesian deep learning seeks to equip deep neural networks with the ability to precisely quantify their predictive uncertainty, and has promised to make deep learning more reliable for safety-critical real-world applications. Yet, existing Bayesian deep learning methods fall short of this promise; new methods continue to be evaluated on unrealistic test beds that do not reflect the complexities of downstream real-world tasks that would benefit most from reliable uncertainty quantification. We propose a set of real-world tasks that accurately reflect such complexities and are designed to assess the reliability of predictive models in safety-critical scenarios. Specifically, we curate two publicly available datasets of high-resolution human retina images exhibiting varying degrees of diabetic retinopathy, a medical condition that can lead to blindness, and use them to design a suite of automated diagnosis tasks that require reliable predictive uncertainty quantification. We use th... [full abstract]


Neil Band, Tim G. J. Rudner, Qixuan Feng, Angelos Filos, Zachary Nado, Michael W. Dusenberry, Ghassen Jerfel, Dustin Tran, Yarin Gal
NeurIPS Datasets and Benchmarks Track, 2021
Spotlight Talk, NeurIPS Workshop on Distribution Shifts, 2021
Symposium on Machine Learning for Health (ML4H) Extended Abstract Track, 2021
NeurIPS Workshop on Bayesian Deep Learning, 2021
[OpenReview] [Code] [BibTex]
Link to this publication

Speedy Performance Estimation for Neural Architecture Search

Reliable yet efficient evaluation of generalisation performance of a proposed architecture is crucial to the success of neural architecture search (NAS). Traditional approaches face a variety of limitations: training each architecture to completion is prohibitively expensive, early stopped validation accuracy may correlate poorly with fully trained performance, and model-based estimators require large training sets. We instead propose to estimate the final test performance based on a simple measure of training speed. Our estimator is theoretically motivated by the connection between generalisation and training speed, and is also inspired by the reformulation of a PAC-Bayes bound under the Bayesian setting. Our model-free estimator is simple, efficient, and cheap to implement, and does not require hyperparameter-tuning or surrogate training before deployment. We demonstrate on various NAS search spaces that our estimator consistently outperforms other alternatives in achieving be... [full abstract]


Binxin (Robin) Ru, Clare Lyle, Lisa Schut, Miroslav Fil, Mark van der Wilk, Yarin Gal
NeurIPS 2021
[Paper]
Link to this publication

Uncertainty Baselines: Benchmarks for Uncertainty & Robustness in Deep Learning

High-quality estimates of uncertainty and robustness are crucial for numerous real-world applications, especially for deep learning which underlies many deployed ML systems. The ability to compare techniques for improving these estimates is therefore very important for research and practice alike. Yet, competitive comparisons of methods are often lacking due to a range of reasons, including: compute availability for extensive tuning, incorporation of sufficiently many baselines, and concrete documentation for reproducibility. In this paper we introduce Uncertainty Baselines: high-quality implementations of standard and state-of-the-art deep learning methods on a variety of tasks. As of this writing, the collection spans 19 methods across 9 tasks, each with at least 5 metrics. Each baseline is a self-contained experiment pipeline with easily reusable and extendable components. Our goal is to provide immediate starting points for experimentation with new methods or applications. A... [full abstract]


Zachary Nado, Neil Band, Mark Collier, Josip Djolonga, Michael W. Dusenberry, Sebastian Farquhar, Angelos Filos, Marton Havasi, Rodolphe Jenatton, Ghassen Jerfel, Jeremiah Liu, Zelda Mariet, Jeremy Nixon, Shreyas Padhy, Jie Ren, Tim G. J. Rudner, Yeming Wen, Florian Wenzel, Kevin Murphy, D. Sculley, Balaji Lakshminarayanan, Jasper Snoek, Yarin Gal, Dustin Tran
NeurIPS Workshop on Bayesian Deep Learning, 2021
[arXiv] [Code] [Blog Post (Google AI)] [BibTex]
Link to this publication

Disease variant prediction with deep generative models of evolutionary data

Quantifying the pathogenicity of protein variants in human disease-related genes would have a marked effect on clinical decisions, yet the overwhelming majority (over 98%) of these variants still have unknown consequences. In principle, computational methods could support the large-scale interpretation of genetic variants. However, state-of-the-art methods have relied on training machine learning models on known disease labels. As these labels are sparse, biased and of variable quality, the resulting models have been considered insufficiently reliable. Here we propose an approach that leverages deep generative models to predict variant pathogenicity without relying on labels. By modelling the distribution of sequence variation across organisms, we implicitly capture constraints on the protein sequences that maintain fitness. Our model EVE (evolutionary model of variant effect) not only outperforms computational approaches that rely on labelled data but also performs on par with,... [full abstract]


Jonathan Frazer, Pascal Notin, Mafalda Dias, Aidan Gomez, Joseph K. Min, Kelly Brock, Yarin Gal, Debora Marks
Nature, 2021 (volume 599, pages 91–95)
[Paper] [BibTex] [Preprint] [Website] [Code]
Link to this publication

Deep Deterministic Uncertainty for Semantic Segmentation

We extend Deep Deterministic Uncertainty (DDU), a method for uncertainty estimation using feature space densities, to semantic segmentation. DDU enables quantifying and disentangling epistemic and aleatoric uncertainty in a single forward pass through the model. We study the similarity of feature representations of pixels at different locations for the same class and conclude that it is feasible to apply DDU location independently, which leads to a significant reduction in memory consumption compared to pixel dependent DDU. Using the DeepLab-v3+ architecture on Pascal VOC 2012, we show that DDU improves upon MC Dropout and Deep Ensembles while being significantly faster to compute.


Jishnu Mukhoti, jv, Philip HS Torr, Yarin Gal
arXiv (2021)
[Paper]
Link to this publication

Shifts: A Dataset of Real Distributional Shift Across Multiple Large-Scale Tasks

There has been significant research done on developing methods for improving robustness to distributional shift and uncertainty estimation. In contrast, only limited work has examined developing standard datasets and benchmarks for assessing these approaches. Additionally, most work on uncertainty estimation and robustness has developed new techniques based on small-scale regression or image classification tasks. However, many tasks of practical interest have different modalities, such as tabular data, audio, text, or sensor data, which offer significant challenges involving regression and discrete or continuous structured prediction. Thus, given the current state of the field, a standardized large-scale dataset of tasks across a range of modalities affected by distributional shifts is necessary. This will enable researchers to meaningfully evaluate the plethora of recently developed uncertainty quantification methods, as well as assessment criteria and state-of-the-art baselin... [full abstract]


Andrey Malinin, Neil Band, Alexander Ganshin, German Chesnokov, Yarin Gal, Mark J. F. Gales, Alexey Noskov, Andrey Ploskonosov, Liudmila Prokhorenkova, Ivan Provilkov, Vatsal Raina, Vyas Raina, Denis Roginskiy, Mariya Shmatova, Panagiotis Tigas, Boris Yangel
NeurIPS Datasets and Benchmarks Track, 2021
[arXiv] [BibTex] [Code]
[Competition Website] [Blog Post (OATML)] [Blog Post (Yandex Research)]
Link to this publication

Self-Attention Between Datapoints: Going Beyond Individual Input-Output Pairs in Deep Learning

We challenge a common assumption underlying most supervised deep learning: that a model makes a prediction depending only on its parameters and the features of a single input. To this end, we introduce a general-purpose deep learning architecture that takes as input the entire dataset instead of processing one datapoint at a time. Our approach uses self-attention to reason about relationships between datapoints explicitly, which can be seen as realizing non-parametric models using parametric attention mechanisms. However, unlike conventional non-parametric models, we let the model learn end-to-end from the data how to make use of other datapoints for prediction. Empirically, our models solve cross-datapoint lookup and complex reasoning tasks unsolvable by traditional deep learning models. We show highly competitive results on tabular data, early results on CIFAR-10, and give insight into how the model makes use of the interactions between points.


Jannik Kossen, Neil Band, Clare Lyle, Aidan Gomez, Yarin Gal, Tom Rainforth
NeurIPS, 2021
[OpenReview] [arXiv] [Code]
Link to this publication

Galaxy Zoo DECaLS: Detailed visual morphology measurements from volunteers and Deep Learning for 314,000 galaxies

We present Galaxy Zoo DECaLS: detailed visual morphological classifications for Dark Energy Camera Legacy Survey images of galaxies within the SDSS DR8 footprint. Deeper DECaLS images (r = 23.6 versus r = 22.2 from SDSS) reveal spiral arms, weak bars, and tidal features not previously visible in SDSS imaging. To best exploit the greater depth of DECaLS images, volunteers select from a new set of answers designed to improve our sensitivity to mergers and bars. Galaxy Zoo volunteers provide 7.5 million individual classifications over 314 000 galaxies. 140 000 galaxies receive at least 30 classifications, sufficient to accurately measure detailed morphology like bars, and the remainder receive approximately 5. All classifications are used to train an ensmble of Bayesian convolutional neural networks (a state-of-the-art deep learning method) to predict posteriors for the detailed morphology of all 314 000 galaxies. We use active learning to focus our volunteer effort on the galaxies... [full abstract]


Mike Walmsley, Chris Lintott, Tobias Géron, Sandor Kruk, Coleman Krawczyk, Kyle W Willett, Steven Bamford, Lee S Kelvin, Lucy Fortson, Yarin Gal, William Keel, Karen L Masters, Vihang Mehta, Brooke D Simmons, Rebecca Smethurst, Lewis Smith, Elisabeth M Baeten, Christine Macmillan
Monthly Notices of the Royal Astronomical Society
[Paper]
Link to this publication

Deterministic Neural Networks with Inductive Biases Capture Epistemic and Aleatoric Uncertainty

While Deep Ensembles are the state-of-the art for uncertainty prediction, standard softmax neural nets suffer from feature collapse and cannot disentangle aleatoric and epistemic uncertainty. We show that a single softmax neural net with minimal changes can beat epistemic uncertainty predictions of Deep Ensembles and other complex single-forward-pass uncertainty approaches (DUQ and SNGP) while also disentangling uncertainties. Our Deep Deterministic Uncertainty (DDU) is based on three insights: i) predictive entropy confounds aleatoric and epistemic uncertainty, and softmax entropy is inconsistent for OoD points; ii) with appropriate inductive biases, i.e. residual connections and spectral normalization, feature-space density reliably captures epistemic uncertainty; and, iii) density estimation and classification objectives might have different optima. Thus, DDU disentangles aleatoric uncertainty using softmax entropy and epistemic uncertainty using a separate featur... [full abstract]


Jishnu Mukhoti, Andreas Kirsch, Joost van Amersfoort, Philip H.S. Torr, Yarin Gal
Uncertainty & Robustness in Deep Learning Workshop, ICML, 2021
[Paper] [BibTex] [Poster]
Link to this publication

On Pitfalls in OoD Detection: Entropy Considered Harmful

Entropy of a predictive distribution averaged over an ensemble or several posterior weight samples is often used as a metric for Out-of-Distribution (OoD) detection. However, we show that predictive entropy is inappropriate for this task because it mistakes ambiguous in-distribution samples as OoD. This issue remains hidden on curated datasets commonly used for benchmarking. We introduce a new dataset, Dirty-MNIST, with a long tail of ambiguous samples, which exemplifies this problem. Additionally, we look at the entropy of single, deterministic, softmax models and show that it is unreliable exactly for OoD samples. In summary, we caution against using predictive or softmax entropy for OoD detection in practice and introduce several methods to evaluate the quantitative difference between several uncertainty metrics.


Andreas Kirsch, Jishnu Mukhoti, Joost van Amersfoort, Philip H.S. Torr, Yarin Gal
Uncertainty & Robustness in Deep Learning Workshop, ICML, 2021
[Paper] [BibTex] [Poster]
Link to this publication

Can convolutional ResNets approximately preserve input distances? A frequency analysis perspective

ResNets constrained to be bi-Lipschitz, that is, approximately distance preserving, have been a crucial component of recently proposed techniques for deterministic uncertainty quantification in neural models. We show that theoretical justifications for recent regularisation schemes trying to enforce such a constraint suffer from a crucial flaw – the theoretical link between the regularisation scheme used and bi-Lipschitzness is only valid under conditions which do not hold in practice, rendering existing theory of limited use, despite the strong empirical performance of these models. We provide a theoretical explanation for the effectiveness of these regularisation schemes using a frequency analysis perspective, showing that under mild conditions these schemes will enforce a lower Lipschitz bound on the low-frequency projection of images. We then provide empirical evidence supporting our theoretical claims, and perform further experiments which demonstrate that our broader concl... [full abstract]


Lewis Smith, jv, Haiwen Huang, Stephen Roberts, Yarin Gal
arXiv (2022)
[Paper]
Link to this publication

Kessler : A machine learning library for spacecraft collision avoidance

As megaconstellations are launched and the space sector grows, space debris pollution is posing an increasing threat to operational spacecraft. Low Earth orbit is a junkyard of dead satellites, rocket bodies, shrapnels, and other debris that travel at very high speed in an uncontrolled manner. Collisions at orbital speeds can generate fragments and potentially trigger a cascade of more collisions endangering the whole population, a scenario known since the late 1970s as the Kessler syndrome. In this work we present Kessler: an open-source Python package for machine learning (ML) applied to collision avoidance. Kessler provides functionalities to import and export conjunction data messages (CDMs) in their standard format and predict the evolution of conjunction events based on explainable ML models. In Kessler we provide Bayesian recurrent neural networks that can be trained with existing collections of CDM data and then deployed in order to predict the contents of future CDMs in... [full abstract]


Giacomo Acciarini, Francesco Pinto, Francesca Letizia, José A. Martinez-Heras, Klaus Merz, Christopher Bridges, Atılım Güneş Baydin
8th European Conference on Space Debris
[Paper]
Link to this publication

Generating Interpretable Counterfactual Explanations By Implicit Minimisation of Epistemic and Aleatoric Uncertainties

Counterfactual explanations (CEs) are a practical tool for demonstrating why machine learning classifiers make particular decisions. For CEs to be useful, it is important that they are easy for users to interpret. Existing methods for generating interpretable CEs rely on auxiliary generative models, which may not be suitable for complex datasets, and incur engineering overhead. We introduce a simple and fast method for generating interpretable CEs in a white-box setting without an auxiliary model, by using the predictive uncertainty of the classifier. Our experiments show that our proposed algorithm generates more interpretable CEs, according to IM1 scores, than existing methods. Additionally, our approach allows us to estimate the uncertainty of a CE, which may be important in safety-critical applications, such as those in the medical domain.


Lisa Schut, Oscar Key, Rory McGrath, Luca Costabello, Bogdan Sacaleanu, Medb Corcoran, Yarin Gal
AISTATS, 2021
[Paper] [Code]
Link to this publication

Towards global flood mapping onboard low cost satellites with machine learning

Spaceborne Earth observation is a key technology for flood response, offering valuable information to decision makers on the ground. Very large constellations of small, nano satellites— ’CubeSats’ are a promising solution to reduce revisit time in disaster areas from days to hours. However, data transmission to ground receivers is limited by constraints on power and bandwidth of CubeSats. Onboard processing offers a solution to decrease the amount of data to transmit by reducing large sensor images to smaller data products. The ESA’s recent PhiSat-1 mission aims to facilitate the demonstration of this concept, providing the hardware capability to perform onboard processing by including a power-constrained machine learning accelerator and the software to run custom applications. This work demonstrates a flood segmentation algorithm that produces flood masks to be transmitted instead of the raw images, while running efficiently on the accelerator aboard the PhiSat-1. Our models ar... [full abstract]


Gonzalo Mateo-Garcia, Joshua Veitch-Michealis, Lewis Smith, Silviu Oprea, Guy Schumann, Yarin Gal, Atılım Güneş Baydin, Dietmar Backes
Nature Scientific Reports, 2021
[Paper]
Link to this publication

Interpretable Neural Architecture Search via Bayesian Optimisation with Weisfeiler-Lehman Kernels

Current neural architecture search (NAS) strategies focus only on finding a single, good, architecture. They offer little insight into why a specific network is performing well, or how we should modify the architecture if we want further improvements. We propose a Bayesian optimisation (BO) approach for NAS that combines the Weisfeiler-Lehman graph kernel with a Gaussian process surrogate. Our method not only optimises the architecture in a highly data-efficient manner, but also affords interpretability by discovering useful network features and their corresponding impact on the network performance. Moreover, our method is capable of capturing the topological structures of the architectures and is scalable to large graphs, thus making the high-dimensional and graph-like search spaces amenable to BO. We demonstrate empirically that our surrogate model is capable of identifying useful motifs which can guide the generation of new architectures. We finally show that our method outpe... [full abstract]


Binxin (Robin) Ru, Xingchen Wan, Xiaowen Dong, Michael A. Osborne
ICLR, 2021
[Paper]
Link to this publication

Uncertainty Quantification for virtual diagnostic of particle accelerators

Virtual diagnostic (VD) is a computational tool based on deep learning that can be used to predict a diagnostic output. VDs are especially useful in systems where measuring the output is invasive, limited, costly or runs the risk of altering the output. Given a prediction, it is necessary to relay how reliable that prediction is, i.e., quantify the uncertainty of the prediction. In this paper, we use ensemble methods and quantile regression neural networks to explore different ways of creating and analyzing prediction’s uncertainty on experimental data from the Linac Coherent Light Source at SLAC National Lab. We aim to accurately and confidently predict the current profile or longitudinal phase space images of the electron beam. The ability to make iformed decisions under uncertainty is crucial for reliable deployment of deep learning tools on safety-critical systems as particle accelerators.


Owen Convery, Lewis Smith, Yarin Gal, Adi Hanuka
Physical Review Accelerators and Beams
[Paper]
Link to this publication

Invariant Representations for Reinforcement Learning without Reconstruction

We study how representation learning can accelerate reinforcement learning from rich observations, such as images, without relying either on domain knowledge or pixel-reconstruction. Our goal is to learn representations that provide for effective downstream control and invariance to task-irrelevant details. Bisimulation metrics quantify behavioral similarity between states in continuous MDPs, which we propose using to learn robust latent representations which encode only the task-relevant information from observations. Our method trains encoders such that distances in latent space equal bisimulation distances in state space. We demonstrate the effectiveness of our method at disregarding task-irrelevant information using modified visual MuJoCo tasks, where the background is replaced with moving distractors and natural videos, while achieving SOTA performance. We also test a first-person highway driving task where our method learns invariance to clouds, weather, and time of day. F... [full abstract]


Amy Zhang, Rowan McAllister, Roberto Calandra, Yarin Gal, Sergey Levine
ICLR, 2021 (Oral)
[Paper]
Link to this publication

Improving VAEs' Robustness to Adversarial Attack

Variational autoencoders (VAEs) have recently been shown to be vulnerable to adversarial attacks, wherein they are fooled into reconstructing a chosen target image. However, how to defend against such attacks remains an open problem. We make significant advances in addressing this issue by introducing methods for producing adversarially robust VAEs. Namely, we first demonstrate that methods proposed to obtain disentangled latent representations produce VAEs that are more robust to these attacks. However, this robustness comes at the cost of reducing the quality of the reconstructions. We ameliorate this by applying disentangling methods to hierarchical VAEs. The resulting models produce high–fidelity autoencoders that are also adversarially robust. We confirm their capabilities on several different datasets and with current state-of-the-art VAE adversarial attacks, and also show that they increase the robustness of downstream tasks to attack.


Matthew JF Willetts, Alexander Camuto, Tom Rainforth, Steve Roberts, Christopher Holmes
ICLR, 2021
[Paper]
Link to this publication

Capturing Label Characteristics in VAEs

We present a principled approach to incorporating labels in variational autoencoders (VAEs) that captures the rich characteristic information associated with those labels. While prior work has typically conflated these by learning latent variables that directly correspond to label values, we argue this is contrary to the intended effect of supervision in VAEs—capturing rich label characteristics with the latents. For example, we may want to capture the characteristics of a face that make it look young, rather than just the age of the person. To this end, we develop a novel VAE model, the characteristic capturing VAE (CCVAE), which “reparameterizes” supervision through auxiliary variables and a concomitant variational objective. Through judicious structuring of mappings between latent and auxiliary variables, we show that the CCVAE can effectively learn meaningful representations of the characteristics of interest across a variety of supervision schemes. In particular, we show th... [full abstract]


Tom Joy, Sebastian Schmon, Philip Torr, Siddharth N, Tom Rainforth
ICLR, 2021
[Paper]
Link to this publication

Improving Transformation Invariance in Contrastive Representation Learning

We propose methods to strengthen the invariance properties of representations obtained by contrastive learning. While existing approaches implicitly induce a degree of invariance as representations are learned, we look to more directly enforce invariance in the encoding process. To this end, we first introduce a training objective for contrastive learning that uses a novel regularizer to control how the representation changes under transformation. We show that representations trained with this objective perform better on downstream tasks and are more robust to the introduction of nuisance transformations at test time. Second, we propose a change to how test time representations are generated by introducing a feature averaging approach that combines encodings from multiple transformations of the original input, finding that this leads to across the board performance gains. Finally, we introduce the novel Spirograph dataset to explore our ideas in the context of a differentiable g... [full abstract]


Adam Foster, Rattana Pukdee, Tom Rainforth
ICLR, 2021
[Paper]
Link to this publication

On Statistical Bias In Active Learning: How and When to Fix It

Active learning is a powerful tool when labelling data is expensive, but it introduces a bias because the training data no longer follows the population distribution. We formalize this bias and investigate the situations in which it can be harmful and sometimes even helpful. We further introduce novel corrective weights to remove bias when doing so is beneficial. Through this, our work not only provides a useful mechanism that can improve the active learning approach, but also an explanation for the empirical successes of various existing approaches which ignore this bias. In particular, we show that this bias can be actively helpful when training overparameterized models—like neural networks—with relatively modest dataset sizes.


Sebastian Farquhar, Yarin Gal, Tom Rainforth
ICLR, 2021 (Spotlight)
[Paper]
Link to this publication

Black-Box Optimization with Local Generative Surrogates

We propose a novel method for gradient-based optimization of black-box simulators using differentiable local surrogate models. In fields such as physics and engineering, many processes are modeled with non-differentiable simulators with intractable likelihoods. Optimization of these forward models is particularly challenging, especially when the simulator is stochastic. To address such cases, we introduce the use of deep generative models to iteratively approximate the simulator in local neighborhoods of the parameter space. We demonstrate that these local surrogates can be used to approximate the gradient of the simulator, and thus enable gradient-based optimization of simulator parameters. In cases where the dependence of the simulator on the parameter space is constrained to a low dimensional submanifold, we observe that our method attains minima faster than baseline methods, including Bayesian optimization, numerical optimization, and approaches using score function gradient... [full abstract]


Sergey Shirobokov, Vladislav Belavin, Michael Kagan, Andrey Ustyuzhanin, Atılım Güneş Baydin
Advances in Neural Information Processing Systems 34 (NeurIPS)
[Paper]
Link to this publication

Global Earth Magnetic Field Modeling andForecasting with Spherical Harmonics Decomposition

Modeling and forecasting the solar wind-driven global magnetic field perturbations is an open challenge. Current approaches depend on simulations of computationally demanding models like the Magnetohydrodynamics (MHD) model or sampling spatially and temporally through sparse ground-based stations (SuperMAG). In this paper, we develop a Deep Learning model that forecasts in Spherical Harmonics space, replacing reliance on MHD models and providing global coverage at oneminute cadence, improving over the current state-of-the-art which relies on feature engineering. We evaluate the performance in SuperMAG dataset (improved by 14.53%) and MHD simulations (improved by 24.35%). Additionally, we evaluate the extrapolation performance of the spherical harmonics reconstruction based on sparse ground-based stations (SuperMAG), showing that spherical harmonics can reliably reconstruct the global magnetic field as evaluated on MHD simulation


Panagiotis Tigas, Téo Bloch, Vishal Upendran, Banafsheh Ferdoushi, Yarin Gal, Siddha Ganju, Ryan M. McGranaghan, Mark C. M. Cheung, Asti Bhatt
Machine Learning and the Physical Sciences Workshop - 34th NeurIPS 2020 [Paper]
Determining new representations of “Geoeffectiveness” using deep learning - AGU 2020
Link to this publication

On the robustness of effectiveness estimation of nonpharmaceutical interventions against COVID-19 transmission

There remains much uncertainty about the relative effectiveness of different nonpharmaceutical interventions (NPIs) against COVID-19 transmission. Several studies attempt to infer NPI effectiveness with cross-country, data-driven modelling, by linking from NPI implementation dates to the observed timeline of cases and deaths in a country. These models make many assumptions. Previous work sometimes tests the sensitivity to variations in explicit epidemiological model parameters, but rarely analyses the sensitivity to the assumptions that are made by the choice the of model structure (structural sensitivity analysis). Such analysis would ensure that the inferences made are consistent under plausible alternative assumptions. Without it, NPI effectiveness estimates cannot be used to guide policy. We investigate four model structures similar to a recent state-of-the-art Bayesian hierarchical model. We find that the models differ considerably in the robustness of their NPI effectivene... [full abstract]


Mrinank Sharma, Sören Mindermann, Jan Brauner, Gavin Leech, Anna B. Stephenson, Tomáš Gavenčiak, Jan Kulveit, Yee Whye Teh, Leonid Chindelevitch, Yarin Gal
NeurIPS, 2020
[Paper]
Link to this publication

Capsule Networks: A Generative Probabilistic Perspective

‘Capsule’ models try to explicitly represent the poses of objects, enforcing a linear relationship between an objects pose and those of its constituent parts. This modelling assumption should lead to robustness to viewpoint changes since the object-component relationships are invariant to the poses of the object. We describe a probabilistic generative model that encodes these assumptions. Our probabilistic formulation separates the generative assumptions of the model from the inference scheme, which we derive from a variational bound. We experimentally demonstrate the applicability of our unified objective, and the use of test time optimisation to solve problems inherent to amortised inference.


Lewis Smith, Lisa Schut, Yarin Gal, Mark van der Wilk
Object Oriented Learning Workshop, ICML 2020
[Paper]
Link to this publication

SliceOut: Training Transformers and CNNs faster while using less memory

We demonstrate 10-40% speedups and memory reduction with Wide ResNets, EfficientNets, and Transformer models, with minimal to no loss in accuracy, using SliceOut—a new dropout scheme designed to take advantage of GPU memory layout. By dropping contiguous sets of units at random, our method preserves the regularization properties of dropout while allowing for more efficient low-level implementation, resulting in training speedups through (1) fast memory access and matrix multiplication of smaller tensors, and (2) memory savings by avoiding allocating memory to zero units in weight gradients and activations. Despite its simplicity, our method is highly effective. We demonstrate its efficacy at scale with Wide ResNets & EfficientNets on CIFAR10/100 and ImageNet, as well as Transformers on the LM1B dataset. These speedups and memory savings in training can lead to CO2 emissions reduction of up to 40% for training large models.


Pascal Notin, Aidan Gomez, Joanna Yoo, Yarin Gal
Under review
[Paper]
Link to this publication

Unpacking Information Bottlenecks: Unifying Information-Theoretic Objectives in Deep Learning

The Information Bottleneck principle offers both a mechanism to explain how deep neural networks train and generalize, as well as a regularized objective with which to train models. However, multiple competing objectives are proposed in the literature, and the information-theoretic quantities used in these objectives are difficult to compute for large deep neural networks, which in turn limits their use as a training objective. In this work, we review these quantities and compare and unify previously proposed objectives, which allows us to develop surrogate objectives more friendly to optimization without relying on cumbersome tools such as density estimation. We find that these surrogate objectives allow us to apply the information bottleneck to modern neural network architectures. We demonstrate our insights on MNIST, CIFAR-10 and Imagenette with modern DNN architectures (ResNets).


Andreas Kirsch, Clare Lyle, Yarin Gal
Uncertainty & Robustness in Deep Learning Workshop, ICML, 2020
[Paper] [BibTex] [Poster]
Link to this publication

Learning CIFAR-10 with a Simple Entropy Estimator Using Information Bottleneck Objectives

The Information Bottleneck (IB) principle characterizes learning and generalization in deep neural networks in terms of the change in two information theoretic quantities and leads to a regularized objective function for training neural networks. These quantities are difficult to compute directly for deep neural networks. We show that it is possible to backpropagate through a simple entropy estimator to obtain an IB training method that works for modern neural network architectures. We evaluate our approach empirically on the CIFAR-10 dataset, showing that IB objectives can yield competitive performance on this dataset with a conceptually simple approach while also performing well against adversarial attacks out-of-the-box.


Andreas Kirsch, Clare Lyle, Yarin Gal
Uncertainty & Robustness in Deep Learning Workshop, ICML, 2020
[Paper] [BibTex]
Link to this publication

Invariant Causal Prediction for Block MDPs

Generalization across environments is critical to the successful application of reinforcement learning algorithms to real-world challenges. In this paper, we consider the problem of learning abstractions that generalize in block MDPs, families of environments with a shared latent state space and dynamics structure over that latent space, but varying observations. We leverage tools from causal inference to propose a method of invariant prediction to learn model-irrelevance state abstractions (MISA) that generalize to novel observations in the multi-environment setting. We prove that for certain classes of environments, this approach outputs with high probability a state abstraction corresponding to the causal feature set with respect to the return. We further provide more general bounds on model error and generalization error in the multi-environment setting, in the process showing a connection between causal variable selection and the state abstraction framework for MDPs. We giv... [full abstract]


Amy Zhang, Clare Lyle, Shagun Sodhani, Angelos Filos, Marta Kwiatkowska, Joelle Pineau, Yarin Gal, Doina Precup
Causal Learning for Decision Making Workshop at ICLR, 2020
[Paper]
ICML, 2020
[Paper]
Link to this publication

Divide, Conquer, and Combine: a New Inference Strategy for Probabilistic Programs with Stochastic Support

Universal probabilistic programming systems (PPSs) provide a powerful framework for specifying rich and complex probabilistic models. They further attempt to automate the process of drawing inferences from these models, but doing this successfully is severely hampered by the wide range of non–standard models they can express. As a result, although one can specify complex models in a universal PPS, the provided inference engines often fall far short of what is required. In particular, we show they produce surprisingly unsatisfactory performance for models where the support may vary between executions, often doing no better than importance sampling from the prior. To address this, we introduce a new inference framework: Divide, Conquer, and Combine, which remains efficient for such models, and show how it can be implemented as an automated and general-purpose PPS inference engine. We empirically demonstrate substantial performance improvements over existing approaches on two examp... [full abstract]


Yuan Zhou, Hongseok Yang, Yee Whye Teh, Tom Rainforth
ICML, 2020
[Paper]
Link to this publication

Try Depth Instead of Weight Correlations: Mean-field is a Less Restrictive Assumption for Deeper Networks

We challenge the longstanding assumption that the mean-field approximation for variational inference in Bayesian neural networks is severely restrictive. We argue mathematically that full-covariance approximations only improve the ELBO if they improve the expected log-likelihood. We further show that deeper mean-field networks are able to express predictive distributions approximately equivalent to shallower full-covariance networks. We validate these observations empirically, demonstrating that deeper models decrease the divergence between diagonal- and full-covariance Gaussian fits to the true posterior.


Sebastian Farquhar, Lewis Smith, Yarin Gal
Contributed talk, Workshop on Bayesian Deep Learning, NeurIPS 2019
[Workshop paper], [arXiv]
Link to this publication

Gradient \(\ell_1\) Regularization for Quantization Robustness

We analyze the effect of quantizing weights and activations of neural networks on their loss and derive a simple regularization scheme that improves robustness against post-training quantization. By training quantization-ready networks, our approach enables storing a single set of weights that can be quantized on-demand to different bit-widths as energy and memory requirements of the application change. Unlike quantization-aware training using the straight-through estimator that only targets a specific bit-width and requires access to training data and pipeline, our regularization-based method paves the way for ``on the fly’’ post-training quantization to various bit-widths. We show that by modeling quantization as a \(\ell_\infty\)-bounded perturbation, the first-order term in the loss expansion can be regularized using the \(\ell_1\)-norm of gradients. We experimentally validate our method on different architectures on CIFAR-10 and ImageNet datasets and show that the regulariz... [full abstract]


Milad Alizadeh, Arash Behboodi, Mart van Baalen, Christos Louizos, Tijmen Blankevoort, Max Welling
ICLR, 2020
[OpenReview]
Link to this publication

BayesOpt Adversarial Attack

Black-box adversarial attacks require a large number of attempts before finding successful adversarial examples that are visually indistinguishable from the original input. Current approaches relying on substitute model training, gradient estimation or genetic algorithms often require an excessive number of queries. Therefore, they are not suitable for real-world systems where the maximum query number is limited due to cost. We propose a query-efficient black-box attack which uses Bayesian optimisation in combination with Bayesian model selection to optimise over the adversarial perturbation and the optimal degree of search space dimension reduction. We demonstrate empirically that our method can achieve comparable success rates with 2-5 times fewer queries compared to previous state-of-the-art black-box attacks.


Binxin (Robin) Ru, Adam Cobb, Arno Blaas, Yarin Gal
ICLR, 2020
[OpenReview]
Link to this publication

Using U-Nets to Create High-Fidelity Virtual Observations of the Solar Corona

Understanding and monitoring the complex and dynamic processes of the Sun is important for a number of human activities on Earth and in space. For this reason, NASA’s Solar Dynamics Observatory (SDO) has been continuously monitoring the multi-layered Sun’s atmosphere in high-resolution since its launch in 2010, generating terabytes of observational data every day. The synergy between machine learning and this enormous amount of data has the potential, still largely unexploited, to advance our understanding of the Sun and extend the capabilities of heliophysics missions. In the present work, we show that deep learning applied to SDO data can be successfully used to create a high-fidelity virtual telescope that generates synthetic observations of the solar corona by image translation. Towards this end we developed a deep neural network, structured as an encoder-decoder with skip connections (U-Net), that reconstructs the Sun’s image of one instrument channel given temporally align... [full abstract]


Valentina Salvatelli, Souvik Bose, Brad Neuberg, Luiz F. G. dos Santos, Mark Cheung, Miho Janvier, Atılım Güneş Baydin, Yarin Gal, Meng Jin
Machine Learning and the Physical Sciences Workshop (ML4PS), NeurIPS 2019
[arXiv]
Link to this publication

Auto-Calibration of Remote Sensing Solar Telescopes with Deep Learning

As a part of NASA’s Heliophysics System Observatory (HSO) fleet of satellites,the Solar Dynamics Observatory (SDO) has continuously monitored the Sun since2010. Ultraviolet (UV) and Extreme UV (EUV) instruments in orbit, such asSDO’s Atmospheric Imaging Assembly (AIA) instrument, suffer time-dependent degradation which reduces instrument sensitivity. Accurate calibration for (E)UV instruments currently depends on periodic sounding rockets, which are infrequent and not practical for heliophysics missions in deep space. In the present work, we develop a Convolutional Neural Network (CNN) that auto-calibrates SDO/AIA channels and corrects sensitivity degradation by exploiting spatial patterns in multi-wavelength observations to arrive at a self-calibration of (E)UV imaging instruments. Our results remove a major impediment to developing future HSOmissions of the same scientific caliber as SDO but in deep space, able to observe the Sun from more vantage points than just SDO’s curren... [full abstract]


Brad Neuberg, Souvik Bose, Valentina Salvatelli, Luiz F.G. dos Santos, Mark Cheung, Miho Janvier, Atılım Güneş Baydin, Yarin Gal, Meng Jin
Machine Learning and the Physical Sciences Workshop (ML4PS), NeurIPS 2019
[arXiv]
Link to this publication

PAC-Bayes Generalization Bounds for Invariant Neural Networks

Invariance is widely described as a desirable property of neural networks, but the mechanisms by which it benefits deep learning remain shrouded in mystery. We show that building invariance into model architecture via feature averaging provably tightens PAC-Bayes generalization bounds, as compared to data augmentation. Furthermore, through a link to the marginal likelihood and Bayesian model selection, we provide justification for using the improvement in these bounds for model selection. Our key observation is that invariance doesn’t just reduce variance in deep learning: it also changes the parameter-function mapping, and this leads better provable guarantees for the model. We verify our theoretical results empirically on a permutation-invariant dataset.


Clare Lyle, Marta Kwiatkowska, Yarin Gal
14th Women in Machine Learning Workshop (WiML 2019)
[WiML]
Link to this publication

Prediction of GNSS Phase Scintillations: A Machine Learning Approach

A Global Navigation Satellite System (GNSS) uses a constellation of satellites around the earth for accurate navigation, timing, and positioning. Natural phenomena like space weather introduce irregularities in the Earth’s ionosphere, disrupting the propagation of the radio signals that GNSS relies upon. Such disruptions affect both the amplitude and the phase of the propagated waves. No physics-based model currently exists to predict the time and location of these disruptions with sufficient accuracy and at relevant scales. In this paper, we focus on predicting the phase fluctuations of GNSS radio waves, known as phase scintillations. We propose a novel architecture and loss function to predict 1 hour in advance the magnitude of phase scintillations within a time window of plus-minus 5 minutes with state-of-the-art performance.


Kara Lamb, Garima Malhotra, Athanasios Vlontzos, Edward Wagstaff, Atılım Güneş Baydin, Anahita Bhiwandiwalla, Yarin Gal, Freddie Kalaitzis, Anthony Reina, Asti Bhatt
Machine Learning and the Physical Sciences Workshop (ML4PS), NeurIPS 2019
[arXiv]
Link to this publication

Correlation of Auroral Dynamics and GNSS Scintillation with an Autoencoder

High energy particles originating from solar activity travel along the the Earth’s magnetic field and interact with the atmosphere around the higher latitudes. These interactions often manifest as aurora in the form of visible light in the Earth’s ionosphere. These interactions also result in irregularities in the electron density, which cause disruptions in the amplitude and phase of the radio signals from the Global Navigation Satellite Systems (GNSS), known as ‘scintillation’. In this paper we use a multi-scale residual autoencoder (Res-AE) to show the correlation between specific dynamic structures of the aurora and the magnitude of the GNSS phase scintillations (σϕ). Auroral images are encoded in a lower dimensional feature space using the Res-AE, which in turn are clustered with t-SNE and UMAP. Both methods produce similar clusters, and specific clusters demonstrate greater correlations with observed phase scintillations. Our results suggest that specific dynamic structure... [full abstract]


Kara Lamb, Garima Malhotra, Athanasios Vlontzos, Edward Wagstaff, Atılım Güneş Baydin, Anahita Bhiwandiwalla, Yarin Gal, Freddie Kalaitzis, Anthony Reina, Asti Bhatt
Machine Learning and the Physical Sciences Workshop (ML4PS), NeurIPS 2019
[arXiv]
Link to this publication

Single-Frame Super-Resolution of Solar Magnetograms; Investigating Physics-Based Metrics & Losses

Breakthroughs in our understanding of physical phenomena have traditionally followed improvements in instrumentation. Studies of the magnetic field of the Sun, and its influence on the solar dynamo and space weather events, have benefited from improvements in resolution and measurement frequency of new instruments. However, in order to fully understand the solar cycle, high-quality data across time-scales longer than the typical lifespan of a solar instrument are required. At the moment, discrepancies between measurement surveys prevent the combined use of all available data. In this work, we show that machine learning can help bridge the gap between measurement surveys by learning to super-resolve low-resolution magnetic field images and translate between characteristics of contemporary instruments in orbit. We also introduce the notion of physics-based metrics and losses for super-resolution to preserve underlying physics and constrain the solution space of possible super-reso... [full abstract]


Anna Jungbluth, Xavier Gitiaux, Shane A.Maloney, Carl Shneider, Paul J. Wright, Freddie Kalaitzis, Michel Deudon, Atılım Güneş Baydin, Yarin Gal, Andrés Muñoz-Jaramillo
Machine Learning and the Physical Sciences Workshop (ML4PS), NeurIPS 2019
[arXiv]
Link to this publication

Flood Detection On Low Cost Orbital Hardware

Satellite imaging is a critical technology for monitoring and responding to natural disasters such as flooding. Despite the capabilities of modern satellites, there is still much to be desired from the perspective of first response organisations like UNICEF. Two main challenges are rapid access to data, and the ability to automatically identify flooded regions in images. We describe a prototypical flood segmentation system, identifying cloud, water and land, that could be deployed on a constellation of small satellites, performing processing on board to reduce downlink bandwidth by 2 orders of magnitude. We target PhiSat-1, part of the FSSCAT mission, which is planned to be launched by the European Space Agency (ESA) near the start of 2020 as a proof of concept for this new technology.


Joshua Veitch-Michaelis, Gonzalo Mateo-Garcia, Silviu Oprea, Lewis Smith, Atılım Güneş Baydin, Dietmar Backes, Yarin Gal, Guy Schumann
Spotlight talk, Artificial Intelligence for Humanitarian Assistance and Disaster Response (AI+HADR) NeurIPS 2019 Workshop
[arXiv]
Link to this publication

FDL: Mission Support Challenge

The Frontier Development Lab (FDL) is a National Aeronautics and Space Administration (NASA) machine learning program with the stated aim of conducting artificial intelligence research for space exploration and all humankind with support in the European program from the European Space Agency (ESA). Interdisciplinary teams of researchers and data-scientists are brought together to tackle a range of challenging, real-world problems in the space-domain. The program primarily consists of a sprint phase during which teams tackle separate problems in the spirit of ‘coopetition’. Teams are given a problem brief by real stakeholders and mentored by a range of experts. With access to exceptional computational resources, we were challenged to make a serious contribution within just eight weeks. Stated simply, our team was tasked with producing a system capable of scheduling downloads from satellites autonomously. Scheduling is a difficult problem in general, of course, complicated further... [full abstract]


Luís F. Simões, Ben Day, Vinutha M. Shreenath, Callum Wilson, Chris Bridges, Sylvester Kaczmarek, Yarin Gal
NeurIPS 2019 Workshop on Machine Learning Competitions for All
[arXiv]
Link to this publication

Machine Learning for Generalizable Prediction of Flood Susceptibility

Flooding is a destructive and dangerous hazard and climate change appears to be increasing the frequency of catastrophic flooding events around the world. Physics-based flood models are costly to calibrate and are rarely generalizable across different river basins, as model outputs are sensitive to site-specific parameters and human-regulated infrastructure. In contrast, statistical models implicitly account for such factors through the data on which they are trained. Such models trained primarily from remotely-sensed Earth observation data could reduce the need for extensive in-situ measurements. In this work, we develop generalizable, multi-basin models of river flooding susceptibility using geographically-distributed data from the USGS stream gauge network. Machine learning models are trained in a supervised framework to predict two measures of flood susceptibility from a mix of river basin attributes, impervious surface cover information derived from satellite imagery, and h... [full abstract]


Chelsea Sidrane, Dylan J Fitzpatrick, Andrew Annex, Diane O’Donoghue, Piotr Bilinksi, Yarin Gal
Spotlight talk, Artificial Intelligence for Humanitarian Assistance and Disaster Response (AI+HADR) NeurIPS 2019 Workshop
[arXiv]
Link to this publication

Location Conditional Image Generation using Generative Adversarial Networks

Can an AI-artist instil the emotion of sense of place in its audience? Motivated by this thought, this paper presents our endeavours to make a GANs model learn the visual characteristics of locations to achieve creativity. The project’s novelty lies in addressing the problem of the hardness of GANs training for an extremely diverse dataset in a contextual setting. The project explores GANs as an impressionist artist who adds its perspective to the artwork without hampering photo realism.


Mayur Saxena, Aidan Gomez, Yarin Gal
Machine Learning for Creativity and Design NeurIPS 2019 Workshop
[Paper]
Link to this publication

On the Benefits of Disentangled Representations

Recently there has been a significant interest in learning disentangled representations, as they promise increased interpretability, generalization to unseen scenarios and faster learning on downstream tasks. In this paper, we investigate the usefulness of different notions of disentanglement for improving the fairness of downstream prediction tasks based on representations. We consider the setting where the goal is to predict a target variable based on the learned representation of high-dimensional observations (such as images) that depend on both the target variable and an unobserved sensitive variable. We show that in this setting both the optimal and empirical predictions can be unfair, even if the target variable and the sensitive variable are independent. Analyzing more than 12600 trained representations of state-of-the-art disentangled models, we observe that various disentanglement scores are consistently correlated with increased fairness, suggesting that disentanglemen... [full abstract]


Francesco Locatello, Gabriele Abbati, Tom Rainforth, Stefan Bauer, Bernhard Schölkopf, Olivier Bachem
NeurIPS, 2019
[arXiv]
Link to this publication

Variational Bayesian Optimal Experimental Design

Bayesian optimal experimental design (BOED) is a principled framework for making efficient use of limited experimental resources. Unfortunately, its applicability is hampered by the difficulty of obtaining accurate estimates of the expected information gain (EIG) of an experiment. To address this, we introduce several classes of fast EIG estimators by building on ideas from amortized variational inference. We show theoretically and empirically that these estimators can provide significant gains in speed and accuracy over previous approaches. We further demonstrate the practicality of our approach on a number of end-to-end experiments.


Adam Foster, Martin Jankowiak, Eli Bingham, Paul Horsfall, Yee Whye Teh, Tom Rainforth, Noah Goodman
NeurIPS, 2019
[arXiv]
Link to this publication

An Analysis of the Effect of Invariance on Generalization in Neural Networks

Invariance is often cited as a desirable property of machine learning systems, claimed to improve model accuracy and reduce overfitting. Empirically, invariant models often generalize better than their non-invariant counterparts. But is it possible to show that invariant models provably do so? In this paper we explore the effect of invariance on model generalization. We find strong Bayesian and frequentist motivations for enforcing invariance which leverage recent results connecting PAC-Bayes generalization bounds and the marginal likelihood. We make use of these results to perform model selection on neural networks.


Clare Lyle, Marta Kwiatkowska, Mark van der Wilk, Yarin Gal
Understanding and Improving Generalization in Deep Learning workshop, ICML, 2019
[Paper]
Link to this publication

Towards Inverse Reinforcement Learning for Limit Order Book Dynamics

We investigate whether Inverse Reinforcement Learning (IRL) can infer rewards from agents within real financial stochastic environments: limit order books (LOB). Our results illustrate that complex behaviours, induced by non-linear reward functions amid agent-based stochastic scenarios, can be deduced through inference, encouraging the use of inverse reinforcement learning for opponent-modelling in multi-agent systems.


Jacobo Roa-Vicens, Cyrine Chtourou, Angelos Filos, Francisco Rullan, Yarin Gal, Ricardo Silva
Oral Presentation, Multi-Agent Learning Workshop at the 36th International Conference on Machine Learning, 2019
[arXiv] [BibTex]
Link to this publication

Multi³Net: Segmenting Flooded Buildings via Fusion of Multiresolution, Multisensor, and Multitemporal Satellite Imagery

We propose a novel approach for rapid segmentation of flooded buildings by fusing multiresolution, multisensor, and multitemporal satellite imagery in a convolutional neural network. Our model significantly expedites the generation of satellite imagery-based flood maps, crucial for first responders and local authorities in the early stages of flood events. By incorporating multitemporal satellite imagery, our model allows for rapid and accurate post-disaster damage assessment and can be used by governments to better coordinate medium- and long-term financial assistance programs for affected areas. The network consists of multiple streams of encoder-decoder architectures that extract spatiotemporal information from medium-resolution images and spatial information from high-resolution images before fusing the resulting representations into a single medium-resolution segmentation map of flooded buildings. We compare our model to state-of-the-art methods for building footprint segme... [full abstract]


Tim G. J. Rudner, Marc Rußwurm, Jakub Fil, Ramona Pelich, Benjamin Bischke, Veronika Kopackova, Piotr Bilinski
AAAI 2019
NeurIPS 2018 Workshop AI for Social Good
[arXiv] [Code] [BibTex] [Media]
Link to this publication

Targeted Dropout

Neural networks are extremely flexible models due to their large number of parameters, which is beneficial for learning, but also highly redundant. This makes it possible to compress neural networks without having a drastic effect on performance. We introduce targeted dropout, a strategy for post hoc pruning of neural network weights and units that builds the pruning mechanism directly into learning. At each weight update, targeted dropout selects a candidate set for pruning using a simple selection criterion, and then stochastically prunes the network via dropout applied to this set. The resulting network learns to be explicitly robust to pruning, comparing favourably to more complicated regularization schemes while at the same time being extremely simple to implement, and easy to tune.


Aidan Gomez, Ivan Zhang, Kevin Swersky, Yarin Gal, Geoffrey E. Hinton
Workshop on Compact Deep Neural Networks with industrial applications, NeurIPS 2018
[Paper] [BibTex]
Link to this publication

An Empirical study of Binary Neural Networks' Optimisation

Binary neural networks using the Straight-Through-Estimator (STE) have been shown to achieve state-of-the-art results, but their training process is not well-founded. This is due to the discrepancy between the evaluated function in the forward path, and the weight updates in the back-propagation, updates which do not correspond to gradients of the forward path. Efficient convergence and accuracy of binary models often rely on careful fine-tuning and various ad-hoc techniques. In this work, we empirically identify and study the effectiveness of the various ad-hoc techniques commonly used in the literature, providing best-practices for efficient training of binary models. We show that adapting learning rates using second moment methods is crucial for the successful use of the STE, and that other optimisers can easily get stuck in local minima. We also find that many of the commonly employed tricks are only effective towards the end of the training, with these methods making early ... [full abstract]


Milad Alizadeh, Javier Fernández-Marqués, Nicholas D. Lane, Yarin Gal
International Conference on Learning Representations (ICLR), 2019
[Paper] [Code]
Link to this publication

Automating Asteroid Shape Modeling From Radar Images

Characterizing the shapes and spin states of near-Earth asteroids is essential both for trajectory predictions to rule out potential future Earth impacts and for planning spacecraft missions. But reconstructing objects’ shapes and spins from delay-Doppler data is a computationally intensive inversion problem. We implement a Bayesian optimization routine that uses SHAPE to autonomously search the space of spin-state parameters, yielding spin state constraints within a factor of 3 less computer runtime and minimal human supervision. These routines are now being incorporated into radar data processing pipelines at Arecibo.


Michael W. Busch, Agata Rozek, Sean Marshall, Grace Young, Adam Cobb, Chedy Raissi, Yarin Gal, Lance Benner, Shantanu Naidu, Marina Brozovic, Patrick Taylor
COSPAR (Committee on Space Research) Assembly, 2018
[Blog Post (Adam Cobb)] [BibTex]
Link to this publication

Using Pre-trained Full-Precision Models to Speed Up Training Binary Networks For Mobile Devices

Binary Neural Networks (BNNs) are well-suited for deploying Deep Neural Networks (DNNs) to small embedded devices but state-of-the-art BNNs need to be trained from scratch. We show how weights from a trained full-precision model can be used to speed-up training binary networks. We show that for CIFAR-10, accuracies within 1% of the full-precision model can be achieved in just 5 epochs.


Milad Alizadeh, Nicholas D. Lane, Yarin Gal
16th ACM International Conference on Mobile Systems (MobiSys), 2018
[Abstract] [BibTex]
Link to this publication

More publications here: publications

Are you looking to do a PhD in machine learning? Did you do a PhD in another field and want to do a postdoc in machine learning? Would you like to visit the group?

How to apply


Contact

We are located at
Department of Computer Science, University of Oxford
Wolfson Building
Parks Road
OXFORD
OX1 3QD
UK
Twitter: @OATML_Oxford
Github: OATML
Email: oatml@cs.ox.ac.uk