Seminars are usually held in the main seminar room (Euler Room - A002) of the Building Euler.
If you wish to receive the seminar announcements, please send an email to
The subject line should be 'subscribe big-data yourname' where yourname is your first and last name.

Tuesday 29 May, 14h00
Leveraging historical data for high-dimensional regression adjustment, a machine learning approach
Samuel Branders (UCL)

The amount of data collected from patients involved in clinical trials is continuously growing. All those patient's characteristics are potential covariates that could be used to improve study analysis and power. At the same time, the development of computerized systems simplifies the access to huge amount of historical data. However, it is still difficult to leverage those data when dealing with small clinical trials, such as in Phases I and II. Their restricted number of patients limits the possible number of covariates included in the analysis. The purpose of this talk is to present how machine learning can overcome this problem by taking advantage of historical data with larger sample sizes. Our approach is to pre-specify the combination of the baseline covariates by building a "meta-covariate". In small studies, using this meta-covariate alone will limit the loss of degrees of freedom while making the best uses of all generated data. Two advantages of fitting the covariates on independent data are to free the modeling from the study constraints and to limit the risk of overfitting. Those are of particular interest with complex data, i.e non-normal distribution or in the presence of non-linearities. To demonstrate the benefit of the methodology, we discuss several questions: - What are too many covariates? Can we go beyond the simple rule-of-thumb 1 variable for 10 patients? - What should be the minimum performance of the ML model in this context? - When should we use machine learning?

Tuesday 08 May, 14h00
Data science for precision medicine
Thibault Helleputte (DNAlytics)

The field of precision medicine aims at developing new decision-support tools for the healthcare professionals, in order to improve patients health, healthcare system sustainability, and potentially better position some pharmaceutical products. These tools can be developed by combining a broad range of health-related data and mathematical modeling strategies, namely machine learning approaches. In this talk, we will discuss how some very well defined machine learning approaches for predictive model training, feature selection and performance evaluation survive the reality of biology, medicine and clinical applications.

Tuesday 24 April, 14h00 (Maxwell building, Shannon room)
Locating the source of diffusion in large-scale and random networks
Patrick Thiran (EPFL)

We will survey some results on the localization of the source of diffusion in a network. There have been significant efforts in studying the dynamics of epidemic propagations on networks, and more particularly on the forward problem of epidemics: understanding the diffusion process and its dependence on the infecting and curing rates. We address here the inverse problem of inferring the original source of diffusion, given the infection data gathered at some of the nodes in the network. Indeed, because of the large size of many real networks, the state of all nodes in a network cannot in general be observed. We show that it is fundamentally possible to estimate the location of the source from measurements collected by sparsely placed observers. We present a strategy that is optimal for arbitrary trees, achieving maximum probability of correct localization, and describe efficient implementations for arbitrary graphs. When propagation times are (close to) deterministic values, the small est number of sensors needed to exactly localize the source is the double metric dimension of the network. We compute tight bounds for this quantity in a number of topologies that include Erdös-Renyi random graphs, and we comment on its implications for the detectability of a source in actual networks. This is a joint work with Elisa Celis, Pedro Pinto, Brunella Spinelli and Martin Vetterli.

Tuesday 16 April, 14h00
The Hierarchical Adaptive Forgetting Variational Filter
Vincent Moens (COSY, Institute of Neuroscience, UCL)

A common problem in Machine Learning and statistics consists in detecting whether the current sample in a stream of data belongs to the same distribution as previous ones, is an isolated outlier or inaugurates a new distribution of data. We present a hierarchical Bayesian algorithm that aims at learning a time-specific approximate posterior distribution of the parameters describing the distribution of the data observed. We derive the update equations of the variational parameters of the approximate posterior at each time step for models from the exponential family, and show that these updates find interesting correspondents in Reinforcement Learning (RL). In this perspective, our model can be seen as a hierarchical RL algorithm that learns a posterior distribution according to a certain stability confidence that is, in turn, learned according to its own stability confidence. Finally, we show some applications of our generic model, first in a RL context, next with an adaptive Bayesian Autoregressive model, and finally in the context of Stochastic Gradient Descent optimization.

Tuesday 13 March, 14h00
Approximate Bayesian inference in Machine Learning: where stats meets optimisation
Thibaut Lienart (University of Oxford)

In this talk I will discuss some recent work on distributed approximate Bayesian inference for Machine Learning where one is interested not only in finding good parameters for a given model (e.g.: the weights of a Neural Network) but also to model the uncertainty around these parameters and thereby model the uncertainty in the predictions made with the model. This is of tremendous importance in modern applications of ML such as self-driving cars, healthcare applications etc. I will show how approximate Bayesian inference converts a hard integration problem into a simpler optimisation problem and discuss some approaches that have been used to tackle the optimisation. Finally, I will give a critical view on the field, hopefully leading to a discussion.

Tuesday 27 February, 16h30
Increasing stability of training of Deep CNNs with Stochastic Gradient Descent method. Application to Image classification tasks
Jenny Benois-Pineau, University of Bordeaux

Supervised learning for classification of visual data has been the major approach during the last two decades. With adventure of GPUs, the well-known supervised classifiers such as Artificial Neural Networks (ANN) came up for these problems. A specific case of ANNs represent the Convolutional Neural Networks (CNN) designed specifically for visual information classification, such as object recognition and localization, visual tracking, saliency prediction and image categorization. On the contrary to usual fully connected Artificial Neural Networks, their main characteristic is the limitation of receptive field of neurons using convolution operation and subsequent data reduction by pooling of features. Stacking convolution , pooling and non-linearity layers in deeper and deeper architectures such classifiers as AlexNet, VGG, GoogleNet, ResNet have been built. Training of Deep CNNs requires a large amount of labelled data and despite the availability of computational resources represent a heavy task. For parameter optimisation first-order methods such as Gradient descent are used, namely stochastic gradient descent (SGD). The conditions of convergence of these optimizers, i.e. the convexity of objective functions is not guaranteed. This is why different forms of SGD have been proposed. Still, the optimisation process remains instable and therefore , it is difficult to identify stopping iteration number. In our talk we will present main principles of Deep CNNs for image classification tasks and develop an approach we propose to smooth objective when training. We study it experimentally and present results on a well-known image database from MNIST.

Tuesday 12 Dec, 14h00
Positive semi-definite embedding for dimensionality reduction and out-of-sample extensions
Michaël Fanuel - (UCL)

Dimensionality reduction of data is often an important preprocessing step in statistics or machine learning. Several non-linear techniques exist, for instance, Diffusion Maps or, more generally, manifold learning. In this context, we will discuss a recent proposal to reduce the dimension of Euclidean data by relying on methods developed for graph embedding. More specifically, we discuss a numerical method phrased as a Semi-Definite Program for estimating a distance (semi-metric) on the data cloud. As an interesting feature, this positive semi-definite "matrix" can be extended to a "kernel" function thanks to an out-of-sample formula. Indeed, if any new data point arises, it may also be embedded at a low computational cost thanks to the extension formula. Inspired by this property, we will also briefly discuss an infinite dimensional analogue of this discrete problem in the context of machine learning.

Thursday 2 Nov, 11h00
Graph embeddings and community detection -- a control theoretic perspective
Michael Schaub - (MIT and Oxford University)

Community detection, the task of partitioning a network into groups of nodes similar to each other according to some criterion, has received enormous attention in the past decade. Many different notions of what constitutes a good community exist in the literature, some based on finding groups with high edge-density, a small cut between the different groups, or by assuming a particular (generative) model for the structure of the network as a whole. In this talk we focus on a dynamical notion of community structure: we think of our system under study as a network of coupled dynamical units, and aim to find groups of nodes which influence the system in a coherent way for a particular time-scale. Previous work, focusing on diffusion processes, has shown that such a perspective can indeed be fruitful, and that notions such as Modularity, spectral clustering and various other well-known graph partitioning heuristics can be recovered using this point of view. In this talk we reconsider such a dynamical approach towards community detection using a more control-theoretic perspective. This enables us not only to consider more general system descriptions, such as networks with both positive and negative edges, but also reveals interesting connections to ideas used in dimensionality reduction and graph embeddings. More importantly, however, it also allows us to connect to concepts considered in Control Theory, such as controllability and observability Gramians, thereby offering a fresh perspective on the problems of community detection and graph embeddings. % %

Thursday 11 Oct, 11h00
Temporal Pattern of (Re)tweets Reveal Cascade Migration
Ayan Bhowmick (Indian Institute of Technology Kharagpur)

Twitter has recently become one of the most popular online social networking websites where users can share news and ideas through messages in the form of tweets. As a tweet gets retweeted from user to user, large cascades of information diffusion are formed over the Twitter follower network. Existing works on cascades have mainly focused on predicting their popularity in terms of size. In this paper, we leverage on the temporal pattern of retweets to model the diffusion dynamics of a cascade. Notably, retweet cascades provide two complementary information: (a) inter-retweet time intervals of retweets, and (b) diffusion of cascade over the underlying follower network. Using datasets from Twitter, we identify two types of cascades based on presence or absence of early peaks in their sequence of inter-retweet intervals. We identify multiple diffusion localities associated with a cascade as it propagates over the network. Our studies reveal the transition of a cascade to a new locality facilitated by pivotal users that are highly cascade dependent following saturation of current locality. We propose an analytical model to show co-occurrence of first peaks and cascade migration to a new locality as well as predict locality saturation from inter-retweet intervals. Finally, we validate these claims from empirical data showing co-occurrence of first peaks and migration with good accuracy; we obtain even better accuracy for successfully classifying saturated and non-saturated diffusion localities from inter-retweet intervals. % %

Monday 11 Sep, 16h30 (SCES03)
Data science for modeling opinion dynamics on social media
Corentin Vande Kerckhove's (UCL)

Since the last decade, the arrival of social media has established new practices which have changed the way humans interact. Nowadays, anyone can access personal information by simply following a subset of relevant user s with a social media account. Recent technologies such as the Web 2.0 with user - generated content have allowed individual actions to be combined through social interactions at a scale that was previously unreachable. The availability of online data has led to a recent surge in trying to understand how social influence impact human behavior. Detecting influent actors and measuring accurately people’s opinion is a major challenge for data scientists. This work investigates a set of relevant questions conce rning the study of online opinions. How precisely can we predict the variations of judgments resulting from social influence, even when the information is scarce? Does social influence promote or undermine the performance of a group? Are we able to predict the voting behavior of citizens from their online posts and comments? In this thesis, we aim at understanding how opinions are expressed on social media, and whether their evolution can be anticipated by taking the interaction between agents into account. In the first part of the thesis, we introduced the concept of intrinsic noise through a variable that captures the part in human revision judgment that is composed of unpredictable variations. To quantify opinion dynamics that are subject to social infl uence, we carried out online experiments in which the participants had to estimate some quantities while receiving information about the other participants’ opinions. In the end, we discovered that about two thirds of the errors made by classical opinion d ynamics models are due to these unpredictable variations. In the second part of the thesis, we focused on the measurement of people’s opinion based on digital traces and we analyzed the communication patterns arising on real social media. In the context of political elections, our results showed that valid ideological positions can be deduced for political actors and engaged citizens based solely on network and textual data publicly available on social media platforms.

Monday 11 Sep, 14h00
Achieving budget-optimality with adaptive schemes in crowdsourcing
Sewoong Oh (University of Illinois at Urbana-Champaign)

Crowdsourcing platforms provide marketplaces where task requesters can pay to get labels on their data. Such markets have emerged recently as popular venues for collecting annotations that are crucial in training machine learning models in various applications. However, as jobs are tedious and payments are low, errors are common in such crowdsourced labels. A common strategy to overcome such noise in the answers is to add redundancy by getting multiple answers for each task and aggregating them using some methods such as majority voting. For such a system, there is a fundamental question of interest: how can we maximize the accuracy given a fixed budget on how many responses we can collect on the crowdsourcing system. We characterize this fundamental trade-off between the budget (how many answers the requester can collect in total) and the accuracy in the estimated labels. In particular, we ask whether adaptive task assignment schemes lead to a more efficient trade-off between the accuracy and the budget.<p> Adaptive schemes, where tasks are assigned adaptively based on the data collected thus far, are widely used in practical crowdsourcing systems to efficiently use a given fixed budget. However, existing theoretical analyses of crowdsourcing systems suggest that the gain of adaptive task assignments is minimal. To bridge this gap, we investigate this question under a strictly more general probabilistic model, which has been recently introduced to model practical crowdsourced annotations. Under this generalized Dawid-Skene model, we characterize the fundamental trade-off between budget and accuracy. I will present a novel adaptive task assignment scheme that matches this fundamental limit. This allows us to quantify the fundamental gap between adaptive and non-adaptive schemes, by comparing the trade-off with the one for non-adaptive schemes.

Useful links : UCLouvain | ICTEAM | INMA