Large Scale Data And Systems

WP1: LARGE SCALE DATA AND SYSTEMS (UCL, KUL, UGent, ULg, UMons, UNamur)

Introduction and state of the art

The numerical treatment of accurate models of systems from engineering and the bio-sciences, which take into account distributed parameters, multiple time-scales, cross-coupling between dynamics as in multi-physics systems, give rise to problems characterized by a very large number of variables or equations. While the simulation of such systems is already a challenge using state-of-the-art numerical algorithms, quantifying uncertainty propagation as well as making the leap from simulation to optimization and control of such systems quickly becomes computationally infeasible.

At the same time we have evolved towards an information rich world, characterized by the availability of huge amounts of data (e.g. massive biomedical data sets, data generated by sensor networks) and the presence of extremely large information networks (e.g. information available from the internet), where extracting relevant information and becomes a major challenge and where most optimization techniques needed to recover that information require too much computing time..

Both in the treatment of large systems and large data sets significant advances in algorithmic design are indispensible, where exploiting structure (e.g. low rank structures in matrix methods), multi-scale aspects and sparsity, the use of new data representations (e.g. tensor based algorithms) and an adjustment to the available computational hardware (e.g. the use of distributed and parallel computing) are crucial. Also novel algorithms need to be developed, that are targeted to the next generation multi-core and supercomputers. In biomedical data analysis other main challenges lie in data fusion and in the integration of complementary techniques. Indeed several methods from signal processing, data mining and statistics have already been successfully used to analyze data sets and extract new information. However, most existing methods are still based on one single classification method and on single-center data collection, therefore restricting the general applicability. Integrated techniques include new "components"-based methods to perform an integrated analysis of multiple groups of possibly heterogeneous data types.

In the area of large-scale and complex systems, the challenge nowadays is to make the leap from academic problems to realistic real-life applications. Multi-scale problems have to be addressed appropriately so that their simulation and optimization can be kept computationally feasible by either exploiting the fundamental multi-scale properties or by using distributed algorithms and parallel computing. In order to analyze massive data sets, such as in biological, internet or bio-medical data, one needs to merge tools from signal processing, machine learning and complex systems analysis, to be able to advance the state-of-the-art in several application areas. These include the extraction of optimal adaptive treatment strategies from available clinical data, the study of social network dynamics and the extraction of relevant data from complex network behaviors.

Research objectives and proposed topics

WP1.1 Large scale optimization (UCL, KUL, UGent, ULg, UNamur)

Manifold techniques. Several problems of science and engineering can be phrased as finding an optimizer of a cost function defined on a nonlinear manifold. The advantages of the manifold-based approach are a reduced computational complexity and/or stronger convergence results. We are pursuing the development of efficient general-purpose algorithms on manifolds. Differential-geometric concepts also underlie our ongoing work on collaborative filtering, low-rank approximation of higher-order tensors, curve fitting on manifolds and nonsmooth optimization on manifolds. Recent work in the area has focused on developing and exploiting new geometries for the manifold of symmetric or non-symmetric fixed-rank matrices. Motivated by statistical analysis and machine-learning applications, special attention will be devoted to providing quantitative bounds for the convergence analysis, handling efficiently non-smooth cost functions, and developing stochastic optimization algorithms on manifolds. Novel algorithms will be developed for the optimization of large scale power system problems (500 million variables), aimed at security management of the European Electric Power System. This work is in collaboration with Florida State University.

Random methods in Nonlinear Optimization. We revisited the idea to use randomization techniques in optimization. It appears that such methods can be based on a better description of the problem classes and therefore outperform the standard optimization technique. On the other hand, the appropriate randomized oracles can significantly reduce the complexity of iterations, keeping the expected number of iterations on the same level. Finally, the randomized search directions can help to solve problems with restricted informational support (e.g. impossibility to compute derivatives).

Constrained Low-Rank Approximations. Low-rank approximation of a given matrix or tensor is a fundamental problem in numerical linear algebra, related to many application domains such as machine learning, graph theory and model reduction and is an indispensable tool for the analysis of high- dimensional data. We aim at a better characterization and understanding of such problems, by studying the computational complexity in order to understand which instances can be approximated in polynomial- time, and which subclasses are NP-hard. Another goal is to design novel algorithms and techniques for the analysis of high-dimensional data. The effectiveness of these methods is assessed in the context of their use in data analysis as well as for building approximate and exact extended formulations for linear and convex programs.

Large-Scale Convex Optimization. Optimization modeling and algorithmic techniques can be applied to many different domains and to various types of problems. Convex optimization problems form an important class of continuous problems which hold the promise of efficient algorithms with theoretical performance guarantees. We plan to study both first-order and second-order methods in order to balance between accuracy and performance. Additional specific problem structure such as sparsity, separability, smoothness, saddle-point structure, will be exploited to lead to further improvements. We intend to develop modeling tools that recognize these features for several real-world applications.

Non-convex problems. Non-convex optimization problems are often intractable but local minima can often be obtained via the solution of sequences of convex problems. This is for instance the case in stability problems of dynamical systems or in the solution of bilinear matrix inequalities, which often occur in analysis and design problems of dynamical systems. Several teams collaborate to study these subsequent linearizations in order to derive efficient algorithms converging to local minima. We will also develop novel algorithms tackling optimization problems over high degree multivariate nonnegative polynomials. Stability and accuracy will be enforced by the choice of the polynomial basis.

Data assimilation. The goal of data assimilation is to improve the quality of the numerical simulation of a phenomenon through the use of observational data. The common approach for data assimilation consists of determining the best initial condition for a forecast, from the knowledge of a previous forecast, of collected data and of statistics on the uncertainty related to the two first sources of information. This process is repeated in chain in operational weather and ocean prediction centers. Data assimilation gives rise to huge optimization problems, for which it is crucial to develop robust and efficient algorithms. We will develop new techniques to reach good approximations of the solution in a short time, compared to the problem size. We will also address computational aspects and constraint handling issues of the Ensemble Kalman Filter and Optimal Interpolation when applied to large-scale systems. The application to regional-scale air quality models will be explored (see also WP2.5).

KKT systems in constrained optimization. Efficiently and robustly solving the Karush-Kuhn-Tucker systems arising in optimization methods to solve large scale nonlinear programming problems is a big challenge that requires a good insight into both the linear algebra and optimization fields. We will develop appropriate preconditioners for iterative methods designed to solve such saddle-point systems arising in constrained optimization, exploiting their special structure and properties.

Robust optimization of PDE-constrained systems. The discretization of PDE-constrained optimization problems leads to numerical models with possibly millions of optimization variables. In order to solve such problems, a standard approach is to include advanced linear algebra techniques into classical optimization techniques. However, more ambitious approaches integrate the optimization and linear algebra techniques more tightly. This has led to, for example, multi-level one-shot methods and multi- level optimization techniques. These methods have been shown, for quite a few classes of problems, to operate with an optimal computational complexity, i.e., linear in the number of degrees of freedom. Here, we will extend these approaches towards systems of PDEs, irregular domains models, and nonlinear equations appearing in real-life engineering optimization. Particularly challenging is the question as to how to deal with state and control constraints efficiently, as the decrease in regularity of the PDE solution adversely affects the performance of the optimized solvers. Another challenge is the handling of uncertainties in PDE-constrained optimization, and the efficient treatment of min-max or chance- constrained PDE models.

WP1.2 Model reduction (KUL, UCL, UGent)

Parametric studies of dynamical systems, e.g., arising from acoustics and vibrations, require a large amount of expensive evaluations of the system for changing parameter values. In order to reduce the computational cost, model order reduction is then of utmost importance. We focus currently on non-linear systems and mainly on parametric systems. For parametric problems, the model reduction method should be adapted to the specific features of the parameters in order to be tractable. For model reduction of partial differential equation problems, Krylov based techniques will be used, as well as techniques used for system identification such as Loewner matrices. Model reduction techniques can also significantly improve the simulation of multiscale differential equations that only have a few 'slow' degrees of freedom. We will study how Krylov-type model order reduction techniques can exploit strong time- scale separation for the automatic detection of macroscopic variables, and to adaptively control the resulting error. An alternative technique for handling really high-dimensional problems is by means of quasi-Monte Carlo methods which employ low-discrepancy point sets. We will study such dimension reduction methods for problems from financial engineering and quantum mechanics. We also further extend our techniques of tangential interpolation for their use for optimal H2-norm approximations of time varying and interconnected systems. Special attention will be given to the simulation of partial differential equations such as the wave propagation of Tsunamis. Collaborations will be sought with Florida State University.

Proper Orthogonal Decomposition (POD) is a powerful and elegant method that has been used in combination with Galerkin Projection as well as identification techniques, for deriving reduced order models of high-dimensional systems obtained from Partial Differential Equations. The method provides a basis for the modal decomposition of an ensemble of functions, such as data obtained in the course of experiments or numerical simulations. When POD is applied to a multidimensional system, the snapshot matrix has usually many more rows than columns because the Cartesian structure of the data is ignored. We propose to take into account the multidimensional nature of the spatial coordinates by using a tensor representation. More research regarding the derivation of optimal basis vectors for multidimensional systems will be required.

WP1.3 Large scale linear algebra problems (UCL, KUL)

Compressive sampling and structured learning. The increase in biomedical measurement techniques for diagnosis and follow-up of human diseases, strongly requires compression in order to keep the dataflow tractable. To extract the relevant information, the biomedical data sets should be represented sparsely in a context-dependent basis, because this decreases computations, memory usage, and data communications. It is required by increasingly sophisticated information processing devices. Traditionally, compressed sensing approaches, using random sampling, are used to sparsify data for wireless transmission. We will investigate how exact estimation and approximation techniques from computer algebra can be used in a noisy environment to represent biomedical signals in a sparse way with uniform sampling. A stronger way of compression can also be achieved by learning the relevant patterns hidden in the medical datasets. Both in epilepsy, or neonatal monitoring, large datasets are recorded that have to be summarized in a compact and clear diagnostic report. An efficient way to do this consists of learning the important patterns in a structured way in every subject.

Algorithms for structured, nonlinear and parameter dependent problems. We have a long history in exploiting structures while solving a wide variety of linear algebra problems. This expertise already led to fruitful collaborations within the IAP, and will again be provided to the partners. For multidimensional data, there is the growing tendency to develop more algorithm for tensors. Besides the Canonical Polyadic Decomposition, the Tucker decomposition, the Block-Term Decomposition will be examined and applied to problems in system identification. We will also focus on efficient linear algebra techniques for specific applications, such as spectral clustering for large scale data problems (see also WP2.7).
Another research line concerns distance problems for linear and nonlinear eigenvalue problems, with applications to graph theory, structural dynamics and robust control. This work includes the computation of the maximum smallest eigenvalue of parametric matrix pairs, and the design of robust controllers for distributed parameter systems such as time-delay systems. We will also analyze parametric eigenvalue problems involving Kronecker products and Lyapunov equations. Specific QR-type and Krylov type methods will be developed that respect this structure. We will study sweeping techniques that pass over the large scale data only once, while extracting as much as possible spectral information for various types of large scale matrices. This work will be pursued in collaboration with CNR Bari, Italy.

WP1.4 Numerical methods for highly oscillatory problems (KUL)

We will develop numerical methods and software for the evaluation of integrals that are high-dimensional or highly oscillatory. Both appear in mathematical models in a wide range of disciplines and are computationally demanding. For high-dimensional integrals the so-called curse of dimensionality implies exponential growth of computing times with increasing dimension. Progress is envisioned through the use of optimized quasi-Monte Carlo techniques. Highly oscillatory integrals in turn are demanding due to strict sampling requirements. However, ongoing research shows that these requirements can be lifted, and numerical methods based on the asymptotic analysis of the integrals for large frequencies hold the promise of being frequency-independent. Similarly, by incorporating asymptotic analysis into numerical simulation methods, we will construct novel algorithms for the simulation of wave equations, aiming at lifting the dependency on the frequency. Many additional results in asymptotic analysis have been described in the mathematical literature that we will use to extend the method to more complicated configurations. Moreover, we will investigate the extent to which asymptotic analysis of wave problems can improve the frequency robustness of existing methods.

WP1.5 Large scale dynamical systems simulation (UCL, KUL, UNamur)

Multiscale problems are ubiquitous in science and engineering, and they pose severe challenges for numerical simulation. On the one hand, a mathematical model at the macroscopic level can often not be obtained from the available microscopic description without making simplifying assumptions that are hard to justify. On the other hand, direct simulation using the microscopic model may well be prohibitively expensive. We focus on the development of a generic computational framework that allows to perform simulations at the macroscopic level when a complete macroscopic mathematical model is not available. The framework is based on repeated, appropriately initialized local microscopic simulations, and on the availability of numerical techniques to transfer information between the microscopic and macroscopic levels of description. Our research will also aim at the development of numerical and analytical tools particularly suitable to study variables (in)stabilities and control (chaos detection, diffusion frequency analysis, Hamiltonian control), to accurately determine the time evolutions of the system variables (high performance numerical integration schemes) and to determine or estimate the key parameters and variables of systems for simplification purposes (normal forms). The main challenges that we will address are: i) quantifying the macroscopic uncertainly that stems from a stochastic microscopic model, aiming at adaptive algorithms with automatic error control, ii) develop novel distributed algorithms that make extensive use of parallel computing and, iii) developing data-driven algorithms that do not need to predefine the macroscopic variables. This framework can also be applied to biochemical processes, as described in WP2.7-2.8.

Large-scale nonlinear dynamic system models formulated as coupled differential-algebraic equations (DAE) are routinely developed for numerical simulation in diverse engineering domains such as process engineering, large integrated chemical plants, electrical power generation, and the automotive industry. While the simulation of these large-scale and multi-physics systems in an acceptable time is already a challenge, the dynamic optimization necessitates novel distributed algorithms that make extensive use of parallel computing. Within the framework of the Flanders Exascience Lab (one of the Intel Labs Europe, www.exascience.com/) we will develop numerical algorithms and programming models for large-scale simulation on multicore computers and supercomputers.

WP1.6 Biomedical data analysis (KUL, UGent, UMons)

MRS data processing. Magnetic Resonance Imaging (MRI) is a widely used non-invasive and radiation- free medical imaging modality. In vivo Magnetic Resonance Spectroscopy (MRS) is a more specialized technique, which yields additional and essential metabolic information. However, the bottleneck that keeps MRS away from routine clinical examinations is the lack of robustness in extracting and interpreting relevant features from very noisy, low spatial resolution, MRS data. We will develop robust and automated MRS data processing for 2D/3D multi-voxel MRS measurements, based on semi- supervised quality assessment and spatial prior knowledge. Additionally, supervised and unsupervised tissue classification for high resolution multi-slice MR images will be tackled by data fusion of several MR imaging modalities acquired during the same measurement session, including MRS. The first key application is a therapy follow-up study of brain glioma patients monitored with multi-modal MR modalities.

Digital signal processing. We will use advanced digital signal processing to investigate multiple applications as pathological voice analysis, anesthesia monitoring helps, sleep study (Polysomnography and sleep stages classification), counting of cough and neural signal analysis (prosthesis movements). Data processing will also be used to optimize medical records, through an optimal introduction of new data from a patient to the physician with respect to his specialty on the one hand, and through the comparison with data from previous patients in order to improve the support, on the other hand.

Reinforcement Learning. Several algorithmic developments are motivated by the statistical analysis of massive biological data sets. This includes new "components"-based methods to perform an integrated analysis of multiple groups of possibly heterogeneous data types, such as those arising from different � omics technologies. Many challenges remain in this context, including accommodating different measurement scales and high-dimensional intra-correlated data, optimal scaling, sparsity in data matrices. A novel challenge in such studies is to integrate the use of massive neuroimaging data as phenotypes of neuronal diseases. Using Reinforcement Learning (RL) and Marginal Mean models, we aim at extracting (optimal) adaptive treatment strategies from available clinical data, which can take into account delay effects and avoid the use of pre-specified mathematical models. As an alternative to mathematical profile analysis models, smoothers can be used to identify profiles. However, a complication of the latter is the inference with future behavioral patterns. Especially when data sizes are small, there is a need to combine classical statistical methodologies with ideas from reinforcement learning method.

Network approach. The application of complex network tools to neuroscience and neuro-imaging datasets has recently led to major advances in understanding the way the brain works at a system level. Our aim is to go beyond a purely static point of view and to develop a robust framework and algorithms for unraveling how functional networks develop in time and robustly retain properties in spite of environmental fluctuations. The development of sound statistical procedure to compare between-group network properties will also be a priority.

WP1.7 Information extraction form very large databases (UCL, KUL, ULg, UNamur)

Similarity measures. Ongoing research concerns the classification of similarity measures on graphs. Recently, we have shown that several of these measures can be rewritten in terms of fixed points of a scaled affine transformation, and we have proposed a novel definition of node-to-node similarity that is better suited to yield information relevant to the graph matching problem. For large-scale applications, it is important to be able to efficiently compute low-rank approximations of similarity matrices. This topic relates to an ongoing collaboration between two of our teams that aims at a better understanding of the geometry of sets of low-rank matrices. We will also use efficient matrix theoretic techniques in graph theoretic problems and apply them to reputation systems, image segmentation, dominant feature extraction and ranking in graphs with negative links. In image processing applications, we are especially interested in extending the use of the modularity function and the corresponding Louvain algorithm to perform segmentation of image sequences rather than still images.

Time and embedding. We aim at examining social network dynamics in connection with the physical space in which interactions take place. This becomes possible as providers of online services have access to a valuable source of data on the geographic location of users, as well as to online friendship connections among them. The main goal is to propose computational models explaining the observed generic properties of human dynamics and mobility, and the development of a mathematical framework for dynamic complex networks. This includes the development of mathematical models and statistical measures accounting for the real trajectories taking place on the networks and going beyond simplistic Markov models. In general, such algorithms find applications in a broad range of industries providing services to huge numbers of clients. They want to be able to interpret the data in order to detect optimal settings for their activity.