Tentative topics for the PhD thesis

Advanced Statistics and Data Science 2: New data, new models, new challenges

a. Time series explainability in the context of Big Data.

There have been various efforts to develop model-agnostic methodologies that offer explicability or interpretability in the context of black-box models. While several attempts have been made to adapt these model-agnostic techniques for time series analysis, there remains a significant research gap in the domain of multivariate time series, particularly within the purview of Deep Learning models. Specifically, classification and regression tasks involving multivariate time series data are commonly addressed using convolutional neural networks (CNNs), recurrent neural networks (RNNs), or, to a lesser extent, transformers due to their versatility in handling multiple time series and incorporating exogenous variables (numerical or categorical). However, the flexibility of these methods often hinders the extraction of meaningful explanations, leaving practitioners to rely on techniques adapted from computer vision literature, with uncertain applicability to time series. This thesis proposes a comprehensive investigation and implementation of novel methodologies that can be applied to multivariate time series data within the context of Deep Learning models. The overarching objective is to enhance our understanding of the data and the underlying learning processes of the model, thereby facilitating improved classification and regression outcomes.

In particular, the research of this thesis aims to explore:

  • Explainability for multivariate time series tasks with CNNs, RNNs, and

Transformers: This involves developing techniques to elucidate the decision-making process of these models when applied to multivariate time series data.

  • Inherent techniques from different Deep Learning architectures: This entails identifying and leveraging intrinsic properties of CNNs, RNNs, and Transformers to extract interpretability directly from the model architecture.
  • Agnostic methods for any Deep Learning architecture: This aims to develop model-agnostic approaches that can be applied to a wide range of Deep Learning models, regardless of their specific architecture.
  • Explainability in the context of generative models (diffusion models).
  • Explainability in new architectures like Structured State Space Models.

 

b. EEG data: Contributions of Functional Data Analysis and IML.

Electroencephalograms (EEG) are recorded as long multivariate time series data with spatially dependent components (nearby electrodes produce more similar time series than those that are far from each other). The use of predictive machine learning models is increasingly common in this field, as it allows neuroscientists to make predictions about future behavior based on learned patterns from brain data, with minimal human inference. Additionally, functional data analysis (FDA) is a potential way to model these data, both on the time and frequency scales.

The specific research steps in this proposal are the following:

  • To explore the potential contribution of interpretable machine learning (IML) for the understanding of the neural basis of cognition. Specifically, the proposal is to consider benchmark EEG datasets for which non-interpretable machine learning models have good performance, and then to adapt interpretability tools to this specific kind of data.
  • To transform EEG data into functional data, enabling new forms of analysis. This approach can be done in different ways. Raw EEG data can be transformed into smoother functional data by truncation of a basis expansion. This way, FDA allows us to investigate the variation of the signal and to decompose the data into different modes of variation. Alternatively, EEG data can be analyzed in the frequency domain by means of the estimated spectral density - a functional representation of an observed EEG that can be seen directly as a functional data.
  • To design an experimental phase of EEG data collection that will allow to explore the limits of the methods used. The goal is to obtain our own experimental EEG data in the context of a cognitive task and then analyze them using the methods developed in previous chapter of the thesis. In particular, this objective aims to further understand the neural basis of cognitive load.

 

c. Non-linear dimensionality reduction methods for Big Data.

In the era of big data, modern machines and systems increasingly incorporate multiple sensors that automatically monitor their operation, resulting in large, high-dimensional, and highly interdependent data sets. The problem of dimensionality reduction is crucial in this context. Classical dimensionality techniques are Principal Component Analysis (PCA, working on an n x p data matrix) and Multidimensional Scaling (MDS, working on an n x n distance matrix between individuals). Both methods fail to describe well the data set when the data are distributed around a low-dimensional non-linear manifold embedded into a high dimensional space. There exist several non-linear extensions of PCA and MDS: principal curves, local MDS, isomap, t-SNE or UMAP, among others. These methods (except principal curves) use an n x n distance matrix, implying memory and computing time problems for large sample size n.

The specific research steps in this proposal are the following:

  • To propose versions of distance-based non-linear dimensionality reduction methods that could deal with large datasets in moderate times with no memory problems. In previous publications, members of the research group have explored six different strategies to apply MDS to very large datasets. We conjecture that only two of them could be adapted to the non-linear methods, because the other 4 ways rely heavily on the classical metric scaling solution to the MDS problem.
  • To apply our proposals to a massive data set coming from sensor-equipped structures. This point is related to the field of Structural Health Monitoring (SHM: checking the correct behavior of engineering structures and determining if any type of maintenance is required), where we have some experience.

To explore the possibility of performing through a neural network approach by using the autoencoder architecture. Furthermore, the potential of neural networks to handle nonlinear relationships suggests the possibility of developing nonlinear variants of MDS using autoencoders.