Project summary

Advanced Statistics and Data Science 2: New data, new models, new challenges

Project summary:

In this century, statistics as a scientific discipline has been a witness to remarkable technological and scientific changes. On the one hand, next-generation datasets (including Big Data) have proliferated which, and in addition to the volume problem, they present other challenges that distinguish them from traditional tabular data. In particular, data coming from wearable devices (continuous health monitoring devices, smart watches, or even mobile phones that automatically generate continuous time-dependent data), from electroencephalograms (EEGs, recorded as long multivariate time series data with spatially dependent components), and from graphs (social networks, communication networks, bibliographic networks) become increasingly common.

On the other hand, a challenge has been the emergence of Data Science, a new scientific discipline midway between Mathematics, Computer Science, and Statistics, which aims to extract meaningful information from the data. Although Statistics shares this goal, the term Data Science is associated with concepts and techniques developed outside the field of Statistics, such as machine learning (neural networks or NN, random forest), deep learning (convolutional NN, recurrent NN) and Artificial Intelligence (AI). In particular, the recent emergence of popular generative AI tools, such as ChatGPT, clearly shows that AI presents both opportunities and ethical challenges, including the transparency and explainability of algorithmic predictive models. Over the past 15 years, a powerful line of research has been developed around Interpretable Machine Learning (IML), also known as eXplainable AI (XAI).

The main objective of this project is to address the challenges posed by recent types of datasets (increasingly larger and more complex) and new ways of analyzing them (more flexible but less transparent than traditional statistical techniques). We plan to pursue five lines of research:

(1) New directions in interpretability and explainability for predictive models. Our first goal is to develop an automatic procedure for identifying groups of explanatory variables that are jointly relevant in a prediction model. We will also introduce interpretability into functional regression. Finally, we plan to extend existing interpretability methods when analyzing time series with recurrent NN.

(2) Wearable device data: A Functional Data Analysis approach. We plan to extend linear local likelihood estimation to the Beta model with two functional parameters, allowing for the possibility of mixed effects.

(3) EEG data: Contributions of Functional Data Analysis and IML. We will transform EEG data into functional data, enabling new forms of analysis. In addition, we will introduce interpretability methods to the deep learning tools that are being used for EEG data. Finally, we will design an experimental stage of EEG data collection that will allow us to explore the limits of the methods used.

(4) Data coming from graphs: Prediction and Bayesian modeling. The degree distribution is well fitted by generalizations of the Zipf distribution. We will study the role of these distributions in prediction and forecasting problems arising in graphs.

(5) Nonlinear dimensionality reduction methods for Big Data. Distance-based approaches to this problem are limited by memory and time constraints, which we will try to eliminate. We will also explore the possibility of defining autoencoder versions of them.