Accepted papers

All 31 papers accepted at IDA 2022 appear in the Springer-Nature LNCS proceedings Advances in Intelligent Data Analysis XX.
The proceedings are freely available for all attendees of the conference from now on until 2 weeks after the conference on this webpage.

Oral presentations last 22 minutes including questions (ideally 17 + 5).

Wednesday 20th

Session 1 (chairwoman: Elisa Fromont)

  • Oral 1 (3:00-3:25 pm): Yoeri Poels and Vlado Menkovski. VAE-CE: Visual Contrastive Explanation using Disentangled VAEs Abstract
    The goal of a classification model is to assign the correct labels to data. In most cases, this data is not fully described by the given set of labels. Often a rich set of meaningful concepts exist in the domain that can describe each datapoint much more precisely. Such concepts can also be highly useful for interpreting the model’s classifications. In this paper, we propose Variational Autoencoder-based Contrastive Explanation (VAE-CE), a model that represents data with high-level concepts and uses this representation for both classification and explanation. The explanations are contrastive, conveying why a datapoint is assigned to one class rather than an alternative class. An explanation is specified as a set of transformations of the input datapoint, where each step changes a concept towards the contrastive class. We build the model using a disentangled VAE, extended with a new supervised method for disentangling individual dimensions. An analysis on synthetic data and MNIST validates the utility of the approaches to both disentanglement and explanation generation. Code is available at
  • Oral 2 (3:25-3:50 pm): Romaric Gaudel, Luis Galárraga, Julien Delaunay, Laurence Rozé and Vaishnavi Bhargava. S-LIME: Reconciling Locality and Fidelity in Linear Explanations Abstract
    The benefit of locality is one of the major premises of LIME, one ofthe most prominent methods to explain black-box machine learning models. This emphasis relies on the postulate that the more locally we look at the vicinity of an instance, the simpler the black-box model becomes, and the more accurately we can mimic it with a linear surrogate. As logical as this seems, our findings suggest that, with the current design of LIME, the surrogate model may degenerate when the explanation is too local, namely, when the bandwidth parameter σ tends to zero. Based on this observation, the contribution of this paper is twofold. Firstly, we study the impact of both the bandwidth and the training vicinity on the fidelity and semantics of LIME explanations. Secondly, and based on our findings, we propose S -LIME, an extension of LIME that reconciles fidelity and locality.

Session 2 (chairman: Bruno Crémilleux)

  • Oral 3 (4:15-4:40 pm): Hugo Ayats, Peggy Cellier and Sebastien Ferre. A Two-Step Approach for Explainable Relation Extraction Abstract
    Knowledge Graphs (KG) offer easy-to-process information.
    An important issue to build a KG from texts is the Relation Extraction (RE) task that identifies and labels relationships between entity mentions. In this paper, to address the RE problem, we propose to combine a deep learning approach for relation detection, and a symbolic method for relation classification. It allows to have at the same time the per-
    formance of deep learning methods and the interpretability of symbolic methods. This method has been evaluated and compared with state-of-the-art methods on TACRED, a relation extraction benchmark, and has shown interesting quantitative and qualitative results.
  • Oral 4 (4:40-5:05 pm): Nirbhaya Shaji, Joao Gama and Rita P. Ribeiro. Bank statements to network features: Extracting features out of time series using visibility graph Abstract
    Non-traditional data like the applicant’s bank statement is a significant source for decision-making when granting loans. We find that we can use methods from network science on the applicant’s bank statements to convert inherent cash flow characteristics to predictors for default prediction in a credit scoring or credit risk assessment model. First, the credit cash flow is extracted from a bank statement and later converted into a visibility graph or network. Afterward, we use this visibility network to find features that predict the borrowers’ repayment behavior. We see that feature selection methods select all the five extracted features. Finally, SMOTE is used to balance the training data. The model using the features from the network and the standard features together is shown to have superior performance compared to the
    model that uses only the standard features, indicating the network features’ predictive power
  • Oral 5 (5:05-5:30 pm): Mandani Ntekouli, Gerasimos Spanakis, Lourens Waldorp and Anne Roefs. Using Explainable Boosting Machine (EBM) to compare Idiographic and Nomothetic Approaches for Ecological Momentary Assessment (EMA) Data Abstract
    Previous research on EMA data of mental disorders was
    mainly focused on multivariate regression-based approaches modeling each individual separately. This paper goes a step further towards exploring the use of non-linear interpretable machine learning (ML) models in classification problems. ML models can enhance the ability to accurately predict the occurrence of different behaviors by recognizing complicated patterns between variables in data. To evaluate this, the performance of various ensembles of trees are compared to linear models using imbalanced synthetic and real-world datasets. After examining the distributions of AUC scores in all cases, non-linear models appear to be superior to baseline linear models. Moreover, apart from personalized approaches, group-level prediction models are also likely to offer an enhanced performance. According to this, two different nomothetic approaches to integrate data of more than one individuals are examined, one using directly all data during training and one based on knowledge distillation. Interestingly, it is observed that in one of the two real-world datasets, knowledge distillation method achieves improved AUC scores (mean relative change of +17% compared to personalized) showing how it can benefit EMA data classification and performance.
  • Oral 6 (5:30-5:55 pm): Toyah Overton, Allan Tucker, Tim James and Dimitar Hristozov. dunXai: DO-U-Net for Explainable (Multi-Label) Image Classification Abstract
    Artificial Intelligence (AI) and Machine Learning (ML) are becoming some of the most dominant tools in scientific research. Despite this, little is often understood about the complex decisions taken by the models in predicting their results. This disproportionately affects biomedical and healthcare research where explainability of AI is one of the requirements for its wide adoption. To help answer the question of what the network is looking at when the labels do not correspond to the presence of objects in the image but the context in which they are found, we propose a novel framework for Explainable AI that combines
    and simultaneously analyses Class Activation and Segmentation Maps for thousands of images. We apply our approach to two distinct, complex examples of real-world biomedical research, and demonstrate how it can be used to provide a global and concise numerical measurement of how distinct classes of objects affect the final classification. We also show how this can be used to inform model selection, architecture design and aid traditional domain researchers in interpreting the model results.

Thursday 21st

Session 3 (chairman: Matthijs Van Leeuwen)

  • Oral 7 (10:00-10:25 am): Maximilian Stubbemann and Gerd Stumme. LG4AV: Combining Language Models and Graph Neural Networks for Author Verification Abstract
    The verification of document authorships is important in various settings. Researchers are for example judged and compared by the amount and impact of their publications and public figures are confronted by their posts on social media. Therefore, it is important that authorship
    information in frequently used data sets is correct. The question of whether a given document is written by a given author is commonly referred to as authorship verification (AV). While AV is a widely investigated problem in general, only a few works consider settings where the documents are short and written in a rather uniform style. This makes most approaches impractical for bibliometric data. Here, authorships of scientific publications have to be verified, often with just abstracts and titles available. To this point, we present LG4AV which combines language models and graph neural networks for authorship verification. By directly feeding the available texts in a pre-trained transformer architecture, our model does not need any hand-crafted stylometric features that are not meaningful in scenarios where the writing style is, at least to some extent, standardized. By the incorporation of a graph neural network structure, our model can benefit from relations between authors that are meaningful with respect to the verification process.
  • Oral 8 (10:25-10:50 am): Stefan Horoi, Jessie Huang, Bastian Rieck, Guillaume Lajoie, Guy Wolf and Smita Krishnaswamy. Exploring the Geometry and Topology of Neural Network Loss Landscapes Abstract
    Recent work has established clear links between the generalization performance of trained neural networks and the geometry of their loss landscape near the local minima to which they converge. This suggests that qualitative and quantitative examination of the loss landscape geometry could yield insights about neural network generalization performance during training. To this end, researchers have proposed visualizing the loss landscape through the use of simple dimensionality reduction techniques. However, such visualization methods have been limited by their linear nature and only capture features in one or two dimensions, thus restricting sampling of the loss landscape to lines or planes. Here, we expand and improve upon these in three ways. First, we present a novel “jump and retrain” procedure for sampling relevant portions of the loss landscape. We show that the resulting sampled data holds more meaningful information about the network’s ability to generalize. Next, we show that non-linear dimensionality reduction of the jump and retrain trajectories via PHATE, a trajectory and manifold-preserving method, allows us to visualize differences between networks that are generalizing well vs poorly. Finally, we combine PHATE trajectories with a computational homology characterization to quantify trajectory differences.

Session 4 (chairman: Romaric Gaudel)

  • Oral 10 (11:15-11:40 am): Narjes Davari, Sepideh Pashami, Bruno Veloso, Slawomir Nowaczyk, Yuantao Fan, Pedro Pereira, Rita Ribeiro and João Gama. A fault detection framework based on LSTM autoencoder: a case study for Volvo bus data Abstract
    This study applies a data-driven anomaly detection framework based on a Long Short-Term Memory (LSTM) autoencoder network for several subsystems of a public transport bus. The proposed framework efficiently detects abnormal data, significantly reducing the false alarm rate compared to available alternatives. Using historical repair records, we demonstrate how detection of abnormal sequences in the signals can be used for predicting equipment failures. The deviations from normal operation patterns are detected by analysing the data collected from several on-board sensors (e.g., wet tank air pressure, engine speed, engine load) installed on the bus. The performance of LSTM autoencoder (LSTM-AE) is compared against the multi-layer autoencoder (mlAE) network in the same anomaly detection framework. The experimental results show that the performance indicators of the LSTM-AE network, in terms of F1 Score, Recall, and Precision, are better than those of the mlAE network.
  • Oral 9 (11:40-12:05 am) – remotely: Mohamed Fakhfakh, Bassem Bouaziz, Faiez Gargouri and Lotfi Chaari. Efficient Bayesian learning of sparse deep artificial neural networks Abstract
    In supervised Machine Learning (ML), Artificial Neural Networks (ANN) are commonly utilized to analyze signals or images for a variety of applications. They are increasingly performing as a strong tool to establish the relationships among data and being successfully applied
    in science due to their generalization ability, noise and fault tolerance.
    One of the most difficult aspects of using the learning process is optimization of the network weights.
    A gradient-based technique with a back-propagation strategy is commonly used for this optimization stage. Regularization is commonly employed for the benefit of efficiency. This optimization gets difficult when non-smooth regularizers are applied, especially to promote sparse networks. Due to differentiability difficulties, traditional gradient-based optimizers cannot be employed.
    In this paper, we propose an MCMC-based optimization strategy within a Bayesian framework. An effective sampling strategy is designed using Hamiltonian dynamics. The suggested strategy appears to be effective in allowing ANNs with modest complexity levels to achieve high accuracy rates, as seen by promising findings.
  • Oral 11 (12:05-12:30 am) – remotely: Yaroub Elloumi, Nesrine Abroug and Mohamed Hedi Bedoui. End-to-End Mobile System for Diabetic Retinopathy Screening Based on Lightweight Deep Neural Network Abstract
    Diabetic Retinopathy (DR) is the leading cause of visual impairment among working-aged adults. Screening and early diagnosis of DR is essential to avoid visual acuity reduction and blindness. However, a worldwide limited access to ophthalmologists may prevent an early diagnosis of this blinding condition. In this paper, we propose a novel method for screening DR from smartphone-captured fundus images. The main challenges are to perform higher accurate detection even with reduced quality of handheld captured fundus images and to provide the result
    into the smartphone used for acquisition. For such a need, we apply transfer learning to the lightweight deep neural network “NasnetMobile” which is used as a feature descriptor, while configuring a multi-layer perceptron classifier to deduce the DR disease, in order to take benefit from their lower complexity. A dataset composed of 440 fundus images is structured, where the acquisition and statement are performed by expert ophthalmologists. A cross-validation process is conducted where 95.91% accuracy, 94.44% sensitivity, 96.92% specificity and 95.71% precision in average are achieved. In addition, the whole processing flowchart is implemented into a mobile device, where the execution time is under one second whatever the fundus image is. Those performances allow deploying the proposed system in a clinical context.

Session 5 (chairman: Hendrik Blockeel)

  • Oral 12 (1:45-2:10 pm): Dusan Hetlerovic, Lubomir Popelinsky, Pavel Brazdil, Carlos Soares and Fernando Freitas. On Usefulness of Outlier Elimination in Classification Tasks Abstract
    Although outlier detection/elimination has been studied before, few comprehensive studies exist on when exactly this technique would be useful as preprocessing in classification tasks. The objective of our study is to fill in this gap. We have performed experiments with 12 various outlier elimination methods and 10 classification algorithms on 50 different datasets. The results were then processed by the proposed reduction method, whose aim is identify the most useful workflows for a given set of tasks (datasets). The reduction method has identified that just three OEMs that are generally useful for the given set of tasks. We have shown that the inclusion of these OEMs is indeed useful, as it leads to lower loss in accuracy and the difference is quite significant (0.5%) on average.
  • Oral 13 (2:10-2:35 pm): Maciej Grzenda. Quantifying Changes in Predictions of Classification Models for Data Streams Abstract
    Evaluation methods for data stream classification have frequently been focused on how available data are used for learning a model and for its performance assessment, with major emphasis on the difference between predicted and true labels. More recently, growing interest in delayed labelling evaluation has resulted in the evaluation of multiple predictions made by an evolving model for an instance before its true label arrival. Still, under this setting predictions are also compared with true labels rather than changes in predictions focused on. In this study, we aim to provide an intuitive evaluation framework to quantify changes in predictions made over time for the same input instances by evolving classification models. The primary motivation is to gain insight into the impact of the evolution of a classification model on the changes in decision boundaries, which may effectively re-assign the instances to other classes. The prediction change measures proposed in this study make it possible to reveal the scale of such changes. Furthermore, the notions of volatility of predictions and productive volatility are proposed and quantified. Results for a number of real and synthetic data streams show that similar accuracy of the models can be accompanied by significantly different volatility of predictions made by these models.
  • Oral 14 (2:35-3:00 pm): Ekaterina Antonenko and Jesse Read. Multi-Modal Ensembles of Regressor Chains for Multi-Output Prediction Abstract
    Multi-target regression is a predictive task involving multiple numerical outputs per instance. In the domain of multi-label classification there exist a large number of techniques that successfully model outputs together. Classifier Chains is one such example that is naturally extendable to the multi-target regression task (as Regressor Chains). However, although this method is straightforward to adapt
    to the regression setting, large improvements over independent models (as seen already in the multi-label classification context over the recent decade) have not as of yet been obtained from Regressor Chains. One
    of the reasons for this is the adoption of squared-error-based loss metrics which do not require consideration of joint-target modeling. In this paper, we consider cases where the predictive distribution can be multimodal. Such a scenario, which easily manifests in real-world tasks involving uncertainty, motivates a different loss metric and, thereby, a different approach. We thus present a new method for multi-target regression: Multi-Modal Ensemble of Regressor Chains (mmERC), which performs competitively on datasets exhibiting a multi-modal distribution, both
    against independent regressors and state-of-the-art ensembles of regressor chains. We argue that such distributions are not sufficiently considered in the regression and particularly multi-target regression literature.
  • IDA 2022 Frontier Price. Oral 15 (3:00-3:25 pm): Lucile Dierckx, Mélanie Beauvois and Siegfried Nijssen. Detection and Multi-Label Classification of Bats Abstract
    As bats are an important indicator for the health of their habitat, projects in multiple countries monitor bat populations by collecting audio recordings of bat calls. Analyzing these recordings is however a tedious task and there is a need for systems that accurately and efficiently detect and classify bat calls. While earlier studies focused on detection and classification separately, in this paper we propose a first approach that combines these two tasks. Moreover, we aim to build a multi-label classifier that is able to detect if multiple bat species are present in the same audio recording. One of the challenges we face is that the available data focuses either on detection or single-label classification, but not on the combined task of detection and multi-label classification. We propose to address this by a data augmentation approach and demonstrate that the resulting approach achieves the objectives of being accurate and efficient.

Session 6 (chairman: Michael Berthold) keynote

Friday 22nd

Session 7 (chairman: Panče Panov)

  • Oral 16 (9:00-9:25 am): Wen-Chi Yang, Arcchit Jain, Luc De Raedt and Wannes Meert. Parameter Learning in ProbLog With Annotated Disjunctions Abstract
    In parameter learning, a partial interpretation most often contains information about only a subset of the parameters in the program. However, standard EM-based algorithms use all interpretations to learn all parameters, which significantly slows down learning. To tackle this issue, we introduce EMPLiFI, an EM-based parameter learning technique for probabilistic logic programs, that improves the efficiency of EM by exploiting the rule-based structure of logic programs. In addition, EMPLiFI enables parameter learning of multi-head annotated disjunctions in ProbLog programs, which was not yet possible in previous methods. Theoretically, we show that EMPLiFI is correct. Empirically, we compare EMPLiFI to LFI-ProbLog and EMBLEM. The results show that EMPLiFI is the most efficient in learning single-head annotated disjunctions. In learning multi-head annotated disjunctions, EMPLiFI is more accurate than EMBLEM, while LFI-ProbLog cannot handle this task.
  • Oral 17 (9:25-9:50 am): Mina Rafla, Nicolas Voisine and Bruno Cremilleux. Evaluation of Uplift Models with Non-Random Assignment Bias Abstract
    Uplift Modeling measures the impact of an action (marketing, medical treatment) on a person’s behavior. This allows the selection of the subgroup of persons for which the effect of the action will be most noteworthy. Uplift estimation is based on groups of people who have received different treatments. These groups are assumed to be equivalent. However, in practice, we observe biases between these groups. We propose in this paper a protocol to evaluate and study the impact of the Non-Random Assignment bias (NRA) on the performance of the main uplift methods. Then we present a weighting method to reduce the effect of the NRA bias. Experimental results show that our bias reduction method significantly improves the performance of uplift models under NRA bias.

Session 8 (chairman: Joao Gama)

  • Oral 18 (3:00-3:25 pm): Etienne Lehembre, Ronan Bureau, Bruno Cremilleux, Bertrand Cuissart, Jean-Luc Lamotte, Alban Lepailleur, Abdelkader Ouali and Albrecht Zimmermann. Selecting Outstanding Patterns Based on their Neighbourhood Abstract
    The purpose of pattern mining is to help experts understand
    their data. Following the assumption that an analyst expects neighboring patterns to show similar behavior, we investigate the interestingness of a pattern given its neighborhood. We define a new way of selecting outstanding patterns, based on an order relation between patterns and a quality score. An outstanding pattern shows only small syntactic variations compared to its neighbors but deviates strongly in quality. Using several supervised quality measures, we show experimentally that only very few patterns turn out to be outstanding. We also illustrate our approach with patterns mined from molecular data.
  • Oral 19 (3:25-3:50 pm): Hadi Fanaee-T. Tensor Completion Post-Correction Abstract
    Many real-world tensors come with missing values. The task of estimation of such missing elements is called tensor completion (TC). It is a fundamental problem with a wide range of applications in data mining, machine learning, signal processing, and computer vision. In the last decade, several different algorithms have been developed, couple of them have shown high-quality performance in diverse domains. However, our investigation shows that even state-of-the-art TC algorithms sometimes make poor estimations for few cases that are not noticeable if we look at their overall performance. However, such wrong estimates might have a severe effect on some decisions. It becomes a crucial issue in applications where humans are involved. Making bad decisions based on such poor estimations can harm fairness. We propose the first algorithm for tensor completion post-correction, called TCPC, to identify some of such poor estimates from the output of any TC algorithm and refine them with more realistic estimations. Our initial experiments with five real-life tensor datasets show that TCPC is an effective post-correction method.

Session 9 (chairman: Wouter Duivesteijn)

  • Oral 20 (4:15-4:40 pm): Javier Perez Tobia, Apurva Narayan and Phillip Braun. AGS: Attribution Guided Sharpening as a Defense Against Adversarial Attacks Abstract
    Even though deep learning has allowed for significant advances in the last decade, it is still vulnerable to adversarial attacks inputs that, despite looking similar to clean data, can force neural networks to make incorrect predictions. Moreover, deep learning models usually act as a black box or an oracle that does not provide any explanations behind its outputs. In this paper, we propose Attribution Guided Sharpening (AGS), a defense against adversarial attacks that incorporates explainability techniques as a means to make neural networks robust. AGS uses the saliency maps generated on a non-robust model to guide
    Choi and Hall’s sharpening method to denoise input images before passing them to a classifier. We show that AGS can outperform previous defenses on three benchmark datasets: MNIST, CIFAR-10 and CIFAR-100, and achieve state-of-the-art performance against AutoAttack.
  • Oral 21 (4:40-5:05 pm): Daniel Schuster, Emanuel Domnitsch, Sebastiaan J. van Zelst and Wil van der Aalst. A Generic Trace Ordering Framework for Incremental Process Discovery Abstract
    Executing operational processes generates valuable event data in organizations’ information systems. Process discovery describes the learning of process models from such event data. Incremental process discovery algorithms allow learning a process model from event data gradually. In this context, process behavior recorded in event data is incrementally fed into the discovery algorithm that integrates the added behavior to a process model under construction. In this paper, we investigate the open research question of the impact of the ordering of incrementally selected process behavior on the quality, i.e., recall and precision, of the learned process models. We propose a framework for defining ordering strategies for traces, i.e., observed process behavior, for incremental process discovery. Further, we provide concrete instantiations of this framework. We evaluate different trace-ordering strategies on real-life event data. The results show that trace-ordering strategies can significantly improve the quality of the learned process models.
  • Oral 22 (5:05-5:30 pm): Fabian Hinder, Valerie Vaquet, and Barbara Hammer. Suitability of Different Metric Choices for Concept Drift Detection Abstract
    The notion of concept drift refers to the phenomenon that
    the distribution, which is underlying the observed data, changes over time; as a consequence machine learning models may become inaccurate and need adjustment. Many unsupervised approaches for drift detection rely on measuring the discrepancy between the sample distributions of two-time windows. This may be done directly, after some preprocessing (feature extraction, embedding into a latent space, etc.), or with respect to inferred features (mean, variance, conditional probabilities etc.). Most drift detection methods can be distinguished in what metric they use,
    how this metric is estimated, and how the decision threshold is found. In this paper, we analyze the structural properties of the drift-induced signals in the context of different metrics. We compare different types of estimators and metrics theoretically and empirically and investigate the relevance of the single metric components. In addition, we propose new choices and demonstrate their suitability in several experiments.

Poster Presentations (spotlights 9:50 to 10:25 am) from 10:50 to 12:50 am

  • Thomas Villmann, Daniel Staps, Jensun Ravinchandran, Sascha Saralajew, Michael Biehl and Marika Kaden. A Learning Vector Quantization Architecture for Transfer Learning Based Classification by Means of Nullspace Evaluation Abstract
    We present a method, which allows to train a Generalized Matrix Learning Vector Quantization (GMLVQ) model for classification using data from several, maybe non-calibrated, sources without explicit transfer learning. This is achieved by using a siamese-like GMLVQ-architecture, which comprises different sets of prototypes for the target classification and for the separation learning of the sources. In this architecture, a linear map is trained by means of GMLVQ for source distinction in the mapping space in parallel to the classification task
    learning. The respective null-space projection provides a common data representation of the different source data for an all-together classification learning.
  • Stefany Guarnizo, Ioanna Miliou and Panagiotis Papapetrou. Impact of dimensionality on nowcasting seasonal influenza with environmental factors Abstract
    Seasonal influenza is an infectious disease of multi-causal etiology and a major cause of mortality worldwide that has been associated with environmental factors. In the attempt to model and predict future outbreaks of seasonal influenza with multiple environmental factors, we face the challenge of increased dimensionality that makes the models more complex and unstable. In this paper, we propose a nowcasting and forecasting framework that compares the theoretical approaches of Single Environmental Factor and Multiple Environmental Factors. We introduce seven solutions to minimize the weaknesses associated with the increased dimensionality when predicting seasonal influenza activity level using multiple environmental factors as external proxies. Our work provides evidence that using dimensionality reduction techniques as a strategy to combine multiple datasets improves seasonal influenza forecasting without the penalization of increased dimensionality.
  • Lu Yin, Vlado Menkovski, Yulong Pei and Mykola Pechenizkiy. Semantic-Based Few-Shot Classification by Psychometric Learning Abstract
    Few-shot classification tasks aim to classify images in query
    sets based on only a few labeled examples in support sets. Most studies usually assume that each image in a task has a single and unique class association. Under these assumptions, these algorithms may not be able to identify the proper class assignment when there is no exact matching between support and query classes. For example, given a few images of lions, bikes, and apples to classify a tiger. However, in a more general setting, we could consider the higher-level concept, the large carnivores, to match the tiger to the lion for semantic classification. Existing studies rarely considered this situation due to the incompatibility of label-based supervision with complex conception relationships. In this work, we advance the few-shot learning towards this more challenging scenario, the semantic-based few-shot learning, and propose a method to address the paradigm by capturing the inner semantic relationships using psycho-
    metric learning. The experiment results on the CIFAR-100 dataset show the superiority of our method for the semantic-based few-shot learning compared to the baseline.
  • Joost F. van der Haar, Sander C. Nagelkerken, Igor G. Smit, Kjell van Straaten, Janneke A. Tack, Rianne M. Schouten and Wouter Duivesteijn. Efficient Subgroup Discovery Through Auto-Encoding Abstract
    Current subgroup discovery methods struggle to produce good results for large real-life datasets with high dimensionality. Run
    times can become high and dependencies between attributes are hard to capture. We propose a method in which auto-encoding is applied for dimensionality reduction before subgroup discovery is performed. In an experimental study, we find that auto-encoding increases both the quality and coverage for our dataset with over 500 attributes. On the dataset with over 250 attributes and the one with the most instances, the coverage improves, while the quality remains similar. For smaller datasets, quality and coverage remain similar or see a minor decrease. Additionally, we greatly improve the run time for each dataset-algorithm combination; for the datasets with over 250 and 500 attributes run times decrease by a factor of on average 150 and 200, respectively. We conclude that dimensionality reduction is a promising method for subgroup discovery in datasets with many attributes and/or a high number of instances.
  • Huiyao Wu and Maryam Tavakol. MuseBar: Alleviating Posterior Collapse in Recurrent VAEs toward Music Generation Abstract
    Machine learning has shown remarkable artistic values and commercial potentials in the music industry. Recurrent variational autoencoders (RVAEs) have been widely applied to this area due to the
    condensing, inclusive, and smooth nature of their latent space. However, RNNs are powerful auto-regressive models on their own, where the decoder in a RVAE can be strong enough to work independently from the encoder. When this happens, the model degrades from an autoencoder to a traditional RNN, which is known as posterior collapse. In this paper, we propose a cost-effective bar-wise regulation schema called MuseBar to alleviate this problem for music generation. We impose a prior on the hidden state of every music bar in the RNN encoder, instead of only on the last hidden state as in the standard RVAEs, such that the latent code is learned under stronger regulations. We further evaluate our proposed method, quantitatively and qualitatively, with extensive experiments on manually scraped musical data. The results demonstrate that the barwise regulation significantly improves the quality of the latent space in terms of Mutual Information and Kullback-Leibler divergence.
  • Nádia Soares, João Gonçalves, Raquel Vasconcelos and Rita Ribeiro. Combining Multiple Data Sources to Predict IUCN Conservation Status of Reptiles Abstract
    Biodiversity loss is a hot topic. We are losing species at a
    high rate, even before their extinction risk is assessed. The International Union for Conservation of Nature (IUCN) Red List is the most complete assessment of all species conservation status, yet it only covers a small part of the species identified so far. Additionally, many of the existing assessments are outdated, either due to the ever-evolving nature of taxonomy or to the lack of reassessments. The assessment of the conservation status of a species is a long, mostly manual process that needs to be carefully done by experts. The conservation field would gain by having ways of automating this process, for instance, by prioritizing the species which experts and financing should focus on. In this paper, we present a pipeline used to derive a conservation dataset out of openly available data and obtain predictions, through machine learning techniques, on which species are most likely to be threatened. We applied this pipeline
    to the different groups within the Reptilia class as a model of one of the most under-assessed taxonomic groups. Additionally, we compared the performance of models using datasets that include different sets of predictors describing species ecological requirements and geographical distributions such as IUCN’s area and extent of occurrence. Our results show that most groups benefit from using ecological variables together with IUCN predictors. Random Forest appeared as the best method for most species groups, and feature selection was shown to improve results.
  • Ammar Shaker, Francesco Alesiani and Shujian Yu. Modular-Relatedness for Continual Learning Abstract
    Deep Neural Network (NN) architectures often achieve super-human performance in many application domains. Recent models are made of up to billions of parameters (e.g. GPT2 and GPT3 for Natural Language Processing) and require massive training resources. How can these models be trained on sequences of tasks without negatively affecting each other? Continual Learning (CL) methods tackle the problem of incrementally updating NN models with new tasks while retaining the performance on previously learned tasks. In this paper, we propose a continual learning (CL) technique that is beneficial to sequential task learners by improving their retained accuracy and reducing catastrophic forgetting. The principal target of our approach is the automatic extraction of modular parts of the neural network (NN) and then estimating the relatedness between the tasks given these modular components. This technique is applicable to the CL family of rehearsal-based (e.g., the Gradient Episodic Memory) approaches
    where episodic memory is needed. Empirical results demonstrate remarkable performance gain (in terms of robustness to forgetting) for methods such as GEM based on our technique, especially when the memory budget is very limited.
  • Yann Dauxais, Urchade Zaratiana, Matthieu Laneuville, Simon Hernandez, Pierre Holat and Charlie Grosman. Towards Automation of Topic Taxonomy Construction Abstract
    The automation of taxonomy construction has increased in popularity recently. Such interest for the domain has been motivated by the large number of new scientific papers published each year that implies a growing difficulty in following the new topics of the different scientific domains and their importance in the topic hierarchy. In this paper, we propose a way to automatically construct topic taxonomies from millions of scientific article abstracts and ways to automatically
    evaluate this construction. While, to our knowledge, other approaches rely on pipelines of models and human evaluation to validate them, we chose to rely on simple models that are easier to evaluate automatically and, thus, promote the improvement of our models thanks to a large number of iterations. The contribution of this paper is threefold: 1) the proposition of a new method to construct taxonomies from a large set of scientific papers, 2) a method to precompile taxonomy information into matrices that will be quickly queried, and 3) an objective method
    to automatically evaluate the constructed taxonomies without requiring human evaluation.
  • Stepan Veretennikov, Koen Minartz, Vlado Menkovski, Burcu Gumuscu and Jan de Boer. Simulation of scientific experiments with generative models Abstract
    Lab experiments are a crucial part of research in natural sciences. High-throughput screening is leveraged to generate hypotheses, by evaluating a wide range of experimental parameter values and accumulating a wealth of data on the corresponding experimental outcomes. The data is subsequently analyzed to design new rounds of experiments. While discriminative models have previously proven useful for screening data analytics, they do not account for randomness inherent to lab experiments, and do not have the capacity to capture the potentially high-dimensional relationship between the experiment input parameters and outcomes. Instead, we take a data-driven simulation perspective on the problem. Inspired by biomaterials research experiments, we consider a case where both the input parameter space and the outcome space have a high-dimensional (image) representation. We propose a deep generative model that serves simultaneously as a simulation model of the experiment, i.e. allows generating potential outcomes conditioned on the experiment input, and as a tool for inverse design, i.e. generating instances of inputs that could lead to a given experiment outcome. A proof-of-concept evaluation on a synthetic dataset shows that the model is able to learn the embedded relationship between the properties of the input and of the output in a probabilistic manner and allows for experiment simulation and design application scenarios.


Les commentaires sont clos.