Otto-von-Guericke-Universität Magdeburg


Forschungskolloquium Data and Knowledge Engineering

Im Rahmen dieses Forschungskolloquiums werden aktuelle Forschungsarbeiten im Bereich Data and Knowledge Engineering (DKE) vorgestellt. Das Kolloquium findet in der Regel jeweils Donnerstags ab 13:00 s.t. im Raum G29-301 statt.
Fragen zum Kolloquium richten Sie bitte an Andreas Nürnberger.

Aktuelle Vorträge:

03. Juli 2017 (13:15 Uhr s.t. in Raum G29-301)
Theory and Practice of Big Data Analytics for Railway Transportation Systems
Assoc. Prof. Luca Oneto (University of Genoa, Italy)

Big Data Analytics is one of the current trending research interests in many industrial sectors and in particular in the context of railway transportation systems. Indeed, many aspects of the railway world can greatly benefit from new technologies and methodologies able to collect, store, process, analyse and visualize large amounts of data as well as new methodologies coming from machine learning, artificial intelligence, and computational intelligence to analyse that data in order to extract actionable information. The EC H2020 In2Rail project is the perfect example of initiative made to bring the big data technologies into the railway world. The purpose of this talk is to show how theory and practice must be exploited together in order to solve real big data analytics problems in the field of railway transportation systems. In particular, we will focus on one of the problems that we are facing in the In2Rail project: predicting the train delays in the Italian railway network by exploiting both data coming from Rete Ferroviaria Italiana and exogenous data sources. For this purpose, we will make use of the most recent advances in the analytics field of research: from the deep learning to the thresholdout model selection framework.

Vergangene Vorträge:

12. Mai 2017 (10:00 Uhr s.t. in Raum G29-301)
Big Data Visualization: Graphics quality factors
Prof. Dr. Juan J. Cuadrado Gallego (Universidad de Alcalá, Spain)

Nowadays Big Data is used in almost all the fields of human knowledge. The main goal of Big Data is to analyze big databases to find useful information that expand the knowledge in the field that it is applied. In addition, the reason to get knowledge is to share it. Moreover, it is in these two points when data visualization is having a bigger role each day. Data visualization can help to analyze the data faster, and can help to share the acquired knowledge more easily. For the reasons many and new data graphics are used and published everyday. But, all of them provide the reasons for which are used? That is, all them allow to have a easy and faster analysis of the databases and a easy and faster transmission of the information/knowledge obtained from the big databases analysis? The answer is no. And the reason is that is not enough use graphics to improve big data analysis. The user must know when to use data visualization and how to use data visualization. It is not enough to know how must be developed a graphic but that must be know which design aspects must be applied to make a graphic useful. This talk introduces the quality aspects that must be applied to obtain not only data visualization but higher quality data visualization.

11. Mai 2017 (11:00 Uhr s.t. in Raum G29-301)
Three Algorithms Inspired by Data from the Life Sciences
Dr. Allan Tucker (Brunel University London)

In this talk I will discuss how the analysis of real-world data from health and the environment can shape novel algorithms. Firstly, I will discuss some of our work on modelling clinical data. In particular I will discuss the collection of longitudinal data and how this creates challenges for diagnosis and the modelling of disease progression. I will then discuss how cross-sectional studies offer additional useful information that can be used to model disease diversity within a population but lack valuable temporal information. Finally, I will discuss the importance of inferring models that generalise well to new independent data and how this can sometimes lead to new challenges, where the same variables can represent subtly different phenomena. Some examples in ecology and genomics will be described.

10. Mai 2017 (17:00 Uhr s.t. in Raum G29-301)
Multiobjective Clustering
Prof. Dr. Sanghamitra Bandyopadhyay (Indian Statistical Institute, Kolkata)

When the only data that is available is unlabelled, clustering is one of the primary operations applied. The objective is to group those data points that are similar to each other, while clearly separating dissimilar groups from each other. In clustering, usually some similarity/dissimilarity metric is optimized such that a pre-defined objective attains its optimal value. The problem of clustering is therefore essentially one of optimization. The use of metaheuristic methods like genetic algorithms has been demonstrated successfully in the past for clustering a data set. The clustering problem inherently admits a number of criteria or cluster validity indices that have to be simultaneously optimized for obtaining improved results. Hence in recent times the problem has been posed in a multiobjective optimization (MOO) framework and popular metaheuristics for multiobjective optimization have been applied. In this talk, we will first briefly discuss about the fuzzy c-means algorithm, followed by an introduction to the basic principles of MOO and the popular NSGA-II algorithm. Subsequently it will be shown how the algorithm is useful for solving the clustering problem. Since such algorithms provide a number of solutions, a way of combining the multiple clustering solutions so obtained into a single one using supervised learning will be explained. Finally, results will be demonstrated on clustering of some popular gene expression data sets.

19.01.2017 (13:00 Uhr s.t. in Raum G29-301)
Random Shapelet Forests for time series classification
Prof. Panagiotis Papapetrou (Stockholm University)

In this talk I will present a novel technique for time series classification called random shapelet forest. Shapelets are discriminative subsequences of time series, usually embedded in shapelet-based decision trees. The enumeration of time series shapelets is, however, computationally costly, which in addition to the inherent difficulty of the decision tree learning algorithm to effectively handle high-dimensional data, severely limits the applicability of shapelet-based decision tree learning from large (multivariate) time series databases.

In the first part of the talk I will discuss a novel tree-based ensemble method for univariate and multivariate time series classification using shapelets, called the generalized random shapelet forest algorithm. The algorithm generates a set of shapelet-based decision trees, where both the choice of instances used for building a tree and the choice of shapelets are randomized. For univariate time series, it is demonstrated through an extensive empirical investigation that the proposed algorithm yields predictive performance comparable to the current state-of-the-art and significantly outperforms several alternative algorithms, while being at least an order of magnitude faster. Similarly for multivariate time series, it is shown that the algorithm is significantly less computationally costly and more accurate than the current state-of-the-art.

The second part of the talk will focus on early classification of time series. I will present a novel technique that extends the random shapelet forest to allow for early classification of time series. An extensive empirical investigation has shown that the proposed algorithm is superior to alternative state-of-the-art approaches, in case predictive performance is considered to be more important than earliness. The algorithm allows for tuning the trade-off between accuracy and earliness, thereby supporting the generation of early classifiers that can be dynamically adapted to specific needs at low computational cost.

15.12.2016 (13:00 Uhr s.t. in Raum G29-301)
Handling Time-Series Data with Visual Analytics: Challenges and Examples
Dr. Theresia Gschwandtner (TU Wien)

Due to the ever growing amounts of available data we need effective ways to make these often complex and heterogeneous data accessible and analyzable. The aim of Visual Analytics (VA) is to support this information discovery process by combining humans’ outstanding capabilities of visual perception with the computational power of computers. By providing interactive visualizations for the visual exploration of trends, patterns and relationships, with automatic methods, such as machine learning and data mining, VA enables knowledge discovery in large and complex bodies of data. The design of such VA solutions, however, requires careful consideration of the data and tasks at hand, as well as the knowledge and capabilities of the user who is going to work with the solution. Dealing with time-oriented data makes this task even more complex as time is an exceptional data dimension with special characteristics. In my talk, I will illustrate different aspects and characteristics of time-oriented data and how we tackled these problems in previous work with respect to data, users, and tasks. I will give several examples of VA solutions developed in our group and I will put a particular focus on examples from the healthcare domain.

07.07.2016 (15:00 s.t. in Raum G29-301)
Aktives Lernen für Klassifikationsprobleme unter der Nutzung von Strukturinformationen
Dr. rer. nat. Tobias Reitmaier (Universität Kassel)

Heutzutage werden mediale, kommerzielle und auch persönliche Inhalte immer mehr in der digitalen Welt konsumiert, ausgetauscht und somit gespeichert. Diese Daten versuchen IT-Unternehmen mittels Methoden des Data Mining oder des maschinellen Lernens verstärkt wirtschaftlich zu nutzen, wobei in der Regel eine zeit- und kostenintensive Kategorisierung bzw. Klassifikation dieser Daten stattfindet. Ein effizienter Ansatz, diese Kosten zu senken, ist aktives Lernen (AL), da AL den Trainingsprozess eines Klassifikators durch gezieltes Anfragen einzelner Datenpunkte steuert, die daraufhin durch Experten mit einer Klassenzugehörigkeit versehen werden. Jedoch zeigt eine Analyse aktueller Verfahren, dass AL nach wie vor Defizite aufweist. Insbesondere wird Strukturinformation, die durch die räumliche Anordnung der (un-)gelabelten Daten gegeben ist, unzureichend genutzt. Außerdem wird bei vielen bisherigen AL-Techniken noch zu wenig auf ihre praktische Einsatzfähigkeit geachtet. Um diesen Herausforderungen zu begegnen, werden in diesem Vortrag mehrere aufeinander aufbauende Lösungsansätze präsentiert: Zunächst wird mit probabilistischen, generativen Modellen die Struktur der Daten erfasst und die selbstadaptive, (fast) parameterfreie Selektionsstrategie 4DS (Distance-Density-Distribution-Diversity Sampling) entwickelt, die zur Musterauswahl Strukturinformation nutzt. Anschließend wird der AL-Prozess um einem transduktiven Lernprozess erweitert, um die Datenmodellierung während des Lernvorgangs anhand der bekanntwerdenden Klasseninformationen iterativ zu verfeinern. Darauf aufbauend wird für das AL-Training einer Support Vector Machine (SVM) der neue datenabhängige Kernel RWM (Responsibility Weighted Mahalanobis) definiert.

01.07.2016 (13:00 Uhr s.t. in Raum G29-301)
Gemeinsam gegen Kriminaldelikte: Wie die Kombination von Data Mining und Spieltheorie bei der Verbrechensbekämpfung helfen kann
Prof. Richard Weber (Department of Industrial Engineering, Universidad de Chile)

Methoden des Data Mining werden seit vielen Jahren erfolgreich zur Erkennung von Verbrechensmustern eingesetzt. Anwendungen gibt es beispielsweise in der Missbrauchserkennung (fraud detection), Vorhersage von Verbrechen im öffentlichen Bereich und in der cyber Kriminalität. Dabei werden in der Regel Daten ausgewertet, die das Verbrechen beschreiben. In vielen Fällen wird jedoch die Interaktion zwischen Kriminellen und den Verantwortlichen für Sicherheit nicht entsprechend berücksichtigt. In diesem Vortrag stellen wir ein hybrides Modell zur Klassifikation von Verbrechensmustern vor, das diese Interaktion explizit modelliert. Am Beispiel der Identifizierung von phishing emails wird ein Spiel zwischen „Angreifer“ und „Bewacher“ beschrieben, welches Eingangsinformationen für einen auf Support Vector Machines basierenden binären Klassifikator liefert. Anhand eines umfangreichen Datensatzes wird gezeigt, welche Vorteile das beschriebene hybride Modell bietet. Zahlreiche Ansätze für weiterführende Arbeiten deuten auf das Potenzial für zukünftige angewandte Forschung hin.

25.05.16 (14:15 in Raum 301)
Metro Maps: Straight-line, Curved, and Concentric
Prof.  Dr. Alexander Wolff (Universität Würzburg)

The first schematic metro maps appeared in the 1930's when the networks became too big to be readable in a geographically correct layout.  Only 70 years later, computer scientists started to investigate ways how to automate the drawing of metro maps.  In my talk, I will present a few of these approaches.

14.04.16 (13:00 in Raum 301)
Space, Time, and Visual Analytics
Prof. Natalia Andrienko, Prof. Gennady Andrienko (Fraunhofer IAIS and City University London)

Visual analytics aims to combine the strengths of human and computer data processing. Visualization, whereby humans and computers cooperate through graphics, is the means through which this is achieved. Sophisticated synergies are required for analyzing spatio-temporal data and solving spatio-temporal problems. It is necessary to take into account the specifics of the geographic space, time, and spatio-temporal data. While a wide variety of methods and tools are available, it is still hard to find guidelines for considering a data set systematically from multiple perspectives. To fill this gap, we systematically consider the structure of spatio-temporal data, possible transformations, and demonstrate several workflows of comprehensive analysis of different data sets, paying special attention to the investigation of data properties. We shall show several workflows of analysis of real data sets on human mobility, city traffic, animal movement, and football. We finish the talk by outlying directions for future research, including semantic level analysis and big data.

21.01.16 (13:15 in Raum 301)
Learning Shortest Paths for Text Summarisation
Prof. Dr. Ulf Brefeld (Leuphana Universität Lüneburg)

We cast multi-sentence compression as a structured prediction problem. Related sentences are represented by a word graph such that every path in the graph is considered a (more or less meaningful) summary of the collection. We propose to adapt shortest path algorithms to data at hand so that the shortest path realises the best possible summary. We report on empirical results and compare our approach to state-of-the-art baselines using word graphs. The proposed technique can be applied to a great variety of objectives that are traditionally solved by dynamic programming. I’ll conclude with a short discussion of learning knapsack-like problems using the same framework.

10.12.15 (13:00 in Raum G26.1-010)
Trajectories Through the Disease Process: Cross Sectional and Longitudinal Data Analysis
Dr. Allan Tucker (Brunel University London)

Degenerative diseases such as cancer, Parkinson’s disease, and glaucoma are characterised by a continuing deterioration to organs or tissues over time. This monotonic increase in severity of symptoms is not always straightforward however. The rate can vary in a single patient during the course of their disease so that sometimes rapid deterioration is observed and other times the symptoms of the sufferer may stabilise (or even improve - for example when medication is used). The characteristics of many degenerative diseases is however a general transition from healthy to early onset to advanced stages. Clinical trials are typically conducted over a population within a defined time period in order to illuminate certain characteristics of a health issue or disease process. These cross-sectional studies provide a snapshot of these disease processes over a large number of people but do not allow us to model the temporal nature of disease, which is essential for modelling detailed prognostic predictions. Longitudinal studies on the other hand, are used to explore how these processes develop over time in a number of people but can be expensive and time-consuming, and many studies only cover a relatively small window within the disease process. This talk explores the application of intelligent data analysis techniques for building reliable models of disease progression from both cross-sectional and longitudinal studies. The aim is to learn disease `trajectories' from cross-sectional data, integrating longitudinal data and taking into account the sometimes non-stationary nature of the disease process.

19.11.15 (13:15 in Raum G29-301)
Knowledge based Tax Fraud Fighting
Prof. Dr. Hans-Joachim Lenz (Freie Universität Berlin)

Tax Fraud is a criminal activity done by a manager of a firm or at least one tax payer who intentionally manipulates tax data to deprive the tax authorities or the government of money for his own benefit. Tax fraud is a kind of data fraud, and happens every time and every where in daily life. Data fraud is extensionally characterized by the four fields: Spy-out, data plagiarism, manipulation and fabrication. Tax fraud investigations can be embedded into the methodology of knowledge based reasoning. One way is to apply case based reasoning where similar stored cases are retrieved and their information re-used. Alternatively, we put the focus on the Bayesian Learning Theory as a step wise procedure integrating prior information, facts from first (and follow-up) investigations and partial or background information. There is and will be no omnibus test available to detect the underlying manipulations of (even double-entry) book keeping data in business with high precision. However, a bundle of techniques like probability distribution analysis methods, Benford’s Law application, inliers and outlier as well as tests of conformity between data and Business Key Indicators systems exist to give hints for tax fraud. Finally, investigators may be hopeful in the long run because betrayers never will be able to construct a perfect manipulated world of figures, cf. F. Wehrheim (2011).

12.11.15 (13:15 in Raum G29-301)
Ethische Herausforderung einen Mobilitätsdienstleisters im Umgang mit Kundendaten im Digitalen Zeitalter
Dr. Karl Teille. (Leiter Institut für Informatik des Volkswagen Konzerns)

Die Informationsethik ist als Zweig der Philosophie eine Bereichsethik, die moralische Fragestellungen im Umgang mit digital verfügbaren Informationen in Informations- und Kommunikationstechnologien untersucht. Dabei ergeben sich Herausforderungen aus der Verbindung von Vorstellungen die zum Teil älter als 5000 Jahre sind mit Technologien, die keine 50 Jahre alt sind. Hierzu vier Thesen:

Angesichts dessen stellen sich Fragen nach der sozialen Verantwortung und dem Nutzen bewusst gezogenen moralischen Schranken als Maßstab und Maß ethischer Grundlagen im digitalen Zeitalter. Dies soll am Beispiel der Herausforderungen eines Mobilitätsdienstleisters im Umgang mit Kundendaten geprüft werden.

Letzte Änderung: 26.06.2017 - Ansprechpartner: Webmaster