# SDM 2019 Tutorial, Calgary, AB, Canada

**Date:** May 2-4, 2019

**Location:** Hyatt Regency Calgary

- Presenters
- Tutorial Summary
- Motivation
- Outline
- Target Audience & Prerequisites
- Presenter Biographies
- Previous Versions
- Acknowledgements

## Presenters

Jose Bento, Boston College

Tina Eliassi-Rad, Northeastern University

Stratis Ioannidis, Northeastern University

Leo Torres, Northeastern University

## Tutorial Summary

Representations of real-world phenomena as graphs (a.k.a. networks) are ubiquitous, ranging from social and information networks, to technological, biological, chemical, and brain networks. Many graph mining tasks – including clustering, anomaly detection, nearest neighbor, similarity search, pattern recognition, and transfer learning – require a distance that is a metric between graphs to be computed efficiently. Several classic distances between graphs do not scale to graphs with millions of nodes; others do not satisfy all of the metric properties: non-negativity, positive definiteness, symmetry, and triangle inequality. The purpose of this tutorial is to review the recent and expanding literature of graph metric spaces, focusing specifically on tractable metrics. To that end, this tutorial will cover classic metrics such as the chemical, edit, reaction, and Chartrand-Kubiki-Shultz (CKS) distances, as well as their heuristic approximations. It will also cover both convex and non-convex relaxations of classical metrics that have been recently developed and satisfy the metric property. In addition, it will cover how to incorporate metric embeddings and node/edge attributes when computing graph distances.

## Motivation

Graph representations of real-world phenomena are ubiquitous – from social and information networks, to technological, biological, chemical, and brain networks. Many graph mining tasks require a distance (or, conversely, a similarity) measure between graphs. Examples include clustering of graphs and anomaly detection, nearest neighbor and similarity search, pattern recognition, and transfer learning. Such tasks find applications in diverse areas including image processing, chemistry, and social network analysis, to name a few.

### Why Metrics?

Intuitively, given two (unlabeled) graphs, their distance is a score quanitifying their structural differences. A highly desirable property for such a score is that it is a *metric*, i.e., it is non-negative, symmetric, positive-definite, and, crucially, satisfies the triangle inequality.

Metrics exhibit significant computational advantages over non-metrics. For example, operations such as nearest-neighbor search, clustering, outlier detection, and diameter computation have known fast algorithms precisely when performed over objects embedded in a metric space. Indeed, some of these tasks become tractable or admit formal guarantees precisely when performed over a metric space. For example, finding the nearest neighbor or the diameter of a dataset become polylogarithimic under metric assumptions; similarly, approximation algorithms for clustering (which is NP-hard) rely on metric assumptions, whose absence leads to a deterioration on known bounds. This difference between metrics and non-metrics can play quite a significant role in practice.

### Tutorial Objectives

The purpose of this tutorial is to review the recent and expanding literature of graph metric spaces, focusing specifically on tractable metrics. To that end, the tutorial will cover classic metrics such as the chemical, edit, reaction, and CKS distances, as well as heuristic approximations. It will also cover both convex and non-convex relaxations that have been recently developed and satisfy the metric property.

## Outline

The tutorial will consist of three parts. Each part will have a duration of 50 minutes, with 10 minutes for questions in between, and will be given by a different tutor.

### Part 1: Introduction to Graph Distances [slides]

- Examples illustrating applications in which comparing graphs is important.
- Biology, computer vision, social sciences, chemistry

- Commonly used notions for comparing graphs
- Overview of graph isomorphism, and subgraph isomorphisms
- Overview of graph alignment methods, and distance metrics: Edit, Chemical, CKS, spectral distances.
- Commparing multiple graphs

- Importance of graph metrics in applications
- Clustering, nearest neighbors, classification, diameter computation.

- Scalability issues
- Examples of scalable alignment algorithms, and how they can produce a distance between graphs
- Are these metrics?

### Part 2: A Family of Tractable Graph Metrics [slides]

- Relaxations of the Chemical and CKS distances
- Optimization Spaces, permutations, doubly-stochastic matrices/Birkhoff polytope, relation to the Weisfeiler-Lehman Algorithm (WL) algorithm, orthogonal Matrices/Stiefler manifold, relation to co-spectrality
- Advantages of an optimization-based formulation, contrast to other optimization-based algorithms
- Better relaxations, problems with using semidefinite programming and/or sum-of-squares techniques
- Efficiency and scalability, convexity and non-convexity

- Incorporating Metric Embeddings
- Metric embedding: broad definitions, soft vs.~hard constraints, effects on metric property, effects on convergence
- Examples of metric embeddings

- Distributed Computation
- the Alternating Direction Method of Multipliers, effect of sparsity, granularity of decomposition, tuning
- Python/Spark/GPU implementations

### Part 3: Non-backtracking Matrix, Graph Distance, and Metric Embedding [slides]

- Graphs as metric spaces
- The length spectrum
- Modifying the length spectrum
- Detour: Non-backtracking matrix (NBM)
- Graph distance
- Properties & examples

- Geometry of graph embeddings
- Vectorization vs. pattern recognition
- Literature review
- Edge embedding with NBM
- Examples

## Target Audience and Prerequisites

Our target audience includes researchers and practitioners in graph mining and network analysis. We are targeting people who are interested in theory and applications of measuring distances between graphs. We expect the audience to come away with an overview of the state-of-art in tractable graph distances and have a better understanding of the challenges in this area. No assumption is made about familiarity with graph mining and network analysis. A brief overview of these topics will be included in the tutorial.

## Presenter Biographies

**Jose Bento** completed his Ph.D. in Electrical Engineering at Stanford University where he worked with Professor Andrea Montanari on statistical inference and structural learning of graphical models. After his Ph.D., he moved to Disney Research, Boston lab, where he worked with Dr. Jonathan Yedidia on algorithms for distributed optimization, robotics, and computer vision. He is now with the Computer Science department at Boston College. His current research lies at the intersection of distributed algorithms and machine learning. In 2014 he received a Disney Inventor Award for his work on distributed optimization, which recently lead to an approved patent. In 2016 he was awarded a $10M NIH joint grant to study the emergence of antibiotic resistance and in 2017 a $2M NSF joint grant to study measures of distances between large graphs.

**Tina Eliassi-Rad** is an Associate Professor of Computer Science at Northeastern University in Boston, MA. She is also on the fac- ulty of Northeastern’s Network Science Institute. Prior to joining Northeastern, Tina was an Associate Professor of Computer Science at Rutgers University; and before that she was a Member of Technical Staff and Principal Investigator at Lawrence Livermore National Laboratory. Tina earned her Ph.D. in Computer Sciences (with a minor in Mathematical Statistics) at the University of Wisconsin- Madison. Her research is rooted in data mining and machine learning; and spans theory, algorithms, and applications of massive data from networked representations of physical and social phenomena. Tina’s
work has been applied to personalized search on the World-Wide Web, statistical indices of large-scale scientific simulation data, fraud detection, mobile ad targeting, and cyber situational awareness. She has published over 70 peer-reviewed papers (including a best paper runner-up award at ICDM’09, a best interdisciplinary paper award at CIKM’12, and a best of conference paper at ICDM’16); and has given over 140 invited presentations. She has presented 9 tutorials at top conferences such as KDD, ICDM, SDM, WSDM, and WWW. Her algorithms have been incorporated into systems used by the gov- ernment and industry (e.g., IBM System G Graph Analytics) as well as open-source software (e.g., Stanford Network Analysis Project). In 2010, she received an Outstanding Mentor Award from the Office of Science at the US Department of Energy.

**Stratis Ioannidis** is an assistant professor in the Electrical and Computer Engineering Department of Northeastern University, in Boston, MA, where he also holds a courtesy appointment with the College of Computer and Information Science. He received his B.Sc. (2002) in Electrical and Computer Engineering from the National Technical University of Athens, Greece, and his M.Sc. (2004) and Ph.D. (2009) in Computer Science from the University of Toronto, Canada. Prior to joining Northeastern, he was a research scientist at the Technicolor research centers in Paris, France, and Palo Alto, CA, as well as at Yahoo Labs in Sunnyvale, CA. He is the recipient of an NSF CAREER Award, a Google Faculty Research Award, and a best paper award at ACM ICN 2017. His research interests span machine learning, distributed systems, networking, optimization, and privacy.

**Leo Torres** is a third-year PhD student at Northeastern’s Network Science Institute, working with Tina Eliassi-Rad. He is broadly interested in the intersection of Network Science, Complexity Science, and Mathematics. He studies complex networks by applying different fields of mathematics, such as differential and metric geometry, and algebraic topology; and different computational approaches such as graph mining and machine learning. Currently, his main research focus is to introduce novel mathematical tools to the network science toolbox, and design efficient algorithms around them to uncover previously unknown features of networks. Leo has a B.S. in Mathematics from Pontificia Universidad Católica del Perú has held positions as mathematics instructor at
the undergraduate level, Spanish teacher, and research programmer at Wolfram Research South America.

## Previous Versions of Tutorial

## Acknowledgements

The presenters would like to acknowledge the generous support of the National Science Foundation via grants IIS-1741197 and IIS-1741129, as well as the Google Cloud Platform for providing GCP compute credits to these awards.