Execute this notebook:

# Unsupervised graph classification/representation learning via distances¶

This demo demonstrated training a graph classification model without supervision. This model could be used to compute embedding vectors or representations for graphs.

The algorithm uses a ground-truth distance between graphs as a metric to train against, by embedding pairs of graphs simultaneously and combining the resulting embedding vectors to match the distance.

It is inspired by UGraphEmb[1].

[1]: Y. Bai et al., “Unsupervised Inductive Graph-Level Representation Learning via Graph-Graph Proximity,” arXiv:1904.01098 [cs, stat], Jun. 2019.

[3]:

import stellargraph as sg
import pandas as pd
import numpy as np
import networkx as nx
import tensorflow as tf
from tensorflow import keras

from IPython.display import display, HTML


## Dataset¶

The PROTEINS dataset consists of about one thousand graphs, with binary labels 1 or 2.

[4]:

dataset = sg.datasets.PROTEINS()
display(HTML(dataset.description))

Each graph represents a protein and graph labels represent whether they are are enzymes or non-enzymes. The dataset includes 1113 graphs with 39 nodes and 73 edges on average for each graph. Graph nodes have 4 attributes (including a one-hot encoding of their label), and each graph is labelled as belonging to 1 of 2 classes.
[5]:

graph_labels.value_counts().to_frame()

[5]:

label
1 663
2 450

The graphs value consists of many StellarGraph instances:

[6]:

print(graphs[0].info())

StellarGraph: Undirected multigraph
Nodes: 42, Edges: 162

Node types:
default: [42]
Features: float32 vector, length 4
Edge types: default-default->default

Edge types:
default-default->default: [162]
Weights: all 1 (default)
Features: none


Summary statistics of the sizes of the graphs:

[7]:

summary = pd.DataFrame(
[(g.number_of_nodes(), g.number_of_edges()) for g in graphs],
columns=["nodes", "edges"],
)
summary.describe().round(1)

[7]:

nodes edges
count 1113.0 1113.0
mean 39.1 145.6
std 45.8 169.3
min 4.0 10.0
25% 15.0 56.0
50% 26.0 98.0
75% 45.0 174.0
max 620.0 2098.0

## Create the model¶

[8]:

generator = sg.mapper.PaddedGraphGenerator(graphs)

[9]:

gc_model = sg.layer.GCNSupervisedGraphClassification(
[64, 32], ["relu", "relu"], generator, pool_all_layers=True
)

[10]:

inp1, out1 = gc_model.in_out_tensors()
inp2, out2 = gc_model.in_out_tensors()

vec_distance = tf.norm(out1 - out2, axis=1)

[11]:

pair_model = keras.Model(inp1 + inp2, vec_distance)
embedding_model = keras.Model(inp1, out1)


## Train the model¶

The model is trained on 100 random pairs of graphs, along with the ground-truth distance between them.

### Similarity measure¶

This method can use any notion of distance or similarity between two graphs. In this case, we use something efficient, but not particularly accurate: the distance between the spectrum (or eigenvalues) of the Laplacian matrix of the graphs.

Other options include graph edit distance and minimum common subgraph, but these are NP-hard to compute and are too slow for this demonstration.

[12]:

def graph_distance(graph1, graph2):
spec1 = nx.laplacian_spectrum(graph1.to_networkx(feature_attr=None))
spec2 = nx.laplacian_spectrum(graph2.to_networkx(feature_attr=None))
k = min(len(spec1), len(spec2))
return np.linalg.norm(spec1[:k] - spec2[:k])


### Training examples¶

[13]:

graph_idx = np.random.RandomState(0).randint(len(graphs), size=(100, 2))

[14]:

targets = [graph_distance(graphs[left], graphs[right]) for left, right in graph_idx]

[15]:

train_gen = generator.flow(graph_idx, batch_size=10, targets=targets)


### Training procedure¶

[16]:

pair_model.compile(keras.optimizers.Adam(1e-2), loss="mse")

[17]:

%%time
history = pair_model.fit(train_gen, epochs=500, verbose=0)
sg.utils.plot_history(history)

  ['...']
CPU times: user 3min 30s, sys: 1min 40s, total: 5min 11s
Wall time: 1min 17s


## Compute embeddings¶

[18]:

embeddings = embedding_model.predict(generator.flow(graphs))


Now that we’ve computed some embedding vectors in an unsupervised fashion, we can use them for other supervised, semi-supervised and unsupervised tasks.

### Supervised graph classification¶

We can use the embedding vectors to perform logistic regression classification, using the labels.

[19]:

from sklearn.linear_model import LogisticRegression
from sklearn import model_selection

[20]:

train_labels, test_labels = model_selection.train_test_split(
graph_labels, train_size=0.1, test_size=None, stratify=graph_labels
)

test_embeddings = embeddings[test_labels.index - 1]
train_embeddings = embeddings[train_labels.index - 1]

lr = LogisticRegression(multi_class="auto", solver="lbfgs")
lr.fit(train_embeddings, train_labels)

y_pred = lr.predict(test_embeddings)
gcn_acc = (y_pred == test_labels).mean()
print(f"Test classification accuracy: {gcn_acc}")

Test classification accuracy: 0.6846307385229541


#### Confusion matrix¶

[21]:

pd.crosstab(test_labels, y_pred, rownames=["true"], colnames=["predicted"])

[21]:

predicted 1 2
true
1 506 91
2 225 180

### Visualising embeddings¶

We can also get a qualitative measure of the embeddings, using dimensionality reduction.

[22]:

from sklearn.manifold import TSNE

tsne = TSNE(2)
two_d = tsne.fit_transform(embeddings)

[23]:

from matplotlib import pyplot as plt

plt.scatter(two_d[:, 0], two_d[:, 1], c=graph_labels.cat.codes, cmap="jet", alpha=0.4)

[23]:

<matplotlib.collections.PathCollection at 0x163db1a20>


## Conclusion¶

This demo demonstrated training a graph classification model without supervision. This model could be used to compute embedding vectors or representations for graphs.

The algorithm works with three components:

• a ground truth distance or similarity between two graphs such as graph edit distance, or, in this case, Laplacian spectrum distance (for efficiency)

• a model that encodes graphs into embedding vectors

• a data generator that yields pairs of graphs and the corresponding ground truth distance

This model is inspired by UGraphEmb[1].

Execute this notebook: