Execute this notebook: Download locally

# Unsupervised graph classification/representation learning via distances¶

This demo demonstrated training a graph classification model without supervision. This model could be used to compute embedding vectors or representations for graphs.

The algorithm uses a ground-truth distance between graphs as a metric to train against, by embedding pairs of graphs simultaneously and combining the resulting embedding vectors to match the distance.

It is inspired by UGraphEmb[1].

[1]: Y. Bai et al., “Unsupervised Inductive Graph-Level Representation Learning via Graph-Graph Proximity,” arXiv:1904.01098 [cs, stat], Jun. 2019.

```
[3]:
```

```
import stellargraph as sg
import pandas as pd
import numpy as np
import networkx as nx
import tensorflow as tf
from tensorflow import keras
from IPython.display import display, HTML
```

## Dataset¶

The PROTEINS dataset consists of about one thousand graphs, with binary labels `1`

or `2`

.

```
[4]:
```

```
dataset = sg.datasets.PROTEINS()
display(HTML(dataset.description))
graphs, graph_labels = dataset.load()
```

```
[5]:
```

```
graph_labels.value_counts().to_frame()
```

```
[5]:
```

label | |
---|---|

1 | 663 |

2 | 450 |

The `graphs`

value consists of many `StellarGraph`

instances:

```
[6]:
```

```
print(graphs[0].info())
```

```
StellarGraph: Undirected multigraph
Nodes: 42, Edges: 162
Node types:
default: [42]
Features: float32 vector, length 4
Edge types: default-default->default
Edge types:
default-default->default: [162]
Weights: all 1 (default)
Features: none
```

Summary statistics of the sizes of the graphs:

```
[7]:
```

```
summary = pd.DataFrame(
[(g.number_of_nodes(), g.number_of_edges()) for g in graphs],
columns=["nodes", "edges"],
)
summary.describe().round(1)
```

```
[7]:
```

nodes | edges | |
---|---|---|

count | 1113.0 | 1113.0 |

mean | 39.1 | 145.6 |

std | 45.8 | 169.3 |

min | 4.0 | 10.0 |

25% | 15.0 | 56.0 |

50% | 26.0 | 98.0 |

75% | 45.0 | 174.0 |

max | 620.0 | 2098.0 |

## Create the model¶

```
[8]:
```

```
generator = sg.mapper.PaddedGraphGenerator(graphs)
```

```
[9]:
```

```
gc_model = sg.layer.GCNSupervisedGraphClassification(
[64, 32], ["relu", "relu"], generator, pool_all_layers=True
)
```

```
[10]:
```

```
inp1, out1 = gc_model.in_out_tensors()
inp2, out2 = gc_model.in_out_tensors()
vec_distance = tf.norm(out1 - out2, axis=1)
```

```
[11]:
```

```
pair_model = keras.Model(inp1 + inp2, vec_distance)
embedding_model = keras.Model(inp1, out1)
```

## Train the model¶

The model is trained on 100 random pairs of graphs, along with the ground-truth distance between them.

### Similarity measure¶

This method can use any notion of distance or similarity between two graphs. In this case, we use something efficient, but not particularly accurate: the distance between the spectrum (or eigenvalues) of the Laplacian matrix of the graphs.

Other options include graph edit distance and minimum common subgraph, but these are NP-hard to compute and are too slow for this demonstration.

```
[12]:
```

```
def graph_distance(graph1, graph2):
spec1 = nx.laplacian_spectrum(graph1.to_networkx(feature_attr=None))
spec2 = nx.laplacian_spectrum(graph2.to_networkx(feature_attr=None))
k = min(len(spec1), len(spec2))
return np.linalg.norm(spec1[:k] - spec2[:k])
```

### Training examples¶

```
[13]:
```

```
graph_idx = np.random.RandomState(0).randint(len(graphs), size=(100, 2))
```

```
[14]:
```

```
targets = [graph_distance(graphs[left], graphs[right]) for left, right in graph_idx]
```

```
[15]:
```

```
train_gen = generator.flow(graph_idx, batch_size=10, targets=targets)
```

### Training procedure¶

```
[16]:
```

```
pair_model.compile(keras.optimizers.Adam(1e-2), loss="mse")
```

```
[17]:
```

```
%%time
history = pair_model.fit(train_gen, epochs=500, verbose=0)
sg.utils.plot_history(history)
```

```
['...']
CPU times: user 3min 30s, sys: 1min 40s, total: 5min 11s
Wall time: 1min 17s
```

## Compute embeddings¶

```
[18]:
```

```
embeddings = embedding_model.predict(generator.flow(graphs))
```

## Downstream tasks¶

Now that we’ve computed some embedding vectors in an unsupervised fashion, we can use them for other supervised, semi-supervised and unsupervised tasks.

### Supervised graph classification¶

We can use the embedding vectors to perform logistic regression classification, using the labels.

```
[19]:
```

```
from sklearn.linear_model import LogisticRegression
from sklearn import model_selection
```

```
[20]:
```

```
train_labels, test_labels = model_selection.train_test_split(
graph_labels, train_size=0.1, test_size=None, stratify=graph_labels
)
test_embeddings = embeddings[test_labels.index - 1]
train_embeddings = embeddings[train_labels.index - 1]
lr = LogisticRegression(multi_class="auto", solver="lbfgs")
lr.fit(train_embeddings, train_labels)
y_pred = lr.predict(test_embeddings)
gcn_acc = (y_pred == test_labels).mean()
print(f"Test classification accuracy: {gcn_acc}")
```

```
Test classification accuracy: 0.6846307385229541
```

#### Confusion matrix¶

```
[21]:
```

```
pd.crosstab(test_labels, y_pred, rownames=["true"], colnames=["predicted"])
```

```
[21]:
```

predicted | 1 | 2 |
---|---|---|

true | ||

1 | 506 | 91 |

2 | 225 | 180 |

### Visualising embeddings¶

We can also get a qualitative measure of the embeddings, using dimensionality reduction.

```
[22]:
```

```
from sklearn.manifold import TSNE
tsne = TSNE(2)
two_d = tsne.fit_transform(embeddings)
```

```
[23]:
```

```
from matplotlib import pyplot as plt
plt.scatter(two_d[:, 0], two_d[:, 1], c=graph_labels.cat.codes, cmap="jet", alpha=0.4)
```

```
[23]:
```

```
<matplotlib.collections.PathCollection at 0x163db1a20>
```

## Conclusion¶

This demo demonstrated training a graph classification model without supervision. This model could be used to compute embedding vectors or representations for graphs.

The algorithm works with three components:

a ground truth distance or similarity between two graphs such as graph edit distance, or, in this case, Laplacian spectrum distance (for efficiency)

a model that encodes graphs into embedding vectors

a data generator that yields pairs of graphs and the corresponding ground truth distance

This model is inspired by UGraphEmb[1].

Execute this notebook: Download locally