Node classification with Relational Graph Convolutional Network (RGCN)¶

This example demonstrates how use an RGCN [1] on the AIFB dataset with stellargraph.

[1] Modeling Relational Data with Graph Convolutional Networks. Thomas N. Kipf, Michael Schlichtkrull (2017). https://arxiv.org/pdf/1703.06103.pdf

First we load the required libraries.

[3]:

from rdflib.extras.external_graph_libs import *
from rdflib import Graph, URIRef, Literal

import networkx as nx
from networkx.classes.function import info

import stellargraph as sg
from stellargraph.mapper import RelationalFullBatchNodeGenerator
from stellargraph.layer import RGCN

import numpy as np
import matplotlib.pyplot as plt
import os
import pandas as pd

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.layers import Dense
from tensorflow.keras.models import Model

import sklearn
from sklearn import model_selection
from collections import Counter
from stellargraph import datasets
from IPython.display import display, HTML
import matplotlib.pyplot as plt
%matplotlib inline

Loading the data¶

(See the “Loading from Pandas” demo for details on how data can be loaded.)

[4]:

dataset = datasets.AIFB()
display(HTML(dataset.description))
G, affiliation = dataset.load()

The AIFB dataset describes the AIFB research institute in terms of its staff, research group, and publications. First used for machine learning with RDF in Bloehdorn, Stephan and Sure, York, "Kernel Methods for Mining Instance Data in Ontologies", The Semantic Web (2008), http://dx.doi.org/10.1007/978-3-540-76298-0_5. It contains ~8k entities, ~29k edges, and 45 different relationships or edge types. In (Bloehdorn et al 2007) the dataset was first used to predict the affiliation (i.e., research group) for people in the dataset. The dataset contains 178 members of a research group with 5 different research groups. The goal is to predict which research group a researcher belongs to.

[5]:

print(G.info())

StellarDiGraph: Directed multigraph
 Nodes: 8285, Edges: 29043

 Node types:
  default: [8285]
    Features: float32 vector, length 8285
    Edge types: default-http://swrc.ontoware.org/ontology#abstract->default, default-http://swrc.ontoware.org/ontology#address->default, default-http://swrc.ontoware.org/ontology#author->default, default-http://swrc.ontoware.org/ontology#booktitle->default, default-http://swrc.ontoware.org/ontology#carriedOutBy->default, ... (40 more)

 Edge types:
    default-http://swrc.ontoware.org/ontology#publication->default: [4163]
    default-http://www.w3.org/1999/02/22-rdf-syntax-ns#type->default: [4124]
    default-http://swrc.ontoware.org/ontology#author->default: [3986]
    default-http://swrc.ontoware.org/ontology#isAbout->default: [2477]
    default-http://swrc.ontoware.org/ontology#name->default: [1302]
    default-http://swrc.ontoware.org/ontology#year->default: [1227]
    default-http://swrc.ontoware.org/ontology#title->default: [1227]
    default-http://swrc.ontoware.org/ontology#publishes->default: [1217]
    default-http://swrc.ontoware.org/ontology#projectInfo->default: [952]
    default-http://swrc.ontoware.org/ontology#hasProject->default: [952]
    default-http://swrc.ontoware.org/ontology#booktitle->default: [765]
    default-http://swrc.ontoware.org/ontology#month->default: [759]
    default-http://swrc.ontoware.org/ontology#isWorkedOnBy->default: [571]
    default-http://swrc.ontoware.org/ontology#pages->default: [548]
    default-http://swrc.ontoware.org/ontology#abstract->default: [534]
    default-http://swrc.ontoware.org/ontology#dealtWithIn->default: [357]
    default-http://swrc.ontoware.org/ontology#member->default: [339]
    default-http://swrc.ontoware.org/ontology#volume->default: [311]
    default-http://swrc.ontoware.org/ontology#series->default: [298]
    default-http://swrc.ontoware.org/ontology#homepage->default: [239]
    ... (25 more)

The relationship ‘affiliation’ indicates whether a researcher is affiliated with a reseach group e.g. (researcher, research group, affilliation). This is used to create the one-hot labels in the affiliation DataFrame. These relationships are not included in G (nor is its inverse relationship ‘employs’). The idea here is to test whether we can recover a ‘missing’ relationship.

Input preparation¶

The nodes don’t natively have features, so they’ve been replaced with one-hot indicators to allow the model to learn from the graph structure. We’re only training on the people with affiliations, so we split that into train and test splits.

[6]:

train_targets, test_targets = model_selection.train_test_split(
    affiliation, train_size=0.8, test_size=None
)

[7]:

generator = RelationalFullBatchNodeGenerator(G, sparse=True)

train_gen = generator.flow(train_targets.index, targets=train_targets)
test_gen = generator.flow(test_targets.index, targets=test_targets)

RGCN model creation and training¶

We use stellargraph to create an RGCN object. This creates a stack of relational graph convolutional layers. We add a softmax layer to transform the features created by RGCN into class predictions and create a keras model. Then we train the model on the stellargraph generators.

Each RGCN layer creates a weight matrix for each relationship in the graph. If num_bases==0 these weight matrices are completely independent. If num_bases!=0 each weight matrix is a different linear combination of the same basis matrices. This introduces parameter sharing and reduces the number of the parameters in the model. See the paper for more details.

[8]:

rgcn = RGCN(
    layer_sizes=[32, 32],
    activations=["relu", "relu"],
    generator=generator,
    bias=True,
    num_bases=20,
    dropout=0.5,
)

[9]:

x_in, x_out = rgcn.in_out_tensors()
predictions = Dense(train_targets.shape[-1], activation="softmax")(x_out)
model = Model(inputs=x_in, outputs=predictions)
model.compile(
    loss="categorical_crossentropy",
    optimizer=keras.optimizers.Adam(0.01),
    metrics=["acc"],
)

[10]:

history = model.fit(train_gen, validation_data=test_gen, epochs=20)

Epoch 1/20
1/1 [==============================] - 27s 27s/step - loss: 1.6109 - acc: 0.2746 - val_loss: 1.5623 - val_acc: 0.3611
Epoch 2/20
1/1 [==============================] - 23s 23s/step - loss: 1.5564 - acc: 0.5000 - val_loss: 1.4438 - val_acc: 0.4167
Epoch 3/20
1/1 [==============================] - 22s 22s/step - loss: 1.4328 - acc: 0.5070 - val_loss: 1.2094 - val_acc: 0.5000
Epoch 4/20
1/1 [==============================] - 21s 21s/step - loss: 1.2018 - acc: 0.5141 - val_loss: 0.9568 - val_acc: 0.6389
Epoch 5/20
1/1 [==============================] - 20s 20s/step - loss: 0.8872 - acc: 0.7606 - val_loss: 0.7373 - val_acc: 0.6944
Epoch 6/20
1/1 [==============================] - 20s 20s/step - loss: 0.7686 - acc: 0.8099 - val_loss: 0.5692 - val_acc: 0.7778
Epoch 7/20
1/1 [==============================] - 21s 21s/step - loss: 0.6025 - acc: 0.8662 - val_loss: 0.4802 - val_acc: 0.8889
Epoch 8/20
1/1 [==============================] - 21s 21s/step - loss: 0.4335 - acc: 0.8944 - val_loss: 0.4364 - val_acc: 0.9444
Epoch 9/20
1/1 [==============================] - 21s 21s/step - loss: 0.3616 - acc: 0.9437 - val_loss: 0.4061 - val_acc: 0.9444
Epoch 10/20
1/1 [==============================] - 21s 21s/step - loss: 0.3286 - acc: 0.9437 - val_loss: 0.3821 - val_acc: 0.9444
Epoch 11/20
1/1 [==============================] - 20s 20s/step - loss: 0.3106 - acc: 0.9507 - val_loss: 0.3619 - val_acc: 0.9444
Epoch 12/20
1/1 [==============================] - 21s 21s/step - loss: 0.2678 - acc: 0.9437 - val_loss: 0.3498 - val_acc: 0.9167
Epoch 13/20
1/1 [==============================] - 20s 20s/step - loss: 0.2236 - acc: 0.9507 - val_loss: 0.3463 - val_acc: 0.9167
Epoch 14/20
1/1 [==============================] - 21s 21s/step - loss: 0.2434 - acc: 0.9296 - val_loss: 0.3552 - val_acc: 0.9167
Epoch 15/20
1/1 [==============================] - 20s 20s/step - loss: 0.2236 - acc: 0.9296 - val_loss: 0.3680 - val_acc: 0.9167
Epoch 16/20
1/1 [==============================] - 20s 20s/step - loss: 0.1783 - acc: 0.9437 - val_loss: 0.3912 - val_acc: 0.9167
Epoch 17/20
1/1 [==============================] - 20s 20s/step - loss: 0.1887 - acc: 0.9437 - val_loss: 0.4214 - val_acc: 0.9167
Epoch 18/20
1/1 [==============================] - 19s 19s/step - loss: 0.1636 - acc: 0.9437 - val_loss: 0.4550 - val_acc: 0.9167
Epoch 19/20
1/1 [==============================] - 18s 18s/step - loss: 0.1699 - acc: 0.9437 - val_loss: 0.4450 - val_acc: 0.9167
Epoch 20/20
1/1 [==============================] - 18s 18s/step - loss: 0.1848 - acc: 0.9437 - val_loss: 0.4342 - val_acc: 0.9167

[11]:

sg.utils.plot_history(history)

../../_images/demos_node-classification_rgcn-node-classification_19_0.png

Now we assess the accuracy of our trained model on the test set - it does pretty well on this example dataset!

[12]:

test_metrics = model.evaluate(test_gen)
print("\nTest Set Metrics:")
for name, val in zip(model.metrics_names, test_metrics):
    print("\t{}: {:0.4f}".format(name, val))


Test Set Metrics:
        loss: 0.4342
        acc: 0.9167

Node embeddings¶

We evaluate node embeddings as the activations of the output of the last graph convolution layer in the GCN layer stack and visualise them, coloring nodes by their true subject label. We expect to see nice clusters of researchers in the node embedding space, with researchers from the same group belonging to the same cluster.

To calculate the node embeddings rather than the class predictions, we create a new model with the same inputs as we used previously x_inp but now the output is the embeddings x_out rather than the predicted class. Additionally note that the weights trained previously are kept in the new model.

[13]:

from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

# get embeddings for all people nodes
all_gen = generator.flow(affiliation.index, targets=affiliation)
embedding_model = Model(inputs=x_in, outputs=x_out)
emb = embedding_model.predict(all_gen)

[14]:

X = emb.squeeze(0)
y = affiliation.idxmax(axis="columns").astype("category")

if X.shape[1] > 2:
    transform = TSNE

    trans = transform(n_components=2)
    emb_transformed = pd.DataFrame(trans.fit_transform(X), index=affiliation.index)
    emb_transformed["label"] = y
else:
    emb_transformed = pd.DataFrame(X, index=affiliation.index)
    emb_transformed = emb_transformed.rename(columns={"0": 0, "1": 1})
    emb_transformed["label"] = y

[15]:

alpha = 0.7

fig, ax = plt.subplots(figsize=(7, 7))
ax.scatter(
    emb_transformed[0],
    emb_transformed[1],
    c=emb_transformed["label"].cat.codes,
    cmap="jet",
    alpha=alpha,
)
ax.set(aspect="equal", xlabel="$X_1$", ylabel="$X_2$")
plt.title(
    "{} visualization of RGCN embeddings for AIFB dataset".format(transform.__name__)
)
plt.show()

../../_images/demos_node-classification_rgcn-node-classification_25_0.png

Aside from a slight overlap the classes are well seperated despite only using 2-dimensions. This indicates that our model is performing well at clustering the researchers into the right groups.

Execute this notebook: Download locally