Execute this notebook:
Download locally
Comparison of link prediction with random walks based node embedding¶
This demo notebook compares the link prediction performance of the embeddings learned by Node2Vec [1], Attri2Vec [2], GraphSAGE [3] and GCN [4] on the Cora dataset, under the same edge train-test-split setting. Node2Vec and Attri2Vec are learned by capturing the random walk context node similarity. GraphSAGE and GCN are learned in an unsupervised way by making nodes co-occurring in short random walks represented closely in the embedding space.
We’re going to tackle link prediction as a supervised learning problem on top of node representations/embeddings. After obtaining embeddings, a binary classifier can be used to predict a link, or not, between any two nodes in the graph. Various hyperparameters could be relevant in obtaining the best link classifier - this demo demonstrates incorporating model selection into the pipeline for choosing the best binary operator to apply on a pair of node embeddings.
There are four steps:
Obtain embeddings for each node
For each set of hyperparameters, train a classifier
Select the classifier that performs the best
Evaluate the selected classifier on unseen data to validate its ability to generalise
References:
[1] Node2Vec: Scalable Feature Learning for Networks. A. Grover, J. Leskovec. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2016.
[2] Attributed Network Embedding via Subspace Discovery. D. Zhang, Y. Jie, X. Zhu and C. Zhang. Data Mining and Knowledge Discovery, 2019.
[3] Inductive Representation Learning on Large Graphs. W.L. Hamilton, R. Ying, and J. Leskovec. Neural Information Processing Systems (NIPS), 2017.
[4] Graph Convolutional Networks (GCN): Semi-Supervised Classification with Graph Convolutional Networks. Thomas N. Kipf, Max Welling. International Conference on Learning Representations (ICLR), 2017
[3]:
import matplotlib.pyplot as plt
from math import isclose
from sklearn.decomposition import PCA
import os
import networkx as nx
import numpy as np
import pandas as pd
from stellargraph import StellarGraph, datasets
from stellargraph.data import EdgeSplitter
from collections import Counter
import multiprocessing
from IPython.display import display, HTML
from sklearn.model_selection import train_test_split
%matplotlib inline
Load the dataset¶
The Cora dataset is a homogeneous network where all nodes are papers and edges between nodes are citation links, e.g. paper A cites paper B.
(See the “Loading from Pandas” demo for details on how data can be loaded.)
[4]:
dataset = datasets.Cora()
display(HTML(dataset.description))
graph, _ = dataset.load(largest_connected_component_only=True, str_node_ids=True)
[5]:
print(graph.info())
StellarGraph: Undirected multigraph
Nodes: 2485, Edges: 5209
Node types:
paper: [2485]
Features: float32 vector, length 1433
Edge types: paper-cites->paper
Edge types:
paper-cites->paper: [5209]
Weights: all 1 (default)
Features: none
Construct splits of the input data¶
We have to carefully split the data to avoid data leakage and evaluate the algorithms correctly:
For computing node embeddings, a Train Graph (
graph_train
)For training classifiers, a classifier Training Set (
examples_train
) of positive and negative edges that weren’t used for computing node embeddingsFor choosing the best classifier, a Model Selection Test Set (
examples_model_selection
) of positive and negative edges that weren’t used for computing node embeddings or training the classifierFor the final evaluation, with the learned node embeddings from the Train Graph (
graph_train
), the chosen best classifier is applied to a Test Set (examples_test
) of positive and negative edges not used for neither computing the node embeddings or for classifier training or model selection
Test Graph¶
We begin with the full graph and use the EdgeSplitter
class to produce:
Test Graph
Test set of positive/negative link examples
The Test Graph is the reduced graph we obtain from removing the test set of links from the full graph.
[6]:
# Define an edge splitter on the original graph:
edge_splitter_test = EdgeSplitter(graph)
# Randomly sample a fraction p=0.1 of all positive links, and same number of negative links, from graph, and obtain the
# reduced graph graph_test with the sampled links removed:
graph_test, examples_test, labels_test = edge_splitter_test.train_test_split(
p=0.1, method="global"
)
print(graph_test.info())
** Sampled 520 positive and 520 negative edges. **
StellarGraph: Undirected multigraph
Nodes: 2485, Edges: 4689
Node types:
paper: [2485]
Features: float32 vector, length 1433
Edge types: paper-cites->paper
Edge types:
paper-cites->paper: [4689]
Weights: all 1 (default)
Features: none
Train Graph¶
This time, we use the EdgeSplitter
on the Test Graph, and perform a train/test split on the examples to produce:
Train Graph
Training set of link examples
Set of link examples for model selection
[7]:
# Do the same process to compute a training subset from within the test graph
edge_splitter_train = EdgeSplitter(graph_test)
graph_train, examples, labels = edge_splitter_train.train_test_split(
p=0.1, method="global"
)
(
examples_train,
examples_model_selection,
labels_train,
labels_model_selection,
) = train_test_split(examples, labels, train_size=0.75, test_size=0.25)
print(graph_train.info())
** Sampled 468 positive and 468 negative edges. **
StellarGraph: Undirected multigraph
Nodes: 2485, Edges: 4221
Node types:
paper: [2485]
Features: float32 vector, length 1433
Edge types: paper-cites->paper
Edge types:
paper-cites->paper: [4221]
Weights: all 1 (default)
Features: none
Below is a summary of the different splits that have been created in this section.
[8]:
pd.DataFrame(
[
(
"Training Set",
len(examples_train),
"Train Graph",
"Test Graph",
"Train the Link Classifier",
),
(
"Model Selection",
len(examples_model_selection),
"Train Graph",
"Test Graph",
"Select the best Link Classifier model",
),
(
"Test set",
len(examples_test),
"Test Graph",
"Full Graph",
"Evaluate the best Link Classifier",
),
],
columns=("Split", "Number of Examples", "Hidden from", "Picked from", "Use"),
).set_index("Split")
[8]:
Number of Examples | Hidden from | Picked from | Use | |
---|---|---|---|---|
Split | ||||
Training Set | 702 | Train Graph | Test Graph | Train the Link Classifier |
Model Selection | 234 | Train Graph | Test Graph | Select the best Link Classifier model |
Test set | 1040 | Test Graph | Full Graph | Evaluate the best Link Classifier |
Create random walker¶
We define the helper function to generate biased random walks from the given graph with the fixed random walk parameters:
p
- Random walk parameter “p” that defines probability, “1/p”, of returning to source nodeq
- Random walk parameter “q” that defines probability, “1/q”, for moving to a node away from the source node
[9]:
from stellargraph.data import BiasedRandomWalk
def create_biased_random_walker(graph, walk_num, walk_length):
# parameter settings for "p" and "q":
p = 1.0
q = 1.0
return BiasedRandomWalk(graph, n=walk_num, length=walk_length, p=p, q=q)
Parameter Settings¶
We train Node2Vec, Attri2Vec, GraphSAGE, and GCN by following the same unsupervised learning procedure: we firstly generate a set of short random walks from the given graph and then learn node embeddings from batches of target, context
pairs collected from random walks. For learning node embeddings, we need to specify the following parameters:
dimension
- Dimensionality of node embeddingswalk_number
- Number of walks from each nodewalk_length
- Length of each random walkepochs
- The number of epochs to train embedding learning modelbatch_size
- The batch size to train embedding learning model
We consistently set the node embedding dimension to 128 for all algorithms. However, we use different hidden layers to learn node embeddings for different algorithms to exert their respective power. For the remaining parameters, we set them as:
Node2Vec |
Attri2Vec |
GraphSAGE |
GCN |
|
---|---|---|---|---|
|
20 |
4 |
1 |
1 |
|
5 |
5 |
5 |
5 |
|
6 |
6 |
6 |
6 |
|
50 |
50 |
50 |
50 |
As all algorithms use the same walk_length
, batch_size
and epochs
values, we uniformly set them here:
[10]:
walk_length = 5
[11]:
epochs = 6
[12]:
batch_size = 50
For different algorithms, users can find the best parameter setting with the Model Selection
edge set.
Node2Vec¶
We use Node2Vec [1], to calculate node embeddings. These embeddings are learned in such a way to ensure that nodes that are close in the graph remain close in the embedding space. We train Node2Vec with the Stellargraph Node2Vec components.
[13]:
from stellargraph.data import UnsupervisedSampler
from stellargraph.mapper import Node2VecLinkGenerator, Node2VecNodeGenerator
from stellargraph.layer import Node2Vec, link_classification
from tensorflow import keras
def node2vec_embedding(graph, name):
# Set the embedding dimension and walk number:
dimension = 128
walk_number = 20
print(f"Training Node2Vec for '{name}':")
graph_node_list = list(graph.nodes())
# Create the biased random walker to generate random walks
walker = create_biased_random_walker(graph, walk_number, walk_length)
# Create the unsupervised sampler to sample (target, context) pairs from random walks
unsupervised_samples = UnsupervisedSampler(
graph, nodes=graph_node_list, walker=walker
)
# Define a Node2Vec training generator, which generates batches of training pairs
generator = Node2VecLinkGenerator(graph, batch_size)
# Create the Node2Vec model
node2vec = Node2Vec(dimension, generator=generator)
# Build the model and expose input and output sockets of Node2Vec, for node pair inputs
x_inp, x_out = node2vec.in_out_tensors()
# Use the link_classification function to generate the output of the Node2Vec model
prediction = link_classification(
output_dim=1, output_act="sigmoid", edge_embedding_method="dot"
)(x_out)
# Stack the Node2Vec encoder and prediction layer into a Keras model, and specify the loss
model = keras.Model(inputs=x_inp, outputs=prediction)
model.compile(
optimizer=keras.optimizers.Adam(lr=1e-3),
loss=keras.losses.binary_crossentropy,
metrics=[keras.metrics.binary_accuracy],
)
# Train the model
model.fit(
generator.flow(unsupervised_samples),
epochs=epochs,
verbose=2,
use_multiprocessing=False,
workers=4,
shuffle=True,
)
# Build the model to predict node representations from node ids with the learned Node2Vec model parameters
x_inp_src = x_inp[0]
x_out_src = x_out[0]
embedding_model = keras.Model(inputs=x_inp_src, outputs=x_out_src)
# Get representations for all nodes in ``graph``
node_gen = Node2VecNodeGenerator(graph, batch_size).flow(graph_node_list)
node_embeddings = embedding_model.predict(node_gen, workers=1, verbose=0)
def get_embedding(u):
u_index = graph_node_list.index(u)
return node_embeddings[u_index]
return get_embedding
Attri2Vec¶
We use Attri2Vec [2] to calculate node embeddings. Attri2Vec learns node representations through performing a linear/non-linear mapping on node content attributes and simultaneously making nodes sharing similar context nodes in random walks have similar representations. With the node content features are used to learn node embeddings, we wish that Attri2Vec can achieve better link prediction performance than the only structure preserving network embedding algorithm Node2Vec.
[14]:
from stellargraph.mapper import Attri2VecLinkGenerator, Attri2VecNodeGenerator
from stellargraph.layer import Attri2Vec
def attri2vec_embedding(graph, name):
# Set the embedding dimension and walk number:
dimension = [128]
walk_number = 4
print(f"Training Attri2Vec for '{name}':")
graph_node_list = list(graph.nodes())
# Create the biased random walker to generate random walks
walker = create_biased_random_walker(graph, walk_number, walk_length)
# Create the unsupervised sampler to sample (target, context) pairs from random walks
unsupervised_samples = UnsupervisedSampler(
graph, nodes=graph_node_list, walker=walker
)
# Define an Attri2Vec training generator, which generates batches of training pairs
generator = Attri2VecLinkGenerator(graph, batch_size)
# Create the Attri2Vec model
attri2vec = Attri2Vec(
layer_sizes=dimension, generator=generator, bias=False, normalize=None
)
# Build the model and expose input and output sockets of Attri2Vec, for node pair inputs
x_inp, x_out = attri2vec.in_out_tensors()
# Use the link_classification function to generate the output of the Attri2Vec model
prediction = link_classification(
output_dim=1, output_act="sigmoid", edge_embedding_method="ip"
)(x_out)
# Stack the Attri2Vec encoder and prediction layer into a Keras model, and specify the loss
model = keras.Model(inputs=x_inp, outputs=prediction)
model.compile(
optimizer=keras.optimizers.Adam(lr=1e-3),
loss=keras.losses.binary_crossentropy,
metrics=[keras.metrics.binary_accuracy],
)
# Train the model
model.fit(
generator.flow(unsupervised_samples),
epochs=epochs,
verbose=2,
use_multiprocessing=False,
workers=1,
shuffle=True,
)
# Build the model to predict node representations from node features with the learned Attri2Vec model parameters
x_inp_src = x_inp[0]
x_out_src = x_out[0]
embedding_model = keras.Model(inputs=x_inp_src, outputs=x_out_src)
# Get representations for all nodes in ``graph``
node_gen = Attri2VecNodeGenerator(graph, batch_size).flow(graph_node_list)
node_embeddings = embedding_model.predict(node_gen, workers=1, verbose=0)
def get_embedding(u):
u_index = graph_node_list.index(u)
return node_embeddings[u_index]
return get_embedding
GraphSAGE¶
GraphSAGE [3] learns node embeddings for attributed graphs through aggregating neighboring node attributes. The aggregation parameters are learned by encouraging node pairs co-occurring in short random walks to have similar representations. As node attributes are also leveraged, GraphSAGE is expected to perform better than Node2Vec in link prediction.
[15]:
from stellargraph.mapper import GraphSAGELinkGenerator, GraphSAGENodeGenerator
from stellargraph.layer import GraphSAGE
def graphsage_embedding(graph, name):
# Set the embedding dimensions, the numbers of sampled neighboring nodes and walk number:
dimensions = [128, 128]
num_samples = [10, 5]
walk_number = 1
print(f"Training GraphSAGE for '{name}':")
graph_node_list = list(graph.nodes())
# Create the biased random walker to generate random walks
walker = create_biased_random_walker(graph, walk_number, walk_length)
# Create the unsupervised sampler to sample (target, context) pairs from random walks
unsupervised_samples = UnsupervisedSampler(
graph, nodes=graph_node_list, walker=walker
)
# Define a GraphSAGE training generator, which generates batches of training pairs
generator = GraphSAGELinkGenerator(graph, batch_size, num_samples)
# Create the GraphSAGE model
graphsage = GraphSAGE(
layer_sizes=dimensions,
generator=generator,
bias=True,
dropout=0.0,
normalize="l2",
)
# Build the model and expose input and output sockets of GraphSAGE, for node pair inputs
x_inp, x_out = graphsage.in_out_tensors()
# Use the link_classification function to generate the output of the GraphSAGE model
prediction = link_classification(
output_dim=1, output_act="sigmoid", edge_embedding_method="ip"
)(x_out)
# Stack the GraphSAGE encoder and prediction layer into a Keras model, and specify the loss
model = keras.Model(inputs=x_inp, outputs=prediction)
model.compile(
optimizer=keras.optimizers.Adam(lr=1e-3),
loss=keras.losses.binary_crossentropy,
metrics=[keras.metrics.binary_accuracy],
)
# Train the model
model.fit(
generator.flow(unsupervised_samples),
epochs=epochs,
verbose=2,
use_multiprocessing=False,
workers=4,
shuffle=True,
)
# Build the model to predict node representations from node features with the learned GraphSAGE model parameters
x_inp_src = x_inp[0::2]
x_out_src = x_out[0]
embedding_model = keras.Model(inputs=x_inp_src, outputs=x_out_src)
# Get representations for all nodes in ``graph``
node_gen = GraphSAGENodeGenerator(graph, batch_size, num_samples).flow(
graph_node_list
)
node_embeddings = embedding_model.predict(node_gen, workers=1, verbose=0)
def get_embedding(u):
u_index = graph_node_list.index(u)
return node_embeddings[u_index]
return get_embedding
GCN¶
GCN [4] learns node embeddings through graph convolution. Traditional GCN relies on node labels as a supervision to perform training. Here, we consider the unsupervised link prediction setting and we try to learn informative GCN node embeddings by making nodes co-occurring in short random walks represented closely, as is performed in training GraphSAGE.
[16]:
from stellargraph.mapper import FullBatchLinkGenerator, FullBatchNodeGenerator
from stellargraph.layer import GCN, LinkEmbedding
def gcn_embedding(graph, name):
# Set the embedding dimensions and walk number:
dimensions = [128, 128]
walk_number = 1
print(f"Training GCN for '{name}':")
graph_node_list = list(graph.nodes())
# Create the biased random walker to generate random walks
walker = create_biased_random_walker(graph, walk_number, walk_length)
# Create the unsupervised sampler to sample (target, context) pairs from random walks
unsupervised_samples = UnsupervisedSampler(
graph, nodes=graph_node_list, walker=walker
)
# Define a GCN training generator, which generates the full batch of training pairs
generator = FullBatchLinkGenerator(graph, method="gcn")
# Create the GCN model
gcn = GCN(
layer_sizes=dimensions,
activations=["relu", "relu"],
generator=generator,
dropout=0.3,
)
# Build the model and expose input and output sockets of GCN, for node pair inputs
x_inp, x_out = gcn.in_out_tensors()
# Use the dot product of node embeddings to make node pairs co-occurring in short random walks represented closely
prediction = LinkEmbedding(activation="sigmoid", method="ip")(x_out)
prediction = keras.layers.Reshape((-1,))(prediction)
# Stack the GCN encoder and prediction layer into a Keras model, and specify the loss
model = keras.Model(inputs=x_inp, outputs=prediction)
model.compile(
optimizer=keras.optimizers.Adam(lr=1e-3),
loss=keras.losses.binary_crossentropy,
metrics=[keras.metrics.binary_accuracy],
)
# Train the model
batches = unsupervised_samples.run(batch_size)
for epoch in range(epochs):
print(f"Epoch: {epoch+1}/{epochs}")
batch_iter = 1
for batch in batches:
samples = generator.flow(batch[0], targets=batch[1], use_ilocs=True)[0]
[loss, accuracy] = model.train_on_batch(x=samples[0], y=samples[1])
output = (
f"{batch_iter}/{len(batches)} - loss:"
+ " {:6.4f}".format(loss)
+ " - binary_accuracy:"
+ " {:6.4f}".format(accuracy)
)
if batch_iter == len(batches):
print(output)
else:
print(output, end="\r")
batch_iter = batch_iter + 1
# Get representations for all nodes in ``graph``
embedding_model = keras.Model(inputs=x_inp, outputs=x_out)
node_embeddings = embedding_model.predict(
generator.flow(list(zip(graph_node_list, graph_node_list)))
)
node_embeddings = node_embeddings[0][:, 0, :]
def get_embedding(u):
u_index = graph_node_list.index(u)
return node_embeddings[u_index]
return get_embedding
Train and evaluate the link prediction model¶
There are a few steps involved in using the learned embeddings to perform link prediction: 1. We calculate link/edge embeddings for the positive and negative edge samples by applying a binary operator on the embeddings of the source and target nodes of each sampled edge. 2. Given the embeddings of the positive and negative examples, we train a logistic regression classifier to predict a binary value indicating whether an edge between two nodes should exist or not. 3. We evaluate the performance
of the link classifier for each of the 4 operators on the training data with node embeddings calculated on the Train Graph (graph_train
), and select the best classifier. 4. The best classifier is then used to calculate scores on the test data with node embeddings trained on the Train Graph (graph_train
).
Below are a set of helper functions that let us repeat these steps for each of the binary operators.
[17]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegressionCV
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import StandardScaler
# 1. link embeddings
def link_examples_to_features(link_examples, transform_node, binary_operator):
return [
binary_operator(transform_node(src), transform_node(dst))
for src, dst in link_examples
]
# 2. training classifier
def train_link_prediction_model(
link_examples, link_labels, get_embedding, binary_operator
):
clf = link_prediction_classifier()
link_features = link_examples_to_features(
link_examples, get_embedding, binary_operator
)
clf.fit(link_features, link_labels)
return clf
def link_prediction_classifier(max_iter=5000):
lr_clf = LogisticRegressionCV(Cs=10, cv=10, scoring="roc_auc", max_iter=max_iter)
return Pipeline(steps=[("sc", StandardScaler()), ("clf", lr_clf)])
# 3. and 4. evaluate classifier
def evaluate_link_prediction_model(
clf, link_examples_test, link_labels_test, get_embedding, binary_operator
):
link_features_test = link_examples_to_features(
link_examples_test, get_embedding, binary_operator
)
score = evaluate_roc_auc(clf, link_features_test, link_labels_test)
return score
def evaluate_roc_auc(clf, link_features, link_labels):
predicted = clf.predict_proba(link_features)
# check which class corresponds to positive links
positive_column = list(clf.classes_).index(1)
return roc_auc_score(link_labels, predicted[:, positive_column])
We consider 4 different operators:
Hadamard
\(L_1\)
\(L_2\)
average
The paper [1] provides a detailed description of these operators. All operators produce link embeddings that have equal dimensionality to the input node embeddings (128 dimensions for our example).
[18]:
def operator_hadamard(u, v):
return u * v
def operator_l1(u, v):
return np.abs(u - v)
def operator_l2(u, v):
return (u - v) ** 2
def operator_avg(u, v):
return (u + v) / 2.0
def run_link_prediction(binary_operator, embedding_train):
clf = train_link_prediction_model(
examples_train, labels_train, embedding_train, binary_operator
)
score = evaluate_link_prediction_model(
clf,
examples_model_selection,
labels_model_selection,
embedding_train,
binary_operator,
)
return {
"classifier": clf,
"binary_operator": binary_operator,
"score": score,
}
binary_operators = [operator_hadamard, operator_l1, operator_l2, operator_avg]
Train and evaluate the link model with the specified embedding¶
[19]:
def train_and_evaluate(embedding, name):
embedding_train = embedding(graph_train, "Train Graph")
# Train the link classification model with the learned embedding
results = [run_link_prediction(op, embedding_train) for op in binary_operators]
best_result = max(results, key=lambda result: result["score"])
print(
f"\nBest result with '{name}' embeddings from '{best_result['binary_operator'].__name__}'"
)
display(
pd.DataFrame(
[(result["binary_operator"].__name__, result["score"]) for result in results],
columns=("name", "ROC AUC"),
).set_index("name")
)
# Evaluate the best model using the test set
test_score = evaluate_link_prediction_model(
best_result["classifier"],
examples_test,
labels_test,
embedding_train,
best_result["binary_operator"],
)
return test_score
Collect the link prediction results for Node2Vec, Attri2Vec, GraphSAGE and GCN¶
Get Node2Vec link prediction result¶
[20]:
node2vec_result = train_and_evaluate(node2vec_embedding, "Node2Vec")
Training Node2Vec for 'Train Graph':
link_classification: using 'dot' method to combine node embeddings into edge embeddings
Train for 7674 steps
Epoch 1/6
7674/7674 - 27s - loss: 0.5544 - binary_accuracy: 0.6739
Epoch 2/6
7674/7674 - 38s - loss: 0.4469 - binary_accuracy: 0.7548
Epoch 3/6
7674/7674 - 38s - loss: 0.2881 - binary_accuracy: 0.8811
Epoch 4/6
7674/7674 - 38s - loss: 0.1686 - binary_accuracy: 0.9420
Epoch 5/6
7674/7674 - 39s - loss: 0.1369 - binary_accuracy: 0.9522
Epoch 6/6
7674/7674 - 41s - loss: 0.1277 - binary_accuracy: 0.9551
Best result with 'Node2Vec' embeddings from 'operator_l2'
ROC AUC | |
---|---|
name | |
operator_hadamard | 0.810491 |
operator_l1 | 0.835257 |
operator_l2 | 0.845412 |
operator_avg | 0.513223 |
Get Attri2Vec link prediction result¶
[21]:
attri2vec_result = train_and_evaluate(attri2vec_embedding, "Attri2Vec")
Training Attri2Vec for 'Train Graph':
link_classification: using 'ip' method to combine node embeddings into edge embeddings
Train for 1535 steps
Epoch 1/6
1535/1535 - 4s - loss: 0.6997 - binary_accuracy: 0.5324
Epoch 2/6
1535/1535 - 4s - loss: 0.6275 - binary_accuracy: 0.6479
Epoch 3/6
1535/1535 - 4s - loss: 0.4614 - binary_accuracy: 0.8077
Epoch 4/6
1535/1535 - 4s - loss: 0.3255 - binary_accuracy: 0.8848
Epoch 5/6
1535/1535 - 5s - loss: 0.2451 - binary_accuracy: 0.9198
Epoch 6/6
1535/1535 - 4s - loss: 0.1901 - binary_accuracy: 0.9427
Best result with 'Attri2Vec' embeddings from 'operator_l1'
ROC AUC | |
---|---|
name | |
operator_hadamard | 0.871274 |
operator_l1 | 0.893191 |
operator_l2 | 0.883913 |
operator_avg | 0.542300 |
Get GraphSAGE link prediction result¶
[22]:
graphsage_result = train_and_evaluate(graphsage_embedding, "GraphSAGE")
Training GraphSAGE for 'Train Graph':
link_classification: using 'ip' method to combine node embeddings into edge embeddings
Train for 384 steps
Epoch 1/6
384/384 - 31s - loss: 0.5610 - binary_accuracy: 0.7552
Epoch 2/6
384/384 - 26s - loss: 0.5380 - binary_accuracy: 0.7844
Epoch 3/6
384/384 - 29s - loss: 0.5376 - binary_accuracy: 0.7840
Epoch 4/6
384/384 - 26s - loss: 0.5339 - binary_accuracy: 0.7943
Epoch 5/6
384/384 - 26s - loss: 0.5321 - binary_accuracy: 0.7958
Epoch 6/6
384/384 - 26s - loss: 0.5313 - binary_accuracy: 0.7936
Best result with 'GraphSAGE' embeddings from 'operator_l2'
ROC AUC | |
---|---|
name | |
operator_hadamard | 0.883986 |
operator_l1 | 0.882963 |
operator_l2 | 0.888150 |
operator_avg | 0.507890 |
Get GCN link prediction result¶
[23]:
gcn_result = train_and_evaluate(gcn_embedding, "GCN")
Training GCN for 'Train Graph':
Using GCN (local pooling) filters...
Epoch: 1/6
384/384 - loss: 0.4742 - binary_accuracy: 0.5882
Epoch: 2/6
384/384 - loss: 0.3981 - binary_accuracy: 0.8235
Epoch: 3/6
384/384 - loss: 0.3889 - binary_accuracy: 0.9412
Epoch: 4/6
384/384 - loss: 0.3537 - binary_accuracy: 0.9118
Epoch: 5/6
384/384 - loss: 0.3784 - binary_accuracy: 0.8824
Epoch: 6/6
384/384 - loss: 0.3334 - binary_accuracy: 0.8824
Best result with 'GCN' embeddings from 'operator_hadamard'
ROC AUC | |
---|---|
name | |
operator_hadamard | 0.881794 |
operator_l1 | 0.824445 |
operator_l2 | 0.760666 |
operator_avg | 0.595777 |
Comparison between Node2Vec, Attri2Vec, GraphSAGE and GCN on the test set¶
The ROC AUC scores on the test set of links of different embeddings with their corresponding best operators:
[24]:
pd.DataFrame(
[
("Node2Vec", node2vec_result),
("Attri2Vec", attri2vec_result),
("GraphSAGE", graphsage_result),
("GCN", gcn_result),
],
columns=("name", "ROC AUC"),
).set_index("name")
[24]:
ROC AUC | |
---|---|
name | |
Node2Vec | 0.843417 |
Attri2Vec | 0.936742 |
GraphSAGE | 0.925422 |
GCN | 0.902997 |
Conclusion¶
This example has demonstrated how to use the stellargraph
library to build a link prediction algorithm for homogeneous graphs using the unsupervised embeddings learned by Node2Vec [1], Attri2Vec [2] and GraphSAGE [3] and GCN [4].
For more information about the link prediction process, all of these algorithms have specific demos with more details:
Execute this notebook:
Download locally