Inductive Node Representation Learning through attri2vec¶

Run the master version of this notebook:

This is the python implementation of the attri2vec algorithm outlined in paper *`Attributed Network Embedding Via Subspace Discovery <https://arxiv.org/abs/1901.04095>`__* D. Zhang, Y. Jie, X. Zhu and C. Zhang, arXiv:1901.04095, [cs.SI], 2019. The implementation uses the stellargraph libraries.

attri2vec¶

attri2vec learns node representations by performing a linear/non-linear mapping on node content attributes. To make the learned node representations respect structural similarity, `DeepWalk <https://dl.acm.org/citation.cfm?id=2623732>`__/`node2vec <https://snap.stanford.edu/node2vec>`__ learning mechanism is used to make nodes sharing similar random walk context nodes represented closely in the subspace, which is achieved by maximizing the occurrence probability of context nodes conditioned on the representation of the target nodes. The probability is modelled by Softmax and negative sampling is used to speed up its calculation. This makes attri2vec equivalent to predict whether a node occurs in the given target node’s context in random walks with the representation of the target node, by minimizing the cross-entropy loss.

In implementation, node embeddings are learnt by solving a simple classification task: given a large set of “positive” (target, context) node pairs generated from random walks performed on the graph (i.e., node pairs that co-occur within a certain context window in random walks), and an equally large set of “negative” node pairs that are randomly selected from the graph according to a certain distribution, learn a binary classifier that predicts whether arbitrary node pairs are likely to co-occur in a random walk performed on the graph. Through learning this simple binary node-pair-classification task, the model automatically learns an inductive mapping from attributes of nodes to node embeddings in a low-dimensional vector space, which preserves structural and feature similarities of the nodes.

To train the attri2vec model, we first construct a training set of nodes, which is composed of an equal number of positive and negative (target, context) pairs from the graph. The positive (target, context) pairs are the node pairs co-occurring on random walks over the graph whereas the negative node pairs are the sampled randomly from the global node degree distribution of the graph. In attri2vec, each node is attached with two kinds of embeddings: 1) the inductive ‘input embedding’, i.e, the objective embedding, obtained by perform a non-linear transformation on node content features, and 2) ‘output embedding’, i.e., the parameter vector used to predict its occurrence as a context node, obtained by looking up a parameter table. Given a (target, context) pair, attri2vec outputs a predictive value to indicate whether it is positive or negative, which is obtained by performing the dot product of the ‘input embedding’ of the target node and the ‘output embedding’ of the context node, followed by a sigmoid activation.

The entire model is trained end-to-end by minimizing the binary cross-entropy loss function with regards to predicted node pair labels and true node pair labels, using stochastic gradient descent (SGD) updates of the model parameters, with minibatches of ‘training’ node pairs generated on demand and fed into the model.

In this demo, we first train the attri2vec model on the in-sample subgraph and obtain a mapping function from node attributes to node representations, then apply the mapping function to the content attributes of out-of-sample nodes and obtain the representations of out-of-sample nodes. We evaluate the quality of inferred out-of-sample node representations by using it to predict the links of out-of-sample nodes.

[1]:

# install StellarGraph if running on Google Colab
import sys
if 'google.colab' in sys.modules:
  %pip install -q stellargraph[demos]==1.0.0rc1

[2]:

# verify that we're using the correct version of StellarGraph for this notebook
import stellargraph as sg

try:
    sg.utils.validate_notebook_version("1.0.0rc1")
except AttributeError:
    raise ValueError(
        f"This notebook requires StellarGraph version 1.0.0rc1, but a different version {sg.__version__} is installed.  Please see <https://github.com/stellargraph/stellargraph/issues/1172>."
    ) from None

[3]:

import networkx as nx
import pandas as pd
import numpy as np
import os
import random

import stellargraph as sg
from stellargraph.data import UnsupervisedSampler
from stellargraph.mapper import Attri2VecLinkGenerator, Attri2VecNodeGenerator
from stellargraph.layer import Attri2Vec, link_classification

from tensorflow import keras

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

Loading DBLP network data¶

This demo uses a DBLP citation network, a subgraph extracted from DBLP-Citation-network V3. To form this subgraph, papers from four subjects are extracted according to their venue information: Database, Data Mining, Artificial Intelligence and Computer Vision, and papers with no citations are removed. The DBLP network contains 18,448 papers and 45,661 citation relations. From paper titles, we construct 2,476-dimensional binary node feature vectors, with each element indicating the presence/absence of the corresponding word. By ignoring the citation direction, we take the DBLP subgraph as an undirected network.

As papers in DBLP are attached with publication year, the DBLP network with the dynamic property can be used to study the problem of out-of-sample node representation learning. From the DBLP network, we construct four in-sample subgraphs using papers published before 2006, 2007, 2008 and 2009, and denote the four subgraphs as DBLP2006, DBLP2007, DBLP2008, and DBLP2009. For each subgraph, the remaining papers are taken as out-of-sample nodes. We consider the case where new coming nodes have no links. We predict the links of out-of-sample nodes using the learned out-of-sample node representations and compare its performance with the node content feature baseline.

The dataset used in this demo can be downloaded from https://www.kaggle.com/daozhang/dblp-subgraph. The following is the description of the dataset:

The content.txt file contains descriptions of the papers in the following format:

    <paper_id> <word_attributes> <class_label> <publication_year>

The first entry in each line contains the unique integer ID (ranging from 0 to 18,447) of the paper followed by binary values indicating whether each word in the vocabulary is present (indicated by 1) or absent (indicated by 0) in the paper. Finally, the last two entries in the line are the class label and the publication year of the paper. The edgeList.txt file contains the citation relations. Each line describes a link in the following format:

    <ID of paper1> <ID of paper2>

Each line contains two paper IDs, with paper2 citing paper1 or paper1 citing paper2.

Download and unzip the dblp-subgraph.zip file to a location on your computer and set the data_dir variable to point to the location of the dataset (the “DBLP” directory containing “content.txt” and “edgeList.txt”).

[4]:

data_dir = "~/data/DBLP"

Load the graph from the edgelist.

[5]:

edgelist = pd.read_csv(
    os.path.join(data_dir, "edgeList.txt"),
    sep="\t",
    header=None,
    names=["source", "target"],
)
edgelist["label"] = "cites"  # set the edge type

Load paper content features, subjects and publishing years.

[6]:

feature_names = ["w_{}".format(ii) for ii in range(2476)]
node_column_names = feature_names + ["subject", "year"]
node_data = pd.read_csv(
    os.path.join(data_dir, "content.txt"), sep="\t", header=None, names=node_column_names
)

Construct the whole graph from edge list.

[7]:

G_all_nx = nx.from_pandas_edgelist(edgelist, edge_attr="label")

Specify node types.

[8]:

nx.set_node_attributes(G_all_nx, "paper", "label")

Get node features.

[9]:

all_node_features = node_data[feature_names]

Create the Stellargraph with node features.

[10]:

G_all = sg.StellarGraph.from_networkx(G_all_nx, node_features=all_node_features)

[11]:

print(G_all.info())

NetworkXStellarGraph: Undirected multigraph
 Nodes: 18448, Edges: 45611

 Node types:
  paper: [18448]
    Edge types: paper-cites->paper

 Edge types:
    paper-cites->paper: [45611]

Get DBLP Subgraph¶

with papers published before a threshold year¶

Get the edge list connecting in-sample nodes.

[12]:

year_thresh = 2006  # the threshold year for in-sample and out-of-sample set split, which can be 2007, 2008 and 2009
subgraph_edgelist = []
for ii in range(len(edgelist)):
    source_index = edgelist["source"][ii]
    target_index = edgelist["target"][ii]
    source_year = int(node_data["year"][source_index])
    target_year = int(node_data["year"][target_index])
    if source_year < year_thresh and target_year < year_thresh:
        subgraph_edgelist.append([source_index, target_index])
subgraph_edgelist = pd.DataFrame(
    np.array(subgraph_edgelist), columns=["source", "target"]
)
subgraph_edgelist["label"] = "cites"  # set the edge type

Construct the network from the selected edge list.

[13]:

G_sub_nx = nx.from_pandas_edgelist(subgraph_edgelist, edge_attr="label")

Specify node types.

[14]:

nx.set_node_attributes(G_sub_nx, "paper", "label")

Get the ids of the nodes in the selected subgraph.

[15]:

subgraph_node_ids = sorted(list(G_sub_nx.nodes))

Get the node features of the selected subgraph.

[16]:

subgraph_node_features = node_data[feature_names].reindex(subgraph_node_ids)

Create the Stellargraph with node features.

[17]:

G_sub = sg.StellarGraph.from_networkx(G_sub_nx, node_features=subgraph_node_features)

[18]:

print(G_sub.info())

NetworkXStellarGraph: Undirected multigraph
 Nodes: 11776, Edges: 28937

 Node types:
  paper: [11776]
    Edge types: paper-cites->paper

 Edge types:
    paper-cites->paper: [28937]

Train attri2vec on the DBLP Subgraph¶

Specify the other optional parameter values: root nodes, the number of walks to take per node, the length of each walk.

[19]:

nodes = list(G_sub.nodes())
number_of_walks = 2
length = 5

Create the UnsupervisedSampler instance with the relevant parameters passed to it.

[20]:

unsupervised_samples = UnsupervisedSampler(
    G_sub, nodes=nodes, length=length, number_of_walks=number_of_walks
)

Set the batch size and the number of epochs.

[21]:

batch_size = 50
epochs = 6

Define an attri2vec training generator, which generates a batch of (feature of target node, index of context node, label of node pair) pairs per iteration.

[22]:

generator = Attri2VecLinkGenerator(G_sub, batch_size)

Building the model: a 1-hidden-layer node representation (‘input embedding’) of the target node and the parameter vector (‘output embedding’) for predicting the existence of context node for each (target context) pair, with a link classification layer performed on the dot product of the ‘input embedding’ of the target node and the ‘output embedding’ of the context node.

Attri2Vec part of the model, with a 128-dimenssion hidden layer, no bias term, no dropout and no normalization. (Dropout can be switched on by specifying a positive dropout rate, 0 < dropout < 1 and normalization can be set to ‘l2’).

[23]:

layer_sizes = [128]
attri2vec = Attri2Vec(
    layer_sizes=layer_sizes, generator=generator, bias=False, normalize=None
)

[24]:

# Build the model and expose input and output sockets of attri2vec, for node pair inputs:
x_inp, x_out = attri2vec.in_out_tensors()

Use the link_classification function to generate the prediction, with the ‘ip’ edge embedding generation method and the ‘sigmoid’ activation, which actually performs the dot product of the ‘input embedding’ of the target node and the ‘output embedding’ of the context node followed by a sigmoid activation.

[25]:

prediction = link_classification(
    output_dim=1, output_act="sigmoid", edge_embedding_method="ip"
)(x_out)

link_classification: using 'ip' method to combine node embeddings into edge embeddings

Stack the Attri2Vec encoder and prediction layer into a Keras model, and specify the loss.

[26]:

model = keras.Model(inputs=x_inp, outputs=prediction)

model.compile(
    optimizer=keras.optimizers.Adam(lr=1e-2),
    loss=keras.losses.binary_crossentropy,
    metrics=[keras.metrics.binary_accuracy],
)

Train the model.

[27]:

history = model.fit(
    generator.flow(unsupervised_samples),
    epochs=epochs,
    verbose=2,
    use_multiprocessing=False,
    workers=1,
    shuffle=True,
)

Epoch 1/6
4711/4711 - 194s - loss: 0.7453 - binary_accuracy: 0.5219
Epoch 2/6
4711/4711 - 194s - loss: 0.6641 - binary_accuracy: 0.5677
Epoch 3/6
4711/4711 - 201s - loss: 0.5845 - binary_accuracy: 0.6375
Epoch 4/6
4711/4711 - 195s - loss: 0.5185 - binary_accuracy: 0.7198
Epoch 5/6
4711/4711 - 194s - loss: 0.4662 - binary_accuracy: 0.7767
Epoch 6/6
4711/4711 - 194s - loss: 0.4220 - binary_accuracy: 0.8110

Predicting links of out-of-sample nodes with the learned attri2vec model¶

Build the node based model for predicting node representations from node content attributes with the learned parameters. Below a Keras model is constructed, with x_inp[0] as input and x_out[0] as output. Note that this model’s weights are the same as those of the corresponding node encoder in the previously trained node pair classifier.

[28]:

x_inp_src = x_inp[0]
x_out_src = x_out[0]
embedding_model = keras.Model(inputs=x_inp_src, outputs=x_out_src)

Get the node embeddings, for both in-sample and out-of-sample nodes, by applying the learned mapping function to node content features.

[29]:

node_ids = node_data.index
node_gen = Attri2VecNodeGenerator(G_all, batch_size).flow(node_ids)
node_embeddings = embedding_model.predict(node_gen, workers=4, verbose=1)

369/369 [==============================] - 1s 2ms/step

Get the positive and negative edges for in-sample nodes and out-of-sample nodes. The edges of the in-sample nodes only include the edges between in-sample nodes, and the edges of out-of-sample nodes are referred to all the edges linked to out-of-sample nodes, including the edges connecting in-sample and out-of-sample edges.

[30]:

year_thresh = 2006
in_sample_edges = []
out_of_sample_edges = []
for ii in range(len(edgelist)):
    source_index = edgelist["source"][ii]
    target_index = edgelist["target"][ii]
    if source_index > target_index:  # neglect edge direction for the undirected graph
        continue
    source_year = int(node_data["year"][source_index])
    target_year = int(node_data["year"][target_index])
    if source_year < year_thresh and target_year < year_thresh:
        in_sample_edges.append([source_index, target_index, 1])  # get the positive edge
        negative_target_index = unsupervised_samples.random.choices(
            node_data.index.tolist(), k=1
        )  # generate negative node
        in_sample_edges.append(
            [source_index, negative_target_index[0], 0]
        )  # get the negative edge
    else:
        out_of_sample_edges.append(
            [source_index, target_index, 1]
        )  # get the positive edge
        negative_target_index = unsupervised_samples.random.choices(
            node_data.index.tolist(), k=1
        )  # generate negative node
        out_of_sample_edges.append(
            [source_index, negative_target_index[0], 0]
        )  # get the negative edge
in_sample_edges = np.array(in_sample_edges)
out_of_sample_edges = np.array(out_of_sample_edges)

Construct the edge features from the learned node representations with l2 normed difference, where edge features are the element-wise square of the difference between the embeddings of two head nodes. Other strategy like element-wise product can also be used to construct edge features.

[31]:

in_sample_edge_feat_from_emb = (
    node_embeddings[in_sample_edges[:, 0]] - node_embeddings[in_sample_edges[:, 1]]
) ** 2
out_of_sample_edge_feat_from_emb = (
    node_embeddings[out_of_sample_edges[:, 0]]
    - node_embeddings[out_of_sample_edges[:, 1]]
) ** 2

Train the Logistic Regression classifier from in-sample edges with the edge features constructed from attri2vec embeddings.

[32]:

clf_edge_pred_from_emb = LogisticRegression(
    verbose=0, solver="lbfgs", multi_class="auto", max_iter=500
)
clf_edge_pred_from_emb.fit(in_sample_edge_feat_from_emb, in_sample_edges[:, 2])

[32]:

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=500,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

Predict the edge existence probability with the trained Logistic Regression classifier.

[33]:

edge_pred_from_emb = clf_edge_pred_from_emb.predict_proba(
    out_of_sample_edge_feat_from_emb
)

Get the positive class index of edge_pred_from_emb.

[34]:

if clf_edge_pred_from_emb.classes_[0] == 1:
    positive_class_index = 0
else:
    positive_class_index = 1

Evaluate the AUC score for the prediction with attri2vec embeddings.

[35]:

roc_auc_score(out_of_sample_edges[:, 2], edge_pred_from_emb[:, positive_class_index])

[35]:

0.7880958790510728

As the baseline, we also investigate the performance of node content features in predicting the edges of out-of-sample nodes. Firstly, we construct edge features from node content features with the same strategy.

[36]:

in_sample_edge_rep_from_feat = (
    node_data[feature_names].values[in_sample_edges[:, 0]]
    - node_data[feature_names].values[in_sample_edges[:, 1]]
) ** 2
out_of_sample_edge_rep_from_feat = (
    node_data[feature_names].values[out_of_sample_edges[:, 0]]
    - node_data[feature_names].values[out_of_sample_edges[:, 1]]
) ** 2

Then we train the Logistic Regression classifier from in-sample edges with the edge features constructed from node content features.

[37]:

clf_edge_pred_from_feat = LogisticRegression(
    verbose=0, solver="lbfgs", multi_class="auto", max_iter=500
)
clf_edge_pred_from_feat.fit(in_sample_edge_rep_from_feat, in_sample_edges[:, 2])

[37]:

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=500,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

Predict the edge existence probability with the trained Logistic Regression classifier.

[38]:

edge_pred_from_feat = clf_edge_pred_from_feat.predict_proba(
    out_of_sample_edge_rep_from_feat
)

Get positive class index of clf_edge_pred_from_feat.

[39]:

if clf_edge_pred_from_feat.classes_[0] == 1:
    positive_class_index = 0
else:
    positive_class_index = 1

Evaluate the AUC score for the prediction with node content features.

[40]:

roc_auc_score(out_of_sample_edges[:, 2], edge_pred_from_feat[:, positive_class_index])

[40]:

0.6601881857840772

attri2vec can inductively infer the representations of out-of-sample nodes from their content attributes. As the inferred node representations well capture both structure and node content information, they perform much better than node content features in predicting the links of out-of-sample nodes.

Run the master version of this notebook: