Notebook demo on loading Cora dataset into Neo4J Database¶

This notebook demonstrates how to load Cora dataset into Neo4J graph database.

Run the master version of this notebook:

[1]:

# install StellarGraph if running on Google Colab
import sys
if 'google.colab' in sys.modules:
  %pip install -q stellargraph[demos]==1.0.0rc1

[2]:

# verify that we're using the correct version of StellarGraph for this notebook
import stellargraph as sg

try:
    sg.utils.validate_notebook_version("1.0.0rc1")
except AttributeError:
    raise ValueError(
        f"This notebook requires StellarGraph version 1.0.0rc1, but a different version {sg.__version__} is installed.  Please see <https://github.com/stellargraph/stellargraph/issues/1172>."
    ) from None

[3]:

import pandas as pd
import os
from stellargraph import datasets
from IPython.display import display, HTML

Load Cora dataset¶

[4]:

dataset = datasets.Cora()
display(HTML(dataset.description))
dataset.download()

The Cora dataset consists of 2708 scientific publications classified into one of seven classes. The citation network consists of 5429 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 1433 unique words.

[5]:

edge_list = pd.read_csv(
    os.path.join(dataset.data_directory, "cora.cites"),
    sep="\t",
    header=None,
    names=["target", "source"],
)
edge_list["label"] = "cites"

[6]:

display(edge_list.head(5))

	target	source	label
0	35	1033	cites
1	35	103482	cites
2	35	103515	cites
3	35	1050679	cites
4	35	1103960	cites

[7]:

feature_names = ["w_{}".format(ii) for ii in range(1433)]
column_names = feature_names + ["subject"]
node_list = pd.read_csv(
    os.path.join(dataset.data_directory, "cora.content"),
    sep="\t",
    header=None,
    names=column_names,
)

Preprocess data¶

[8]:

# gather all features into lists under 'features' column.
node_list["features"] = node_list[feature_names].values.tolist()

node_list = node_list.drop(columns=feature_names)
node_list["id"] = node_list.index
node_list.head(5)

[8]:

	subject	features	id
31336	Neural_Networks	[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...	31336
1061127	Rule_Learning	[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, ...	1061127
1106406	Reinforcement_Learning	[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...	1106406
13195	Reinforcement_Learning	[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...	13195
37879	Probabilistic_Methods	[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...	37879

Ingest data into Neo4J database¶

We define the graph schema as below:

Each vertex represents a paper
- subject (String): the class where each subject belongs to. There are seven classes in total.
- features (List[int]): 1D-vector represents the presence of each words in the dictionary.
- ID (int): id of each paper. (Note: this ID attribute is different from the Neo4J id, i.e., the id of each node or relationship which Neo4J automatically assigns with).
Each directed edge represents a citation. Each edge points to the paper being cited.

As the Cora dataset is small, we could use Cypher queries and execute the transactions via a Python-supported driver.

For bigger dataset, this loading job might take very long, so it is more convenient to use neo4j-admin import tool, tutorial here.

[9]:

import time

[10]:

import py2neo

default_host = os.environ.get("STELLARGRAPH_NEO4J_HOST")

# Create the Neo4J Graph database object; the arguments can be edited to specify location and authentication
graph = py2neo.Graph(host=default_host, port=None, user=None, password=None)

Delete the existing edges and relationships in the current database.

[11]:

empty_db_query = """
    MATCH(n) DETACH
    DELETE(n)
    """

tx = graph.begin(autocommit=True)
tx.evaluate(empty_db_query)

Load all nodes to the graph database.

[12]:

loading_node_query = """
    UNWIND $node_list as node
    CREATE( e: paper {
        ID: toInteger(node.id),
        subject: node.subject,
        features: node.features
    })
    """

# For efficient loading, we will load batch of nodes into Neo4J.
batch_len = 500

for batch_start in range(0, len(node_list), batch_len):
    batch_end = batch_start + batch_len
    # turn node dataframe into a list of records
    records = node_list.iloc[batch_start:batch_end].to_dict("records")
    tx = graph.begin(autocommit=True)
    tx.evaluate(loading_node_query, parameters={"node_list": records})

Load all edges to the graph database.

[13]:

loading_edge_query = """
    UNWIND $edge_list as edge

    MATCH(source: paper {ID: toInteger(edge.source)})
    MATCH(target: paper {ID: toInteger(edge.target)})

    MERGE (source)-[r:cites]->(target)
    """

batch_len = 500

for batch_start in range(0, len(edge_list), batch_len):
    batch_end = batch_start + batch_len
    # turn edge dataframe into a list of records
    records = edge_list.iloc[batch_start:batch_end].to_dict("records")
    tx = graph.begin(autocommit=True)
    tx.evaluate(loading_edge_query, parameters={"edge_list": records})

[14]:

Run the master version of this notebook: