Execute this notebook: Download locally

Loading Cora dataset into Neo4j database

This notebook demonstrates how to load Cora dataset into Neo4j graph database.

[3]:
import pandas as pd
import os
from stellargraph import datasets
from IPython.display import display, HTML

Load Cora dataset

(See the “Loading from Pandas” demo for details on how data can be loaded.)

[4]:
dataset = datasets.Cora()
display(HTML(dataset.description))
dataset.download()
The Cora dataset consists of 2708 scientific publications classified into one of seven classes. The citation network consists of 5429 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 1433 unique words.
[5]:
edge_list = pd.read_csv(
    os.path.join(dataset.data_directory, "cora.cites"),
    sep="\t",
    header=None,
    names=["target", "source"],
)
edge_list["label"] = "cites"
[6]:
display(edge_list.head(5))
target source label
0 35 1033 cites
1 35 103482 cites
2 35 103515 cites
3 35 1050679 cites
4 35 1103960 cites
[7]:
feature_names = ["w_{}".format(ii) for ii in range(1433)]
column_names = feature_names + ["subject"]
node_list = pd.read_csv(
    os.path.join(dataset.data_directory, "cora.content"),
    sep="\t",
    header=None,
    names=column_names,
)

Preprocess data

[8]:
# gather all features into lists under 'features' column.
node_list["features"] = node_list[feature_names].values.tolist()

node_list = node_list.drop(columns=feature_names)
node_list["id"] = node_list.index
node_list.head(5)
[8]:
subject features id
31336 Neural_Networks [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... 31336
1061127 Rule_Learning [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, ... 1061127
1106406 Reinforcement_Learning [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... 1106406
13195 Reinforcement_Learning [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... 13195
37879 Probabilistic_Methods [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... 37879

Ingest data into Neo4j database

We define the graph schema as below:

  • Each vertex represents a paper

    • subject (String): the class where each subject belongs to. There are seven classes in total.

    • features (List[int]): 1D-vector represents the presence of each words in the dictionary.

    • ID (int): id of each paper. (Note: this ID attribute is different from the Neo4j id, i.e., the id of each node or relationship which Neo4j automatically assigns with).

  • Each directed edge represents a citation. Each edge points to the paper being cited.

As the Cora dataset is small, we could use Cypher queries and execute the transactions via a Python-supported driver.

For bigger dataset, this loading job might take very long, so it is more convenient to use neo4j-admin import tool, tutorial here.

[9]:
import time
[10]:
import py2neo

default_host = os.environ.get("STELLARGRAPH_NEO4J_HOST")

# Create the Neo4j Graph database object; port, user, password parameters can be add to specify location and authentication
graph = py2neo.Graph(host=default_host)

Delete the existing edges and relationships in the current database.

[11]:
empty_db_query = """
    MATCH(n) DETACH
    DELETE(n)
    """

tx = graph.begin(autocommit=True)
tx.evaluate(empty_db_query)

Delete any existing constraints or indexes in the current database.

[12]:
constraints = graph.run("CALL db.constraints").data()
for constraint in constraints:
    graph.run(f"DROP CONSTRAINT {constraint['name']}")

indexes = graph.run("CALL db.indexes").data()
for index in indexes:
    graph.run(f"DROP INDEX {index['name']}")

Load all nodes to the graph database.

[13]:
loading_node_query = """
    UNWIND $node_list as node
    CREATE( e: paper {
        ID: toInteger(node.id),
        subject: node.subject,
        features: node.features
    })
    """

# For efficient loading, we will load batch of nodes into Neo4j.
batch_len = 500

for batch_start in range(0, len(node_list), batch_len):
    batch_end = batch_start + batch_len
    # turn node dataframe into a list of records
    records = node_list.iloc[batch_start:batch_end].to_dict("records")
    tx = graph.begin(autocommit=True)
    tx.evaluate(loading_node_query, parameters={"node_list": records})

Load all edges to the graph database.

[14]:
loading_edge_query = """
    UNWIND $edge_list as edge

    MATCH(source: paper {ID: toInteger(edge.source)})
    MATCH(target: paper {ID: toInteger(edge.target)})

    MERGE (source)-[r:cites]->(target)
    """

batch_len = 500

for batch_start in range(0, len(edge_list), batch_len):
    batch_end = batch_start + batch_len
    # turn edge dataframe into a list of records
    records = edge_list.iloc[batch_start:batch_end].to_dict("records")
    tx = graph.begin(autocommit=True)
    tx.evaluate(loading_edge_query, parameters={"edge_list": records})

Ensure node IDs are unique. Creating this constraint also automatically creates an index which will improve performance of querying nodes by ID.

[15]:
node_id_constraint = """
    CREATE CONSTRAINT
    ON (n:paper)
    ASSERT n.ID IS UNIQUE
    """

tx = graph.begin(autocommit=True)
tx.evaluate(node_id_constraint)

Execute this notebook: Download locally