Notebook demo on loading Cora dataset into Neo4J Database¶
This notebook demonstrates how to load Cora dataset into Neo4J graph database.
Run the master version of this notebook: |
[1]:
# install StellarGraph if running on Google Colab
import sys
if 'google.colab' in sys.modules:
%pip install -q stellargraph[demos]==1.0.0rc1
[2]:
# verify that we're using the correct version of StellarGraph for this notebook
import stellargraph as sg
try:
sg.utils.validate_notebook_version("1.0.0rc1")
except AttributeError:
raise ValueError(
f"This notebook requires StellarGraph version 1.0.0rc1, but a different version {sg.__version__} is installed. Please see <https://github.com/stellargraph/stellargraph/issues/1172>."
) from None
[3]:
import pandas as pd
import os
from stellargraph import datasets
from IPython.display import display, HTML
Load Cora dataset¶
[4]:
dataset = datasets.Cora()
display(HTML(dataset.description))
dataset.download()
The Cora dataset consists of 2708 scientific publications classified into one of seven classes. The citation network consists of 5429 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 1433 unique words.
[5]:
edge_list = pd.read_csv(
os.path.join(dataset.data_directory, "cora.cites"),
sep="\t",
header=None,
names=["target", "source"],
)
edge_list["label"] = "cites"
[6]:
display(edge_list.head(5))
target | source | label | |
---|---|---|---|
0 | 35 | 1033 | cites |
1 | 35 | 103482 | cites |
2 | 35 | 103515 | cites |
3 | 35 | 1050679 | cites |
4 | 35 | 1103960 | cites |
[7]:
feature_names = ["w_{}".format(ii) for ii in range(1433)]
column_names = feature_names + ["subject"]
node_list = pd.read_csv(
os.path.join(dataset.data_directory, "cora.content"),
sep="\t",
header=None,
names=column_names,
)
Preprocess data¶
[8]:
# gather all features into lists under 'features' column.
node_list["features"] = node_list[feature_names].values.tolist()
node_list = node_list.drop(columns=feature_names)
node_list["id"] = node_list.index
node_list.head(5)
[8]:
subject | features | id | |
---|---|---|---|
31336 | Neural_Networks | [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... | 31336 |
1061127 | Rule_Learning | [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, ... | 1061127 |
1106406 | Reinforcement_Learning | [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... | 1106406 |
13195 | Reinforcement_Learning | [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... | 13195 |
37879 | Probabilistic_Methods | [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... | 37879 |
Ingest data into Neo4J database¶
We define the graph schema as below:
- Each vertex represents a paper
- subject (String): the class where each subject belongs to. There are seven classes in total.
- features (List[int]): 1D-vector represents the presence of each words in the dictionary.
- ID (int): id of each paper. (Note: this ID attribute is different from the Neo4J id, i.e., the id of each node or relationship which Neo4J automatically assigns with).
- Each directed edge represents a citation. Each edge points to the paper being cited.
As the Cora dataset is small, we could use Cypher queries and execute the transactions via a Python-supported driver.
For bigger dataset, this loading job might take very long, so it is more convenient to use neo4j-admin import
tool, tutorial here.
[9]:
import time
[10]:
import py2neo
default_host = os.environ.get("STELLARGRAPH_NEO4J_HOST")
# Create the Neo4J Graph database object; the arguments can be edited to specify location and authentication
graph = py2neo.Graph(host=default_host, port=None, user=None, password=None)
Delete the existing edges and relationships in the current database.
[11]:
empty_db_query = """
MATCH(n) DETACH
DELETE(n)
"""
tx = graph.begin(autocommit=True)
tx.evaluate(empty_db_query)
Load all nodes to the graph database.
[12]:
loading_node_query = """
UNWIND $node_list as node
CREATE( e: paper {
ID: toInteger(node.id),
subject: node.subject,
features: node.features
})
"""
# For efficient loading, we will load batch of nodes into Neo4J.
batch_len = 500
for batch_start in range(0, len(node_list), batch_len):
batch_end = batch_start + batch_len
# turn node dataframe into a list of records
records = node_list.iloc[batch_start:batch_end].to_dict("records")
tx = graph.begin(autocommit=True)
tx.evaluate(loading_node_query, parameters={"node_list": records})
Load all edges to the graph database.
[13]:
loading_edge_query = """
UNWIND $edge_list as edge
MATCH(source: paper {ID: toInteger(edge.source)})
MATCH(target: paper {ID: toInteger(edge.target)})
MERGE (source)-[r:cites]->(target)
"""
batch_len = 500
for batch_start in range(0, len(edge_list), batch_len):
batch_end = batch_start + batch_len
# turn edge dataframe into a list of records
records = edge_list.iloc[batch_start:batch_end].to_dict("records")
tx = graph.begin(autocommit=True)
tx.evaluate(loading_edge_query, parameters={"edge_list": records})
[14]:
Run the master version of this notebook: |