Source

A Source can be implemented for any file, local, and/or remote store that can contains a graph. A Source is responsible for reading nodes and edges from the graph.

A source must subclass kgx.source.source.Source class and must implement the following methods:

  • parse

  • read_nodes

  • read_edges

parse method

  • Responsible for parsing a graph from a file/store

  • Must return a generator that iterates over list of node and edge records from the graph

read_nodes method

  • Responsible for reading nodes from the file/store

  • Must return a generator that iterates over list of node records

  • Each node record must be a 2-tuple (node_id, node_data) where,

    • node_id is the node CURIE

    • node_data is a dictionary that represents the node properties

read_edges method

  • Responsible for reading edges from the file/store

  • Must return a generator that iterates over list of edge records

  • Each edge record must be a 4-tuple (subject_id, object_id, edge_key, edge_data) where,

    • subject_id is the subject node CURIE

    • object_id is the object node CURIE

    • edge_key is the unique key for the edge

    • edge_data is a dictionary that represents the edge properties

kgx.source.source

Base class for all Sources in KGX.

kgx.source.graph_source

GraphSource is responsible for reading from an instance of kgx.graph.base_graph.BaseGraph and must use only the methods exposed by BaseGraph to access the graph.

kgx.source.tsv_source

TsvSource is responsible for reading from KGX formatted CSV or TSV using Pandas where every flat file is treated as a Pandas DataFrame and from which data are read in chunks.

KGX expects two separate files - one for nodes and another for edges.

kgx.source.json_source

JsonSource is responsible for reading data from a KGX formatted JSON using the ijson library, which allows for streaming data from the file.

kgx.source.jsonl_source

JsonlSource is responsible for reading data from a KGX formatted JSON Lines using the jsonlines library.

KGX expects two separate JSON Lines files - one for nodes and another for edges.

kgx.source.trapi_source

TrapiSource is responsible for reading data from a Translator Reasoner API formatted JSON.

kgx.source.obograph_source

ObographSource is responsible for reading data from OBOGraphs in JSON.

kgx.source.sssom_source

SssomSource is responsible for reading data from an SSSOM formatted files.

kgx.source.neo_source

NeoSource is responsible for reading data from a local or remote Neo4j instance.

kgx.source.rdf_source

RdfSource is responsible for reading data from RDF N-Triples.

This source makes use of a custom kgx.parsers.ntriples_parser.CustomNTriplesParser for parsing N-Triples, which extends rdflib.plugins.parsers.ntriples.NTriplesParser.

To ensure proper parsing of N-Triples and a relatively low memory footprint, it is recommended that the N-Triples be sorted based on the subject IRIs.

sort -k 1,2 -t ' ' data.nt > data_sorted.nt

kgx.source.owl_source

OwlSource is responsible for parsing an OWL ontology.

When parsing an OWL, this source also adds OwlStar annotations to certain OWL axioms.

kgx.source.sparql_source

SparqlSource has yet to be implemented.

In principle, SparqlSource should be able to read data from a local or remote SPARQL endpoint.