Summarize Graph

The Summarize Graph operation takes an instance of kgx.graph.base_graph.BaseGraph and generates summary statistics for the entire graph.

This operation generates summary as a YAML (or JSON) in a format that is compatible with the Knowledge Graph Hub dashboard.

The main entry point is the kgx.graph_operations.summarize_graph.generate_graph_stats method.

The tool does detect and logs anomalies in the graph (defaults reporting to stderr, but may be reset to a file using the error_log parameter)

Note: To generate a summary statistics YAML that is consistent with Translator API (TRAPI) Release 1.1 standards, refer to Meta Knowledge Graph.

Streaming Data Processing Mode

For very large graphs, the Graph Summary operation may now successfully process graph data equally well using data streaming (command flag --stream=True) which significantly minimizes the memory footprint required to process such graphs.

kgx.graph_operations.summarize_graph

class kgx.graph_operations.summarize_graph.GraphSummary(name='', node_facet_properties: Optional[List] = None, edge_facet_properties: Optional[List] = None, progress_monitor: Optional[Callable[[kgx.utils.kgx_utils.GraphEntityType, List], None]] = None, error_log: str = None, **kwargs)[source]

Bases: object

Class for generating a “classical” knowledge graph summary.

The optional ‘progress_monitor’ for the validator should be a lightweight Callable which is injected into the class ‘inspector’ Callable, designed to intercepts node and edge records streaming through the Validator (inside a Transformer.process() call. The first (GraphEntityType) argument of the Callable tags the record as a NODE or an EDGE. The second argument given to the Callable is the current record itself. This Callable is strictly meant to be procedural and should not mutate the record. The intent of this Callable is to provide a hook to KGX applications wanting the namesake function of passively monitoring the graph data stream. As such, the Callable could simply tally up the number of times it is called with a NODE or an EDGE, then provide a suitable (quick!) report of that count back to the KGX application. The Callable (function/callable class) should not modify the record and should be of low complexity, so as not to introduce a large computational overhead to validation!

class Category(category_curie: str, summary)[source]

Bases: object

Internal class for compiling statistics about a distinct category.

__init__(category_curie: str, summary)[source]

GraphSummary.Category constructor.

category: str

Biolink Model category curie identifier.

analyse_node_category(summary, n, data)[source]

Analyse metadata of a given graph node record of this category.

Parameters
  • summary (GraphSummary) – GraphSunmmary within which the Category is being analysed.

  • n (str) – Curie identifier of the node record (not used here).

  • data (Dict) – Complete data dictionary of node record fields.

get_cid() → int[source]
Returns

Internal GraphSummary index id for tracking a Category.

Return type

int

get_count()[source]
Returns

Count of nodes which have this category.

Return type

int

get_count_by_id_prefixes()[source]
Returns

Count of nodes by id_prefixes for nodes which have this category.

Return type

int

get_id_prefixes() → Set[source]
Returns

Set of identifier prefix (strings) used by nodes of this Category.

Return type

Set[str]

get_name() → str[source]
Returns

Biolink CURIE name of the category.

Return type

str

json_object()[source]
Returns

Returns JSON friendly metadata for this category.,

Return type

Dict[str, Any]

__call__(entity_type: kgx.utils.kgx_utils.GraphEntityType, rec: List)[source]

Transformer ‘inspector’ Callable, for analysing a stream of graph data.

Parameters
  • entity_type (GraphEntityType) – indicates what kind of record being passed to the function for analysis.

  • rec (Dict) – Complete data dictionary of the given record.

__init__(name='', node_facet_properties: Optional[List] = None, edge_facet_properties: Optional[List] = None, progress_monitor: Optional[Callable[[kgx.utils.kgx_utils.GraphEntityType, List], None]] = None, error_log: str = None, **kwargs)[source]

GraphSummary constructor.

Parameters
  • name (str) – (Graph) name assigned to the summary.

  • node_facet_properties (Optional[List]) – A list of properties to facet on. For example, ['provided_by']

  • edge_facet_properties (Optional[List]) – A list of properties to facet on. For example, ['knowledge_source']

  • progress_monitor (Optional[Callable[[GraphEntityType, List], None]]) – Function given a peek at the current record being stream processed by the class wrapped Callable.

  • error_log (str) – Where to write any graph processing error message (stderr, by default)

add_node_stat(tag: str, value: Any)[source]

Compile/add a nodes statistic for a given tag = value annotation of the node.

Parameters
  • tag (str) –

  • value (Any) –

  • tag – Tag label for the annotation.

  • value – Value of the specific tag annotation.

Returns

analyse_edge(u: str, v: str, k: str, data: Dict)[source]

Analyse metadata of one graph edge record.

Parameters
  • u (str) – Subject node curie identifier of the edge.

  • v (str) – Subject node curie identifier of the edge.

  • k (str) – Key identifier of the edge record (not used here).

  • data (Dict) – Complete data dictionary of edge record fields.

analyse_node(n, data)[source]

Analyse metadata of one graph node record.

Parameters
  • n (str) – Curie identifier of the node record (not used here).

  • data (Dict) – Complete data dictionary of node record fields.

get_category(category_curie: str) → kgx.graph_operations.summarize_graph.GraphSummary.Category[source]

Counts the number of distinct (Biolink) categories encountered in the knowledge graph (not including those of ‘unknown’ category)

Parameters

category_curie (str) – Curie identifier for the (Biolink) category.

Returns

MetaKnowledgeGraph.Category object for a given Biolink category.

Return type

Category

get_facet_counts(data: Dict, stats: Dict, x: str, y: str, facet_property: str) → Dict[source]

Facet on facet_property and record the count for stats[x][y][facet_property].

Parameters
  • data (dict) – Node/edge data dictionary

  • stats (dict) – The stats dictionary

  • x (str) – first key

  • y (str) – second key

  • facet_property (str) – The property to facet on

Returns

The stats dictionary

Return type

Dict

get_graph_summary(name: str = None, **kwargs) → Dict[source]

Similar to summarize_graph except that the node and edge statistics are already captured in the GraphSummary class instance (perhaps by Transformer.process() stream inspection) and therefore, the data structure simply needs to be ‘finalized’ for saving or similar use.

Parameters
  • name (Optional[str]) – Name for the graph (if being renamed)

  • kwargs (Dict) – Any additional arguments (ignored in this method at present)

Returns

A knowledge map dictionary corresponding to the graph

Return type

Dict

get_name()[source]
Returns

Currently assigned knowledge graph name.

Return type

str

get_node_stats() → Dict[str, Any][source]
Returns

Statistics for the nodes in the graph.

Return type

Dict[str, Any]

save(file, name: str = None, file_format: str = 'yaml')[source]

Save the current GraphSummary to a specified (open) file (device).

Parameters
  • file (File) – Text file handler open for writing.

  • name (str) – Optional string to which to (re-)name the graph.

  • file_format (str) – Text output format (‘json’ or ‘yaml’) for the saved meta knowledge graph (default: ‘json’)

Returns

Return type

None

summarize_graph(graph: kgx.graph.base_graph.BaseGraph) → Dict[source]

Summarize the entire graph.

Parameters

graph (kgx.graph.base_graph.BaseGraph) – The graph

Returns

The stats dictionary

Return type

Dict

summarize_graph_edges(graph: kgx.graph.base_graph.BaseGraph) → Dict[source]

Summarize the edges in a graph.

Parameters

graph (kgx.graph.base_graph.BaseGraph) – The graph

Returns

The edge stats

Return type

Dict

summarize_graph_nodes(graph: kgx.graph.base_graph.BaseGraph) → Dict[source]

Summarize the nodes in a graph.

Parameters

graph (kgx.graph.base_graph.BaseGraph) – The graph

Returns

The node stats

Return type

Dict

kgx.graph_operations.summarize_graph.generate_graph_stats(graph: kgx.graph.base_graph.BaseGraph, graph_name: str, filename: str, node_facet_properties: Optional[List] = None, edge_facet_properties: Optional[List] = None) → None[source]

Generate stats from Graph.

Parameters
  • graph (kgx.graph.base_graph.BaseGraph) – The graph

  • graph_name (str) – Name for the graph

  • filename (str) – Filename to write the stats to

  • node_facet_properties (Optional[List]) – A list of properties to facet on. For example, ['provided_by']

  • edge_facet_properties (Optional[List]) – A list of properties to facet on. For example, ['knowledge_source']

kgx.graph_operations.summarize_graph.gs_default(o)[source]

JSONEncoder ‘default’ function override to properly serialize ‘Set’ objects (into ‘List’)

kgx.graph_operations.summarize_graph.summarize_graph(graph: kgx.graph.base_graph.BaseGraph, name: str = None, node_facet_properties: Optional[List] = None, edge_facet_properties: Optional[List] = None) → Dict[source]

Summarize the entire graph.

Parameters
  • graph (kgx.graph.base_graph.BaseGraph) – The graph

  • name (str) – Name for the graph

  • node_facet_properties (Optional[List]) – A list of properties to facet on. For example, ['provided_by']

  • edge_facet_properties (Optional[List]) – A list of properties to facet on. For example, ['knowledge_source']

Returns

The stats dictionary

Return type

Dict