Meta Knowledge Graph¶
The Meta Knowledge Graph operation takes an instance of kgx.graph.base_graph.BaseGraph
and
generates Translator API (TRAPI) Release 1.1 compatible knowledge map for the entire graph.
This operation generates graph summary as a JSON (or YAML) in a format that is compatible with the content metadata standards of the Knowledge Graph Exchange (KGE) Archive.
The main entry point is the kgx.graph_operations.meta_knowledge_graph.generate_meta_knowledge_graph
method.
The tool does detect and logs anomalies in the graph (defaults reporting to stderr, but may be reset to a file using the error_log
parameter)
Note: To generate a summary statistics YAML that is compatible with Knowledge Graph Hub dashboard, refer to Summarize Graph operation.
Streaming Data Processing Mode¶
For very large graphs, the Meta Knowledge Graph operation now successfully processes graph data using data streaming (command flag --stream=True
) which significantly minimizes the memory footprint required to process such graphs.
Provenance Statistics¶
The Meta Knowledge Graph operation can count numbers of nodes and edges by Biolink 2.0 biolink:knowledge_source
provenance (and related is_a
descendant slot terms). The node_facet_properties
and edge_facet_properties
CLI (and code method) arguments need to be explicitly set to specify which provenance slot names are to be counted in a given graph (by default, provided_by
slots used for nodes and knowledge_source
slots used for edges).
kgx.graph_operations.meta_knowledge_graph¶
-
class
kgx.graph_operations.meta_knowledge_graph.
MetaKnowledgeGraph
(name='', node_facet_properties: Optional[List] = None, edge_facet_properties: Optional[List] = None, progress_monitor: Optional[Callable[[kgx.utils.kgx_utils.GraphEntityType, List], None]] = None, error_log=None, **kwargs)[source]¶ Bases:
object
Class for generating a TRAPI 1.1 style of “meta knowledge graph” summary.
The optional ‘progress_monitor’ for the validator should be a lightweight Callable which is injected into the class ‘inspector’ Callable, designed to intercepts node and edge records streaming through the Validator (inside a Transformer.process() call. The first (GraphEntityType) argument of the Callable tags the record as a NODE or an EDGE. The second argument given to the Callable is the current record itself. This Callable is strictly meant to be procedural and should not mutate the record. The intent of this Callable is to provide a hook to KGX applications wanting the namesake function of passively monitoring the graph data stream. As such, the Callable could simply tally up the number of times it is called with a NODE or an EDGE, then provide a suitable (quick!) report of that count back to the KGX application. The Callable (function/callable class) should not modify the record and should be of low complexity, so as not to introduce a large computational overhead to validation!
-
class
Category
(category_curie: str, mkg)[source]¶ Bases:
object
Internal class for compiling statistics about a distinct category.
-
__init__
(category_curie: str, mkg)[source]¶ MetaKnowledgeGraph.Category constructor.
- category_curie: str
Biolink Model category CURIE identifier.
-
analyse_node_category
(n, data) → None[source]¶ Analyse metadata of a given graph node record of this category.
- Parameters
n (str) – Curie identifier of the node record (not used here).
data (Dict) – Complete data dictionary of node record fields.
-
classmethod
get_category_curie_from_index
(cid: int) → str[source]¶ - Parameters
cid (int) – Internal MetaKnowledgeGraph index id for tracking a Category.
- Returns
Curie identifier of the Category.
- Return type
str
-
get_cid
()[source]¶ - Returns
Internal MetaKnowledgeGraph index id for tracking a Category.
- Return type
int
-
get_count_by_source
(facet: str = 'provided_by', source: str = None) → Dict[str, Any][source]¶ - Parameters
facet (str) – Facet tag (default, ‘provided_by’) from which the count should be returned
source (str) – Source name about which the count is desired.
- Returns
Count of nodes, by node ‘provided_by’ knowledge source, for a given category. Returns dictionary of all source counts, if input ‘source’ argument is not specified.
- Return type
Dict
-
-
__call__
(entity_type: kgx.utils.kgx_utils.GraphEntityType, rec: List)[source]¶ Transformer ‘inspector’ Callable, for analysing a stream of graph data.
- Parameters
entity_type (GraphEntityType) – indicates what kind of record being passed to the function for analysis.
rec (Dict) – Complete data dictionary of the given record.
-
__init__
(name='', node_facet_properties: Optional[List] = None, edge_facet_properties: Optional[List] = None, progress_monitor: Optional[Callable[[kgx.utils.kgx_utils.GraphEntityType, List], None]] = None, error_log=None, **kwargs)[source]¶ MetaKnowledgeGraph constructor.
- Parameters
name (str) – (Graph) name assigned to the summary.
node_facet_properties (Optional[List]) – A list of node properties (e.g. knowledge_source tags) to facet on. For example,
['provided_by']
edge_facet_properties (Optional[List]) – A list of edge properties (e.g. knowledge_source tags) to facet on. For example,
['original_knowledge_source', 'aggregator_knowledge_source']
progress_monitor (Optional[Callable[[GraphEntityType, List], None]]) – Function given a peek at the current record being stream processed by the class wrapped Callable.
error_log – Where to write any graph processing error message (stderr, by default).
-
analyse_edge
(u, v, k, data) → None[source]¶ Analyse metadata of one graph edge record. :param u: Subject node curie identifier of the edge. :type u: str :param v: Subject node curie identifier of the edge. :type v: str :param k: Key identifier of the edge record (not used here). :type k: str :param data: Complete data dictionary of edge record fields. :type data: Dict
-
analyse_node
(n: str, data: Dict) → None[source]¶ Analyse metadata of one graph node record.
- Parameters
n (str) – Curie identifier of the node record (not used here).
data (Dict) – Complete data dictionary of node record fields.
-
get_category
(category_curie: str) → kgx.graph_operations.meta_knowledge_graph.MetaKnowledgeGraph.Category[source]¶ Counts the number of distinct (Biolink) categories encountered in the knowledge graph (not including those of ‘unknown’ category)
- Parameters
category_curie (str) – Curie identifier for the (Biolink) category.
- Returns
MetaKnowledgeGraph.Category object for a given Biolink category.
- Return type
-
get_edge_count_by_predicate
(predicate_curie: str) → int[source]¶ Counts the number of edges in the graph with the specified predicate.
- Parameters
predicate_curie (str) – (Biolink) curie identifier for the predicate.
- Returns
Number of edges for the given predicate.
- Return type
int
- Raises
RuntimeError – Error if predicate identifier is empty string or None.
-
get_edge_count_by_source
(subject_category: str, predicate: str, object_category: str, facet: str = 'knowledge_source', source: Optional[str] = None) → Dict[str, Any][source]¶ Returns count by source for one S-P-O triple (S, O being Biolink categories; P, a Biolink predicate)
-
get_edge_mapping_count
() → int[source]¶ Counts the number of distinct edge Subject (category) - P (predicate) -> Object (category) mappings in the knowledge graph.
- Returns
Count of subject(category) - predicate -> object(category) mappings in the graph.
- Return type
int
-
get_edge_stats
() → List[Dict[str, Any]][source]¶ - Returns
Knowledge map for the list of edges in the graph.
- Return type
List[Dict[str, Any]]
-
get_graph_summary
(name: str = None, **kwargs) → Dict[source]¶ Similar to summarize_graph except that the node and edge statistics are already captured in the MetaKnowledgeGraph class instance (perhaps by Transformer.process() stream inspection) and therefore, the data structure simply needs to be ‘finalized’ for saving or similar use.
- Parameters
name (Optional[str]) – Name for the graph (if being renamed)
kwargs (Dict) – Any additional arguments (ignored in this method at present)
- Returns
A TRAPI 1.1 compliant meta knowledge graph of the knowledge graph returned as a dictionary.
- Return type
Dict
-
get_node_count_by_category
(category_curie: str) → int[source]¶ Counts the number of edges in the graph with the specified (Biolink) category curie.
- Parameters
category_curie (str) – Curie identifier for the (Biolink) category.
- Returns
Number of nodes for the given category.
- Return type
int
- Raises
RuntimeError – Error if category identifier is empty string or None.
-
get_node_stats
() → Dict[str, kgx.graph_operations.meta_knowledge_graph.MetaKnowledgeGraph.Category][source]¶ - Returns
Statistics for the nodes in the graph.
- Return type
Dict[str, Category]
-
get_number_of_categories
() → int[source]¶ Counts the number of distinct (Biolink) categories encountered in the knowledge graph (not including those of ‘unknown’ category)
- Returns
Number of distinct (Biolink) categories found in the graph (excluding nodes with ‘unknown’ category)
- Return type
int
-
get_predicate_count
() → int[source]¶ Counts the number of distinct edge predicates in the knowledge graph.
- Returns
Number of distinct (Biolink) predicates in the graph.
- Return type
int
-
get_total_edge_counts_across_mappings
() → int[source]¶ Aggregate count of the edges in the graph for every mapping. Edges with subject and object nodes with multiple assigned categories will have their count replicated under each distinct mapping of its categories.
- Returns
Number of the edges counted across all mappings.
- Return type
int
-
get_total_edges_count
() → int[source]¶ Gets the total number of ‘valid’ edges in the data set (ignoring those with ‘unknown’ subject or predicate category mappings)
- Returns
Total count of edges in the graph.
- Return type
int
-
get_total_node_counts_across_categories
() → int[source]¶ The aggregate count of all node to category mappings for every category. Note that nodes with multiple categories will have their count replicated under each of its categories.
- Returns
Total count of node to category mappings for the graph.
- Return type
int
-
get_total_nodes_count
() → int[source]¶ Counts the total number of distinct nodes in the knowledge graph (not including those ignored due to being of ‘unknown’ category)
- Returns
Number of distinct nodes in the knowledge.
- Return type
int
-
save
(file, name: str = None, file_format: str = 'json') → None[source]¶ Save the current MetaKnowledgeGraph to a specified (open) file (device).
- Parameters
file (File) – Text file handler open for writing.
name (str) – Optional string to which to (re-)name the graph.
file_format (str) – Text output format (‘json’ or ‘yaml’) for the saved meta knowledge graph (default: ‘json’)
- Returns
- Return type
None
-
summarize_graph
(graph: kgx.graph.base_graph.BaseGraph, name: str = None, **kwargs) → Dict[source]¶ Generate a meta knowledge graph that describes the composition of the graph.
- Parameters
graph (kgx.graph.base_graph.BaseGraph) – The graph
name (Optional[str]) – Name for the graph
kwargs (Dict) – Any additional arguments (ignored in this method at present)
- Returns
A TRAPI 1.1 compliant meta knowledge graph of the knowledge graph returned as a dictionary.
- Return type
Dict
-
summarize_graph_edges
(graph: kgx.graph.base_graph.BaseGraph) → List[Dict][source]¶ Summarize the edges in a graph.
- Parameters
graph (kgx.graph.base_graph.BaseGraph) – The graph
- Returns
The edge stats
- Return type
List[Dict]
-
summarize_graph_nodes
(graph: kgx.graph.base_graph.BaseGraph) → Dict[source]¶ Summarize the nodes in a graph.
- Parameters
graph (kgx.graph.base_graph.BaseGraph) – The graph
- Returns
The node stats
- Return type
Dict
-
class
-
kgx.graph_operations.meta_knowledge_graph.
generate_meta_knowledge_graph
(graph: kgx.graph.base_graph.BaseGraph, name: str, filename: str) → None[source]¶ Generate a knowledge map that describes the composition of the graph and write to
filename
.- Parameters
graph (kgx.graph.base_graph.BaseGraph) – The graph
name (Optional[str]) – Name for the graph
filename (str) – The file to write the knowledge map to
-
kgx.graph_operations.meta_knowledge_graph.
mkg_default
(o)[source]¶ JSONEncoder ‘default’ function override to properly serialize ‘Set’ objects (into ‘List’)
-
kgx.graph_operations.meta_knowledge_graph.
summarize_graph
(graph: kgx.graph.base_graph.BaseGraph, name: str = None, **kwargs) → Dict[source]¶ Generate a meta knowledge graph that describes the composition of the graph.
- Parameters
graph (kgx.graph.base_graph.BaseGraph) – The graph
name (Optional[str]) – Name for the graph
kwargs (Dict) – Any additional arguments
- Returns
A TRAPI 1.1 compliant meta knowledge graph of the knowledge graph returned as a dictionary.
- Return type
Dict