INDRA CoGEx 1.0.0 Documentation
License and funding
INDRA CoGEx is made available under the 2-clause BSD license. The development of this project is funded under the DARPA Young Faculty Award (ARO grant W911NF2010255).
INDRA CoGEx modules reference
INDRA CoGEx Apps
INDRA CoGEx apps
INDRA CoGEx Gene List Analysis (indra_cogex.apps.gla
)
INDRA CoGEx Knowledge Assembly (indra_cogex.assembly
)
Assembly of Node objects.
- class NodeAssembler(nodes=None)[source]
Assembles Node objects.
Initialize a new NodeAssembler object.
- assemble_nodes()[source]
Assemble the nodes in the assembler.
Nodes with the same grounding are assembled into a single node that contains all the labels and data from all the nodes.
- Returns:
A list of Node objects.
- Return type:
nodes
INDRA CoGEx Client
The INDRA CoGEx Client.
Enrichment Analysis (indra_cogex.client.enrichment
)
A module for performing enrichment analysis with the INDRA COGEX service.
Continuous Gene Enrichment Analysis (indra_cogex.client.enrichment.continuous
)
A collection of analyses possible on gene lists (of HGNC identifiers) with scores.
For example, this could be applied to the log_2 fold scores from differential gene expression experiments.
Warning
This module requires the optional dependency gseapy
. Install with
pip install gseapy
.
- get_human_scores(path, read_csv_kwargs=None, gene_symbol_column_name=None, score_column_name=None)[source]
Load a differential gene expression file with human measurements.
- Parameters:
path (
Union
[Path
,str
,DataFrame
]) – Path to the file to read withpandas.read_csv()
.read_csv_kwargs (
Optional
[Dict
[str
,Any
]]) – Keyword arguments to pass topandas.read_csv()
gene_symbol_column_name (
Optional
[str
]) – The name of the column with gene symbols. If none, will try and guess.score_column_name (
Optional
[str
]) – The name of the column with scores. If none, will try and guess.
- Return type:
- Returns:
A dictionary of human gene HGNC IDs to scores.
- get_mouse_scores(path, read_csv_kwargs=None, gene_symbol_column_name=None, score_column_name=None)[source]
Load a differential gene expression file with mouse measurements.
This function extracts the MGI gene symbols, maps them to MGI identifiers, uses PyOBO to map orthologs to HGNC, then returns the HGNC gene and scores as a dictionary.
- Parameters:
path (
Union
[Path
,str
,DataFrame
]) – Path to the file to read withpandas.read_csv()
.read_csv_kwargs (
Optional
[Dict
[str
,Any
]]) – Keyword arguments to pass topandas.read_csv()
gene_symbol_column_name (
Optional
[str
]) – The name of the column with gene symbols. If none, will try and guess.score_column_name (
Optional
[str
]) – The name of the column with scores. If none, will try and guess.
- Return type:
- Returns:
A dictionary of mapped orthologus human gene HGNC IDs to scores.
- get_rat_scores(path, read_csv_kwargs=None, gene_symbol_column_name=None, score_column_name=None)[source]
Load a differential gene expression file with rat measurements.
This function extracts the RGD gene symbols, maps them to RGD identifiers, uses PyOBO to map orthologs to HGNC, then returns the HGNC gene and scores as a dictionary.
- Parameters:
path (
Union
[Path
,str
,DataFrame
]) – Path to the file to read withpandas.read_csv()
.read_csv_kwargs (
Optional
[Dict
[str
,Any
]]) – Keyword arguments to pass topandas.read_csv()
gene_symbol_column_name (
Optional
[str
]) – The name of the column with gene symbols. If none, will try and guess.score_column_name (
Optional
[str
]) – The name of the column with scores. If none, will try and guess.
- Return type:
- Returns:
A dictionary of mapped orthologus human gene HGNC IDs to scores.
- go_gsea(scores, directory=None, *, client, **kwargs)[source]
Run GSEA with gene sets for each Gene Ontology term.
- Parameters:
client (
Neo4jClient
) – The Neo4j client.scores (
Dict
[str
,float
]) – A mapping from HGNC gene identifiers to floating point scores (e.g., from a differential gene expression analysis)directory (
Union
[None
,Path
,str
]) – Specify the directory if the results should be saved, including both a dataframe and plots for each gen setkwargs – Remaining keyword arguments to pass through to
gseapy.prerank()
- Return type:
DataFrame
- Returns:
A pandas dataframe with the GSEA results
- gsea(scores, gene_sets, directory=None, alpha=None, keep_insignificant=True, **kwargs)[source]
Run GSEA on pre-ranked data.
- Parameters:
scores (
Dict
[str
,float
]) – A mapping from HGNC gene identifiers to floating point scores (e.g., from a differential gene expression analysis)gene_sets (
Dict
[Tuple
[str
,str
],Set
[str
]]) – A mapping fromdirectory (
Union
[None
,Path
,str
]) – Specify the directory if the results should be saved, including both a dataframe and plots for each gen setalpha (
Optional
[float
]) – The cutoff for significance. Defaults to 0.05keep_insignificant (
bool
) – If false, removes results with a p value less than alpha.kwargs – Remaining keyword arguments to pass through to
gseapy.prerank()
- Return type:
DataFrame
- Returns:
A pandas dataframe with the GSEA results
- indra_downstream_gsea(scores, directory=None, *, client, minimum_evidence_count=None, minimum_belief=None, **kwargs)[source]
Run GSEA for each entry in the INDRA database and the set of human genes that are upstream regulators of it.
- Parameters:
client (
Neo4jClient
) – The Neo4j client.scores (
Dict
[str
,float
]) – A mapping from HGNC gene identifiers to floating point scores (e.g., from a differential gene expression analysis)directory (
Union
[None
,Path
,str
]) – Specify the directory if the results should be saved, including both a dataframe and plots for each gen setminimum_evidence_count (
Optional
[int
]) – The minimum number of evidences for a relationship to count it as a regulator. Defaults to 1 (i.e., cutoff not applied.minimum_belief (
Optional
[float
]) – The minimum belief for a relationship to count it as a regulator. Defaults to 0.0 (i.e., cutoff not applied).kwargs – Remaining keyword arguments to pass through to
gseapy.prerank()
- Return type:
DataFrame
- Returns:
A pandas dataframe with the GSEA results
- indra_upstream_gsea(scores, directory=None, *, client, minimum_evidence_count=None, minimum_belief=None, **kwargs)[source]
Run GSEA for each entry in the INDRA database and the set of human genes that it regulates.
- Parameters:
client (
Neo4jClient
) – The Neo4j client.scores (
Dict
[str
,float
]) – A mapping from HGNC gene identifiers to floating point scores (e.g., from a differential gene expression analysis)directory (
Union
[None
,Path
,str
]) – Specify the directory if the results should be saved, including both a dataframe and plots for each gen setminimum_evidence_count (
Optional
[int
]) – The minimum number of evidences for a relationship to count it as a regulator. Defaults to 1 (i.e., cutoff not applied.minimum_belief (
Optional
[float
]) – The minimum belief for a relationship to count it as a regulator. Defaults to 0.0 (i.e., cutoff not applied).kwargs – Remaining keyword arguments to pass through to
gseapy.prerank()
- Return type:
DataFrame
- Returns:
A pandas dataframe with the GSEA results
- phenotype_gsea(scores, directory=None, *, client, **kwargs)[source]
Run GSEA with HPO phenotype gene sets.
- Parameters:
client (
Neo4jClient
) – The Neo4j client.scores (
Dict
[str
,float
]) – A mapping from HGNC gene identifiers to floating point scores (e.g., from a differential gene expression analysis)directory (
Union
[None
,Path
,str
]) – Specify the directory if the results should be saved, including both a dataframe and plots for each gen setkwargs – Remaining keyword arguments to pass through to
gseapy.prerank()
- Return type:
DataFrame
- Returns:
A pandas dataframe with the GSEA results
- reactome_gsea(scores, directory=None, *, client, **kwargs)[source]
Run GSEA with Reactome gene sets.
- Parameters:
client (
Neo4jClient
) – The Neo4j client.scores (
Dict
[str
,float
]) – A mapping from HGNC gene identifiers to floating point scores (e.g., from a differential gene expression analysis)directory (
Union
[None
,Path
,str
]) – Specify the directory if the results should be saved, including both a dataframe and plots for each gen setkwargs – Remaining keyword arguments to pass through to
gseapy.prerank()
- Return type:
DataFrame
- Returns:
A pandas dataframe with the GSEA results
- wikipathways_gsea(scores, directory=None, *, client, **kwargs)[source]
Run GSEA with WikiPathways gene sets.
- Parameters:
client (
Neo4jClient
) – The Neo4j client.scores (
Dict
[str
,float
]) – A mapping from HGNC gene identifiers to floating point scores (e.g., from a differential gene expression analysis)directory (
Union
[None
,Path
,str
]) – Specify the directory if the results should be saved, including both a dataframe and plots for each gen setkwargs – Remaining keyword arguments to pass through to
gseapy.prerank()
- Return type:
DataFrame
- Returns:
A pandas dataframe with the GSEA results
Discrete Gene Enrichment Analysis (indra_cogex.client.enrichment.discrete
)
A collection of analyses possible on gene lists (of HGNC identifiers).
- go_ora(client, gene_ids, background_gene_ids=None, **kwargs)[source]
Calculate over-representation on all GO terms.
- Parameters:
client (
Neo4jClient
) – Neo4jClientbackground_gene_ids (
Optional
[Collection
[str
]]) – List of HGNC gene identifiers for the background gene set. If not given, all genes with HGNC IDs are used as the background.**kwargs – Additional keyword arguments to pass to _do_ora
- Return type:
DataFrame
- Returns:
DataFrame with columns: curie, name, p, q, mlp, mlq
- indra_downstream_ora(client, gene_ids, background_gene_ids=None, *, minimum_evidence_count=1, minimum_belief=0.0, **kwargs)[source]
Calculate a p-value for each entity in the INDRA database based on the genes that are causally upstream of it and how they compare to the query gene set.
- Parameters:
client (
Neo4jClient
) – Neo4jClientbackground_gene_ids (
Optional
[Collection
[str
]]) – List of HGNC gene identifiers for the background gene set. If not given, all genes with HGNC IDs are used as the background.minimum_evidence_count (
Optional
[int
]) – Minimum number of evidences to consider a causal relationshipminimum_belief (
Optional
[float
]) – Minimum belief to consider a causal relationship**kwargs – Additional keyword arguments to pass to _do_ora
- Return type:
DataFrame
- Returns:
DataFrame with columns: curie, name, p, q, mlp, mlq
- indra_upstream_ora(client, gene_ids, background_gene_ids=None, *, minimum_evidence_count=1, minimum_belief=0.0, **kwargs)[source]
Calculate a p-value for each entity in the INDRA database based on the set of genes that it regulates and how they compare to the query gene set.
- Parameters:
client (
Neo4jClient
) – Neo4jClientbackground_gene_ids (
Optional
[Collection
[str
]]) – List of HGNC gene identifiers for the background gene set. If not given, all genes with HGNC IDs are used as the background.minimum_evidence_count (
Optional
[int
]) – Minimum number of evidences to consider a causal relationshipminimum_belief (
Optional
[float
]) – Minimum belief to consider a causal relationship**kwargs – Additional keyword arguments to pass to _do_ora
- Return type:
DataFrame
- Returns:
DataFrame with columns: curie, name, p, q, mlp, mlq
- phenotype_ora(gene_ids, background_gene_ids=None, *, client, **kwargs)[source]
Calculate over-representation on all HP phenotypes.
- Parameters:
background_gene_ids (
Optional
[Collection
[str
]]) – List of HGNC gene identifiers for the background gene set. If not given, all genes with HGNC IDs are used as the background.client (
Neo4jClient
) – Neo4jClient**kwargs – Additional keyword arguments to pass to _do_ora
- Return type:
DataFrame
- Returns:
DataFrame with columns: curie, name, p, q, mlp, mlq
- reactome_ora(client, gene_ids, background_gene_ids=None, **kwargs)[source]
Calculate over-representation on all Reactome pathways.
- Parameters:
client (
Neo4jClient
) – Neo4jClientbackground_gene_ids (
Optional
[Collection
[str
]]) – List of HGNC gene identifiers for the background gene set. If not given, all genes with HGNC IDs are used as the background.**kwargs – Additional keyword arguments to pass to _do_ora
- Return type:
DataFrame
- Returns:
DataFrame with columns: curie, name, p, q, mlp, mlq
- wikipathways_ora(client, gene_ids, background_gene_ids=None, **kwargs)[source]
Calculate over-representation on all WikiPathway pathways.
- Parameters:
client (
Neo4jClient
) – Neo4jClientbackground_gene_ids (
Optional
[Collection
[str
]]) – List of HGNC gene identifiers for the background gene set. If not given, all genes with HGNC IDs are used as the background.**kwargs – Additional keyword arguments to pass to _do_ora
- Return type:
DataFrame
- Returns:
DataFrame with columns: curie, name, p, q, mlp, mlq
Signed Gene Enrichment Analysis (indra_cogex.client.enrichment.signed
)
A collection of analyses possible on pairs of gene lists (of HGNC identifiers).
- reverse_causal_reasoning(positive_hgnc_ids, negative_hgnc_ids, minimum_size=4, alpha=None, keep_insignificant=True, *, client, minimum_evidence_count=None, minimum_belief=None)[source]
Implement the Reverse Causal Reasoning algorithm from [catlett2013].
- Parameters:
client (
Neo4jClient
) – A neo4j clientpositive_hgnc_ids (
Iterable
[str
]) – A list of positive-signed HGNC gene identifiers (e.g., up-regulated genes in a differential gene expression analysis)negative_hgnc_ids (
Iterable
[str
]) – A list of negative-signed HGNC gene identifiers (e.g., down-regulated genes in a differential gene expression analysis)minimum_size (
int
) – The minimum number of entities marked as downstream of an entity for it to be usable as a hypalpha (
Optional
[float
]) – The cutoff for significance. Defaults to 0.05keep_insignificant (
bool
) – If false, removes results with a p value less than alpha.minimum_evidence_count (
Optional
[int
]) – The minimum number of evidences for a relationship to count it as a regulator. Defaults to 1 (i.e., cutoff not applied).minimum_belief (
Optional
[float
]) – The minimum belief for a relationship to count it as a regulator. Defaults to 0.0 (i.e., cutoff not applied).
- Return type:
DataFrame
- Returns:
A pandas DataFrame with results for each entity in the graph database
.. [catlett2013] Catlett, N. L., *et al.* (2013). `Reverse causal reasoning (applying) – qualitative causal knowledge to the interpretation of high-throughput data <https://doi.org/10.1186/1471-2105-14-340>`_. BMC Bioinformatics, **14**(1), 340.
Gene Enrichment Analysis Utilities (indra_cogex.client.enrichment.utils
)
Utility functions for gene enrichment analysis.
Utilities for getting gene sets.
- collect_gene_sets(query, *, client, background_gene_ids=None, include_ontology_children=False, cache_file=None)[source]
Collect gene sets based on the given query.
- Parameters:
query (
str
) – A cypher queryclient (
Neo4jClient
) – The Neo4j client.background_gene_ids (
Optional
[Iterable
[str
]]) – List of HGNC gene identifiers for the background gene set. If not given, all genes with HGNC IDs are used as the background.include_ontology_children (
bool
) – If True, extend the gene set associations with associations from child terms using the indra ontology
- Return type:
- Returns:
A dictionary whose keys that are 2-tuples of CURIE and name of each queried item and whose values are sets of HGNC gene identifiers (as strings)
- get_entity_to_regulators(*, client, background_gene_ids=None, minimum_evidence_count=1, minimum_belief=0.0)[source]
Get a mapping from each entity in the INDRA database to the set of human genes that are causally upstream of it.
- Parameters:
client (
Neo4jClient
) – The Neo4j client.background_gene_ids (
Optional
[Iterable
[str
]]) – List of HGNC gene identifiers for the background gene set. If not given, all genes with HGNC IDs are used as the background.minimum_evidence_count (
Optional
[int
]) – The minimum number of evidences for a relationship to count it as a regulator. Defaults to 1 (i.e., cutoff not applied.minimum_belief (
Optional
[float
]) – The minimum belief for a relationship to count it as a regulator. Defaults to 0.0 (i.e., cutoff not applied).
- Return type:
- Returns:
A dictionary whose keys that are 2-tuples of CURIE and name of each entity and whose values are sets of HGNC gene identifiers (as strings)
- get_entity_to_targets(*, client, background_gene_ids=None, minimum_evidence_count=1, minimum_belief=0.0)[source]
Get a mapping from each entity in the INDRA database to the set of human genes that it regulates.
- Parameters:
client (
Neo4jClient
) – The Neo4j client.background_gene_ids (
Optional
[Iterable
[str
]]) – List of HGNC gene identifiers for the background gene set. If not given, all genes with HGNC IDs are used as the background.minimum_evidence_count (
Optional
[int
]) – The minimum number of evidences for a relationship to count it as a regulator. Defaults to 1 (i.e., cutoff not applied.minimum_belief (
Optional
[float
]) – The minimum belief for a relationship to count it as a regulator. Defaults to 0.0 (i.e., cutoff not applied).
- Return type:
- Returns:
A dictionary whose keys that are 2-tuples of CURIE and name of each entity and whose values are sets of HGNC gene identifiers (as strings)
- get_go(*, background_gene_ids=None, client)[source]
Get GO gene sets.
- Parameters:
client (
Neo4jClient
) – The Neo4j client.background_gene_ids (
Optional
[Iterable
[str
]]) – List of HGNC gene identifiers for the background gene set. If not given, all genes with HGNC IDs are used as the background.
- Return type:
- Returns:
A dictionary whose keys that are 2-tuples of CURIE and name of each GO term and whose values are sets of HGNC gene identifiers (as strings)
- get_phenotype_gene_sets(*, background_gene_ids=None, client)[source]
Get HPO phenotype gene sets.
- Parameters:
client (
Neo4jClient
) – The Neo4j client.background_gene_ids (
Optional
[Iterable
[str
]]) – List of HGNC gene identifiers for the background gene set. If not given, all genes with HGNC IDs are used as the background.
- Return type:
- Returns:
A dictionary whose keys that are 2-tuples of CURIE and name of each phenotype gene set and whose values are sets of HGNC gene identifiers (as strings)
- get_reactome(*, background_gene_ids=None, client)[source]
Get Reactome gene sets.
- Parameters:
client (
Neo4jClient
) – The Neo4j client.background_gene_ids (
Optional
[Iterable
[str
]]) – List of HGNC gene identifiers for the background gene set. If not given, all genes with HGNC IDs are used as the background.
- Return type:
- Returns:
A dictionary whose keys that are 2-tuples of CURIE and name of each Reactome pathway and whose values are sets of HGNC gene identifiers (as strings)
- get_wikipathways(*, background_gene_ids=None, client)[source]
Get WikiPathways gene sets.
- Parameters:
client (
Neo4jClient
) – The Neo4j client.background_gene_ids (
Optional
[Iterable
[str
]]) – List of HGNC gene identifiers for the background gene set. If not given, all genes with HGNC IDs are used as the background.
- Return type:
- Returns:
A dictionary whose keys that are 2-tuples of CURIE and name of each WikiPathway pathway and whose values are sets of HGNC gene identifiers (as strings)
Neo4j Client (indra_cogex.client.neo4j_client
)
Neo4j client module.
- class Neo4jClient(url=None, auth=None)[source]
A client to communicate with an INDRA CogEx neo4j instance
- Parameters:
url (
Optional
[str
]) – The bolt URL to the neo4j instance to override INDRA_NEO4J_URL set as an environment variable or set in the INDRA config file.auth (
Optional
[Tuple
[str
,str
]]) – A tuple consisting of the user name and password for the neo4j instance to override INDRA_NEO4J_USER and INDRA_NEO4J_PASSWORD set as environment variables or set in the INDRA config file.
Initialize the Neo4j client.
- create_single_property_node_index(index_name, label, property_name, exist_ok=False)[source]
Create a single property node index.
- create_single_property_relationship_index(index_name, rel_type, property_name)[source]
Create a single property relationship index.
NOTE: Relationship indexes can only be created once, and there is no IF NOT EXISTS option to silently ignore if the index already exists.
- get_all_relations(node, relation=None, node_type=None, other_type=None)[source]
Get relations that connect sources and targets with the given node.
- Parameters:
- Returns:
A list of relations matching the constraints.
- Return type:
rels
- get_common_sources(targets, relation, source_type=None, target_type=None)[source]
Return the common source nodes related to all the given targets via a given relation type.
- Parameters:
- Returns:
A list of source nodes.
- Return type:
sources
- get_common_targets(sources, relation, source_type=None, target_type=None)[source]
Return the common target nodes related to all the given sources via a given relation type.
- Parameters:
- Returns:
A list of target nodes.
- Return type:
targets
- get_predecessors(target, relations, source_type=None, target_type=None)[source]
Return the nodes that precede the given node via the given relation types.
- Parameters:
- Returns:
A list of predecessor nodes.
- Return type:
predecessors
- static get_property_from_relations(relations, prop)[source]
Return the set of property values on given relations.
- get_relations(source=None, target=None, relation=None, source_type=None, target_type=None, limit=None, bidirectional=False)[source]
Return relations based on source, target and type constraints.
This is a generic function for getting relations, all of its parameters are optional, though at least a source or a target needs to be provided.
- Parameters:
source (
Optional
[Tuple
[str
,str
]]) – Surce namespace and ID.target (
Optional
[Tuple
[str
,str
]]) – Target namespace and ID.source_type (
Optional
[str
]) – A constraint on the source typetarget_type (
Optional
[str
]) – A constraint on the target typelimit (
Optional
[int
]) – A limit on the number of relations returned.bidirectional (
Optional
[bool
]) – If True, return both directions of relationships between the source and target.
- Returns:
A list of relations matching the constraints.
- Return type:
rels
- get_source_agents(target, relation)[source]
Return the nodes related to the target via a given relation type as INDRA Agents.
- get_source_relations(target, relation=None, target_type=None, source_type=None)[source]
Get relations that connect sources to the given target.
- Parameters:
- Returns:
A list of relations matching the constraints.
- Return type:
rels
- get_sources(target, relation=None, source_type=None, target_type=None)[source]
Return the nodes related to the target via a given relation type.
- Parameters:
- Returns:
A list of source nodes.
- Return type:
sources
- get_successors(source, relations, source_type=None, target_type=None)[source]
Return the nodes that precede the given node via the given relation types.
- Parameters:
- Returns:
A list of successors nodes.
- Return type:
predecessors
- get_target_agents(source, relation, source_type=None)[source]
Return the nodes related to the source via a given relation type as INDRA Agents.
- get_target_relations(source, relation=None, source_type=None, target_type=None)[source]
Get relations that connect targets from the given source.
- Parameters:
- Returns:
A list of relations matching the constraints.
- Return type:
rels
- get_targets(source, relation=None, source_type=None, target_type=None)[source]
Return the nodes related to the source via a given relation type.
- Parameters:
- Returns:
A list of target nodes.
- Return type:
targets
- has_relation(source, target, relation, source_type=None, target_type=None)[source]
Return True if there is a relation between the source and the target.
- Parameters:
- Returns:
True if there is a relation of the given type, otherwise False.
- Return type:
related
- static neo4j_to_node(neo4j_node)[source]
Return a Node from a neo4j internal node.
- Parameters:
neo4j_node (
Node
) – A neo4j internal node using its internal data structure and identifier scheme.- Returns:
A Node object with the INDRA standard identifier scheme.
- Return type:
node
- classmethod neo4j_to_relation(neo4j_path)[source]
Return a Relation from a neo4j internal single-relation path.
- Parameters:
neo4j_path (
Path
) – A neo4j internal single-edge path using its internal data structure and identifier scheme.- Returns:
A Relation object with the INDRA standard identifier scheme.
- Return type:
relation
- static neo4j_to_relations(neo4j_path)[source]
Return a list of Relations from a neo4j internal multi-relation path.
- static node_to_agent(node)[source]
Return an INDRA Agent from a Node.
- Parameters:
node (
Node
) – A Node object.- Returns:
An INDRA Agent with standardized name and expanded/standardized db_refs.
- Return type:
agent
- query_dict(query, **query_params)[source]
Run a read-only query that generates a dictionary.
- Return type:
- query_dict_value_json(query, **query_params)[source]
Run a read-only query that generates a dictionary.
- Return type:
- query_nodes(query, **query_params)[source]
Run a read-only query for nodes.
- Parameters:
query (
str
) – The query string to be executed.query_params – Query parameters to pass to cypher
- Returns:
A list of
Node
instances corresponding to the results of the query- Return type:
values
- query_relations(query, **query_params)[source]
Run a read-only query for relations.
- Parameters:
query (
str
) – The query string to be executed. Must have aRETURN
with a single elementp
where in theMATCH
part of the query it has something likep=(h)-[r]->(t)
.query_params – Query parameters to pass to query transaction function that will fill out the placeholders in the cypher query
- Returns:
A list of
Relation
instances corresponding to the results of the query- Return type:
values
- autoclient(*, cache=False, maxsize=128)[source]
Wrap a function that takes a client for easier usage.
- Parameters:
cache (
bool
) – Should the result be cached usingfunctools.lru_cache()
? Is False by default.maxsize (
Optional
[int
]) – If cache is True, this is the value passed to themaxsize
argument offunctools.lru_cache()
. Set to None for unlimited caching, but beware that this can potentially use a lot of memory and isn’t a good idea for queries that can take a lot of different kinds of input over time.
- Returns:
A decorator object that will wrap the function
Examples
Not appropriate for caching (i.e., many possible inputs, especially in a web app scenario):
@autoclient() def get_tissues_for_gene(gene: Tuple[str, str], *, client: Neo4jClient): return client.get_targets( gene, relation="expressed_in", source_type="BioEntity", target_type="BioEntity", )
Appropriate for caching (e.g., doen’t take inputs at all):
@autoclient(cache=True, maxsize=1) def get_node_count(*, client: Neo4jClient) -> Counter: return Counter( { label[0]: client.query_tx(f"MATCH (n:{label[0]}) RETURN count(*)")[0][0] for label in client.query_tx("call db.labels();") } )
The INDRA CoGEx Neo4j Client (indra_cogex.client.queries
)
- get_drugs_for_side_effect(side_effect, *, client)[source]
Return the drugs for the given side effect.
- get_drugs_for_target(target, *, client)[source]
Return the drugs targeting the given protein.
- Parameters:
client (
Neo4jClient
) – The Neo4j client.
- Return type:
Iterable
[Agent
]- Returns:
The drugs targeting the given protein.
- get_drugs_for_targets(targets, *, client)[source]
Return the drugs targeting each of the given targets.
- get_evidences_for_mesh(mesh_term, include_child_terms=True, *, client)[source]
Return the evidence objects for the given MESH term.
- Parameters:
client (
Neo4jClient
) – The Neo4j client.include_child_terms (
bool
) – If True, also match against the child MESH terms of the given MESH ID
- Return type:
- Returns:
The evidence objects for the given MESH ID grouped into a dict by statement hash.
- get_evidences_for_stmt_hash(stmt_hash, *, client, limit=None, offset=0, remove_medscan=True)[source]
Return the matching evidence objects for the given statement hash.
- Parameters:
client (
Neo4jClient
) – The Neo4j client.stmt_hash (
int
) – The statement hash to query, accepts both string and integer.limit (
Optional
[int
]) – The maximum number of results to return.offset (
int
) – The number of results to skip before returning the first result.remove_medscan (
bool
) – If True, remove the MedScan evidence from the results.
- Return type:
Iterable
[Evidence
]- Returns:
The evidence objects for the given statement hash.
- get_evidences_for_stmt_hashes(stmt_hashes, *, client, limit=None, remove_medscan=True)[source]
Return the matching evidence objects for the given statement hashes.
- Parameters:
client (
Neo4jClient
) – The Neo4j client.stmt_hashes (
Iterable
[int
]) – The statement hashes to query, accepts integers and strings.limit (
Optional
[str
]) – The optional maximum number of evidences returned for each statement hashremove_medscan (
bool
) – If True, remove the MedScan evidence from the results.
- Return type:
- Returns:
A mapping of stmt hash to a list of evidence objects for the given statement hashes.
- get_genes_for_go_term(go_term, include_indirect=False, *, client)[source]
Return the genes associated with the given GO term.
- Parameters:
client (
Neo4jClient
) – The Neo4j client.go_term (
Tuple
[str
,str
]) – The GO term to query. Example:("GO", "GO:0006915")
include_indirect (
bool
) – Should ontological children of the given GO term be queried as well? Defaults to False.
- Return type:
- Returns:
The genes associated with the given GO term.
- get_go_terms_for_gene(gene, include_indirect=False, *, client)[source]
Return the GO terms for the given gene.
- get_mutated_genes(cell_line, *, client)[source]
Return the list of genes that are mutated in a given cell line.
Parameters client:
The Neo4j client.
- cell_line :
The cell line to query.
- get_node_counter(*, client)[source]
Get a count of each entity type.
- Parameters:
client (
Neo4jClient
) – The Neo4j client.- Return type:
- Returns:
A Counter of the entity types.
Warning
This code assumes all nodes only have one label, as in
label[0]
- get_pmids_for_mesh(mesh_term, include_child_terms=True, *, client)[source]
Return the PubMed IDs for the given MESH term.
- Parameters:
client (
Neo4jClient
) – The Neo4j client.include_child_terms (
bool
) – If True, also match against the child MESH terms of the given MESH term.
- Return type:
- Returns:
The PubMed IDs for the given MESH term and, optionally, its child terms.
- get_schema_graph(*, client)[source]
Get a NetworkX graph reflecting the schema of the Neo4j graph.
Generate a PDF diagram (works with PNG and SVG too) with the following:
>>> from networkx.drawing.nx_agraph import to_agraph >>> client = ... >>> graph = get_schema_graph(client=client) >>> to_agraph(graph).draw("~/Desktop/cogex_schema.pdf", prog="dot")
- Return type:
MultiDiGraph
Return the shared pathways for the given list of genes.
- get_stmts_for_mesh(mesh_term, include_child_terms=True, *, client, **kwargs)[source]
Return the statements with evidence for the given MESH ID.
- Parameters:
client (
Neo4jClient
) – The Neo4j client.include_child_terms (
bool
) – If True, also match against the children of the given MESH ID.kwargs – Additional keyword arguments to forward to
get_stmts_for_stmt_hashes()
- Return type:
Iterable
[Statement
]- Returns:
The statements for the given MESH ID.
- get_stmts_for_paper(paper_term, *, client, **kwargs)[source]
Return the statements with evidence from the given PubMed ID.
- Parameters:
client (
Neo4jClient
) – The Neo4j client.paper_term (
Tuple
[str
,str
]) – The term to query. Can be a PubMed ID, PMC id, TRID, or DOI
- Return type:
List
[Statement
]- Returns:
The statements for the given PubMed ID.
- get_stmts_for_pubmeds(pubmeds, *, client, **kwargs)[source]
Return the statements with evidence from the given PubMed ID.
- Parameters:
client (
Neo4jClient
) – The Neo4j client.
- Return type:
List
[Statement
]- Returns:
The statements for the given PubMed identifiers.
Example
from indra_cogex.client.queries import get_stmts_for_pubmeds pubmeds = [20861832, 19503834] stmts = get_stmts_for_pubmeds(pubmeds)
- get_stmts_for_stmt_hashes(stmt_hashes, *, evidence_map=None, client, evidence_limit=None, return_evidence_counts=False, subject_prefix=None, object_prefix=None)[source]
Return the statements for the given statement hashes.
- Parameters:
client (
Neo4jClient
) – The Neo4j client.stmt_hashes (
Iterable
[int
]) – The statement hashes to query.evidence_map (
Optional
[Dict
[int
,List
[Evidence
]]]) – Optionally provide a mapping of stmt hash to a list of evidence objectsevidence_limit (
Optional
[int
]) – An optional maximum number of evidences to return
- Return type:
Union
[List
[Statement
],Tuple
[List
[Statement
],Mapping
[int
,int
]]]- Returns:
The statements for the given statement hashes.
- get_stmts_meta_for_stmt_hashes(stmt_hashes, *, client)[source]
Return the metadata and statements for a given list of hashes
- Parameters:
stmt_hashes (
Iterable
[int
]) – The list of statement hashes to query.client (
Neo4jClient
) – The Neo4j client.
- Return type:
- Returns:
A dict of statements with their metadata
- get_targets_for_drug(drug, *, client)[source]
Return the proteins targeted by the given drug.
- Parameters:
client (
Neo4jClient
) – The Neo4j client.
- Return type:
Iterable
[Agent
]- Returns:
The proteins targeted by the given drug.
- get_targets_for_drugs(drugs, *, client)[source]
Return the proteins targeted by each of the given drugs
- is_gene_in_pathway(gene, pathway, *, client)[source]
Return True if the gene is in the given pathway.
- is_gene_in_tissue(gene, tissue, *, client)[source]
Return True if the gene is expressed in the given tissue.
- is_gene_mutated(gene, cell_line, *, client)[source]
Return True if the gene is mutated in the given cell line.
- is_go_term_for_gene(gene, go_term, *, client)[source]
Return True if the given GO term is associated with the given gene.
- is_side_effect_for_drug(drug, side_effect, *, client)[source]
Return True if the given side effect is associated with the given drug.
Subnetwork Client (indra_cogex.client.subnetwork
)
Queries that generate statement subnetworks.
- indra_mediated_subnetwork(nodes, *, client)[source]
Return the INDRA Statement subnetwork induced pairs of statements between the given nodes.
For example, if gene A and gene B are given as the query, find statements mediated by X such that A -> X -> B.
- indra_subnetwork(nodes, *, client)[source]
Return the INDRA Statement subnetwork induced by the given nodes.
- indra_subnetwork_go(go_term, *, client, include_indirect=False, mediated=False, upstream_controllers=False, downstream_targets=False)[source]
Return the INDRA Statement subnetwork induced by the given GO term.
- Parameters:
go_term (
Tuple
[str
,str
]) – The GO term to query. Example:("GO", "GO:0006915")
client (
Neo4jClient
) – The Neo4j client.include_indirect (
bool
) – Should ontological children of the given GO term be queried as well? Defaults to False.mediated (
bool
) – Should relations A->X->B be included for X not associated to the given GO term? Defaults to False.upstream_controllers (
bool
) – Should relations A<-X->B be included for upstream controller X not associated to the given GO term? Defaults to False.downstream_targets (
bool
) – Should relations A->X<-B be included for downstream target X not associated to the given GO term? Defaults to False.
- Return type:
List
[Statement
]- Returns:
The INDRA statement subnetwork induced by GO term.
- indra_subnetwork_relations(nodes, *, client)[source]
Return the subnetwork induced by the given nodes as a set of Relations.
Indexing of The Database
Once the database is built, it can be indexed using the following command:
python -m indra_cogex.indexing
or from the root directory of the repository:
./build_extra_indexes.sh
Indexing The Database (indra_cogex.indexing
)
A collection of functions for indexing on the database.
- index_evidence_on_stmt_hash(client, exist_ok=False)[source]
Index all Evidence nodes on the stmt_hash property
- Parameters:
client (
Neo4jClient
) – Neo4jClient instance to the graph database to be indexedexist_ok (
bool
) – If False, raise an exception if the index already exists. Default: False.
- index_indra_rel_on_stmt_hash(client)[source]
Index all indra_rel relationships on stmt_hash property
- Parameters:
client (
Neo4jClient
) – Neo4jClient instance to the graph database to be indexed
- index_nodes_on_id(client, exist_ok=False)[source]
Index all nodes on the id property
- Parameters:
client (
Neo4jClient
) – Neo4jClient instance to the graph database to be indexedexist_ok (
bool
) – If False, raise an exception if the index already exists. Default: False.
Indexing CLI (indra_cogex.indexing.cli
)
indra_cogex indexing
Build indexes on the database.
indra_cogex indexing [OPTIONS]
Options
- --all
Build all indexes
- --index-nodes
Index all nodes on the id property.
- --index-evidence-nodes
Index the Evidence nodes on the stmt_hash property.
- --index-indra-relations
Index the INDRA relations on the stmt_hash property.
- --exist-ok
If set, skip already set indices silently, otherwise an exception is raised if attempting to set an index that already exists.
INDRA CoGEx Sources
Source CLI (indra_cogex.sources.cli
)
INDRA CoGEx Sources Processor (indra_cogex.sources.processor
)
Base classes for processors.
Bgee Processor (indra_cogex.sources.bgee
)
Processor for Bgee.
cBioPortal Processor (indra_cogex.sources.cbioportal
)
CellMarker Processor (indra_cogex.sources.cellmarker
)
Processor for the CellMarker database.
See also
Website: http://xteam.xbio.top/CellMarker/
Publication: https://doi.org/10.1093/nar/gky900
chembl Processor (indra_cogex.sources.chembl
)
Processor for ChEMBL.
- class ChemblIndicationsProcessor(version=None)[source]
Bases:
Processor
A processor for ChEMBL indications.
- MOLECULE_SQL = '\nSELECT DISTINCT\n MOLECULE_DICTIONARY.chembl_id,\n MOLECULE_DICTIONARY.pref_name\nFROM MOLECULE_DICTIONARY\nJOIN DRUG_INDICATION ON MOLECULE_DICTIONARY.molregno == DRUG_INDICATION.molregno\n'
SQL for ChEMBL to get molecules that have indications
- SQL = '\nSELECT\n MOLECULE_DICTIONARY.chembl_id,\n DRUG_INDICATION.mesh_id,\n DRUG_INDICATION.max_phase_for_ind\nFROM MOLECULE_DICTIONARY\nJOIN DRUG_INDICATION ON MOLECULE_DICTIONARY.molregno == DRUG_INDICATION.molregno\n'
SQL for ChEMBL to get indications
clinicaltrials Processor (indra_cogex.sources.clinicaltrials
)
This module implements input for ClinicalTrials.gov data
NOTE: ClinicalTrials.gov are working on a more modern API that is currently in Beta: https://beta.clinicaltrials.gov/data-about-studies/learn-about-api Once this API is released, we should switch to using it. The instructions for using the current/old API are below.
To obtain the custom download for ingest, do the following
Go to https://clinicaltrials.gov/api/gui/demo/simple_study_fields
Enter the following in the form:
expr= fields=NCTId,BriefTitle,Condition,ConditionMeshTerm,ConditionMeshId,InterventionName,InterventionType,InterventionMeshTerm,InterventionMeshId min_rnk=1 max_rnk=500000 # or any number larger than the current number of studies fmt=csv
Send Request
4. Enter the captcha characters into the text box and then press enter (make sure to use the enter key and not press any buttons).
5. The website will display “please wait… ” for a couple of minutes, finally, the Save to file button will be active.
Click the Save to file button to download the response as a txt file.
7. Rename the txt file to clinical_trials.csv and then compress it as gzip clinical_trials.csv to get clinical_trials.csv.gz, then place this file into <pystow home>/indra/cogex/clinicaltrials/
goa Processor (indra_cogex.sources.goa
)
Processor for the Gene Ontology Associations (GOA) database.
INDRA DB Processor (indra_cogex.sources.indra_db
)
Processor for the INDRA database.
- class DbProcessor(dir_path=None)[source]
Bases:
Processor
Processor for the INDRA database.
Initialize the INDRA database processor.
INDRA Ontology Processor (indra_cogex.sources.indra_ontology
)
Processor for the INDRA ontology.
InterPro Processor (indra_cogex.sources.interpro
)
Processor for the InterPro database.
This was added in https://github.com/bgyori/indra_cogex/pull/125.
Odinson Processor (indra_cogex.sources.odinson
)
The Odinson Processor
Pathways Processor (indra_cogex.sources.pathways
)
PubMed Processor (indra_cogex.sources.pubmed
)
Sider Processor (indra_cogex.sources.sider
)
INDRA CoGEx Representation (indra_cogex.representation
)
This documentation goes over helper functions and the python objects that represent Neo4j Nodes and Relations.
Representations for nodes and relations to upload to Neo4j.
- class Node(db_ns, db_id, labels, data=None)[source]
Representation for a node.
Initialize the node.
- Parameters:
db_ns (
str
) – The namespace associated with the node. Uses the INDRA standard.db_id (
str
) – The identifier within the namespace associated with the node. Uses the INDRA standard.labels (
Collection
[str
]) – A collection of labels for the node.data (
Optional
[Mapping
[str
,Any
]]) – An optional data dictionary associated with the node.
- class Relation(source_ns, source_id, target_ns, target_id, rel_type, data=None)[source]
Representation for a relation.
Initialize the relation.
- Parameters:
source_ns (
str
) – The namespace associated with the source node.source_id (
str
) – The identifier within the namespace associated with the source node.target_ns (
str
) – The namespace associated with the target node.target_id (
str
) – The identifier within the namespace associated with the target node.rel_type (
str
) – The type of relation.data (
Optional
[Mapping
[str
,Any
]]) – An optional data dictionary associated with the relation.