Data¶
Data loading, preprocessing, and biological-data containers.
This submodule provides:
- Data containers — :class:
GeneData, :class:EnzymeData, :class:MediumData, :class:MetaboliteData— that wrap biological measurements and align them with COBRA metabolic models. - Fetching utilities — :func:
fetch_HPA_data, :func:list_models, :func:load_remote_model— for downloading data and models from public databases (HPA, BiGG, Metabolic Atlas). - Preprocessing helpers — :func:
translate_gene_id, :func:unify_score_column, :func:transform_HPA_data, :func:get_gene_id_map— for gene-ID translation, score unification, and HPA data pivoting. - Synthetic data — :func:
get_syn_gene_data— for generating simulated gene-expression matrices for testing.
GeneData ¶
GeneData(
data: Union[AnnData, Series, dict],
convert_to_str: bool = True,
expression_threshold: float = 0.0001,
absent_expression: float = 0,
data_transform=None,
discrete_transform=None,
ordered_thresholds: list = None,
)
Bases: BaseData
Store gene-expression data and compute reaction activity scores.
GeneData wraps a mapping of gene IDs to expression values and
provides methods to:
- align the data with a COBRA model via a :class:
~pipeGEM.analysis.RxnMapper, - compute per-reaction activity scores using gene–protein–reaction (GPR) rules,
- apply global or local thresholding strategies,
- aggregate data across multiple samples.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
pd.Series, anndata.AnnData, or dict
|
Gene-expression values keyed by gene ID. For |
required |
convert_to_str
|
bool
|
If |
True
|
expression_threshold
|
float
|
Genes with expression below this value are set to absent_expression.
Default is |
0.0001
|
absent_expression
|
float
|
Value assigned to genes below expression_threshold. Default is |
0
|
data_transform
|
callable or str
|
Transformation applied to expression values when computing reaction
scores (e.g. |
None
|
discrete_transform
|
str, dict, or callable
|
Maps raw expression values to discrete levels before storing.
Recognised strings: |
None
|
ordered_thresholds
|
list of float
|
Ascending cut-offs for digitising expression into integer bins centred around zero. |
None
|
Attributes:
| Name | Type | Description |
|---|---|---|
gene_data |
dict[str, float]
|
Mapping of gene IDs to (possibly transformed) expression values. |
genes |
list[str]
|
Sorted list of gene IDs. |
rxn_mapper |
RxnMapper or None
|
Reaction mapper created by :meth: |
data_transform |
callable
|
The transformation applied to values when accessing
:attr: |
Examples:
>>> gd = GeneData({"geneA": 10.5, "geneB": 0.0})
>>> gd["geneA"]
10.5
>>> gd.align(model)
>>> gd.rxn_scores # reaction → score mapping
transformed_gene_data
property
¶
transformed_gene_data: Dict[str, float]
Gene data after applying the specified data_transform.
Returns:
| Type | Description |
|---|---|
dict[str, float]
|
Dictionary mapping gene IDs to their transformed expression values. |
rxn_scores
property
¶
rxn_scores: Dict[str, float]
Reaction scores calculated by a RxnMapper. A RxnMapper assigns a reaction score to each reaction in the aligned model based on its gene-reaction relationship. By default, 'or' relationships will be converted into max() formula, and 'and' relationships will be converted into min() formula to represent isozymes and protein subunits, respectively. However, users can determine which formula to use to replace the relationships.
Returns:
| Name | Type | Description |
|---|---|---|
rxn_scores |
dict[str, float]
|
|
align ¶
align(model, **kwargs)
Calculate rxn_scores using a metabolic model.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model
|
The model with the genes and reactions to be mapped onto |
required | |
kwargs
|
Keyword arguments used to create a RxnMapper object, including: threshold: float or int, default = 0 The absent_value will be assigned to the rxn_scores below this threshold. absent_value: float or int, default = 0 The value assigned to the reactions with score lower than the threshold. missing_value: any, default = np.nan The value assigned to the genes not included in the gene_data. and_operation: str, default = 'nanmin', The operation name used to calculate the 'and' gene-reaction relationships. or_operation: str, default = 'nanmax' The operation name used to calculate the 'or' gene-reaction relationships. plus_operation: str, default = 'nansum' The operation name used to calculate the 'plus' gene-reaction relationships. |
{}
|
Returns:
| Type | Description |
|---|---|
None
|
|
transformed_rxn_scores ¶
transformed_rxn_scores(func) -> dict
Get the transformed reaction activity scores.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
func
|
Function used to transform the reaction score |
required |
Returns:
| Name | Type | Description |
|---|---|---|
transformed_rxn_scores |
dict
|
A dict contains reaction ids as keys and transformed reaction scores as values. |
calc_rxn_score_stat ¶
calc_rxn_score_stat(
rxn_ids,
ignore_na=True,
na_value=0,
return_if_all_na=-1,
method="mean",
) -> float
Calculate a statistic (mean or median) for reaction scores of specified reactions.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
rxn_ids
|
list or set
|
IDs of the reactions to include in the calculation. |
required |
ignore_na
|
bool
|
If True, ignore NaN scores during calculation. |
True
|
na_value
|
float
|
Value to replace NaN scores with if |
0
|
return_if_all_na
|
float
|
Value to return if all selected reaction scores are NaN. |
-1
|
method
|
(mean, median)
|
The statistic to calculate. |
"mean"
|
Returns:
| Type | Description |
|---|---|
float
|
The calculated statistic. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If |
AttributeError
|
If reaction scores have not been calculated yet (call |
apply ¶
apply(func)
Apply a function to each reaction score.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
func
|
callable
|
A function that takes a single reaction score (float) as input. |
required |
Returns:
| Type | Description |
|---|---|
dict[str, float]
|
A dictionary mapping reaction IDs to the results of applying |
Raises:
| Type | Description |
|---|---|
AttributeError
|
If reaction scores have not been calculated yet (call |
get_threshold ¶
get_threshold(
name: str, transform: bool = True, **kwargs
) -> ALL_THRESHOLD_ANALYSES
Calculate expression thresholds for classifying genes.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
Thresholding method name (e.g. |
required |
transform
|
bool
|
If |
True
|
**kwargs
|
Additional keyword arguments forwarded to the threshold finder's
|
{}
|
Returns:
| Type | Description |
|---|---|
ThresholdAnalysis
|
A result object whose type depends on name (e.g.
|
assign_local_threshold ¶
assign_local_threshold(
local_threshold_result,
transform: bool = True,
method: Literal[
"binary", "ratio", "diff", "rdiff"
] = "binary",
group: str = None,
**kwargs
) -> None
Replace gene-expression values using per-gene local thresholds.
Modifies :attr:gene_data in place according to method:
"binary"—1if expression > threshold, else0."ratio"— expression / threshold."diff"— threshold − expression (positive ⇒ under-expressed)."rdiff"— expression − threshold (positive ⇒ over-expressed).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
local_threshold_result
|
LocalThresholdAnalysis
|
Result object containing per-gene thresholds (from
:func: |
required |
transform
|
bool
|
If |
True
|
method
|
(binary, ratio, diff, rdiff)
|
Comparison strategy (default |
"binary"
|
group
|
str
|
Column to select from |
None
|
aggregate
classmethod
¶
aggregate(
data: Dict[
str, Dict[str, Union[Dict[str, GeneData], GeneData]]
],
method: str = "concat",
prop: Literal["data", "score"] = "data",
absent_expression: float = 0,
group_annotation: DataFrame = None,
) -> DataAggregation
Aggregate gene data or reaction scores from multiple sources.
Combines multiple :class:GeneData objects into a single
:class:~pipeGEM.analysis.DataAggregation result, either by
concatenation or by applying a pandas aggregation method
(e.g. "mean", "median").
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
dict[str, dict[str, GeneData]] or dict[str, GeneData]
|
Gene-data objects to aggregate. When nested (two-level dict),
columns are named |
required |
method
|
str
|
Aggregation method. |
'concat'
|
prop
|
(data, score)
|
Which property to extract from each |
"data"
|
absent_expression
|
float
|
Fill value for missing genes (default |
0
|
group_annotation
|
DataFrame
|
Sample-level group labels. Its index must match the resulting
column names when method is |
None
|
Returns:
| Type | Description |
|---|---|
DataAggregation
|
Result object wrapping the aggregated DataFrame. |
Raises:
| Type | Description |
|---|---|
AssertionError
|
If prop is not |
ValueError
|
If group_annotation index does not overlap with aggregated column names. |
MediumData ¶
MediumData(
data,
conc_col_label="mmol/L",
conc_unit="mmol/L",
id_index=False,
name_index=True,
id_col_label="human_1",
name_col_label=None,
)
Bases: BaseData
Stores and processes medium composition data for constraining metabolic models.
This class handles loading medium data (metabolite concentrations), aligning it with exchange reactions in a metabolic model, and applying these concentrations as constraints on reaction bounds.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
DataFrame
|
DataFrame containing medium composition data. Must include columns for metabolite IDs and concentrations. Optionally includes metabolite names. |
required |
conc_col_label
|
str
|
The label of the column containing metabolite concentrations in |
"mmol/L"
|
conc_unit
|
str
|
The unit of the concentrations provided in |
"mmol/L"
|
id_index
|
bool
|
If True, assumes the DataFrame index contains the metabolite IDs.
If False, uses the column specified by |
False
|
name_index
|
bool
|
If True, assumes the DataFrame index contains the metabolite names.
If False, uses the column specified by |
True
|
id_col_label
|
str
|
The label of the column containing metabolite IDs, used if |
"human_1"
|
name_col_label
|
str
|
The label of the column containing metabolite names, used if |
None
|
Attributes:
| Name | Type | Description |
|---|---|---|
data_dict |
dict
|
Dictionary mapping metabolite IDs to their concentrations. |
rxn_dict |
dict
|
Dictionary mapping exchange reaction IDs to corresponding metabolite concentrations after alignment with a model. |
name_dict |
dict
|
Dictionary mapping metabolite IDs to their names (if available). |
conc_unit |
Quantity
|
The concentration unit parsed by |
align ¶
align(
model,
external_comp_name="e",
met_id_format="{met_id}{comp}",
raise_err=False,
)
Aligns medium metabolite data with exchange reactions in a metabolic model.
Iterates through the metabolites in data_dict and attempts to find
corresponding exchange reactions in the model. Populates rxn_dict
with mappings from reaction IDs to metabolite concentrations.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model
|
Model or Model
|
The metabolic model to align against. |
required |
external_comp_name
|
str
|
The identifier for the external compartment in the model. |
"e"
|
met_id_format
|
str
|
A format string to construct the full metabolite ID in the model,
using the metabolite ID from the data ( |
"{met_id}{comp}"
|
raise_err
|
bool
|
If True, raises a KeyError if a metabolite from the data cannot be found in the model's external compartment. If False, issues a warning. |
False
|
Returns:
| Type | Description |
|---|---|
None
|
|
apply ¶
apply(
model,
cell_dgw=1e-12,
n_cells_per_l=1000000000.0,
time_hr=96,
flux_unit="mmol/g/hr",
threshold=1e-06,
)
Applies the medium constraints to the bounds of model reactions.
Calculates the maximum possible influx rate for each metabolite based on its concentration, cell density, dry weight, and time. Updates the lower or upper bounds of the corresponding exchange, sink, or demand reactions in the model.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model
|
Model or Model
|
The metabolic model whose reaction bounds will be modified. |
required |
cell_dgw
|
float
|
Cell dry weight in grams. |
1e-12
|
n_cells_per_l
|
float
|
Number of cells per liter of medium. |
1e9
|
time_hr
|
float
|
Duration of the experiment or simulation in hours. |
96
|
flux_unit
|
str
|
The desired unit for reaction fluxes in the model. The calculated influx bounds will be converted to this unit. |
"mmol/g/hr"
|
threshold
|
float
|
A minimum absolute value for the calculated bound. Bounds smaller than this threshold will be set to this value (or its negative). Helps avoid numerical issues with zero bounds. |
1e-6
|
Returns:
| Type | Description |
|---|---|
None
|
|
Notes
- Modifies the
modelobject in place. - Assumes exchange reactions consuming the metabolite have negative stoichiometry.
- Sets bounds for unconstrained inorganic exchanges or sinks/demands to 0 if they only produce/consume metabolites, respectively. Issues warnings for others.
from_catalog
classmethod
¶
from_catalog(medium, **kwargs)
Load a medium from the built-in :class:~pipeGEM.data.MediumCatalog.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
medium
|
MediumCatalog or str
|
A :class: |
required |
**kwargs
|
Keyword arguments forwarded to :meth: |
{}
|
Returns:
| Type | Description |
|---|---|
MediumData
|
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If medium is a string that does not match any catalog entry. |
TypeError
|
If medium is neither a |
Examples:
>>> m9 = MediumData.from_catalog('M9')
>>> m9 = MediumData.from_catalog(MediumCatalog.M9)
supplement ¶
supplement(met_id, concentration, name=None)
Add or update a metabolite in the medium.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
met_id
|
str
|
BiGG (or other scheme) metabolite identifier. |
required |
concentration
|
float
|
Concentration in the unit stored in :attr: |
required |
name
|
str
|
Human-readable name. If omitted, met_id is used as the name. |
None
|
Returns:
| Type | Description |
|---|---|
MediumData
|
|
Warns:
| Type | Description |
|---|---|
UserWarning
|
If :attr: |
remove ¶
remove(met_id)
Remove a metabolite from the medium.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
met_id
|
str
|
Metabolite identifier to remove. |
required |
Returns:
| Type | Description |
|---|---|
MediumData
|
|
Raises:
| Type | Description |
|---|---|
KeyError
|
If met_id is not present in :attr: |
Warns:
| Type | Description |
|---|---|
UserWarning
|
If :attr: |
combine ¶
combine(other, mode='union', conflict='max')
Combine two media into a new :class:MediumData instance.
Concentrations in other are unit-converted to match self before
combining. The original instances are not modified.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
other
|
MediumData
|
The second medium. |
required |
mode
|
(union, intersection)
|
|
'union'
|
conflict
|
(max, min, sum, first, second)
|
How to resolve a metabolite present in both media:
|
'max'
|
Returns:
| Type | Description |
|---|---|
MediumData
|
New instance with the combined composition (unit = |
Raises:
| Type | Description |
|---|---|
TypeError
|
If other is not a :class: |
ValueError
|
If mode or conflict is invalid, or if units are incompatible. |
from_file
classmethod
¶
from_file(file_name='DMEM', csv_kw=None, **kwargs)
Loads medium data from a file.
Supports TSV and CSV formats. Looks for the file in the standard
medium/ directory relative to the package structure first. If not
found there, attempts to load from the provided file_name path directly.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_name
|
str or Path
|
The base name of the medium file (e.g., "DMEM", "Hams") or a full
path to a custom medium file. The method will try appending ".tsv"
first, then assume CSV if not found or if |
"DMEM"
|
csv_kw
|
dict
|
Keyword arguments to pass directly to |
None
|
**kwargs
|
Additional keyword arguments passed directly to the |
{}
|
Returns:
| Type | Description |
|---|---|
MediumData
|
An instance of the MediumData class initialized with the loaded data. |
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If the specified file cannot be found either in the default directory or at the provided path. |
Exception
|
Propagates exceptions from |
EnzymeData ¶
EnzymeData(
data: Union[DataFrame],
gene_id_col: Optional[str] = None,
prot_id_col: Optional[str] = None,
rxn_id_col: Optional[str] = None,
met_id_col: Optional[str] = None,
mw_col: str = "MW",
kcat_col: str = "Kcat",
alt_kcat_col: str = "DLKcat",
prot_seq_col: str = "Sequence",
ec_num_col: str = "EC",
sa_col: str = "SA",
)
Bases: BaseData
Store enzyme kinetic parameters for enzyme-constrained model construction.
Wraps a DataFrame of per-gene (or per-protein) kinetic data — kcat, molecular weight (MW), EC numbers, protein sequences, etc. — and provides methods to align it with a COBRA model and optionally run DLKcat for in-silico kcat prediction.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
DataFrame
|
Enzyme kinetic data. Each row represents one gene/protein entry. |
required |
gene_id_col
|
str
|
Column to use as gene IDs (index). If |
None
|
prot_id_col
|
str
|
Column containing protein/UniProt IDs. If |
None
|
rxn_id_col
|
str
|
Column containing reaction IDs. If |
None
|
met_id_col
|
str
|
Column containing metabolite IDs associated with each reaction. |
None
|
mw_col
|
str
|
Column for molecular weight values (default |
'MW'
|
kcat_col
|
str
|
Column for experimentally measured kcat values (default |
'Kcat'
|
alt_kcat_col
|
str
|
Column for alternative (predicted) kcat values, e.g. from DLKcat
(default |
'DLKcat'
|
prot_seq_col
|
str
|
Column for amino-acid sequences (default |
'Sequence'
|
ec_num_col
|
str
|
Column for EC numbers (default |
'EC'
|
sa_col
|
str
|
Column for specific activity values (default |
'SA'
|
Attributes:
| Name | Type | Description |
|---|---|---|
prot_id_col |
str or None
|
Protein ID column name. |
mw_col, kcat_col, alt_kcat_col, prot_seq_col, ec_num_col, sa_col |
str
|
Column names for the respective fields. |
calc_molecular_weight
staticmethod
¶
calc_molecular_weight(seq: str) -> float
Estimate protein molecular weight from an amino-acid sequence.
Uses average amino-acid residue masses and adds 18.02 Da for the terminal water molecule. Non-standard characters are silently skipped.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
seq
|
str
|
One-letter amino-acid sequence. |
required |
Returns:
| Type | Description |
|---|---|
float
|
Approximate molecular weight in Daltons. |
check_gene_rxn_pair ¶
check_gene_rxn_pair(
ref_model, raise_err: bool = True
) -> None
Checks if the gene-reaction pairs in the enzyme data exist in the reference model.
Iterates through the enzyme DataFrame and verifies that for each row,
the gene (index or gene_id_col) is associated with the reaction
specified in _rxn_id_col within the ref_model.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ref_model
|
Model or Model
|
The metabolic model used as a reference. |
required |
raise_err
|
bool
|
If True, raises a ValueError upon finding a mismatch. If False, issues a warning instead. |
True
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If |
AttributeError
|
If |
KeyError
|
If a reaction ID from the data is not found in the model. |
rxn_items ¶
rxn_items() -> Dict[str, Dict[str, Union[str, float]]]
Returns a dictionary mapping reaction IDs to their best-matched enzyme data.
Requires the .align() method to be called first to populate the
_best_matched_df.
Returns:
| Type | Description |
|---|---|
Dict[str, Dict[str, Union[str, float]]]
|
A dictionary where keys are reaction IDs and values are dictionaries containing 'protein_to_use' (protein ID), 'best_kcat' (kcat value), and 'best_mw' (molecular weight). |
Raises:
| Type | Description |
|---|---|
AttributeError
|
If |
run_DLKcat ¶
run_DLKcat(
met_data: MetaboliteData, device: str = "cpu"
) -> None
Runs the DLKcat tool to predict kcat values.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
met_data
|
MetaboliteData
|
Metabolite data object containing SMILES information needed by DLKcat. |
required |
device
|
str
|
Device to run DLKcat on ('cpu' or 'cuda' if available). |
"cpu"
|
Notes
Predictions are stored in alt_kcat_col. Existing positive values
in kcat_col are treated as curated values and are not overwritten.
align ¶
align(
model,
check_and_raise=True,
run_DLKcat=True,
device="cpu",
)
Align enzyme data with a metabolic model.
Maps genes in the enzyme DataFrame to their corresponding reactions and metabolites using the model's GPR rules. Optionally runs DLKcat to predict missing kcat values.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model
|
Model or Model
|
The metabolic model to align against. |
required |
check_and_raise
|
bool
|
If |
True
|
run_DLKcat
|
bool
|
If |
True
|
device
|
str
|
PyTorch device for DLKcat ( |
'cpu'
|
MetaboliteData ¶
MetaboliteData(
data: Union[DataFrame],
met_id_col: Optional[str] = None,
smiles_col="SMILES",
)
Bases: BaseData
Store metabolite structural data (SMILES) for enzyme-constrained modelling.
Used by :class:EnzymeData when running DLKcat predictions, which
require substrate SMILES strings alongside protein sequences.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
DataFrame
|
DataFrame containing at least a SMILES column. Rows represent metabolites. |
required |
met_id_col
|
str
|
Column whose values should be used as the DataFrame index
(metabolite IDs). If |
None
|
smiles_col
|
str
|
Name of the column holding SMILES strings (default |
'SMILES'
|
Raises:
| Type | Description |
|---|---|
KeyError
|
If smiles_col is not found among the DataFrame columns. |
Attributes:
| Name | Type | Description |
|---|---|---|
smiles_col |
str
|
Column name used for SMILES look-ups. |
get_smiles ¶
get_smiles(ids)
Retrieve SMILES string(s) for the given metabolite ID(s).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ids
|
str or list of str
|
One or more metabolite IDs. |
required |
Returns:
| Type | Description |
|---|---|
str or ndarray
|
A single SMILES string when ids is a |
MediumInfo
dataclass
¶
MediumInfo(
name: str,
description: str,
source: str,
organism: str,
medium_type: str,
default_id_col: str = "BiGG",
is_approximate: bool = False,
source_url: str = "",
composition_url: str = "",
composition_note: str = "",
)
Metadata for a named medium in the catalog.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
Filename stem (without .tsv) used to locate the medium file. |
description |
str
|
Short human-readable description. |
source |
str
|
Literature citation or derivation note. |
organism |
str
|
Intended host organism(s). |
medium_type |
str
|
One of |
default_id_col |
str
|
Column label in the TSV that contains metabolite IDs (default |
is_approximate |
bool
|
|
source_url |
str
|
URL for the original citation or source publication, when known. |
composition_url |
str
|
URL for the formulation used to build or validate the bundled TSV. |
composition_note |
str
|
Short note describing whether the TSV is an exact defined recipe, an ionized exchange representation, or an approximate proxy. |
MediumCatalog ¶
Bases: Enum
Catalog of named media bundled with pipeGEM.
Each member maps to a :class:MediumInfo instance that carries metadata
and the TSV filename. Use :meth:~pipeGEM.data.MediumData.from_catalog
to load a medium by catalog entry.
Examples:
>>> from pipeGEM.data import MediumCatalog
>>> MediumCatalog.M9.value.description
'M9 minimal salts medium with glucose'
>>> MediumCatalog.LB.value.is_approximate
True
find_local_threshold ¶
find_local_threshold(
data_df, **kwargs
) -> ALL_THRESHOLD_ANALYSES
Compute per-gene local expression thresholds across multiple samples.
This is a convenience wrapper that creates a "local" threshold
finder and calls its :meth:find_threshold method.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data_df
|
DataFrame
|
Expression matrix with genes as rows and samples (or groups) as columns. |
required |
**kwargs
|
Forwarded to the local threshold finder (e.g. |
{}
|
Returns:
| Type | Description |
|---|---|
LocalThresholdAnalysis
|
Result object containing per-gene threshold values accessible via
its |
See Also
GeneData.get_threshold : Instance method that delegates to any registered threshold finder.
load_remote_model ¶
load_remote_model(
model_id,
format="mat",
branch="main",
download_dest="default",
)
Load a metabolic model from a remote database (BiGG or Metabolic Atlas).
If the model_id is found in the BiGG database, it is loaded directly using cobrapy. Otherwise, it attempts to download the model from the Metabolic Atlas GitHub repository.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_id
|
str
|
The ID of the model to load. |
required |
format
|
str
|
The format of the model file to download (e.g., "mat", "xml", "yml"). Defaults to "mat". |
'mat'
|
branch
|
str
|
The GitHub branch to download the model from for Metabolic Atlas models. Defaults to "main". |
'main'
|
download_dest
|
str
|
The destination directory to download the model to. Defaults to "default", which saves to a 'models' directory relative to the project root. |
'default'
|
Returns:
| Type | Description |
|---|---|
Model
|
The loaded metabolic model. |
list_models ¶
list_models(
databases=["metabolic atlas", "BiGG"],
organism=None,
max_n_rxns=np.inf,
max_n_mets=np.inf,
max_n_genes=np.inf,
**kwargs
) -> pd.DataFrame
List available metabolic models from specified databases with optional filtering.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
databases
|
List[str]
|
A list of database names to fetch models from (e.g., ["metabolic atlas", "BiGG"]). Defaults to ["metabolic atlas", "BiGG"]. |
['metabolic atlas', 'BiGG']
|
organism
|
str
|
Filter models by organism name (e.g., "human", "mouse"). Case-insensitive. |
None
|
max_n_rxns
|
float
|
Maximum number of reactions allowed in the models. Defaults to infinity. |
inf
|
max_n_mets
|
float
|
Maximum number of metabolites allowed in the models. Defaults to infinity. |
inf
|
max_n_genes
|
float
|
Maximum number of genes allowed in the models. Defaults to infinity. |
inf
|
**kwargs
|
Additional keyword arguments for DataBaseFetcherIniter. |
{}
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
A DataFrame containing information about the available models, including 'id', 'organism', 'reaction_count', 'metabolite_count', 'gene_count', and 'database'. Returns an empty DataFrame if no data is fetched. |
fetch_HPA_data ¶
fetch_HPA_data(
data_name: str,
data_path: Union[str, Path] = Path(
__file__
).parent.parent.parent
/ Path("external_data/HPA"),
) -> dict
Fetch Human Protein Atlas (HPA) data.
Downloads the specified HPA dataset if it doesn't exist locally.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data_name
|
str
|
The name of the HPA dataset to fetch (e.g., 'rna_tissue_consensus'). |
required |
data_path
|
Union[str, Path]
|
The directory path to save or load the data from. Defaults to 'external_data/HPA' relative to the project root. |
parent / Path('external_data/HPA')
|
Returns:
| Type | Description |
|---|---|
dict
|
A dictionary containing the path to the downloaded TSV file under the key "data_path". |
get_syn_gene_data ¶
get_syn_gene_data(
model: Union[Model, Model],
n_sample: int,
n_genes: Optional[int] = None,
groups: Optional[str] = None,
random_state: int = 42,
returned_dtype: str = "DataFrame",
) -> Union[pd.DataFrame, AnnData]
Generate synthetic gene expression data with a given number of samples and genes.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model
|
Model or Model
|
A model containing information about the genes to simulate expression data for. |
required |
n_sample
|
int
|
The number of samples to generate. |
required |
n_genes
|
int
|
The number of genes to simulate expression data for. If None, use all genes in the model (default=None). |
None
|
groups
|
str
|
The name of the attribute containing group information for the genes (default=None). |
None
|
random_state
|
int
|
The random seed to use for generating the data (default=42). |
42
|
returned_dtype
|
str
|
The type of object to return. Must be either 'DataFrame' or 'AnnData' (default='DataFrame'). |
'DataFrame'
|
Returns:
| Type | Description |
|---|---|
Union[DataFrame, AnnData]
|
The simulated gene expression data. If returned_dtype is 'DataFrame', returns a pandas DataFrame with gene IDs as the index and sample IDs as the columns. If returned_dtype is 'AnnData', returns an AnnData object with the simulated expression data as the X attribute, and empty obs and var attributes. |
transform_HPA_data ¶
transform_HPA_data(
data_df,
categories: List[str],
gene_id_col: str = "entrezgene",
score_col_name: str = "score",
)
Pivot HPA data into a gene × sample expression matrix.
Groups rows by gene_id_col and the specified categories, averages duplicate entries, then pivots so that each unique combination of category values becomes a column (sample).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data_df
|
DataFrame
|
Filtered HPA DataFrame (e.g. output of :func: |
required |
categories
|
list of str
|
Column names that together define a "sample" (e.g.
|
required |
gene_id_col
|
str
|
Column (or |
'entrezgene'
|
score_col_name
|
str
|
Column with numeric expression scores (default |
'score'
|
Returns:
| Type | Description |
|---|---|
dict
|
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If categories is empty (at least one sample column is required). |
unify_score_column ¶
unify_score_column(
data_df: DataFrame,
level_dic: Dict[str, float],
score_col_name: str,
) -> (
pd.DataFrame,
Dict[str, Dict[str, Tuple[float, float]]],
)
Convert heterogeneous HPA expression columns into a single score.
Handles three cases depending on which columns are present:
- Level count columns (e.g.
"High","Medium","Low"): compute a weighted average using level_dic as weights and return continuous CORDA thresholds. - A
"Level"column with discrete labels: map labels to numeric scores via level_dic and return discrete CORDA thresholds. - Quantitative columns (
"pTPM"or"NX"): rename the first matching column to score_col_name and returnNonethresholds (thresholding left to downstream methods).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data_df
|
DataFrame
|
HPA expression DataFrame (one row per gene × tissue/cell-type). |
required |
level_dic
|
dict[str, float]
|
Mapping from expression-level labels (e.g. |
required |
score_col_name
|
str
|
Name for the unified score column added to the returned DataFrame. |
required |
Returns:
| Type | Description |
|---|---|
dict
|
|
translate_gene_id ¶
translate_gene_id(
data_df: DataFrame,
map_df: DataFrame,
gene_col: str,
to_id: str,
)
Translate gene identifiers in a DataFrame using a mapping.
Adds a new column (or replaces the index) with the translated IDs and drops rows that could not be mapped.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data_df
|
DataFrame
|
DataFrame containing the original gene identifiers. |
required |
map_df
|
DataFrame or dict
|
Mapping from original IDs to target IDs. If a DataFrame, it is
used via :meth: |
required |
gene_col
|
str
|
Column in data_df that holds the source gene IDs. Use
|
required |
to_id
|
str
|
Name for the new translated-ID column. If |
required |
Returns:
| Type | Description |
|---|---|
dict
|
|
get_gene_id_map ¶
get_gene_id_map(
gene_names: List[str],
from_id: str,
to_id: str,
df_path: Union[PathLike, str],
dataset: Union[str, Dict] = "hsapiens_gene_ensembl",
ds_kws: Optional[Dict] = None,
map_type: str = "df",
drop_unused: bool = False,
ref_model: Optional[Model] = None,
)
Get a gene ID mapper from local path or BioMart.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
gene_names
|
list of str
|
The gene names / IDs to be translated into another gene names or IDs. |
required |
from_id
|
str
|
The name of the current IDs (e.g. |
required |
to_id
|
str
|
The name of the transformed IDs (e.g. |
required |
df_path
|
path - like or str
|
Path to cache the mapping DataFrame as a TSV file. If |
required |
dataset
|
str
|
BioMart dataset name (default |
'hsapiens_gene_ensembl'
|
ds_kws
|
dict
|
Backward-compatible dataset keyword dictionary used by earlier
versions. If supplied, |
None
|
map_type
|
str
|
|
'df'
|
drop_unused
|
bool
|
If |
False
|
ref_model
|
Model
|
Reference model used when drop_unused is |
None
|
Returns:
| Type | Description |
|---|---|
dict
|
|