Skip to content

Data

Data loading, preprocessing, and biological-data containers.

This submodule provides:

  • Data containers — :class:GeneData, :class:EnzymeData, :class:MediumData, :class:MetaboliteData — that wrap biological measurements and align them with COBRA metabolic models.
  • Fetching utilities — :func:fetch_HPA_data, :func:list_models, :func:load_remote_model — for downloading data and models from public databases (HPA, BiGG, Metabolic Atlas).
  • Preprocessing helpers — :func:translate_gene_id, :func:unify_score_column, :func:transform_HPA_data, :func:get_gene_id_map — for gene-ID translation, score unification, and HPA data pivoting.
  • Synthetic data — :func:get_syn_gene_data — for generating simulated gene-expression matrices for testing.

GeneData

GeneData(
    data: Union[AnnData, Series, dict],
    convert_to_str: bool = True,
    expression_threshold: float = 0.0001,
    absent_expression: float = 0,
    data_transform=None,
    discrete_transform=None,
    ordered_thresholds: list = None,
)

Bases: BaseData

Store gene-expression data and compute reaction activity scores.

GeneData wraps a mapping of gene IDs to expression values and provides methods to:

  • align the data with a COBRA model via a :class:~pipeGEM.analysis.RxnMapper,
  • compute per-reaction activity scores using gene–protein–reaction (GPR) rules,
  • apply global or local thresholding strategies,
  • aggregate data across multiple samples.

Parameters:

Name Type Description Default
data pd.Series, anndata.AnnData, or dict

Gene-expression values keyed by gene ID. For AnnData, the object must contain exactly one observation (row).

required
convert_to_str bool

If True (default), gene IDs are cast to strings.

True
expression_threshold float

Genes with expression below this value are set to absent_expression. Default is 1e-4.

0.0001
absent_expression float

Value assigned to genes below expression_threshold. Default is 0.

0
data_transform callable or str

Transformation applied to expression values when computing reaction scores (e.g. np.log2, "log2"). None means identity.

None
discrete_transform str, dict, or callable

Maps raw expression values to discrete levels before storing. Recognised strings: "HPA" (Human Protein Atlas scoring). A dict is used as a direct look-up table.

None
ordered_thresholds list of float

Ascending cut-offs for digitising expression into integer bins centred around zero.

None

Attributes:

Name Type Description
gene_data dict[str, float]

Mapping of gene IDs to (possibly transformed) expression values.

genes list[str]

Sorted list of gene IDs.

rxn_mapper RxnMapper or None

Reaction mapper created by :meth:align; None before alignment.

data_transform callable

The transformation applied to values when accessing :attr:rxn_scores or :attr:transformed_gene_data.

Examples:

>>> gd = GeneData({"geneA": 10.5, "geneB": 0.0})
>>> gd["geneA"]
10.5
>>> gd.align(model)
>>> gd.rxn_scores  # reaction → score mapping

transformed_gene_data property

transformed_gene_data: Dict[str, float]

Gene data after applying the specified data_transform.

Returns:

Type Description
dict[str, float]

Dictionary mapping gene IDs to their transformed expression values.

rxn_scores property

rxn_scores: Dict[str, float]

Reaction scores calculated by a RxnMapper. A RxnMapper assigns a reaction score to each reaction in the aligned model based on its gene-reaction relationship. By default, 'or' relationships will be converted into max() formula, and 'and' relationships will be converted into min() formula to represent isozymes and protein subunits, respectively. However, users can determine which formula to use to replace the relationships.

Returns:

Name Type Description
rxn_scores dict[str, float]

align

align(model, **kwargs)

Calculate rxn_scores using a metabolic model.

Parameters:

Name Type Description Default
model

The model with the genes and reactions to be mapped onto

required
kwargs

Keyword arguments used to create a RxnMapper object, including: threshold: float or int, default = 0 The absent_value will be assigned to the rxn_scores below this threshold. absent_value: float or int, default = 0 The value assigned to the reactions with score lower than the threshold. missing_value: any, default = np.nan The value assigned to the genes not included in the gene_data. and_operation: str, default = 'nanmin', The operation name used to calculate the 'and' gene-reaction relationships.

Valid operations include:
nanmin: return minimum while ignoring all the nan values
nanmax: return maximum while ignoring all the nan values
nansum: return the expression sums while ignoring all the nan values
nanmean: calculate the expression means while ignoring all the nan values

or_operation: str, default = 'nanmax' The operation name used to calculate the 'or' gene-reaction relationships. plus_operation: str, default = 'nansum' The operation name used to calculate the 'plus' gene-reaction relationships.

{}

Returns:

Type Description
None

transformed_rxn_scores

transformed_rxn_scores(func) -> dict

Get the transformed reaction activity scores.

Parameters:

Name Type Description Default
func

Function used to transform the reaction score

required

Returns:

Name Type Description
transformed_rxn_scores dict

A dict contains reaction ids as keys and transformed reaction scores as values.

calc_rxn_score_stat

calc_rxn_score_stat(
    rxn_ids,
    ignore_na=True,
    na_value=0,
    return_if_all_na=-1,
    method="mean",
) -> float

Calculate a statistic (mean or median) for reaction scores of specified reactions.

Parameters:

Name Type Description Default
rxn_ids list or set

IDs of the reactions to include in the calculation.

required
ignore_na bool

If True, ignore NaN scores during calculation.

True
na_value float

Value to replace NaN scores with if ignore_na is False.

0
return_if_all_na float

Value to return if all selected reaction scores are NaN.

-1
method (mean, median)

The statistic to calculate.

"mean"

Returns:

Type Description
float

The calculated statistic.

Raises:

Type Description
ValueError

If method is not "mean" or "median".

AttributeError

If reaction scores have not been calculated yet (call .align() first).

apply

apply(func)

Apply a function to each reaction score.

Parameters:

Name Type Description Default
func callable

A function that takes a single reaction score (float) as input.

required

Returns:

Type Description
dict[str, float]

A dictionary mapping reaction IDs to the results of applying func to their scores.

Raises:

Type Description
AttributeError

If reaction scores have not been calculated yet (call .align() first).

get_threshold

get_threshold(
    name: str, transform: bool = True, **kwargs
) -> ALL_THRESHOLD_ANALYSES

Calculate expression thresholds for classifying genes.

Parameters:

Name Type Description Default
name str

Thresholding method name (e.g. "percentile", "rFastCormic", "local"). Passed to :func:~pipeGEM.analysis.threshold_finders.create.

required
transform bool

If True (default), apply :attr:data_transform to the gene data before computing thresholds.

True
**kwargs

Additional keyword arguments forwarded to the threshold finder's find_threshold method.

{}

Returns:

Type Description
ThresholdAnalysis

A result object whose type depends on name (e.g. PercentileThresholdAnalysis, rFastCormicThresholdAnalysis).

assign_local_threshold

assign_local_threshold(
    local_threshold_result,
    transform: bool = True,
    method: Literal[
        "binary", "ratio", "diff", "rdiff"
    ] = "binary",
    group: str = None,
    **kwargs
) -> None

Replace gene-expression values using per-gene local thresholds.

Modifies :attr:gene_data in place according to method:

  • "binary"1 if expression > threshold, else 0.
  • "ratio" — expression / threshold.
  • "diff" — threshold − expression (positive ⇒ under-expressed).
  • "rdiff" — expression − threshold (positive ⇒ over-expressed).

Parameters:

Name Type Description Default
local_threshold_result LocalThresholdAnalysis

Result object containing per-gene thresholds (from :func:find_local_threshold).

required
transform bool

If True (default), apply :attr:data_transform to both gene values and thresholds before comparison.

True
method (binary, ratio, diff, rdiff)

Comparison strategy (default "binary").

"binary"
group str

Column to select from local_threshold_result.exp_ths. Defaults to "exp_th".

None

aggregate classmethod

aggregate(
    data: Dict[
        str, Dict[str, Union[Dict[str, GeneData], GeneData]]
    ],
    method: str = "concat",
    prop: Literal["data", "score"] = "data",
    absent_expression: float = 0,
    group_annotation: DataFrame = None,
) -> DataAggregation

Aggregate gene data or reaction scores from multiple sources.

Combines multiple :class:GeneData objects into a single :class:~pipeGEM.analysis.DataAggregation result, either by concatenation or by applying a pandas aggregation method (e.g. "mean", "median").

Parameters:

Name Type Description Default
data dict[str, dict[str, GeneData]] or dict[str, GeneData]

Gene-data objects to aggregate. When nested (two-level dict), columns are named "outer_key:inner_key".

required
method str

Aggregation method. "concat" (default) keeps all columns; any other value is called as a pandas DataFrame method along axis=1 (e.g. "mean", "median").

'concat'
prop (data, score)

Which property to extract from each GeneData: "data" → :attr:gene_data, "score" → :attr:rxn_scores.

"data"
absent_expression float

Fill value for missing genes (default 0).

0
group_annotation DataFrame

Sample-level group labels. Its index must match the resulting column names when method is "concat".

None

Returns:

Type Description
DataAggregation

Result object wrapping the aggregated DataFrame.

Raises:

Type Description
AssertionError

If prop is not "data" or "score".

ValueError

If group_annotation index does not overlap with aggregated column names.

MediumData

MediumData(
    data,
    conc_col_label="mmol/L",
    conc_unit="mmol/L",
    id_index=False,
    name_index=True,
    id_col_label="human_1",
    name_col_label=None,
)

Bases: BaseData

Stores and processes medium composition data for constraining metabolic models.

This class handles loading medium data (metabolite concentrations), aligning it with exchange reactions in a metabolic model, and applying these concentrations as constraints on reaction bounds.

Parameters:

Name Type Description Default
data DataFrame

DataFrame containing medium composition data. Must include columns for metabolite IDs and concentrations. Optionally includes metabolite names.

required
conc_col_label str

The label of the column containing metabolite concentrations in data.

"mmol/L"
conc_unit str

The unit of the concentrations provided in conc_col_label. Uses pint for unit handling.

"mmol/L"
id_index bool

If True, assumes the DataFrame index contains the metabolite IDs. If False, uses the column specified by id_col_label.

False
name_index bool

If True, assumes the DataFrame index contains the metabolite names. If False, uses the column specified by name_col_label.

True
id_col_label str

The label of the column containing metabolite IDs, used if id_index is False.

"human_1"
name_col_label str

The label of the column containing metabolite names, used if name_index is False. If None and name_index is False, names will not be stored.

None

Attributes:

Name Type Description
data_dict dict

Dictionary mapping metabolite IDs to their concentrations.

rxn_dict dict

Dictionary mapping exchange reaction IDs to corresponding metabolite concentrations after alignment with a model.

name_dict dict

Dictionary mapping metabolite IDs to their names (if available).

conc_unit Quantity

The concentration unit parsed by pint.

align

align(
    model,
    external_comp_name="e",
    met_id_format="{met_id}{comp}",
    raise_err=False,
)

Aligns medium metabolite data with exchange reactions in a metabolic model.

Iterates through the metabolites in data_dict and attempts to find corresponding exchange reactions in the model. Populates rxn_dict with mappings from reaction IDs to metabolite concentrations.

Parameters:

Name Type Description Default
model Model or Model

The metabolic model to align against.

required
external_comp_name str

The identifier for the external compartment in the model.

"e"
met_id_format str

A format string to construct the full metabolite ID in the model, using the metabolite ID from the data (met_id) and the external_comp_name (comp).

"{met_id}{comp}"
raise_err bool

If True, raises a KeyError if a metabolite from the data cannot be found in the model's external compartment. If False, issues a warning.

False

Returns:

Type Description
None

apply

apply(
    model,
    cell_dgw=1e-12,
    n_cells_per_l=1000000000.0,
    time_hr=96,
    flux_unit="mmol/g/hr",
    threshold=1e-06,
)

Applies the medium constraints to the bounds of model reactions.

Calculates the maximum possible influx rate for each metabolite based on its concentration, cell density, dry weight, and time. Updates the lower or upper bounds of the corresponding exchange, sink, or demand reactions in the model.

Parameters:

Name Type Description Default
model Model or Model

The metabolic model whose reaction bounds will be modified.

required
cell_dgw float

Cell dry weight in grams.

1e-12
n_cells_per_l float

Number of cells per liter of medium.

1e9
time_hr float

Duration of the experiment or simulation in hours.

96
flux_unit str

The desired unit for reaction fluxes in the model. The calculated influx bounds will be converted to this unit.

"mmol/g/hr"
threshold float

A minimum absolute value for the calculated bound. Bounds smaller than this threshold will be set to this value (or its negative). Helps avoid numerical issues with zero bounds.

1e-6

Returns:

Type Description
None
Notes
  • Modifies the model object in place.
  • Assumes exchange reactions consuming the metabolite have negative stoichiometry.
  • Sets bounds for unconstrained inorganic exchanges or sinks/demands to 0 if they only produce/consume metabolites, respectively. Issues warnings for others.

from_catalog classmethod

from_catalog(medium, **kwargs)

Load a medium from the built-in :class:~pipeGEM.data.MediumCatalog.

Parameters:

Name Type Description Default
medium MediumCatalog or str

A :class:~pipeGEM.data.MediumCatalog member or a case-insensitive string matching the enum name (e.g. 'M9', 'lb', 'DMEM_HIGH_FFA').

required
**kwargs

Keyword arguments forwarded to :meth:from_file / __init__. id_col_label and name_index default to the values stored in the catalog entry but can be overridden here.

{}

Returns:

Type Description
MediumData

Raises:

Type Description
ValueError

If medium is a string that does not match any catalog entry.

TypeError

If medium is neither a MediumCatalog member nor a string.

Examples:

>>> m9 = MediumData.from_catalog('M9')
>>> m9 = MediumData.from_catalog(MediumCatalog.M9)

supplement

supplement(met_id, concentration, name=None)

Add or update a metabolite in the medium.

Parameters:

Name Type Description Default
met_id str

BiGG (or other scheme) metabolite identifier.

required
concentration float

Concentration in the unit stored in :attr:conc_unit. Use float('inf') for unconstrained species.

required
name str

Human-readable name. If omitted, met_id is used as the name.

None

Returns:

Type Description
MediumData

self, to allow method chaining.

Warns:

Type Description
UserWarning

If :attr:rxn_dict is non-empty (the alignment may be stale).

remove

remove(met_id)

Remove a metabolite from the medium.

Parameters:

Name Type Description Default
met_id str

Metabolite identifier to remove.

required

Returns:

Type Description
MediumData

self, to allow method chaining.

Raises:

Type Description
KeyError

If met_id is not present in :attr:data_dict.

Warns:

Type Description
UserWarning

If :attr:rxn_dict is non-empty (the alignment may be stale).

combine

combine(other, mode='union', conflict='max')

Combine two media into a new :class:MediumData instance.

Concentrations in other are unit-converted to match self before combining. The original instances are not modified.

Parameters:

Name Type Description Default
other MediumData

The second medium.

required
mode (union, intersection)

'union' — include metabolites from either medium. 'intersection' — include only metabolites present in both.

'union'
conflict (max, min, sum, first, second)

How to resolve a metabolite present in both media:

  • 'max' — use the larger concentration.
  • 'min' — use the smaller concentration.
  • 'sum' — add both concentrations.
  • 'first' — keep self's value.
  • 'second' — use other's value.
'max'

Returns:

Type Description
MediumData

New instance with the combined composition (unit = self's unit).

Raises:

Type Description
TypeError

If other is not a :class:MediumData instance.

ValueError

If mode or conflict is invalid, or if units are incompatible.

from_file classmethod

from_file(file_name='DMEM', csv_kw=None, **kwargs)

Loads medium data from a file.

Supports TSV and CSV formats. Looks for the file in the standard medium/ directory relative to the package structure first. If not found there, attempts to load from the provided file_name path directly.

Parameters:

Name Type Description Default
file_name str or Path

The base name of the medium file (e.g., "DMEM", "Hams") or a full path to a custom medium file. The method will try appending ".tsv" first, then assume CSV if not found or if csv_kw is provided.

"DMEM"
csv_kw dict

Keyword arguments to pass directly to pandas.read_csv. If provided, CSV reading is prioritized. Example: {'sep': ',', 'index_col': 0}.

None
**kwargs

Additional keyword arguments passed directly to the MediumData constructor (__init__), such as conc_col_label, id_col_label, etc.

{}

Returns:

Type Description
MediumData

An instance of the MediumData class initialized with the loaded data.

Raises:

Type Description
FileNotFoundError

If the specified file cannot be found either in the default directory or at the provided path.

Exception

Propagates exceptions from pandas.read_csv or MediumData.__init__.

EnzymeData

EnzymeData(
    data: Union[DataFrame],
    gene_id_col: Optional[str] = None,
    prot_id_col: Optional[str] = None,
    rxn_id_col: Optional[str] = None,
    met_id_col: Optional[str] = None,
    mw_col: str = "MW",
    kcat_col: str = "Kcat",
    alt_kcat_col: str = "DLKcat",
    prot_seq_col: str = "Sequence",
    ec_num_col: str = "EC",
    sa_col: str = "SA",
)

Bases: BaseData

Store enzyme kinetic parameters for enzyme-constrained model construction.

Wraps a DataFrame of per-gene (or per-protein) kinetic data — kcat, molecular weight (MW), EC numbers, protein sequences, etc. — and provides methods to align it with a COBRA model and optionally run DLKcat for in-silico kcat prediction.

Parameters:

Name Type Description Default
data DataFrame

Enzyme kinetic data. Each row represents one gene/protein entry.

required
gene_id_col str

Column to use as gene IDs (index). If None, the existing DataFrame index is used.

None
prot_id_col str

Column containing protein/UniProt IDs. If None, gene IDs are used as protein identifiers and a warning is issued.

None
rxn_id_col str

Column containing reaction IDs. If None, reaction mapping is inferred during :meth:align via the model's GPR rules.

None
met_id_col str

Column containing metabolite IDs associated with each reaction.

None
mw_col str

Column for molecular weight values (default "MW"). If absent, MW is inferred from prot_seq_col.

'MW'
kcat_col str

Column for experimentally measured kcat values (default "Kcat").

'Kcat'
alt_kcat_col str

Column for alternative (predicted) kcat values, e.g. from DLKcat (default "DLKcat").

'DLKcat'
prot_seq_col str

Column for amino-acid sequences (default "Sequence").

'Sequence'
ec_num_col str

Column for EC numbers (default "EC").

'EC'
sa_col str

Column for specific activity values (default "SA").

'SA'

Attributes:

Name Type Description
prot_id_col str or None

Protein ID column name.

mw_col, kcat_col, alt_kcat_col, prot_seq_col, ec_num_col, sa_col str

Column names for the respective fields.

calc_molecular_weight staticmethod

calc_molecular_weight(seq: str) -> float

Estimate protein molecular weight from an amino-acid sequence.

Uses average amino-acid residue masses and adds 18.02 Da for the terminal water molecule. Non-standard characters are silently skipped.

Parameters:

Name Type Description Default
seq str

One-letter amino-acid sequence.

required

Returns:

Type Description
float

Approximate molecular weight in Daltons.

check_gene_rxn_pair

check_gene_rxn_pair(
    ref_model, raise_err: bool = True
) -> None

Checks if the gene-reaction pairs in the enzyme data exist in the reference model.

Iterates through the enzyme DataFrame and verifies that for each row, the gene (index or gene_id_col) is associated with the reaction specified in _rxn_id_col within the ref_model.

Parameters:

Name Type Description Default
ref_model Model or Model

The metabolic model used as a reference.

required
raise_err bool

If True, raises a ValueError upon finding a mismatch. If False, issues a warning instead.

True

Raises:

Type Description
ValueError

If raise_err is True and a mismatch is found.

AttributeError

If _rxn_id_col is None or not set.

KeyError

If a reaction ID from the data is not found in the model.

rxn_items

rxn_items() -> Dict[str, Dict[str, Union[str, float]]]

Returns a dictionary mapping reaction IDs to their best-matched enzyme data.

Requires the .align() method to be called first to populate the _best_matched_df.

Returns:

Type Description
Dict[str, Dict[str, Union[str, float]]]

A dictionary where keys are reaction IDs and values are dictionaries containing 'protein_to_use' (protein ID), 'best_kcat' (kcat value), and 'best_mw' (molecular weight).

Raises:

Type Description
AttributeError

If .align() has not been called yet (_best_matched_df is None).

run_DLKcat

run_DLKcat(
    met_data: MetaboliteData, device: str = "cpu"
) -> None

Runs the DLKcat tool to predict kcat values.

Parameters:

Name Type Description Default
met_data MetaboliteData

Metabolite data object containing SMILES information needed by DLKcat.

required
device str

Device to run DLKcat on ('cpu' or 'cuda' if available).

"cpu"
Notes

Predictions are stored in alt_kcat_col. Existing positive values in kcat_col are treated as curated values and are not overwritten.

align

align(
    model,
    check_and_raise=True,
    run_DLKcat=True,
    device="cpu",
)

Align enzyme data with a metabolic model.

Maps genes in the enzyme DataFrame to their corresponding reactions and metabolites using the model's GPR rules. Optionally runs DLKcat to predict missing kcat values.

Parameters:

Name Type Description Default
model Model or Model

The metabolic model to align against.

required
check_and_raise bool

If True (default) and rxn_id_col was provided at construction, raise on gene–reaction mismatches.

True
run_DLKcat bool

If True (default), attempt to predict kcat values via the DLKcat deep-learning model.

True
device str

PyTorch device for DLKcat ("cpu" or "cuda").

'cpu'

MetaboliteData

MetaboliteData(
    data: Union[DataFrame],
    met_id_col: Optional[str] = None,
    smiles_col="SMILES",
)

Bases: BaseData

Store metabolite structural data (SMILES) for enzyme-constrained modelling.

Used by :class:EnzymeData when running DLKcat predictions, which require substrate SMILES strings alongside protein sequences.

Parameters:

Name Type Description Default
data DataFrame

DataFrame containing at least a SMILES column. Rows represent metabolites.

required
met_id_col str

Column whose values should be used as the DataFrame index (metabolite IDs). If None, the existing index is kept.

None
smiles_col str

Name of the column holding SMILES strings (default "SMILES").

'SMILES'

Raises:

Type Description
KeyError

If smiles_col is not found among the DataFrame columns.

Attributes:

Name Type Description
smiles_col str

Column name used for SMILES look-ups.

get_smiles

get_smiles(ids)

Retrieve SMILES string(s) for the given metabolite ID(s).

Parameters:

Name Type Description Default
ids str or list of str

One or more metabolite IDs.

required

Returns:

Type Description
str or ndarray

A single SMILES string when ids is a str, or a NumPy array of strings when ids is a list.

MediumInfo dataclass

MediumInfo(
    name: str,
    description: str,
    source: str,
    organism: str,
    medium_type: str,
    default_id_col: str = "BiGG",
    is_approximate: bool = False,
    source_url: str = "",
    composition_url: str = "",
    composition_note: str = "",
)

Metadata for a named medium in the catalog.

Attributes:

Name Type Description
name str

Filename stem (without .tsv) used to locate the medium file.

description str

Short human-readable description.

source str

Literature citation or derivation note.

organism str

Intended host organism(s).

medium_type str

One of 'minimal', 'defined', 'rich', or 'complex'.

default_id_col str

Column label in the TSV that contains metabolite IDs (default 'BiGG').

is_approximate bool

True for undefined/complex media whose composition is estimated.

source_url str

URL for the original citation or source publication, when known.

composition_url str

URL for the formulation used to build or validate the bundled TSV.

composition_note str

Short note describing whether the TSV is an exact defined recipe, an ionized exchange representation, or an approximate proxy.

MediumCatalog

Bases: Enum

Catalog of named media bundled with pipeGEM.

Each member maps to a :class:MediumInfo instance that carries metadata and the TSV filename. Use :meth:~pipeGEM.data.MediumData.from_catalog to load a medium by catalog entry.

Examples:

>>> from pipeGEM.data import MediumCatalog
>>> MediumCatalog.M9.value.description
'M9 minimal salts medium with glucose'
>>> MediumCatalog.LB.value.is_approximate
True

find_local_threshold

find_local_threshold(
    data_df, **kwargs
) -> ALL_THRESHOLD_ANALYSES

Compute per-gene local expression thresholds across multiple samples.

This is a convenience wrapper that creates a "local" threshold finder and calls its :meth:find_threshold method.

Parameters:

Name Type Description Default
data_df DataFrame

Expression matrix with genes as rows and samples (or groups) as columns.

required
**kwargs

Forwarded to the local threshold finder (e.g. groups, group_dic).

{}

Returns:

Type Description
LocalThresholdAnalysis

Result object containing per-gene threshold values accessible via its exp_ths attribute.

See Also

GeneData.get_threshold : Instance method that delegates to any registered threshold finder.

load_remote_model

load_remote_model(
    model_id,
    format="mat",
    branch="main",
    download_dest="default",
)

Load a metabolic model from a remote database (BiGG or Metabolic Atlas).

If the model_id is found in the BiGG database, it is loaded directly using cobrapy. Otherwise, it attempts to download the model from the Metabolic Atlas GitHub repository.

Parameters:

Name Type Description Default
model_id str

The ID of the model to load.

required
format str

The format of the model file to download (e.g., "mat", "xml", "yml"). Defaults to "mat".

'mat'
branch str

The GitHub branch to download the model from for Metabolic Atlas models. Defaults to "main".

'main'
download_dest str

The destination directory to download the model to. Defaults to "default", which saves to a 'models' directory relative to the project root.

'default'

Returns:

Type Description
Model

The loaded metabolic model.

list_models

list_models(
    databases=["metabolic atlas", "BiGG"],
    organism=None,
    max_n_rxns=np.inf,
    max_n_mets=np.inf,
    max_n_genes=np.inf,
    **kwargs
) -> pd.DataFrame

List available metabolic models from specified databases with optional filtering.

Parameters:

Name Type Description Default
databases List[str]

A list of database names to fetch models from (e.g., ["metabolic atlas", "BiGG"]). Defaults to ["metabolic atlas", "BiGG"].

['metabolic atlas', 'BiGG']
organism str

Filter models by organism name (e.g., "human", "mouse"). Case-insensitive.

None
max_n_rxns float

Maximum number of reactions allowed in the models. Defaults to infinity.

inf
max_n_mets float

Maximum number of metabolites allowed in the models. Defaults to infinity.

inf
max_n_genes float

Maximum number of genes allowed in the models. Defaults to infinity.

inf
**kwargs

Additional keyword arguments for DataBaseFetcherIniter.

{}

Returns:

Type Description
DataFrame

A DataFrame containing information about the available models, including 'id', 'organism', 'reaction_count', 'metabolite_count', 'gene_count', and 'database'. Returns an empty DataFrame if no data is fetched.

fetch_HPA_data

fetch_HPA_data(
    data_name: str,
    data_path: Union[str, Path] = Path(
        __file__
    ).parent.parent.parent
    / Path("external_data/HPA"),
) -> dict

Fetch Human Protein Atlas (HPA) data.

Downloads the specified HPA dataset if it doesn't exist locally.

Parameters:

Name Type Description Default
data_name str

The name of the HPA dataset to fetch (e.g., 'rna_tissue_consensus').

required
data_path Union[str, Path]

The directory path to save or load the data from. Defaults to 'external_data/HPA' relative to the project root.

parent / Path('external_data/HPA')

Returns:

Type Description
dict

A dictionary containing the path to the downloaded TSV file under the key "data_path".

get_syn_gene_data

get_syn_gene_data(
    model: Union[Model, Model],
    n_sample: int,
    n_genes: Optional[int] = None,
    groups: Optional[str] = None,
    random_state: int = 42,
    returned_dtype: str = "DataFrame",
) -> Union[pd.DataFrame, AnnData]

Generate synthetic gene expression data with a given number of samples and genes.

Parameters:

Name Type Description Default
model Model or Model

A model containing information about the genes to simulate expression data for.

required
n_sample int

The number of samples to generate.

required
n_genes int

The number of genes to simulate expression data for. If None, use all genes in the model (default=None).

None
groups str

The name of the attribute containing group information for the genes (default=None).

None
random_state int

The random seed to use for generating the data (default=42).

42
returned_dtype str

The type of object to return. Must be either 'DataFrame' or 'AnnData' (default='DataFrame').

'DataFrame'

Returns:

Type Description
Union[DataFrame, AnnData]

The simulated gene expression data. If returned_dtype is 'DataFrame', returns a pandas DataFrame with gene IDs as the index and sample IDs as the columns. If returned_dtype is 'AnnData', returns an AnnData object with the simulated expression data as the X attribute, and empty obs and var attributes.

transform_HPA_data

transform_HPA_data(
    data_df,
    categories: List[str],
    gene_id_col: str = "entrezgene",
    score_col_name: str = "score",
)

Pivot HPA data into a gene × sample expression matrix.

Groups rows by gene_id_col and the specified categories, averages duplicate entries, then pivots so that each unique combination of category values becomes a column (sample).

Parameters:

Name Type Description Default
data_df DataFrame

Filtered HPA DataFrame (e.g. output of :func:unify_score_column).

required
categories list of str

Column names that together define a "sample" (e.g. ["Tissue", "Cell type"]). Multiple columns are joined with "_" to form a single sample label.

required
gene_id_col str

Column (or "index") holding gene identifiers (default "entrezgene").

'entrezgene'
score_col_name str

Column with numeric expression scores (default "score").

'score'

Returns:

Type Description
dict

{"data_df": pd.DataFrame} — a genes × samples matrix where rows are genes and columns are sample labels.

Raises:

Type Description
ValueError

If categories is empty (at least one sample column is required).

unify_score_column

unify_score_column(
    data_df: DataFrame,
    level_dic: Dict[str, float],
    score_col_name: str,
) -> (
    pd.DataFrame,
    Dict[str, Dict[str, Tuple[float, float]]],
)

Convert heterogeneous HPA expression columns into a single score.

Handles three cases depending on which columns are present:

  1. Level count columns (e.g. "High", "Medium", "Low"): compute a weighted average using level_dic as weights and return continuous CORDA thresholds.
  2. A "Level" column with discrete labels: map labels to numeric scores via level_dic and return discrete CORDA thresholds.
  3. Quantitative columns ("pTPM" or "NX"): rename the first matching column to score_col_name and return None thresholds (thresholding left to downstream methods).

Parameters:

Name Type Description Default
data_df DataFrame

HPA expression DataFrame (one row per gene × tissue/cell-type).

required
level_dic dict[str, float]

Mapping from expression-level labels (e.g. "High") to numeric weights. Also used to map discrete "Level" labels.

required
score_col_name str

Name for the unified score column added to the returned DataFrame.

required

Returns:

Type Description
dict

{"data_df": pd.DataFrame, "used_rxn_thres": dict or None} — the updated DataFrame and the CORDA threshold dictionary (or None when quantitative data is used).

translate_gene_id

translate_gene_id(
    data_df: DataFrame,
    map_df: DataFrame,
    gene_col: str,
    to_id: str,
)

Translate gene identifiers in a DataFrame using a mapping.

Adds a new column (or replaces the index) with the translated IDs and drops rows that could not be mapped.

Parameters:

Name Type Description Default
data_df DataFrame

DataFrame containing the original gene identifiers.

required
map_df DataFrame or dict

Mapping from original IDs to target IDs. If a DataFrame, it is used via :meth:pandas.Series.map; if a dict, keys are original IDs and values are translated IDs.

required
gene_col str

Column in data_df that holds the source gene IDs. Use "index" to translate the DataFrame index instead.

required
to_id str

Name for the new translated-ID column. If "index", the translated IDs replace the DataFrame index.

required

Returns:

Type Description
dict

{"data_df": pd.DataFrame, "gene_id_col": str} — the updated DataFrame and the name of the translated-ID column.

get_gene_id_map

get_gene_id_map(
    gene_names: List[str],
    from_id: str,
    to_id: str,
    df_path: Union[PathLike, str],
    dataset: Union[str, Dict] = "hsapiens_gene_ensembl",
    ds_kws: Optional[Dict] = None,
    map_type: str = "df",
    drop_unused: bool = False,
    ref_model: Optional[Model] = None,
)

Get a gene ID mapper from local path or BioMart.

Parameters:

Name Type Description Default
gene_names list of str

The gene names / IDs to be translated into another gene names or IDs.

required
from_id str

The name of the current IDs (e.g. "ensembl_gene_id").

required
to_id str

The name of the transformed IDs (e.g. "external_gene_name").

required
df_path path - like or str

Path to cache the mapping DataFrame as a TSV file. If None, the mapping is fetched from BioMart without caching.

required
dataset str

BioMart dataset name (default "hsapiens_gene_ensembl").

'hsapiens_gene_ensembl'
ds_kws dict

Backward-compatible dataset keyword dictionary used by earlier versions. If supplied, name, dataset, or dataset_name is used as the BioMart dataset name.

None
map_type str

"df" to return a DataFrame, "dict" to return a dict.

'df'
drop_unused bool

If True, drop genes not present in ref_model.

False
ref_model Model

Reference model used when drop_unused is True.

None

Returns:

Type Description
dict

{"map_df": ...} where the value is a DataFrame or dict.