Data¶

Data loading, preprocessing, and biological-data containers.

This submodule provides:

Data containers — :class:GeneData, :class:EnzymeData, :class:MediumData, :class:MetaboliteData — that wrap biological measurements and align them with COBRA metabolic models.
Fetching utilities — :func:fetch_HPA_data, :func:list_models, :func:load_remote_model — for downloading data and models from public databases (HPA, BiGG, Metabolic Atlas).
Preprocessing helpers — :func:translate_gene_id, :func:unify_score_column, :func:transform_HPA_data, :func:get_gene_id_map — for gene-ID translation, score unification, and HPA data pivoting.
Synthetic data — :func:get_syn_gene_data — for generating simulated gene-expression matrices for testing.

GeneData ¶

GeneData(
    data: Union[AnnData, Series, dict],
    convert_to_str: bool = True,
    expression_threshold: float = 0.0001,
    absent_expression: float = 0,
    data_transform=None,
    discrete_transform=None,
    ordered_thresholds: list = None,
)

Bases: BaseData

Store gene-expression data and compute reaction activity scores.

GeneData wraps a mapping of gene IDs to expression values and provides methods to:

align the data with a COBRA model via a :class:~pipeGEM.analysis.RxnMapper,
compute per-reaction activity scores using gene–protein–reaction (GPR) rules,
apply global or local thresholding strategies,
aggregate data across multiple samples.

Parameters:

Name	Type	Description	Default
`data`	`pd.Series, anndata.AnnData, or dict`	Gene-expression values keyed by gene ID. For `AnnData`, the object must contain exactly one observation (row).	required
`convert_to_str`	`bool`	If `True` (default), gene IDs are cast to strings.	`True`
`expression_threshold`	`float`	Genes with expression below this value are set to absent_expression. Default is `1e-4`.	`0.0001`
`absent_expression`	`float`	Value assigned to genes below expression_threshold. Default is `0`.	`0`
`data_transform`	`callable or str`	Transformation applied to expression values when computing reaction scores (e.g. `np.log2`, `"log2"`). `None` means identity.	`None`
`discrete_transform`	`str, dict, or callable`	Maps raw expression values to discrete levels before storing. Recognised strings: `"HPA"` (Human Protein Atlas scoring). A dict is used as a direct look-up table.	`None`
`ordered_thresholds`	`list of float`	Ascending cut-offs for digitising expression into integer bins centred around zero.	`None`

Attributes:

Name	Type	Description
`gene_data`	`dict[str, float]`	Mapping of gene IDs to (possibly transformed) expression values.
`genes`	`list[str]`	Sorted list of gene IDs.
`rxn_mapper`	`RxnMapper or None`	Reaction mapper created by :meth:`align`; `None` before alignment.
`data_transform`	`callable`	The transformation applied to values when accessing :attr:`rxn_scores` or :attr:`transformed_gene_data`.

Examples:

>>> gd = GeneData({"geneA": 10.5, "geneB": 0.0})
>>> gd["geneA"]
10.5
>>> gd.align(model)
>>> gd.rxn_scores  # reaction → score mapping

transformed_gene_data `property` ¶

transformed_gene_data: Dict[str, float]

Gene data after applying the specified data_transform.

Returns:

Type	Description
`dict[str, float]`	Dictionary mapping gene IDs to their transformed expression values.

rxn_scores `property` ¶

rxn_scores: Dict[str, float]

Reaction scores calculated by a RxnMapper. A RxnMapper assigns a reaction score to each reaction in the aligned model based on its gene-reaction relationship. By default, 'or' relationships will be converted into max() formula, and 'and' relationships will be converted into min() formula to represent isozymes and protein subunits, respectively. However, users can determine which formula to use to replace the relationships.

Returns:

Name	Type	Description
`rxn_scores`	`dict[str, float]`

align ¶

align(model, **kwargs)

Calculate rxn_scores using a metabolic model.

Parameters:

Name	Type	Description	Default
`model`		The model with the genes and reactions to be mapped onto	required
`kwargs`		Keyword arguments used to create a RxnMapper object, including: threshold: float or int, default = 0 The absent_value will be assigned to the rxn_scores below this threshold. absent_value: float or int, default = 0 The value assigned to the reactions with score lower than the threshold. missing_value: any, default = np.nan The value assigned to the genes not included in the gene_data. and_operation: str, default = 'nanmin', The operation name used to calculate the 'and' gene-reaction relationships. `Valid operations include: nanmin: return minimum while ignoring all the nan values nanmax: return maximum while ignoring all the nan values nansum: return the expression sums while ignoring all the nan values nanmean: calculate the expression means while ignoring all the nan values` or_operation: str, default = 'nanmax' The operation name used to calculate the 'or' gene-reaction relationships. plus_operation: str, default = 'nansum' The operation name used to calculate the 'plus' gene-reaction relationships.	`{}`

Returns:

Type	Description
`None`

transformed_rxn_scores ¶

transformed_rxn_scores(func) -> dict

Get the transformed reaction activity scores.

Parameters:

Name	Type	Description	Default
`func`		Function used to transform the reaction score	required

Returns:

Name	Type	Description
`transformed_rxn_scores`	`dict`	A dict contains reaction ids as keys and transformed reaction scores as values.

calc_rxn_score_stat ¶

calc_rxn_score_stat(
    rxn_ids,
    ignore_na=True,
    na_value=0,
    return_if_all_na=-1,
    method="mean",
) -> float

Calculate a statistic (mean or median) for reaction scores of specified reactions.

Parameters:

Name	Type	Description	Default
`rxn_ids`	`list or set`	IDs of the reactions to include in the calculation.	required
`ignore_na`	`bool`	If True, ignore NaN scores during calculation.	`True`
`na_value`	`float`	Value to replace NaN scores with if `ignore_na` is False.	`0`
`return_if_all_na`	`float`	Value to return if all selected reaction scores are NaN.	`-1`
`method`	`(mean, median)`	The statistic to calculate.	`"mean"`

Returns:

Type	Description
`float`	The calculated statistic.

Raises:

Type	Description
`ValueError`	If `method` is not "mean" or "median".
`AttributeError`	If reaction scores have not been calculated yet (call `.align()` first).

apply ¶

apply(func)

Apply a function to each reaction score.

Parameters:

Name	Type	Description	Default
`func`	`callable`	A function that takes a single reaction score (float) as input.	required

Returns:

Type	Description
`dict[str, float]`	A dictionary mapping reaction IDs to the results of applying `func` to their scores.

Raises:

Type	Description
`AttributeError`	If reaction scores have not been calculated yet (call `.align()` first).

get_threshold ¶

get_threshold(
    name: str, transform: bool = True, **kwargs
) -> ALL_THRESHOLD_ANALYSES

Calculate expression thresholds for classifying genes.

Parameters:

Name	Type	Description	Default
`name`	`str`	Thresholding method name (e.g. `"percentile"`, `"rFastCormic"`, `"local"`). Passed to :func:`~pipeGEM.analysis.threshold_finders.create`.	required
`transform`	`bool`	If `True` (default), apply :attr:`data_transform` to the gene data before computing thresholds.	`True`
`**kwargs`		Additional keyword arguments forwarded to the threshold finder's `find_threshold` method.	`{}`

Returns:

Type	Description
`ThresholdAnalysis`	A result object whose type depends on name (e.g. `PercentileThresholdAnalysis`, `rFastCormicThresholdAnalysis`).

assign_local_threshold ¶

assign_local_threshold(
    local_threshold_result,
    transform: bool = True,
    method: Literal[
        "binary", "ratio", "diff", "rdiff"
    ] = "binary",
    group: str = None,
    **kwargs
) -> None

Replace gene-expression values using per-gene local thresholds.

Modifies :attr:gene_data in place according to method:

"binary" — 1 if expression > threshold, else 0.
"ratio" — expression / threshold.
"diff" — threshold − expression (positive ⇒ under-expressed).
"rdiff" — expression − threshold (positive ⇒ over-expressed).

Parameters:

Name	Type	Description	Default
`local_threshold_result`	`LocalThresholdAnalysis`	Result object containing per-gene thresholds (from :func:`find_local_threshold`).	required
`transform`	`bool`	If `True` (default), apply :attr:`data_transform` to both gene values and thresholds before comparison.	`True`
`method`	`(binary, ratio, diff, rdiff)`	Comparison strategy (default `"binary"`).	`"binary"`
`group`	`str`	Column to select from `local_threshold_result.exp_ths`. Defaults to `"exp_th"`.	`None`

aggregate `classmethod` ¶

aggregate(
    data: Dict[
        str, Dict[str, Union[Dict[str, GeneData], GeneData]]
    ],
    method: str = "concat",
    prop: Literal["data", "score"] = "data",
    absent_expression: float = 0,
    group_annotation: DataFrame = None,
) -> DataAggregation

Aggregate gene data or reaction scores from multiple sources.

Combines multiple :class:GeneData objects into a single :class:~pipeGEM.analysis.DataAggregation result, either by concatenation or by applying a pandas aggregation method (e.g. "mean", "median").

Parameters:

Name	Type	Description	Default
`data`	`dict[str, dict[str, GeneData]] or dict[str, GeneData]`	Gene-data objects to aggregate. When nested (two-level dict), columns are named `"outer_key:inner_key"`.	required
`method`	`str`	Aggregation method. `"concat"` (default) keeps all columns; any other value is called as a pandas DataFrame method along `axis=1` (e.g. `"mean"`, `"median"`).	`'concat'`
`prop`	`(data, score)`	Which property to extract from each `GeneData`: `"data"` → :attr:`gene_data`, `"score"` → :attr:`rxn_scores`.	`"data"`
`absent_expression`	`float`	Fill value for missing genes (default `0`).	`0`
`group_annotation`	`DataFrame`	Sample-level group labels. Its index must match the resulting column names when method is `"concat"`.	`None`

Returns:

Type	Description
`DataAggregation`	Result object wrapping the aggregated DataFrame.

Raises:

Type	Description
`AssertionError`	If prop is not `"data"` or `"score"`.
`ValueError`	If group_annotation index does not overlap with aggregated column names.

MediumData ¶

MediumData(
    data,
    conc_col_label="mmol/L",
    conc_unit="mmol/L",
    id_index=False,
    name_index=True,
    id_col_label="human_1",
    name_col_label=None,
)

Bases: BaseData

Stores and processes medium composition data for constraining metabolic models.

This class handles loading medium data (metabolite concentrations), aligning it with exchange reactions in a metabolic model, and applying these concentrations as constraints on reaction bounds.

Parameters:

Name	Type	Description	Default
`data`	`DataFrame`	DataFrame containing medium composition data. Must include columns for metabolite IDs and concentrations. Optionally includes metabolite names.	required
`conc_col_label`	`str`	The label of the column containing metabolite concentrations in `data`.	`"mmol/L"`
`conc_unit`	`str`	The unit of the concentrations provided in `conc_col_label`. Uses `pint` for unit handling.	`"mmol/L"`
`id_index`	`bool`	If True, assumes the DataFrame index contains the metabolite IDs. If False, uses the column specified by `id_col_label`.	`False`
`name_index`	`bool`	If True, assumes the DataFrame index contains the metabolite names. If False, uses the column specified by `name_col_label`.	`True`
`id_col_label`	`str`	The label of the column containing metabolite IDs, used if `id_index` is False.	`"human_1"`
`name_col_label`	`str`	The label of the column containing metabolite names, used if `name_index` is False. If None and `name_index` is False, names will not be stored.	`None`

Attributes:

Name	Type	Description
`data_dict`	`dict`	Dictionary mapping metabolite IDs to their concentrations.
`rxn_dict`	`dict`	Dictionary mapping exchange reaction IDs to corresponding metabolite concentrations after alignment with a model.
`name_dict`	`dict`	Dictionary mapping metabolite IDs to their names (if available).
`conc_unit`	`Quantity`	The concentration unit parsed by `pint`.

align ¶

align(
    model,
    external_comp_name="e",
    met_id_format="{met_id}{comp}",
    raise_err=False,
)

Aligns medium metabolite data with exchange reactions in a metabolic model.

Iterates through the metabolites in data_dict and attempts to find corresponding exchange reactions in the model. Populates rxn_dict with mappings from reaction IDs to metabolite concentrations.

Parameters:

Name	Type	Description	Default
`model`	`Model or Model`	The metabolic model to align against.	required
`external_comp_name`	`str`	The identifier for the external compartment in the model.	`"e"`
`met_id_format`	`str`	A format string to construct the full metabolite ID in the model, using the metabolite ID from the data (`met_id`) and the `external_comp_name` (`comp`).	`"{met_id}{comp}"`
`raise_err`	`bool`	If True, raises a KeyError if a metabolite from the data cannot be found in the model's external compartment. If False, issues a warning.	`False`

Returns:

Type	Description
`None`

apply ¶

apply(
    model,
    cell_dgw=1e-12,
    n_cells_per_l=1000000000.0,
    time_hr=96,
    flux_unit="mmol/g/hr",
    threshold=1e-06,
)

Applies the medium constraints to the bounds of model reactions.

Calculates the maximum possible influx rate for each metabolite based on its concentration, cell density, dry weight, and time. Updates the lower or upper bounds of the corresponding exchange, sink, or demand reactions in the model.

Parameters:

Name	Type	Description	Default
`model`	`Model or Model`	The metabolic model whose reaction bounds will be modified.	required
`cell_dgw`	`float`	Cell dry weight in grams.	`1e-12`
`n_cells_per_l`	`float`	Number of cells per liter of medium.	`1e9`
`time_hr`	`float`	Duration of the experiment or simulation in hours.	`96`
`flux_unit`	`str`	The desired unit for reaction fluxes in the model. The calculated influx bounds will be converted to this unit.	`"mmol/g/hr"`
`threshold`	`float`	A minimum absolute value for the calculated bound. Bounds smaller than this threshold will be set to this value (or its negative). Helps avoid numerical issues with zero bounds.	`1e-6`

Returns:

Type	Description
`None`

Notes

Modifies the model object in place.
Assumes exchange reactions consuming the metabolite have negative stoichiometry.
Sets bounds for unconstrained inorganic exchanges or sinks/demands to 0 if they only produce/consume metabolites, respectively. Issues warnings for others.

from_catalog `classmethod` ¶

from_catalog(medium, **kwargs)

Load a medium from the built-in :class:~pipeGEM.data.MediumCatalog.

Parameters:

Name	Type	Description	Default
`medium`	`MediumCatalog or str`	A :class:`~pipeGEM.data.MediumCatalog` member or a case-insensitive string matching the enum name (e.g. `'M9'`, `'lb'`, `'DMEM_HIGH_FFA'`).	required
`**kwargs`		Keyword arguments forwarded to :meth:`from_file` / `__init__`. `id_col_label` and `name_index` default to the values stored in the catalog entry but can be overridden here.	`{}`

Returns:

Type	Description
`MediumData`

Raises:

Type	Description
`ValueError`	If medium is a string that does not match any catalog entry.
`TypeError`	If medium is neither a `MediumCatalog` member nor a string.

Examples:

>>> m9 = MediumData.from_catalog('M9')
>>> m9 = MediumData.from_catalog(MediumCatalog.M9)

supplement ¶

supplement(met_id, concentration, name=None)

Add or update a metabolite in the medium.

Parameters:

Name	Type	Description	Default
`met_id`	`str`	BiGG (or other scheme) metabolite identifier.	required
`concentration`	`float`	Concentration in the unit stored in :attr:`conc_unit`. Use `float('inf')` for unconstrained species.	required
`name`	`str`	Human-readable name. If omitted, met_id is used as the name.	`None`

Returns:

Type	Description
`MediumData`	`self`, to allow method chaining.

Warns:

Type	Description
`UserWarning`	If :attr:`rxn_dict` is non-empty (the alignment may be stale).

remove ¶

remove(met_id)

Remove a metabolite from the medium.

Parameters:

Name	Type	Description	Default
`met_id`	`str`	Metabolite identifier to remove.	required

Returns:

Type	Description
`MediumData`	`self`, to allow method chaining.

Raises:

Type	Description
`KeyError`	If met_id is not present in :attr:`data_dict`.

Warns:

Type	Description
`UserWarning`	If :attr:`rxn_dict` is non-empty (the alignment may be stale).

combine ¶

combine(other, mode='union', conflict='max')

Combine two media into a new :class:MediumData instance.

Concentrations in other are unit-converted to match self before combining. The original instances are not modified.

Parameters:

Name	Type	Description	Default
`other`	`MediumData`	The second medium.	required
`mode`	`(union, intersection)`	`'union'` — include metabolites from either medium. `'intersection'` — include only metabolites present in both.	`'union'`
`conflict`	`(max, min, sum, first, second)`	How to resolve a metabolite present in both media: `'max'` — use the larger concentration. `'min'` — use the smaller concentration. `'sum'` — add both concentrations. `'first'` — keep `self`'s value. `'second'` — use `other`'s value.	`'max'`

Returns:

Type	Description
`MediumData`	New instance with the combined composition (unit = `self`'s unit).

Raises:

Type	Description
`TypeError`	If other is not a :class:`MediumData` instance.
`ValueError`	If mode or conflict is invalid, or if units are incompatible.

from_file `classmethod` ¶

from_file(file_name='DMEM', csv_kw=None, **kwargs)

Loads medium data from a file.

Supports TSV and CSV formats. Looks for the file in the standard medium/ directory relative to the package structure first. If not found there, attempts to load from the provided file_name path directly.

Parameters:

Name	Type	Description	Default
`file_name`	`str or Path`	The base name of the medium file (e.g., "DMEM", "Hams") or a full path to a custom medium file. The method will try appending ".tsv" first, then assume CSV if not found or if `csv_kw` is provided.	`"DMEM"`
`csv_kw`	`dict`	Keyword arguments to pass directly to `pandas.read_csv`. If provided, CSV reading is prioritized. Example: `{'sep': ',', 'index_col': 0}`.	`None`
`**kwargs`		Additional keyword arguments passed directly to the `MediumData` constructor (`__init__`), such as `conc_col_label`, `id_col_label`, etc.	`{}`

Returns:

Type	Description
`MediumData`	An instance of the MediumData class initialized with the loaded data.

Raises:

Type	Description
`FileNotFoundError`	If the specified file cannot be found either in the default directory or at the provided path.
`Exception`	Propagates exceptions from `pandas.read_csv` or `MediumData.__init__`.

EnzymeData ¶

EnzymeData(
    data: Union[DataFrame],
    gene_id_col: Optional[str] = None,
    prot_id_col: Optional[str] = None,
    rxn_id_col: Optional[str] = None,
    met_id_col: Optional[str] = None,
    mw_col: str = "MW",
    kcat_col: str = "Kcat",
    alt_kcat_col: str = "DLKcat",
    prot_seq_col: str = "Sequence",
    ec_num_col: str = "EC",
    sa_col: str = "SA",
)

Bases: BaseData

Store enzyme kinetic parameters for enzyme-constrained model construction.

Wraps a DataFrame of per-gene (or per-protein) kinetic data — kcat, molecular weight (MW), EC numbers, protein sequences, etc. — and provides methods to align it with a COBRA model and optionally run DLKcat for in-silico kcat prediction.

Parameters:

Name	Type	Description	Default
`data`	`DataFrame`	Enzyme kinetic data. Each row represents one gene/protein entry.	required
`gene_id_col`	`str`	Column to use as gene IDs (index). If `None`, the existing DataFrame index is used.	`None`
`prot_id_col`	`str`	Column containing protein/UniProt IDs. If `None`, gene IDs are used as protein identifiers and a warning is issued.	`None`
`rxn_id_col`	`str`	Column containing reaction IDs. If `None`, reaction mapping is inferred during :meth:`align` via the model's GPR rules.	`None`
`met_id_col`	`str`	Column containing metabolite IDs associated with each reaction.	`None`
`mw_col`	`str`	Column for molecular weight values (default `"MW"`). If absent, MW is inferred from prot_seq_col.	`'MW'`
`kcat_col`	`str`	Column for experimentally measured kcat values (default `"Kcat"`).	`'Kcat'`
`alt_kcat_col`	`str`	Column for alternative (predicted) kcat values, e.g. from DLKcat (default `"DLKcat"`).	`'DLKcat'`
`prot_seq_col`	`str`	Column for amino-acid sequences (default `"Sequence"`).	`'Sequence'`
`ec_num_col`	`str`	Column for EC numbers (default `"EC"`).	`'EC'`
`sa_col`	`str`	Column for specific activity values (default `"SA"`).	`'SA'`

Attributes:

Name	Type	Description
`prot_id_col`	`str or None`	Protein ID column name.
`mw_col, kcat_col, alt_kcat_col, prot_seq_col, ec_num_col, sa_col`	`str`	Column names for the respective fields.

calc_molecular_weight `staticmethod` ¶

calc_molecular_weight(seq: str) -> float

Estimate protein molecular weight from an amino-acid sequence.

Uses average amino-acid residue masses and adds 18.02 Da for the terminal water molecule. Non-standard characters are silently skipped.

Parameters:

Name	Type	Description	Default
`seq`	`str`	One-letter amino-acid sequence.	required

Returns:

Type	Description
`float`	Approximate molecular weight in Daltons.

check_gene_rxn_pair ¶

check_gene_rxn_pair(
    ref_model, raise_err: bool = True
) -> None

Checks if the gene-reaction pairs in the enzyme data exist in the reference model.

Iterates through the enzyme DataFrame and verifies that for each row, the gene (index or gene_id_col) is associated with the reaction specified in _rxn_id_col within the ref_model.

Parameters:

Name	Type	Description	Default
`ref_model`	`Model or Model`	The metabolic model used as a reference.	required
`raise_err`	`bool`	If True, raises a ValueError upon finding a mismatch. If False, issues a warning instead.	`True`

Raises:

Type	Description
`ValueError`	If `raise_err` is True and a mismatch is found.
`AttributeError`	If `_rxn_id_col` is None or not set.
`KeyError`	If a reaction ID from the data is not found in the model.

rxn_items ¶

rxn_items() -> Dict[str, Dict[str, Union[str, float]]]

Returns a dictionary mapping reaction IDs to their best-matched enzyme data.

Requires the .align() method to be called first to populate the _best_matched_df.

Returns:

Type	Description
`Dict[str, Dict[str, Union[str, float]]]`	A dictionary where keys are reaction IDs and values are dictionaries containing 'protein_to_use' (protein ID), 'best_kcat' (kcat value), and 'best_mw' (molecular weight).

Raises:

Type	Description
`AttributeError`	If `.align()` has not been called yet (`_best_matched_df` is None).

run_DLKcat ¶

run_DLKcat(
    met_data: MetaboliteData, device: str = "cpu"
) -> None

Runs the DLKcat tool to predict kcat values.

Parameters:

Name	Type	Description	Default
`met_data`	`MetaboliteData`	Metabolite data object containing SMILES information needed by DLKcat.	required
`device`	`str`	Device to run DLKcat on ('cpu' or 'cuda' if available).	`"cpu"`

Notes

Predictions are stored in alt_kcat_col. Existing positive values in kcat_col are treated as curated values and are not overwritten.

align ¶

align(
    model,
    check_and_raise=True,
    run_DLKcat=True,
    device="cpu",
)

Align enzyme data with a metabolic model.

Maps genes in the enzyme DataFrame to their corresponding reactions and metabolites using the model's GPR rules. Optionally runs DLKcat to predict missing kcat values.

Parameters:

Name	Type	Description	Default
`model`	`Model or Model`	The metabolic model to align against.	required
`check_and_raise`	`bool`	If `True` (default) and `rxn_id_col` was provided at construction, raise on gene–reaction mismatches.	`True`
`run_DLKcat`	`bool`	If `True` (default), attempt to predict kcat values via the DLKcat deep-learning model.	`True`
`device`	`str`	PyTorch device for DLKcat (`"cpu"` or `"cuda"`).	`'cpu'`

MetaboliteData ¶

MetaboliteData(
    data: Union[DataFrame],
    met_id_col: Optional[str] = None,
    smiles_col="SMILES",
)

Bases: BaseData

Store metabolite structural data (SMILES) for enzyme-constrained modelling.

Used by :class:EnzymeData when running DLKcat predictions, which require substrate SMILES strings alongside protein sequences.

Parameters:

Name	Type	Description	Default
`data`	`DataFrame`	DataFrame containing at least a SMILES column. Rows represent metabolites.	required
`met_id_col`	`str`	Column whose values should be used as the DataFrame index (metabolite IDs). If `None`, the existing index is kept.	`None`
`smiles_col`	`str`	Name of the column holding SMILES strings (default `"SMILES"`).	`'SMILES'`

Raises:

Type	Description
`KeyError`	If smiles_col is not found among the DataFrame columns.

Attributes:

Name	Type	Description
`smiles_col`	`str`	Column name used for SMILES look-ups.

get_smiles ¶

get_smiles(ids)

Retrieve SMILES string(s) for the given metabolite ID(s).

Parameters:

Name	Type	Description	Default
`ids`	`str or list of str`	One or more metabolite IDs.	required

Returns:

Type	Description
`str or ndarray`	A single SMILES string when ids is a `str`, or a NumPy array of strings when ids is a list.

MediumInfo `dataclass` ¶

MediumInfo(
    name: str,
    description: str,
    source: str,
    organism: str,
    medium_type: str,
    default_id_col: str = "BiGG",
    is_approximate: bool = False,
    source_url: str = "",
    composition_url: str = "",
    composition_note: str = "",
)

Metadata for a named medium in the catalog.

Attributes:

Name	Type	Description
`name`	`str`	Filename stem (without .tsv) used to locate the medium file.
`description`	`str`	Short human-readable description.
`source`	`str`	Literature citation or derivation note.
`organism`	`str`	Intended host organism(s).
`medium_type`	`str`	One of `'minimal'`, `'defined'`, `'rich'`, or `'complex'`.
`default_id_col`	`str`	Column label in the TSV that contains metabolite IDs (default `'BiGG'`).
`is_approximate`	`bool`	`True` for undefined/complex media whose composition is estimated.
`source_url`	`str`	URL for the original citation or source publication, when known.
`composition_url`	`str`	URL for the formulation used to build or validate the bundled TSV.
`composition_note`	`str`	Short note describing whether the TSV is an exact defined recipe, an ionized exchange representation, or an approximate proxy.

MediumCatalog ¶

Bases: Enum

Catalog of named media bundled with pipeGEM.

Each member maps to a :class:MediumInfo instance that carries metadata and the TSV filename. Use :meth:~pipeGEM.data.MediumData.from_catalog to load a medium by catalog entry.

Examples:

>>> from pipeGEM.data import MediumCatalog
>>> MediumCatalog.M9.value.description
'M9 minimal salts medium with glucose'
>>> MediumCatalog.LB.value.is_approximate
True

find_local_threshold ¶

find_local_threshold(
    data_df, **kwargs
) -> ALL_THRESHOLD_ANALYSES

Compute per-gene local expression thresholds across multiple samples.

This is a convenience wrapper that creates a "local" threshold finder and calls its :meth:find_threshold method.

Parameters:

Name	Type	Description	Default
`data_df`	`DataFrame`	Expression matrix with genes as rows and samples (or groups) as columns.	required
`**kwargs`		Forwarded to the local threshold finder (e.g. `groups`, `group_dic`).	`{}`

Returns:

Type	Description
`LocalThresholdAnalysis`	Result object containing per-gene threshold values accessible via its `exp_ths` attribute.

load_remote_model ¶

load_remote_model(
    model_id,
    format="mat",
    branch="main",
    download_dest="default",
)

Load a metabolic model from a remote database (BiGG or Metabolic Atlas).

If the model_id is found in the BiGG database, it is loaded directly using cobrapy. Otherwise, it attempts to download the model from the Metabolic Atlas GitHub repository.

Parameters:

Name	Type	Description	Default
`model_id`	`str`	The ID of the model to load.	required
`format`	`str`	The format of the model file to download (e.g., "mat", "xml", "yml"). Defaults to "mat".	`'mat'`
`branch`	`str`	The GitHub branch to download the model from for Metabolic Atlas models. Defaults to "main".	`'main'`
`download_dest`	`str`	The destination directory to download the model to. Defaults to "default", which saves to a 'models' directory relative to the project root.	`'default'`

Returns:

Type	Description
`Model`	The loaded metabolic model.

list_models ¶

list_models(
    databases=["metabolic atlas", "BiGG"],
    organism=None,
    max_n_rxns=np.inf,
    max_n_mets=np.inf,
    max_n_genes=np.inf,
    **kwargs
) -> pd.DataFrame

List available metabolic models from specified databases with optional filtering.

Parameters:

Name	Type	Description	Default
`databases`	`List[str]`	A list of database names to fetch models from (e.g., ["metabolic atlas", "BiGG"]). Defaults to ["metabolic atlas", "BiGG"].	`['metabolic atlas', 'BiGG']`
`organism`	`str`	Filter models by organism name (e.g., "human", "mouse"). Case-insensitive.	`None`
`max_n_rxns`	`float`	Maximum number of reactions allowed in the models. Defaults to infinity.	`inf`
`max_n_mets`	`float`	Maximum number of metabolites allowed in the models. Defaults to infinity.	`inf`
`max_n_genes`	`float`	Maximum number of genes allowed in the models. Defaults to infinity.	`inf`
`**kwargs`		Additional keyword arguments for DataBaseFetcherIniter.	`{}`

Returns:

Type	Description
`DataFrame`	A DataFrame containing information about the available models, including 'id', 'organism', 'reaction_count', 'metabolite_count', 'gene_count', and 'database'. Returns an empty DataFrame if no data is fetched.

fetch_HPA_data ¶

fetch_HPA_data(
    data_name: str,
    data_path: Union[str, Path] = Path(
        __file__
    ).parent.parent.parent
    / Path("external_data/HPA"),
) -> dict

Fetch Human Protein Atlas (HPA) data.

Downloads the specified HPA dataset if it doesn't exist locally.

Parameters:

Name	Type	Description	Default
`data_name`	`str`	The name of the HPA dataset to fetch (e.g., 'rna_tissue_consensus').	required
`data_path`	`Union[str, Path]`	The directory path to save or load the data from. Defaults to 'external_data/HPA' relative to the project root.	`parent / Path('external_data/HPA')`

Returns:

Type	Description
`dict`	A dictionary containing the path to the downloaded TSV file under the key "data_path".

get_syn_gene_data ¶

get_syn_gene_data(
    model: Union[Model, Model],
    n_sample: int,
    n_genes: Optional[int] = None,
    groups: Optional[str] = None,
    random_state: int = 42,
    returned_dtype: str = "DataFrame",
) -> Union[pd.DataFrame, AnnData]

Generate synthetic gene expression data with a given number of samples and genes.

Parameters:

Name	Type	Description	Default
`model`	`Model or Model`	A model containing information about the genes to simulate expression data for.	required
`n_sample`	`int`	The number of samples to generate.	required
`n_genes`	`int`	The number of genes to simulate expression data for. If None, use all genes in the model (default=None).	`None`
`groups`	`str`	The name of the attribute containing group information for the genes (default=None).	`None`
`random_state`	`int`	The random seed to use for generating the data (default=42).	`42`
`returned_dtype`	`str`	The type of object to return. Must be either 'DataFrame' or 'AnnData' (default='DataFrame').	`'DataFrame'`

Returns:

Type	Description
`Union[DataFrame, AnnData]`	The simulated gene expression data. If returned_dtype is 'DataFrame', returns a pandas DataFrame with gene IDs as the index and sample IDs as the columns. If returned_dtype is 'AnnData', returns an AnnData object with the simulated expression data as the X attribute, and empty obs and var attributes.

transform_HPA_data ¶

transform_HPA_data(
    data_df,
    categories: List[str],
    gene_id_col: str = "entrezgene",
    score_col_name: str = "score",
)

Pivot HPA data into a gene × sample expression matrix.

Groups rows by gene_id_col and the specified categories, averages duplicate entries, then pivots so that each unique combination of category values becomes a column (sample).

Parameters:

Name	Type	Description	Default
`data_df`	`DataFrame`	Filtered HPA DataFrame (e.g. output of :func:`unify_score_column`).	required
`categories`	`list of str`	Column names that together define a "sample" (e.g. `["Tissue", "Cell type"]`). Multiple columns are joined with `"_"` to form a single sample label.	required
`gene_id_col`	`str`	Column (or `"index"`) holding gene identifiers (default `"entrezgene"`).	`'entrezgene'`
`score_col_name`	`str`	Column with numeric expression scores (default `"score"`).	`'score'`

Returns:

Type	Description
`dict`	`{"data_df": pd.DataFrame}` — a genes × samples matrix where rows are genes and columns are sample labels.

Raises:

Type	Description
`ValueError`	If categories is empty (at least one sample column is required).

unify_score_column ¶

unify_score_column(
    data_df: DataFrame,
    level_dic: Dict[str, float],
    score_col_name: str,
) -> (
    pd.DataFrame,
    Dict[str, Dict[str, Tuple[float, float]]],
)

Convert heterogeneous HPA expression columns into a single score.

Handles three cases depending on which columns are present:

Level count columns (e.g. "High", "Medium", "Low"): compute a weighted average using level_dic as weights and return continuous CORDA thresholds.
A "Level" column with discrete labels: map labels to numeric scores via level_dic and return discrete CORDA thresholds.
Quantitative columns ("pTPM" or "NX"): rename the first matching column to score_col_name and return None thresholds (thresholding left to downstream methods).

Parameters:

Name	Type	Description	Default
`data_df`	`DataFrame`	HPA expression DataFrame (one row per gene × tissue/cell-type).	required
`level_dic`	`dict[str, float]`	Mapping from expression-level labels (e.g. `"High"`) to numeric weights. Also used to map discrete `"Level"` labels.	required
`score_col_name`	`str`	Name for the unified score column added to the returned DataFrame.	required

Returns:

Type	Description
`dict`	`{"data_df": pd.DataFrame, "used_rxn_thres": dict or None}` — the updated DataFrame and the CORDA threshold dictionary (or `None` when quantitative data is used).

translate_gene_id ¶

translate_gene_id(
    data_df: DataFrame,
    map_df: DataFrame,
    gene_col: str,
    to_id: str,
)

Translate gene identifiers in a DataFrame using a mapping.

Adds a new column (or replaces the index) with the translated IDs and drops rows that could not be mapped.

Parameters:

Name	Type	Description	Default
`data_df`	`DataFrame`	DataFrame containing the original gene identifiers.	required
`map_df`	`DataFrame or dict`	Mapping from original IDs to target IDs. If a DataFrame, it is used via :meth:`pandas.Series.map`; if a dict, keys are original IDs and values are translated IDs.	required
`gene_col`	`str`	Column in data_df that holds the source gene IDs. Use `"index"` to translate the DataFrame index instead.	required
`to_id`	`str`	Name for the new translated-ID column. If `"index"`, the translated IDs replace the DataFrame index.	required

Returns:

Type	Description
`dict`	`{"data_df": pd.DataFrame, "gene_id_col": str}` — the updated DataFrame and the name of the translated-ID column.

get_gene_id_map ¶

get_gene_id_map(
    gene_names: List[str],
    from_id: str,
    to_id: str,
    df_path: Union[PathLike, str],
    dataset: Union[str, Dict] = "hsapiens_gene_ensembl",
    ds_kws: Optional[Dict] = None,
    map_type: str = "df",
    drop_unused: bool = False,
    ref_model: Optional[Model] = None,
)

Get a gene ID mapper from local path or BioMart.

Parameters:

Name	Type	Description	Default
`gene_names`	`list of str`	The gene names / IDs to be translated into another gene names or IDs.	required
`from_id`	`str`	The name of the current IDs (e.g. `"ensembl_gene_id"`).	required
`to_id`	`str`	The name of the transformed IDs (e.g. `"external_gene_name"`).	required
`df_path`	`path - like or str`	Path to cache the mapping DataFrame as a TSV file. If `None`, the mapping is fetched from BioMart without caching.	required
`dataset`	`str`	BioMart dataset name (default `"hsapiens_gene_ensembl"`).	`'hsapiens_gene_ensembl'`
`ds_kws`	`dict`	Backward-compatible dataset keyword dictionary used by earlier versions. If supplied, `name`, `dataset`, or `dataset_name` is used as the BioMart dataset name.	`None`
`map_type`	`str`	`"df"` to return a DataFrame, `"dict"` to return a dict.	`'df'`
`drop_unused`	`bool`	If `True`, drop genes not present in ref_model.	`False`
`ref_model`	`Model`	Reference model used when drop_unused is `True`.	`None`

Returns:

Type	Description
`dict`	`{"map_df": ...}` where the value is a DataFrame or dict.

Data¶

GeneData ¶

transformed_gene_data property ¶

rxn_scores property ¶

align ¶

transformed_rxn_scores ¶

calc_rxn_score_stat ¶

apply ¶

get_threshold ¶

assign_local_threshold ¶

aggregate classmethod ¶

MediumData ¶

align ¶

apply ¶

from_catalog classmethod ¶

supplement ¶

remove ¶

combine ¶

from_file classmethod ¶

EnzymeData ¶

calc_molecular_weight staticmethod ¶

check_gene_rxn_pair ¶

rxn_items ¶

run_DLKcat ¶

align ¶

MetaboliteData ¶

get_smiles ¶

MediumInfo dataclass ¶

MediumCatalog ¶

find_local_threshold ¶

load_remote_model ¶

list_models ¶

fetch_HPA_data ¶

get_syn_gene_data ¶

transform_HPA_data ¶

unify_score_column ¶

translate_gene_id ¶

get_gene_id_map ¶

transformed_gene_data `property` ¶

rxn_scores `property` ¶

aggregate `classmethod` ¶

from_catalog `classmethod` ¶

from_file `classmethod` ¶

calc_molecular_weight `staticmethod` ¶

MediumInfo `dataclass` ¶