Curate DataFrames and AnnDatas¶
Curating datasets typically means three things:
Validate: ensure a dataset meets predefined validation criteria
Standardize: transform a dataset so that it meets validation criteria, e.g., by fixing typos or using standardized identifiers
Annotate: link a dataset against metadata records
In LaminDB, valid metadata is metadata that’s stored in a metadata registry and validation criteria merely defines a mapping onto a field of a registry.
Example
"Experiment 1"
is a valid value for ULabel.name
if a record with this name exists in the ULabel
registry.
# !pip install 'lamindb[bionty]'
!lamin init --storage ./test-curate --schema bionty
Show code cell output
→ connected lamindb: testuser1/test-curate
Validate a DataFrame¶
Let’s start with a DataFrame that we’d like to validate.
import lamindb as ln
import bionty as bt
import pandas as pd
df = pd.DataFrame(
{
"temperature": [37.2, 36.3, 38.2],
"cell_type": ["cerebral pyramidal neuron", "astrocyte", "oligodendrocyte"],
"assay_ontology_id": ["EFO:0008913", "EFO:0008913", "EFO:0008913"],
"donor": ["D0001", "D0002", "DOOO3"]
},
index = ["obs1", "obs2", "obs3"]
)
df
Show code cell output
→ connected lamindb: testuser1/test-curate
temperature | cell_type | assay_ontology_id | donor | |
---|---|---|---|---|
obs1 | 37.2 | cerebral pyramidal neuron | EFO:0008913 | D0001 |
obs2 | 36.3 | astrocyte | EFO:0008913 | D0002 |
obs3 | 38.2 | oligodendrocyte | EFO:0008913 | DOOO3 |
Define validation criteria and create a Curator
object.
# in the dictionary, each key is a column name of the dataframe, and each value
# is a registry field onto which values are mapped
categoricals = {
"cell_type": bt.CellType.name,
"assay_ontology_id": bt.ExperimentalFactor.ontology_id,
"donor": ln.ULabel.name,
}
# pass validation criteria
curate = ln.Curator.from_df(df, categoricals=categoricals)
Show code cell output
✓ added 3 records with Feature.name for columns: 'cell_type', 'assay_ontology_id', 'donor'
• 1 non-validated values are not saved in Feature.name: ['temperature']!
→ to lookup values, use lookup().columns
→ to save, run add_new_from_columns
The validate()
method checks our data against the defined criteria. It identifies which values are already validated (exist in our registries) and which are potentially problematic (do not yet exist in our registries).
curate.validate()
Show code cell output
• saving validated records of 'cell_type'
✓ added 2 records from public with CellType.name for cell_type: 'astrocyte', 'oligodendrocyte'
! 1 non-validated values are not saved in CellType.name: ['cerebral pyramidal neuron']!
→ to lookup values, use lookup().cell_type
→ to save, run .add_new_from('cell_type')
• saving validated records of 'assay_ontology_id'
• saving validated records of 'donor'
! 3 non-validated values are not saved in ULabel.name: ['D0001', 'D0002', 'DOOO3']!
→ to lookup values, use lookup().donor
→ to save, run .add_new_from('donor')
• mapping cell_type on CellType.name
! 1 terms is not validated: 'cerebral pyramidal neuron'
→ fix typos, remove non-existent values, or save terms via .add_new_from('cell_type')
✓ assay_ontology_id is validated against ExperimentalFactor.ontology_id
• mapping donor on ULabel.name
! 3 terms are not validated: 'D0001', 'D0002', 'DOOO3'
→ fix typos, remove non-existent values, or save terms via .add_new_from('donor')
False
Register new metadata values¶
If you see “non-validated” values, you’ll need to decide whether to add them to your registries or “fix” them in your dataset.
For cell_type
, we saw that ‘cerebral pyramidal neuron’ is not validated, let’s understand which cell type in the public ontology might be the actual match.
# use a lookup object to get the correct spelling of categories from a public ontology via `public=True`
lookup = curate.lookup(public=True)
lookup
Show code cell output
Lookup objects from the public:
.cell_type
.assay_ontology_id
.donor
.columns
Example:
→ categories = validator.lookup()['cell_type']
→ categories.alveolar_type_1_fibroblast_cell
To look up public ontologies, use .lookup(public=True)
# here is an example for the cell_type column
cell_types = lookup["cell_type"]
cell_types.cerebral_cortex_pyramidal_neuron
Show code cell output
CellType(ontology_id='CL:4023111', name='cerebral cortex pyramidal neuron', definition='A Pyramidal Neuron With Soma Located In The Cerebral Cortex.', synonyms=None, parents=array(['CL:0000598', 'CL:0010012'], dtype=object))
# fix the cell type
df.cell_type = df.cell_type.replace({"cerebral pyramidal neuron": cell_types.cerebral_cortex_pyramidal_neuron.name})
For donor, we want to add the new donoers: ‘D0001’, ‘D0002’, ‘DOOO3’
# this adds donors that were _not_ validated
curate.add_new_from("donor")
Show code cell output
✓ added 3 records with ULabel.name for donor: 'D0001', 'D0002', 'DOOO3'
# validate again
validated = curate.validate()
validated
Show code cell output
• saving validated records of 'cell_type'
• saving validated records of 'assay_ontology_id'
• saving validated records of 'donor'
✓ cell_type is validated against CellType.name
✓ assay_ontology_id is validated against ExperimentalFactor.ontology_id
✓ donor is validated against ULabel.name
True
Validate an AnnData¶
Here we addtionally specify which var_index
to validate against.
import anndata as ad
X = pd.DataFrame(
{
"ENSG00000081059": [1, 2, 3],
"ENSG00000276977": [4, 5, 6],
"ENSG00000198851": [7, 8, 9],
"ENSG00000010610": [10, 11, 12],
"ENSG00000153563": [13, 14, 15],
"corrupted": [16, 17, 18]
},
index=df.index
)
adata = ad.AnnData(X=X, obs=df)
adata
Show code cell output
AnnData object with n_obs × n_vars = 3 × 6
obs: 'temperature', 'cell_type', 'assay_ontology_id', 'donor'
curate = ln.Curator.from_anndata(
adata,
var_index=bt.Gene.ensembl_gene_id, # validate var.index against Gene.ensembl_gene_id
categoricals=categoricals,
organism="human",
)
Show code cell output
• 1 non-validated values are not saved in Feature.name: ['temperature']!
→ to lookup values, use lookup().columns
→ to save, run add_new_from_columns
curate.validate()
Show code cell output
• saving validated records of 'var_index'
✓ added 5 records from public with Gene.ensembl_gene_id for var_index: 'ENSG00000081059', 'ENSG00000276977', 'ENSG00000198851', 'ENSG00000010610', 'ENSG00000153563'
! 1 non-validated values are not saved in Gene.ensembl_gene_id: ['corrupted']!
→ to lookup values, use lookup().var_index
→ to save, run .add_new_from_var_index()
• saving validated terms of 'cell_type'
• saving validated terms of 'assay_ontology_id'
• saving validated terms of 'donor'
• mapping var_index on Gene.ensembl_gene_id
! 1 terms is not validated: 'corrupted'
→ fix typos, remove non-existent values, or save terms via .add_new_from_var_index()
✓ cell_type is validated against CellType.name
✓ assay_ontology_id is validated against ExperimentalFactor.ontology_id
✓ donor is validated against ULabel.name
False
Non-validated terms can be accessed via:
curate.non_validated
Show code cell output
{'var_index': ['corrupted']}
Subset the AnnData
to validated genes only:
adata_validated = adata[:, ~adata.var.index.isin(curate.non_validated["var_index"])].copy()
Now let’s validate the subsetted object:
curate = ln.Curator.from_anndata(
adata_validated,
var_index=bt.Gene.ensembl_gene_id, # validate var.index against Gene.ensembl_gene_id
categoricals=categoricals,
organism="human",
)
curate.validate()
Show code cell output
• 1 non-validated values are not saved in Feature.name: ['temperature']!
→ to lookup values, use lookup().columns
→ to save, run add_new_from_columns
• saving validated records of 'var_index'
• saving validated terms of 'cell_type'
• saving validated terms of 'assay_ontology_id'
• saving validated terms of 'donor'
✓ var_index is validated against Gene.ensembl_gene_id
✓ cell_type is validated against CellType.name
✓ assay_ontology_id is validated against ExperimentalFactor.ontology_id
✓ donor is validated against ULabel.name
True
Save a curated artifact¶
The validated object can be subsequently saved as an Artifact
:
artifact = curate.save_artifact(description="test AnnData")
Show code cell output
! no run & transform got linked, call `ln.track()` & re-run
! run input wasn't tracked, call `ln.track()` and re-run
! 1 unique term (25.00%) is not validated for name: temperature
Validated features and labels are linked to the artifact:
artifact.describe()
Show code cell output
Artifact(uid='ERiOfpYONO5QXonQ0000', is_latest=True, description='test AnnData', suffix='.h5ad', type='dataset', size=20336, hash='6dfQCkZFszTuTqs3omVY-w', n_observations=3, _hash_type='md5', _accessor='AnnData', visibility=1, _key_is_virtual=True, created_at=2024-10-19 15:04:45 UTC)
Provenance
.storage = '/home/runner/work/lamindb/lamindb/docs/test-curate'
.created_by = 'testuser1'
Labels
.cell_types = 'astrocyte', 'oligodendrocyte', 'cerebral cortex pyramidal neuron'
.experimental_factors = 'single-cell RNA sequencing'
.ulabels = 'D0001', 'D0002', 'DOOO3'
Features
'assay_ontology_id' = 'single-cell RNA sequencing'
'cell_type' = 'astrocyte', 'cerebral cortex pyramidal neuron', 'oligodendrocyte'
'donor' = 'D0001', 'D0002', 'DOOO3'
Feature sets
'var' = 'TCF7', 'PDCD1', 'CD3E', 'CD4', 'CD8A'
'obs' = 'cell_type', 'assay_ontology_id', 'donor'
We’ve walked through the process of validating, standardizing, and annotating datasets going through these key steps:
Defining validation criteria
Validating data against existing registries
Adding new validated entries to registries
Annotating artifacts with validated metadata
By following these steps, you can ensure your data is standardized and well-curated.
If you have datasets that aren’t DataFrame-like or AnnData-like, read: Curate datasets of any format.