Skip to content

Add derivative filename validation#70

Draft
astewartau wants to merge 5 commits intomainfrom
validate-derivatives
Draft

Add derivative filename validation#70
astewartau wants to merge 5 commits intomainfrom
validate-derivatives

Conversation

@astewartau
Copy link
Contributor

@astewartau astewartau commented Mar 5, 2026

Summary

  • Adds schema.rules.files.deriv to the regex chain in BIDSValidator._init_regexes() so derivative filenames are recognized as valid BIDS
  • Derivative-specific entities (space, desc, res, den, label, atlas, seg, hemi, scale) are now parsed correctly by parse() and accepted by is_bids()

Before

>>> BIDSValidator.is_bids("/sub-01/anat/sub-01_space-MNI152_desc-preproc_T1w.nii.gz")
False
>>> BIDSValidator.parse("/sub-01/anat/sub-01_space-MNI152_desc-preproc_T1w.nii.gz")
{}

After

>>> BIDSValidator.is_bids("/sub-01/anat/sub-01_space-MNI152_desc-preproc_T1w.nii.gz")
True
>>> BIDSValidator.parse("/sub-01/anat/sub-01_space-MNI152_desc-preproc_T1w.nii.gz")
{'subject': '01', 'datatype': 'anat', 'space': 'MNI152', 'description': 'preproc', 'suffix': 'T1w', 'extension': '.nii.gz'}

Closes #62

Test plan

  • New tests/test_derivatives.py with 12 parametrized tests (valid filenames, entity parsing, invalid filenames)
  • Existing tests pass

Include schema.rules.files.deriv in the regex chain so derivative
filenames (with entities like space, desc, res, den) are recognized
as valid BIDS.

Closes #62
@astewartau
Copy link
Contributor Author

One question to discuss: Should derivative filenames only be accepted under bids/derivatives/? This is currently not the case under this PR.

@codecov
Copy link

codecov bot commented Mar 6, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 90.98%. Comparing base (878d57d) to head (640b2fc).

Additional details and impacted files
@@            Coverage Diff             @@
##             main      #70      +/-   ##
==========================================
+ Coverage   90.89%   90.98%   +0.08%     
==========================================
  Files          13       14       +1     
  Lines         846      854       +8     
  Branches      124      124              
==========================================
+ Hits          769      777       +8     
  Misses         47       47              
  Partials       30       30              
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@effigies
Copy link
Contributor

effigies commented Mar 6, 2026

I just want to check what your goal here is. Is this intended to flesh out the BIDSValidator object for PYBIDS use, or is it to work on the full validator?

If it's the former, then you're right that we should distinguish between raw and derivative datasets, instead of declaring derivative files always valid. I might suggest giving BIDSValidator.__init__ a dataset_type parameter that enables derivative rules.

If it's the latter, then we need to do some design. Here's the current validation loop:

def validate(tree: FileTree, schema: Namespace) -> None:
"""Check if the file path is BIDS compliant.
Parameters
----------
tree : FileTree
Full FileTree object to iterate over and check
schema : Namespace
Schema object to validate dataset against
"""
validator = BIDSValidator()
dataset = Dataset(tree, schema)
for file in walk(tree, dataset):
if not validator.is_bids(file.path):
print(f'{file.path} is not a valid bids filename')

file is a Context:

@attrs.define
class Context:
"""A context object that creates context for file on access."""
file: FileTree
dataset: Dataset
subject: ctx.Subject | None
file_parts: FileParts = attrs.field(init=False)
def __attrs_post_init__(self) -> None:
self.file_parts = FileParts.from_file(self.file, self.schema)
@property
def schema(self) -> Namespace:
"""The BIDS specification schema."""
return self.dataset.schema
@property
def path(self) -> str:
"""Path of the current file."""
return self.file_parts.path
@property
def entities(self) -> dict[str, str | None]:
"""Entities parsed from the current filename."""
return self.file_parts.entities
@property
def datatype(self) -> str | None:
"""Datatype of current file, for examples, anat."""
return self.file_parts.datatype
@property
def suffix(self) -> str | None:
"""Suffix of current file."""
return self.file_parts.suffix
@property
def extension(self) -> str | None:
"""Extension of current file including initial dot."""
return self.file_parts.extension
@property
def modality(self) -> str | None:
"""Modality of current file, for examples, MRI."""
if (datatype := self.file_parts.datatype) is not None:
return datatype_to_modality(datatype, self.schema)
return None
@property
def size(self) -> int:
"""Length of the current file in bytes."""
return self.file.path_obj.stat().st_size
@property
def associations(self) -> ctx.Associations:
"""Associated files, indexed by suffix, selected according to the inheritance principle."""
return ctx.Associations()
@property
def columns(self) -> Namespace | None:
"""TSV columns, indexed by column header, values are arrays with column contents."""
if self.extension == '.tsv':
return load_tsv(self.file)
elif self.extension == '.tsv.gz':
columns = tuple(self.sidecar.Columns) if self.sidecar else ()
return load_tsv_gz(self.file, columns)
return None
@property
def json(self) -> Namespace | None:
"""Contents of the current JSON file."""
if self.file_parts.extension == '.json':
return Namespace(load_json(self.file))
return None
@property
def gzip(self) -> None:
"""Parsed contents of gzip header."""
pass
@cached_property
def nifti_header(self) -> ctx.NiftiHeader | None:
"""Parsed contents of NIfTI header referenced elsewhere in schema."""
if self.extension in ('.nii', '.nii.gz'):
return load_nifti_header(self.file)
return None
@property
def ome(self) -> None:
"""Parsed contents of OME-XML header, which may be found in OME-TIFF or OME-ZARR files."""
pass
@property
def tiff(self) -> None:
"""TIFF file format metadata."""
pass
@property
def sidecar(self) -> Namespace | None:
"""Sidecar metadata constructed via the inheritance principle."""
sidecar = load_sidecar(self.file) or {}
return Namespace(sidecar)

We could rename it context.

Broadly speaking we need something like:

for rule in file_rules:
    if full_match(context, rule):
        break
    if partial_match(context, rule):
        partials.append(rule)
else:  # Break is never called
    suggestion = generate_suggestion(partials)  # May be empty
    error("UNKNOWN_FILE", suggestion)

We don't currently have a rule data structure, but we can create dataclasses to match the types of rules:

@attrs.define
class PathRule:
    selectors: list[str]
    level: Literal["optional", "required"]
    path: str

@attrs.define
class StemRule:
    selectors: list[str]
    level: Literal["optional", "required"]
    stem: str
    datatypes: list[str]
    extensions: list[str]

@attrs.define
class SuffixRule:
    selectors: list[str]
    level: Literal["optional", "required"]
    suffixes: list[str]
    entities: dict[str, Literal["optional", "required"]]
    datatypes: list[str]
    extensions: list[str]

Possibly our full_match could look like:

def full_match(context: Context, rule: PathRule | StemRule | SuffixRule) -> bool:
    match rule:
        case PathRule(path=path):
            return context.path == path
        case StemRule(stem=stem, datatypes=dtypes, extensionss=exts):
            return fnmatch(context.file.name, stem) and ...
        case SuffixRule(suffixes=suffixes, ...):
            return context.suffix in suffixes 

Anyway, that got a bit long. What's the target?

@bendhouseart
Copy link

If it's the former, then you're right that we should distinguish between raw and derivative datasets, instead of declaring derivative files always valid. I might suggest giving BIDSValidator.init a dataset_type parameter that enables derivative rules.

I believe it was this, that is to help validate generated (or existing) filenames from PyBIDS and other tools.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Validate Derivatives

3 participants