wf_psf.data.data_adapter

Data Adapter.

This module manages dataset lifecycle transitions for the WF-PSF pipeline.

Overview

Two orthogonal state machines are maintained:

1. Structure state

  • COMPLETESPLIT via split_data()

  • SPLITCOMPLETE via join_data()

2. Representation state

  • NUMPYTENSORFLOW via convert_to_tensorflow()

Glossary

COMPLETE

Dataset stored as a single container.

SPLIT

Dataset stored as train/test subsets.

NUMPY

Data stored as NumPy arrays.

TENSORFLOW

Data stored as TensorFlow tensors.

Design principles

  • Structure and representation are orthogonal.

  • All transitions are explicit and idempotent where possible.

  • No training or model logic lives in this module.

  • Dataset field names are canonicalized for downstream models.

Notes

The DataAdapter class manages these transitions while providing a consistent interface for accessing dataset contents.

Authors: Jennifer Pollack <jennifer.pollack@cea.fr>

Classes

DataAdapter(dataset, converter[, params, ...])

Adapter for managing dataset structure and backend representation.

LoadedDataset([complete, train, test])

Structured container for loaded dataset.

RepresentationState(value)

Representation state of the dataset.

StructureState(value)

Structural state of the dataset.

class wf_psf.data.data_adapter.DataAdapter(dataset: LoadedDataset, converter: TensorFlowDatasetConverter, params: Any | None = None, metadata: dict | None = None)[source]

Bases: object

Adapter for managing dataset structure and backend representation.

The adapter provides a consistent interface to datasets regardless of whether they are stored as a complete dataset or as train/test splits, and whether the underlying representation is NumPy or TensorFlow.

It also canonicalizes dataset fields to the names expected by downstream models.

Notes

Instances should be created via DataAdapterFactory.build().

Attributes:
complete_data

Return the complete dataset in the current representation.

masks

Get masks for the complete dataset.

metadata

Get dataset metadata.

params

Get dataset params.

positions

Get positions for the complete dataset.

representation_state

Return the current representation state of the dataset.

sources

Get sources for the complete dataset.

structure_state

Return the current structural state of the dataset.

test_data

Return the test set in the current representation.

train_data

Return the training set in the current representation.

zernike_prior

Get Zernike prior for the complete dataset.

Methods

convert_to_tensorflow(simPSF, n_bins_lambda, ...)

Convert dataset containers from NumPy to TensorFlow representation.

join_data([keys])

Join train and test splits into a single complete dataset.

release_numpy()

Release NumPy datasets.

release_tensorflow()

Release tensorflow datasets.

split_data([ratio, seed])

Split the complete dataset into train and test sets if not already split.

property complete_data

Return the complete dataset in the current representation.

convert_to_tensorflow(simPSF, n_bins_lambda, mode)[source]

Convert dataset containers from NumPy to TensorFlow representation.

Applies the configured converter to transform dataset fields associated with canonical keys into TensorFlow-compatible formats.

Conversion is performed on the active structure:

  • SPLIT: converts train and test datasets separately

  • COMPLETE: converts the full dataset

Parameters:
  • simPSF (PSFSimulator) – Simulator instance passed to the converter.

  • n_bins_lambda (int) – Number of wavelength bins used during conversion.

  • mode (DatasetMode) – Dataset operation mode used to select the appropriate dataset schema for a given pipeline process (e.g. training, validation, inference)

Raises:

RuntimeError – If no converter is configured.

Notes

  • Conversion is idempotent: if the data is already in TensorFlow

representation, this method does nothing.

  • Converted datasets are stored in internal attributes

(_train_tf, _test_tf, _complete_tf) and do not overwrite the original NumPy data.

join_data(keys: list[str] | None = None)[source]

Join train and test splits into a single complete dataset.

Concatenates corresponding arrays from the train and test containers along the first axis (sample dimension) for the specified keys.

Parameters:

keys (list of str, optional) – Dataset fields to join. If None, uses the canonical dataset keys.

Raises:

RuntimeError – If the adapter is not in SPLIT state or if train/test data is missing.

Notes

Only keys present in both train and test datasets are joined.

property masks

Get masks for the complete dataset.

property metadata: dict | None

Get dataset metadata.

property params: Any | None

Get dataset params.

property positions

Get positions for the complete dataset.

release_numpy()[source]

Release NumPy datasets.

release_tensorflow()[source]

Release tensorflow datasets.

property representation_state

Return the current representation state of the dataset.

property sources

Get sources for the complete dataset.

split_data(ratio: float | None = None, seed: int | None = None)[source]

Split the complete dataset into train and test sets if not already split.

Parameters:
  • ratio (float, optional) – The fraction of the dataset to use for training (default is 0.8 or from params).

  • seed (int, optional) – The random seed for reproducibility (default is from params).

Raises:

RuntimeError – If the dataset is not in COMPLETE state when attempting to split.

Notes

  • Splitting is idempotent: if the dataset is already in SPLIT state, this method does not modify the data or re-split the dataset.

property structure_state

Return the current structural state of the dataset.

property test_data

Return the test set in the current representation.

property train_data

Return the training set in the current representation.

property zernike_prior

Get Zernike prior for the complete dataset.

class wf_psf.data.data_adapter.LoadedDataset(complete: dict | None = None, train: dict | None = None, test: dict | None = None)[source]

Bases: object

Structured container for loaded dataset.

complete

The complete dataset (if in COMPLETE state).

Type:

dict, optional

train

The training dataset (if in SPLIT state).

Type:

dict, optional

test

The test dataset (if in SPLIT state).

Type:

dict, optional

Methods

is_complete()

Check if the dataset is in COMPLETE state.

is_split()

Check if the dataset is in SPLIT state.

is_complete() bool[source]

Check if the dataset is in COMPLETE state.

is_split() bool[source]

Check if the dataset is in SPLIT state.

class wf_psf.data.data_adapter.RepresentationState(value)[source]

Bases: Enum

Representation state of the dataset.

  • NUMPY: the dataset is represented as a NumPy array.

  • TENSORFLOW: the dataset is represented as a TensorFlow tensor.

NUMPY = 1
TENSORFLOW = 2
class wf_psf.data.data_adapter.StructureState(value)[source]

Bases: Enum

Structural state of the dataset.

  • COMPLETE: the dataset is complete and not split into train/test.

  • SPLIT: the dataset is split into train and test sets.

COMPLETE = 1
SPLIT = 2