wf_psf.data.data_adapter

Data Adapter.

This module manages dataset lifecycle transitions for the WF-PSF pipeline.

Overview

Two orthogonal state machines are maintained:

1. Structure state

COMPLETE → SPLIT via split_data()
SPLIT → COMPLETE via join_data()

2. Representation state

NUMPY → TENSORFLOW via convert_to_tensorflow()

Glossary

COMPLETE: Dataset stored as a single container.
SPLIT: Dataset stored as train/test subsets.
NUMPY: Data stored as NumPy arrays.
TENSORFLOW: Data stored as TensorFlow tensors.

Design principles

Structure and representation are orthogonal.
All transitions are explicit and idempotent where possible.
No training or model logic lives in this module.
Dataset field names are canonicalized for downstream models.

Notes

The DataAdapter class manages these transitions while providing a consistent interface for accessing dataset contents.

Authors: Jennifer Pollack <jennifer.pollack@cea.fr>

Classes

`DataAdapter`(dataset, converter[, params, ...])	Adapter for managing dataset structure and backend representation.
`LoadedDataset`([complete, train, test])	Structured container for loaded dataset.
`RepresentationState`(value)	Representation state of the dataset.
`StructureState`(value)	Structural state of the dataset.

class wf_psf.data.data_adapter.DataAdapter(dataset: LoadedDataset, converter: TensorFlowDatasetConverter, params: Any | None = None, metadata: dict | None = None)[source]

Bases: object

Adapter for managing dataset structure and backend representation.

The adapter provides a consistent interface to datasets regardless of whether they are stored as a complete dataset or as train/test splits, and whether the underlying representation is NumPy or TensorFlow.

It also canonicalizes dataset fields to the names expected by downstream models.

Notes

Instances should be created via DataAdapterFactory.build().

Attributes:

complete_data: Return the complete dataset in the current representation.
masks: Get masks for the complete dataset.
metadata: Get dataset metadata.
params: Get dataset params.
positions: Get positions for the complete dataset.
representation_state: Return the current representation state of the dataset.
sources: Get sources for the complete dataset.
structure_state: Return the current structural state of the dataset.
test_data: Return the test set in the current representation.
train_data: Return the training set in the current representation.
zernike_prior: Get Zernike prior for the complete dataset.

Methods

`convert_to_tensorflow`(simPSF, n_bins_lambda, ...)	Convert dataset containers from NumPy to TensorFlow representation.
`join_data`([keys])	Join train and test splits into a single complete dataset.
`release_numpy`()	Release NumPy datasets.
`release_tensorflow`()	Release tensorflow datasets.
`split_data`([ratio, seed])	Split the complete dataset into train and test sets if not already split.

property complete_data: Return the complete dataset in the current representation.

convert_to_tensorflow(simPSF, n_bins_lambda, mode)[source]

Convert dataset containers from NumPy to TensorFlow representation.

Applies the configured converter to transform dataset fields associated with canonical keys into TensorFlow-compatible formats.

Conversion is performed on the active structure:

SPLIT: converts train and test datasets separately
COMPLETE: converts the full dataset

Parameters:

simPSF (PSFSimulator) – Simulator instance passed to the converter.
n_bins_lambda (int) – Number of wavelength bins used during conversion.
mode (DatasetMode) – Dataset operation mode used to select the appropriate dataset schema for a given pipeline process (e.g. training, validation, inference)

Raises:

RuntimeError – If no converter is configured.

Notes

Conversion is idempotent: if the data is already in TensorFlow

representation, this method does nothing.

Converted datasets are stored in internal attributes

(_train_tf, _test_tf, _complete_tf) and do not overwrite the original NumPy data.

join_data(keys: list[str] | None = None)[source]

Join train and test splits into a single complete dataset.

Concatenates corresponding arrays from the train and test containers along the first axis (sample dimension) for the specified keys.

Parameters:: keys (list of str, optional) – Dataset fields to join. If None, uses the canonical dataset keys.
Raises:: RuntimeError – If the adapter is not in SPLIT state or if train/test data is missing.

Notes

Only keys present in both train and test datasets are joined.

property masks: Get masks for the complete dataset.

property metadata: dict | None: Get dataset metadata.

property params: Any | None: Get dataset params.

property positions: Get positions for the complete dataset.

release_numpy()[source]: Release NumPy datasets.

release_tensorflow()[source]: Release tensorflow datasets.

property representation_state: Return the current representation state of the dataset.

property sources: Get sources for the complete dataset.

split_data(ratio: float | None = None, seed: int | None = None)[source]

Split the complete dataset into train and test sets if not already split.

Parameters:

ratio (float, optional) – The fraction of the dataset to use for training (default is 0.8 or from params).
seed (int, optional) – The random seed for reproducibility (default is from params).

Raises:

RuntimeError – If the dataset is not in COMPLETE state when attempting to split.

Notes

Splitting is idempotent: if the dataset is already in SPLIT state, this method does not modify the data or re-split the dataset.

property structure_state: Return the current structural state of the dataset.

property test_data: Return the test set in the current representation.

property train_data: Return the training set in the current representation.

property zernike_prior: Get Zernike prior for the complete dataset.

class wf_psf.data.data_adapter.LoadedDataset(complete: dict | None = None, train: dict | None = None, test: dict | None = None)[source]

Bases: object

Structured container for loaded dataset.

complete

The complete dataset (if in COMPLETE state).

Type:: dict, optional

train

The training dataset (if in SPLIT state).

Type:: dict, optional

test

The test dataset (if in SPLIT state).

Type:: dict, optional

Methods

`is_complete`()	Check if the dataset is in COMPLETE state.
`is_split`()	Check if the dataset is in SPLIT state.

is_complete() → bool[source]: Check if the dataset is in COMPLETE state.

is_split() → bool[source]: Check if the dataset is in SPLIT state.

class wf_psf.data.data_adapter.RepresentationState(value)[source]

Bases: Enum

Representation state of the dataset.

NUMPY: the dataset is represented as a NumPy array.
TENSORFLOW: the dataset is represented as a TensorFlow tensor.

NUMPY = 1

TENSORFLOW = 2

class wf_psf.data.data_adapter.StructureState(value)[source]

Bases: Enum

Structural state of the dataset.

COMPLETE: the dataset is complete and not split into train/test.
SPLIT: the dataset is split into train and test sets.

COMPLETE = 1

SPLIT = 2