transforms
- class transforms.ColumnFunctionTransform(dataset_name: str, output_dataset_name: str, input_column: str, output_column: str, function: Callable[[Series], Series] | str | None = None, extra_code: str | None = None)[source]
Bases:
Transform
,ExtraCodeFunctionMixin
,YAMLObject
A transform for applying a function to every value in a column of a DataFrame
- Parameters:
- dataset_name: str
the dataset to operate on
- output_dataset_name: str
the name given to the output dataset
- input_column: str
the name of the column the function should be applied to
- output_column: str
the name given to the output column containing the results of applying the function
- function: Union[Callable[[pandas.Series], pandas.Series], str]
The unary function to apply to the values in the chosen column.
- extra_code: Optional[str]
This can contain additional code for the transform function, such as the definition of a function over multiple lines or split into multiple functions for readibility.
- Attributes:
- yaml_flow_style
Methods
evaluate_function
(function, extra_code)Compile and evaluate the given function and an additional, optional code fragment within a separate global environment and return the executable function object.
from_yaml
(loader, node)Convert a representation node to a Python object.
get_data
(dataset_name)Retrieve a dataset with the given name from the data repository associated with this transform
set_data_repo
(data_repo)to_yaml
(dumper, data)Convert a Python object to a representation node.
prepare
process
set_name
yaml_dumper
- yaml_tag = '!ColumnFunctionTransform'
- class transforms.ConcatTransform(dataset_names: List[str] | None, output_dataset_name: str)[source]
Bases:
Transform
,YAMLObject
A transform for concatenating all DataFrames from the given datasets.
- Parameters:
- dataset_names: Optional[List[str]]
the list of datasets to concatenate
- output_dataset_name: str
the name given to the output dataset
- Attributes:
- yaml_flow_style
Methods
from_yaml
(loader, node)Convert a representation node to a Python object.
get_data
(dataset_name)Retrieve a dataset with the given name from the data repository associated with this transform
set_data_repo
(data_repo)to_yaml
(dumper, data)Convert a Python object to a representation node.
concat
prepare
process
set_name
yaml_dumper
- yaml_tag = '!ConcatTransform'
- class transforms.FunctionTransform(dataset_name: str, output_dataset_name: str, function: Callable[[DataFrame], DataFrame] | str | None = None, extra_code: str | None = None)[source]
Bases:
Transform
,ExtraCodeFunctionMixin
,YAMLObject
A transform for applying a arbitrary function to a whole DataFrame.
- Parameters:
- dataset_name: str
The dataset to operate on.
- output_dataset_name: str
The name given to the output dataset.
- function: Union[Callable[[pandas.DataFrame], pandas.DataFrame], str]
The unary function to apply to each DataFrame of the dataset. It takes the full DataFrame as its only argument and returns a DataFrame.
- extra_code: Optional[str]
This can contain additional code for the transform function, such as the definition of a function over multiple lines or split into multiple functions for readibility.
- Attributes:
- yaml_flow_style
Methods
evaluate_function
(function, extra_code)Compile and evaluate the given function and an additional, optional code fragment within a separate global environment and return the executable function object.
from_yaml
(loader, node)Convert a representation node to a Python object.
get_data
(dataset_name)Retrieve a dataset with the given name from the data repository associated with this transform
set_data_repo
(data_repo)to_yaml
(dumper, data)Convert a Python object to a representation node.
prepare
process
set_name
yaml_dumper
- yaml_tag = '!FunctionTransform'
- class transforms.GroupedAggregationTransform(dataset_name: str, output_dataset_name: str, input_column: str, output_column: str, grouping_columns: ~typing.List, raw: bool = False, pre_concatenate: bool = False, aggregation_function: ~typing.Callable[[~pandas.core.series.Series], object] | str | None = None, extra_code: str | None = None, timestamp_selector: ~typing.Callable = <function NDFrame.head>)[source]
Bases:
Transform
,ExtraCodeFunctionMixin
,YAMLObject
A transform for dividing a dataset into distinct partitions with pandas.DataFrame.groupby, each sharing the same value in the specified list of grouping/partitioning column names, and then applying a function to the values in a given column of a that partition, producing a aggregate scalar value.
- Parameters:
- dataset_name: str
the dataset to operate on
- output_dataset_name: str
the name given to the output dataset
- input_column: str
the name of the column the function should be applied to
- output_column: str
the name given to the output column containing the results of applying the function
- grouping_columns: List
the set of columns used for partitioning the dataset
- raw: bool
whether to append the raw output of transform_function to the result list
- pre_concatenate: bool
concatenate all input DataFrames before processing
- aggregation_function: Union[Callable[[pandas.Series], object], str]
The unary function to apply to a each partition. Should expect an pandas.Series as argument and return a scalar value.
- extra_code: Optional[str]
This can contain additional code for the transform function, such as the definition of a function over multiple lines or split into multiple functions for readibility.
- timestamp_selector: Callable
the function to select the row in the partition data as template for the output in case of aggregation
- Attributes:
- yaml_flow_style
Methods
evaluate_function
(function, extra_code)Compile and evaluate the given function and an additional, optional code fragment within a separate global environment and return the executable function object.
from_yaml
(loader, node)Convert a representation node to a Python object.
get_data
(dataset_name)Retrieve a dataset with the given name from the data repository associated with this transform
set_data_repo
(data_repo)to_yaml
(dumper, data)Convert a Python object to a representation node.
aggregate_frame
prepare
process
set_name
yaml_dumper
- yaml_tag = '!GroupedAggregationTransform'
- class transforms.GroupedFunctionTransform(dataset_name: str, output_dataset_name: str, input_column: str, output_column: str, grouping_columns: ~typing.List, raw: bool = False, aggregate: bool = False, pre_concatenate: bool = False, transform_function: ~typing.Callable[[~pandas.core.frame.DataFrame], ~pandas.core.frame.DataFrame] | ~typing.Callable[[~pandas.core.frame.DataFrame], object] | str | None = None, extra_code: str | None = None, timestamp_selector: ~typing.Callable = <function NDFrame.head>)[source]
Bases:
Transform
,ExtraCodeFunctionMixin
,YAMLObject
A transform for dividing a dataset into distinct partitions with pandas.DataFrame.groupby, each sharing the same value in the specified list of grouping/partitioning column names, and then applying a function to that partition.
- Parameters:
- dataset_name: str
the dataset to operate on
- output_dataset_name: str
the name given to the output dataset
- input_column: str
the name of the column the function should be applied to
- output_column: str
the name given to the output column containing the results of applying the function
- grouping_columns: List
the set of columns used for partitioning the dataset
- raw: bool
whether to append the raw output of transform_function to the result list
- aggregate: bool
whether the transform function returns a scalar or an object (like a pandas.DataFrame)
- pre_concatenate: bool
concatenate all input DataFrames before processing
- transform_function: Union[Callable[[pandas.DataFrame], pandas.DataFrame], Callable[[pandas.DataFrame], object], str]
The unary function to apply to a each partition. Should expect an pandas.DataFrame as argument and return a pandas.DataFrame (or an arbitrary object if raw is true).
- extra_code: Optional[str]
This can contain additional code for the transform function, such as the definition of a function over multiple lines or split into multiple functions for readibility.
- timestamp_selector: Callable
the function to select the row in the partition data as template for the output in case of aggregation
- Attributes:
- yaml_flow_style
Methods
evaluate_function
(function, extra_code)Compile and evaluate the given function and an additional, optional code fragment within a separate global environment and return the executable function object.
from_yaml
(loader, node)Convert a representation node to a Python object.
get_data
(dataset_name)Retrieve a dataset with the given name from the data repository associated with this transform
set_data_repo
(data_repo)to_yaml
(dumper, data)Convert a Python object to a representation node.
aggregate_frame
prepare
process
set_name
yaml_dumper
- yaml_tag = '!GroupedFunctionTransform'
- class transforms.MergeTransform(dataset_name_left: str, dataset_name_right: str, output_dataset_name: str, left_key_columns: List[str] | None = None, right_key_columns: List[str] | None = None, match_by_filename: bool = True, matching_attribute: str = 'source_files')[source]
Bases:
Transform
,YAMLObject
A transform for merging the columns from two DataFrames, from two distinct datasets, similarly to a SQL INNER JOIN.
Basically a wrapper around pandas.merge
- Parameters:
- dataset_name_left: str
the left dataset to operate on
- dataset_name_right: str
the right dataset to operate on
- output_dataset_name: str
the name given to the output dataset
- right_key_columns: str
the name of the column from the right dataset taht is used as key for joining
- left_key_columns: str
the name of the column from the left dataset taht is used as key for joining
- match_by_filename: bool
whether to match merge input by the filename the data has been extracted from
- matching_attribute: str
the attribute to match the datasets on
- Attributes:
- yaml_flow_style
Methods
from_yaml
(loader, node)Convert a representation node to a Python object.
get_data
(dataset_name)Retrieve a dataset with the given name from the data repository associated with this transform
set_data_repo
(data_repo)to_yaml
(dumper, data)Convert a Python object to a representation node.
merge
prepare
prepare_matched_by_attribute
prepare_simple_sequential
process
set_name
yaml_dumper
- merge(data_l: DataFrame, data_r: DataFrame, left_key_columns: List[str] | None = None, right_key_columns: List[str] | None = None)[source]
- yaml_tag = '!MergeTransform'
- class transforms.Transform[source]
Bases:
YAMLObject
The base class for all transforms
- Attributes:
- yaml_flow_style
Methods
from_yaml
(loader, node)Convert a representation node to a Python object.
get_data
(dataset_name)Retrieve a dataset with the given name from the data repository associated with this transform
set_data_repo
(data_repo)to_yaml
(dumper, data)Convert a Python object to a representation node.
prepare
process
set_name
yaml_dumper
- get_data(dataset_name: str)[source]
Retrieve a dataset with the given name from the data repository associated with this transform
- Parameters:
- dataset_namestr
The name of the dataset to retrieve from the data repository
- set_data_repo(data_repo: dict)[source]
- Parameters:
- data_repodict
The dictionary containing all loaded datasets necessary for this transform
- yaml_tag = '!Transform'