Using Transforms
When dealing with metrics that need to be calculated from one or by combining multiple signals, the use of a transform becomes necessary.
A transform should generally receive a list of pandas.DataFrame
as input.
A transform should generally output a list of pandas.DataFrame
.
Generally, it will iterate over that list and apply the operation represented by
the transform to every DataFrame
in the list.
When developing code for a transform or debugging errors, it can be really useful to start an interactive console by adding:
start_ipython_dbg_cmdline(user_ns=locals())
into the code in the recipe and running single-threaded by adding --worker 1 --single-threaded
to the command line. This allows inspection and
manipulation of the loaded data in a comfortable REPL.
There are currently six types of transform implemented:
ConcatTransform
MergeTransform
FunctionTransform
ColumnFunctionTransform
GroupedAggregationTransform
GroupedFunctionTransform
ConcatTransform
This transform will concatenate all datasets into a single dataset for further processing.
MergeTransform
This transform will combine two datasets with different columns based on the given keys.
FunctionTransform
This is for applying an arbitrary function to the dataset and saving the result in another (or the same) set.
The user can defined unary function defined by the function
parameter that is executed for every pandas.DataFrame
in the selected dataset.
The extra_code
parameter can contain arbirary Python code, such as function defintions, that can be used when more complex or individual transforming fucntions are required.
ColumnFunctionTransform
This is for applying a function to every value in a selected column of the data
and saving the result in another (or the same) column.
The user defined unary function defined by the function
parameter that is executed for every value, for every pandas.DataFrame
in the selected dataset.
The extra_code
parameter can contain arbirary Python code.
GroupedFunctionTransform
This is for dividing the input pandas.DataFrame
s into partitions based on
sharing the same values in the columns given by grouping_columns
. On each of
these partitions a user defined function is applied. The unary function takes
a DataFrame
as argument and returns either a DataFrame
, a scalar value or an
arbitrary object as result. In the case of a scalar, the parameter aggregate
should be set; the first row of the partition DataFrame
is taken and the
single value output of the function is added in a new column. In the case of an
arbitrary object, the parameter raw
should be set; the output of the function
is then passed on without modification, most likely for export as a JSON
representation of the object.