# Evaluation & Plotting ## Key Technologies The framework is built on top of [pandas](https://pandas.pydata.org/), [seaborn](https://seaborn.pydata.org/index.html) and [dask](https://docs.dask.org/en/stable/) Dask is used as a way to build task dependency graphs using [`dask.delayed`](https://docs.dask.org/en/latest/delayed.html) and for scalable parallelisation of the execution of those tasks, either locally on a desktop/laptop or remotely on a (SLURM) cluster using [Dask.distributed](https://distributed.dask.org/en/stable/) and [Dask-Jobqueue](https://jobqueue.dask.org/en/latest/index.html). There are a few key concepts necessary for using the framework effectively: - [`pandas.Series`](https://pandas.pydata.org/pandas-docs/stable/user_guide/dsintro.html#series) - [`pandas.DataFrame`](https://pandas.pydata.org/pandas-docs/stable/user_guide/dsintro.html#dataframe) The main data structure of the framework is the [`pandas.DataFrame`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html#pandas.DataFrame), which is similar to a spreadsheet or a table in a SQL database. For everything interesting you will need some knowledge how to index, partition and apply a function to a `DataFrame`. A short [introduction](https://pandas.pydata.org/pandas-docs/stable/user_guide/dsintro.html) to the data structures used in pandas. Reading the documentation for [`pandas.DataFrame.groupby`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html#pandas.DataFrame.groupby) is advisable. ## Getting Insight The evaluation and plotting is done by running `run_recipe.py` with the name of the 'recipe' YAML file, which contains the steps for processing the results of a batch of simulation runs. For a minimal introduction to YAML and some specifics of PyYAML see the [PyYAML documentation](https://pyyaml.org/wiki/PyYAMLDocumentation) ### Recipes The recipe describes the individual tasks as a list of key-value pairs; internally the data and operations are constructed (by [dask](https://docs.dask.org/en/latest/)) into an dependency/task graph. A recipe can contain two phases: `evaluation` and `plot`. Each phase is optional. The `evaluation` phase itself consists of three sub-phases: - `extractors`: for extracting the desired data from the input databases - `transforms`: for processing the extracted data in some way - `exporter`: for saving the extracted and possibly processed data The `plot` phase consists of three sub-phases: - `reader`: for loading the data exported from the `evaluation` sub-phase - `transforms`: for processing the loaded data in some way - `tasks`: for actually plotting the loaded and possibly processed data The sub-phases are evaluated in this order. Each sub-phase consists of a list of tasks to execute and each task usually has a dependency on a task in the previous sub-phase. If a task is not depended upon by another task, it will not be part of the dependency/task graph and will thus not be executed. Each task either creates or modifies a named 'dataset', a list of `pandas.DataFrame`s. Usually an extractor creates a `pandas.DataFrame` for each input file and stores the resulting list under the user defined `dataset_name` in an internal dictionary, all `transforms` are then executed over each `DataFrame` in that list separately and then written to disk, either separately or concatenated into a single `DataFrame` and then written to disk. The basic structure of a recipe is thus, for the evaluation phase: ``` evaluation: extractors: dataset0: !extractor_class parameter0: "value" transforms: transformed_dataset0: !transform_class dataset_name: "dataset0" parameter0: "value" exporter: name0: !exporter_class dataset_name: "transformed_dataset0" parameter0: "value" ``` For the plotting phase: ``` plot: reader: dataset0: !reader_class input_files: - "/path/regular/expression0" - "/path/regular/expression1" transforms: transformed_dataset0: !transform_class dataset_name: "dataset0" output_dataset_name: "dataset0" parameter0: "value" tasks: plot_task0: !plotting_class dataset_name: "transformed_dataset0" ``` For simplicity only one task is listed for each sub-phase, but an arbitrary number of tasks is possible. For ease of use, the following omissions are possible: - the `transforms` phase is optional, chaining transforms is possible - the `exporter` and `reader` phases are optional, the `plot` phase can just use the datasets extracted in the `evaluation` phase The first line of the definition of a task has the format ': !' and defines the name and the type of the transform. The type is just the class name (or more precisely, the YAML tag assigned to the class, but they are literally the same) of the desired operation. What follows are the parameters of the constructor for the class. In the `extractors` and `reader` sub-phase, the collection of data that is to be extracted, processed and plotted is given a name so as to allow referencing it in the task of other sub-phases. In the example above, the name given is `dataset0`. Each task has at least one parameter `dataset_name`, which references the input dataset the task operates upon. A transform will always have parameter `output_dataset_name` to assign a name to the result of the operation. This assignment can overwrite previously defined names, and thus also free the associated data if they are not being depended upon by another instance. Most of the documentation of the parameters of the actual components is in the API documentation, e.g. the documentation one needs for just plotting is in `plots.PlottingTask`, specifically in the documentation of the parameters of the constructor for that class. #### Tags The `evaluation` phase also supports assigning tags to the extracted data. A tag is a property shared among a subset of the input data, e.g. the repetition number of a run, the run number or the rate at which vehicles are being equipped with V2X hardware. The syntax for the tag definition is as follows: ``` evaluation: tags: attributes: repetition: | [{ 'regex': r'repetition' , 'transform': lambda v: int(v) }] iterationsvars: tag_name_1: | [{ 'regex': r'anotherExampleRE.*' , 'transform': lambda v: str(v) }] parameters: tag_name_2: | [{ 'regex': r'exampleRE.*' , 'transform': lambda v: str(v) }, { 'regex': r'exampleRE2.*' , 'transform': lambda v: str(v) }, ] ``` The `attributes`, `iterationsvars` and `parameters` are predefined categories for the tags and are extracted from different places in the input database. The general procedure involves using the regular expression (python flavour, [syntax](https://docs.python.org/3/library/re.html#regular-expression-syntax)) defined by the `regex` key to match on the name of the attribute and then applying the unary function defined by the `transform` key to the value in the column associated with the category. The tags are extracted from: - `attributes`: the `runAttr` table - the `regex` matches on the value in the `attrName` column - the `transform` is applied to the value in the `attrValue` column - `iterationvars`: the row with `attrName=='iterationvars'` in the `runAttr` table - the `regex` matches on the value of the `attrValue` column of the row - the `transform` is applied to the value matched by the regular expression - `parameters`: the `runParam` table - the `regex` matches on the value in the `paramKey` column - the `transform` is applied to the value in the `paramValue` column Multiple regular expressions can be bound to the same tag, in case of a heterogeneous data set or typing errors. The built-in tag definitions can be found in `tag_regular_expressions.py`. ### Examples Example recipes can be found in the `examples` directory in the root of this repository: - `lineplot.yaml`: a basic recipe for producing a CBR-over-MPR lineplot. One should probably start with this as template. - `CUI.yaml`: this calculates, for every vehicle, the mean of the differences between consecutive receptions of a CAM and plots them as a lineplot. This is probably the second template to look at, as it uses `GroupedFunctionTransform` to partition the input data by MPR and the name of the module emitting the signal used as marker for CAM emission. - `recipe.yaml`: a more elaborate recipe showcasing all possible options - `statistic.yaml`: this extracts the results for the statName `LemObjectUpdateInterval:stats` from the `statistic` table and saves them, then plots the mean of the values (from the `statMean` column of the table) of those over the market rate. - `boxstats.yaml`: showcases using a custom function to calculate the values needed for a boxplot, forward them as a python list to the exporter and save them as a JSON file - `sqlextractor.yaml`: a showcase for the generic SQLite extractor