Best practices ============== This is a collection of best practices to consider when using this framework: - keep the input data (SQLite3/feather files) on a SSD disk, otherwise the random access pattern on the files over NFS will lead to rather severe performance losses - test the extraction and plotting with a subset of your data first. No need to waste time & energy if there's a typo somewhere or the parameters of a task are not appropriately set. - increase the verbosity with `-vv...` to see what extractors, transformers and exporters do, e.g. which columns are extracted, are in- and output of transformers and get categorized. - if you are working with a subset of your data, be sure to test the assumptions made in your code on the whole dataset, e.g. handling of NaN values. Testing for those edge cases is a good way to verify your simulation code too. - if you just want to extract data, add `--eval-only` to your command line. Similarly, if you just want to plot, add `--plot-only` - test the regular expression used for the paths to the input data. One might have hurried and copied an expression with globbing as used by a shell. - test the regular expression used for tag extraction in a python REPL on the actual string values in the database - only three parameters for partitioning the data are supported in plotting tasks. If more are needed, use a `GroupedFunctionTransform` to add another column to the DataFrame that combines multiple parameters into a string that then can be used in the `hue`, `row` or `column` parameter of a plotting task. - running `run_recipe.py` with the parameter `--plot-task-graphs` generates plots with the task graph in the directory for temporary files (set with `--tmpdir`. - when using `pandas.DataFrame.groupby` to partition up the data, e.g. using the `GroupedFunctionTransform`, try limiting the number of keys used for partitioning and the size of the input `DataFrame`s to minimise processing time and memory usage. Only concatenate the input into a large `DataFrame` if the operation has to happen over all data or over subsets of the data that can't otherwise be easily selected. - when extracting a signal with its associated position data, a lot of SQL JOIN operations over the `eventNumber` are being executed. This is a fairly slow procedure since there's no index over the `eventNumber` column (inspect the `sqlite_master` table in the result database to verify this). To improve performance, one can construct an index over `eventNumber`: ```sh #!/bin/sh for file in $@ do \time sqlite3 $file 'CREATE INDEX eventNumber_index ON vectorData(eventNumber);' \ && echo "created index over eventNumber in" $file done ``` This increases storage usage by a factor of two, so it's only to be used on a copy of the dataset and the resulting files should be deleted after use. Note: this only applies to extraction from the databases, all stages after that are not influenced.