Best practices
This is a collection of best practices to consider when using this framework:
keep the input data (SQLite3/feather files) on a SSD disk, otherwise the random access pattern on the files over NFS will lead to rather severe performance losses
test the extraction and plotting with a subset of your data first. No need to waste time & energy if there’s a typo somewhere or the parameters of a task are not appropriately set.
increase the verbosity with
-vv...
to see what extractors, transformers and exporters do, e.g. which columns are extracted, are in- and output of transformers and get categorized.if you are working with a subset of your data, be sure to test the assumptions made in your code on the whole dataset, e.g. handling of NaN values. Testing for those edge cases is a good way to verify your simulation code too.
if you just want to extract data, add
--eval-only
to your command line. Similarly, if you just want to plot, add--plot-only
test the regular expression used for the paths to the input data. One might have hurried and copied an expression with globbing as used by a shell.
test the regular expression used for tag extraction in a python REPL on the actual string values in the database
only three parameters for partitioning the data are supported in plotting tasks. If more are needed, use a
GroupedFunctionTransform
to add another column to the DataFrame that combines multiple parameters into a string that then can be used in thehue
,row
orcolumn
parameter of a plotting task.running
run_recipe.py
with the parameter--plot-task-graphs
generates plots with the task graph in the directory for temporary files (set with--tmpdir
.when using
pandas.DataFrame.groupby
to partition up the data, e.g. using theGroupedFunctionTransform
, try limiting the number of keys used for partitioning and the size of the inputDataFrame
s to minimise processing time and memory usage. Only concatenate the input into a largeDataFrame
if the operation has to happen over all data or over subsets of the data that can’t otherwise be easily selected.when extracting a signal with its associated position data, a lot of SQL JOIN operations over the
eventNumber
are being executed. This is a fairly slow procedure since there’s no index over theeventNumber
column (inspect thesqlite_master
table in the result database to verify this). To improve performance, one can construct an index overeventNumber
:#!/bin/sh for file in $@ do \time sqlite3 $file 'CREATE INDEX eventNumber_index ON vectorData(eventNumber);' \ && echo "created index over eventNumber in" $file done
This increases storage usage by a factor of two, so it’s only to be used on a copy of the dataset and the resulting files should be deleted after use. Note: this only applies to extraction from the databases, all stages after that are not influenced.