Best practices
This is a collection of best practices to consider when using this framework:
keep the input data (SQLite3/feather files) on a SSD disk, otherwise the random access pattern on the files over NFS will lead to rather severe performance losses
test the extraction and plotting with a subset of your data first. No need to waste time & energy if there’s a typo somewhere or the parameters of a task are not appropriately set.
increase the verbosity with
-vv...to see what extractors, transformers and exporters do, e.g. which columns are extracted, are in- and output of transformers and get categorized.if you are working with a subset of your data, be sure to test the assumptions made in your code on the whole dataset, e.g. handling of NaN values. Testing for those edge cases is a good way to verify your simulation code too.
if you just want to extract data, add
--eval-onlyto your command line. Similarly, if you just want to plot, add--plot-onlytest the regular expression used for the paths to the input data. One might have hurried and copied an expression with globbing as used by a shell.
test the regular expression used for tag extraction in a python REPL on the actual string values in the database
only three parameters for partitioning the data are supported in plotting tasks. If more are needed, use a
GroupedFunctionTransformto add another column to the DataFrame that combines multiple parameters into a string that then can be used in thehue,roworcolumnparameter of a plotting task.running
run_recipe.pywith the parameter--plot-task-graphsgenerates plots with the task graph in the directory for temporary files (set with--tmpdir.when using
pandas.DataFrame.groupbyto partition up the data, e.g. using theGroupedFunctionTransform, try limiting the number of keys used for partitioning and the size of the inputDataFrames to minimise processing time and memory usage. Only concatenate the input into a largeDataFrameif the operation has to happen over all data or over subsets of the data that can’t otherwise be easily selected.when extracting a signal with its associated position data, a lot of SQL JOIN operations over the
eventNumberare being executed. This is a fairly slow procedure since there’s no index over theeventNumbercolumn (inspect thesqlite_mastertable in the result database to verify this). To improve performance, one can construct an index overeventNumber:#!/bin/sh for file in $@ do \time sqlite3 $file 'CREATE INDEX eventNumber_index ON vectorData(eventNumber);' \ && echo "created index over eventNumber in" $file done
This increases storage usage by a factor of two, so it’s only to be used on a copy of the dataset and the resulting files should be deleted after use. Note: this only applies to extraction from the databases, all stages after that are not influenced.