Best practices

This is a collection of best practices to consider when using this framework:

  • keep the input data (SQLite3/feather files) on a SSD disk, otherwise the random access pattern on the files over NFS will lead to rather severe performance losses

  • test the extraction and plotting with a subset of your data first. No need to waste time & energy if there’s a typo somewhere or the parameters of a task are not appropriately set.

  • increase the verbosity with -vv... to see what extractors, transformers and exporters do, e.g. which columns are extracted, are in- and output of transformers and get categorized.

  • if you are working with a subset of your data, be sure to test the assumptions made in your code on the whole dataset, e.g. handling of NaN values. Testing for those edge cases is a good way to verify your simulation code too.

  • if you just want to extract data, add --eval-only to your command line. Similarly, if you just want to plot, add --plot-only

  • test the regular expression used for the paths to the input data. One might have hurried and copied an expression with globbing as used by a shell.

  • test the regular expression used for tag extraction in a python REPL on the actual string values in the database

  • only three parameters for partitioning the data are supported in plotting tasks. If more are needed, use a GroupedFunctionTransform to add another column to the DataFrame that combines multiple parameters into a string that then can be used in the hue, row or column parameter of a plotting task.

  • running run_recipe.py with the parameter --plot-task-graphs generates plots with the task graph in the directory for temporary files (set with --tmpdir.

  • when using pandas.DataFrame.groupby to partition up the data, e.g. using the GroupedFunctionTransform, try limiting the number of keys used for partitioning and the size of the input DataFrames to minimise processing time and memory usage. Only concatenate the input into a large DataFrame if the operation has to happen over all data or over subsets of the data that can’t otherwise be easily selected.

  • when extracting a signal with its associated position data, a lot of SQL JOIN operations over the eventNumber are being executed. This is a fairly slow procedure since there’s no index over the eventNumber column (inspect the sqlite_master table in the result database to verify this). To improve performance, one can construct an index over eventNumber:

    #!/bin/sh
    
    for file in $@
    do
        \time sqlite3 $file 'CREATE INDEX eventNumber_index ON vectorData(eventNumber);' \
                && echo "created index over eventNumber in" $file
                done
    

    This increases storage usage by a factor of two, so it’s only to be used on a copy of the dataset and the resulting files should be deleted after use. Note: this only applies to extraction from the databases, all stages after that are not influenced.