Reprocessing

Reprocessing Observations:

To process many nights of data, the recommended procedure is to use scripts/reprocess_obs.py, which is to be executed inside a KPF Docker container. Change to your KPF-Pipeline git-repo directory, and start an interactive Docker container set up for running the KPF DRP with the following command:

./docker_run.sh

The reprocess_obs.py script deletes old 2D/L1/L2/QLP/outliers/logs/logs_QLP files before starting reprocessing of a given night. Using the linux utility ‘nice’, it’s execution is deprioritized (nice=15) so that regular processing isn’t slowed down. It also procuces or appends to a log file (default name: reprocess_obs.log) with columns Datecode, Start Time, End Time, Run Time, and Version. Dates that have been reprocessed with the same pipeline versions (according to the log file) are skipped. In production processing by the DRP development team, this command is in the xterm called Reprocessing.:

python3 reprocess_obs.py --delete --ncpu 96 yyyymmdd YYYYMMDD

Here’s the docstring showing all of the options.:

usage: reprocess_obs.py [-h] [--ncpu NCPU] [--delete] [--verbose] [--force] [--logfile LOGFILE]
                         [--forward] [--not-nice] [--dry-run] [--local-tz LOCAL_TZ]
                        startdate enddate

Reprocess KPF data over a date range.

positional arguments:
  startdate            Start date in YYYYMMDD format
  enddate              End date in YYYYMMDD format

options:
  -h, --help           show this help message and exit
  --ncpu NCPU          Number of CPUs to use
  --delete             Delete existing 2D/L1/L2/QLP (but not L0/masters) files before reprocessing
  --qlp-regen          Regenerate Quicklook plots (and yaml files for QC) after reprocessing
  --verbose            Verbose stdout
  --force              Process even if datecode/version are listed in the logfile
  --logfile LOGFILE    Log file path
  --forward            Process datecodes in chronological order (reverse is default)
  --not-nice           Do not apply standard nice (=15) deprioritization
  --dry-run            Print commands without executing them
  --local-tz LOCAL_TZ  Local timezone (default: America/Los_Angeles)

One can also reprocess KPF data using the kpf command and a recipe. The command below launches 50 processes to reprocess L0 files into 2D/L1/L2 files for the date YYYYMMDD.:

kpf --ncpu 50 --watch /data/L0/YYYYMMDD/ --reprocess -c configs/kpf_drp.config -r recipes/kpf_drp.recipe

Reprocessing Masters:

Reprocessing master files over a range of observation dates from yyyymmdd to YYYYMMDD is accomplished with scripts/reprocess_masters.py, which is a cousin of reprocess_obs.py having similar command-line options. This script is to be executed inside a KPF Docker container. A number of master files are produced, and it can take hours to reprocess a single observation date. In production processing by the DRP development team, this command is in the xterm called Masters Repocessing.

Change to your KPF-Pipeline git-repo directory, and start an interactive Docker container set up for reprocessing masters running with the following command:

./docker-masters-run.sh

An example command to reprocess a single night is (start and end dates are the same):

python3 scripts/reprocess_masters.py 20241009 20241009 --ncpu 1 --verbose  --not-nice --force

Here’s the docstring showing all of the options:

usage: reprocess_masters.py [-h] [--steps {2d,stacks_etc,order_stuff,l12,wls,etalon} [{2d,stacks_etc,order_stuff,l12,wls,etalon} ...]] [--force] [--dry-run]
                            [--logfile LOGFILE] [--ncpu NCPU] [--forward] [--not-nice] [--local-tz LOCAL_TZ] [-v]
                            startdate enddate

Run complete masters pipeline or specified parts inside container.

positional arguments:
  startdate             Start observation date in YYYYMMDD format
  enddate               End observation date in YYYYMMDD format (same as startdate for single date)

options:
  -h, --help            show this help message and exit
  --steps {2d,stacks_etc,order_stuff,l12,wls,etalon} [{2d,stacks_etc,order_stuff,l12,wls,etalon} ...]
                        Which steps to run (default: all)
  --force               Process even if datecode/version are listed in the logfile
  --dry-run             Dry run mode: print commands without executing them
  --logfile LOGFILE     Log file path
  --ncpu NCPU           Number of parallel observation-date processes (default = 1)
  --forward             Process dates in chronological order (reverse is default)
  --not-nice            Do not apply standard nice (=15) deprioritization
  --local-tz LOCAL_TZ   Local timezone for logfile lines (default: America/Los_Angeles)
  -v, --verbose         Print detailed messages during execution

At the start of the script, there are lines to remove old master files for the specified dates and remove associated records from the CalFiles database table.

The name of the log file can be specified (the default is masters_reprocessing.log).

Quicklook reprocessing – qlp_parallel.py:

For a daterange from yyyymmdd to YYYYMMDD with NCPU cpus.:

./scripts/qlp_parallel.py yyyymmdd YYYYMMDD --ncpu NCPU --l0 --2d --l1 --l2 --master

The full description is here:

Description:
  This command line script uses the 'parallel' utility to execute the recipe
  called 'recipes/quicklook_match.recipe' to generate standard Quicklook data
  products.  The script selects all KPF files based on their
  type (L0/2D/L1/L2/master) from the standard data directory using a date
  range specified by the parameters start_date and end_date.  L0 files are
  included if the --l0 flag is set or none of the --l0, --2d, --l1, --l2
  flags are set (in which case all data types are included).  The --2d,
  --l1, and --l2 flags have similar functions.  The script assumes that it
  is being run in Docker and will return with an error message if not.
  If start_date is later than end_date, the arguments will be reversed
  and the files with later dates will be processed first.

  Invoking the --print_files flag causes the script to print filenames
  but not create QLP data products.

  The --ncpu parameter determines the maximum number of cores used.

  The following feature is not operational if this script is run inside of
  a Docker container: If the --load parameter (a percentage, e.g. 90 = 90%)
  is set to a non-zero value, this script will be throttled so that no new
  files will have QLPs processed until the load is below that value.  Note
  that throttling works in steady state; it is possible to overload the
  system with the first set of jobs if --ncpu is set too way high.

Arguments:
  start_date     Start date as YYYYMMDD, YYYYMMDD.SSSSS, or YYYYMMDD.SSSSS.SS
  end_date       End date as YYYYMMDD, YYYYMMDD.SSSSS, or YYYYMMDD.SSSSS.SS

Options:
  --l0           Select all L0 files in date range
  --2d           Select all 2D files in date range
  --l1           Select all L1 files in date range
  --l2           Select all L2 files in date range
  --master       Select all master files in date range
  --ncpu         Number of cores used for parallel processing; default=10
  --load         Maximum load (1 min average); default=0 (only activated if !=0)
  --print_files  Display file names matching criteria, but don't generate Quicklook plots
  --help         Display this message

Usage:
  python qlp_parallel.py YYYYMMDD.SSSSS YYYYMMDD.SSSSS --ncpu NCPU --load LOAD --l0 --2d --l1 --l2 --master --print_files

Examples:
  ./scripts/qlp_parallel.py 20230101.12345.67 20230101.17 --ncpu 50 --l0 --2d
  ./scripts/qlp_parallel.py 20240501 20240505 --ncpu 150 --load 90

Reprocess specific observations – slowtouch.py:

Individual observations can be reprocessed by touching the L0 files. To reprocess a set of files, use the script slowtouch.sh. Files are touched slowly (usually with 0.2 sec between touching individual files) to avoid overloading the file event triggers system that initiate reprocessing of specific files.:

./scripts/slowtouch.py

This script is used to touch a list of KPF L0 files that have names like KP.20230623.12345.67.fits. This is useful to initiate reprocessing using the KPF DRP. The full descriptio is here:

Script name: slowtouch.py

This script 'touches' a list of KPF L0 files with names like
KP.YYYYMMDD.12345.67.fits to trigger reprocessing in the KPF DRP.

Ways to provide filenames (any combination works):
  1) As positional arguments on the command line.
  2) With -f <csv>, reading the first column (quotes removed; header 'observation_id' skipped).
  3) With -d <dir>, adding every file name in that directory.

Date range mode (Docker only):
  If you pass exactly two positional arguments that are valid datecodes
  (YYYYMMDD) and you do NOT use -f/--csv or -d/--dir, the script switches to
  'date range mode'. This mode is available only when running inside a Docker
  container. In date range mode it:
    • Validates the two YYYYMMDD values and sorts them into start_date/end_date.
    • Uses the time series database (TSDB) to query for ObsIDs in that date window,
      optionally filtering by:
        --only-object <name>   (e.g., autocal-bias)
        --only-source <name>   (e.g., Star, Etalon, Dark, etc.)
    • Touches each matched ObsID's L0 file under the resolved L0 base path.

  If you attempt date range mode outside Docker, the script prints an error and exits.

Options (all optional):
  -f, --csv <filename>       CSV with L0 filenames in the first column (can be used multiple times)
  -d, --dir <directory>      Directory to scan for filenames (can be used multiple times)
  -p, --path <path>          L0 base path (default: automatic)
                             automatic -> /data/L0 when in Docker, /data/kpf/L0 otherwise
  -s, --sleep <seconds>      Sleep interval between touches (default: 0.2)
  -e, --echo                 Echo touch commands instead of executing
      --only-object <name>   (Date range mode) filter TSDB rows to this OBJECT (e.g., autocal-bias)
      --only-source <name>   (Date range mode) filter TSDB rows to this SOURCE (e.g., Star, Etalon, Dark)

Examples:
  slowtouch.py KP.20230623.12345.67.fits KP.20230623.12345.68.fits # touch two files (matched to L0 dir by ObsID)
  slowtouch.py -f filenames.csv                                    # touch files in first col of csv
  slowtouch.py -d /path/to/directory                               # touch files in dir (matched to L0 dir by ObsID)
  slowtouch.py KP.20230623.12345.67.fits -p /new/L0/path -s 0.5    # specify L0 path and sleep interval
  slowtouch.py KP.20230623.12345.67.fits -e                        # echo touch commands
  slowtouch.py 20241001 20241015 --only-object autocal-dark        # touch matching object name in date range
  slowtouch.py 20241001 20241015 --only-source Star                # touch matching source type in date range
  slowtouch.py 20241001 20241015 --only-source Etalon              # touch matching source type in date range

Reprocess specific observations – kpf_slowtouch.sh (deprecated; use slowtouch.py instead):

Individual observations can be reprocessed by touching the L0 files, or touching the 2D/L1/L2 files to start reprocessing at a later stage. To reprocess a set of files, use the script kpf_slowtouch.sh. Files are touched slowly (usually with 0.2 sec between touching individual files) to avoid overloading the file event triggers system that initiate reprocessing of specific files.:

./scripts/kpf_slowtouch.sh

This script is used to touch a list of KPF L0 files that have names like KP.20230623.12345.67.fits. This is useful to initiate reprocessing using the KPF DRP. The list of L0 files can be provided in multiple ways:

  1. As command-line arguments when invoking the script.

  2. In the first column of a CSV file specified with the -f option. This is useful for CSV files with a large set of L0 filenames downloaded from Jump. Such files might have double quotes around the L0 filename, which the script will remove when appropriate.

  3. All filenames in a directory specified with the -d option.

The (optional) command-line options are:

-f <filename>       : The script will read the KPF L0 filenames
                      from the first column of a CSV with the name <filename>.
                      Useful for lists of L0 files downloaded from Jump.
-d <directory>      : Adds every file in <directory> to the list of L0 files.
-p <path>           : Sets the L0 path to <path>.
                      Default value: /data/kpf/L0
-s <sleep_interval> : Sets the interval between file touches.
                      Default value: 0.2 [sec]
-e                  : Echo the touch commands instead of executing them.

Some example uses of this script are:

  1. To provide filenames using command line arguments: ./kpf_slowtouch.sh KP.20230623.12345.67.fits KP.20230623.12345.68.fits

  2. To provide filenames using a CSV file: ./kpf_slowtouch.sh -f filenames.csv

  3. To provide files listed in a directory: ./kpf_slowtouch.sh -d /path/to/directory

  4. To change the default L0 path and sleep interval between touches: ./kpf_slowtouch.sh KP.20230623.12345.67.fits -p /new/path -s 0.5

  5. To echo the touch commands instead of executing them: ./kpf_slowtouch.sh KP.20230623.12345.67.fits -e

Monitoring processing progress – kpf_processing_progress.py:

Print the status of processing for a date range:

./scripts/kpf_processing_progress.py YYYYMMDD YYYYMMDD

The full description is here:

Description:
  This script is used to assess the status and progress of processing KPF data.
  It searches over a range of dates specified by the first two arguments which are
  of the form YYYYMMDD.  For each date (with /data/kpf/L0/YYYYMMDD as the
  assumed L0 directory), it examines each L0 file and the associated 2D/L1/L2
  files in their related directories.  If the first argument is a date after the
  second argument, then the dates are printed in reverse chronological order (later
  dates first).  The output of this script is a table with columns indicating the
  date for each row, the most recent modification date for and L0 file in that
  directory, the fraction of 2D files processed, the fraction of L1 files processed,
  and the fraction of L2 files processed.  Sample output is shown below.

  > ./scripts/kpf_processing_progress.py 20231231 20230101 --current_version 2.5


  DATECODE | LAST L0 MOD DATE | 2D PROCESSING  | L1 PROCESSING  | L2 PROCESSING
  ------------------------------------------------------------------------------
  20231221 | 2023-12-21 10:18 |  256/256  100% |  254/256   99% |  229/230   99%
  20231220 | 2023-12-20 16:00 |  342/342  100% |  342/342  100% |  315/315  100%
  20231219 | 2023-12-19 16:00 |  406/406  100% |  406/406  100% |  377/379   99%
  20231218 | 2023-12-18 16:00 |  531/531  100% |  528/531   99% |  501/504   99%
  20231217 | 2023-12-17 16:00 |  524/524  100% |  524/524  100% |  497/497  100%
  20231216 | 2023-12-16 16:00 |  527/527  100% |  524/527   99% |  497/500   99%

  The following criteria are used to determine if 2D/L1/L2 files are "processed":

      - not in the junk file list ('/data/kpf/reference/Junk_Observations_for_KPF.csv');
        if the file is missing, all files are assumed to not be junk
      - have the Green, Red, or CaHK extension present in the L0 file
      - not a Dark or Bias exposure [only applied to L2 files]
      - the 2D/L1/L2 exists
      - the modification time of the 2D/L1/L2 file is later than the
        modification time of the associated L0 file
      - the DRP version number is equal to or greater than the current DRP version
        number of the master branch on Github [only if --check_version option
        selected]

                #    - not junk
                #    - Green, Red, or CaHK extension present
                #    - not a Dark or Bias exposure
                #    - file present
                #    - L2 modification time more recent than L0 modification time
                #    - current DRP version number (if check_version option selected)

  Command-line options listed below enable touching of the L0 files associated
  with 2D/L1/L2 files that are not present, printing those filenames, printing the
  filenames of the 2D/L1/L2 files themselves, and turning on the DRP version check.

Options:
  --help             Display this message
  --print_files      Display missing file names (or files that fail other criteria)
  --print_files_2D   Display missing 2D file names (or files that fail other criteria)
  --print_files_L1   Display missing L1 file names (or files that fail other criteria)
  --print_files_L2   Display missing L2 file names (or files that fail other criteria)
  --touch_files      Touch the base L0 files of missing 2D/L1/L2 files
  --check_version    Checks that each 2D/L1/L2 file has the current Git version for the KPF-Pipeline
  --current_version  The current version of determining completion status; e.g. --current version 2.5

Usage:
  kpf_processing_progress.py YYYYMMDD [YYYYMMDD] [--print_files] [--print_files_2D] [--print_files_L1] [--print_files_L2] [--touch_files] [--check_version]

Example:
  ./scripts/kpf_processing_progress.sh 20231114 20231231 --print_files