Reprocessing
Reprocessing Observations:
To process many nights of data, the recommended procedure is to use scripts/reprocess_obs.py, which is to be executed inside a KPF Docker container. Change to your KPF-Pipeline git-repo directory, and start an interactive Docker container set up for running the KPF DRP with the following command:
./docker_run.sh
The reprocess_obs.py script deletes old 2D/L1/L2/QLP/outliers/logs/logs_QLP files before starting reprocessing of a given night. Using the linux utility ‘nice’, it’s execution is deprioritized (nice=15) so that regular processing isn’t slowed down. It also procuces or appends to a log file (default name: reprocess_obs.log) with columns Datecode, Start Time, End Time, Run Time, and Version. Dates that have been reprocessed with the same pipeline versions (according to the log file) are skipped. In production processing by the DRP development team, this command is in the xterm called Reprocessing.:
python3 reprocess_obs.py --delete --ncpu 96 yyyymmdd YYYYMMDD
Here’s the docstring showing all of the options.:
usage: reprocess_obs.py [-h] [--ncpu NCPU] [--delete] [--verbose] [--force] [--logfile LOGFILE]
[--forward] [--not-nice] [--dry-run] [--local-tz LOCAL_TZ]
startdate enddate
Reprocess KPF data over a date range.
positional arguments:
startdate Start date in YYYYMMDD format
enddate End date in YYYYMMDD format
options:
-h, --help show this help message and exit
--ncpu NCPU Number of CPUs to use
--delete Delete existing 2D/L1/L2/QLP (but not L0/masters) files before reprocessing
--qlp-regen Regenerate Quicklook plots (and yaml files for QC) after reprocessing
--verbose Verbose stdout
--force Process even if datecode/version are listed in the logfile
--logfile LOGFILE Log file path
--forward Process datecodes in chronological order (reverse is default)
--not-nice Do not apply standard nice (=15) deprioritization
--dry-run Print commands without executing them
--local-tz LOCAL_TZ Local timezone (default: America/Los_Angeles)
One can also reprocess KPF data using the kpf command and a recipe. The command below launches 50 processes to reprocess L0 files into 2D/L1/L2 files for the date YYYYMMDD.:
kpf --ncpu 50 --watch /data/L0/YYYYMMDD/ --reprocess -c configs/kpf_drp.config -r recipes/kpf_drp.recipe
Reprocessing Masters:
Reprocessing master files over a range of observation dates from yyyymmdd to YYYYMMDD is accomplished with scripts/reprocess_masters.py, which is a cousin of reprocess_obs.py having similar command-line options. This script is to be executed inside a KPF Docker container. A number of master files are produced, and it can take hours to reprocess a single observation date. In production processing by the DRP development team, this command is in the xterm called Masters Repocessing.
Change to your KPF-Pipeline git-repo directory, and start an interactive Docker container set up for reprocessing masters running with the following command:
./docker-masters-run.sh
An example command to reprocess a single night is (start and end dates are the same):
python3 scripts/reprocess_masters.py 20241009 20241009 --ncpu 1 --verbose --not-nice --force
Here’s the docstring showing all of the options:
usage: reprocess_masters.py [-h] [--steps {2d,stacks_etc,order_stuff,l12,wls,etalon} [{2d,stacks_etc,order_stuff,l12,wls,etalon} ...]] [--force] [--dry-run]
[--logfile LOGFILE] [--ncpu NCPU] [--forward] [--not-nice] [--local-tz LOCAL_TZ] [-v]
startdate enddate
Run complete masters pipeline or specified parts inside container.
positional arguments:
startdate Start observation date in YYYYMMDD format
enddate End observation date in YYYYMMDD format (same as startdate for single date)
options:
-h, --help show this help message and exit
--steps {2d,stacks_etc,order_stuff,l12,wls,etalon} [{2d,stacks_etc,order_stuff,l12,wls,etalon} ...]
Which steps to run (default: all)
--force Process even if datecode/version are listed in the logfile
--dry-run Dry run mode: print commands without executing them
--logfile LOGFILE Log file path
--ncpu NCPU Number of parallel observation-date processes (default = 1)
--forward Process dates in chronological order (reverse is default)
--not-nice Do not apply standard nice (=15) deprioritization
--local-tz LOCAL_TZ Local timezone for logfile lines (default: America/Los_Angeles)
-v, --verbose Print detailed messages during execution
At the start of the script, there are lines to remove old master files for the specified dates and remove associated records from the CalFiles database table.
The name of the log file can be specified (the default is masters_reprocessing.log).
Quicklook reprocessing – qlp_parallel.py:
For a daterange from yyyymmdd to YYYYMMDD with NCPU cpus.:
./scripts/qlp_parallel.py yyyymmdd YYYYMMDD --ncpu NCPU --l0 --2d --l1 --l2 --master
The full description is here:
Description:
This command line script uses the 'parallel' utility to execute the recipe
called 'recipes/quicklook_match.recipe' to generate standard Quicklook data
products. The script selects all KPF files based on their
type (L0/2D/L1/L2/master) from the standard data directory using a date
range specified by the parameters start_date and end_date. L0 files are
included if the --l0 flag is set or none of the --l0, --2d, --l1, --l2
flags are set (in which case all data types are included). The --2d,
--l1, and --l2 flags have similar functions. The script assumes that it
is being run in Docker and will return with an error message if not.
If start_date is later than end_date, the arguments will be reversed
and the files with later dates will be processed first.
Invoking the --print_files flag causes the script to print filenames
but not create QLP data products.
The --ncpu parameter determines the maximum number of cores used.
The following feature is not operational if this script is run inside of
a Docker container: If the --load parameter (a percentage, e.g. 90 = 90%)
is set to a non-zero value, this script will be throttled so that no new
files will have QLPs processed until the load is below that value. Note
that throttling works in steady state; it is possible to overload the
system with the first set of jobs if --ncpu is set too way high.
Arguments:
start_date Start date as YYYYMMDD, YYYYMMDD.SSSSS, or YYYYMMDD.SSSSS.SS
end_date End date as YYYYMMDD, YYYYMMDD.SSSSS, or YYYYMMDD.SSSSS.SS
Options:
--l0 Select all L0 files in date range
--2d Select all 2D files in date range
--l1 Select all L1 files in date range
--l2 Select all L2 files in date range
--master Select all master files in date range
--ncpu Number of cores used for parallel processing; default=10
--load Maximum load (1 min average); default=0 (only activated if !=0)
--print_files Display file names matching criteria, but don't generate Quicklook plots
--help Display this message
Usage:
python qlp_parallel.py YYYYMMDD.SSSSS YYYYMMDD.SSSSS --ncpu NCPU --load LOAD --l0 --2d --l1 --l2 --master --print_files
Examples:
./scripts/qlp_parallel.py 20230101.12345.67 20230101.17 --ncpu 50 --l0 --2d
./scripts/qlp_parallel.py 20240501 20240505 --ncpu 150 --load 90
Reprocess specific observations – slowtouch.py:
Individual observations can be reprocessed by touching the L0 files. To reprocess a set of files, use the script slowtouch.sh. Files are touched slowly (usually with 0.2 sec between touching individual files) to avoid overloading the file event triggers system that initiate reprocessing of specific files.:
./scripts/slowtouch.py
This script is used to touch a list of KPF L0 files that have names like KP.20230623.12345.67.fits. This is useful to initiate reprocessing using the KPF DRP. The full descriptio is here:
Script name: slowtouch.py
This script 'touches' a list of KPF L0 files with names like
KP.YYYYMMDD.12345.67.fits to trigger reprocessing in the KPF DRP.
Ways to provide filenames (any combination works):
1) As positional arguments on the command line.
2) With -f <csv>, reading the first column (quotes removed; header 'observation_id' skipped).
3) With -d <dir>, adding every file name in that directory.
Date range mode (Docker only):
If you pass exactly two positional arguments that are valid datecodes
(YYYYMMDD) and you do NOT use -f/--csv or -d/--dir, the script switches to
'date range mode'. This mode is available only when running inside a Docker
container. In date range mode it:
• Validates the two YYYYMMDD values and sorts them into start_date/end_date.
• Uses the time series database (TSDB) to query for ObsIDs in that date window,
optionally filtering by:
--only-object <name> (e.g., autocal-bias)
--only-source <name> (e.g., Star, Etalon, Dark, etc.)
• Touches each matched ObsID's L0 file under the resolved L0 base path.
If you attempt date range mode outside Docker, the script prints an error and exits.
Options (all optional):
-f, --csv <filename> CSV with L0 filenames in the first column (can be used multiple times)
-d, --dir <directory> Directory to scan for filenames (can be used multiple times)
-p, --path <path> L0 base path (default: automatic)
automatic -> /data/L0 when in Docker, /data/kpf/L0 otherwise
-s, --sleep <seconds> Sleep interval between touches (default: 0.2)
-e, --echo Echo touch commands instead of executing
--only-object <name> (Date range mode) filter TSDB rows to this OBJECT (e.g., autocal-bias)
--only-source <name> (Date range mode) filter TSDB rows to this SOURCE (e.g., Star, Etalon, Dark)
Examples:
slowtouch.py KP.20230623.12345.67.fits KP.20230623.12345.68.fits # touch two files (matched to L0 dir by ObsID)
slowtouch.py -f filenames.csv # touch files in first col of csv
slowtouch.py -d /path/to/directory # touch files in dir (matched to L0 dir by ObsID)
slowtouch.py KP.20230623.12345.67.fits -p /new/L0/path -s 0.5 # specify L0 path and sleep interval
slowtouch.py KP.20230623.12345.67.fits -e # echo touch commands
slowtouch.py 20241001 20241015 --only-object autocal-dark # touch matching object name in date range
slowtouch.py 20241001 20241015 --only-source Star # touch matching source type in date range
slowtouch.py 20241001 20241015 --only-source Etalon # touch matching source type in date range
Reprocess specific observations – kpf_slowtouch.sh (deprecated; use slowtouch.py instead):
Individual observations can be reprocessed by touching the L0 files, or touching the 2D/L1/L2 files to start reprocessing at a later stage. To reprocess a set of files, use the script kpf_slowtouch.sh. Files are touched slowly (usually with 0.2 sec between touching individual files) to avoid overloading the file event triggers system that initiate reprocessing of specific files.:
./scripts/kpf_slowtouch.sh
This script is used to touch a list of KPF L0 files that have names like KP.20230623.12345.67.fits. This is useful to initiate reprocessing using the KPF DRP. The list of L0 files can be provided in multiple ways:
As command-line arguments when invoking the script.
In the first column of a CSV file specified with the -f option. This is useful for CSV files with a large set of L0 filenames downloaded from Jump. Such files might have double quotes around the L0 filename, which the script will remove when appropriate.
All filenames in a directory specified with the -d option.
The (optional) command-line options are:
-f <filename> : The script will read the KPF L0 filenames
from the first column of a CSV with the name <filename>.
Useful for lists of L0 files downloaded from Jump.
-d <directory> : Adds every file in <directory> to the list of L0 files.
-p <path> : Sets the L0 path to <path>.
Default value: /data/kpf/L0
-s <sleep_interval> : Sets the interval between file touches.
Default value: 0.2 [sec]
-e : Echo the touch commands instead of executing them.
Some example uses of this script are:
To provide filenames using command line arguments:
./kpf_slowtouch.sh KP.20230623.12345.67.fits KP.20230623.12345.68.fitsTo provide filenames using a CSV file:
./kpf_slowtouch.sh -f filenames.csvTo provide files listed in a directory:
./kpf_slowtouch.sh -d /path/to/directoryTo change the default L0 path and sleep interval between touches:
./kpf_slowtouch.sh KP.20230623.12345.67.fits -p /new/path -s 0.5To echo the touch commands instead of executing them:
./kpf_slowtouch.sh KP.20230623.12345.67.fits -e
Monitoring processing progress – kpf_processing_progress.py:
Print the status of processing for a date range:
./scripts/kpf_processing_progress.py YYYYMMDD YYYYMMDD
The full description is here:
Description:
This script is used to assess the status and progress of processing KPF data.
It searches over a range of dates specified by the first two arguments which are
of the form YYYYMMDD. For each date (with /data/kpf/L0/YYYYMMDD as the
assumed L0 directory), it examines each L0 file and the associated 2D/L1/L2
files in their related directories. If the first argument is a date after the
second argument, then the dates are printed in reverse chronological order (later
dates first). The output of this script is a table with columns indicating the
date for each row, the most recent modification date for and L0 file in that
directory, the fraction of 2D files processed, the fraction of L1 files processed,
and the fraction of L2 files processed. Sample output is shown below.
> ./scripts/kpf_processing_progress.py 20231231 20230101 --current_version 2.5
DATECODE | LAST L0 MOD DATE | 2D PROCESSING | L1 PROCESSING | L2 PROCESSING
------------------------------------------------------------------------------
20231221 | 2023-12-21 10:18 | 256/256 100% | 254/256 99% | 229/230 99%
20231220 | 2023-12-20 16:00 | 342/342 100% | 342/342 100% | 315/315 100%
20231219 | 2023-12-19 16:00 | 406/406 100% | 406/406 100% | 377/379 99%
20231218 | 2023-12-18 16:00 | 531/531 100% | 528/531 99% | 501/504 99%
20231217 | 2023-12-17 16:00 | 524/524 100% | 524/524 100% | 497/497 100%
20231216 | 2023-12-16 16:00 | 527/527 100% | 524/527 99% | 497/500 99%
The following criteria are used to determine if 2D/L1/L2 files are "processed":
- not in the junk file list ('/data/kpf/reference/Junk_Observations_for_KPF.csv');
if the file is missing, all files are assumed to not be junk
- have the Green, Red, or CaHK extension present in the L0 file
- not a Dark or Bias exposure [only applied to L2 files]
- the 2D/L1/L2 exists
- the modification time of the 2D/L1/L2 file is later than the
modification time of the associated L0 file
- the DRP version number is equal to or greater than the current DRP version
number of the master branch on Github [only if --check_version option
selected]
# - not junk
# - Green, Red, or CaHK extension present
# - not a Dark or Bias exposure
# - file present
# - L2 modification time more recent than L0 modification time
# - current DRP version number (if check_version option selected)
Command-line options listed below enable touching of the L0 files associated
with 2D/L1/L2 files that are not present, printing those filenames, printing the
filenames of the 2D/L1/L2 files themselves, and turning on the DRP version check.
Options:
--help Display this message
--print_files Display missing file names (or files that fail other criteria)
--print_files_2D Display missing 2D file names (or files that fail other criteria)
--print_files_L1 Display missing L1 file names (or files that fail other criteria)
--print_files_L2 Display missing L2 file names (or files that fail other criteria)
--touch_files Touch the base L0 files of missing 2D/L1/L2 files
--check_version Checks that each 2D/L1/L2 file has the current Git version for the KPF-Pipeline
--current_version The current version of determining completion status; e.g. --current version 2.5
Usage:
kpf_processing_progress.py YYYYMMDD [YYYYMMDD] [--print_files] [--print_files_2D] [--print_files_L1] [--print_files_L2] [--touch_files] [--check_version]
Example:
./scripts/kpf_processing_progress.sh 20231114 20231231 --print_files