Major changes
- Metric store integration: The previously external
cc-metric-storecomponent was integrated intocc-backend. In this process the configuration for the metric store was made much simpler. It is not possible to use an external time-series database. It is possible though to either send the metric data to multiple time-series backends or to forward all metric-data tocc-backend. We also dropped support for the Prometheus metric data base. - Drop support for MySQL/MariaDB: We only support SQLite from now on. SQLite performance better and requires less administration.
- New slurm adapter: We provide now an official slurm batch job adapter with tighter slurm integration. The REST API should still work but was extended to also provide Slurm node and job states. The job and node-state API is offered as REST API or via NATS.
- Revised configuration: The structure of the configuration was unified and consolidated. It can now be distributed via multiple files. The UI configuration can be selectively configured. Defaults for the metric plots can be configured per cluster/subcluster.
- Switch to more flexible .env handling: In previous releases the
environment variables must be provided in an
.envfile which has to exist. We switched to the godotenv package, which is more flexible about where and how to provide the environment variables.
New experimental features
- Automatic Job taggers: It is possible to automatically detect application types and classify pathological jobs and tag jobs accordingly. The tagger rules are specified in rules.
- Alternative job-archive backends: As alternatives to the file-based job archives there exist now an SQLite and S3 compatible object store backends.
What you need to do
You need to:
- Adapt your central
config.jsonto the new configuration option systematic. - Revise all of your
cluster.jsonfiles in the job archive to reflect the current options. - Migrate your job database to version 10 (see Database migration).
- Migrate your job archive to version 3 (see Job Archive migration).
- Transfer the checkpoints from the external
cc-metric-storeinstance to thecc-backend./var/checkpointsdirectory
The database migration can take more than one day. To minimize the downtime you
can copy the existing SQLite database and perform the migration on the copy
while the production instance is still running. cc-slurm-adapter will
synchronize any missing jobs afterwards. The archive migration should only take
1-2h. This only applies if you do it on a fast storage medium, e.g. an NVMe
disk.
Configuration changes
GitHub Repository with complete configuration examples. All configuration options are now checked against a JSON schema. The required options are significantly reduced.
Transfer cc-metric-store checkpoints
We are currently offering option to use cc-metric-store attached with cc-backend. Meaning both cc-backend and cc-metric-store share same configuration as well as they run on the same server. The checkpoints in your internal cc-metric-store resides in var directory of the cc-backend. If you choose to use cc-metric-store-internal as you metric store, then you can do the following to bring your old checkpoints from your external cc-metric-store:
Look out for “checkpoints” key in your CCMS and CCB config.json.
"checkpoints": {
"interval": "12h",
"directory": "./var/checkpoints",
"restore": "48h"
},
Either you can move the checkpoints manually or you can use this script for moving the checkpoints.
#!/bin/bash
# The path to your "directory" configured in CCMS and CCB config.json
# replace the path as shown with the dummy paths.
CCMS_CHECKPOINTS_DIR="/home/dummy/cc-metric-store/var/checkpoints"
CCB_CHECKPOINTS_DIR="/home/dummy/cc-backend/var/checkpoints"
# Check if the source directory actually exists
if [ -d "$CCMS_DIR" ]; then
if [ ! -d "$CCB_CHECKPOINTS_DIR" ]; then
mkdir "$CCB_CHECKPOINTS_DIR"
fi
mv -f $CCMS_CHECKPOINTS_DIR $CCB_CHECKPOINTS_DIR
echo "Success: 'checkpoints' moved from $CCMS_CHECKPOINTS_DIR to $CCB_DIR"
else
echo "Error: Directory '$CCMS_CHECKPOINTS_DIR' does not exist."
fi
Known issues
- Currently energy footprint metrics of type energy are ignored for calculating total energy.
- With energy footprint metrics of type power the unit is ignored and it is assumed the metric has the unit Watt.