ClusterCockpit is a framework for job-specific performance and power monitoring on distributed HPC clusters. It is designed with a strong focus on ease of installation and maintenance, high security, and intuitive usability.

ClusterCockpit provides a modern web interface offering tailored views for different user groups.

For HPC users

A comprehensive overview of running and completed batch jobs
Access to a wide range of job-level metrics, including hardware performance counters and power data
Flexible sorting, filtering, and tagging of jobs
Support for identifying performance bottlenecks and inefficient resource usage

For support staff

Unified access to job data across multiple clusters
Advanced filtering and sorting by job, user, or system
Customizable statistical analyses with aggregated job and user data
A cluster status dashboard for quick detection of system-wide issues

For administrators

Single-file deployment of the ClusterCockpit web backend with Systemd integration
Node agents available as RPM and DEB packages
Multiple authentication options, including local accounts, LDAP, OpenID Connect, and JWT
A comprehensive REST/NATS API for integration with batch schedulers and existing monitoring infrastructures

ClusterCockpit is used in production at several HPC computing centers, demonstrating its maturity and suitability for real-world HPC operations (List of users.)

How does it work?

Simple setup

In a simple setup ClusterCockpit consists of the following components:

The web user interface and API backend: cc-backend
The node-level metric collection agent (one per compute node): cc-metric-collector
The Slurm scheduler adapter: cc-slurm-adapter

Node-level metrics are collected continuously by the metric collector and sent to the backend at fixed intervals. Job metadata is provided by one Slurm adapter per Slurm Controller or by a custom adapter for other batch job schedulers and is transmitted to cc-backend via HTTP or NATS.

Job metadata is stored in an internal SQLite database. For running jobs, cc-backend queries an internal metrics store to retrieve all required time-series data. Once a job has finished, its complete dataset—including metadata and metrics—is persisted to a JSON based job archive.

cc-backend supports multiple archive backends:

A file-based archive
A single-file SQLite-based archive
An S3-compatible object store

Finished jobs are loaded on demand from the job archive. The internal metrics store uses a memory pool, retaining time-series data only as long as used by running jobs. This design enables data retention policies and allows ClusterCockpit to operate with minimal maintenance overhead.

Alternative setup

Alternative ClusterCockpit software architecture

A more complicated setup with multiple clusters or stricter requirements with regard to security may look as follows:

The web user interface and API backend (There is always only one backend instance): cc-backend
The node-level metric collection agent (one per compute node): cc-metric-collector
The Slurm scheduler adapter (one per Slurm controller): cc-slurm-adapter
Optional: External cc-metric-store. Can be one for all clusters, or any other distribution up to one per subcluster. You can also mix to use the internal metric store for some clusters and one or more external metric stores for others.

The rest of the architecture is the same as above.

Where to go next?

Getting Started: Set up and explore a local ClusterCockpit demo
Installation manual: Plan, configure, and deploy a production ClusterCockpit installation
User guide: Learn how to use the ClusterCockpit web interface

Documentation Structure

Tutorials: Step-by-step guides for configuring and deploying ClusterCockpit
How-to Guides: Practical solutions to common tasks and problems
Explanation: Background information, concepts, and terminology used in ClusterCockpit
Reference: Detailed technical reference documentation

2 - Release specific infos

Settings and issues specific to the current release

Major changes

Metric store integration: The previously external cc-metric-store component was integrated into cc-backend. In this process the configuration for the metric store was made much simpler. It is not possible to use an external time-series database. It is possible though to either send the metric data to multiple time-series backends or to forward all metric-data to cc-backend. We also dropped support for the Prometheus metric data base.
Drop support for MySQL/MariaDB: We only support SQLite from now on. SQLite performance better and requires less administration.
New slurm adapter: We provide now an official slurm batch job adapter with tighter slurm integration. The REST API should still work but was extended to also provide Slurm node and job states. The job and node-state API is offered as REST API or via NATS.
Revised configuration: The structure of the configuration was unified and consolidated. It can now be distributed via multiple files. The UI configuration can be selectively configured. Defaults for the metric plots can be configured per cluster/subcluster.
Switch to more flexible .env handling: In previous releases the environment variables must be provided in an .env file which has to exist. We switched to the godotenv package, which is more flexible about where and how to provide the environment variables.

New experimental features

Automatic Job taggers: It is possible to automatically detect application types and classify pathological jobs and tag jobs accordingly. The tagger rules are specified in rules.
Alternative job-archive backends: As alternatives to the file-based job archives there exist now an SQLite and S3 compatible object store backends.

What you need to do

You need to:

Adapt your central config.json to the new configuration option systematic.
Revise all of your cluster.json files in the job archive to reflect the current options.
Migrate your job database to version 10 (see Database migration).
Migrate your job archive to version 3 (see Job Archive migration).
Transfer the checkpoints from the external cc-metric-store instance to the cc-backend ./var/checkpoints directory

The database migration can take more than one day. To minimize the downtime you can copy the existing SQLite database and perform the migration on the copy while the production instance is still running. cc-slurm-adapter will synchronize any missing jobs afterwards. The archive migration should only take 1-2h. This only applies if you do it on a fast storage medium, e.g. an NVMe disk.

Configuration changes

GitHub Repository with complete configuration examples. All configuration options are now checked against a JSON schema. The required options are significantly reduced.

Transfer `cc-metric-store` checkpoints

We are currently offering option to use cc-metric-store attached with cc-backend. Meaning both cc-backend and cc-metric-store share same configuration as well as they run on the same server. The checkpoints in your internal cc-metric-store resides in var directory of the cc-backend. If you choose to use cc-metric-store-internal as you metric store, then you can do the following to bring your old checkpoints from your external cc-metric-store:

Look out for “checkpoints” key in your CCMS and CCB config.json.

"checkpoints": {
  "interval": "12h",
  "directory": "./var/checkpoints",
  "restore": "48h"
},

Either you can move the checkpoints manually or you can use this script for moving the checkpoints.

#!/bin/bash

# The path to your "directory" configured in CCMS and CCB config.json
# replace the path as shown with the dummy paths.
CCMS_CHECKPOINTS_DIR="/home/dummy/cc-metric-store/var/checkpoints"
CCB_CHECKPOINTS_DIR="/home/dummy/cc-backend/var/checkpoints"

# Check if the source directory actually exists
if [ -d "$CCMS_DIR" ]; then    
    if [ ! -d "$CCB_CHECKPOINTS_DIR" ]; then
        mkdir "$CCB_CHECKPOINTS_DIR"
    fi

    mv -f $CCMS_CHECKPOINTS_DIR $CCB_CHECKPOINTS_DIR
    echo "Success: 'checkpoints' moved from $CCMS_CHECKPOINTS_DIR to $CCB_DIR"
else
    echo "Error: Directory '$CCMS_CHECKPOINTS_DIR' does not exist."
fi

Known issues

Currently energy footprint metrics of type energy are ignored for calculating total energy.
With energy footprint metrics of type power the unit is ignored and it is assumed the metric has the unit Watt.

3 - Getting Started

Information on how to setup our demo and build cc-backend

The central component of ClusterCockpit is the web- and api backend cc-backend. We provide a demo setup that allows you to get an impression of the web interface. If you just want to try the demo and you have a Linux OS you can do so using the cc-backend release binary. You find detailed instructions on how to setup the demo with the release binary here If you have a different OS or want to build cc-backend yourself follow the instructions below.

Prerequisites

To build cc-backend you need:

A go compiler, version 1.24 or newer. Most recent os environments should have a package with a recent enough version. On MacOS we recommend to use Homebrew to install on.
A node.js environment including the npm package manager.
A git revision control client.
For the demo shell script you need wget to download the example job archive

Try it out

All ClusterCockpit components are available within the GitHub ClusterCockpit project.

Clone cc-backend and change directory into the repository:

git clone https://github.com/ClusterCockpit/cc-backend.git && cd cc-backend

Note

The startDemo script will download a tar file with 38MB (223MB on disk)!

Execute the demo start script:

./startDemo.sh

What follows is output from building cc-backend and downloading the job-archive

HTTP server listening at 127.0.0.1:8080...

Open a web browser and access http://localhost:8080. You should see the ClusterCockpit login page:

Enter demo for the Username and demo for the Password and press the Submit button. After that the ClusterCockpit index page should be displayed:

The demo user has the admin role and therefore can see all views.

Note

Because the demo only loads data from the job archive some views as the status and systems view do not work!

For details about the features of the web interface have a look at the user guide.

Installation

We provide an installation manual to guide you how to plan and configure a production ClusterCockpit deployment. If you are a computing center and face problems do not hesitate to ask for help in our communication channels.

3.1 - Demo with release binary

The demo setup with the release binary only works with a Linux system running on a x86-64 processor.

Grab the release binary at GitHub. The following description assumes you perform all tasks from your home folder. Extract the tar archive:

tar xzf cc-backend_Linux_x86_64.tar.gz

Create an empty folder and copy the binary cc-backend from the extracted archive folder to this folder:

mkdir ./demo

cp cc-backend ./demo

Change to the demo folder and run the following command to setup the required var directory, initialize the sqlite database, config.json and .env files:

./cc-backend -init

Open config.json in an editor of your choice to edit the existing clusters name and add a second cluster. Name the clusters fritz and alex. The file should look as below afterwards:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
{
    "addr": "127.0.0.1:8080",
    "archive": {
        "kind": "file",
        "path": "./var/job-archive"
    },
    "jwts": {
        "session-max-age": "24h",
    },
    "clusters": [
        {
            "name": "fritz",
            "metricDataRepository": {
                "kind": "cc-metric-store",
                "url": "http://localhost:8082",
                "token": ""
            },
            "filterRanges": {
                "numNodes": {
                    "from": 1,
                    "to": 64
                },
                "duration": {
                    "from": 0,
                    "to": 86400
                },
                "startTime": {
                    "from": "2023-01-01T00:00:00Z",
                    "to": null
                }
            }
        },
        {
            "name": "alex",
            "metricDataRepository": {
                "kind": "cc-metric-store",
                "url": "http://localhost:8082",
                "token": ""
            },
            "filterRanges": {
                "numNodes": {
                    "from": 1,
                    "to": 64
                },
                "duration": {
                    "from": 0,
                    "to": 86400
                },
                "startTime": {
                    "from": "2023-01-01T00:00:00Z",
                    "to": null
                }
            }
        }
    ]
}

Download the demo job archive:

wget https://hpc-mover.rrze.uni-erlangen.de/HPC-Data/0x7b58aefb/eig7ahyo6fo2bais0ephuf2aitohv1ai/job-archive-demo.tar

Extract the job archive:

tar xf job-archive-demo.tar

Initialize the database using the data from the job archive and create the demo user:

./cc-backend -init-db -add-user demo:admin:demo -loglevel info

Start the web server:

./cc-backend -server -dev -loglevel info

Open a web browser and access http://localhost:8080. You should see the ClusterCockpit login page:

Enter demo for the Username and demo for the Password and press the Submit button. After that the ClusterCockpit index page should be displayed:

The demo user has the admin role and therefore can see all views.

Note

Because the demo only loads data from the job archive some views as the status and systems view do not work!

For details about the features of the web interface have a look at the user guide.

4 - Tutorials

Detailed step-by-step lessons how to configure and deploy ClusterCockpit

4.1 - Plan overall ClusterCockpit architecture

How to decide on communication and data flows

Introduction

When deploying ClusterCockpit in production, two key architectural decisions need to be made:

Transport mechanism: How metrics flow from collectors to the metric store (REST API vs NATS)
Metric store deployment: Where the metric store runs (internal to cc-backend vs external standalone)

This guide helps you understand the trade-offs to make informed decisions based on your cluster size, administrative capabilities, and requirements.

Transport: REST API vs NATS

The cc-metric-collector can send metrics to cc-metric-store using either direct HTTP REST API calls or via NATS messaging.

REST API Transport

With REST transport, each collector node sends HTTP POST requests directly to the metric store endpoint.

┌─────────────┐     HTTP POST      ┌──────────────────┐
│  Collector  │ ─────────────────► │  cc-metric-store │
│   (Node 1)  │                    │                  │
└─────────────┘                    │                  │
┌─────────────┐     HTTP POST      │                  │
│  Collector  │ ─────────────────► │                  │
│   (Node 2)  │                    └──────────────────┘
└─────────────┘
      ...

Advantages:

Simple setup with no additional infrastructure
Direct point-to-point communication
Easy to debug and monitor
Works well for smaller clusters (< 500 nodes)

Disadvantages:

Each collector needs direct network access to metric store
No buffering: if metric store is unavailable, metrics are lost
Scales linearly with node count (many concurrent connections)
Higher load on metric store during burst scenarios

NATS Transport

With NATS, collectors publish metrics to a NATS server, and the metric store subscribes to receive them.

┌─────────────┐                    ┌─────────────┐
│  Collector  │ ──► publish ──►    │             │
│   (Node 1)  │                    │             │
└─────────────┘                    │    NATS     │     subscribe     ┌──────────────────┐
┌─────────────┐                    │   Server    │ ◄───────────────► │  cc-metric-store │
│  Collector  │ ──► publish ──►    │             │                   └──────────────────┘
│   (Node 2)  │                    │             │
└─────────────┘                    └─────────────┘
      ...

Advantages:

Decoupled architecture: collectors and metric store are independent
Built-in buffering and message persistence (with JetStream)
Better scalability for large clusters (1000+ nodes)
Supports multiple subscribers (e.g., external metric store for redundancy)
Collectors continue working even if metric store is temporarily down
Lower connection overhead (single connection per collector to NATS)
Integrated key management via NKeys (Ed25519-based authentication):
- No need to generate and distribute JWT tokens to each collector
- Centralized credential management in NATS server configuration
- Support for accounts with fine-grained publish/subscribe permissions
- Credential revocation without redeploying collectors
- Simpler key rotation compared to JWT token redistribution

Disadvantages:

Additional infrastructure component to deploy and maintain
More complex initial setup and configuration
Additional point of failure (NATS server)
Requires NATS expertise for troubleshooting

Recommendation

Cluster Size	Recommended Transport
< 200 nodes	REST API
200-500 nodes	Either (based on preference)
> 500 nodes	NATS

For large clusters or environments requiring high availability, NATS provides better resilience and scalability. For smaller deployments or when minimizing complexity is important, REST API is sufficient.

Metric Store: Internal vs External

The cc-metric-store storage engine can run either integrated within cc-backend (internal) or as a separate standalone service (external).

Internal Metric Store

The metric store runs as part of the cc-backend process, sharing the same configuration and lifecycle.

┌────────────────────────────────────────┐
│              cc-backend                │
│  ┌──────────────┐  ┌────────────────┐  │
│  │   Web UI &   │  │  metric-store  │  │
│  │   GraphQL    │  │    (internal)  │  │
│  └──────────────┘  └────────────────┘  │
└────────────────────────────────────────┘
         │                    ▲
         ▼                    │
    ┌─────────┐          ┌─────────┐
    │ Browser │          │Collector│
    └─────────┘          └─────────┘

Advantages:

Single process to deploy and manage
Unified configuration
Simplified networking (metrics received on same endpoint)
Lower resource overhead
Easier initial setup

Disadvantages:

Metric store restart requires cc-backend restart
Resource contention between web serving and metric ingestion
No horizontal scaling of metric ingestion
Single point of failure for entire system

External Metric Store

The metric store runs as a separate process, communicating with cc-backend via its REST API.

┌──────────────────┐         ┌──────────────────┐
│    cc-backend    │ ◄─────► │  cc-metric-store │
│   (Web UI/API)   │  query  │    (external)    │
└──────────────────┘         └──────────────────┘
         │                            ▲
         ▼                            │
    ┌─────────┐                  ┌─────────┐
    │ Browser │                  │Collector│
    └─────────┘                  └─────────┘

Advantages:

Independent scaling and resource allocation
Can restart metric store without affecting web interface
Enables redundancy with multiple metric store instances
Better isolation for security and resource management
Can run on dedicated hardware optimized for in-memory workloads

Disadvantages:

Two processes to deploy and manage
Separate configuration files
Additional network communication between components
More complex setup and monitoring

Recommendation

Scenario	Recommended Deployment
Development/Testing	Internal
Small production (< 200 nodes)	Internal
Medium production (200-1000 nodes)	Either
Large production (> 1000 nodes)	External
Resource-constrained head node	External (on dedicated host)

Security Considerations

Network Exposure

Component	REST Transport	NATS Transport
Metric Store	Exposed to all collector nodes	Only exposed to NATS server
NATS Server	N/A	Exposed to all collectors and metric stores
cc-backend	Exposed to users	Exposed to users

With NATS, the metric store can be isolated from the compute network, reducing attack surface. The NATS server becomes the single point of ingress for metrics. Another option to isolate the web backend from the compute network is to setup cc-metric-collector proxies.

Authentication

REST API: Uses JWT tokens (Ed25519 signed). Each collector needs a valid token configured and distributed to it.
NATS: Supports multiple authentication methods:
- Username/password (simple, suitable for smaller deployments)
- NKeys (Ed25519 key pairs managed centrally in NATS server)
- Credential files (.creds) for decentralized authentication
- Accounts for multi-tenancy with isolated namespaces

NKeys Advantage: With NATS NKeys, authentication keys are managed in the NATS server configuration rather than distributed to each collector. This simplifies credential management significantly:

Add/remove collectors by editing NATS server config
Revoke access instantly without touching collector nodes
No JWT token expiration to manage
Keys can be scoped to specific subjects (publish-only for collectors)

For both transports, ensure:

Keys are properly generated and securely stored
TLS is enabled for production deployments
Network segmentation isolates monitoring traffic

Privilege Separation

Both cc-backend and the external cc-metric-store support dropping privileges after binding to privileged ports (via user and group configuration). This limits the impact of potential vulnerabilities.

Performance Considerations

Memory Usage

The metric store keeps data in memory based on retention-in-memory. Memory usage scales with:

Number of nodes
Number of metrics per node
Number of hardware scopes (cores, sockets, accelerators)
Retention duration
Metric frequency

For a 1000-node cluster with 20 metrics at 60-second intervals and 48-hour retention, expect approximately 10-20 GB of memory usage. For larger setups and many core level metrics this can increase up to 100GB, which must fit into main memory.

CPU Usage

Internal: Competes with cc-backend web serving
External: Dedicated resources for metric processing

For clusters with high query load (many users viewing job details), external deployment prevents metric ingestion from impacting user experience.

Disk I/O

Checkpoints are written periodically. For large deployments:

Use fast storage (SSD) for checkpoint directory
Consider separate disks for checkpoints and archives
Monitor disk space for archive growth

Example Configurations

Small Cluster (Internal + REST)

Single cc-backend with internal metric store, collectors using REST:

// cc-backend config
{
  "metricstore": {
    "enabled": true,
    "checkpoints": {
      "interval": "12h",
      "directory": "./var/checkpoints"
    },
    "retention-in-memory": "48h"
  }
}

Large Cluster (External + NATS)

Separate cc-metric-store with NATS transport:

// cc-metric-store config
{
  "main": {
    "addr": "0.0.0.0:8080",
    "jwt-public-key": "..."
  },
  "nats": {
    "address": "nats://nats-server:4222",
    "username": "ccms",
    "password": "..."
  },
  "metric-store": {
    "retention-in-memory": "48h",
    "memory-cap": 80,
    "checkpoints": {
      "interval": "12h",
      "directory": "/data/checkpoints"
    },
    "cleanup": {
      "mode": "archive",
      "interval": "48h",
      "directory": "/data/archive"
    },
    "nats-subscriptions": [
      {
        "subscribe-to": "hpc-metrics",
        "cluster-tag": "mycluster"
      }
    ]
  }
}

Decision Checklist

Use this checklist to guide your architecture decision:

Cluster size: How many nodes need monitoring?
Availability requirements: Is downtime acceptable?
Administrative capacity: Can you manage additional services?
Network topology: Can collectors reach the metric store directly?
Resource constraints: Is the head node resource-limited?
Security requirements: Do you need network isolation?
Growth plans: Will the cluster expand significantly?

For most new deployments, starting with internal metric store and REST transport is recommended. You can migrate to external deployment and/or NATS later as needs evolve.

4.2 - ClusterCockpit installation manual

How to plan and configure a ClusterCockpit installation

Introduction

ClusterCockpit requires the following components:

A node agent running on all compute nodes that measures required metrics and forwards all data to a time series metrics database. ClusterCockpit provides its own node agent cc-metric-collector. This is the recommended setup, but ClusterCockpit can also be integrated with other node agents, e.g. collectd, prometheus or telegraf. In this case you have to use it with the accompanying time series database and ensure the metric data is send or forwarded to cc-backend.
The api and web interface backend cc-backend. Only one instance of cc-backend is required. This will provide the HTTP server at the desired monitoring URL for serving the web interface. It also integrates a in-memory metric store.
A SQL database. The only supported option is to use the builtin sqlite database for ClusterCockpit. You can setup LiteStream as a service which performs a continuous replication of the sqlite database to multiple storage backends.
(Optional) Metric store: One or more cc-metric-store instances. Advantages for using an external cc-metric-store are:
- Independent scaling and resource allocation
- Can restart metric store without affecting web interface and the other way around
- Enables redundancy with multiple metric store instances
- Better isolation for security and resource management
- Can run on dedicated hardware optimized for in-memory workloads
(Optional) NATS message broker: Apart from REST APIs ClusterCockpit also supports NATS as a way to connect components. Using NATS brings a number of advantages:
- More flexible deployment and testing. Instances can have different URLs or IP addresses. Test instances are easy to deploy in parallel without a need to touch the configuration.
- NATS comes with a builtin sophisticated token key management. This also enables to restrict authorization to specific subjects.
- NATS may provide a larger message throughput compared to REST over HTTP.
- Upcoming ClusterCockpit components as the Energy Manager require NATS.
A batch job scheduler adapter that provides the job meta information to cc-backend. This is done by using the provided REST or NATS API for starting and stopping jobs. Currently available adapters:
- Slurm: Golang based solution (cc-slurm-adapter) maintained by NHR@FAU. This is the recommended option in case you use Slurm. All options in cc-backend are supported.
- Slurm: Python based solution (cc-slurm-sync) maintained by PC2 Paderborn
- HTCondor: cc-condor-sync maintained by Saarland University

Server Hardware

cc-backend is threaded and therefore profits from multiple cores. Enough memory is required to hold the metric data cache. For most setups 128GB should be enough. You can set an upper limit for the memory capacity used b ythe internal metric in-memory cache. It is possible to run it in a virtual machine. For best performance the ./var folder of cc-backend which contains the sqlite database file and the file based job archive should be located on a fast storage device, ideally a NVMe SSD. The sqlite database file and the job archive will grow over time (if you are not removing old jobs using a retention policy). Our setup covering multiple clusters over 5 years takes 75GB for the sqlite database and around 1.4TB for the job archive. In case you have very high job counts, we recommend to use a retention policy to keep the database and the job archive at a manageable size. In case you archive old jobs the database can be easily restored using cc-backend. We run cc-backend as systemd services.

Planning and initial configuration

We recommended the following order for planning and configuring a ClusterCockpit installation:

Decide on overall setup: Initially you have to decide on some fundamental design options about how the components communicate with each other and how the data flows from the compute nodes to the backend.
Setup your metric list: With two exceptions you are in general free which metrics you want choose. Those exceptions are: mem_bw for main memory bandwidth and flops_any for flop throughput (double precision flops are upscaled to single precision rates). The metric list is an integral component for the configuration of all ClusterCockpit components.
Planning of deployment
Configure and deploy cc-metric-collector
Configure and deploy cc-backend
Configure and deploy cc-slurm-adapter or another job scheduler adapter of your choice

You can find complete example production configurations in the cc-examples repository.

Common problems

Up front here is a list with common issues people are facing when installing ClusterCockpit for the first time.

Inconsistent metric names across components

At the moment you need to configure the metric list in every component separately. In cc-metric-collector the metrics that are send to the cc-backend are determined by the collector configuration and possible renaming in the router configuration. In cc-backend for every cluster you need to create a cluster.json configuration in the job-archive. There you setup which metrics are shown in the web-frontend including many additional properties for the metrics. For running jobs cc-backend will query the internal metric-store for exactly those metric names and if there is no match there will be an error.

We provide a JSON schema based specification as part of the job meta and metric schema. This specification recommends a minimal set of metrics and we suggest to use the metric names provided there. While it is up to you if you want to adhere to the metric names suggested in the schema, there are two exceptions: mem_bw (main memory bandwidth) and flops_any (total flop rate with DP flops scaled to SP flops) are required for the roofline plots to work.

Inconsistent device naming between `cc-metric-collector` and batch job scheduler adapter

The batch job scheduler adapter (e.g. cc-slurm-adapter) provides a list of resources that are used by the job. cc-backend will query the internal metric-store with exactly those resource ids for getting all metrics for a job. As a consequence if cc-metric-collector uses another systematic the metrics will not be found.

If you have GPU accelerators cc-slurm-adapter should use the PCI-E device addresses as ids. The option gpuPciAddrs for the nvidia and rocm-smi collectors in the collector configuration must be configured. To validate and debug problems you can use the cc-backend debug endpoint:

curl -H "Authorization: Bearer $JWT" -D - "http://localhost:8080/api/debug"

This will return the current state of cc-metric-store. You can search for a hostname and scroll there for all topology leaf nodes that are available.

Missing nodes in subcluster node lists

ClusterCockpit supports multiple subclusters as part of a cluster. A subcluster in this context is a homogeneous hardware partition with a dedicated metric and device configuration. cc-backend dynamically matches the nodes a job runs on to a subcluster node list to figure out on which subcluster a job is running. If nodes are missing in a subcluster node list this fails and the metric list used may be wrong.

4.3 - Decide on metric list

Planning and naming the metrics

Introduction

To decide on a sensible and meaningful set of metrics is deciding factor for how useful the monitoring will be. As part of a collaborative project several academic HPC centers came up with a minimal set of metrics including their naming. To use a consistent naming is crucial for establishing what metrics mean and we urge you to adhere to the metric names suggested there. You can find this list as part of the ClusterCockpit job data structure JSON schemas.

ClusterCockpit supports multiple clusters within one instance of cc-backend. You have to create separate metric lists for each of them. In cc-backend the metric lists are provided as part of the cluster configuration. Every cluster is configured as part of the job archive using one cluster.json file per cluster. This how-to describes in-detail how to create a cluster.json file.

Required Metrics

Flop throughput rate: `flops_any`

Memory bandwidth: `mem_bw`

Memory capacity used: `mem_used`

Requested cpu core utilization: `cpu_load`

Total fast network bandwidth: `net_bw`

Total file IO bandwidth: `file_bw`

Recommended CPU Metrics

Instructions throughput in cycles: `ipc`

User active CPU core utilization: `cpu_user`

Double precision flop throughput rate: `flops_dp`

Single precision flop throughput rate: `flops_sp`

Average core frequency: `clock`

CPU power consumption: `rapl_power`

Recommended GPU Metrics

GPU utilization: `acc_used`

GPU memory capacity used: `acc_mem_used`

GPU power consumption: `acc_power`

Recommended node level metrics

Ethernet read bandwidth: `eth_read_bw`

Ethernet write bandwidth: `eth_write_bw`

Fast network read bandwidth: `ic_read_bw`

Fast network write bandwidth: `ic_write_bw`

File system metrics

Warning

A file system metric tree is currently not yet supported in cc-backend

In the schema a tree of file system metrics is suggested. This allows to provide a similar set of metrics for different file systems used in a cluster. The file system type names suggested are:

nfs
lustre
gpfs
nvme
ssd
hdd
beegfs

File system read bandwidth: `read_bw`

File system write bandwidth: `write_bw`

File system read requests: `read_req`

File system write requests: `write_req`

File system inodes used: `inodes`

File system open and close: `accesses`

File system file syncs: `fsync`

File system file creates: `create`

File system file open: `open`

File system file close: `close`

File system file syncs: `seek`

4.4 - Deployment

Plan and implement deployhment workflow

Deployment

Why we do not provide a docker container

The ClusterCockpit web backend binary has no external dependencies, everything is included in the binary. The external assets, SQL database and job archive, would also be external in a docker setup. The only advantage of a docker setup would be that the initial configuration is automated. But this only needs to be done one time. We therefore think that setting up docker, securing and maintaining it is not worth the effort.

It is recommended to install all ClusterCockpit components in a common directory, e.g. /opt/monitoring, var/monitoring or var/clustercockpit. In the following we use /opt/monitoring.

Two Systemd services run on the central monitoring server:

clustercockpit : binary cc-backend in /opt/monitoring/cc-backend.
cc-metric-store : Binary cc-metric-store in /opt/monitoring/cc-metric-store.

ClusterCockpit is deployed as a single binary that embeds all static assets. We recommend keeping all cc-backend binary versions in a folder archive and linking the currently active one from the cc-backend root. This allows for easy roll-back in case something doesn’t work.

Please Note

cc-backend is started with root rights to open the privileged ports (80 and 443). It is recommended to set the configuration options user and group, in which case cc-backend will drop root permissions once the ports are taken. You have to take care, that the ownership of the ./var folder and its contents are set accordingly.

Workflow to deploy new version

This example assumes the DB and job archive versions did not change.

Stop systemd service:

sudo systemctl stop clustercockpit.service

Backup the sqlite DB file! This is as simple as to copy it.
Copy new cc-backend binary to /opt/monitoring/cc-backend/archive (Tip: Use a date tag like YYYYMMDD-cc-backend). Here is an example:

cp ~/cc-backend /opt/monitoring/cc-backend/archive/20231124-cc-backend

Link from cc-backend root to current version

ln -s  /opt/monitoring/cc-backend/archive/20231124-cc-backend /opt/monitoring/cc-backend/cc-backend

Start systemd service:

sudo systemctl start clustercockpit.service

Check if everything is ok:

sudo systemctl status clustercockpit.service

Check log for issues:

sudo journalctl -u clustercockpit.service

Check the ClusterCockpit web frontend and your Slurm adapters if anything is broken!

4.5 - Setup of cc-metric-store

How to configure and deploy cc-metric-store

Note

The standalone cc-metric-store shares its core storage engine with cc-backend. Its role is for distributed setups and redundancy.

Introduction

The cc-metric-store provides an in-memory metric time-series database. It is configured via a JSON configuration file (config.json). Metrics are received via messages using the ClusterCockpit ccMessage protocol. It can receive messages via an HTTP REST API or by subscribing to a NATS subject. Requesting data is possible via an HTTP REST API.

Configuration

For a complete list of configuration options see here. The configuration is organized into four main sections: main, metrics, nats, and metric-store.

Minimal example of a configuration file:

{
  "main": {
    "addr": "localhost:8080",
    "jwt-public-key": "kzfYrYy+TzpanWZHJ5qSdMj5uKUWgq74BWhQG6copP0="
  },
  "metrics": {
    "clock": {
      "frequency": 60,
      "aggregation": "avg"
    },
    "mem_bw": {
      "frequency": 60,
      "aggregation": "sum"
    },
    "flops_any": {
      "frequency": 60,
      "aggregation": "sum"
    },
    "flops_dp": {
      "frequency": 60,
      "aggregation": "sum"
    },
    "flops_sp": {
      "frequency": 60,
      "aggregation": "sum"
    },
    "mem_used": {
      "frequency": 60,
      "aggregation": null
    }
  },
  "metric-store": {
    "checkpoints": {
      "interval": "12h",
      "directory": "./var/checkpoints"
    },
    "memory-cap": 100,
    "retention-in-memory": "48h",
    "cleanup": {
      "mode": "archive",
      "interval": "48h",
      "directory": "./var/archive"
    }
  }
}

Main Section

The main section specifies the address and port on which the server should listen (addr). Optionally, for HTTPS, paths to TLS cert and key files can be specified via https-cert-file and https-key-file. If using a privileged port (e.g., 443), you can specify user and group to drop root permissions after binding. The backend-url option allows connecting to cc-backend for querying job information. The REST API uses JWT token based authentication. The option jwt-public-key provides the Ed25519 public key to verify signed JWT tokens.

Metrics Section

The cc-metric-store will only accept metrics that are specified in its metric list. The metric names must exactly match! The frequency for the metrics specifies how incoming values are binned. If multiple values are received in the same interval older values are overwritten, if no value is received in an interval there is a gap. cc-metric-store can aggregate metrics across topological entities, e.g., to compute an aggregate node scope value from core scope metrics. The aggregation attribute specifies how the aggregate value is computed. Resource metrics usually require sum, whereas diagnostic metrics (e.g., clock) require avg. For clock a sum would obviously make no sense. Metrics that are only available at node scope should set aggregation to null.

Metric-Store Section

The most important configuration option is the retention-in-memory setting. It specifies for which duration back in time metrics should be provided. This should be long enough to cover common job durations plus a safety margin. The memory-cap option sets the maximum memory percentage to use. The memory footprint scales with the number of nodes, the number of native metric scopes (cores, sockets), the number of metrics, and the memory retention time divided by the frequency.

The cc-metric-store supports checkpoints and cleanup/archiving. Checkpoints are always performed on shutdown. To not lose data on a crash or other failure, checkpoints are written regularly in fixed intervals. Checkpoints that are not needed anymore can either be archived (moved and compressed) or deleted, controlled by the cleanup.mode setting (archive or delete). The cleanup happens at the interval specified in cleanup.interval. You may want to set up a cron job to delete older archive files.

Authentication

For authentication signed (but unencrypted) JWT tokens are used. Only Ed25519/EdDSA cryptographic key-pairs are supported. A client has to sign the token with its private key, on the server side it is checked if the configured public key matches the private key with which the token was signed, if the token was altered after signing, and if the token has expired. All other token attributes are ignored.

We provide an article on how to generate JWT. The is also a background info article on JWT usage in ClusterCockpit. Tokens are cached in cc-metric-store to minimize overhead.

NATS

As an alternative to HTTP REST cc-metric-store can also receive metrics via NATS. You find more infos about NATS in this background article.

To enable NATS in cc-metric-store add the nats section for the connection and nats-subscriptions in the metric-store section:

{
  "nats": {
    "address": "nats://localhost:4222",
    "username": "user",
    "password": "password"
  },
  "metric-store": {
    "nats-subscriptions": [
      {
        "subscribe-to": "hpc-nats",
        "cluster-tag": "fritz"
      }
    ]
  }
}

The nats section configures the NATS server connection with address and credentials. The nats-subscriptions within metric-store define which subjects to subscribe to and how to tag incoming metrics with cluster information.

4.6 - Setup of cc-metric-collector

How to configure and deploy cc-metric-collector

Introduction

cc-metric-collector is a node agent for measuring, processing and forwarding node level metrics. It is currently mostly documented via Markdown documents in its GitHub repository. The configuration consists of the following parts:

collectors: Metric sources. There is a large number of collectors available. Important and also most demanding to configure is the likwid collector for measuring hardware performance counter metrics.
router: Rename, drop and modify metrics.
sinks: Configuration where to send the metrics.
receivers: Receive metrics. Useful as a proxy to connect different metric sinks. Can be left empty in most cases.

Build and deploy

Since the cc-metric-collector needs to be installed on every compute node and requires configuration specific to the node hardware it is demanding to install and configure. The Makefile supports to generate RPM and DEB packages. There is also a Systemd service file included which you may take as a blueprint. More information on deployment is available here.

Collectors

You may want to have a look at our collector configuration which includes configurations for many different systems, Intel and AMD CPUs and NVIDIA GPUs. The general recommendation is to first decide on the metrics you need and then figure out which collectors are required. For hardware performance counter metrics you may want to have a look at likwid-perfctr performance groups for inspiration on how to computer the required derived metrics on your target processor architecture.

Router

The router enables to rename, drop and modify metrics. Top level configuration attributes (can be usually be left at default):

interval_timestamp: Metrics received within same interval get the same identical time stamp if true. Default is true.
num_cache_intervals: Number of intervals that are cached in router. Default is 1. Set to 0 to disable router cache.
hostname_tag: Set a host name different that what is returned by hostname.
max_forward: Number of metrics read at once from a Golang channel. Default is 50. Option has to be larger than 1. Recommendation: Leave at default!

Below you find the operations that are supported by the message processor.

Rename metrics

To rename metric names add a rename_messages section mapping the old metric name to the new name.

"process_messages" : {
    "rename_messages" : {
        "load_one" : "cpu_load",
        "net_bytes_in_bw" : "net_bytes_in",
        "net_bytes_out_bw" : "net_bytes_out",
        "net_pkts_in_bw" : "net_pkts_in",
        "net_pkts_out_bw" : "net_pkts_out",
        "ib_recv_bw" : "ib_recv",
        "ib_xmit_bw" : "ib_xmit",
        "lustre_read_bytes_diff" : "lustre_read_bytes",
        "lustre_read_requests_diff" : "lustre_read_requests",
        "lustre_write_bytes_diff" : "lustre_write_bytes",
        "lustre_write_requests_diff" : "lustre_write_requests",
}

Drop metrics

Sometimes collectors provide a lot of metrics that are not needed. To save data volume metrics can be dropped. Some collectors also support to exclude metrics at the collector level using the exclude_metrics option.

Note

If you are using the cc-metric-store all metrics that are not configured in its metric list are also silently dropped.

"process_messages" : {
   "drop_messages" : [
       "load_five",
       "load_fifteen",
       "proc_run",
       "proc_total"
   ],
}

Normalize unit naming

Enforce a consistent naming of units in metrics. This option should always be set to true which is the default. The metric value is not altered!

"process_messages" : {
   "normalize_units": true
}

Change metric unit

The collectors usually do not alter the unit of a metric. To change the unit set the change_uni_prefix key. The value is automatically scaled correctly, depending on the old unit prefix.

"process_messages" : {
   "change_unit_prefix": {
       "name == 'mem_used'": "G",
       "name == 'swap_used'": "G",
       "name == 'mem_total'": "G",
       "name == 'swap_total'": "G",
       "name == 'cpufreq'": "M"
   }
}

Add tags

To add tags set the add_tags_if configuration attribute. The following statement unconditionally sets a cluster name tag for all metrics.

Note

You always want to set the cluster tag if you are using cc-metric-collector within the ClusterCockpit framework!

"process_messages" : {
    "add_tags_if": [
      {
        "key": "cluster",
        "value": "alex",
        "if": "true"
      }
    ],
}

Sinks

A simple example configuration for two sinks: HTTP cc-metric-store and NATS:

{
  "fritzstore": {
    "type": "http",
    "url": "http://monitoring.nhr.fau.de:8082/api/write?cluster=fritz",
    "jwt": "XYZ",
    "idle_connection_timeout": "60s"
  },
  "fritznats": {
    "type": "nats",
    "host": "monitoring.nhr.fau.de",
    "database": "fritz",
    "nkey_file": "/etc/cc-metric-collector/nats.nkey",
  }
}

All metrics are concurrently send to all configured sinks.

Note

cc-metric-store only accepts timestamps in seconds

4.7 - Setup of cc-backend

How to configure and deploy cc-backend

Introduction

cc-backend is the main hub within the ClusterCockpit framework. Its configuration consists of the general part in config.json and the cluster configurations in cluster.json files, that are part of the job archive. The job archive is a long-term persistent storage for all job meta and metric data. The job meta data including job statistics as well as the user data are stored in a SQL database. Secrets as passwords and tokens are provided as environment variables. Environment variables can be initialized using a .env file residing in the same directory as cc-backend. If using an .env file environment variables that are already set take precedence.

Note (cc-backend before v1.5.0)

For versions before v1.5.0 the .env file was the only option to set environment variables, and they could not be set by other means!

Configuration

cc-backend provides a command line switch to generate an initial template for all required configuration files apart from the job archive:

./cc-backend -init

This will create the ./var folder, generate initial version of the config.json and .env files, and initialize a sqlite database file.

`config.json`

Below is a production configuration enabling the following functionality:

Use HTTPS only
Mark jobs as short job if smaller than 5m
Enable authentication and user syncing via an LDAP directory
Enable to initiate a user session via an JWT token, e.g. by an IDM portal
Drop permission after privileged ports are taken
enable re-sampling of time-series metric data for long jobs
Enable NATS for job and metric store APIs
Set metric in memory retention to 48h
Set upper memory capping for internal metric store to 100GB
Enable archiving of metric data
Using S3 as job archive backend. Note: The file based archive in ./var/job-archive is the default.

Not included below but set by the default settings:

Use compression for metric data files in job archive
Allow access to the REST API from all IPs

{
  "main": {
    "addr": "0.0.0.0:443",
    "https-cert-file": "/etc/letsencrypt/live/url/fullchain.pem",
    "https-key-file": "/etc/letsencrypt/live/url/privkey.pem",
    "user": "clustercockpit",
    "group": "clustercockpit",
    "short-running-jobs-duration": 300,
    "enable-job-taggers": true,
    "resampling": {
      "minimum-points": 600,
      "trigger": 180,
      "resolutions": [240, 60]
    },
    "api-subjects": {
      "subject-job-event": "cc.job.event",
      "subject-node-state": "cc.node.state"
    }
  },
  "nats": {
    "address": "nats://x.x.x.x:4222",
    "username": "root",
    "password": "root"
  },
  "auth": {
    "jwts": {
      "max-age": "2000h"
    },
    "ldap": {
      "url": "ldaps://hpcldap.rrze.uni-erlangen.de",
      "user_base": "ou=people,ou=hpc,dc=rz,dc=uni,dc=de",
      "search_dn": "cn=hpcmonitoring,ou=roadm,ou=profile,ou=hpc,dc=rz,dc=uni,dc=de",
      "user_bind": "uid={username},ou=people,ou=hpc,dc=rrze,dc=uni,dc=de",
      "user_filter": "(&(objectclass=posixAccount))",
      "sync_interval": "24h"
    }
  },
  "cron": {
    "commit-job-worker": "1m",
    "duration-worker": "5m",
    "footprint-worker": "10m"
  },
  "archive": {
    "kind": "s3",
    "endpoint": "http://x.x.x.x",
    "bucket": "jobarchive",
    "access-key": "xx",
    "secret-key": "xx",
    "retention": {
      "policy": "move",
      "age": 365,
      "location": "./var/archive"
    }
  },
  "metric-store": {
    "memory-cap": 100,
    "retention-in-memory": "48h",
    "cleanup": {
      "mode": "archive",
      "directory": "./var/archive"
    },
    "nats-subscriptions": [
      {
        "subscribe-to": "hpc-nats",
        "cluster-tag": "fritz"
      },
      {
        "subscribe-to": "hpc-nats",
        "cluster-tag": "alex"
      }
    ]
  },
  "ui-file": "ui-config.json"
}

Environment variables

Secrets are provided in terms of environment variables. The only two required secrets are JWT_PUBLIC_KEY and JWT_PRIVATE_KEY used for signing generated JWT tokens and validate JWT authentication.

Please refer to the environment reference for details.

5 - How-to Guides

How-to solve concrete problems

5.1 - Configure retention policies

Managing database and job archive size with retention policies

Overview

Over time, the ClusterCockpit database and job archive can grow significantly, especially in production environments with high job counts. Retention policies help keep your storage at a manageable size by automatically removing or archiving old jobs.

Why use retention policies?

Without retention policies:

The SQLite database file can grow to tens of gigabytes
The job archive can reach terabytes in size
Storage requirements increase indefinitely
System performance may degrade

A typical multi-cluster setup over 5 years can accumulate:

75 GB for the SQLite database
1.4 TB for the job archive

Retention policies allow you to balance data retention needs with storage capacity.

Retention policy options

ClusterCockpit supports three retention policies:

None (default)

No automatic cleanup. Jobs are kept indefinitely.

{
  "archive": {
    "kind": "file",
    "path": "./var/job-archive"
  }
}

Delete

Permanently removes jobs older than the specified age from both the job archive and the database.

Use when:

Storage space is limited
You don’t need long-term job data
You have external backups or data exports

Configuration example:

{
  "archive": {
    "kind": "file",
    "path": "./var/job-archive",
    "retention": {
      "policy": "delete",
      "age": 365,
      "includeDB": true
    }
  }
}

This configuration will:

Delete jobs older than 365 days
Remove them from both the job archive and database
Run automatically based on the cleanup interval

Move

Moves old jobs to a separate location for long-term archival while removing them from the active database.

Use when:

You need to preserve historical data
You want to reduce active database size
You can store archived data on cheaper, slower storage

Configuration example:

{
  "archive": {
    "kind": "file",
    "path": "./var/job-archive",
    "retention": {
      "policy": "move",
      "age": 365,
      "location": "/mnt/archive/old-jobs",
      "includeDB": true
    }
  }
}

This configuration will:

Move jobs older than 365 days to /mnt/archive/old-jobs
Remove them from the active database
Preserve the data for potential future analysis

Configuration parameters

`archive.retention` section

Parameter	Type	Required	Default	Description
`policy`	string	Yes	-	Retention policy: `none`, `delete`, or `move`
`age`	integer	No	7	Age threshold in days. Jobs older than this are affected
`includeDB`	boolean	No	true	Also remove jobs from the database (not just archive)
`location`	string	For `move`	-	Target directory for moved jobs (only for `move` policy)

Complete configuration examples

Example 1: One-year retention with deletion

Suitable for environments with limited storage:

{
  "archive": {
    "kind": "file",
    "path": "./var/job-archive",
    "retention": {
      "policy": "delete",
      "age": 365,
      "includeDB": true
    }
  }
}

Example 2: Two-tier archival system

Keep 6 months active, move older data to long-term storage:

{
  "archive": {
    "kind": "file",
    "path": "./var/job-archive",
    "retention": {
      "policy": "move",
      "age": 180,
      "location": "/mnt/slow-storage/archive",
      "includeDB": true
    }
  }
}

Example 3: S3 backend with retention

Using S3 object storage with one-year retention:

{
  "archive": {
    "kind": "s3",
    "endpoint": "https://s3.example.com",
    "bucket": "clustercockpit-jobs",
    "access-key": "your-access-key",
    "secret-key": "your-secret-key",
    "retention": {
      "policy": "delete",
      "age": 365,
      "includeDB": true
    }
  }
}

How retention policies work

Automatic execution: Retention policies run automatically based on the configured interval
Age calculation: Jobs are evaluated based on their startTime field
Batch processing: All jobs older than the specified age are processed in one operation
Database cleanup: When includeDB: true, corresponding database entries are removed
Archive handling: Based on policy (delete removes, move relocates)

Best practices

Planning retention periods

Consider these factors when setting the age parameter:

Accounting requirements: Some organizations require job data for billing/auditing
Research needs: Longer retention for research clusters where users may need historical data
Storage capacity: Available disk space and growth rate
Compliance: Legal or institutional data retention policies

Recommended retention periods:

Use Case	Suggested Age
Development/testing	30-90 days
Production (limited storage)	180-365 days
Production (ample storage)	365-730 days
Research/archival	730+ days or use `move` policy

Storage considerations

For `move` policy

Mount the target location on slower, cheaper storage (e.g., spinning disks, network storage)
Ensure sufficient space at the target location
Consider periodic backups of the moved archive
Document the archive structure for future retrieval

For `delete` policy

Create backups first: Always backup your database and job archive before enabling deletion
Test on a copy: Verify the retention policy works as expected on test data
Export important data: Consider exporting summary statistics or critical job data before deletion

Monitoring and maintenance

Track archive size: Monitor growth to adjust retention periods
```
du -sh /var/job-archive
du -sh /path/to/database.db
```
Verify retention execution: Check logs for retention policy runs
```
grep -i retention /var/log/cc-backend.log
```

Regular backups: Backup before changing retention settings

cp -r /var/job-archive /backup/job-archive-$(date +%Y%m%d)
cp /var/clustercockpit.db /backup/clustercockpit-$(date +%Y%m%d).db

Restoring deleted jobs

If using `move` policy

Jobs moved to the retention location can be restored:

Stop cc-backend

Use the archive-manager tool to import jobs back:

cd tools/archive-manager
go build
./archive-manager -import \
  -src-config '{"kind":"file","path":"/mnt/archive/old-jobs"}' \
  -dst-config '{"kind":"file","path":"./var/job-archive"}'

Rebuild database from archive:
```
./cc-backend -init-db
```
Restart cc-backend

If using `delete` policy

Jobs cannot be restored unless you have external backups. This is why backups are critical before enabling deletion.

archive-manager: Manage and validate job archives
archive-migration: Migrate archives between schema versions
Database migration: See database migration guide

Troubleshooting

Retention policy not running

Check:

Verify archive.retention is properly configured in config.json
Ensure cc-backend was restarted after configuration changes
Check logs for errors: grep -i retention /var/log/cc-backend.log

Database size not decreasing

Possible causes:

includeDB: false - Database entries are not being removed
SQLite doesn’t automatically reclaim space - run VACUUM:
```
sqlite3 /var/clustercockpit.db "VACUUM;"
```

Jobs not being moved to target location

Check:

Target directory exists and is writable
Sufficient disk space at target location
File permissions allow cc-backend to write to location
Path in location is absolute, not relative

Performance impact

If retention policy execution causes performance issues:

Consider running during off-peak hours (feature may require manual execution)
Reduce the number of old jobs by running retention more frequently with shorter age periods
Use more powerful hardware for the database operations

5.2 - How to set up hierarchical metric collection

Configure multiple cc-metric-collector instances to receive metrics from compute nodes and forward them to the backend

Overview

In large HPC clusters, it’s often impractical or undesirable to have every compute node connect directly to the central database. A hierarchical collection setup allows you to:

Reduce database connections: Instead of hundreds of nodes connecting directly, use aggregation nodes as intermediaries
Improve network efficiency: Aggregate metrics at rack or partition level before forwarding
Add processing layers: Filter, transform, or enrich metrics at intermediate collection points
Increase resilience: Buffer metrics during temporary database outages

This guide shows how to configure multiple cc-metric-collector instances where compute nodes send metrics to aggregation nodes, which then forward them to the backend database.

Architecture

flowchart TD
  subgraph Rack1 ["Rack 1 - Compute Nodes"]
    direction LR
    node1["Node 1<br/>cc-metric-collector"]
    node2["Node 2<br/>cc-metric-collector"]
    node3["Node 3<br/>cc-metric-collector"]
  end

  subgraph Rack2 ["Rack 2 - Compute Nodes"]
    direction LR
    node4["Node 4<br/>cc-metric-collector"]
    node5["Node 5<br/>cc-metric-collector"]
    node6["Node 6<br/>cc-metric-collector"]
  end

  subgraph Aggregator ["Aggregation Node"]
    ccrecv["cc-metric-collector<br/>(with receivers)"]
  end

  subgraph Backend ["Backend Server"]
    ccms[("cc-metric-store")]
    ccweb["cc-backend<br/>(Web Frontend)"]
  end
  
  node1 --> ccrecv
  node2 --> ccrecv
  node3 --> ccrecv
  node4 --> ccrecv
  node5 --> ccrecv
  node6 --> ccrecv
  
  ccrecv --> ccms
  ccms <--> ccweb

Components

Compute Node Collectors: Run on each compute node, collect local metrics, forward to aggregation node
Aggregation Node: Receives metrics from multiple compute nodes, optionally processes them, forwards to cc-metric-store
cc-metric-store: In-memory time-series database for metric storage and retrieval
cc-backend: Web frontend that queries cc-metric-store and visualizes metrics

Configuration

Step 1: Configure Compute Nodes

Compute nodes collect local metrics and send them to the aggregation node using a network sink (NATS or HTTP).

Using NATS (Recommended)

NATS provides better performance, reliability, and built-in clustering support.

config.json:

{
  "sinks-file": "/etc/cc-metric-collector/sinks.json",
  "collectors-file": "/etc/cc-metric-collector/collectors.json",
  "receivers-file": "/etc/cc-metric-collector/receivers.json",
  "router-file": "/etc/cc-metric-collector/router.json",
  "main": {
    "interval": "10s",
    "duration": "1s"
  }
}

sinks.json:

{
  "nats_aggregator": {
    "type": "nats",
    "host": "aggregator.example.org",
    "port": "4222",
    "subject": "metrics.rack1"
  }
}

collectors.json (enable metrics you need):

{
  "cpustat": {},
  "memstat": {},
  "diskstat": {},
  "netstat": {},
  "loadavg": {},
  "tempstat": {}
}

router.json (add identifying tags):

{
  "interval_timestamp": true,
  "process_messages": {
    "manipulate_messages": [
      {
        "add_base_tags": {
          "cluster": "mycluster",
          "rack": "rack1"
        }
      }
    ]
  }
}

receivers.json (empty for compute nodes):

{}

Using HTTP

HTTP is simpler but less efficient for high-frequency metrics.

sinks.json (HTTP alternative):

{
  "http_aggregator": {
    "type": "http",
    "host": "aggregator.example.org",
    "port": "8080",
    "path": "/api/write",
    "idle_connection_timeout": "5s",
    "timeout": "3s"
  }
}

Step 2: Configure Aggregation Node

The aggregation node receives metrics from compute nodes via receivers and forwards them to the backend database.

config.json:

{
  "sinks-file": "/etc/cc-metric-collector/sinks.json",
  "collectors-file": "/etc/cc-metric-collector/collectors.json",
  "receivers-file": "/etc/cc-metric-collector/receivers.json",
  "router-file": "/etc/cc-metric-collector/router.json",
  "main": {
    "interval": "10s",
    "duration": "1s"
  }
}

receivers.json (receive from compute nodes):

{
  "nats_rack1": {
    "type": "nats",
    "address": "localhost",
    "port": "4222",
    "subject": "metrics.rack1"
  },
  "nats_rack2": {
    "type": "nats",
    "address": "localhost",
    "port": "4222",
    "subject": "metrics.rack2"
  }
}

sinks.json (forward to cc-metric-store):

{
  "metricstore": {
    "type": "http",
    "host": "backend.example.org",
    "port": "8082",
    "path": "/api/write",
    "idle_connection_timeout": "5s",
    "timeout": "5s",
    "jwt": "eyJ0eXAiOiJKV1QiLCJhbGciOiJFZERTQSJ9.eyJ1c2VyIjoiYWRtaW4iLCJyb2xlcyI6WyJST0xFX0FETUlOIiwiUk9MRV9VU0VSIl19.d-3_3FZTsadPjDbVXKrQr4jNiQV-B_1-uaL_lW8d8gGb-TSAG9KdMg"
  }
}

Note: The jwt token must be signed with the private key corresponding to the public key configured in cc-metric-store. See JWT generation guide for details.

collectors.json (optionally collect local metrics):

{
  "cpustat": {},
  "memstat": {},
  "loadavg": {}
}

router.json (optionally process metrics):

{
  "interval_timestamp": false,
  "num_cache_intervals": 0,
  "process_messages": {
    "manipulate_messages": [
      {
        "add_base_tags": {
          "datacenter": "dc1"
        }
      }
    ]
  }
}

Step 3: Set Up cc-metric-store

The backend server needs cc-metric-store to receive and store metrics from the aggregation node.

config.json (/etc/cc-metric-store/config.json):

{
  "metrics": {
    "cpu_user": {
      "frequency": 10,
      "aggregation": "avg"
    },
    "cpu_system": {
      "frequency": 10,
      "aggregation": "avg"
    },
    "mem_used": {
      "frequency": 10,
      "aggregation": null
    },
    "mem_total": {
      "frequency": 10,
      "aggregation": null
    },
    "net_bw": {
      "frequency": 10,
      "aggregation": "sum"
    },
    "flops_any": {
      "frequency": 10,
      "aggregation": "sum"
    },
    "mem_bw": {
      "frequency": 10,
      "aggregation": "sum"
    }
  },
  "http-api": {
    "address": "0.0.0.0:8082"
  },
  "jwt-public-key": "kzfYrYy+TzpanWZHJ5qSdMj5uKUWgq74BWhQG6copP0=",
  "retention-in-memory": "48h",
  "checkpoints": {
    "interval": "12h",
    "directory": "/var/lib/cc-metric-store/checkpoints",
    "restore": "48h"
  },
  "archive": {
    "interval": "24h",
    "directory": "/var/lib/cc-metric-store/archive"
  }
}

Important configuration notes:

metrics: Must list ALL metrics you want to store. Only configured metrics are accepted.
frequency: Must match the collection interval from cc-metric-collector (in seconds)
aggregation: "sum" for resource metrics (bandwidth, FLOPS), "avg" for diagnostic metrics (CPU %), null for node-only metrics
jwt-public-key: Must correspond to the private key used to sign JWT tokens in the aggregation node sink configuration
retention-in-memory: How long to keep metrics in memory (should cover typical job durations)

Install cc-metric-store:

# Download binary
wget https://github.com/ClusterCockpit/cc-metric-store/releases/latest/download/cc-metric-store

# Install
sudo mkdir -p /opt/monitoring/cc-metric-store
sudo mv cc-metric-store /opt/monitoring/cc-metric-store/
sudo chmod +x /opt/monitoring/cc-metric-store/cc-metric-store

# Create directories
sudo mkdir -p /var/lib/cc-metric-store/checkpoints
sudo mkdir -p /var/lib/cc-metric-store/archive
sudo mkdir -p /etc/cc-metric-store

Create systemd service (/etc/systemd/system/cc-metric-store.service):

[Unit]
Description=ClusterCockpit Metric Store
After=network.target

[Service]
Type=simple
User=cc-metricstore
Group=cc-metricstore
WorkingDirectory=/opt/monitoring/cc-metric-store
ExecStart=/opt/monitoring/cc-metric-store/cc-metric-store -config /etc/cc-metric-store/config.json
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target

Start cc-metric-store:

# Create user
sudo useradd -r -s /bin/false cc-metricstore
sudo chown -R cc-metricstore:cc-metricstore /var/lib/cc-metric-store

# Start service
sudo systemctl daemon-reload
sudo systemctl start cc-metric-store
sudo systemctl enable cc-metric-store

# Check status
sudo systemctl status cc-metric-store

Step 4: Set Up NATS Server

The aggregation node needs a NATS server to receive metrics from compute nodes.

Install NATS:

# Using Docker
docker run -d --name nats -p 4222:4222 nats:latest

# Using package manager (example for Ubuntu/Debian)
curl -L https://github.com/nats-io/nats-server/releases/download/v2.10.5/nats-server-v2.10.5-linux-amd64.zip -o nats-server.zip
unzip nats-server.zip
sudo mv nats-server-v2.10.5-linux-amd64/nats-server /usr/local/bin/

NATS Configuration (/etc/nats/nats-server.conf):

listen: 0.0.0.0:4222
max_payload: 10MB
max_connections: 1000

# Optional: Enable authentication
authorization {
  user: collector
  password: secure_password
}

# Optional: Enable clustering for HA
cluster {
  name: metrics-cluster
  listen: 0.0.0.0:6222
}

Start NATS:

# Systemd
sudo systemctl start nats
sudo systemctl enable nats

# Or directly
nats-server -c /etc/nats/nats-server.conf

Advanced Configurations

Multiple Aggregation Levels

For very large clusters, you can create multiple aggregation levels:

flowchart TD
  subgraph Compute ["Compute Nodes"]
    node1["Node 1-100"]
  end

  subgraph Rack ["Rack Aggregators"]
    agg1["Aggregator<br/>Rack 1-10"]
  end

  subgraph Cluster ["Cluster Aggregator"]
    agg_main["Main Aggregator"]
  end

  subgraph Backend ["Backend"]
    ccms[("cc-metric-store")]
  end
  
  node1 --> agg1
  agg1 --> agg_main
  agg_main --> ccms

Rack-level aggregator sinks.json:

{
  "cluster_aggregator": {
    "type": "nats",
    "host": "main-aggregator.example.org",
    "port": "4222",
    "subject": "metrics.cluster"
  }
}

Cluster-level aggregator receivers.json:

{
  "all_racks": {
    "type": "nats",
    "address": "localhost",
    "port": "4222",
    "subject": "metrics.cluster"
  }
}

Load Balancing with Multiple Aggregators

Use NATS queue groups to distribute load across multiple aggregation nodes:

Compute node sinks.json:

{
  "nats_aggregator": {
    "type": "nats",
    "host": "nats-cluster.example.org",
    "port": "4222",
    "subject": "metrics.loadbalanced"
  }
}

Aggregator 1 and 2 receivers.json (identical configuration):

{
  "nats_with_queue": {
    "type": "nats",
    "address": "localhost",
    "port": "4222",
    "subject": "metrics.loadbalanced",
    "queue_group": "aggregators"
  }
}

With queue_group configured, NATS automatically distributes messages across all aggregators in the group.

Filtering at Aggregation Level

Reduce cc-metric-store load by filtering metrics at the aggregation node:

Aggregator router.json:

{
  "interval_timestamp": false,
  "process_messages": {
    "manipulate_messages": [
      {
        "drop_by_name": ["cpu_idle", "cpu_guest", "cpu_guest_nice"]
      },
      {
        "drop_by": "value == 0 && match('temp_', name)"
      },
      {
        "add_base_tags": {
          "aggregated": "true"
        }
      }
    ]
  }
}

Metric Transformation

Aggregate or transform metrics before forwarding:

Aggregator router.json:

{
  "interval_timestamp": false,
  "num_cache_intervals": 1,
  "interval_aggregates": [
    {
      "name": "rack_avg_temp",
      "if": "name == 'temp_package_id_0'",
      "function": "avg(values)",
      "tags": {
        "type": "rack",
        "rack": "<copy>"
      },
      "meta": {
        "unit": "degC",
        "source": "aggregated"
      }
    }
  ]
}

High Availability Setup

Use multiple NATS servers in cluster mode:

NATS server 1 config:

cluster {
  name: metrics-cluster
  listen: 0.0.0.0:6222
  routes: [
    nats://nats2.example.org:6222
    nats://nats3.example.org:6222
  ]
}

Compute node sinks.json (with failover):

{
  "nats_ha": {
    "type": "nats",
    "host": "nats1.example.org,nats2.example.org,nats3.example.org",
    "port": "4222",
    "subject": "metrics.rack1"
  }
}

Deployment

1. Install cc-metric-collector

On all nodes (compute and aggregation):

# Download binary
wget https://github.com/ClusterCockpit/cc-metric-collector/releases/latest/download/cc-metric-collector

# Install
sudo mkdir -p /opt/monitoring/cc-metric-collector
sudo mv cc-metric-collector /opt/monitoring/cc-metric-collector/
sudo chmod +x /opt/monitoring/cc-metric-collector/cc-metric-collector

2. Deploy Configuration Files

Compute nodes:

sudo mkdir -p /etc/cc-metric-collector
sudo cp config.json /etc/cc-metric-collector/
sudo cp sinks.json /etc/cc-metric-collector/
sudo cp collectors.json /etc/cc-metric-collector/
sudo cp receivers.json /etc/cc-metric-collector/
sudo cp router.json /etc/cc-metric-collector/

Aggregation node:

sudo mkdir -p /etc/cc-metric-collector
# Deploy aggregator-specific configs
sudo cp aggregator-config.json /etc/cc-metric-collector/config.json
sudo cp aggregator-sinks.json /etc/cc-metric-collector/sinks.json
sudo cp aggregator-receivers.json /etc/cc-metric-collector/receivers.json
# etc...

3. Create Systemd Service

On all nodes (/etc/systemd/system/cc-metric-collector.service):

[Unit]
Description=ClusterCockpit Metric Collector
After=network.target

[Service]
Type=simple
User=cc-collector
Group=cc-collector
ExecStart=/opt/monitoring/cc-metric-collector/cc-metric-collector -config /etc/cc-metric-collector/config.json
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target

4. Start Services

Order of startup:

Start cc-metric-store on backend server
Start NATS server on aggregation node
Start cc-metric-collector on aggregation node
Start cc-metric-collector on compute nodes

# On backend server
sudo systemctl start cc-metric-store

# On aggregation node
sudo systemctl start nats
sudo systemctl start cc-metric-collector

# On compute nodes
sudo systemctl start cc-metric-collector

# Enable on boot (on all nodes)
sudo systemctl enable cc-metric-store  # backend only
sudo systemctl enable nats              # aggregator only
sudo systemctl enable cc-metric-collector

Testing and Validation

Test Compute Node → Aggregator

On compute node, run once to verify metrics are collected:

cc-metric-collector -config /etc/cc-metric-collector/config.json -once

On aggregation node, check NATS for incoming metrics:

# Subscribe to see messages
nats sub 'metrics.>'

Test Aggregator → cc-metric-store

On aggregation node, verify metrics are forwarded:

# Check logs
journalctl -u cc-metric-collector -f

On backend server, verify cc-metric-store is receiving data:

# Check cc-metric-store logs
journalctl -u cc-metric-store -f

# Query metrics via REST API (requires valid JWT token)
curl -H "Authorization: Bearer $JWT_TOKEN" \
  "http://backend.example.org:8082/api/query?cluster=mycluster&from=$(date -d '5 minutes ago' +%s)&to=$(date +%s)"

Validate End-to-End

Check cc-backend to see if metrics appear for all nodes:

Open cc-backend web interface
Navigate to node view
Verify metrics are displayed for compute nodes
Check that tags (cluster, rack, etc.) are present

Monitoring and Troubleshooting

Check Collection Pipeline

# Compute node: metrics are being sent
journalctl -u cc-metric-collector -n 100 | grep -i "sent\|error"

# Aggregator: metrics are being received
journalctl -u cc-metric-collector -n 100 | grep -i "received\|error"

# NATS: check connections
nats server info
nats server list

Common Issues

Metrics not appearing in cc-metric-store:

Check compute node → NATS connection
Verify NATS → aggregator reception
Check aggregator → cc-metric-store sink (verify JWT token is valid)
Verify metrics are configured in cc-metric-store’s config.json
Examine router filters (may be dropping metrics)

High latency:

Reduce metric collection interval on compute nodes
Increase batch size in aggregator sinks
Add more aggregation nodes with load balancing
Check network bandwidth between tiers

Memory growth on aggregator:

Reduce num_cache_intervals in router
Check sink write performance
Verify cc-metric-store is accepting writes
Monitor NATS queue depth

Memory growth on cc-metric-store:

Reduce retention-in-memory setting
Increase checkpoint frequency
Verify archive cleanup is working

Connection failures:

Verify firewall rules allow NATS port (4222)
Check NATS server is running and accessible
Test network connectivity: telnet aggregator.example.org 4222
Review NATS server logs: journalctl -u nats -f

Performance Tuning

Compute nodes (reduce overhead):

{
  "main": {
    "interval": "30s",
    "duration": "1s"
  }
}

Aggregator (increase throughput):

{
  "metricstore": {
    "type": "http",
    "host": "backend.example.org",
    "port": "8082",
    "path": "/api/write",
    "timeout": "10s",
    "idle_connection_timeout": "10s"
  }
}

NATS server (handle more connections):

max_connections: 10000
max_payload: 10MB
write_deadline: "10s"

Security Considerations

NATS Authentication

NATS server config:

authorization {
  users = [
    {
      user: "collector"
      password: "$2a$11$..."  # bcrypt hash
    }
  ]
}

Compute node sinks.json:

{
  "nats_aggregator": {
    "type": "nats",
    "host": "aggregator.example.org",
    "port": "4222",
    "subject": "metrics.rack1",
    "username": "collector",
    "password": "secure_password"
  }
}

TLS Encryption

NATS server config:

tls {
  cert_file: "/etc/nats/certs/server-cert.pem"
  key_file: "/etc/nats/certs/server-key.pem"
  ca_file: "/etc/nats/certs/ca.pem"
  verify: true
}

Compute node sinks.json:

{
  "nats_aggregator": {
    "type": "nats",
    "host": "aggregator.example.org",
    "port": "4222",
    "subject": "metrics.rack1",
    "ssl": true,
    "ssl_cert": "/etc/cc-metric-collector/certs/client-cert.pem",
    "ssl_key": "/etc/cc-metric-collector/certs/client-key.pem"
  }
}

Firewall Rules

On aggregation node:

# Allow NATS from compute network
sudo firewall-cmd --permanent --add-rich-rule='rule family="ipv4" source address="10.0.0.0/8" port protocol="tcp" port="4222" accept'

sudo firewall-cmd --reload

On backend server:

# Allow HTTP from aggregation node to cc-metric-store
sudo firewall-cmd --permanent --add-rich-rule='rule family="ipv4" source address="aggregator.example.org" port protocol="tcp" port="8082" accept'

sudo firewall-cmd --reload

Alternative: Using NATS for cc-metric-store

Instead of HTTP, you can also use NATS to send metrics from the aggregation node to cc-metric-store.

Aggregation node sinks.json:

{
  "nats_metricstore": {
    "type": "nats",
    "host": "backend.example.org",
    "port": "4222",
    "subject": "metrics.store"
  }
}

cc-metric-store config.json (add NATS section):

{
  "metrics": { ... },
  "nats": {
    "address": "nats://0.0.0.0:4222",
    "subscriptions": [
      {
        "subscribe-to": "metrics.store",
        "cluster-tag": "mycluster"
      }
    ]
  },
  "http-api": { ... },
  "jwt-public-key": "...",
  "retention-in-memory": "48h",
  "checkpoints": { ... },
  "archive": { ... }
}

Benefits of NATS:

Better performance for high-frequency metrics
Built-in message buffering
No need for JWT tokens in sink configuration
Easier to scale with multiple aggregators

Trade-offs:

Requires running NATS server on backend
More complex infrastructure

5.3 - Database migrations

Database migrations

Introduction

In general, an upgrade is nothing more than a replacement of the binary file. All the necessary files, except the database file, the configuration file and the job archive, are embedded in the binary file. It is recommended to use a directory where the file names of the binary files are named with a version indicator. This can be, for example, the date or the Unix epoch time. A symbolic link points to the version to be used. This makes it easier to switch to earlier versions.

The database and the job archive are versioned. Each release binary supports specific versions of the database and job archive. If a version mismatch is detected, the application is terminated and migration is required.

IMPORTANT NOTE
It is recommended to make a backup copy of the database before each update. This is mandatory in case the database needs to be migrated. In the case of sqlite, this means to stopping cc-backend and copying the sqlite database file somewhere.

Migrating the database

IMPORTANT NOTE
In case you database is larger than 10GB you may want to do a test migration on a database copy to determine the expected downtime before attempting the migration in production.

After you have backed up the database, run the following command to migrate the database to the latest version:

> ./cc-backend -migrate-db

The migration files are embedded in the binary and can also be viewed in the cc backend source tree. We use the migrate library.

If something goes wrong, you can check the status and get the current schema (here for sqlite):

> sqlite3 var/job.db

In the sqlite console execute:

.schema

to get the current database schema. You can query the current version and whether the migration failed with:

SELECT * FROM schema_migrations;

The first column indicates the current database version and the second column is a dirty flag indicating whether the migration was successful.

5.4 - Job archive migrations

Job archive migrations

Introduction

IMPORTANT NOTE
It is recommended to make a backup copy of the job-archive before each update.

Migrating the job archive

Notice

Don’t forget to also migrate archive jobs in case you use an archive retention policy!. Archive migration is only supported from the previous archive version.

Job archive migration requires a separate tool (archive-migration), which is part of the cc-backend source tree (build with go build ./tools/archive-migration) and is also provided as part of the releases.

Migration is supported only between two successive releases. You find details how to use the archive-migration tool in its reference documentation

The cluster.json files in job-archive-new must be checked for errors, especially whether the aggregation attribute is set correctly for all metrics.

Migration takes a few hours for large job archives (several hundred GB). A versioned job archive contains a version.txt file in the root directory of the job archive. This file contains the version as an unsigned integer.

5.5 -

5.6 - Hands-On Demo

Hands-On Demo for a basic ClusterCockpit setup and API usage (without Docker)

Prerequisites

perl
go
npm
Optional: curl
Script migrateTimestamp.pl

Documentation

You find READMEs or api docs in

./cc-backend/configs
./cc-backend/init
./cc-backend/api

ClusterCockpit configuration files

cc-backend

./.env Passwords and Tokens set in the environment
./config.json Configuration options for cc-backend

cc-metric-store

./config.json Optional to overwrite configuration options

cc-metric-collector

Not yet included in the hands-on setup.

Setup Components

Start by creating a base folder for all of the following steps.

mkdir clustercockpit
cd clustercockpit

Setup cc-backend

Clone Repository
- git clone https://github.com/ClusterCockpit/cc-backend.git
- cd cc-backend
Build
- make
Activate & configure environment for cc-backend
- cp configs/env-template.txt .env
- Optional: Have a look via vim .env
- Copy the config.json file included in this tarball into the root directory of cc-backend: cp ../../config.json ./
Back to toplevel clustercockpit
- cd ..
Prepare Datafolder and Database file
- mkdir var
- ./cc-backend -migrate-db

Setup cc-metric-store

Clone Repository
- git clone https://github.com/ClusterCockpit/cc-metric-store.git
- cd cc-metric-store
Build Go Executable
- go get
- go build
Prepare Datafolders
- mkdir -p var/checkpoints
- mkdir -p var/archive
Update Config
- vim config.json
- Exchange existing setting in metrics with the following:

"clock":      { "frequency": 60, "aggregation": null },
"cpi":        { "frequency": 60, "aggregation": null },
"cpu_load":   { "frequency": 60, "aggregation": null },
"flops_any":  { "frequency": 60, "aggregation": null },
"flops_dp":   { "frequency": 60, "aggregation": null },
"flops_sp":   { "frequency": 60, "aggregation": null },
"ib_bw":      { "frequency": 60, "aggregation": null },
"lustre_bw":  { "frequency": 60, "aggregation": null },
"mem_bw":     { "frequency": 60, "aggregation": null },
"mem_used":   { "frequency": 60, "aggregation": null },
"rapl_power": { "frequency": 60, "aggregation": null }

Back to toplevel clustercockpit
- cd ..

Setup Demo Data

mkdir source-data
cd source-data
Download JobArchive-Source:
- wget https://hpc-mover.rrze.uni-erlangen.de/HPC-Data/0x7b58aefb/eig7ahyo6fo2bais0ephuf2aitohv1ai/job-archive-dev.tar.xz
- tar xJf job-archive-dev.tar.xz
- mv ./job-archive ./job-archive-source
- rm ./job-archive-dev.tar.xz
Download CC-Metric-Store Checkpoints:
- mkdir -p cc-metric-store-source/checkpoints
- cd cc-metric-store-source/checkpoints
- wget https://hpc-mover.rrze.uni-erlangen.de/HPC-Data/0x7b58aefb/eig7ahyo6fo2bais0ephuf2aitohv1ai/cc-metric-store-checkpoints.tar.xz
- tar xf cc-metric-store-checkpoints.tar.xz
- rm cc-metric-store-checkpoints.tar.xz
Back to source-data
- cd ../..
Run timestamp migration script. This may take tens of minutes!
- cp ../migrateTimestamps.pl .
- ./migrateTimestamps.pl
- Expected output:

Starting to update start- and stoptimes in job-archive for emmy
Starting to update start- and stoptimes in job-archive for woody
Done for job-archive
Starting to update checkpoint filenames and data starttimes for emmy
Starting to update checkpoint filenames and data starttimes for woody
Done for checkpoints

Copy cluster.json files from source to migrated folders
- cp source-data/job-archive-source/emmy/cluster.json cc-backend/var/job-archive/emmy/
- cp source-data/job-archive-source/woody/cluster.json cc-backend/var/job-archive/woody/
Initialize Job-Archive in SQLite3 job.db and add demo user
- cd cc-backend
- ./cc-backend -init-db -add-user demo:admin:demo
- Expected output:

<6>[INFO]    new user "demo" created (roles: ["admin"], auth-source: 0)
<6>[INFO]    Building job table...
<6>[INFO]    A total of 3936 jobs have been registered in 1.791 seconds.

Back to toplevel clustercockpit
- cd ..

Startup both Apps

In cc-backend root: $./cc-backend -server -dev
- Starts Clustercockpit at http:localhost:8080
  - Log: <6>[INFO] HTTP server listening at :8080...
- Use local internet browser to access interface
  - You should see and be able to browse finished Jobs
  - Metadata is read from SQLite3 database
  - Metricdata is read from job-archive/JSON-Files
- Create User in settings (top-right corner)
  - Name apiuser
  - Username apiuser
  - Role API
  - Submit & Refresh Page
- Create JTW for apiuser
  - In Userlist, press Gen. JTW for apiuser
  - Save JWT for later use
In cc-metric-store root: $./cc-metric-store
- Start the cc-metric-store on http:localhost:8081, Log:

2022/07/15 17:17:42 Loading checkpoints newer than 2022-07-13T17:17:42+02:00
2022/07/15 17:17:45 Checkpoints loaded (5621 files, 319 MB, that took 3.034652s)
2022/07/15 17:17:45 API http endpoint listening on '0.0.0.0:8081'

Does not have a graphical interface
Otpional: Test function by executing:

$ curl -H "Authorization: Bearer eyJ0eXAiOiJKV1QiLCJhbGciOiJFZERTQSJ9.eyJ1c2VyIjoiYWRtaW4iLCJyb2xlcyI6WyJST0xFX0FETUlOIiwiUk9MRV9BTkFMWVNUIiwiUk9MRV9VU0VSIl19.d-3_3FZTsadPjDEdsWrrQ7nS0edMAR4zjl-eK7rJU3HziNBfI9PDHDIpJVHTNN5E5SlLGLFXctWyKAkwhXL-Dw" -D - "http://localhost:8081/api/query" -d "{ \"cluster\": \"emmy\", \"from\": $(expr $(date +%s) - 60), \"to\": $(date +%s), \"queries\": [{
  \"metric\": \"flops_any\",
  \"host\": \"e1111\"
}] }"

HTTP/1.1 200 OK
Content-Type: application/json
Date: Fri, 15 Jul 2022 13:57:22 GMT
Content-Length: 119
{"results":[[JSON-DATA-ARRAY]]}

Development API web interfaces

The -dev flag enables web interfaces to document and test the apis:

Local GQL Playgorund - A GraphQL playground. To use it you must have a authenticated session in the same browser.
Local Swagger Docs - A Swagger UI. To use it you have to be logged out, so no user session in the same browser. Use the JWT token with role Api generate previously to authenticate via http header.

Use cc-backend API to start job

Enter the URL http://localhost:8080/swagger/index.html in your browser.
Enter your JWT token you generated for the API user by clicking the green Authorize button in the upper right part of the window.
Click the /job/start_job endpoint and click the Try it out button.
Enter the following json into the request body text area and fill in a recent start timestamp by executing date +%s.:

{
    "jobId":         100000,
    "arrayJobId":    0,
    "user":          "ccdemouser",
    "subCluster":    "main",
    "cluster":       "emmy",
    "startTime":    <date +%s>,
    "project":       "ccdemoproject",
    "resources":  [
        {"hostname":  "e0601"},
        {"hostname":  "e0823"},
        {"hostname":  "e0337"},
        {"hostname": "e1111"}],
    "numNodes":      4,
    "numHwthreads":  80,
    "walltime":      86400
}

The response body should be the database id of the started job, for example:

{
  "id": 3937
}

Check in ClusterCockpit
- User ccdemouser should appear in Users-Tab with one running job
- It could take up to 5 Minutes until the Job is displayed with some current data (5 Min Short-Job Filter)
- Job then is marked with a green running tag
- Metricdata displayed is read from cc-metric-store!

Use cc-backend API to stop job

Enter the URL http://localhost:8080/swagger/index.html in your browser.
Enter your JWT token you generated for the API user by clicking the green Authorize button in the upper right part of the window.
Click the /job/stop_job/{id} endpoint and click the Try it out button.
Enter the database id at id that was returned by start_job and copy the following into the request body. Replace the timestamp with a recent one:

{
  "cluster": "emmy",
  "jobState": "completed",
  "stopTime": <RECENT TS>
}

On success a json document with the job meta data is returned.
Check in ClusterCockpit
- User ccdemouser should appear in Users-Tab with one completed job
- Job is no longer marked with a green running tag -> Completed!
- Metricdata displayed is now read from job-archive!
Check in job-archive
- cd ./cc-backend/var/job-archive/emmy/100/000
- cd $STARTTIME
- Inspect meta.json and data.json

Helper scripts

In this tarball you can find the perl script generate_subcluster.pl that helps to generate the subcluster section for your system. Usage:
Log into an exclusive cluster node.
The LIKWID tools likwid-topology and likwid-bench must be in the PATH!
$./generate_subcluster.pl outputs the subcluster section on stdout

Please be aware that

You have to enter the name and node list for the subCluster manually.
GPU detection only works if LIKWID was build with Cuda avalable and you run likwid-topology also with Cuda loaded.
Do not blindly trust the measured peakflops values.
Because the script blindly relies on the CSV format output by likwid-topology this is a fragile undertaking!

5.7 - How to add a MOD notification banner

Add a message of the day banner on homepage

Overview

To add a notification banner you can add a file notice.txt to the ./var directory of the cc-backend server. As long as this file is present all text in this file is shown in an info banner on the homepage.

As an alternative the admin role can also add and edit the notification banner from the settings view.

5.8 - How to create a `cluster.json` file

How to initially create a cluster configuration

Overview

Every cluster is configured using a dedicated cluster.json file, that is part of the job archive. You can find the JSON schema for it here. This file provides information about the homogeneous hardware partitions within the cluster including the node topology and the metric list. A real production configuration is provided as part of cc-examples.

`cluster.json`: Basics

The cluster.json file contains three top level parts: the name of the cluster, the metric configuration, and the subcluster list. You find the latest cluster.json schema here. Basic layout of cluster.json files:

{
  "name": "fritz",
  "metricConfig": [
    {
      "name": "cpu_load",
      ...
    },
    {
      "name": "mem_used",
      ...
    }
  ],
  "subClusters": [
    {
      "name": "main",
      ...
    },
    {
      "name": "spr",
      ...
    }
  ]
}

`cluster.json`: Metric configuration

There is one metric list per cluster. You can find a list of recommended metrics and their naming here. Example for a metric list entry with only the required attributes:

"metricConfig": [
    {
        "name": "flops_sp",
        "unit": {
            "base": "Flops/s",
            "prefix": "G"
        },
        "scope": "hwthread",
        "timestep": 60,
        "aggregation": "sum",
        "peak": 5600,
        "normal": 1000,
        "caution": 200,
        "alert": 50
    }
]

Explanation of required attributes:

name: The metric name.
unit: The metrics unit. Base can be: B (for bytes), F (for flops), B/s, F/s, Flops (for floating point operations), Flops/s (for FLOP rate), CPI (for cycles per instruction), IPC (for instructions per cycle), Hz, W (for Watts), °C, or empty string for no unit. Prefix can be: K, M, G, T, P, or E.
scope: The native metric measurement resolution. Can be node, socket, memoryDomain, core, hwthread, or accelerator.
timestep: The measurement frequency in seconds
aggregation: How the metric is aggregated with in node topology. Can be one of sum, avg, or empty string for no aggregation (node level metrics).
Metric thresholds. If threshold applies for larger or smaller values depends on optional attribute lowerIsBetter (default false).
- peak: The maximum possible metric value
- normal: A common metric value level
- caution: Metric value requires attention
- alert: Metric value requiring immediate attention

Optional attributes:

footprint: Is this a job footprint metric. Set to how the footprint is aggregated: Can avg, min, or max. Footprint metrics are shown in the footprint UI component and job view polar plot.
energy: Should the metric be used to calculate the job energy. Can be power (metric has unit Watts) or energy (metric has unit Joules).
lowerIsBetter: Is lower better. Influences frontend UI and evaluation of metric thresholds. Default is false.
restrict: Whether to restrict visibility of this metric to non-user roles (admin, support, manager). Default is false. When set to true, regular users cannot view this metric.
subClusters (Type: array of objects): Overwrites for specific subClusters. The metrics per default are valid for all subClusters. It is possible to overwrite or remove metrics for specific subClusters. If a metric is overwritten for a subClusters all attributes have to be set, partial overwrites are not supported. Example for a metric overwrite:

{
    "name": "mem_used",
    "unit": {
        "base": "B",
        "prefix": "G"
    },
    "scope": "node",
    "aggregation": "sum",
    "footprint": "max",
    "timestep": 60,
    "lowerIsBetter": true,
    "peak": 256,
    "normal": 128,
    "caution": 200,
    "alert": 240,
    "subClusters": [
        {
            "name": "spr1tb",
            "footprint": "max",
            "peak": 1024,
            "normal": 512,
            "caution": 900,
            "alert": 1000
        },
        {
            "name": "spr2tb",
            "footprint": "max",
            "peak": 2048,
            "normal": 1024,
            "caution": 1800,
            "alert": 2000
        }
    ]
},

This metric characterizes the memory capacity used by a job. Aggregation for a job is the sum of all node values. As footprint the largest allocated memory capacity is used. For this configuration lower is better is set, which results in jobs with more than the metric thresholds are marked. There exist two subClusters with 1TB and 2TB memory capacity compared to the default 256GB.

Example for removing metrics for a subcluster:

{
  "name": "vectorization_ratio",
  "unit": {
    "base": ""
  },
  "scope": "hwthread",
  "aggregation": "avg",
  "timestep": 60,
  "peak": 100,
  "normal": 60,
  "caution": 40,
  "alert": 10,
  "subClusters": [
    {
      "name": "icelake",
      "remove": true
    }
  ]
}

`cluster.json`: subcluster configuration

SubClusters in ClusterCockpit are subsets of a cluster with homogeneous hardware. The subCluster part specifies the node topology, a list of nodes that are part of a subClusters, and the node capabilities that are used to draw the roofline diagrams.

Topology Structure

The topology section defines the hardware topology using nested arrays that map relationships between hardware threads, cores, sockets, memory domains, and dies:

node: Flat list of all hardware thread IDs on the node
socket: Hardware threads grouped by physical CPU socket (2D array)
memoryDomain: Hardware threads grouped by NUMA domain (2D array)
die: Optional grouping by CPU die within sockets (2D array). This is used for multi-die processors where each socket contains multiple dies. If not applicable, use an empty array []
core: Hardware threads grouped by physical core (2D array)
accelerators: Optional list of attached accelerators (GPUs, FPGAs, etc.)

The resource ID for CPU cores is the OS processor ID. For GPUs we recommend using the PCI-E address as resource ID.

Here is an example:

{
  "name": "icelake",
  "nodes": "w22[01-35],w23[01-35],w24[01-20],w25[01-20]",
  "processorType": "Intel Xeon Gold 6326",
  "socketsPerNode": 2,
  "coresPerSocket": 16,
  "threadsPerCore": 1,
  "flopRateScalar": {
    "unit": {
      "base": "F/s",
      "prefix": "G"
    },
    "value": 432
  },
  "flopRateSimd": {
    "unit": {
      "base": "F/s",
      "prefix": "G"
    },
    "value": 9216
  },
  "memoryBandwidth": {
    "unit": {
      "base": "B/s",
      "prefix": "G"
    },
    "value": 350
  },
  "topology": {
    "node": [
      0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20,
      21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38,
      39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56,
      57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71
    ],
    "socket": [
      [
        0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
        20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35
      ],
      [
        36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53,
        54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71
      ]
    ],
    "memoryDomain": [
      [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17],
      [18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35],
      [36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53],
      [54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71]
    ],
    "die": [],
    "core": [
      [0],
      [1],
      [2],
      [3],
      [4],
      [5],
      [6],
      [7],
      [8],
      [9],
      [10],
      [11],
      [12],
      [13],
      [14],
      [15],
      [16],
      [17],
      [18],
      [19],
      [20],
      [21],
      [22],
      [23],
      [24],
      [25],
      [26],
      [27],
      [28],
      [29],
      [30],
      [31],
      [32],
      [33],
      [34],
      [35],
      [36],
      [37],
      [38],
      [39],
      [40],
      [41],
      [42],
      [43],
      [44],
      [45],
      [46],
      [47],
      [48],
      [49],
      [50],
      [51],
      [52],
      [53],
      [54],
      [55],
      [56],
      [57],
      [58],
      [59],
      [60],
      [61],
      [62],
      [63],
      [64],
      [65],
      [66],
      [67],
      [68],
      [69],
      [70],
      [71]
    ]
  }
}

Since it is tedious to write this by hand, we provide a Perl script as part of cc-backend that generates a subCluster template. This script only works if the LIKWID tools are installed and in the PATH. You also need the LIKWID library for cc-metric-store. You find instructions on how to install LIKWID here.

Example: SubCluster with GPU Accelerators

Here is an example for a subCluster with GPU accelerators:

{
  "name": "a100m80",
  "nodes": "a[0531-0537],a[0631-0633],a0731,a[0831-0833],a[0931-0934]",
  "processorType": "AMD Milan",
  "socketsPerNode": 2,
  "coresPerSocket": 64,
  "threadsPerCore": 1,
  "flopRateScalar": {
    "unit": {
      "base": "F/s",
      "prefix": "G"
    },
    "value": 432
  },
  "flopRateSimd": {
    "unit": {
      "base": "F/s",
      "prefix": "G"
    },
    "value": 9216
  },
  "memoryBandwidth": {
    "unit": {
      "base": "B/s",
      "prefix": "G"
    },
    "value": 400
  },
  "topology": {
    "node": [
      0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20,
      21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38,
      39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56,
      57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74,
      75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92,
      93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108,
      109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123,
      124, 125, 126, 127
    ],
    "socket": [
      [
        0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
        20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37,
        38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55,
        56, 57, 58, 59, 60, 61, 62, 63
      ],
      [
        64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81,
        82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99,
        100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113,
        114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127
      ]
    ],
    "memoryDomain": [
      [
        0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
        20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37,
        38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55,
        56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73,
        74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91,
        92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107,
        108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121,
        122, 123, 124, 125, 126, 127
      ]
    ],
    "core": [
      [0],
      [1],
      [2],
      [3],
      [4],
      [5],
      [6],
      [7],
      [8],
      [9],
      [10],
      [11],
      [12],
      [13],
      [14],
      [15],
      [16],
      [17],
      [18],
      [19],
      [20],
      [21],
      [22],
      [23],
      [24],
      [25],
      [26],
      [27],
      [28],
      [29],
      [30],
      [31],
      [32],
      [33],
      [34],
      [35],
      [36],
      [37],
      [38],
      [39],
      [40],
      [41],
      [42],
      [43],
      [44],
      [45],
      [46],
      [47],
      [48],
      [49],
      [50],
      [51],
      [52],
      [53],
      [54],
      [55],
      [56],
      [57],
      [58],
      [59],
      [60],
      [61],
      [62],
      [63],
      [64],
      [65],
      [66],
      [67],
      [68],
      [69],
      [70],
      [71],
      [73],
      [74],
      [75],
      [76],
      [77],
      [78],
      [79],
      [80],
      [81],
      [82],
      [83],
      [84],
      [85],
      [86],
      [87],
      [88],
      [89],
      [90],
      [91],
      [92],
      [93],
      [94],
      [95],
      [96],
      [97],
      [98],
      [99],
      [100],
      [101],
      [102],
      [103],
      [104],
      [105],
      [106],
      [107],
      [108],
      [109],
      [110],
      [111],
      [112],
      [113],
      [114],
      [115],
      [116],
      [117],
      [118],
      [119],
      [120],
      [121],
      [122],
      [123],
      [124],
      [125],
      [126],
      [127]
    ],
    "accelerators": [
      {
        "id": "00000000:0E:00.0",
        "type": "Nvidia GPU",
        "model": "A100"
      },
      {
        "id": "00000000:13:00.0",
        "type": "Nvidia GPU",
        "model": "A100"
      },
      {
        "id": "00000000:49:00.0",
        "type": "Nvidia GPU",
        "model": "A100"
      },
      {
        "id": "00000000:4F:00.0",
        "type": "Nvidia GPU",
        "model": "A100"
      },
      {
        "id": "00000000:90:00.0",
        "type": "Nvidia GPU",
        "model": "A100"
      },
      {
        "id": "00000000:96:00.0",
        "type": "Nvidia GPU",
        "model": "A100"
      },
      {
        "id": "00000000:CC:00.0",
        "type": "Nvidia GPU",
        "model": "A100"
      },
      {
        "id": "00000000:D1:00.0",
        "type": "Nvidia GPU",
        "model": "A100"
      }
    ]
  }
}

Important: Each accelerator requires three fields:

id: Unique identifier (PCI-E address recommended, e.g., “00000000:0E:00.0”)
type: Type of accelerator. Valid values are: "Nvidia GPU", "AMD GPU", "Intel GPU"
model: Specific model name (e.g., “A100”, “MI100”)

You must ensure that the metric collector as well as the Slurm adapter also uses the same identifier format (PCI-E address) as the accelerator resource ID for consistency.

5.9 - How to customize cc-backend

Add legal texts, modify login page, and add custom logo.

Overview

Customizing cc-backend means changing the logo, legal texts, and the login template instead of the placeholders. You can also place a text file in ./var to add dynamic status or notification messages to the ClusterCockpit homepage.

Replace legal texts

To replace the imprint.tmpl and privacy.tmpl legal texts, you can place your version in ./var/. At startup cc-backend will check if ./var/imprint.tmpl and/or ./var/privacy.tmpl exist and use them instead of the built-in placeholders. You can use the placeholders in web/templates as a blueprint.

To replace the default login layout and styling, you can place your version in ./var/. At startup cc-backend will check if ./var/login.tmpl exist and use it instead of the built-in placeholder. You can use the default template web/templates/login.tmpl as a blueprint.

Replace logo

To change the logo displayed in the navigation bar, you can provide the file logo.png in the folder ./var/img/. On startup cc-backend will check if the folder exists and use the images provided there instead of the built-in images. You may also place additional images there you use in a custom login template.

To add a notification banner you can add a file notice.txt to ./var. As long as this file is present all text in this file is shown in an info banner on the homepage.

5.10 - How to deploy and update cc-backend

Recommended deployment and update workflow for production use

Workflow for deployment

Why we do not provide a docker container

It is recommended to install all ClusterCockpit components in a common directory, e.g. /opt/monitoring, var/monitoring or var/clustercockpit. In the following we use /opt/monitoring.

Two systemd services run on the central monitoring server:

clustercockpit : binary cc-backend in /opt/monitoring/cc-backend.
cc-metric-store : Binary cc-metric-store in /opt/monitoring/cc-metric-store.

Please Note

Workflow to update

This example assumes the DB and job archive versions did not change. In case the new binary requires a newer database or job archive version read here how to migrate to newer versions.

Stop systemd service:

sudo systemctl stop clustercockpit.service

Backup the sqlite DB file! This is as simple as to copy it.
Copy new cc-backend binary to /opt/monitoring/cc-backend/archive (Tip: Use a date tag like YYYYMMDD-cc-backend). Here is an example:

cp ~/cc-backend /opt/monitoring/cc-backend/archive/20231124-cc-backend

Link from cc-backend root to current version

ln -s  /opt/monitoring/cc-backend/archive/20231124-cc-backend /opt/monitoring/cc-backend/cc-backend

Start systemd service:

sudo systemctl start clustercockpit.service

Check if everything is ok:

sudo systemctl status clustercockpit.service

Check log for issues:

sudo journalctl -u clustercockpit.service

Check the ClusterCockpit web frontend and your Slurm adapters if anything is broken!

5.11 - How to enable and configure auto-tagging

Enable automatic job tagging for application detection and job classification

Overview

ClusterCockpit provides automatic job tagging to classify and categorize jobs based on configurable rules. The tagging system consists of two components:

Application Detection - Identifies which application a job is running by matching patterns in the job script
Job Classification - Analyzes job performance metrics to identify performance issues or characteristics

Tags are automatically applied when jobs start or stop, and can also be applied retroactively to existing jobs. This feature is disabled by default and must be explicitly enabled in the configuration.

Enable auto-tagging

Step 1: Copy configuration files

The tagging system requires configuration files to define application patterns and classification rules. Example configurations are provided in the cc-backend repository at configs/tagger/.

From the cc-backend root directory, copy the configuration files to the var directory:

mkdir -p var/tagger
cp -r configs/tagger/apps var/tagger/
cp -r configs/tagger/jobclasses var/tagger/

This copies:

Application patterns (var/tagger/apps/) - Text files containing regex patterns to match application names in job scripts (16 example applications)
Job classification rules (var/tagger/jobclasses/) - JSON files defining rules to classify jobs based on metrics (3 example rules)
Shared parameters (var/tagger/jobclasses/parameters.json) - Common threshold values used across multiple classification rules

Step 2: Enable in configuration

Add or set the enable-job-taggers configuration option in your config.json:

{
  "enable-job-taggers": true
}

Important: Automatic tagging is disabled by default. Setting this to true activates automatic tagging for jobs that start or stop after cc-backend is restarted.

Step 3: Restart cc-backend

The tagging system loads configuration from ./var/tagger/ at startup:

./cc-backend -server

Step 4: Verify configuration loaded

Check the logs for messages indicating successful initialization:

[INFO] Setup file watch for ./var/tagger/apps
[INFO] Setup file watch for ./var/tagger/jobclasses

These messages confirm the tagging system is active and watching for configuration changes.

How auto-tagging works

Automatic tagging

When enable-job-taggers is set to true, tags are automatically applied at two points in the job lifecycle:

Job Start - Application detection runs immediately when a job starts, analyzing the job script to identify the application
Job Stop - Job classification runs when a job completes, analyzing metrics to identify performance characteristics

Note: Only jobs that start or stop after enabling the feature are automatically tagged. Existing jobs require manual tagging (see below).

Manual tagging (retroactive)

To apply tags to existing jobs in the database, use the -apply-tags command line option:

./cc-backend -apply-tags

This processes all jobs in the database and applies current tagging rules. This is useful when:

You have existing jobs created before tagging was enabled
You’ve added new tagging rules and want to apply them to historical data
You’ve modified existing rules and want to re-evaluate all jobs

The -apply-tags option works independently of the enable-job-taggers configuration setting.

Hot reload

The tagging system watches configuration directories for changes. You can modify or add rules without restarting cc-backend:

Changes to var/tagger/apps/* are detected automatically
Changes to var/tagger/jobclasses/* are detected automatically

Simply edit the files and the new rules will be applied to subsequent jobs.

Application detection

Application detection identifies which software a job is running by matching patterns in the job script.

Configuration format

Application patterns are stored in text files under var/tagger/apps/. Each file represents one application, and the filename (without .txt extension) becomes the tag name.

Each file contains one or more regular expression patterns, one per line:

Example: var/tagger/apps/vasp.txt

vasp
VASP

Example: var/tagger/apps/python.txt

python
pip
anaconda
conda

How it works

When a job starts, the system retrieves the job script from metadata
Each line in the app configuration files is treated as a regex pattern
Patterns are matched case-insensitively against the lowercased job script
If a match is found, a tag of type app with the filename as tag name is applied
Only the first matching application is tagged

Adding new applications

To add detection for a new application:

Create a new file in var/tagger/apps/ (e.g., tensorflow.txt)
Add regex patterns, one per line:
```
tensorflow
tf\.keras
import tensorflow
```
The file is automatically detected and loaded (no restart required)

The tag name will be the filename without the .txt extension (e.g., tensorflow).

Provided application patterns

The example configuration includes patterns for 16 common HPC applications:

vasp
python
gromacs
lammps
openfoam
starccm
matlab
julia
cp2k
cpmd
chroma
flame
caracal
turbomole
orca
alf

Job classification

Job classification analyzes completed jobs based on their metrics and properties to identify performance issues or characteristics.

Configuration format

Job classification rules are defined in JSON files under var/tagger/jobclasses/. Each rule file contains:

Metrics required - Which job metrics to analyze
Requirements - Pre-conditions that must be met
Variables - Computed values used in the rule
Rule expression - Boolean expression that determines if the rule matches
Hint template - Message displayed when the rule matches

Shared parameters

The file var/tagger/jobclasses/parameters.json defines threshold values used across multiple rules:

{
  "lowcpuload_threshold_factor": 0.9,
  "excessivecpuload_threshold_factor": 1.1,
  "job_min_duration_seconds": 600.0,
  "sampling_interval_seconds": 30.0
}

These parameters can be referenced in rule expressions and make it easy to maintain consistent thresholds across multiple rules.

Rule file structure

Each classification rule is a JSON file with the following structure:

Example: var/tagger/jobclasses/lowload.json

{
  "name": "Low CPU load",
  "tag": "lowload",
  "parameters": ["lowcpuload_threshold_factor", "job_min_duration_seconds"],
  "metrics": ["cpu_load"],
  "requirements": [
    "job.shared == \"none\"",
    "job.duration > job_min_duration_seconds"
  ],
  "variables": [
    {
      "name": "load_threshold",
      "expr": "job.numCores * lowcpuload_threshold_factor"
    }
  ],
  "rule": "cpu_load.avg < cpu_load.limits.caution",
  "hint": "Average CPU load {{.cpu_load.avg}} falls below threshold {{.cpu_load.limits.caution}}"
}

Field descriptions

Field	Description
`name`	Human-readable description of the rule
`tag`	Tag identifier applied when the rule matches
`parameters`	List of parameter names from `parameters.json` to include in rule environment
`metrics`	List of metrics required for evaluation (must be present in job data)
`requirements`	Boolean expressions that must all be true for the rule to be evaluated
`variables`	Named expressions computed before evaluating the main rule
`rule`	Boolean expression that determines if the job matches this classification
`hint`	Go template string for generating a user-visible message

Expression environment

Expressions in requirements, variables, and rule have access to:

Job properties:

job.shared - Shared node allocation type
job.duration - Job runtime in seconds
job.numCores - Number of CPU cores
job.numNodes - Number of nodes
job.jobState - Job completion state
job.numAcc - Number of accelerators
job.smt - SMT setting

Metric statistics (for each metric in metrics):

<metric>.min - Minimum value
<metric>.max - Maximum value
<metric>.avg - Average value
<metric>.limits.peak - Peak limit from cluster config
<metric>.limits.normal - Normal threshold
<metric>.limits.caution - Caution threshold
<metric>.limits.alert - Alert threshold

Parameters:

All parameters listed in the parameters field

Variables:

All variables defined in the variables array

Expression language

Rules use the expr language for expressions. Supported operations:

Arithmetic: +, -, *, /, %, ^
Comparison: ==, !=, <, <=, >, >=
Logical: &&, ||, !
Functions: Standard math functions (see expr documentation)

Hint templates

Hints use Go’s text/template syntax. Variables from the evaluation environment are accessible:

{{.cpu_load.avg}}     # Access metric average
{{.job.duration}}     # Access job property
{{.load_threshold}}   # Access computed variable

Adding new classification rules

To add a new classification rule:

Create a new JSON file in var/tagger/jobclasses/ (e.g., memoryLeak.json)
Define the rule structure following the format above
Add any new parameters to parameters.json if needed
The file is automatically detected and loaded (no restart required)

Example: Detecting memory leaks

{
  "name": "Memory Leak Detection",
  "tag": "memory_leak",
  "parameters": ["memory_leak_slope_threshold"],
  "metrics": ["mem_used"],
  "requirements": ["job.duration > 3600"],
  "variables": [
    {
      "name": "mem_growth",
      "expr": "(mem_used.max - mem_used.min) / job.duration"
    }
  ],
  "rule": "mem_growth > memory_leak_slope_threshold",
  "hint": "Memory usage grew by {{.mem_growth}} bytes per second"
}

Don’t forget to add memory_leak_slope_threshold to parameters.json.

Provided classification rules

The example configuration includes 3 classification rules:

lowload - Detects jobs with low CPU load (avg CPU load below caution threshold)
excessiveload - Detects jobs with excessive CPU load (avg CPU load above peak × threshold factor)
lowutilization - Detects jobs with low resource utilization (flop rate below alert threshold)

Troubleshooting

Tags not applied

Check tagging is enabled: Verify enable-job-taggers: true is set in config.json

Check configuration exists:

ls -la var/tagger/apps
ls -la var/tagger/jobclasses

Check logs for errors:
```
./cc-backend -server -loglevel debug
```
Verify file permissions: Ensure cc-backend can read the configuration files
For existing jobs: Use ./cc-backend -apply-tags to retroactively tag jobs

Rules not matching

Enable debug logging: Set log level to debug to see detailed rule evaluation:
```
./cc-backend -server -loglevel debug
```
Check requirements: Ensure all requirements in the rule are satisfied
Verify metrics exist: Classification rules require job metrics to be available in the job data
Check metric names: Ensure metric names in rules match those in your cluster configuration

File watch not working

If changes to configuration files aren’t detected automatically:

Restart cc-backend to reload all configuration
Check filesystem supports file watching (some network filesystems may not support inotify)
Check logs for file watch setup messages

Best practices

Start simple: Begin with basic rules and refine based on results
Use requirements: Filter out irrelevant jobs early with requirements to avoid unnecessary metric processing
Test incrementally: Add one rule at a time and verify behavior before adding more
Document rules: Use descriptive names and clear hint messages
Share parameters: Define common thresholds in parameters.json for consistency
Version control: Keep your var/tagger/ configuration in version control to track changes
Backup before changes: Test new rules on a development instance before deploying to production

Tag types and usage

The tagging system creates two types of tags:

app - Application tags (e.g., “vasp”, “gromacs”, “python”)
jobClass - Classification tags (e.g., “lowload”, “excessiveload”, “lowutilization”)

Tags can be:

Queried and filtered in the ClusterCockpit UI
Used in API queries to find jobs with specific characteristics
Referenced in reports and analytics

Tags are stored in the database and appear in the job details view, making it easy to identify application usage and performance patterns across your cluster.

5.12 - How to generate JWT tokens

Overview

ClusterCockpit uses JSON Web Tokens (JWT) for authorization of its APIs. JWTs are the industry standard for securing APIs and is also used for example in OAuth2. For details on JWTs refer to the JWT article in the Concepts section.

When a user logs in via the /login page using a browser, a session cookie (secured using the random bytes in the SESSION_KEY env variable you should change as well in production) is used for all requests after the successful login. The JWTs make it easier to use the APIs of ClusterCockpit using scripts or other external programs. The token is specified n the Authorization HTTP header using the Bearer schema (there is an example below). Tokens can be issued to users from the configuration view in the Web-UI or the command line (using the -jwt <username> option). In order to use the token for API endpoints such as /api/jobs/start_job/, the user that executes it needs to have the api role. Regular users can only perform read-only queries and only look at data connected to jobs they started themselves.

There are two usage scenarios:

The APIs are used during a browser session. API accesses are authorized with the active session.
The REST API is used outside a browser session, e.g. by scripts. In this case you have to issue a token manually. This possible from within the configuration view or on the command line. It is recommended to issue a JWT token in this case for a special user that only has the api role. By using different users for different purposes a fine grained access control and access revocation management is possible.

The token is commonly specified in the Authorization HTTP header using the Bearer schema. ClusterCockpit uses a ECDSA private/public keypair to sign and verify its tokens. You can use cc-backend to generate new JWT tokens.

Create a new ECDSA Public/private key pair for signing and validating tokens

We provide a small utility tool as part of cc-backend:

go build ./cmd/gen-keypair/
./gen-keypair

Add key pair in your `.env` file for `cc-backend`

An env file template can be found in ./configs. cc-backend requires the private key to sign newly generated JWT tokens and the public key to validate tokens used to authenticate in its REST APIs.

Generate new JWT token

Every user with the admin role can create or change a user in the configuration view of the web interface. To generate a new JWT for a user just press the GenJWT button behind the user name in the user list.

A new api user and corresponding JWT keys can also be generated from the command line.

Create new API user with admin and api role:

./cc-backend -add-user myapiuser:admin,api:<password>

Create a new JWT token for this user:

./cc-backend -jwt myapiuser

Use issued token token on client side

curl -X GET "<API ENDPOINT>" -H "accept: application/json" -H "Content-Type: application/json" -H "Authorization: Bearer <JWT TOKEN>"

This token can be used for the cc-backend REST API as well as for the cc-metric-store. If you use the token for cc-metric-store you have to configure it to use the corresponding public key for validation in its config.json.

Note

Per default the JWT tokens generated by cc-backend will not expire! To set an expiration date you have to configure an expiration duration in config.json. You find details here, use keys jwts:max-age.

Of course the JWT token can be generated also by other means as long it is signed with a ED25519 private key and the corresponding public key is configured in cc-backend or cc-metric-store. For the claims that are set and used by ClusterCockpit refer to the JWT article.

cc-metric-store

The cc-metric-store also uses JWTs for authentication. As it does not issue new tokens, it does not need to know the private key. The public key of the keypair that is used to generate the JWTs that grant access to the cc-metric-store can be specified in its config.json. When configuring the metricDataRepository object in the cluster.json file of the job-archive, you can put a token issued by cc-backend itself.

Other tools to generate signed tokens

The golang-jwt project provides a small command line tool to sign and verify tokens. You can install it with:

 go install github.com/golang-jwt/jwt/v5/cmd/jwt

OpenSSL can be used to generate ED25519 key-pairs:

# Generate ed25519 private key
openssl genpkey -algorithm ed25519 -out privkey.pem
# export its public key
openssl pkey -in privkey.pem -pubout -out pubkey.pem

5.13 - How to plan and configure resampling

Configure metric resampling

Enable timeseries resampling

ClusterCockpit now supports resampling of time series data to a lower frequency. This dramatically improves load times for very large or very long jobs, and we recommend enabling it. Resampling is supported for running as well as for finished jobs.

Note: For running jobs, this currently only works with the newest version of cc-metric-store. Resampling support for the Prometheus time series database will be added in the future.

Resampling Algorithm

To preserve visual accuracy while reducing data points, ClusterCockpit utilizes the Largest-Triangle-Three-Buckets (LTTB) algorithm.

Standard downsampling methods often fail to represent data accurately:

Averaging: Tends to smooth out important peaks and valleys, hiding critical performance spikes.
Decimation (Step sampling): Simply skips points, which can lead to random data loss and missed outliers.

In contrast, LTTB uses a geometric approach to select data points that form the largest triangles effectively. This technique creates a downsampled representation that retains the perceptual shape of the original line graph, ensuring that significant extrema and performance trends remain visible even at lower resolutions.

Configuration

To enable resampling, you must add the following toplevel configuration key:

"resampling": {
  "minimum-points": 300,
  "trigger": 30,
  "resolutions": [
    600,
    300,
    120,
    60
  ]
}

Configuration Parameters

The enable-resampling object is optional. If configured, it enables dynamic downsampling of metric data using the following properties:

minimum-points (Integer) Specifies the minimum number of data points required to trigger resampling. This ensures short jobs are not unnecessarily downsampled.
- Example: If minimum-points is set to 300 and if the native frequency is 60 seconds, resampling will only trigger for jobs longer than 10 hours (300 points * 60 seconds = 18,000 seconds / 3600 = 5 hours).
resolutions (Array [Integer]) An array of target resampling resolutions in seconds.
- Example: [600, 300, 120, 60]
- Note: The finest resolution in this list must match the native resolution of your metrics. If you have different native resolutions across your metric configuration, you should use the finest available resolution here. The implementation will automatically fallback to the finest available resolution if an exact match isn’t found.
trigger (Integer) Controls the zoom behavior. It specifies the threshold of visible data points required to trigger the next zoom level. When the number of visible points in the plot window drops below this value (due to zooming in), the backend loads the next finer resolution level.

Example view of resampling in graphs

The following examples demonstrate how the configuration above (minimum-points: 300, trigger: 30) affects the visualization of a 16-hour job.

1. Initial Overview (Coarse Resolution)

Because the job duration (~16 hours) requires more than 300 points at native resolution, the system automatically loads the 600s resolution. This provides a fast “overview” load without fetching high-frequency data. You can see in the tooltip of this example that we see datapoints every 10 minutes (because of frequency of 600).

Initial overview at 600s resolution

2. Zooming without Triggering

When the user zooms in, the system checks if the number of visible data points in the new view is less than the trigger value (30). In the example below, the zoomed window still contains enough points, so the resolution remains at 600s. As you can see from the tooltip of the example, we still see dataa points every 10 mins.

Zoom action that does not trigger update

3. Zooming and Triggering Detail

As the user zooms in deeper, the number of visible points drops below the trigger threshold of 30. This signals the backend to fetch the next finer resolution (e.g., 120s or 60s). The graph updates dynamically to show the high-frequency peaks that were previously smoothed out. As you can see from the tooltip of the example, the backend has detected that the selected data points are below trigger threshold and load the second last resampling level with the frequency of 120. With native frequency of 60, a frequency of 120 means 2 mins of data. We will see data points every 2 mins as seen in the tooltip of the example.

Zoom action triggering finer resolution

4. Visual Comparison

The animation below highlights the difference in visual density and performance between the raw data and the optimized resampled view. As you know the minimum-points is 300, means resampling will trigger only for jobs > 5 hours of duration (assuming native frequency of 60).

Comparison of resampling

Suggestion on configuring the resampling

Based on the experiments we have done and the performance we have observed, we recommend the reader to:

configure the "minimum-points": 900. This means, assuming native frequency of 60, resampling will trigger for jobs > 15 hours of duration.
configure the "resolutions" with 2 or 3 levels only, with the last level being native frequency. A resampling frequency of 600 is only recommended for jobs > 24 hours of duration.

5.14 - How to regenerate the Swagger UI documentation

Overview

This project integrates swagger ui to document and test its REST API. The swagger documentation files can be found in ./api/.

Note

To regenerate the Swagger UI files is only required if you change the files ./internal/api/rest.go. Otherwise the Swagger UI will already be correctly build and is ready to use.

Generate Swagger UI files

You can generate the swagger-ui configuration by running the following command from the cc-backend root directory:

go run github.com/swaggo/swag/cmd/swag init -d ./internal/api,./pkg/schema -g rest.go -o ./api

You need to move one generated file:

mv ./api/docs.go ./internal/api/docs.go

Finally rebuild cc-backend:

make

Use the Swagger UI web interface

If you start cc-backend with the -dev flag, the Swagger web interface is available at http://localhost:8080/swagger/. To use the Try Out functionality, e.g. to test the REST API, you must enter a JWT key for a user with the API role.

Info

The user who owns the JWT key must not be logged into the same browser (have a valid session), or the Swagger requests will not work. It is recommended to create a separate user that has only the API role.

5.15 - How to setup a systemd service

Run ClusterCockpit components as systemd services

How to run as a systemd service.

The files in this directory assume that you install ClusterCockpit to /opt/monitoring/cc-backend. Of course you can choose any other location, but make sure you replace all paths starting with /opt/monitoring/cc-backend in the clustercockpit.service file!

The config.json may contain the optional fields user and group. If specified, the application will call setuid and setgid after reading the config file and binding to a TCP port (so it can take a privileged port), but before it starts accepting any connections. This is good for security, but also means that the var/ directory must be readable and writeable by this user. The .env and config.json files may contain secrets and should not be readable by this user. If these files are changed, the server must be restarted.

Clone this repository somewhere in your home

git clone git@github.com:ClusterCockpit/cc-backend.git

(Optional) Install dependencies and build. In general it is recommended to use the provided release binaries.

cd cc-backend && make

Copy the binary to the target folder (adapt if necessary):

sudo mkdir -p /opt/monitoring/cc-backend/

cp ./cc-backend /opt/monitoring/cc-backend/

Modify the config.json and env-template.txt file from the configs directory to your liking and put it in the target directory

cp ./configs/config.json /opt/monitoring/config.json && cp ./configs/env-template.txt /opt/monitoring/.env

vim /opt/monitoring/config.json # do your thing...
vim /opt/monitoring/.env # do your thing...

(Optional) Customization: Add your versions of the login view, legal texts, and logo image. You may use the templates in ./web/templates as blueprint. Every overwrite is separate.

cp login.tmpl /opt/monitoring/cc-backend/var/
cp imprint.tmpl /opt/monitoring/cc-backend/var/
cp privacy.tmpl /opt/monitoring/cc-backend/var/
# Ensure your logo, and any images you use in your login template has a suitable size.
cp -R img /opt/monitoring/cc-backend/img

Copy the systemd service unit file. You may adopt it to your needs.

sudo cp ./init/clustercockpit.service /etc/systemd/system/clustercockpit.service

Enable and start the server

sudo systemctl enable clustercockpit.service # optional (if done, (re-)starts automatically)

sudo systemctl start clustercockpit.service

Check whats going on:

sudo systemctl status clustercockpit.service

sudo journalctl -u clustercockpit.service

5.16 - How to use the REST API Endpoints

Overview

ClusterCockpit offers several REST API Endpoints. While some are integral part of the ClusterCockpit-Stack Workflow (such asstart_job), others are optional. These optional endpoints supplement the functionality of the webinterface with information reachable from scripts or the command line. For example, job metrics could be requested for specific jobs and handled in external statistics programs.

All of the endpoints listed for both administrators and users are secured by JWT authentication. As such, all prerequisites applicable to JSON Web Tokens apply in this case as well, e.g. private and public key setup.

See also the Swagger Reference for more detailed information on each endpoint and the payloads.

Admin Accessible REST API

Endpoints described here should be restricted to administrators only, as they include integral functions.

Admin API Prerequisites

JWT has to be generated by either a dedicated API user (has only api role) or by an administrator with both admin and api roles.
JWTs have a limited lifetime, i.e. will become invalid after a configurable amount of time (see auth.jwt.max-age config option).
Administrator endpoints are additionally subjected to a configurable IP whitelist (see api-allowed-ips config option). Per default there is no restriction on IPs that can access the endpoints.

Admin API Endpoints and Functions

Endpoint	Method	Request Payload(s)	Description
`/api/users/`	GET	-	Lists all Users
`/api/clusters/`	GET	-	Lists all Clusters
`/api/tags/`	DELETE	JSON Payload	Removes payload array of tags specified with `Type, Name, Scope` from DB. Private Tags cannot be removed.
`/api/jobs/start_job/`	POST, PUT	JSON Payload	Starts Job
`/api/jobs/stop_job/`	POST, PUT	JSON Payload	Stops Jobs
`/api/jobs/`	GET	URL-Query Params	Lists Jobs
`/api/jobs/{id}`	POST	$id, JSON Payload	Loads specified job metadata
`/api/jobs/{id}`	GET	$id	Loads specified job with metrics
`/api/jobs/tag_job/{id}`	POST, PATCH	$id, JSON Payload	Adds payload array of tags specified with `Type, Name, Scope` to Job with $id. Tags are created in BD.
`/api/jobs/tag_job/{id}`	POST, PATCH	$id, JSON Payload	Removes payload array of tags specified with `Type, Name, Scope` from Job with $id. Tags remain in DB.
`/api/jobs/edit_meta/{id}`	POST, PATCH	$id, JSON Payload	Edits meta_data db colums info
`/api/jobs/metrics/{id}`	GET	$id, URL-Query Params	Loads specified jobmetrics for metric and scope params
`/api/jobs/delete_job/`	DELETE	JSON Payload	Deletes job specified in payload
`/api/jobs/delete_job/{id}`	DELETE	$id, JSON Payload	Deletes job specified by db id
`/api/jobs/delete_job_before/{ts}`	DELETE	$ts	Deletes all jobs before specified unix timestamp

User Accessible REST API

Endpoints described here can be used by users to write scripted job analysis for their jobs only.

User API Prerequisites

JWT has to be generated by either a dedicated API user (Has only api role) or an User with additional api role.
JWTs have a limited lifetime, i.e. will become invalid after a configurable amount of time (see jwt.max-age config option).

User API Endpoints and Functions

Endpoint	Method	Request	Description
`/userapi/jobs/`	GET	URL-Query Params	Lists Jobs
`/userapi/jobs/{id}`	POST	$id, JSON Payload	Loads specified job metadata
`/userapi/jobs/{id}`	GET	$id	Loads specified job with metrics
`/userapi/jobs/metrics/{id}`	GET	$id, URL-Query Params	Loads specified jobmetrics for metric and scope params

5.17 - How to use the Swagger UI documentation

Overview

This project integrates swagger ui to document and test its REST API. ./api/.

Access the Swagger UI web interface

Info

6 - Explanation

Articles about background infos, terms, and concepts in ClusterCockpit

6.1 - Authentication

A behind the scenes description of how authentication mechanisms are implemented

Overview

The authentication is implemented in internal/auth/. In auth.go an interface is defined that any authentication provider must fulfill. It also acts as a dispatcher to delegate the calls to the available authentication providers.

Two authentication types are available:

JWT authentication for the REST API that does not create a session cookie
Session based authentication using a session cookie

The most important routines in auth are:

Login() Handle POST request to login user and start a new session
Auth() Authenticate user and put User Object in context of the request

The http router calls auth in the following cases:

r.Handle("/login", authentication.Login( ... )).Methods(http.MethodPost): The POST request on the /login route will call the Login callback.
r.Handle("/jwt-login", authentication.Login( ... )): Any request on the /jwt-login route will call the Login callback. Intended for use for the JWT token based authenticators.
Any route in the secured subrouter will always call Auth(), on success it will call the next handler in the chain, on failure it will render the login template.

secured.Use(func(next http.Handler) http.Handler {
  return authentication.Auth(
    // On success;
    next,

    // On failure:
    func(rw http.ResponseWriter, r *http.Request, err error) {
               // Render login form
    })
})

A JWT token can be used to initiate an authenticated user session. This can either happen by calling the login route with a token provided in a header or via a special cookie containing the JWT token. For API routes the access is authenticated on every request using the JWT token and no session is initiated.

The Login function (located in auth.go):

Extracts the user name and gets the user from the user database table. In case the user is not found the user object is set to nil.
Iterates over all authenticators and:
- Calls its CanLogin function which checks if the authentication method is supported for this user.
- Calls its Login function to authenticate the user. On success a valid user object is returned.
- Creates a new session object, stores the user attributes in the session and saves the session.
- Starts the onSuccess http handler

Local authenticator

This authenticator is applied if

return user != nil && user.AuthSource == AuthViaLocalPassword

Compares the password provided by the login form to the password hash stored in the user database table:

if e := bcrypt.CompareHashAndPassword([]byte(user.Password), []byte(r.FormValue("password"))); e != nil {
  log.Errorf("AUTH/LOCAL > Authentication for user %s failed!", user.Username)
  return nil, fmt.Errorf("Authentication failed")
}

LDAP authenticator

This authenticator is applied if the user was found in the database and its AuthSource is LDAP:

if user != nil {
  if user.AuthSource == schema.AuthViaLDAP {
    return user, true
  }
}

If the option SyncUserOnLogin is set it tried to sync the user from the LDAP directory. In case this succeeds the user is persisted to the database and can login.

Gets the LDAP connection and tries a bind with the provided credentials:

if err := l.Bind(userDn, r.FormValue("password")); err != nil {
  log.Errorf("AUTH/LDAP > Authentication for user %s failed: %v", user.Username, err)
  return nil, fmt.Errorf("Authentication failed")
}

JWT Session authenticator

Login via JWT token will create a session without password. For login the X-Auth-Token header is not supported. This authenticator is applied if the Authorization header or query parameter login-token is present:

  return user, r.Header.Get("Authorization") != "" ||
    r.URL.Query().Get("login-token") != ""

The Login function:

Parses the token and checks if it is expired
Check if the signing method is EdDSA or HS256 or HS512
Check if claims are valid and extracts the claims
The following claims have to be present:
- sub: The subject, in this case this is the username
- exp: Expiration in Unix epoch time
- roles: String array with roles of user
In case user does not exist in the database and the option SyncUserOnLogin is set add user to user database table with AuthViaToken AuthSource.
Return valid user object

Login via JWT cookie token will create a session without password. It is first checked if the required configuration options are set:

trustedIssuer
CookieName

The environment variable CROSS_LOGIN_JWT_PUBLIC_KEY is required as well: It is used to verify the identity of the trustedIssuer. The public key must match accordingly.

This authenticator is applied if the configured cookie is present:

  jwtCookie, err := r.Cookie(cookieName)

  if err == nil && jwtCookie.Value != "" {
    return true
  }

The Login function:

Extracts and parses the token
Checks if signing method is Ed25519/EdDSA
In case publicKeyCrossLogin is configured:
- Check if iss issuer claim matched trusted issuer from configuration
- Return public cross login key
- Otherwise return standard public key
Check if claims are valid
Depending on the option validateUser the roles are extracted from JWT token or taken from user object fetched from database
Ask browser to delete the JWT cookie
In case user does not exist in the database and the option SyncUserOnLogin is set add user to user database table with AuthViaToken AuthSource.
Return valid user object

Auth

The Auth function (located in auth.go):

Returns a new http handler function that is defined right away
This handler tries two methods to authenticate a user:
- Via a JWT API token in AuthViaJWT()
- Via a valid session in AuthViaSession()
If err is not nil and the user object is valid it puts the user object in the request context and starts the onSuccess http handler
Otherwise it calls the onFailure handler

AuthViaJWT

Implemented in JWTAuthenticator:

Extract token either from header X-Auth-Token or Authorization with Bearer prefix
Parse token and check if it is valid. The Parse routine will also check if the token is expired.
If the option validateUser is set it will ensure the user object exists in the database and takes the roles from the database user
Otherwise the roles are extracted from the roles claim
Returns a valid user object with AuthType set to AuthToken

AuthViaSession

Extracts session
Get values username, projects, and roles from session
Returns a valid user object with AuthType set to AuthSession

6.2 - Configuration Management

How ClusterCockpit deals with versioning of external assets

Release versions

Versions are marked according to semantic versioning. Each version embeds the following static assets in the binary:

Web frontend with javascript files and all static assets
Golang template files for server-side rendering
JSON schema files for validation
Database migration files

The remaining external assets are:

The SQL database used
The job archive
The configuration files config.json and .env

The external assets are versioned with integer IDs. This means that each release binary is bound to specific versions of the SQL database and the job archive. The configuration file is checked against the current schema at startup. The -migrate-db command line switch can be used to migrate the SQL database from a previous version to the latest one. We offer a separate tool archive-migration to migrate an existing job archive from the previous to the latest version.

Versioning of APIs

cc-backend provides two API backends:

A REST API for querying jobs.
A GraphQL API for data exchange between web frontend and cc-backend.

The REST API will also be versioned. We still have to decide whether we will also support older REST API versions by versioning the endpoint URLs. The GraphQL API is for internal use and will not be versioned.

How to build

In general it is recommended to use the provided release binary. In case you want to build build cc-backend please always use the provided makefile. This will ensure that the frontend is also built correctly and that the version in the binary is encoded in the binary.

6.3 - InfluxDB Line Protocol

Detailed specification of the InfluxDB Line Protocol format used for metric ingestion, covering Node and Hardware level metrics.

Overview

All metrics ingested into the cc-metric-store—whether via REST API or NATS—must strictly adhere to the InfluxDB Line Protocol. This text-based format allows us to tag high-frequency telemetry data with the necessary dimensions (cluster, host, hardware type) for efficient querying.

Line Protocol Syntax

The general format for a single data point is:

<measurement>,<tag_set> <field_set> <timestamp>

In our specific cc-metric-store implementation, the structure translates to:

metric_name,cluster=<name>,hostname=<host>,type=<hw_type>,type-id=<id> value=<float> <unix_epoch>

Component	Description	Example
Measurement	The specific metric name being recorded.	`cpu_load`
Tags	Key-value pairs providing context (metadata).	`cluster=alex,hostname=node01`
Fields	The actual data value. We use a single field key: `value`.	`value=45.2`
Timestamp	Unix timestamp in seconds.	`1725827464`

Metric Modes

We distinguishes between two primary scopes of metrics: Hardware Level and Node Level.

1. Hardware Level Metrics

These metrics track the performance of specific sub-components within a node (e.g., a specific CPU core, a GPU, or a memory domain).

Requirement: You must include the type-id tag to distinguish between multiple components of the same type on the same host.

Schema:

<metric>,cluster=<c>,hostname=<h>,type=<component>,type-id=<index> value=<v> <time>

Example Hardware Types:

hwthread: Logical CPU threads. (IDs: 0..127 for Cluster1, 0..71 for Cluster2)
socket: Physical CPU sockets. (IDs: 0..1)
accelerator: GPUs or FPGA cards. (IDs: PCI Bus Address, e.g., 00000000:49:00.0)
memoryDomain: NUMA nodes. (IDs: 0..7)

Example Payload:

cpu_user,cluster=alex,hostname=a0603,type=hwthread,type-id=12 value=88.5 1725827464
core_power,cluster=fritz,hostname=f0201,type=socket,type-id=0 value=120.0 1725827464

2. Node Level Metrics

These metrics represent the aggregate state of the entire node.

Requirement: The type tag is set to node. The type-id tag is usually omitted or ignored for these metrics.

Schema:

<metric>,cluster=<c>,hostname=<h>,type=node value=<v> <time>

Example Payload:

mem_used,cluster=alex,hostname=a0603,type=node value=64000.0 1725827464
ib_xmit,cluster=fritz,hostname=f0201,type=node value=1024500.0 1725827464

To test this protocol with synthetic data, you can use the Metric Generator. See the documentation here: Metric Generator Script

6.4 - JSON Web Token

JSON Web Token (JWT) usage in ClusterCockpit

Introduction

ClusterCockpit uses JSON Web Tokens (JWT) for authorization of its APIs. JSON Web Token (JWT) is an open standard (RFC 7519) that defines a compact and self-contained way for securely transmitting information between parties as a JSON object. This information can be verified and trusted because it is digitally signed. In ClusterCockpit JWTs are signed using a public/private key pair using ECDSA. Because tokens are signed using public/private key pairs, the signature also certifies that only the party holding the private key is the one that signed it. Expiration of the generated tokens as well as the maximum length of a browser session can be configured in the config.json file described here.

The Ed25519 algorithm for signatures was used because it is compatible with other tools that require authentication, such as NATS.io, and because these elliptic-curve methods provide simillar security with smaller keys compared to something like RSA. They are sligthly more expensive to validate, but that effect is negligible.

JWT Payload

You may view the payload of a JWT token at https://jwt.io/#debugger-io. Currently ClusterCockpit sets the following claims:

iat: Issued at claim. The “iat” claim is used to identify the the time at which the JWT was issued. This claim can be used to determine the age of the JWT.
sub: Subject claim. Identifies the subject of the JWT, in our case this is the username.
roles: An array of strings specifying the roles set for the subject.
exp: Expiration date of the token (only if explicitly configured)

It is important to know that JWTs are not encrypted, only signed. This means that outsiders cannot create new JWTs or modify existing ones, but they are able to read out the username.

If there is an external service like an AuthAPI that can generate JWTs and hand them over to ClusterCockpit via cookies, CC can be configured to accept them:

.env: CC needs a public ed25519 key to verify foreign JWT signatures. Public keys in PEM format can be converted with the instructions in /tools/convert-pem-pubkey-for-cc .

CROSS_LOGIN_JWT_PUBLIC_KEY="+51iXX8BdLFocrppRxIw52xCOf8xFSH/eNilN5IHVGc="

config.json: Insert a name for the cookie (set by the external service) containing the JWT so that CC knows where to look at. Define a trusted issuer (JWT claim ‘iss’), otherwise it will be rejected. If you want usernames and user roles from JWTs (‘sub’ and ‘roles’ claim) to be validated against CC’s internal database, you need to enable it here. Unknown users will then be rejected and roles set via JWT will be ignored.

"jwts": {
    "cookieName": "access_cc",
    "forceJWTValidationViaDatabase": true,
    "trustedExternalIssuer": "auth.example.com"
}

Make sure your external service includes the same issuer (iss) in its JWTs. Example JWT payload:

{
  "iat": 1668161471,
  "nbf": 1668161471,
  "exp": 1668161531,
  "sub": "alice",
  "roles": ["user"],
  "jti": "a1b2c3d4-1234-5678-abcd-a1b2c3d4e5f6",
  "iss": "auth.example.com"
}

6.5 - Metric Store

An architectural view of the CC Metric Store and working of its background workers.

Introduction

CCMS (Cluster Cockpit Metric Store) is a simple in-memory time series database. It stores the data about the nodes in your cluster for a specific interval of days. Data about your nodes can be collected with various instrumentation tools like RAPL, LIKWID, PAPI etc. Instrumentation tools can collect data like memory bandwidth, flops, clock frequency, CPU usage etc. After a specified number of days, the data from the in-memory database will be written to disk, archived and released from the in-memory database. In this documentation, we will explain in-detail working of the CCMS components and the outline of the documentation is as follows:

Present the structure of the metric store.
Explain background workers.

Let us get started with the very basic understanding of how CCMS is structured and how it manages data over time.

General tree structure can be as follows:

root
|-----cluster
| |------node -> [node-metrics]
| |  |--components -> [node-level-metrics]
| |  |--components -> [node-level-metrics]
| |
| |------node -> [node-metrics]
|   |--components -> [node-level-metrics]
|   |--components -> [node-level-metrics]
|
|-----cluster
 |-----node -> [node-metrics]
 | |--components -> [node-level-metrics]
 | |--components -> [node-level-metrics]
 |
 |-----node -> [node-metrics]
  |--components -> [node-level-metrics]
  |--components -> [node-level-metrics]

A simple tree representation with example:

root
|-----alex
| |------a903 -> [mem_cached,cpu_idle,nfs4_read]
| |  |--hwthread01 -> [cpu_load,cpu_user,flops_any]
| |  |--accelerator01 -> [mem_bw,mem_used,flops_any]
| |
| |------a322 -> [mem_cached,cpu_idle,nfs4_read]
|   |--hwthread42 -> [cpu_load,cpu_user,flops_any]
|   |--accelerator05 -> [mem_bw,mem_used,flops_any]
|
|-----fritz
 |-----f104 -> [mem_cached,cpu_idle,nfs4_read]
 | |--hwthread35 -> [cpu_load,cpu_user,flops_any]
 | |--socket02 -> [cpu_load,cpu_user,flops_any]
 |
 |-----f576 -> [mem_cached,cpu_idle,nfs4_read]
  |--hwthread47 -> [cpu_load,cpu_user,flops_any]
  |--cpu01 -> [cpu_load,cpu_user,flops_any]

Example tree structure of CCMS containing 2 clusters ‘alex’ and ‘fritz’ that contains each of its own nodes and each node contains its components. Each node and its component contains metrics. a903 is an example of a node and hwthread01 & accelerator01 is a node-level component. Each node will have its own metrics as well as node-level components will also have their own metrics i.e. node-level-metrics.

Internal data structures used in cc-metric-store

A representation of the Level and Buffer data structure with the buffer chain.

From our previous example, we move from a simplistic view to a more realistic view. Each buffer for the given metric holds up to BUFFER_CAP elements in its data array. Usually the BUFFER_CAP is 512 elements, so for float64 elements, the buffer size is 4KB, which is also the size of the page in general. Below you can find all the data structures and its associated member variables. In our example, the start time in buffer is exactly 512 epoch seconds apart. Older buffers are pushed to the previous of the new buffer. This creates a chain of buffers for every level.

Data structure used to hold the data in memory:

MemoryStore

MemoryStore struct {
    // Parses and stores the metrics from config.json
    Metrics HashMap[string][MetricConfig]

    // Initial root level.
    root    Level
}

Level

// From our example, alex, fritz, a903, a322, hwthreads01 are all of Level data stucture.
Level struct {
    // Stores the metrics for the level.
    // From our example, mem_cached, flops_any are of Buffer data structure.
    metrics  []*buffer

    // Stores
    children HashMap[string][*Level]
}

Buffer

buffer struct {
    // Pointer to previous buffer
    prev      *buffer

    // Pointer to next buffer
    next      *buffer

    // Array of floats to store

    // Interval in seconds at which measurements will arive.
    frequency int64

    // Buffer's start time stored in epoch seconds
    start     int

    // If true, this buffer will be skipped for file checkpointing
    archived  bool

    closed    bool
}

MetricConfig

MetricConfig struct {
    // Interval in seconds at which measurements will arive.
    // frequency of 60 means the the timestep/resolution is 60 seconds.
    Frequency     int

    // Can be 'sum', 'avg' or null. Describes how to aggregate metrics from the same timestep over the hierarchy.
    Aggregation   String

    // Private, used internally...
    Offset        int
}

Background workers

Background workers are separate threads spawned for each background task like:

Data retention -> This background worker uses retention-on-memory parameter in the config.json and sets a looping interval for the user-given time. It ticks until the given interval is reached and then releases all the Buffers in CCMS which are less than the user-given time.

In this example, we assume that we insert data continuously in CCMS with retention period of 48 hrs. So the background worker will always check with an interval of retention-period/2. In the example, it is necessary to check every 24 hrs so that the CCMS can retain data of 48 hrs overall. Once it reaches 72 hrs, background worker releases the first 24 hours of data from the in-memory database.

Note: We have a dynamic buffer retention feature when using the internal cc-metric-store. Meaning the buffers for jobs running for more than the retention-in-memory duration will be kept in the metric-store. Once the jobs complete, the retained buffers will be freed during next retention cycle.

Data checkpointing -> This background worker uses interval from the checkpoints parameter in the config.json and sets a looping interval for the user-given time. It ticks until the given interval is reached and creates local backups of the data from the CCMS to the disk. The check pointed files can be found at the user-defined directory sub-parameter from the checkpoints parameter in the config.json file. Check pointing does not mean removing the data from the in-memory database. The data from the memory will only be released until retention period is reached.
Data cleaning -> We have 2 modes of data cleanup. Meaning that the checkpoint files from the checkpoint directory will either be deleted or archived. This background worker uses interval from the cleanup parameter in the config.json and sets a looping interval for the user-given time. It ticks until the given interval is reached and zips all the checkpointed files which are before the user-given time in the interval sub-parameter. Once the checkpointed files are zipped, they are deleted from the checkpointing directory. If no interval is specified, then it ba default uses the retention-in-memory duration. If no cleanup section is specified, then the default mode is delete mode.
Memory-usage tracker -> We have a worker that tracks the memory-usage of the CCMS. This worker tracks the memory usage every 1 hour. It just calculates the size of CCMS based on number of buffers and the length of the buffers. This worker depends on the memory-cap value from the config.json. Once the memory-usage of CCMS reaches above the memory-cap value, it will first free the dynamically retained buffers for longer running jobs. If the memory-usage is still higher than the limit, it will free the last buffer for every level present within the metric-store.
Graceful shutdown handler -> This is a special background worker that detects system or keyboard interrupts like Ctrl+C or Ctrl+Z. In case of an interrupt, it is essential to save the data from the in-memory database. There can be a case when the CCMS contains data just in the memory and it has not been checkpointed. So this background worker scans for the Buffers that have not been checkpointed and writes them to the checkpoint files before shutting down the CCMS.

Reusing the buffers in cc-metric-store

This section explain how CCMS handles the buffer re usability once the buffers are released by the retention background worker.

In this example, we extend the previous example and assume that the retention background worker releases every last buffer from each level i.e. node and node-level metrics. Each buffer that is about to be unlinked from the buffer chain will not be freed from memory, but instead will be unlinked and stored in the memory pool as shown. This allow buffer reusability whenever the buffers reaches the BUFFER_CAP limit and each metric requests new buffers.

6.6 - NATS messaging

NATS message broker infrastructure

Introduction

NATS is a powerful messaging solution supporting many paradigms. Since it is itself implemented in Golang it provides excellent support for Golang based applications. Currently NATS is offered in most ClusterCockpit applications as an alternative to the default REST API. We plan to make NATS the default way to communicate within the ClusterCockpit framework in the future.

Advantages for us to use NATS:

Scalable and low overhead messaging infrastructure
Flexible configuration free setup of message sources and consumers
Builtin zero trust JWT-based authentication system
Simple message filtering based on hierarchical subject names
Multicast and message queue support

Authentication

NATS provides a sophisticated authentication scheme based on JWT tokens and NKeys. It provides the nsc tool to create and manage tokens supporting fine grained authentication and authorization control.

6.7 - Roles

Description of roles used in the web interface

ClusterCockpit uses a specified set of user roles to steer data access and discriminate authorizations, primarily used in the web interface for different display of views, but also limiting data access when requests return from the server backend.

The roles currently implemented are:

User Role

The standard role for all users. By default, granted to all users imported from LDAP. It is also the default selection for the administrative “Create User” form.

Use Case: View and list personal jobs, view personal job detail, inspect metrics of personal jobs.

Access: Jobs started from the users account only.

Manager Role

A privileged role for project supervisors. This role has to be granted manually by administrators. If ClusterCockpit is configured to accept JWT logins from external management applications, it is possible to retain roles granted in the respective application, see JWT docs.

In addition to the role itself, one ore more projects need to be assigned to the user by administrators.

Use Case: In addition to personal job access, this role is intended to view and inspect all jobs of all users of the assigned projects (usergroups), in order to self-manage and identify problems of the subordinate user group.

Access: Personally started jobs, regardless of project. Additionally, all jobs started from all users of the assigned projects (usergroups).

Support Role

A privileged role for support staff. This role has to be granted manually by administrators. If ClusterCockpit is configured to accept JWT logins from external management applications, it is possible to retain roles granted in the respective application, see JWT docs.

In regard to job view access, this role is identical to administrators. However, webinterface view access differs and, most importantly, acces to administrative options is prohibited.

Use Case: In addition to personal job access, this role is intended to view and inspect all jobs of all users active on the clusters, in order to identify problems and give guidance for the userbase as a whole, supporting the administrative staff in these tasks.

Access: Personally started jobs, regardless of project. Additionally, all jobs started from all users on all configured clusters.

Administrator Role

The highest available authority for administrative staff only. This role has to be granted manually by other administrators. No JWT can ever grant this role.

All jobs from all active users on all systems can be accessed, as well as all webinterface views. In addition, the administrative options in the settings view are accessible.

Use Case: General access and ClusterCockpit administrative tasks from the settings page.

Access: General access.

API Role

An optional, technical role given to users in order to enable usage of the RESTful API endpoints. This role has to be granted manually by administrators. No JWT can ever grant this role.

This role can either be granted to a specialized “API User”, which does not have a password or any other roles, and therefore, can not log in by itself. Such an user is only intended to be used to generate JWT access tokens for scripted API access, for example.

Still, this role can be granted to actual users, for example, administrators to generate personal API tokens for testing.

Use Case: Interact with ClusterCockpits’ REST API.

Access: Allows usage of ClusterCockpits’ REST API.

7 - Reference

In-depth technical documentation

In-depth description of configuration options, file formats, and REST API interfaces.

7.1 - cc-backend

ClusterCockpit Backend References

Reference information regarding the primary ClusterCockpit component “cc-backend” (GitHub Repo).

7.1.1 - Command Line

ClusterCockpit Command Line Options

This page describes the command line options for the cc-backend executable.

-add-user <username>:[admin,support,manager,api,user]:<password>

Function: Add a new user. Only one role can be assigned.

Example: -add-user abcduser:manager:somepass

  -apply-tags

Function: Run taggers on all completed jobs and exit.

  -config <path>

Function: Specify alternative path to config.json.

Default: ./config.json

Example: -config ./configfiles/configuration.json

  -del-user <username>

Function: Remove an existing user.

Example: -del-user abcduser

  -dev

Function: Enable development components: GraphQL Playground and Swagger UI.

  -force-db

Function: Force database version, clear dirty flag and exit.

  -gops

Function: Listen via github.com/google/gops/agent (for debugging).

  -import-job <path-to-meta.json>:<path-to-data.json>, ...

Function: Import a job. Argument format: <path-to-meta.json>:<path-to-data.json>,...

Example: -import-job ./to-import/job1-meta.json:./to-import/job1-data.json,./to-import/job2-meta.json:./to-import/job2-data.json

  -init

Function: Setup var directory, initialize sqlite database file, config.json and .env.

  -init-db

Function: Go through job-archive and re-initialize the job, tag, and jobtag tables (all running jobs will be lost!).

Caution: All running jobs will be lost!

  -jwt <username>

Function: Generate and print a JWT for the user specified by its username.

Example: -jwt abcduser

  -logdate

Function: Set this flag to add date and time to log messages.

  -loglevel <level>

Function: Sets the logging level.

Arguments: debug | info | warn | err | crit

Default: warn

Example: -loglevel debug

  -migrate-db

Function: Migrate database to supported version and exit.

  -revert-db

Function: Migrate database to previous version and exit.

  -server

Function: Start a server, continues listening on port after initialization and argument handling.

  -sync-ldap

Function: Sync the hpc_user table with ldap.

  -version

Function: Show version information and exit.

7.1.2 - Configuration

ClusterCockpit Configuration Option References

cc-backend requires a JSON configuration file. The configuration files is structured into components. Every component is configured either in a separate JSON object or using a separate file. When a section is put in a separate file the section key has to have a -file suffix.

Example:

"auth-file": "./var/auth.json"

To override the default config file path, specify the location of a JSON configuration file with the -config <file path> command line option.

Configuration Options

Section `main`

Section must exist.

addr: Type string (Optional). Address where the http (or https) server will listen on (for example: ‘0.0.0.0:80’). Default localhost:8080.
api-allowed-ips: Type array of strings (Optional). IPv4 addresses from which the secured administrator API endpoint functions /api/* can be reached. Default: No restriction. The previous * wildcard is still supported but obsolete.
user: Type string (Optional). Drop root permissions once .env was read and the port was taken. Only applicable if using privileged port.
group: Type string. Drop root permissions once .env was read and the port was taken. Only applicable if using privileged port.
disable-authentication: Type bool (Optional). Disable authentication (for everything: API, Web-UI, …). Default false.
embed-static-files: Type bool (Optional). If all files in web/frontend/public should be served from within the binary itself (they are embedded) or not. Default true.
static-files: Type string (Optional). Folder where static assets can be found, if embed-static-files is false. No default.
db: Type string (Optional). The db file path. Default: ./var/job.db.
enable-job-taggers: Type bool (Optional). Enable automatic job taggers for application and job class detection. Requires to provide tagger rules. Default: false.
validate: Type bool (Optional). Validate all input JSON documents against JSON schema. Default: false.
session-max-age: Type string (Optional). Specifies for how long a session shall be valid as a string parsable by time.ParseDuration(). If 0 or empty, the session/token does not expire! Default 168h.
https-cert-file and https-key-file: Type string (Optional). If both those options are not empty, use HTTPS using those certificates. Default: No HTTPS.
redirect-http-to: Type string (Optional). If not the empty string and addr does not end in “:80”, redirect every request incoming at port 80 to that url.
stop-jobs-exceeding-walltime: Type int (Optional). If not zero, automatically mark jobs as stopped running X seconds longer than their walltime. Only applies if walltime is set for job. Default 0.
short-running-jobs-duration: Type int (Optional). Do not show running jobs shorter than X seconds. Default 300.
emission-constant: Type integer (Optional). Energy Mix CO2 Emission Constant [g/kWh]. If entered, UI displays estimated CO2 emission for job based on jobs’ total Energy.
resampling: Type object (Optional). If configured, will enable dynamic downsampling of metric data using the configured values.
- minimum-points: Type integer. This option allows user to specify the minimum points required for resampling; Example: 600. If minimum-points: 600, assuming frequency of 60 seconds per sample, then a resampling would trigger only for jobs > 10 hours (600 / 60 = 10).
- resolutions: Type array [integer]. Array of resampling target resolutions, in seconds; Example: [600,300,60].
- trigger: Type integer. Trigger next zoom level at less than this many visible datapoints.
machine-state-dir: Type string (Optional). Where to store MachineState files. TODO: Explain in more detail!
api-subjects: Type object (Optional). NATS subjects configuration for subscribing to job and node events. Default: No NATS API.
- subject-job-event: Type string. NATS subject for job events (start_job, stop_job).
- subject-node-state: Type string. NATS subject for node state updates.

Section `nats`

Section is optional.

address: Type string. Address of the NATS server (e.g., nats://localhost:4222).
username: Type string (Optional). Username for NATS authentication.
password: Type string (Optional). Password for NATS authentication (optional).
creds-file-path: Type string (Optional). Path to NATS credentials file for authentication (optional).

Section `cron`

Section must exist.

commit-job-worker: Type string. Frequency of commit job worker. Default: 2m
duration-worker: Type string. Frequency of duration worker. Default: 5m
footprint-worker: Type string. Frequency of footprint. Default: 10m

Section `archive`

Section is optional. If section is not provided, the default is kind set to file with path set to ./var/job-archive.

kind: Type string (required). Set archive backend. Supported values: file, s3, sqlite.
path: Type string (Optional). Path to the job-archive. Default: ./var/job-archive.
compression: Type integer (Optional). Setup automatic compression for jobs older than number of days.
retention: Type object (Optional). Enable retention policy for archive and database.
- policy: Type string (required). Retention policy. Possible values none, delete, move.
- include-db: Type bool (Optional). Also remove jobs from database. Default: true.
- age: Type integer (Optional). Act on jobs with startTime older than age (in days). Default: 7 days.
- location: Type string (Optional). The target directory for retention. Only applicable for retention policy move. Only applies for move policy.

Section `auth`

Section must exist.

jwts: Type object (required). For JWT Authentication.
- max-age: Type string (required). Configure how long a token is valid. As string parsable by time.ParseDuration().
- cookie-name: Type string (Optional). Cookie that should be checked for a JWT token.
- vaidate-user: Type bool (Optional). Deny login for users not in database (but defined in JWT). Overwrite roles in JWT with database roles.
- trusted-issuer: Type string (Optional). Issuer that should be accepted when validating external JWTs.
- sync-user-on-login: Type bool (Optional). Add non-existent user to DB at login attempt with values provided in JWT.
- update-user-on-login: Type bool (Optional). Update existent user in DB at login attempt with values provided in JWT. Currently only the person name is updated.
ldap: Type object (Optional). For LDAP Authentication and user synchronisation. Default nil.
- url: Type string (required). URL of LDAP directory server.
- user-base: Type string (required). Base DN of user tree root.
- search-dn: Type string (required). DN for authenticating LDAP admin account with general read rights.
- user-bind: Type string (required). Expression used to authenticate users via LDAP bind. Must contain uid={username}.
- user-filter: Type string (required). Filter to extract users for syncing.
- username-attr: Type string (Optional). Attribute with full user name. Defaults to gecos if not provided.
- sync-interval: Type string (Optional). Interval used for syncing local user table with LDAP directory. Parsed using time.ParseDuration.
- sync-del-old-users: Type bool (Optional). Delete obsolete users in database.
- sync-user-on-login: Type bool (Optional). Add non-existent user to DB at login attempt if user exists in LDAP directory.
oidc: Type object (Optional). For OpenID Connect Authentication. Default nil.
- provider: Type string (required). OpenID Connect provider URL.
- sync-user-on-login: Type bool. Add non-existent user to DB at login attempt with values provided.
- update-user-on-login: Type bool. Update existent user in DB at login attempt with values provided. Currently only the person name is updated.

Section `metric-store`

Section must exist.

retention-in-memory: Type string (required). Keep the metrics within memory for given time interval. Retention for X hours, then the metrics would be freed. Buffers that are still used by running jobs will be kept.
memory-cap: Type integer (required). If memory used exceeds value in GB, buffers still used by long running jobs will be freed.
num-workers: Type integer (Optional). Number of concurrent workers for checkpoint and archive operations. Default: If not set defaults to min(runtime.NumCPU()/2+1, 10)
checkpoints: Type object (required). Configuration for checkpointing the metrics buffers
- file-format: Type string (Optional). Format to use for checkpoint files. Can be JSON or Avro. Default: Avro.
- directory: Type string (Optional). Path in which the checkpoints should be placed. Default: ./var/checkpoints.
cleanup: Type object (Optional). Configuration for the cleanup process. If not set the mode is delete with interval set to the retention-in-memory interval.
- mode: Type string (Optional). The mode for cleanup. Can be delete or archive. Default: delete.
- interval: Type string (Optional). Interval at which the cleanup runs.
- directory: Type string (required if mode is archive). Directory where to put the archive files.
nats-subscriptions: Type array (Optional). List of NATS subjects the metric store should subscribe to. Items are of type object with the following attributes:
- subscribe-to: Type string (required). NATS subject to subscribe to.
- cluster-tag: Type string (Optional). Allow lines without a cluster tag, use this as default.

Section `ui`

The ui section specifies defaults for the web user interface. The defaults which metrics to show in different views can be overwritten per cluster or subcluster.

job-list: Type object (Optional). Job list defaults. Applies to user and jobs views.
- use-paging: Type bool (Optional). If classic paging is used instead of continuous scrolling by default.
- show-footprint: Type bool (Optional). If footprint bars are shown as first column by default.
node-list: Type object (Optional). Node list defaults. Applies to node list view.
- use-paging: Type bool (Optional). If classic paging is used instead of continuous scrolling by default.
job-view: Type object (Optional). Job view defaults.
- show-polar-plot: Type bool (Optional). If the job metric footprints polar plot is shown by default.
- show-footprint: Type bool (Optional). If the annotated job metric footprint bars are shown by default.
- show-roofline: Type bool (Optional). If the job roofline plot is shown by default.
- show-stat-table: Type bool (Optional). If the job metric statistics table is shown by default.
metric-config: Type object (Optional). Global initial metric selections for primary views of all clusters.
- job-list-metrics: Type array [string] (Optional). Initial metrics shown for new users in job lists (User and jobs view).
- job-view-plot-metrics: Type array [string] (Optional). Initial metrics shown for new users as job view metric plots.
- job-view-table-metrics: Type array [string] (Optional). Initial metrics shown for new users in job view statistics table.
- clusters: Type array of objects (Optional). Overrides for global defaults by cluster and subcluster.
  - name: Type string (required). The name of the cluster.
  - job-list-metrics: Type array [string] (Optional). Initial metrics shown for new users in job lists (User and jobs view) for this cluster.
  - job-view-plot-metrics: Type array [string] (Optional). Initial metrics shown for new users as job view timeplots for this cluster.
  - job-view-table-metrics: Type array [string] (Optional). Initial metrics shown for new users in job view statistics table for this cluster.
  - sub-clusters: Type array of objects (Optional). The array of overrides per subcluster.
    - name: Type string (required). The name of the subcluster.
    - job-list-metrics: Type array [string] (Optional). Initial metrics shown for new users in job lists (User and jobs view) for subcluster.
    - job-view-plot-metrics: Type array [string] (Optional). Initial metrics shown for new users as job view timeplots for subcluster.
    - job-view-table-metrics: Type array [string] (Optional). Initial metrics shown for new users in job view statistics table for subcluster.
plot-configuration: Type object (Optional). Initial settings for plot render options.
- color-background: Type bool (Optional). If the metric plot backgrounds are initially colored by threshold limits.
- plots-per-row: Type integer (Optional). How many plots are initially rendered per row. Applies to job, single node, and analysis views.
- line-width: Type integer (Optional). Initial thickness of rendered plotlines. Applies to metric plot, job compare plot and roofline.
- color-scheme: Type array [string] (Optional). Initial colorScheme to be used for metric plots.

7.1.3 - Environment

ClusterCockpit Environment Variables

All security-related configurations, e.g. keys and passwords, are set using environment variables. It is supported to set these by means of a .env file in the project root.

Environment Variables

JWT_PUBLIC_KEY and JWT_PRIVATE_KEY: Base64 encoded Ed25519 keys used for JSON Web Token (JWT) authentication. You can generate your own keypair using go run ./tools/gen-keypair/. The release binaries also include the gen-keypair tool for x86-64. For more information, see the JWT documentation.
SESSION_KEY: Some random bytes used as secret for cookie-based sessions
LDAP_ADMIN_PASSWORD: The LDAP admin user password (optional)
CROSS_LOGIN_JWT_HS512_KEY: Used for token based logins via another authentication service (optional)
OID_CLIENT_ID: OpenID connect client id (optional)
OID_CLIENT_SECRET: OpenID connect client secret (optional)

Template `.env` file

Below is an example .env file. Copy it as .env into the project root and adapt it for your needs.

# Base64 encoded Ed25519 keys (DO NOT USE THESE TWO IN PRODUCTION!)
# You can generate your own keypair using `go run tools/gen-keypair/main.go`
JWT_PUBLIC_KEY="kzfYrYy+TzpanWZHJ5qSdMj5uKUWgq74BWhQG6copP0="
JWT_PRIVATE_KEY="dtPC/6dWJFKZK7KZ78CvWuynylOmjBFyMsUWArwmodOTN9itjL5POlqdZkcnmpJ0yPm4pRaCrvgFaFAbpyik/Q=="

# Base64 encoded Ed25519 public key for accepting externally generated JWTs
# Keys in PEM format can be converted, see `tools/convert-pem-pubkey/Readme.md`
CROSS_LOGIN_JWT_PUBLIC_KEY=""

# Some random bytes used as secret for cookie-based sessions (DO NOT USE THIS ONE IN PRODUCTION)
SESSION_KEY="67d829bf61dc5f87a73fd814e2c9f629"

# Password for the ldap server (optional)
LDAP_ADMIN_PASSWORD="mashup"

7.1.4 - REST API

ClusterCockpit RESTful API Endpoint Reference

REST API Authorization

In ClusterCockpit JWTs are signed using a public/private key pair using ED25519. Because tokens are signed using public/private key pairs, the signature also certifies that only the party holding the private key is the one that signed it. JWT tokens in ClusterCockpit are not encrypted, means all information is clear text. Expiration of the generated tokens can be configured in config.json using the max-age option in the jwts object. Example:

"jwts": {
    "max-age": "168h"
},

The party that generates and signs JWT tokens has to be in possession of the private key and any party that accepts JWT tokens must possess the public key to validate it. cc-backed therefore requires both keys, the private one to sign generated tokens and the public key to validate tokens that are provided by REST API clients.

Generate ED25519 key pairs

We provide a tool as part of cc-backend to generate a ED25519 keypair. The tool is called gen-keypair and provided as part of the release binaries. You can easily build it yourself in the cc-backend source tree with:

go build tools/gen-keypair

To use it just call it without any arguments:

./gen-keypair

Usage of Swagger UI documentation

Swagger UI is a REST API documentation and testing framework. To use the Swagger UI for testing you have to run an instance of cc-backend on localhost (and use the default port 8080):

./cc-backend -server

You may want to start the demo as described here . This Swagger UI is also available as part of cc-backend if you start it with the dev option:

./cc-backend -server -dev

You may access it at this URL.

Swagger API Reference

Non-Interactive Documentation

This reference is rendered using the swaggerui plugin based on the original definition file found in the ClusterCockpit repository, but without a serving backend.

This means that all interactivity (“Try It Out”) will not return actual data. However, a Curl call and a compiled Request URL will still be displayed, if an API endpoint is executed.

Administrator API

Endpoints displayed here correspond to the administrator /api/ endpoints, but user-accessible /userapi/ endpoints are functionally identical. See these lists for information about accessibility.

7.1.5 - Authentication Handbook

How to configure and use the authentication backends

Introduction

cc-backend supports the following authentication methods:

Local login with credentials stored in SQL database
Login with authentication to a LDAP directory
Authentication via JSON Web Token (JWT):
- With token provided in HTML request header
- With token provided in cookie
Login via OpenID Connect (against a KeyCloak instance)

All above methods create a session cookie that is then used for subsequent authentication of requests. Multiple authentication methods can be configured at the same time. If LDAP is enabled it takes precedence over local authentication. The OpenID Connect method against a KeyCloak instance enables many more authentication methods using the ability of KeyCloak to act as an Identity Broker.

The REST API uses stateless authentication via a JWT token, which means that every requests must be authenticated.

General configuration options

All configuration is part of the cc-backend configuration file config.json. All security sensitive options as passwords and tokens are passed in terms of environment variables. cc-backend supports to read an .env file upon startup and set the environment variables contained there.

Duration of session

Per default the maximum duration of a session is 7 days. To change this the option session-max-age has to be set to a string that can be parsed by the Golang time.ParseDuration() function. For most use cases the largest unit h is the only relevant option. Example:

"session-max-age": "24h",

To enable unlimited session duration set session-max-age either to 0 or empty string.

LDAP authentication

Configuration

To enable LDAP authentication the following set of options are required as attributes of the ldap JSON object:

url: URL of the LDAP directory server. This must be a complete URL including the protocol and not only the host name. Example: ldaps://ldsrv.mydomain.com.
user_base: Base DN of user tree root. Example: ou=people,ou=users,dc=rz,dc=mydomain,dc=com.
search_dn: DN for authenticating an LDAP admin account with general read rights. This is required for the sync on login and the sync options. Example: cn=monitoring,ou=adm,ou=profile,ou=manager,dc=rz,dc=mydomain,dc=com
user_bind: Expression used to authenticate users via LDAP bind. Must contain uid={username}. Example: uid={username},ou=people,ou=users,dc=rz,dc=mydomain,dc=com.
user_filter: Filter to extract users for syncing. Example: (&(objectclass=posixAccount)).

Optional configuration options are:

username_attr: Attribute with full user name. Defaults to gecos if not provided.
sync_interval: Interval used for syncing SQL user table with LDAP directory. Parsed using time.ParseDuration. The sync interval is always relative to the time cc-backend was started. Example: 24h.
sync_del_old_users: Type boolean. Delete users in SQL database if not in LDAP directory anymore. This of course only applies to users that were added from LDAP.
syncUserOnLogin: Type boolean. Add non-existent user to DB at login attempt if user exists in LDAP directory. This option enables that users can login at once after they are added to the LDAP directory.

The LDAP authentication method requires the environment variable LDAP_ADMIN_PASSWORD for the search_dn account that is used to sync users.

Usage

If LDAP is configured it is the first authentication method that is tried if a user logs in using the login form. A sync with the LDAP directory can also be triggered from the command line using the flag -sync-ldap.

OpenID Connect authentication

Configuration

To enable OpenID Connect authentication the following set of options are required below a top-level oicd key:

provider: The base URL of your OpenID Connect provider. Example: https://auth.example.com/realms/mycloud.

Full example:

"oidc": {
  "provider": "https://auth.server.com:8080/realms/nhr-cloud"
},

Furthermore the following environment variables have to be set (in the .env file):

OID_CLIENT_ID: Set this to the Client ID you configured in Keycloak.
OID_CLIENT_SECRET: Set this to the Client ID secret available in you Keycloak Open ID Client configuration.

Required settings in KeyCloak

The OpenID Connect implementation was only tested against the KeyCloak provider.

Steps to setup KeyCloak:

Create a new realm. This will determine the provider URL.
Create a new OpenID Connect client
Set a Client ID, the Client ID secret is automatically generated and available at the Credentials tab.
For Access settings set:
- Root URL: This is the base URL of your cc-backend instance.
- Valid redirect URLs: Set this to oidc-callback. Wildcards did not work for me.
- Web origins: Set this also to the base URL of your cc-backend instance.
  Keycloak client Access settings
Enable PKCE:
- Click on Advanced tab. Further click on Advanced settings on the right side.
- Set the option Proof Key for Code Exchange Code Challenge Method to S256.

Set PKCE Keycloak option — Keycloak advanced client settings for PKCE

Everything else can be left to the default. Do not forget to create users in your realm before testing.

Usage

If the oicd config key is correctly set and the required environment variables are available, an additional button for OpenID Connect Login is shown below the login mask. If pressed this button will redirect to the OpenID Connect login.

OpenID Connect login mask — Login mask with OpenID Connect enabled

Local authentication

No configuration is required for local authentication.

Usage

You can add an user on the command line using the flag -add-user:

./cc-backend -add-user <username>:<roles>:<password>

Example:

./cc-backend -add-user fritz:admin,api:myPass

Roles can be admin, support, manager, api, and user.

Users can be deleted using the flag -del-user:

./cc-backend -del-user fritz

Warning

The option -del-user as currently implemented will delete ALL users that match the username independent of its origin. This means it will also delete user records that were added from LDAP or JWT tokens.

JWT token authentication

JSON web tokens are a standardized method for representing encoded claims securely between two parties. In ClusterCockpit they are used for authorization to use REST APIs as well as a method to delegate authentication to a third party. This section only describes JWT based authentication for initiating a user session.

Two variants exist:

[1] Session Authenticator: Passes JWT token in the HTTP header Authorization using the Bearer prefix or using the query key login-token.

Example for Authorization header:

Authorization: Bearer S0VLU0UhIExFQ0tFUiEK

Example for query key used as form action in external application:

<form method="post" action="$CCROOT/jwt-login?login-token=S0VLU0UhIExFQ0tFUiEK" target="_blank">
  <button type="submit">Access CC</button>
</form>

[2] Cookie Session Authenticator: Reads the JWT token from a named cookie provided by the request, which is deleted after the session was successfully initiated. This is a more secure alternative to the standard header based solution.

JWT Configuration

[0] Basic required configuration:

In order to enable JWT based transactions generally, the following has to be true:

The jwts JSON object has to exist within config.json, even if no other attribute is set within.
- We recommend to set max-age attribute: Specifies for how long a JWT token shall be valid, defined as a string parsable by time.ParseDuration().
- This will only affect JWTs generated by ClusterCockpit, e.g. for the use with REST-API endpoints.

In addition, the the following environment variables are used:

JWT_PRIVATE_KEY: The applications own private key to be used with JWT transactions. Required for cookie based logins and REST-API communication.
JWT_PUBLIC: The applications own public key to be used with JWT transactions. Required for cookie based logins and REST-API communication.
[1] Configuration for JWT Session Authenticator:

Compatible signing methods are: HS256, HS512

Only a shared (symmetric) key saved as environment variable CROSS_LOGIN_JWT_HS512_KEY is required.

[2] Configuration for JWT Cookie Session Authenticator:

Tokens are signed with: Ed25519/EdDSA

To enable JWT authentication via cookie the following set of options are required as attributes of the jwts JSON object:

cookieName (String): Specifies which cookie should be checked for a JWT token (if no authorization header is present)
trustedIssuer (String): Specifies which issuer should be accepted when validating external JWTs (iss-claim)

In addition, the Cookie Session Authenticator method requires the following environment variable:

CROSS_LOGIN_JWT_PUBLIC_KEY: Primary public key for this method, validates identity of tokens received from trustedIssuer and must therefore match accordingly.
[3] Optional configuration attributes of the jwts JSON object, valid for both [1] and [2], are:
validateUser (Bool): Load user by username encoded in sub-claim from database, including roles, denying login if not matched in database. Ignores all other claims. By design not combinable with both syncUserOnLogin and/or updateUserOnLogin options.
syncUserOnLogin (Bool): If user encoded in token does not exist in database, add a new user entry. Does not update user on recurring JWT logins.
updateUserOnLogin (Bool): If user encoded in token does exist in database, update the user entry with all encoded information. Does not add users on first-time JWT login.

JWT Usage

[1] Usage for JWT Session Authenticator:

The endpoint for initiating JWT logins in ClusterCockpit is /jwt-login

For login with JWT Header, the header has to include the Authorization: Bearer $TOKEN information when accessing this endpoint. For login with JWT request parameter, the external website has to submit an action with the parameter ?login-token=$TOKEN (See example above).

In both cases, the JWT should contain the following parameters:

sub: The subject, in this case this is the username. Will be used for user matching if validateUser is set.
exp: Expiration in Unix epoch time. Can be small as the token is only used during login.
name: The full name of the person assigned to this account. Will be used to update user table.
roles: String array with roles of user.
projects: [Optional] String array with projects of user. Relevant if user has manager-role.
[2] Usage for JWT Cookie Session Authenticator:

The token must be set within a cookie with a name matching the configured cookieName.

The JWT should then contain the following parameters:

sub: The subject, in this case this is the username. Will be used for user matching if validateUser is set.
exp: Expiration in Unix epoch time. Can be small as the token is only used during login.
name: The full name of the person assigned to this account. Will be used to update user table.
roles: String array with roles of user.

Authorization control

cc-backend uses roles to decide if a user is authorized to access certain information. The roles and their rights are described in more detail here.

7.1.6 - Job Archive Handbook

All you need to know about the ClusterCockpit Job Archive

The job archive specifies an exchange format for job meta and performance metric data. It consists of two parts:

a Json file format
a Directory hierarchy / Key specification

By using an open, portable and simple specification based on JSON objects it is possible to exchange job performance data for research and analysis purposes as well as use it as a robust way for archiving job performance data.

The current release supports new SQLite and S3 object store based job archive backends. Those are still experimental and for production we still recommend to use the proven file based job archive. One major disadvantage of the file based job archive backend is that for large job counts it will consume a lot of inodes.

Trying the new job-archive backends

We provide the tool archive-manager that allows to convert between different job-archive formats. This allows to convert your existing file-based job-archive into either a SQLite or S3 variant. Please be aware that for large archives this may take a long time. You can find details about how to use this tool in the archive-manager reference documentation.

Specification for file path / key

To manage the number of directories within a single directory a tree approach is used splitting the integer job ID. The job id is split in junks of 1000 each. Usually 2 layers of directories is sufficient but the concept can be used for an arbitrary number of layers.

For a 2 layer schema this can be achieved with (code example in Perl):

$level1 = $jobID/1000;
$level2 = $jobID%1000;
$dstPath = sprintf("%s/%s/%d/%03d", $trunk, $destdir, $level1, $level2);

While for the SQLite and S3 object store based backend the systematic to introduce layers is obsolete we kept it to keep the naming consistent. This means what is the path in case of the file based backend is used as a object key and column value there.

Example

For the job ID 1034871 on cluster large with start time 1768978339 the key is ./large/1034/871/1768978339.

Create a Job archive from scratch

In case you place the job-archive in the ./var folder create the folder with:

mkdir -p ./var/job-archive

The job-archive is versioned, the current version is documented in the Release Notes. Currently you have to create the version file manually when initializing the job-archive:

echo 3 > ./var/job-archive/version.txt

Directory layout

ClusterCockpit supports multiple clusters, for each cluster you need to create a directory named after the cluster and a cluster.json file specifying the metric list and hardware partitions within the clusters. Hardware partitions are subsets of a cluster with homogeneous hardware (CPU type, memory capacity, GPUs) that are called subclusters in ClusterCockpit.

For above configuration the job archive directory hierarchy looks like the following:

./var/job-archive/
     version.txt
     fritz/
        cluster.json
     alex/
        cluster.json
     woody/
        cluster.json

Note

The cluster.json files currently have to be provided and maintained by the administrator!

You find help how-to create a cluster.json file in the How to create a cluster.json file guide.

Json file format

Overview

Every cluster must be configured in a cluster.json file.

The job data consists of two files:

meta.json: Contains job meta information and job statistics.
data.json: Contains complete job data with time series

The description of the json format specification is available as [[json schema|https://json-schema.org/]] format file. The latest version of the json schema is part of the cc-backend source tree. For external reference it is also available in a separate repository.

Specification `cluster.json`

The json schema specification in its raw format is available at the cc-lib GitHub repository. A variant rendered for better readability is found in the references.

Specification `meta.json`

The json schema specification in its raw format is available at the cc-lib GitHub repository. A variant rendered for better readability is found in the references.

Specification `data.json`

The json schema specification in its raw format is available at the cc-lib GitHub repository. A variant rendered for better readability is found in the references.

Metric time series data is stored for a fixed time step. The time step is set per metric. If no value is available for a metric time series data timestamp null is entered.

7.1.7 - Schemas

ClusterCockpit Schema References

ClusterCockpit Schema References for

Application Configuration
Cluster Configuration
Job Data
Job Statistics
Units
Job Archive Job Metadata
Job Archive Job Metricdata

The schemas in their raw form can be found in the ClusterCockpit GitHub repository.

Manual Updates

Changes to the original JSON schemas found in the repository are not automatically rendered in this reference documentation.

The raw JSON schemas are parsed and rendered for better readability using the json-schema-for-humans utility.

Last Update: 04.12.2024

7.1.7.1 - Application Config Schema

ClusterCockpit Application Config Schema Reference

A detailed description of each of the application configuration options can be found in the config documentation.

The following schema in its raw form can be found in the ClusterCockpit GitHub repository.

Manual Updates

Changes to the original JSON schema found in the repository are not automatically rendered in this reference documentation.

Last Update: 04.12.2024

cc-backend configuration file schema

Title: cc-backend configuration file schema


Type	`object`
Required	No
Additional properties	Any type allowed

Property	Pattern	Type	Deprecated	Definition	Title/Description
- addr	No	string	No	-	Address where the http (or https) server will listen on (for example: ’localhost:80’).
- apiAllowedIPs	No	array of string	No	-	Addresses from which secured API endpoints can be reached
- user	No	string	No	-	Drop root permissions once .env was read and the port was taken. Only applicable if using privileged port.
- group	No	string	No	-	Drop root permissions once .env was read and the port was taken. Only applicable if using privileged port.
- disable-authentication	No	boolean	No	-	Disable authentication (for everything: API, Web-UI, …).
- embed-static-files	No	boolean	No	-	If all files in `web/frontend/public` should be served from within the binary itself (they are embedded) or not.
- static-files	No	string	No	-	Folder where static assets can be found, if embed-static-files is false.
- db-driver	No	enum (of string)	No	-	sqlite3 or mysql (mysql will work for mariadb as well).
- db	No	string	No	-	For sqlite3 a filename, for mysql a DSN in this format: https://github.com/go-sql-driver/mysql#dsn-data-source-name (Without query parameters!).
- archive	No	object	No	-	Configuration keys for job-archive
- disable-archive	No	boolean	No	-	Keep all metric data in the metric data repositories, do not write to the job-archive.
- validate	No	boolean	No	-	Validate all input json documents against json schema.
- session-max-age	No	string	No	-	Specifies for how long a session shall be valid as a string parsable by time.ParseDuration(). If 0 or empty, the session/token does not expire!
- https-cert-file	No	string	No	-	Filepath to SSL certificate. If also https-key-file is set use HTTPS using those certificates.
- https-key-file	No	string	No	-	Filepath to SSL key file. If also https-cert-file is set use HTTPS using those certificates.
- redirect-http-to	No	string	No	-	If not the empty string and addr does not end in :80, redirect every request incoming at port 80 to that url.
- stop-jobs-exceeding-walltime	No	integer	No	-	If not zero, automatically mark jobs as stopped running X seconds longer than their walltime. Only applies if walltime is set for job.
- short-running-jobs-duration	No	integer	No	-	Do not show running jobs shorter than X seconds.
- emission-constant	No	integer	No	-	.
- cron-frequency	No	object	No	-	Frequency of cron job workers.
- enable-resampling	No	object	No	-	Enable dynamic zoom in frontend metric plots.
+ jwts	No	object	No	-	For JWT token authentication.
- oidc	No	object	No	-	-
- ldap	No	object	No	-	For LDAP Authentication and user synchronisation.
+ clusters	No	array of object	No	-	Configuration for the clusters to be displayed.
- ui-defaults	No	object	No	-	Default configuration for web UI

1. Property `cc-backend configuration file schema > addr`


Type	`string`
Required	No

Description: Address where the http (or https) server will listen on (for example: ’localhost:80’).

2. Property `cc-backend configuration file schema > apiAllowedIPs`


Type	`array of string`
Required	No

Description: Addresses from which secured API endpoints can be reached

	Array restrictions
Min items	N/A
Max items	N/A
Items unicity	False
Additional items	False
Tuple validation	See below

Each item of this array must be	Description
apiAllowedIPs items	-

2.1. cc-backend configuration file schema > apiAllowedIPs > apiAllowedIPs items


Type	`string`
Required	No

3. Property `cc-backend configuration file schema > user`


Type	`string`
Required	No

Description: Drop root permissions once .env was read and the port was taken. Only applicable if using privileged port.

4. Property `cc-backend configuration file schema > group`


Type	`string`
Required	No

Description: Drop root permissions once .env was read and the port was taken. Only applicable if using privileged port.

5. Property `cc-backend configuration file schema > disable-authentication`


Type	`boolean`
Required	No

Description: Disable authentication (for everything: API, Web-UI, …).

6. Property `cc-backend configuration file schema > embed-static-files`


Type	`boolean`
Required	No

Description: If all files in web/frontend/public should be served from within the binary itself (they are embedded) or not.

7. Property `cc-backend configuration file schema > static-files`


Type	`string`
Required	No

Description: Folder where static assets can be found, if embed-static-files is false.

8. Property `cc-backend configuration file schema > db-driver`


Type	`enum (of string)`
Required	No

Description: sqlite3 or mysql (mysql will work for mariadb as well).

Must be one of:

“sqlite3”
“mysql”

9. Property `cc-backend configuration file schema > db`


Type	`string`
Required	No

Description: For sqlite3 a filename, for mysql a DSN in this format: https://github.com/go-sql-driver/mysql#dsn-data-source-name (Without query parameters!).

10. Property `cc-backend configuration file schema > archive`


Type	`object`
Required	No
Additional properties	Any type allowed

Description: Configuration keys for job-archive

Property	Pattern	Type	Deprecated	Definition	Title/Description
+ kind	No	enum (of string)	No	-	Backend type for job-archive
- path	No	string	No	-	Path to job archive for file backend
- compression	No	integer	No	-	Setup automatic compression for jobs older than number of days
- retention	No	object	No	-	Configuration keys for retention

10.1. Property `cc-backend configuration file schema > archive > kind`


Type	`enum (of string)`
Required	Yes

Description: Backend type for job-archive

Must be one of:

“file”
“s3”

10.2. Property `cc-backend configuration file schema > archive > path`


Type	`string`
Required	No

Description: Path to job archive for file backend

10.3. Property `cc-backend configuration file schema > archive > compression`


Type	`integer`
Required	No

Description: Setup automatic compression for jobs older than number of days

10.4. Property `cc-backend configuration file schema > archive > retention`


Type	`object`
Required	No
Additional properties	Any type allowed

Description: Configuration keys for retention

Property	Pattern	Type	Deprecated	Definition	Title/Description
+ policy	No	enum (of string)	No	-	Retention policy
- includeDB	No	boolean	No	-	Also remove jobs from database
- age	No	integer	No	-	Act on jobs with startTime older than age (in days)
- location	No	string	No	-	The target directory for retention. Only applicable for retention move.

10.4.1. Property `cc-backend configuration file schema > archive > retention > policy`


Type	`enum (of string)`
Required	Yes

Description: Retention policy

Must be one of:

“none”
“delete”
“move”

10.4.2. Property `cc-backend configuration file schema > archive > retention > includeDB`


Type	`boolean`
Required	No

Description: Also remove jobs from database

10.4.3. Property `cc-backend configuration file schema > archive > retention > age`


Type	`integer`
Required	No

Description: Act on jobs with startTime older than age (in days)

10.4.4. Property `cc-backend configuration file schema > archive > retention > location`


Type	`string`
Required	No

Description: The target directory for retention. Only applicable for retention move.

11. Property `cc-backend configuration file schema > disable-archive`


Type	`boolean`
Required	No

Description: Keep all metric data in the metric data repositories, do not write to the job-archive.

12. Property `cc-backend configuration file schema > validate`


Type	`boolean`
Required	No

Description: Validate all input json documents against json schema.

13. Property `cc-backend configuration file schema > session-max-age`


Type	`string`
Required	No

Description: Specifies for how long a session shall be valid as a string parsable by time.ParseDuration(). If 0 or empty, the session/token does not expire!

14. Property `cc-backend configuration file schema > https-cert-file`


Type	`string`
Required	No

Description: Filepath to SSL certificate. If also https-key-file is set use HTTPS using those certificates.

15. Property `cc-backend configuration file schema > https-key-file`


Type	`string`
Required	No

Description: Filepath to SSL key file. If also https-cert-file is set use HTTPS using those certificates.

16. Property `cc-backend configuration file schema > redirect-http-to`


Type	`string`
Required	No

Description: If not the empty string and addr does not end in :80, redirect every request incoming at port 80 to that url.

17. Property `cc-backend configuration file schema > stop-jobs-exceeding-walltime`


Type	`integer`
Required	No

Description: If not zero, automatically mark jobs as stopped running X seconds longer than their walltime. Only applies if walltime is set for job.

18. Property `cc-backend configuration file schema > short-running-jobs-duration`


Type	`integer`
Required	No

Description: Do not show running jobs shorter than X seconds.

19. Property `cc-backend configuration file schema > emission-constant`


Type	`integer`
Required	No

Description: .

20. Property `cc-backend configuration file schema > cron-frequency`


Type	`object`
Required	No
Additional properties	Any type allowed

Description: Frequency of cron job workers.

Property	Pattern	Type	Deprecated	Definition	Title/Description
- duration-worker	No	string	No	-	Duration Update Worker [Defaults to ‘5m’]
- footprint-worker	No	string	No	-	Metric-Footprint Update Worker [Defaults to ‘10m’]

20.1. Property `cc-backend configuration file schema > cron-frequency > duration-worker`


Type	`string`
Required	No

Description: Duration Update Worker [Defaults to ‘5m’]

20.2. Property `cc-backend configuration file schema > cron-frequency > footprint-worker`


Type	`string`
Required	No

Description: Metric-Footprint Update Worker [Defaults to ‘10m’]

21. Property `cc-backend configuration file schema > enable-resampling`


Type	`object`
Required	No
Additional properties	Any type allowed

Description: Enable dynamic zoom in frontend metric plots.

Property	Pattern	Type	Deprecated	Definition	Title/Description
+ trigger	No	integer	No	-	Trigger next zoom level at less than this many visible datapoints.
+ resolutions	No	array of integer	No	-	Array of resampling target resolutions, in seconds.

21.1. Property `cc-backend configuration file schema > enable-resampling > trigger`


Type	`integer`
Required	Yes

Description: Trigger next zoom level at less than this many visible datapoints.

21.2. Property `cc-backend configuration file schema > enable-resampling > resolutions`


Type	`array of integer`
Required	Yes

Description: Array of resampling target resolutions, in seconds.

	Array restrictions
Min items	N/A
Max items	N/A
Items unicity	False
Additional items	False
Tuple validation	See below

Each item of this array must be	Description
resolutions items	-

21.2.1. cc-backend configuration file schema > enable-resampling > resolutions > resolutions items


Type	`integer`
Required	No

22. Property `cc-backend configuration file schema > jwts`


Type	`object`
Required	Yes
Additional properties	Any type allowed

Description: For JWT token authentication.

Property	Pattern	Type	Deprecated	Definition	Title/Description
+ max-age	No	string	No	-	Configure how long a token is valid. As string parsable by time.ParseDuration()
- cookieName	No	string	No	-	Cookie that should be checked for a JWT token.
- validateUser	No	boolean	No	-	Deny login for users not in database (but defined in JWT). Overwrite roles in JWT with database roles.
- trustedIssuer	No	string	No	-	Issuer that should be accepted when validating external JWTs
- syncUserOnLogin	No	boolean	No	-	Add non-existent user to DB at login attempt with values provided in JWT.

22.1. Property `cc-backend configuration file schema > jwts > max-age`


Type	`string`
Required	Yes

Description: Configure how long a token is valid. As string parsable by time.ParseDuration()

22.2. Property `cc-backend configuration file schema > jwts > cookieName`


Type	`string`
Required	No

Description: Cookie that should be checked for a JWT token.

22.3. Property `cc-backend configuration file schema > jwts > validateUser`


Type	`boolean`
Required	No

Description: Deny login for users not in database (but defined in JWT). Overwrite roles in JWT with database roles.

22.4. Property `cc-backend configuration file schema > jwts > trustedIssuer`


Type	`string`
Required	No

Description: Issuer that should be accepted when validating external JWTs

22.5. Property `cc-backend configuration file schema > jwts > syncUserOnLogin`


Type	`boolean`
Required	No

Description: Add non-existent user to DB at login attempt with values provided in JWT.

23. Property `cc-backend configuration file schema > oidc`


Type	`object`
Required	No
Additional properties	Any type allowed

23.1. The following properties are required

provider

24. Property `cc-backend configuration file schema > ldap`


Type	`object`
Required	No
Additional properties	Any type allowed

Description: For LDAP Authentication and user synchronisation.

Property	Pattern	Type	Deprecated	Definition	Title/Description
+ url	No	string	No	-	URL of LDAP directory server.
+ user_base	No	string	No	-	Base DN of user tree root.
+ search_dn	No	string	No	-	DN for authenticating LDAP admin account with general read rights.
+ user_bind	No	string	No	-	Expression used to authenticate users via LDAP bind. Must contain uid={username}.
+ user_filter	No	string	No	-	Filter to extract users for syncing.
- username_attr	No	string	No	-	Attribute with full username. Default: gecos
- sync_interval	No	string	No	-	Interval used for syncing local user table with LDAP directory. Parsed using time.ParseDuration.
- sync_del_old_users	No	boolean	No	-	Delete obsolete users in database.
- syncUserOnLogin	No	boolean	No	-	Add non-existent user to DB at login attempt if user exists in Ldap directory

24.1. Property `cc-backend configuration file schema > ldap > url`


Type	`string`
Required	Yes

Description: URL of LDAP directory server.

24.2. Property `cc-backend configuration file schema > ldap > user_base`


Type	`string`
Required	Yes

Description: Base DN of user tree root.

24.3. Property `cc-backend configuration file schema > ldap > search_dn`


Type	`string`
Required	Yes

Description: DN for authenticating LDAP admin account with general read rights.

24.4. Property `cc-backend configuration file schema > ldap > user_bind`


Type	`string`
Required	Yes

Description: Expression used to authenticate users via LDAP bind. Must contain uid={username}.

24.5. Property `cc-backend configuration file schema > ldap > user_filter`


Type	`string`
Required	Yes

Description: Filter to extract users for syncing.

24.6. Property `cc-backend configuration file schema > ldap > username_attr`


Type	`string`
Required	No

Description: Attribute with full username. Default: gecos

24.7. Property `cc-backend configuration file schema > ldap > sync_interval`


Type	`string`
Required	No

Description: Interval used for syncing local user table with LDAP directory. Parsed using time.ParseDuration.

24.8. Property `cc-backend configuration file schema > ldap > sync_del_old_users`


Type	`boolean`
Required	No

Description: Delete obsolete users in database.

24.9. Property `cc-backend configuration file schema > ldap > syncUserOnLogin`


Type	`boolean`
Required	No

Description: Add non-existent user to DB at login attempt if user exists in Ldap directory

25. Property `cc-backend configuration file schema > clusters`


Type	`array of object`
Required	Yes

Description: Configuration for the clusters to be displayed.

	Array restrictions
Min items	N/A
Max items	N/A
Items unicity	False
Additional items	False
Tuple validation	See below

Each item of this array must be	Description
clusters items	-

25.1. cc-backend configuration file schema > clusters > clusters items


Type	`object`
Required	No
Additional properties	Any type allowed

Property	Pattern	Type	Deprecated	Definition	Title/Description
+ name	No	string	No	-	The name of the cluster.
+ metricDataRepository	No	object	No	-	Type of the metric data repository for this cluster
+ filterRanges	No	object	No	-	This option controls the slider ranges for the UI controls of numNodes, duration, and startTime.

25.1.1. Property `cc-backend configuration file schema > clusters > clusters items > name`


Type	`string`
Required	Yes

Description: The name of the cluster.

25.1.2. Property `cc-backend configuration file schema > clusters > clusters items > metricDataRepository`


Type	`object`
Required	Yes
Additional properties	Any type allowed

Description: Type of the metric data repository for this cluster

Property	Pattern	Type	Deprecated	Definition	Title/Description
+ kind	No	enum (of string)	No	-	-
+ url	No	string	No	-	-
- token	No	string	No	-	-

25.1.2.1. Property `cc-backend configuration file schema > clusters > clusters items > metricDataRepository > kind`


Type	`enum (of string)`
Required	Yes

Must be one of:

“influxdb”
“prometheus”
“cc-metric-store”
“test”

25.1.2.2. Property `cc-backend configuration file schema > clusters > clusters items > metricDataRepository > url`


Type	`string`
Required	Yes

25.1.2.3. Property `cc-backend configuration file schema > clusters > clusters items > metricDataRepository > token`


Type	`string`
Required	No

25.1.3. Property `cc-backend configuration file schema > clusters > clusters items > filterRanges`


Type	`object`
Required	Yes
Additional properties	Any type allowed

Description: This option controls the slider ranges for the UI controls of numNodes, duration, and startTime.

Property	Pattern	Type	Deprecated	Definition	Title/Description
+ numNodes	No	object	No	-	UI slider range for number of nodes
+ duration	No	object	No	-	UI slider range for duration
+ startTime	No	object	No	-	UI slider range for start time

25.1.3.1. Property `cc-backend configuration file schema > clusters > clusters items > filterRanges > numNodes`


Type	`object`
Required	Yes
Additional properties	Any type allowed

Description: UI slider range for number of nodes

Property	Pattern	Type	Deprecated	Definition	Title/Description
+ from	No	integer	No	-	-
+ to	No	integer	No	-	-

25.1.3.1.1. Property `cc-backend configuration file schema > clusters > clusters items > filterRanges > numNodes > from`


Type	`integer`
Required	Yes

25.1.3.1.2. Property `cc-backend configuration file schema > clusters > clusters items > filterRanges > numNodes > to`


Type	`integer`
Required	Yes

25.1.3.2. Property `cc-backend configuration file schema > clusters > clusters items > filterRanges > duration`


Type	`object`
Required	Yes
Additional properties	Any type allowed

Description: UI slider range for duration

Property	Pattern	Type	Deprecated	Definition	Title/Description
+ from	No	integer	No	-	-
+ to	No	integer	No	-	-

25.1.3.2.1. Property `cc-backend configuration file schema > clusters > clusters items > filterRanges > duration > from`


Type	`integer`
Required	Yes

25.1.3.2.2. Property `cc-backend configuration file schema > clusters > clusters items > filterRanges > duration > to`


Type	`integer`
Required	Yes

25.1.3.3. Property `cc-backend configuration file schema > clusters > clusters items > filterRanges > startTime`


Type	`object`
Required	Yes
Additional properties	Any type allowed

Description: UI slider range for start time

Property	Pattern	Type	Deprecated	Definition	Title/Description
+ from	No	string	No	-	-
+ to	No	null	No	-	-

25.1.3.3.1. Property `cc-backend configuration file schema > clusters > clusters items > filterRanges > startTime > from`


Type	`string`
Required	Yes
Format	`date-time`

25.1.3.3.2. Property `cc-backend configuration file schema > clusters > clusters items > filterRanges > startTime > to`


Type	`null`
Required	Yes

26. Property `cc-backend configuration file schema > ui-defaults`


Type	`object`
Required	No
Additional properties	Any type allowed

Description: Default configuration for web UI

Property	Pattern	Type	Deprecated	Definition	Title/Description
+ plot_general_colorBackground	No	boolean	No	-	Color plot background according to job average threshold limits
+ plot_general_lineWidth	No	integer	No	-	Initial linewidth
+ plot_list_jobsPerPage	No	integer	No	-	Jobs shown per page in job lists
+ plot_view_plotsPerRow	No	integer	No	-	Number of plots per row in single job view
+ plot_view_showPolarplot	No	boolean	No	-	Option to toggle polar plot in single job view
+ plot_view_showRoofline	No	boolean	No	-	Option to toggle roofline plot in single job view
+ plot_view_showStatTable	No	boolean	No	-	Option to toggle the node statistic table in single job view
+ system_view_selectedMetric	No	string	No	-	Initial metric shown in system view
+ job_view_showFootprint	No	boolean	No	-	Option to toggle footprint ui in single job view
+ job_list_usePaging	No	boolean	No	-	Option to switch from continous scroll to paging
+ analysis_view_histogramMetrics	No	array of string	No	-	Metrics to show as job count histograms in analysis view
+ analysis_view_scatterPlotMetrics	No	array of array	No	-	Initial scatter plto configuration in analysis view
+ job_view_nodestats_selectedMetrics	No	array of string	No	-	Initial metrics shown in node statistics table of single job view
+ job_view_selectedMetrics	No	array of string	No	-	-
+ plot_general_colorscheme	No	array of string	No	-	Initial color scheme
+ plot_list_selectedMetrics	No	array of string	No	-	Initial metric plots shown in jobs lists

26.1. Property `cc-backend configuration file schema > ui-defaults > plot_general_colorBackground`


Type	`boolean`
Required	Yes

Description: Color plot background according to job average threshold limits

26.2. Property `cc-backend configuration file schema > ui-defaults > plot_general_lineWidth`


Type	`integer`
Required	Yes

Description: Initial linewidth

26.3. Property `cc-backend configuration file schema > ui-defaults > plot_list_jobsPerPage`


Type	`integer`
Required	Yes

Description: Jobs shown per page in job lists

26.4. Property `cc-backend configuration file schema > ui-defaults > plot_view_plotsPerRow`


Type	`integer`
Required	Yes

Description: Number of plots per row in single job view

26.5. Property `cc-backend configuration file schema > ui-defaults > plot_view_showPolarplot`


Type	`boolean`
Required	Yes

Description: Option to toggle polar plot in single job view

26.6. Property `cc-backend configuration file schema > ui-defaults > plot_view_showRoofline`


Type	`boolean`
Required	Yes

Description: Option to toggle roofline plot in single job view

26.7. Property `cc-backend configuration file schema > ui-defaults > plot_view_showStatTable`


Type	`boolean`
Required	Yes

Description: Option to toggle the node statistic table in single job view

26.8. Property `cc-backend configuration file schema > ui-defaults > system_view_selectedMetric`


Type	`string`
Required	Yes

Description: Initial metric shown in system view

26.9. Property `cc-backend configuration file schema > ui-defaults > job_view_showFootprint`


Type	`boolean`
Required	Yes

Description: Option to toggle footprint ui in single job view

26.10. Property `cc-backend configuration file schema > ui-defaults > job_list_usePaging`


Type	`boolean`
Required	Yes

Description: Option to switch from continous scroll to paging

26.11. Property `cc-backend configuration file schema > ui-defaults > analysis_view_histogramMetrics`


Type	`array of string`
Required	Yes

Description: Metrics to show as job count histograms in analysis view

	Array restrictions
Min items	N/A
Max items	N/A
Items unicity	False
Additional items	False
Tuple validation	See below

Each item of this array must be	Description
analysis_view_histogramMetrics items	-

26.11.1. cc-backend configuration file schema > ui-defaults > analysis_view_histogramMetrics > analysis_view_histogramMetrics items


Type	`string`
Required	No

26.12. Property `cc-backend configuration file schema > ui-defaults > analysis_view_scatterPlotMetrics`


Type	`array of array`
Required	Yes

Description: Initial scatter plto configuration in analysis view

	Array restrictions
Min items	N/A
Max items	N/A
Items unicity	False
Additional items	False
Tuple validation	See below

Each item of this array must be	Description
analysis_view_scatterPlotMetrics items	-

26.12.1. cc-backend configuration file schema > ui-defaults > analysis_view_scatterPlotMetrics > analysis_view_scatterPlotMetrics items


Type	`array of string`
Required	No

	Array restrictions
Min items	1
Max items	N/A
Items unicity	False
Additional items	False
Tuple validation	See below

Each item of this array must be	Description
analysis_view_scatterPlotMetrics items items	-

26.12.1.1. cc-backend configuration file schema > ui-defaults > analysis_view_scatterPlotMetrics > analysis_view_scatterPlotMetrics items > analysis_view_scatterPlotMetrics items items


Type	`string`
Required	No

26.13. Property `cc-backend configuration file schema > ui-defaults > job_view_nodestats_selectedMetrics`


Type	`array of string`
Required	Yes

Description: Initial metrics shown in node statistics table of single job view

	Array restrictions
Min items	N/A
Max items	N/A
Items unicity	False
Additional items	False
Tuple validation	See below

Each item of this array must be	Description
job_view_nodestats_selectedMetrics items	-

26.13.1. cc-backend configuration file schema > ui-defaults > job_view_nodestats_selectedMetrics > job_view_nodestats_selectedMetrics items


Type	`string`
Required	No

26.14. Property `cc-backend configuration file schema > ui-defaults > job_view_selectedMetrics`


Type	`array of string`
Required	Yes

	Array restrictions
Min items	N/A
Max items	N/A
Items unicity	False
Additional items	False
Tuple validation	See below

Each item of this array must be	Description
job_view_selectedMetrics items	-

26.14.1. cc-backend configuration file schema > ui-defaults > job_view_selectedMetrics > job_view_selectedMetrics items


Type	`string`
Required	No

26.15. Property `cc-backend configuration file schema > ui-defaults > plot_general_colorscheme`


Type	`array of string`
Required	Yes

Description: Initial color scheme

	Array restrictions
Min items	N/A
Max items	N/A
Items unicity	False
Additional items	False
Tuple validation	See below

Each item of this array must be	Description
plot_general_colorscheme items	-

26.15.1. cc-backend configuration file schema > ui-defaults > plot_general_colorscheme > plot_general_colorscheme items


Type	`string`
Required	No

26.16. Property `cc-backend configuration file schema > ui-defaults > plot_list_selectedMetrics`


Type	`array of string`
Required	Yes

Description: Initial metric plots shown in jobs lists

	Array restrictions
Min items	N/A
Max items	N/A
Items unicity	False
Additional items	False
Tuple validation	See below

Each item of this array must be	Description
plot_list_selectedMetrics items	-

26.16.1. cc-backend configuration file schema > ui-defaults > plot_list_selectedMetrics > plot_list_selectedMetrics items


Type	`string`
Required	No

Generated using json-schema-for-humans on 2024-12-04 at 16:45:59 +0100

7.1.7.2 - Cluster Schema

ClusterCockpit Cluster Schema Reference

The following schema in its raw form can be found in the ClusterCockpit GitHub repository.

Manual Updates

Changes to the original JSON schema found in the repository are not automatically rendered in this reference documentation.

Last Update: 04.12.2024

HPC cluster description

Title: HPC cluster description


Type	`object`
Required	No
Additional properties	Any type allowed

Description: Meta data information of a HPC cluster

Property	Pattern	Type	Deprecated	Definition	Title/Description
+ name	No	string	No	-	The unique identifier of a cluster
+ metricConfig	No	array of object	No	-	Metric specifications
+ subClusters	No	array of object	No	-	Array of cluster hardware partitions

1. Property `HPC cluster description > name`


Type	`string`
Required	Yes

Description: The unique identifier of a cluster

2. Property `HPC cluster description > metricConfig`


Type	`array of object`
Required	Yes

Description: Metric specifications

	Array restrictions
Min items	1
Max items	N/A
Items unicity	False
Additional items	False
Tuple validation	See below

Each item of this array must be	Description
metricConfig items	-

2.1. HPC cluster description > metricConfig > metricConfig items


Type	`object`
Required	No
Additional properties	Any type allowed

Property	Pattern	Type	Deprecated	Definition	Title/Description
+ name	No	string	No	-	Metric name
+ unit	No	object	No	In embedfs://unit.schema.json	Metric unit
+ scope	No	string	No	-	Native measurement resolution
+ timestep	No	integer	No	-	Frequency of timeseries points
+ aggregation	No	enum (of string)	No	-	How the metric is aggregated
- footprint	No	enum (of string)	No	-	Is it a footprint metric and what type
- energy	No	enum (of string)	No	-	Is it used to calculate job energy
- lowerIsBetter	No	boolean	No	-	Is lower better.
+ peak	No	number	No	-	Metric peak threshold (Upper metric limit)
+ normal	No	number	No	-	Metric normal threshold
+ caution	No	number	No	-	Metric caution threshold (Suspicious but does not require immediate action)
+ alert	No	number	No	-	Metric alert threshold (Requires immediate action)
- subClusters	No	array of object	No	-	Array of cluster hardware partition metric thresholds

2.1.1. Property `HPC cluster description > metricConfig > metricConfig items > name`


Type	`string`
Required	Yes

Description: Metric name

2.1.2. Property `HPC cluster description > metricConfig > metricConfig items > unit`


Type	`object`
Required	Yes
Additional properties	Any type allowed
Defined in	embedfs://unit.schema.json

Description: Metric unit

2.1.3. Property `HPC cluster description > metricConfig > metricConfig items > scope`


Type	`string`
Required	Yes

Description: Native measurement resolution

2.1.4. Property `HPC cluster description > metricConfig > metricConfig items > timestep`


Type	`integer`
Required	Yes

Description: Frequency of timeseries points

2.1.5. Property `HPC cluster description > metricConfig > metricConfig items > aggregation`


Type	`enum (of string)`
Required	Yes

Description: How the metric is aggregated

Must be one of:

“sum”
“avg”

2.1.6. Property `HPC cluster description > metricConfig > metricConfig items > footprint`


Type	`enum (of string)`
Required	No

Description: Is it a footprint metric and what type

Must be one of:

“avg”
“max”
“min”

2.1.7. Property `HPC cluster description > metricConfig > metricConfig items > energy`


Type	`enum (of string)`
Required	No

Description: Is it used to calculate job energy

Must be one of:

“power”
“energy”

2.1.8. Property `HPC cluster description > metricConfig > metricConfig items > lowerIsBetter`


Type	`boolean`
Required	No

Description: Is lower better.

2.1.9. Property `HPC cluster description > metricConfig > metricConfig items > peak`


Type	`number`
Required	Yes

Description: Metric peak threshold (Upper metric limit)

2.1.10. Property `HPC cluster description > metricConfig > metricConfig items > normal`


Type	`number`
Required	Yes

Description: Metric normal threshold

2.1.11. Property `HPC cluster description > metricConfig > metricConfig items > caution`


Type	`number`
Required	Yes

Description: Metric caution threshold (Suspicious but does not require immediate action)

2.1.12. Property `HPC cluster description > metricConfig > metricConfig items > alert`


Type	`number`
Required	Yes

Description: Metric alert threshold (Requires immediate action)

2.1.13. Property `HPC cluster description > metricConfig > metricConfig items > subClusters`


Type	`array of object`
Required	No

Description: Array of cluster hardware partition metric thresholds

	Array restrictions
Min items	N/A
Max items	N/A
Items unicity	False
Additional items	False
Tuple validation	See below

Each item of this array must be	Description
subClusters items	-

2.1.13.1. HPC cluster description > metricConfig > metricConfig items > subClusters > subClusters items


Type	`object`
Required	No
Additional properties	Any type allowed

Property	Pattern	Type	Deprecated	Definition	Title/Description
+ name	No	string	No	-	Hardware partition name
- footprint	No	enum (of string)	No	-	Is it a footprint metric and what type. Overwrite global setting
- energy	No	enum (of string)	No	-	Is it used to calculate job energy. Overwrite global
- lowerIsBetter	No	boolean	No	-	Is lower better. Overwrite global
- peak	No	number	No	-	-
- normal	No	number	No	-	-
- caution	No	number	No	-	-
- alert	No	number	No	-	-
- remove	No	boolean	No	-	Remove this metric for this subcluster

2.1.13.1.1. Property `HPC cluster description > metricConfig > metricConfig items > subClusters > subClusters items > name`


Type	`string`
Required	Yes

Description: Hardware partition name

2.1.13.1.2. Property `HPC cluster description > metricConfig > metricConfig items > subClusters > subClusters items > footprint`


Type	`enum (of string)`
Required	No

Description: Is it a footprint metric and what type. Overwrite global setting

Must be one of:

“avg”
“max”
“min”

2.1.13.1.3. Property `HPC cluster description > metricConfig > metricConfig items > subClusters > subClusters items > energy`


Type	`enum (of string)`
Required	No

Description: Is it used to calculate job energy. Overwrite global

Must be one of:

“power”
“energy”

2.1.13.1.4. Property `HPC cluster description > metricConfig > metricConfig items > subClusters > subClusters items > lowerIsBetter`


Type	`boolean`
Required	No

Description: Is lower better. Overwrite global

2.1.13.1.5. Property `HPC cluster description > metricConfig > metricConfig items > subClusters > subClusters items > peak`


Type	`number`
Required	No

2.1.13.1.6. Property `HPC cluster description > metricConfig > metricConfig items > subClusters > subClusters items > normal`


Type	`number`
Required	No

2.1.13.1.7. Property `HPC cluster description > metricConfig > metricConfig items > subClusters > subClusters items > caution`


Type	`number`
Required	No

2.1.13.1.8. Property `HPC cluster description > metricConfig > metricConfig items > subClusters > subClusters items > alert`


Type	`number`
Required	No

2.1.13.1.9. Property `HPC cluster description > metricConfig > metricConfig items > subClusters > subClusters items > remove`


Type	`boolean`
Required	No

Description: Remove this metric for this subcluster

3. Property `HPC cluster description > subClusters`


Type	`array of object`
Required	Yes

Description: Array of cluster hardware partitions

	Array restrictions
Min items	1
Max items	N/A
Items unicity	False
Additional items	False
Tuple validation	See below

Each item of this array must be	Description
subClusters items	-

3.1. HPC cluster description > subClusters > subClusters items


Type	`object`
Required	No
Additional properties	Any type allowed

Property	Pattern	Type	Deprecated	Definition	Title/Description
+ name	No	string	No	-	Hardware partition name
+ processorType	No	string	No	-	Processor type
+ socketsPerNode	No	integer	No	-	Number of sockets per node
+ coresPerSocket	No	integer	No	-	Number of cores per socket
+ threadsPerCore	No	integer	No	-	Number of SMT threads per core
+ flopRateScalar	No	object	No	-	Theoretical node peak flop rate for scalar code in GFlops/s
+ flopRateSimd	No	object	No	-	Theoretical node peak flop rate for SIMD code in GFlops/s
+ memoryBandwidth	No	object	No	-	Theoretical node peak memory bandwidth in GB/s
+ nodes	No	string	No	-	Node list expression
+ topology	No	object	No	-	Node topology

3.1.1. Property `HPC cluster description > subClusters > subClusters items > name`


Type	`string`
Required	Yes

Description: Hardware partition name

3.1.2. Property `HPC cluster description > subClusters > subClusters items > processorType`


Type	`string`
Required	Yes

Description: Processor type

3.1.3. Property `HPC cluster description > subClusters > subClusters items > socketsPerNode`


Type	`integer`
Required	Yes

Description: Number of sockets per node

3.1.4. Property `HPC cluster description > subClusters > subClusters items > coresPerSocket`


Type	`integer`
Required	Yes

Description: Number of cores per socket

3.1.5. Property `HPC cluster description > subClusters > subClusters items > threadsPerCore`


Type	`integer`
Required	Yes

Description: Number of SMT threads per core

3.1.6. Property `HPC cluster description > subClusters > subClusters items > flopRateScalar`


Type	`object`
Required	Yes
Additional properties	Any type allowed

Description: Theoretical node peak flop rate for scalar code in GFlops/s

Property	Pattern	Type	Deprecated	Definition	Title/Description
- unit	No	object	No	In embedfs://unit.schema.json	Metric unit
- value	No	number	No	-	-

3.1.6.1. Property `HPC cluster description > subClusters > subClusters items > flopRateScalar > unit`


Type	`object`
Required	No
Additional properties	Any type allowed
Defined in	embedfs://unit.schema.json

Description: Metric unit

3.1.6.2. Property `HPC cluster description > subClusters > subClusters items > flopRateScalar > value`


Type	`number`
Required	No

3.1.7. Property `HPC cluster description > subClusters > subClusters items > flopRateSimd`


Type	`object`
Required	Yes
Additional properties	Any type allowed

Description: Theoretical node peak flop rate for SIMD code in GFlops/s

Property	Pattern	Type	Deprecated	Definition	Title/Description
- unit	No	object	No	In embedfs://unit.schema.json	Metric unit
- value	No	number	No	-	-

3.1.7.1. Property `HPC cluster description > subClusters > subClusters items > flopRateSimd > unit`


Type	`object`
Required	No
Additional properties	Any type allowed
Defined in	embedfs://unit.schema.json

Description: Metric unit

3.1.7.2. Property `HPC cluster description > subClusters > subClusters items > flopRateSimd > value`


Type	`number`
Required	No

3.1.8. Property `HPC cluster description > subClusters > subClusters items > memoryBandwidth`


Type	`object`
Required	Yes
Additional properties	Any type allowed

Description: Theoretical node peak memory bandwidth in GB/s

Property	Pattern	Type	Deprecated	Definition	Title/Description
- unit	No	object	No	In embedfs://unit.schema.json	Metric unit
- value	No	number	No	-	-

3.1.8.1. Property `HPC cluster description > subClusters > subClusters items > memoryBandwidth > unit`


Type	`object`
Required	No
Additional properties	Any type allowed
Defined in	embedfs://unit.schema.json

Description: Metric unit

3.1.8.2. Property `HPC cluster description > subClusters > subClusters items > memoryBandwidth > value`


Type	`number`
Required	No

3.1.9. Property `HPC cluster description > subClusters > subClusters items > nodes`


Type	`string`
Required	Yes

Description: Node list expression

3.1.10. Property `HPC cluster description > subClusters > subClusters items > topology`


Type	`object`
Required	Yes
Additional properties	Any type allowed

Description: Node topology

Property	Pattern	Type	Deprecated	Definition	Title/Description
+ node	No	array of integer	No	-	HwTread lists of node
+ socket	No	array of array	No	-	HwTread lists of sockets
+ memoryDomain	No	array of array	No	-	HwTread lists of memory domains
- die	No	array of array	No	-	HwTread lists of dies
- core	No	array of array	No	-	HwTread lists of cores
- accelerators	No	array of object	No	-	List of of accelerator devices

3.1.10.1. Property `HPC cluster description > subClusters > subClusters items > topology > node`


Type	`array of integer`
Required	Yes

Description: HwTread lists of node

	Array restrictions
Min items	N/A
Max items	N/A
Items unicity	False
Additional items	False
Tuple validation	See below

Each item of this array must be	Description
node items	-

3.1.10.1.1. HPC cluster description > subClusters > subClusters items > topology > node > node items


Type	`integer`
Required	No

3.1.10.2. Property `HPC cluster description > subClusters > subClusters items > topology > socket`


Type	`array of array`
Required	Yes

Description: HwTread lists of sockets

	Array restrictions
Min items	N/A
Max items	N/A
Items unicity	False
Additional items	False
Tuple validation	See below

Each item of this array must be	Description
socket items	-

3.1.10.2.1. HPC cluster description > subClusters > subClusters items > topology > socket > socket items


Type	`array of integer`
Required	No

	Array restrictions
Min items	N/A
Max items	N/A
Items unicity	False
Additional items	False
Tuple validation	See below

Each item of this array must be	Description
socket items items	-

3.1.10.2.1.1. HPC cluster description > subClusters > subClusters items > topology > socket > socket items > socket items items


Type	`integer`
Required	No

3.1.10.3. Property `HPC cluster description > subClusters > subClusters items > topology > memoryDomain`


Type	`array of array`
Required	Yes

Description: HwTread lists of memory domains

	Array restrictions
Min items	N/A
Max items	N/A
Items unicity	False
Additional items	False
Tuple validation	See below

Each item of this array must be	Description
memoryDomain items	-

3.1.10.3.1. HPC cluster description > subClusters > subClusters items > topology > memoryDomain > memoryDomain items


Type	`array of integer`
Required	No

	Array restrictions
Min items	N/A
Max items	N/A
Items unicity	False
Additional items	False
Tuple validation	See below

Each item of this array must be	Description
memoryDomain items items	-

3.1.10.3.1.1. HPC cluster description > subClusters > subClusters items > topology > memoryDomain > memoryDomain items > memoryDomain items items


Type	`integer`
Required	No

3.1.10.4. Property `HPC cluster description > subClusters > subClusters items > topology > die`


Type	`array of array`
Required	No

Description: HwTread lists of dies

	Array restrictions
Min items	N/A
Max items	N/A
Items unicity	False
Additional items	False
Tuple validation	See below

Each item of this array must be	Description
die items	-

3.1.10.4.1. HPC cluster description > subClusters > subClusters items > topology > die > die items


Type	`array of integer`
Required	No

	Array restrictions
Min items	N/A
Max items	N/A
Items unicity	False
Additional items	False
Tuple validation	See below

Each item of this array must be	Description
die items items	-

3.1.10.4.1.1. HPC cluster description > subClusters > subClusters items > topology > die > die items > die items items


Type	`integer`
Required	No

3.1.10.5. Property `HPC cluster description > subClusters > subClusters items > topology > core`


Type	`array of array`
Required	No

Description: HwTread lists of cores

	Array restrictions
Min items	N/A
Max items	N/A
Items unicity	False
Additional items	False
Tuple validation	See below

Each item of this array must be	Description
core items	-

3.1.10.5.1. HPC cluster description > subClusters > subClusters items > topology > core > core items


Type	`array of integer`
Required	No

	Array restrictions
Min items	N/A
Max items	N/A
Items unicity	False
Additional items	False
Tuple validation	See below

Each item of this array must be	Description
core items items	-

3.1.10.5.1.1. HPC cluster description > subClusters > subClusters items > topology > core > core items > core items items


Type	`integer`
Required	No

3.1.10.6. Property `HPC cluster description > subClusters > subClusters items > topology > accelerators`


Type	`array of object`
Required	No

Description: List of of accelerator devices

	Array restrictions
Min items	N/A
Max items	N/A
Items unicity	False
Additional items	False
Tuple validation	See below

Each item of this array must be	Description
accelerators items	-

3.1.10.6.1. HPC cluster description > subClusters > subClusters items > topology > accelerators > accelerators items


Type	`object`
Required	No
Additional properties	Any type allowed

Property	Pattern	Type	Deprecated	Definition	Title/Description
+ id	No	string	No	-	The unique device id
+ type	No	enum (of string)	No	-	The accelerator type
+ model	No	string	No	-	The accelerator model

3.1.10.6.1.1. Property `HPC cluster description > subClusters > subClusters items > topology > accelerators > accelerators items > id`


Type	`string`
Required	Yes

Description: The unique device id

3.1.10.6.1.2. Property `HPC cluster description > subClusters > subClusters items > topology > accelerators > accelerators items > type`


Type	`enum (of string)`
Required	Yes

Description: The accelerator type

Must be one of:

“Nvidia GPU”
“AMD GPU”
“Intel GPU”

3.1.10.6.1.3. Property `HPC cluster description > subClusters > subClusters items > topology > accelerators > accelerators items > model`


Type	`string`
Required	Yes

Description: The accelerator model

Generated using json-schema-for-humans on 2024-12-04 at 16:45:59 +0100

7.1.7.3 - Job Data Schema

ClusterCockpit Job Data Schema Reference

The following schema in its raw form can be found in the ClusterCockpit GitHub repository.

Manual Updates

Changes to the original JSON schema found in the repository are not automatically rendered in this reference documentation.

Last Update: 04.12.2024

Job metric data list

Title: Job metric data list


Type	`object`
Required	No
Additional properties	Any type allowed

Description: Collection of metric data of a HPC job

Property	Pattern	Type	Deprecated	Definition	Title/Description
+ mem_used	No	object	No	-	Memory capacity used
+ flops_any	No	object	No	-	Total flop rate with DP flops scaled up
+ mem_bw	No	object	No	-	Main memory bandwidth
+ net_bw	No	object	No	-	Total fast interconnect network bandwidth
- ipc	No	object	No	-	Instructions executed per cycle
+ cpu_user	No	object	No	-	CPU user active core utilization
+ cpu_load	No	object	No	-	CPU requested core utilization (load 1m)
- flops_dp	No	object	No	-	Double precision flop rate
- flops_sp	No	object	No	-	Single precision flops rate
- vectorization_ratio	No	object	No	-	Fraction of arithmetic instructions using SIMD instructions
- cpu_power	No	object	No	-	CPU power consumption
- mem_power	No	object	No	-	Memory power consumption
- acc_utilization	No	object	No	-	GPU utilization
- acc_mem_used	No	object	No	-	GPU memory capacity used
- acc_power	No	object	No	-	GPU power consumption
- clock	No	object	No	-	Average core frequency
- eth_read_bw	No	object	No	-	Ethernet read bandwidth
- eth_write_bw	No	object	No	-	Ethernet write bandwidth
+ filesystems	No	array of object	No	-	Array of filesystems

1. Property `Job metric data list > mem_used`


Type	`object`
Required	Yes
Additional properties	Any type allowed

Description: Memory capacity used

Property	Pattern	Type	Deprecated	Definition	Title/Description
+ node	No	object	No	In embedfs://job-metric-data.schema.json	😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️

1.1. Property `Job metric data list > mem_used > node`


Type	`object`
Required	Yes
Additional properties	Any type allowed
Defined in	embedfs://job-metric-data.schema.json

Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️

2. Property `Job metric data list > flops_any`


Type	`object`
Required	Yes
Additional properties	Any type allowed

Description: Total flop rate with DP flops scaled up

Property	Pattern	Type	Deprecated	Definition	Title/Description
- node	No	object	No	In embedfs://job-metric-data.schema.json	😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️
- socket	No	object	No	In embedfs://job-metric-data.schema.json	😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️
- memoryDomain	No	object	No	In embedfs://job-metric-data.schema.json	😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️
- core	No	object	No	In embedfs://job-metric-data.schema.json	😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️
- hwthread	No	object	No	In embedfs://job-metric-data.schema.json	😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️

2.1. Property `Job metric data list > flops_any > node`


Type	`object`
Required	No
Additional properties	Any type allowed
Defined in	embedfs://job-metric-data.schema.json

Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️

2.2. Property `Job metric data list > flops_any > socket`


Type	`object`
Required	No
Additional properties	Any type allowed
Defined in	embedfs://job-metric-data.schema.json

Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️

2.3. Property `Job metric data list > flops_any > memoryDomain`


Type	`object`
Required	No
Additional properties	Any type allowed
Defined in	embedfs://job-metric-data.schema.json

Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️

2.4. Property `Job metric data list > flops_any > core`


Type	`object`
Required	No
Additional properties	Any type allowed
Defined in	embedfs://job-metric-data.schema.json

Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️

2.5. Property `Job metric data list > flops_any > hwthread`


Type	`object`
Required	No
Additional properties	Any type allowed
Defined in	embedfs://job-metric-data.schema.json

Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️

3. Property `Job metric data list > mem_bw`


Type	`object`
Required	Yes
Additional properties	Any type allowed

Description: Main memory bandwidth

Property	Pattern	Type	Deprecated	Definition	Title/Description
- node	No	object	No	In embedfs://job-metric-data.schema.json	😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️
- socket	No	object	No	In embedfs://job-metric-data.schema.json	😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️
- memoryDomain	No	object	No	In embedfs://job-metric-data.schema.json	😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️

3.1. Property `Job metric data list > mem_bw > node`


Type	`object`
Required	No
Additional properties	Any type allowed
Defined in	embedfs://job-metric-data.schema.json

Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️

3.2. Property `Job metric data list > mem_bw > socket`


Type	`object`
Required	No
Additional properties	Any type allowed
Defined in	embedfs://job-metric-data.schema.json

Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️

3.3. Property `Job metric data list > mem_bw > memoryDomain`


Type	`object`
Required	No
Additional properties	Any type allowed
Defined in	embedfs://job-metric-data.schema.json

Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️

4. Property `Job metric data list > net_bw`


Type	`object`
Required	Yes
Additional properties	Any type allowed

Description: Total fast interconnect network bandwidth

Property	Pattern	Type	Deprecated	Definition	Title/Description
+ node	No	object	No	In embedfs://job-metric-data.schema.json	😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️

4.1. Property `Job metric data list > net_bw > node`


Type	`object`
Required	Yes
Additional properties	Any type allowed
Defined in	embedfs://job-metric-data.schema.json

Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️

5. Property `Job metric data list > ipc`


Type	`object`
Required	No
Additional properties	Any type allowed

Description: Instructions executed per cycle

Property	Pattern	Type	Deprecated	Definition	Title/Description
- node	No	object	No	In embedfs://job-metric-data.schema.json	😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️
- socket	No	object	No	In embedfs://job-metric-data.schema.json	😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️
- memoryDomain	No	object	No	In embedfs://job-metric-data.schema.json	😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️
- core	No	object	No	In embedfs://job-metric-data.schema.json	😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️
- hwthread	No	object	No	In embedfs://job-metric-data.schema.json	😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️

5.1. Property `Job metric data list > ipc > node`


Type	`object`
Required	No
Additional properties	Any type allowed
Defined in	embedfs://job-metric-data.schema.json

Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️

5.2. Property `Job metric data list > ipc > socket`


Type	`object`
Required	No
Additional properties	Any type allowed
Defined in	embedfs://job-metric-data.schema.json

Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️

5.3. Property `Job metric data list > ipc > memoryDomain`


Type	`object`
Required	No
Additional properties	Any type allowed
Defined in	embedfs://job-metric-data.schema.json

Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️

5.4. Property `Job metric data list > ipc > core`


Type	`object`
Required	No
Additional properties	Any type allowed
Defined in	embedfs://job-metric-data.schema.json

Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️

5.5. Property `Job metric data list > ipc > hwthread`


Type	`object`
Required	No
Additional properties	Any type allowed
Defined in	embedfs://job-metric-data.schema.json

Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️

6. Property `Job metric data list > cpu_user`


Type	`object`
Required	Yes
Additional properties	Any type allowed

Description: CPU user active core utilization

Property	Pattern	Type	Deprecated	Definition	Title/Description
- node	No	object	No	In embedfs://job-metric-data.schema.json	😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️
- socket	No	object	No	In embedfs://job-metric-data.schema.json	😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️
- memoryDomain	No	object	No	In embedfs://job-metric-data.schema.json	😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️
- core	No	object	No	In embedfs://job-metric-data.schema.json	😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️
- hwthread	No	object	No	In embedfs://job-metric-data.schema.json	😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️

6.1. Property `Job metric data list > cpu_user > node`


Type	`object`
Required	No
Additional properties	Any type allowed
Defined in	embedfs://job-metric-data.schema.json

Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️

6.2. Property `Job metric data list > cpu_user > socket`


Type	`object`
Required	No
Additional properties	Any type allowed
Defined in	embedfs://job-metric-data.schema.json

Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️

6.3. Property `Job metric data list > cpu_user > memoryDomain`


Type	`object`
Required	No
Additional properties	Any type allowed
Defined in	embedfs://job-metric-data.schema.json

Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️

6.4. Property `Job metric data list > cpu_user > core`


Type	`object`
Required	No
Additional properties	Any type allowed
Defined in	embedfs://job-metric-data.schema.json

Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️

6.5. Property `Job metric data list > cpu_user > hwthread`


Type	`object`
Required	No
Additional properties	Any type allowed
Defined in	embedfs://job-metric-data.schema.json

Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️

7. Property `Job metric data list > cpu_load`


Type	`object`
Required	Yes
Additional properties	Any type allowed

Description: CPU requested core utilization (load 1m)

Property	Pattern	Type	Deprecated	Definition	Title/Description
+ node	No	object	No	In embedfs://job-metric-data.schema.json	😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️

7.1. Property `Job metric data list > cpu_load > node`


Type	`object`
Required	Yes
Additional properties	Any type allowed
Defined in	embedfs://job-metric-data.schema.json

Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️

8. Property `Job metric data list > flops_dp`


Type	`object`
Required	No
Additional properties	Any type allowed

Description: Double precision flop rate

Property	Pattern	Type	Deprecated	Definition	Title/Description
- node	No	object	No	In embedfs://job-metric-data.schema.json	😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️
- socket	No	object	No	In embedfs://job-metric-data.schema.json	😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️
- memoryDomain	No	object	No	In embedfs://job-metric-data.schema.json	😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️
- core	No	object	No	In embedfs://job-metric-data.schema.json	😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️
- hwthread	No	object	No	In embedfs://job-metric-data.schema.json	😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️

8.1. Property `Job metric data list > flops_dp > node`


Type	`object`
Required	No
Additional properties	Any type allowed
Defined in	embedfs://job-metric-data.schema.json

Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️

8.2. Property `Job metric data list > flops_dp > socket`


Type	`object`
Required	No
Additional properties	Any type allowed
Defined in	embedfs://job-metric-data.schema.json

Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️

8.3. Property `Job metric data list > flops_dp > memoryDomain`


Type	`object`
Required	No
Additional properties	Any type allowed
Defined in	embedfs://job-metric-data.schema.json

Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️

8.4. Property `Job metric data list > flops_dp > core`


Type	`object`
Required	No
Additional properties	Any type allowed
Defined in	embedfs://job-metric-data.schema.json

Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️

8.5. Property `Job metric data list > flops_dp > hwthread`


Type	`object`
Required	No
Additional properties	Any type allowed
Defined in	embedfs://job-metric-data.schema.json

Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️

9. Property `Job metric data list > flops_sp`


Type	`object`
Required	No
Additional properties	Any type allowed

Description: Single precision flops rate

Property	Pattern	Type	Deprecated	Definition	Title/Description
- node	No	object	No	In embedfs://job-metric-data.schema.json	😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️
- socket	No	object	No	In embedfs://job-metric-data.schema.json	😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️
- memoryDomain	No	object	No	In embedfs://job-metric-data.schema.json	😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️
- core	No	object	No	In embedfs://job-metric-data.schema.json	😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️
- hwthread	No	object	No	In embedfs://job-metric-data.schema.json	😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️

9.1. Property `Job metric data list > flops_sp > node`


Type	`object`
Required	No
Additional properties	Any type allowed
Defined in	embedfs://job-metric-data.schema.json

Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️

9.2. Property `Job metric data list > flops_sp > socket`


Type	`object`
Required	No
Additional properties	Any type allowed
Defined in	embedfs://job-metric-data.schema.json

Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️

9.3. Property `Job metric data list > flops_sp > memoryDomain`


Type	`object`
Required	No
Additional properties	Any type allowed
Defined in	embedfs://job-metric-data.schema.json

Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️

9.4. Property `Job metric data list > flops_sp > core`


Type	`object`
Required	No
Additional properties	Any type allowed
Defined in	embedfs://job-metric-data.schema.json

Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️

9.5. Property `Job metric data list > flops_sp > hwthread`


Type	`object`
Required	No
Additional properties	Any type allowed
Defined in	embedfs://job-metric-data.schema.json

Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️

10. Property `Job metric data list > vectorization_ratio`


Type	`object`
Required	No
Additional properties	Any type allowed

Description: Fraction of arithmetic instructions using SIMD instructions

Property	Pattern	Type	Deprecated	Definition	Title/Description
- node	No	object	No	In embedfs://job-metric-data.schema.json	😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️
- socket	No	object	No	In embedfs://job-metric-data.schema.json	😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️
- memoryDomain	No	object	No	In embedfs://job-metric-data.schema.json	😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️
- core	No	object	No	In embedfs://job-metric-data.schema.json	😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️
- hwthread	No	object	No	In embedfs://job-metric-data.schema.json	😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️

10.1. Property `Job metric data list > vectorization_ratio > node`


Type	`object`
Required	No
Additional properties	Any type allowed
Defined in	embedfs://job-metric-data.schema.json

Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️

10.2. Property `Job metric data list > vectorization_ratio > socket`


Type	`object`
Required	No
Additional properties	Any type allowed
Defined in	embedfs://job-metric-data.schema.json

Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️

10.3. Property `Job metric data list > vectorization_ratio > memoryDomain`


Type	`object`
Required	No
Additional properties	Any type allowed
Defined in	embedfs://job-metric-data.schema.json

Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️

10.4. Property `Job metric data list > vectorization_ratio > core`


Type	`object`
Required	No
Additional properties	Any type allowed
Defined in	embedfs://job-metric-data.schema.json

Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️

10.5. Property `Job metric data list > vectorization_ratio > hwthread`


Type	`object`
Required	No
Additional properties	Any type allowed
Defined in	embedfs://job-metric-data.schema.json

Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️

11. Property `Job metric data list > cpu_power`


Type	`object`
Required	No
Additional properties	Any type allowed

Description: CPU power consumption

Property	Pattern	Type	Deprecated	Definition	Title/Description
- node	No	object	No	In embedfs://job-metric-data.schema.json	😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️
- socket	No	object	No	In embedfs://job-metric-data.schema.json	😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️

11.1. Property `Job metric data list > cpu_power > node`


Type	`object`
Required	No
Additional properties	Any type allowed
Defined in	embedfs://job-metric-data.schema.json

Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️

11.2. Property `Job metric data list > cpu_power > socket`


Type	`object`
Required	No
Additional properties	Any type allowed
Defined in	embedfs://job-metric-data.schema.json

Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️

12. Property `Job metric data list > mem_power`


Type	`object`
Required	No
Additional properties	Any type allowed

Description: Memory power consumption

Property	Pattern	Type	Deprecated	Definition	Title/Description
- node	No	object	No	In embedfs://job-metric-data.schema.json	😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️
- socket	No	object	No	In embedfs://job-metric-data.schema.json	😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️

12.1. Property `Job metric data list > mem_power > node`


Type	`object`
Required	No
Additional properties	Any type allowed
Defined in	embedfs://job-metric-data.schema.json

Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️

12.2. Property `Job metric data list > mem_power > socket`


Type	`object`
Required	No
Additional properties	Any type allowed
Defined in	embedfs://job-metric-data.schema.json

Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️

13. Property `Job metric data list > acc_utilization`


Type	`object`
Required	No
Additional properties	Any type allowed

Description: GPU utilization

Property	Pattern	Type	Deprecated	Definition	Title/Description
+ accelerator	No	object	No	In embedfs://job-metric-data.schema.json	😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️

13.1. Property `Job metric data list > acc_utilization > accelerator`


Type	`object`
Required	Yes
Additional properties	Any type allowed
Defined in	embedfs://job-metric-data.schema.json

Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️

14. Property `Job metric data list > acc_mem_used`


Type	`object`
Required	No
Additional properties	Any type allowed

Description: GPU memory capacity used

Property	Pattern	Type	Deprecated	Definition	Title/Description
+ accelerator	No	object	No	In embedfs://job-metric-data.schema.json	😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️

14.1. Property `Job metric data list > acc_mem_used > accelerator`


Type	`object`
Required	Yes
Additional properties	Any type allowed
Defined in	embedfs://job-metric-data.schema.json

Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️

15. Property `Job metric data list > acc_power`


Type	`object`
Required	No
Additional properties	Any type allowed

Description: GPU power consumption

Property	Pattern	Type	Deprecated	Definition	Title/Description
+ accelerator	No	object	No	In embedfs://job-metric-data.schema.json	😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️

15.1. Property `Job metric data list > acc_power > accelerator`


Type	`object`
Required	Yes
Additional properties	Any type allowed
Defined in	embedfs://job-metric-data.schema.json

Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️

16. Property `Job metric data list > clock`


Type	`object`
Required	No
Additional properties	Any type allowed

Description: Average core frequency

Property	Pattern	Type	Deprecated	Definition	Title/Description
- node	No	object	No	In embedfs://job-metric-data.schema.json	😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️
- socket	No	object	No	In embedfs://job-metric-data.schema.json	😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️
- memoryDomain	No	object	No	In embedfs://job-metric-data.schema.json	😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️
- core	No	object	No	In embedfs://job-metric-data.schema.json	😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️
- hwthread	No	object	No	In embedfs://job-metric-data.schema.json	😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️

16.1. Property `Job metric data list > clock > node`


Type	`object`
Required	No
Additional properties	Any type allowed
Defined in	embedfs://job-metric-data.schema.json

Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️

16.2. Property `Job metric data list > clock > socket`


Type	`object`
Required	No
Additional properties	Any type allowed
Defined in	embedfs://job-metric-data.schema.json

Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️

16.3. Property `Job metric data list > clock > memoryDomain`


Type	`object`
Required	No
Additional properties	Any type allowed
Defined in	embedfs://job-metric-data.schema.json

Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️

16.4. Property `Job metric data list > clock > core`


Type	`object`
Required	No
Additional properties	Any type allowed
Defined in	embedfs://job-metric-data.schema.json

Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️

16.5. Property `Job metric data list > clock > hwthread`


Type	`object`
Required	No
Additional properties	Any type allowed
Defined in	embedfs://job-metric-data.schema.json

Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️

17. Property `Job metric data list > eth_read_bw`


Type	`object`
Required	No
Additional properties	Any type allowed

Description: Ethernet read bandwidth

Property	Pattern	Type	Deprecated	Definition	Title/Description
+ node	No	object	No	In embedfs://job-metric-data.schema.json	😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️

17.1. Property `Job metric data list > eth_read_bw > node`


Type	`object`
Required	Yes
Additional properties	Any type allowed
Defined in	embedfs://job-metric-data.schema.json

Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️

18. Property `Job metric data list > eth_write_bw`


Type	`object`
Required	No
Additional properties	Any type allowed

Description: Ethernet write bandwidth

Property	Pattern	Type	Deprecated	Definition	Title/Description
+ node	No	object	No	In embedfs://job-metric-data.schema.json	😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️

18.1. Property `Job metric data list > eth_write_bw > node`


Type	`object`
Required	Yes
Additional properties	Any type allowed
Defined in	embedfs://job-metric-data.schema.json

Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️

19. Property `Job metric data list > filesystems`


Type	`array of object`
Required	Yes

Description: Array of filesystems

	Array restrictions
Min items	1
Max items	N/A
Items unicity	False
Additional items	False
Tuple validation	See below

Each item of this array must be	Description
filesystems items	-

19.1. Job metric data list > filesystems > filesystems items


Type	`object`
Required	No
Additional properties	Any type allowed

Property	Pattern	Type	Deprecated	Definition	Title/Description
+ name	No	string	No	-	-
+ type	No	enum (of string)	No	-	-
+ read_bw	No	object	No	-	File system read bandwidth
+ write_bw	No	object	No	-	File system write bandwidth
- read_req	No	object	No	-	File system read requests
- write_req	No	object	No	-	File system write requests
- inodes	No	object	No	-	File system write requests
- accesses	No	object	No	-	File system open and close
- fsync	No	object	No	-	File system fsync
- create	No	object	No	-	File system create
- open	No	object	No	-	File system open
- close	No	object	No	-	File system close
- seek	No	object	No	-	File system seek

19.1.1. Property `Job metric data list > filesystems > filesystems items > name`


Type	`string`
Required	Yes

19.1.2. Property `Job metric data list > filesystems > filesystems items > type`


Type	`enum (of string)`
Required	Yes

Must be one of:

“nfs”
“lustre”
“gpfs”
“nvme”
“ssd”
“hdd”
“beegfs”

19.1.3. Property `Job metric data list > filesystems > filesystems items > read_bw`


Type	`object`
Required	Yes
Additional properties	Any type allowed

Description: File system read bandwidth

Property	Pattern	Type	Deprecated	Definition	Title/Description
+ node	No	object	No	In embedfs://job-metric-data.schema.json	😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️

19.1.3.1. Property `Job metric data list > filesystems > filesystems items > read_bw > node`


Type	`object`
Required	Yes
Additional properties	Any type allowed
Defined in	embedfs://job-metric-data.schema.json

Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️

19.1.4. Property `Job metric data list > filesystems > filesystems items > write_bw`


Type	`object`
Required	Yes
Additional properties	Any type allowed

Description: File system write bandwidth

Property	Pattern	Type	Deprecated	Definition	Title/Description
+ node	No	object	No	In embedfs://job-metric-data.schema.json	😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️

19.1.4.1. Property `Job metric data list > filesystems > filesystems items > write_bw > node`


Type	`object`
Required	Yes
Additional properties	Any type allowed
Defined in	embedfs://job-metric-data.schema.json

Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️

19.1.5. Property `Job metric data list > filesystems > filesystems items > read_req`


Type	`object`
Required	No
Additional properties	Any type allowed

Description: File system read requests

Property	Pattern	Type	Deprecated	Definition	Title/Description
+ node	No	object	No	In embedfs://job-metric-data.schema.json	😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️

19.1.5.1. Property `Job metric data list > filesystems > filesystems items > read_req > node`


Type	`object`
Required	Yes
Additional properties	Any type allowed
Defined in	embedfs://job-metric-data.schema.json

Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️

19.1.6. Property `Job metric data list > filesystems > filesystems items > write_req`


Type	`object`
Required	No
Additional properties	Any type allowed

Description: File system write requests

Property	Pattern	Type	Deprecated	Definition	Title/Description
+ node	No	object	No	In embedfs://job-metric-data.schema.json	😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️

19.1.6.1. Property `Job metric data list > filesystems > filesystems items > write_req > node`


Type	`object`
Required	Yes
Additional properties	Any type allowed
Defined in	embedfs://job-metric-data.schema.json

Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️

19.1.7. Property `Job metric data list > filesystems > filesystems items > inodes`


Type	`object`
Required	No
Additional properties	Any type allowed

Description: File system write requests

Property	Pattern	Type	Deprecated	Definition	Title/Description
+ node	No	object	No	In embedfs://job-metric-data.schema.json	😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️

19.1.7.1. Property `Job metric data list > filesystems > filesystems items > inodes > node`


Type	`object`
Required	Yes
Additional properties	Any type allowed
Defined in	embedfs://job-metric-data.schema.json

Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️

19.1.8. Property `Job metric data list > filesystems > filesystems items > accesses`


Type	`object`
Required	No
Additional properties	Any type allowed

Description: File system open and close

Property	Pattern	Type	Deprecated	Definition	Title/Description
+ node	No	object	No	In embedfs://job-metric-data.schema.json	😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️

19.1.8.1. Property `Job metric data list > filesystems > filesystems items > accesses > node`


Type	`object`
Required	Yes
Additional properties	Any type allowed
Defined in	embedfs://job-metric-data.schema.json

Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️

19.1.9. Property `Job metric data list > filesystems > filesystems items > fsync`


Type	`object`
Required	No
Additional properties	Any type allowed

Description: File system fsync

Property	Pattern	Type	Deprecated	Definition	Title/Description
+ node	No	object	No	In embedfs://job-metric-data.schema.json	😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️

19.1.9.1. Property `Job metric data list > filesystems > filesystems items > fsync > node`


Type	`object`
Required	Yes
Additional properties	Any type allowed
Defined in	embedfs://job-metric-data.schema.json

Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️

19.1.10. Property `Job metric data list > filesystems > filesystems items > create`


Type	`object`
Required	No
Additional properties	Any type allowed

Description: File system create

Property	Pattern	Type	Deprecated	Definition	Title/Description
+ node	No	object	No	In embedfs://job-metric-data.schema.json	😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️

19.1.10.1. Property `Job metric data list > filesystems > filesystems items > create > node`


Type	`object`
Required	Yes
Additional properties	Any type allowed
Defined in	embedfs://job-metric-data.schema.json

Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️

19.1.11. Property `Job metric data list > filesystems > filesystems items > open`


Type	`object`
Required	No
Additional properties	Any type allowed

Description: File system open

Property	Pattern	Type	Deprecated	Definition	Title/Description
+ node	No	object	No	In embedfs://job-metric-data.schema.json	😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️

19.1.11.1. Property `Job metric data list > filesystems > filesystems items > open > node`


Type	`object`
Required	Yes
Additional properties	Any type allowed
Defined in	embedfs://job-metric-data.schema.json

Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️

19.1.12. Property `Job metric data list > filesystems > filesystems items > close`


Type	`object`
Required	No
Additional properties	Any type allowed

Description: File system close

Property	Pattern	Type	Deprecated	Definition	Title/Description
+ node	No	object	No	In embedfs://job-metric-data.schema.json	😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️

19.1.12.1. Property `Job metric data list > filesystems > filesystems items > close > node`


Type	`object`
Required	Yes
Additional properties	Any type allowed
Defined in	embedfs://job-metric-data.schema.json

Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️

19.1.13. Property `Job metric data list > filesystems > filesystems items > seek`


Type	`object`
Required	No
Additional properties	Any type allowed

Description: File system seek

Property	Pattern	Type	Deprecated	Definition	Title/Description
+ node	No	object	No	In embedfs://job-metric-data.schema.json	😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️

19.1.13.1. Property `Job metric data list > filesystems > filesystems items > seek > node`


Type	`object`
Required	Yes
Additional properties	Any type allowed
Defined in	embedfs://job-metric-data.schema.json

Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️

Generated using json-schema-for-humans on 2024-12-04 at 16:45:59 +0100

7.1.7.4 - Job Statistics Schema

ClusterCockpit Job Statistics Schema Reference

The following schema in its raw form can be found in the ClusterCockpit GitHub repository.

Manual Updates

Changes to the original JSON schema found in the repository are not automatically rendered in this reference documentation.

Last Update: 04.12.2024

Job statistics

Title: Job statistics


Type	`object`
Required	No
Additional properties	Any type allowed

Description: Format specification for job metric statistics

Property	Pattern	Type	Deprecated	Definition	Title/Description
+ unit	No	object	No	In embedfs://unit.schema.json	Metric unit
+ avg	No	number	No	-	Job metric average
+ min	No	number	No	-	Job metric minimum
+ max	No	number	No	-	Job metric maximum

1. Property `Job statistics > unit`


Type	`object`
Required	Yes
Additional properties	Any type allowed
Defined in	embedfs://unit.schema.json

Description: Metric unit

2. Property `Job statistics > avg`


Type	`number`
Required	Yes

Description: Job metric average

Restrictions
Minimum	≥ 0

3. Property `Job statistics > min`


Type	`number`
Required	Yes

Description: Job metric minimum

Restrictions
Minimum	≥ 0

4. Property `Job statistics > max`


Type	`number`
Required	Yes

Description: Job metric maximum

Restrictions
Minimum	≥ 0

Generated using json-schema-for-humans on 2024-12-04 at 16:45:59 +0100

7.1.7.5 - Unit Schema

ClusterCockpit Unit Schema Reference

The following schema in its raw form can be found in the ClusterCockpit GitHub repository.

Manual Updates

Changes to the original JSON schema found in the repository are not automatically rendered in this reference documentation.

Last Update: 04.12.2024

Metric unit

Title: Metric unit


Type	`object`
Required	No
Additional properties	Any type allowed

Description: Format specification for job metric units

Property	Pattern	Type	Deprecated	Definition	Title/Description
+ base	No	enum (of string)	No	-	Metric base unit
- prefix	No	enum (of string)	No	-	Unit prefix

1. Property `Metric unit > base`


Type	`enum (of string)`
Required	Yes

Description: Metric base unit

Must be one of:

“B”
“F”
“B/s”
“F/s”
“CPI”
“IPC”
“Hz”
“W”
“°C”
""

2. Property `Metric unit > prefix`


Type	`enum (of string)`
Required	No

Description: Unit prefix

Must be one of:

“K”
“M”
“G”
“T”
“P”
“E”

Generated using json-schema-for-humans on 2024-12-04 at 16:45:59 +0100

7.1.7.6 - Job Archive Metadata Schema

ClusterCockpit Job Archive Metadata Schema Reference

The following schema in its raw form can be found in the ClusterCockpit GitHub repository.

Manual Updates

Changes to the original JSON schema found in the repository are not automatically rendered in this reference documentation.

Last Update: 04.12.2024

Job meta data

Title: Job meta data


Type	`object`
Required	No
Additional properties	Any type allowed

Description: Meta data information of a HPC job

Property	Pattern	Type	Deprecated	Definition	Title/Description
+ jobId	No	integer	No	-	The unique identifier of a job
+ user	No	string	No	-	The unique identifier of a user
+ project	No	string	No	-	The unique identifier of a project
+ cluster	No	string	No	-	The unique identifier of a cluster
+ subCluster	No	string	No	-	The unique identifier of a sub cluster
- partition	No	string	No	-	The Slurm partition to which the job was submitted
- arrayJobId	No	integer	No	-	The unique identifier of an array job
+ numNodes	No	integer	No	-	Number of nodes used
- numHwthreads	No	integer	No	-	Number of HWThreads used
- numAcc	No	integer	No	-	Number of accelerators used
+ exclusive	No	integer	No	-	Specifies how nodes are shared. 0 - Shared among multiple jobs of multiple users, 1 - Job exclusive, 2 - Shared among multiple jobs of same user
- monitoringStatus	No	integer	No	-	State of monitoring system during job run
- smt	No	integer	No	-	SMT threads used by job
- walltime	No	integer	No	-	Requested walltime of job in seconds
+ jobState	No	enum (of string)	No	-	Final state of job
+ startTime	No	integer	No	-	Start epoch time stamp in seconds
+ duration	No	integer	No	-	Duration of job in seconds
+ resources	No	array of object	No	-	Resources used by job
- metaData	No	object	No	-	Additional information about the job
- tags	No	array of object	No	-	List of tags
+ statistics	No	object	No	-	Job statistic data

1. Property `Job meta data > jobId`


Type	`integer`
Required	Yes

Description: The unique identifier of a job

2. Property `Job meta data > user`


Type	`string`
Required	Yes

Description: The unique identifier of a user

3. Property `Job meta data > project`


Type	`string`
Required	Yes

Description: The unique identifier of a project

4. Property `Job meta data > cluster`


Type	`string`
Required	Yes

Description: The unique identifier of a cluster

5. Property `Job meta data > subCluster`


Type	`string`
Required	Yes

Description: The unique identifier of a sub cluster

6. Property `Job meta data > partition`


Type	`string`
Required	No

Description: The Slurm partition to which the job was submitted

7. Property `Job meta data > arrayJobId`


Type	`integer`
Required	No

Description: The unique identifier of an array job

8. Property `Job meta data > numNodes`


Type	`integer`
Required	Yes

Description: Number of nodes used

Restrictions
Minimum	> 0

9. Property `Job meta data > numHwthreads`


Type	`integer`
Required	No

Description: Number of HWThreads used

Restrictions
Minimum	> 0

10. Property `Job meta data > numAcc`


Type	`integer`
Required	No

Description: Number of accelerators used

Restrictions
Minimum	> 0

11. Property `Job meta data > exclusive`


Type	`integer`
Required	Yes

Description: Specifies how nodes are shared. 0 - Shared among multiple jobs of multiple users, 1 - Job exclusive, 2 - Shared among multiple jobs of same user

Restrictions
Minimum	≥ 0
Maximum	≤ 2

12. Property `Job meta data > monitoringStatus`


Type	`integer`
Required	No

Description: State of monitoring system during job run

13. Property `Job meta data > smt`


Type	`integer`
Required	No

Description: SMT threads used by job

14. Property `Job meta data > walltime`


Type	`integer`
Required	No

Description: Requested walltime of job in seconds

Restrictions
Minimum	> 0

15. Property `Job meta data > jobState`


Type	`enum (of string)`
Required	Yes

Description: Final state of job

Must be one of:

“completed”
“failed”
“cancelled”
“stopped”
“out_of_memory”
“timeout”

16. Property `Job meta data > startTime`


Type	`integer`
Required	Yes

Description: Start epoch time stamp in seconds

Restrictions
Minimum	> 0

17. Property `Job meta data > duration`


Type	`integer`
Required	Yes

Description: Duration of job in seconds

Restrictions
Minimum	> 0

18. Property `Job meta data > resources`


Type	`array of object`
Required	Yes

Description: Resources used by job

	Array restrictions
Min items	N/A
Max items	N/A
Items unicity	False
Additional items	False
Tuple validation	See below

Each item of this array must be	Description
resources items	-

18.1. Job meta data > resources > resources items


Type	`object`
Required	No
Additional properties	Any type allowed

Property	Pattern	Type	Deprecated	Definition	Title/Description
+ hostname	No	string	No	-	-
- hwthreads	No	array of integer	No	-	List of OS processor ids
- accelerators	No	array of string	No	-	List of of accelerator device ids
- configuration	No	string	No	-	The configuration options of the node

18.1.1. Property `Job meta data > resources > resources items > hostname`


Type	`string`
Required	Yes

18.1.2. Property `Job meta data > resources > resources items > hwthreads`


Type	`array of integer`
Required	No

Description: List of OS processor ids

	Array restrictions
Min items	N/A
Max items	N/A
Items unicity	False
Additional items	False
Tuple validation	See below

Each item of this array must be	Description
hwthreads items	-

18.1.2.1. Job meta data > resources > resources items > hwthreads > hwthreads items


Type	`integer`
Required	No

18.1.3. Property `Job meta data > resources > resources items > accelerators`


Type	`array of string`
Required	No

Description: List of of accelerator device ids

	Array restrictions
Min items	N/A
Max items	N/A
Items unicity	False
Additional items	False
Tuple validation	See below

Each item of this array must be	Description
accelerators items	-

18.1.3.1. Job meta data > resources > resources items > accelerators > accelerators items


Type	`string`
Required	No

18.1.4. Property `Job meta data > resources > resources items > configuration`


Type	`string`
Required	No

Description: The configuration options of the node

19. Property `Job meta data > metaData`


Type	`object`
Required	No
Additional properties	Any type allowed

Description: Additional information about the job

Property	Pattern	Type	Deprecated	Definition	Title/Description
- jobScript	No	string	No	-	The batch script of the job
- jobName	No	string	No	-	Slurm Job name
- slurmInfo	No	string	No	-	Additional slurm infos as show by scontrol show job

19.1. Property `Job meta data > metaData > jobScript`


Type	`string`
Required	No

Description: The batch script of the job

19.2. Property `Job meta data > metaData > jobName`


Type	`string`
Required	No

Description: Slurm Job name

19.3. Property `Job meta data > metaData > slurmInfo`


Type	`string`
Required	No

Description: Additional slurm infos as show by scontrol show job

20. Property `Job meta data > tags`


Type	`array of object`
Required	No

Description: List of tags

	Array restrictions
Min items	N/A
Max items	N/A
Items unicity	True
Additional items	False
Tuple validation	See below

Each item of this array must be	Description
tags items	-

20.1. Job meta data > tags > tags items


Type	`object`
Required	No
Additional properties	Any type allowed

Property	Pattern	Type	Deprecated	Definition	Title/Description
+ name	No	string	No	-	-
+ type	No	string	No	-	-

20.1.1. Property `Job meta data > tags > tags items > name`


Type	`string`
Required	Yes

20.1.2. Property `Job meta data > tags > tags items > type`


Type	`string`
Required	Yes

21. Property `Job meta data > statistics`


Type	`object`
Required	Yes
Additional properties	Any type allowed

Description: Job statistic data

Property	Pattern	Type	Deprecated	Definition	Title/Description
+ mem_used	No	object	No	In embedfs://job-metric-statistics.schema.json	Memory capacity used (required)
+ cpu_load	No	object	No	In embedfs://job-metric-statistics.schema.json	CPU requested core utilization (load 1m) (required)
+ flops_any	No	object	No	In embedfs://job-metric-statistics.schema.json	Total flop rate with DP flops scaled up (required)
+ mem_bw	No	object	No	In embedfs://job-metric-statistics.schema.json	Main memory bandwidth (required)
- net_bw	No	object	No	In embedfs://job-metric-statistics.schema.json	Total fast interconnect network bandwidth (required)
- file_bw	No	object	No	In embedfs://job-metric-statistics.schema.json	Total file IO bandwidth (required)
- ipc	No	object	No	In embedfs://job-metric-statistics.schema.json	Instructions executed per cycle
+ cpu_user	No	object	No	In embedfs://job-metric-statistics.schema.json	CPU user active core utilization
- flops_dp	No	object	No	In embedfs://job-metric-statistics.schema.json	Double precision flop rate
- flops_sp	No	object	No	In embedfs://job-metric-statistics.schema.json	Single precision flops rate
- rapl_power	No	object	No	In embedfs://job-metric-statistics.schema.json	CPU power consumption
- acc_used	No	object	No	In embedfs://job-metric-statistics.schema.json	GPU utilization
- acc_mem_used	No	object	No	In embedfs://job-metric-statistics.schema.json	GPU memory capacity used
- acc_power	No	object	No	In embedfs://job-metric-statistics.schema.json	GPU power consumption
- clock	No	object	No	In embedfs://job-metric-statistics.schema.json	Average core frequency
- eth_read_bw	No	object	No	In embedfs://job-metric-statistics.schema.json	Ethernet read bandwidth
- eth_write_bw	No	object	No	In embedfs://job-metric-statistics.schema.json	Ethernet write bandwidth
- ic_rcv_packets	No	object	No	In embedfs://job-metric-statistics.schema.json	Network interconnect read packets
- ic_send_packets	No	object	No	In embedfs://job-metric-statistics.schema.json	Network interconnect send packet
- ic_read_bw	No	object	No	In embedfs://job-metric-statistics.schema.json	Network interconnect read bandwidth
- ic_write_bw	No	object	No	In embedfs://job-metric-statistics.schema.json	Network interconnect write bandwidth
- filesystems	No	array of object	No	-	Array of filesystems

21.1. Property `Job meta data > statistics > mem_used`


Type	`object`
Required	Yes
Additional properties	Any type allowed
Defined in	embedfs://job-metric-statistics.schema.json

Description: Memory capacity used (required)

21.2. Property `Job meta data > statistics > cpu_load`


Type	`object`
Required	Yes
Additional properties	Any type allowed
Defined in	embedfs://job-metric-statistics.schema.json

Description: CPU requested core utilization (load 1m) (required)

21.3. Property `Job meta data > statistics > flops_any`


Type	`object`
Required	Yes
Additional properties	Any type allowed
Defined in	embedfs://job-metric-statistics.schema.json

Description: Total flop rate with DP flops scaled up (required)

21.4. Property `Job meta data > statistics > mem_bw`


Type	`object`
Required	Yes
Additional properties	Any type allowed
Defined in	embedfs://job-metric-statistics.schema.json

Description: Main memory bandwidth (required)

21.5. Property `Job meta data > statistics > net_bw`


Type	`object`
Required	No
Additional properties	Any type allowed
Defined in	embedfs://job-metric-statistics.schema.json

Description: Total fast interconnect network bandwidth (required)

21.6. Property `Job meta data > statistics > file_bw`


Type	`object`
Required	No
Additional properties	Any type allowed
Defined in	embedfs://job-metric-statistics.schema.json

Description: Total file IO bandwidth (required)

21.7. Property `Job meta data > statistics > ipc`


Type	`object`
Required	No
Additional properties	Any type allowed
Defined in	embedfs://job-metric-statistics.schema.json

Description: Instructions executed per cycle

21.8. Property `Job meta data > statistics > cpu_user`


Type	`object`
Required	Yes
Additional properties	Any type allowed
Defined in	embedfs://job-metric-statistics.schema.json

Description: CPU user active core utilization

21.9. Property `Job meta data > statistics > flops_dp`


Type	`object`
Required	No
Additional properties	Any type allowed
Defined in	embedfs://job-metric-statistics.schema.json

Description: Double precision flop rate

21.10. Property `Job meta data > statistics > flops_sp`


Type	`object`
Required	No
Additional properties	Any type allowed
Defined in	embedfs://job-metric-statistics.schema.json

Description: Single precision flops rate

21.11. Property `Job meta data > statistics > rapl_power`


Type	`object`
Required	No
Additional properties	Any type allowed
Defined in	embedfs://job-metric-statistics.schema.json

Description: CPU power consumption

21.12. Property `Job meta data > statistics > acc_used`


Type	`object`
Required	No
Additional properties	Any type allowed
Defined in	embedfs://job-metric-statistics.schema.json

Description: GPU utilization

21.13. Property `Job meta data > statistics > acc_mem_used`


Type	`object`
Required	No
Additional properties	Any type allowed
Defined in	embedfs://job-metric-statistics.schema.json

Description: GPU memory capacity used

21.14. Property `Job meta data > statistics > acc_power`


Type	`object`
Required	No
Additional properties	Any type allowed
Defined in	embedfs://job-metric-statistics.schema.json

Description: GPU power consumption

21.15. Property `Job meta data > statistics > clock`


Type	`object`
Required	No
Additional properties	Any type allowed
Defined in	embedfs://job-metric-statistics.schema.json

Description: Average core frequency

21.16. Property `Job meta data > statistics > eth_read_bw`


Type	`object`
Required	No
Additional properties	Any type allowed
Defined in	embedfs://job-metric-statistics.schema.json

Description: Ethernet read bandwidth

21.17. Property `Job meta data > statistics > eth_write_bw`


Type	`object`
Required	No
Additional properties	Any type allowed
Defined in	embedfs://job-metric-statistics.schema.json

Description: Ethernet write bandwidth

21.18. Property `Job meta data > statistics > ic_rcv_packets`


Type	`object`
Required	No
Additional properties	Any type allowed
Defined in	embedfs://job-metric-statistics.schema.json

Description: Network interconnect read packets

21.19. Property `Job meta data > statistics > ic_send_packets`


Type	`object`
Required	No
Additional properties	Any type allowed
Defined in	embedfs://job-metric-statistics.schema.json

Description: Network interconnect send packet

21.20. Property `Job meta data > statistics > ic_read_bw`


Type	`object`
Required	No
Additional properties	Any type allowed
Defined in	embedfs://job-metric-statistics.schema.json

Description: Network interconnect read bandwidth

21.21. Property `Job meta data > statistics > ic_write_bw`


Type	`object`
Required	No
Additional properties	Any type allowed
Defined in	embedfs://job-metric-statistics.schema.json

Description: Network interconnect write bandwidth

21.22. Property `Job meta data > statistics > filesystems`


Type	`array of object`
Required	No

Description: Array of filesystems

	Array restrictions
Min items	1
Max items	N/A
Items unicity	False
Additional items	False
Tuple validation	See below

Each item of this array must be	Description
filesystems items	-

21.22.1. Job meta data > statistics > filesystems > filesystems items


Type	`object`
Required	No
Additional properties	Any type allowed

Property	Pattern	Type	Deprecated	Definition	Title/Description
+ name	No	string	No	-	-
+ type	No	enum (of string)	No	-	-
+ read_bw	No	object	No	In embedfs://job-metric-statistics.schema.json	File system read bandwidth
+ write_bw	No	object	No	In embedfs://job-metric-statistics.schema.json	File system write bandwidth
- read_req	No	object	No	In embedfs://job-metric-statistics.schema.json	File system read requests
- write_req	No	object	No	In embedfs://job-metric-statistics.schema.json	File system write requests
- inodes	No	object	No	In embedfs://job-metric-statistics.schema.json	File system write requests
- accesses	No	object	No	In embedfs://job-metric-statistics.schema.json	File system open and close
- fsync	No	object	No	In embedfs://job-metric-statistics.schema.json	File system fsync
- create	No	object	No	In embedfs://job-metric-statistics.schema.json	File system create
- open	No	object	No	In embedfs://job-metric-statistics.schema.json	File system open
- close	No	object	No	In embedfs://job-metric-statistics.schema.json	File system close
- seek	No	object	No	In embedfs://job-metric-statistics.schema.json	File system seek

21.22.1.1. Property `Job meta data > statistics > filesystems > filesystems items > name`


Type	`string`
Required	Yes

21.22.1.2. Property `Job meta data > statistics > filesystems > filesystems items > type`


Type	`enum (of string)`
Required	Yes

Must be one of:

“nfs”
“lustre”
“gpfs”
“nvme”
“ssd”
“hdd”
“beegfs”

21.22.1.3. Property `Job meta data > statistics > filesystems > filesystems items > read_bw`


Type	`object`
Required	Yes
Additional properties	Any type allowed
Defined in	embedfs://job-metric-statistics.schema.json

Description: File system read bandwidth

21.22.1.4. Property `Job meta data > statistics > filesystems > filesystems items > write_bw`


Type	`object`
Required	Yes
Additional properties	Any type allowed
Defined in	embedfs://job-metric-statistics.schema.json

Description: File system write bandwidth

21.22.1.5. Property `Job meta data > statistics > filesystems > filesystems items > read_req`


Type	`object`
Required	No
Additional properties	Any type allowed
Defined in	embedfs://job-metric-statistics.schema.json

Description: File system read requests

21.22.1.6. Property `Job meta data > statistics > filesystems > filesystems items > write_req`


Type	`object`
Required	No
Additional properties	Any type allowed
Defined in	embedfs://job-metric-statistics.schema.json

Description: File system write requests

21.22.1.7. Property `Job meta data > statistics > filesystems > filesystems items > inodes`


Type	`object`
Required	No
Additional properties	Any type allowed
Defined in	embedfs://job-metric-statistics.schema.json

Description: File system write requests

21.22.1.8. Property `Job meta data > statistics > filesystems > filesystems items > accesses`


Type	`object`
Required	No
Additional properties	Any type allowed
Defined in	embedfs://job-metric-statistics.schema.json

Description: File system open and close

21.22.1.9. Property `Job meta data > statistics > filesystems > filesystems items > fsync`


Type	`object`
Required	No
Additional properties	Any type allowed
Defined in	embedfs://job-metric-statistics.schema.json

Description: File system fsync

21.22.1.10. Property `Job meta data > statistics > filesystems > filesystems items > create`


Type	`object`
Required	No
Additional properties	Any type allowed
Defined in	embedfs://job-metric-statistics.schema.json

Description: File system create

21.22.1.11. Property `Job meta data > statistics > filesystems > filesystems items > open`


Type	`object`
Required	No
Additional properties	Any type allowed
Defined in	embedfs://job-metric-statistics.schema.json

Description: File system open

21.22.1.12. Property `Job meta data > statistics > filesystems > filesystems items > close`


Type	`object`
Required	No
Additional properties	Any type allowed
Defined in	embedfs://job-metric-statistics.schema.json

Description: File system close

21.22.1.13. Property `Job meta data > statistics > filesystems > filesystems items > seek`


Type	`object`
Required	No
Additional properties	Any type allowed
Defined in	embedfs://job-metric-statistics.schema.json

Description: File system seek

Generated using json-schema-for-humans on 2024-12-04 at 16:45:59 +0100

7.1.7.7 - Job Archive Metrics Data Schema

ClusterCockpit Job Archive Metrics Data Schema Reference

The following schema in its raw form can be found in the ClusterCockpit GitHub repository.

Manual Updates

Changes to the original JSON schema found in the repository are not automatically rendered in this reference documentation.

Last Update: 04.12.2024

Job metric data

Title: Job metric data


Type	`object`
Required	No
Additional properties	Any type allowed

Description: Metric data of a HPC job

Property	Pattern	Type	Deprecated	Definition	Title/Description
+ unit	No	object	No	In embedfs://unit.schema.json	Metric unit
+ timestep	No	integer	No	-	Measurement interval in seconds
- thresholds	No	object	No	-	Metric thresholds for specific system
- statisticsSeries	No	object	No	-	Statistics series across topology
+ series	No	array of object	No	-	-

1. Property `Job metric data > unit`


Type	`object`
Required	Yes
Additional properties	Any type allowed
Defined in	embedfs://unit.schema.json

Description: Metric unit

2. Property `Job metric data > timestep`


Type	`integer`
Required	Yes

Description: Measurement interval in seconds

3. Property `Job metric data > thresholds`


Type	`object`
Required	No
Additional properties	Any type allowed

Description: Metric thresholds for specific system

Property	Pattern	Type	Deprecated	Definition	Title/Description
- peak	No	number	No	-	-
- normal	No	number	No	-	-
- caution	No	number	No	-	-
- alert	No	number	No	-	-

3.1. Property `Job metric data > thresholds > peak`


Type	`number`
Required	No

3.2. Property `Job metric data > thresholds > normal`


Type	`number`
Required	No

3.3. Property `Job metric data > thresholds > caution`


Type	`number`
Required	No

3.4. Property `Job metric data > thresholds > alert`


Type	`number`
Required	No

4. Property `Job metric data > statisticsSeries`


Type	`object`
Required	No
Additional properties	Any type allowed

Description: Statistics series across topology

Property	Pattern	Type	Deprecated	Definition	Title/Description
- min	No	array of number	No	-	-
- max	No	array of number	No	-	-
- mean	No	array of number	No	-	-
- percentiles	No	object	No	-	-

4.1. Property `Job metric data > statisticsSeries > min`


Type	`array of number`
Required	No

	Array restrictions
Min items	3
Max items	N/A
Items unicity	False
Additional items	False
Tuple validation	See below

Each item of this array must be	Description
min items	-

4.1.1. Job metric data > statisticsSeries > min > min items


Type	`number`
Required	No

Restrictions
Minimum	≥ 0

4.2. Property `Job metric data > statisticsSeries > max`


Type	`array of number`
Required	No

	Array restrictions
Min items	3
Max items	N/A
Items unicity	False
Additional items	False
Tuple validation	See below

Each item of this array must be	Description
max items	-

4.2.1. Job metric data > statisticsSeries > max > max items


Type	`number`
Required	No

Restrictions
Minimum	≥ 0

4.3. Property `Job metric data > statisticsSeries > mean`


Type	`array of number`
Required	No

	Array restrictions
Min items	3
Max items	N/A
Items unicity	False
Additional items	False
Tuple validation	See below

Each item of this array must be	Description
mean items	-

4.3.1. Job metric data > statisticsSeries > mean > mean items


Type	`number`
Required	No

Restrictions
Minimum	≥ 0

4.4. Property `Job metric data > statisticsSeries > percentiles`


Type	`object`
Required	No
Additional properties	Any type allowed

Property	Pattern	Type	Deprecated	Definition	Title/Description
- 10	No	array of number	No	-	-
- 20	No	array of number	No	-	-
- 30	No	array of number	No	-	-
- 40	No	array of number	No	-	-
- 50	No	array of number	No	-	-
- 60	No	array of number	No	-	-
- 70	No	array of number	No	-	-
- 80	No	array of number	No	-	-
- 90	No	array of number	No	-	-
- 25	No	array of number	No	-	-
- 75	No	array of number	No	-	-

4.4.1. Property `Job metric data > statisticsSeries > percentiles > 10`


Type	`array of number`
Required	No

	Array restrictions
Min items	3
Max items	N/A
Items unicity	False
Additional items	False
Tuple validation	See below

Each item of this array must be	Description
10 items	-

4.4.1.1. Job metric data > statisticsSeries > percentiles > 10 > 10 items


Type	`number`
Required	No

Restrictions
Minimum	≥ 0

4.4.2. Property `Job metric data > statisticsSeries > percentiles > 20`


Type	`array of number`
Required	No

	Array restrictions
Min items	3
Max items	N/A
Items unicity	False
Additional items	False
Tuple validation	See below

Each item of this array must be	Description
20 items	-

4.4.2.1. Job metric data > statisticsSeries > percentiles > 20 > 20 items


Type	`number`
Required	No

Restrictions
Minimum	≥ 0

4.4.3. Property `Job metric data > statisticsSeries > percentiles > 30`


Type	`array of number`
Required	No

	Array restrictions
Min items	3
Max items	N/A
Items unicity	False
Additional items	False
Tuple validation	See below

Each item of this array must be	Description
30 items	-

4.4.3.1. Job metric data > statisticsSeries > percentiles > 30 > 30 items


Type	`number`
Required	No

Restrictions
Minimum	≥ 0

4.4.4. Property `Job metric data > statisticsSeries > percentiles > 40`


Type	`array of number`
Required	No

	Array restrictions
Min items	3
Max items	N/A
Items unicity	False
Additional items	False
Tuple validation	See below

Each item of this array must be	Description
40 items	-

4.4.4.1. Job metric data > statisticsSeries > percentiles > 40 > 40 items


Type	`number`
Required	No

Restrictions
Minimum	≥ 0

4.4.5. Property `Job metric data > statisticsSeries > percentiles > 50`


Type	`array of number`
Required	No

	Array restrictions
Min items	3
Max items	N/A
Items unicity	False
Additional items	False
Tuple validation	See below

Each item of this array must be	Description
50 items	-

4.4.5.1. Job metric data > statisticsSeries > percentiles > 50 > 50 items


Type	`number`
Required	No

Restrictions
Minimum	≥ 0

4.4.6. Property `Job metric data > statisticsSeries > percentiles > 60`


Type	`array of number`
Required	No

	Array restrictions
Min items	3
Max items	N/A
Items unicity	False
Additional items	False
Tuple validation	See below

Each item of this array must be	Description
60 items	-

4.4.6.1. Job metric data > statisticsSeries > percentiles > 60 > 60 items


Type	`number`
Required	No

Restrictions
Minimum	≥ 0

4.4.7. Property `Job metric data > statisticsSeries > percentiles > 70`


Type	`array of number`
Required	No

	Array restrictions
Min items	3
Max items	N/A
Items unicity	False
Additional items	False
Tuple validation	See below

Each item of this array must be	Description
70 items	-

4.4.7.1. Job metric data > statisticsSeries > percentiles > 70 > 70 items


Type	`number`
Required	No

Restrictions
Minimum	≥ 0

4.4.8. Property `Job metric data > statisticsSeries > percentiles > 80`


Type	`array of number`
Required	No

	Array restrictions
Min items	3
Max items	N/A
Items unicity	False
Additional items	False
Tuple validation	See below

Each item of this array must be	Description
80 items	-

4.4.8.1. Job metric data > statisticsSeries > percentiles > 80 > 80 items


Type	`number`
Required	No

Restrictions
Minimum	≥ 0

4.4.9. Property `Job metric data > statisticsSeries > percentiles > 90`


Type	`array of number`
Required	No

	Array restrictions
Min items	3
Max items	N/A
Items unicity	False
Additional items	False
Tuple validation	See below

Each item of this array must be	Description
90 items	-

4.4.9.1. Job metric data > statisticsSeries > percentiles > 90 > 90 items


Type	`number`
Required	No

Restrictions
Minimum	≥ 0

4.4.10. Property `Job metric data > statisticsSeries > percentiles > 25`


Type	`array of number`
Required	No

	Array restrictions
Min items	3
Max items	N/A
Items unicity	False
Additional items	False
Tuple validation	See below

Each item of this array must be	Description
25 items	-

4.4.10.1. Job metric data > statisticsSeries > percentiles > 25 > 25 items


Type	`number`
Required	No

Restrictions
Minimum	≥ 0

4.4.11. Property `Job metric data > statisticsSeries > percentiles > 75`


Type	`array of number`
Required	No

	Array restrictions
Min items	3
Max items	N/A
Items unicity	False
Additional items	False
Tuple validation	See below

Each item of this array must be	Description
75 items	-

4.4.11.1. Job metric data > statisticsSeries > percentiles > 75 > 75 items


Type	`number`
Required	No

Restrictions
Minimum	≥ 0

5. Property `Job metric data > series`


Type	`array of object`
Required	Yes

	Array restrictions
Min items	N/A
Max items	N/A
Items unicity	False
Additional items	False
Tuple validation	See below

Each item of this array must be	Description
series items	-

5.1. Job metric data > series > series items


Type	`object`
Required	No
Additional properties	Any type allowed

Property	Pattern	Type	Deprecated	Definition	Title/Description
+ hostname	No	string	No	-	-
- id	No	string	No	-	-
+ statistics	No	object	No	-	Statistics across time dimension
+ data	No	array	No	-	-

5.1.1. Property `Job metric data > series > series items > hostname`


Type	`string`
Required	Yes

5.1.2. Property `Job metric data > series > series items > id`


Type	`string`
Required	No

5.1.3. Property `Job metric data > series > series items > statistics`


Type	`object`
Required	Yes
Additional properties	Any type allowed

Description: Statistics across time dimension

Property	Pattern	Type	Deprecated	Definition	Title/Description
+ avg	No	number	No	-	Series average
+ min	No	number	No	-	Series minimum
+ max	No	number	No	-	Series maximum

5.1.3.1. Property `Job metric data > series > series items > statistics > avg`


Type	`number`
Required	Yes

Description: Series average

Restrictions
Minimum	≥ 0

5.1.3.2. Property `Job metric data > series > series items > statistics > min`


Type	`number`
Required	Yes

Description: Series minimum

Restrictions
Minimum	≥ 0

5.1.3.3. Property `Job metric data > series > series items > statistics > max`


Type	`number`
Required	Yes

Description: Series maximum

Restrictions
Minimum	≥ 0

5.1.4. Property `Job metric data > series > series items > data`


Type	`array`
Required	Yes

	Array restrictions
Min items	1
Max items	N/A
Items unicity	False
Additional items	False
Tuple validation	See below

5.1.4.1. At least one of the items must be


Type	`number`
Required	No

Restrictions
Minimum	≥ 0

Generated using json-schema-for-humans on 2024-12-04 at 16:45:59 +0100

7.1.8 - Tools

Command-line tools for ClusterCockpit maintenance and administration

This section documents the command-line tools included with ClusterCockpit for various maintenance, migration, and administrative tasks.

Available Tools

Archive Management

archive-manager: Comprehensive job archive management, validation, cleaning, and import/export
archive-migration: Migrate job archives between schema versions

Security & Authentication

gen-keypair: Generate Ed25519 keypairs for JWT signing and validation
convert-pem-pubkey: Convert external Ed25519 PEM keys to ClusterCockpit format

Diagnostics

grepCCLog.pl: Analyze log files to identify non-archived jobs

Data Generation for cc-metric-store

dataGenerator.sh: Connect to cc-metric-store (external or internal) and push data at 1 minute interval.

Building Tools

All Go-based tools follow the same build pattern:

cd tools/<tool-name>
go build

Common Features

Most tools support:

Configurable logging levels (-loglevel)
Timestamped log output (-logdate)
Configuration file specification (-config)

7.1.8.1 - archive-manager

Job Archive Management Tool

The archive-manager tool provides comprehensive management and maintenance capabilities for ClusterCockpit job archives. It supports validation, cleaning, importing between different archive backends, and general archive operations.

Build

cd tools/archive-manager
go build

Command-Line Options

-s <path>

Function: Specify the source job archive path.

Default: ./var/job-archive

Example: -s /data/job-archive

-config <path>

Function: Specify alternative path to config.json.

Default: ./config.json

Example: -config /etc/clustercockpit/config.json

-validate

Function: Validate a job archive against the JSON schema.

-remove-cluster <cluster>

Function: Remove specified cluster from archive and database.

Example: -remove-cluster oldcluster

-remove-before <date>

Function: Remove all jobs with start time before the specified date.

Format: 2006-Jan-04

Example: -remove-before 2023-Jan-01

-remove-after <date>

Function: Remove all jobs with start time after the specified date.

Format: 2006-Jan-04

Example: -remove-after 2024-Dec-31

-import

Function: Import jobs from source archive to destination archive.

Note: Requires -src-config and -dst-config options.

-src-config <json>

Function: Source archive backend configuration in JSON format.

Example: -src-config '{"kind":"file","path":"./archive"}'

-dst-config <json>

Function: Destination archive backend configuration in JSON format.

Example: -dst-config '{"kind":"sqlite","dbPath":"./archive.db"}'

-loglevel <level>

Function: Sets the logging level.

Default: info

Example: -loglevel debug

-logdate

Function: Set this flag to add date and time to log messages.

Usage Examples

Validate Archive

./archive-manager -s /data/job-archive -validate

Clean Old Jobs

# Remove jobs older than January 1, 2023
./archive-manager -s /data/job-archive -remove-before 2023-Jan-01

Import Between Archives

# Import from file-based archive to SQLite archive
./archive-manager -import \
  -src-config '{"kind":"file","path":"./old-archive"}' \
  -dst-config '{"kind":"sqlite","dbPath":"./new-archive.db"}'

Archive Information

# Display archive statistics
./archive-manager -s /data/job-archive

Features

Validation: Verify job archive integrity against JSON schemas
Cleaning: Remove jobs by date range or cluster
Import/Export: Transfer jobs between different archive backend types
Statistics: Display archive information and job counts
Progress Tracking: Real-time progress reporting for long operations

7.1.8.2 - archive-migration

Job Archive Schema Migration Tool

The archive-migration tool migrates job archives from old schema versions to the current schema version. It handles schema changes such as the exclusive → shared field transformation and adds/removes fields as needed.

Features

Parallel Processing: Uses worker pool for fast migration
Dry-Run Mode: Preview changes without modifying files
Safe Transformations: Applies well-defined schema transformations
Progress Reporting: Shows real-time migration progress
Error Handling: Continues on individual failures, reports at end

Build

cd tools/archive-migration
go build

Command-Line Options

-archive <path>

Function: Path to job archive to migrate (required).

Example: -archive /data/job-archive

-dry-run

Function: Preview changes without modifying files.

-workers <n>

Function: Number of parallel workers.

Default: 4

Example: -workers 8

-loglevel <level>

Function: Sets the logging level.

Default: info

Example: -loglevel debug

-logdate

Function: Add date and time to log messages.

Schema Transformations

Exclusive → Shared

Converts the old exclusive integer field to the new shared string field:

0 → "multi_user"
1 → "none"
2 → "single_user"

Missing Fields

Adds fields required by current schema:

submitTime: Defaults to startTime if missing
energy: Defaults to 0.0
requestedMemory: Defaults to 0
shared: Defaults to "none" if still missing after transformation

Deprecated Fields

Removes fields no longer in schema:

mem_used_max, flops_any_avg, mem_bw_avg
load_avg, net_bw_avg, net_data_vol_total
file_bw_avg, file_data_vol_total

Usage Examples

Preview Changes (Dry Run)

./archive-migration --archive /data/job-archive --dry-run

Migrate Archive

# IMPORTANT: Backup your archive first!
cp -r /data/job-archive /data/job-archive-backup

# Run migration
./archive-migration --archive /data/job-archive

Migrate with Verbose Logging

./archive-migration --archive /data/job-archive --loglevel debug

Migrate with More Workers

./archive-migration --archive /data/job-archive --workers 8

Safety

Always backup your archive before running migration!

The tool modifies meta.json files in place. While transformations are designed to be safe, unexpected issues could occur. Follow these safety practices:

Always run with --dry-run first to preview changes
Backup your archive before migration
Test on a copy of your archive first
Verify results after migration

Verification

After migration, verify the archive:

# Use archive-manager to check the archive
cd ../archive-manager
./archive-manager -s /data/migrated-archive

# Or validate specific jobs
./archive-manager -s /data/migrated-archive --validate

Troubleshooting

Migration Failures

If individual jobs fail to migrate:

Check the error messages for specific files
Examine the failing meta.json files manually
Fix invalid JSON or unexpected field types
Re-run migration (already-migrated jobs will be processed again)

Performance

For large archives:

Increase --workers for more parallelism
Use --loglevel warn to reduce log output
Monitor disk I/O if migration is slow

Technical Details

The migration process:

Walks archive directory recursively
Finds all meta.json files
Distributes jobs to worker pool
For each job:
- Reads JSON file
- Applies transformations in order
- Writes back migrated data (if not dry-run)
Reports statistics and errors

Transformations are idempotent - running migration multiple times is safe (though not recommended for performance).

7.1.8.3 - convert-pem-pubkey

Convert Ed25519 Public Key from PEM to ClusterCockpit Format

The convert-pem-pubkey tool converts an Ed25519 public key from PEM format to the base64 format used by ClusterCockpit for JWT validation.

Use Case

When you have externally generated JSON Web Tokens (JWT) that should be accepted by cc-backend, the external provider shares its public key (used for JWT signing) in PEM format. ClusterCockpit requires this key in a different format, which this tool provides.

Build

cd tools/convert-pem-pubkey
go build

Usage

Input Format (PEM)

-----BEGIN PUBLIC KEY-----
MCowBQYDK2VwAyEA+51iXX8BdLFocrppRxIw52xCOf8xFSH/eNilN5IHVGc=
-----END PUBLIC KEY-----

Convert Key

# Insert your public Ed25519 PEM key into dummy.pub
echo "-----BEGIN PUBLIC KEY-----
MCowBQYDK2VwAyEA+51iXX8BdLFocrppRxIw52xCOf8xFSH/eNilN5IHVGc=
-----END PUBLIC KEY-----" > dummy.pub

# Run conversion
go run . dummy.pub

Output Format

CROSS_LOGIN_JWT_PUBLIC_KEY="+51iXX8BdLFocrppRxIw52xCOf8xFSH/eNilN5IHVGc="

Configuration

Copy the output into ClusterCockpit’s .env file
Restart ClusterCockpit backend
ClusterCockpit can now validate JWTs from the external provider

Command-Line Arguments

convert-pem-pubkey <pem-file>

Arguments: Path to PEM-encoded Ed25519 public key file

Example: go run . dummy.pub

Example Workflow

# 1. Navigate to tool directory
cd tools/convert-pem-pubkey

# 2. Save external provider's PEM key
cat > external-key.pub <<EOF
-----BEGIN PUBLIC KEY-----
MCowBQYDK2VwAyEA+51iXX8BdLFocrppRxIw52xCOf8xFSH/eNilN5IHVGc=
-----END PUBLIC KEY-----
EOF

# 3. Convert to ClusterCockpit format
go run . external-key.pub

# 4. Add output to .env file
# CROSS_LOGIN_JWT_PUBLIC_KEY="+51iXX8BdLFocrppRxIw52xCOf8xFSH/eNilN5IHVGc="

# 5. Restart cc-backend

Technical Details

The tool:

Reads Ed25519 public key in PEM format
Extracts the raw key bytes
Encodes to base64 string
Outputs in ClusterCockpit’s expected format

This enables ClusterCockpit to validate JWTs signed by external providers using their Ed25519 keys.

7.1.8.4 - gen-keypair

Generate Ed25519 Keypair for JWT Signing

The gen-keypair tool generates a new Ed25519 keypair for signing and validating JWT tokens in ClusterCockpit.

Purpose

Generates a cryptographically secure Ed25519 public/private keypair that can be used for:

JWT token signing (private key)
JWT token validation (public key)

Build

cd tools/gen-keypair
go build

Usage

go run .

Or after building:

./gen-keypair

Output

The tool outputs a keypair in base64-encoded format:

ED25519 PUBLIC_KEY="<base64-encoded-public-key>"
ED25519 PRIVATE_KEY="<base64-encoded-private-key>"
This is NO JWT token. You can generate JWT tokens with cc-backend. Use this keypair for signing and validation of JWT tokens in ClusterCockpit.

Configuration

Add the generated keys to ClusterCockpit’s configuration:

Option 1: Environment Variables (.env file)

ED25519_PUBLIC_KEY="<base64-encoded-public-key>"
ED25519_PRIVATE_KEY="<base64-encoded-private-key>"

Option 2: Configuration File (config.json)

{
  "jwts": {
    "publicKey": "<base64-encoded-public-key>",
    "privateKey": "<base64-encoded-private-key>"
  }
}

Example Workflow

# 1. Generate keypair
cd tools/gen-keypair
go run . > keypair.txt

# 2. View generated keys
cat keypair.txt

# 3. Add to .env file (manual or scripted)
grep PUBLIC_KEY keypair.txt >> ../../.env
grep PRIVATE_KEY keypair.txt >> ../../.env

# 4. Restart cc-backend to use new keys

Security Notes

The private key must be kept secret
Store private keys securely (file permissions, encryption at rest)
Use environment variables or secure configuration management
Do not commit private keys to version control
Rotate keys periodically for enhanced security

Technical Details

The tool uses:

Go’s crypto/ed25519 package
/dev/urandom as entropy source on Linux
Base64 standard encoding for output format

Ed25519 provides:

Fast signature generation and verification
Small key and signature sizes
Strong security guarantees

7.1.8.5 - grepCCLog.pl

Analyze ClusterCockpit Log Files for Running Jobs

The grepCCLog.pl script analyzes ClusterCockpit log files to identify jobs that were started but not yet archived on a specific day. This is useful for troubleshooting and monitoring job lifecycle.

Purpose

Parses ClusterCockpit log files to:

Identify jobs that started on a specific day
Detect jobs that have not been archived
Generate statistics per user
Report jobs that may be stuck or still running

Usage

./grepCCLog.pl <logfile> <day>

Arguments

<logfile>

Function: Path to ClusterCockpit log file

Example: /var/log/clustercockpit/cc-backend.log

<day>

Function: Day of month to analyze (numeric)

Example: 15 (for October 15th)

Output

The script produces:

List of Non-Archived Jobs: Details for each job that started but hasn’t been archived
Per-User Summary: Count of non-archived jobs per user
Total Statistics: Overall count of started vs. non-archived jobs

Example Output

======
jobID:  12345 User:  alice
======
======
jobID:  12346 User:  bob
======
alice => 1
bob => 1
Not stopped: 2 of 10

Log Format Requirements

The script expects log entries in the following format:

Job Start Entry

Oct 15 ... new job (id: 123): cluster=woody, jobId=12345, user=alice, ...

Job Archive Entry

Oct 15 ... archiving job... (dbid: 123): cluster=woody, jobId=12345, user=alice, ...

Limitations

Hard-coded for cluster name woody
Hard-coded for month Oct
Requires specific log message format
Day must match exactly

Customization

To adapt for your environment, modify the script:

# Line 19: Change cluster name
if ( $cluster eq 'your-cluster-name' && $day eq $Tday  ) {

# Line 35: Change cluster name for archive matching
if ( $cluster eq 'your-cluster-name' ) {

# Lines 12 & 28: Update month pattern
if ( /Oct ([0-9]+) .../ ) {
# Change 'Oct' to your desired month

Use Cases

Debugging: Identify jobs that failed to archive properly
Monitoring: Track running jobs for a specific day
Troubleshooting: Find stuck jobs in the system
Auditing: Verify job lifecycle completion

Example Workflow

# Analyze today's jobs (e.g., October 15)
./grepCCLog.pl /var/log/cc-backend.log 15

# Find jobs started on the 20th
./grepCCLog.pl /var/log/cc-backend.log 20

# Check specific log file
./grepCCLog.pl /path/to/old-logs/cc-backend-2024-10.log 15

Technical Details

The script:

Opens specified log file
Parses log entries with regex patterns
Tracks started jobs in hash table
Tracks archived jobs in separate hash table
Compares to find jobs without archive entry
Aggregates statistics per user
Outputs results

Jobs are matched by database ID (id: field) between start and archive entries.

7.1.8.6 - Metric Generator Script

Overview

The Metric Generator is a bash script designed to simulate high-frequency metric data for the alex and fritz clusters. It is primarily used for testing the connection to cc-metric-store and put dummy data into it. This can either be your separately hoster cc-metric-store (which is what we call external mode) or your integrated cc-metric-store into cc-backend (which is what we call internal cc-metric-store).

The script supports two transport mechanisms:

REST API (via curl)
NATS Messaging (via nats-cli)

It also supports two deployment scopes to handle different URL structures and authentication methods:

Internal (Integrated cc-metric-store into cc-backend)
External (Self-hosted separate cc-metric-store)

Configuration

The script behavior is controlled by variables defined at the top of the file.

Main Operation Flags

Variable	Options	Description
`TRANSPORT_MODE`	`"REST"` / `"NATS"`	REST: Sends HTTP POST requests. NATS: Publishes to a NATS subject.
`CONNECTION_SCOPE`	`"INTERNAL"` / `"EXTERNAL"`	INTERNAL: To use integrated cc-metric-store. EXTERNAL: To use self-hosted separate cc-metric-store.
`API_USER`	String (e.g., `"demo"`)	The username used to generate the JWT when in INTERNAL mode.

Network Settings

Variable	Description	Required Mode
`SERVICE_ADDRESS`	Base URL of the API (e.g., `http://localhost:8080`).	REST
`NATS_SERVER`	NATS connection string (e.g., `nats://0.0.0.0:4222`).	NATS
`NATS_SUBJECT`	The subject topic to publish messages to (e.g., `hpc-nats`).	NATS
`JWT_STATIC`	A hardcoded Bearer token used for authentication.	EXTERNAL

Logic & Behavior

Connection Scopes (REST Mode)

The script automatically adjusts the target URL and Authentication method based on the CONNECTION_SCOPE.

Feature	Scope: `INTERNAL`	Scope: `EXTERNAL`
Target URL	`{SERVICE_ADDRESS}/metricstore/api/write`	`{SERVICE_ADDRESS}/api/write`
Authentication	Dynamic: Executes `./cc-backend -jwt "$API_USER"`	Static: Uses `JWT_STATIC` variable

Transport Modes

REST: The script writes a batch of metrics to a temporary file and uses curl to POST the file binary to the configured URL.
NATS: The script writes a batch of metrics to a temporary file and pipes (|) the content directly to the nats pub command.

Data Specifications

The script generates InfluxDB/Line Protocol formatted text. It iterates through varying hardware hierarchies for two clusters: Alex and Fritz.

1. Metric Dimensions (Tags)

Every data point includes the following tags:

cluster: alex or fritz
hostname: A random host from the predefined host lists.
type: The hardware level (see below).
type-id: The specific index or ID of the hardware component.

2. Hierarchy Levels

Hierarchy Type	ID Format	Count	Notes
`hwthread`	Integer	0..127 (Alex) / 0..71 (Fritz)	Highest volume metric
`accelerator`	PCI Address	8 per node	Alex Only
`memoryDomain`	Integer	0..7	Alex Only
`socket`	Integer	0..1	All Clusters
`node`	N/A	1 per host	All Clusters

3. Metric Fields

Standard Metrics (hwthread, socket, accelerator, memoryDomain):

cpu_load, cpu_user, flops_any, cpu_irq, cpu_system, ipc, cpu_idle, cpu_iowait, core_power, clock

Node Metrics (node):

cpu_irq, cpu_load, mem_cached, net_bytes_in, cpu_user, cpu_idle, nfs4_read, mem_used, nfs4_write, nfs4_total, ib_xmit, ib_xmit_pkts, net_bytes_out, cpu_iowait, ib_recv, cpu_system, ib_recv_pkts

Usage Examples

1. Run for Internal CCMS

Set the variables inside the script:

TRANSPORT_MODE="REST"
CONNECTION_SCOPE="INTERNAL"

Effect: Generates a new token using cc-backend and posts to /metricstore/api/write.

2. Run for External CCMS

Set the variables inside the script:

TRANSPORT_MODE="REST"
CONNECTION_SCOPE="EXTERNAL"

Effect: Uses the static JWT and posts to /api/write.

3. Run as NATS Publisher

Set the variables inside the script:

TRANSPORT_MODE="NATS"

Effect: Pipes data directly to the NATS server on hpc-nats.

7.2 - cc-metric-store

ClusterCockpit Metric Store References

Reference information regarding the ClusterCockpit component “cc-metric-store” (GitHub Repo).

7.2.1 - Command Line

ClusterCockpit Metric Store Command Line Options

This page describes the command line options for the cc-metric-store executable.

  -config <path>

Function: Specifies alternative path to application configuration file.

Default: ./config.json

Example: -config ./configfiles/configuration.json

  -dev

Function: Enables the Swagger UI REST API documentation and playground at /swagger/.

  -gops

Function: Go server listens via github.com/google/gops/agent (for debugging).

  -loglevel <level>

Function: Sets the logging level.

Options: debug, info, warn (default), err, crit

Example: -loglevel debug

  -logdate

Function: Add date and time to log messages.

  -version

Function: Shows version information and exits.

Running

./cc-metric-store                              # Uses ./config.json
./cc-metric-store -config /path/to/config.json # Custom config path
./cc-metric-store -dev                         # Enable Swagger UI at /swagger/
./cc-metric-store -loglevel debug              # Verbose logging

Example Configuration

See Configuration Reference for detailed descriptions of all options.

{
  "main": {
    "addr": "localhost:8080",
    "jwt-public-key": "kzfYrYy+TzpanWZHJ5qSdMj5uKUWgq74BWhQG6copP0="
  },
  "metrics": {
    "clock": {
      "frequency": 60,
      "aggregation": "avg"
    },
    "cpu_idle": {
      "frequency": 60,
      "aggregation": "avg"
    },
    "cpu_iowait": {
      "frequency": 60,
      "aggregation": "avg"
    },
    "cpu_irq": {
      "frequency": 60,
      "aggregation": "avg"
    },
    "cpu_system": {
      "frequency": 60,
      "aggregation": "avg"
    },
    "cpu_user": {
      "frequency": 60,
      "aggregation": "avg"
    },
    "acc_utilization": {
      "frequency": 60,
      "aggregation": "avg"
    },
    "acc_mem_used": {
      "frequency": 60,
      "aggregation": "sum"
    },
    "acc_power": {
      "frequency": 60,
      "aggregation": "sum"
    },
    "flops_any": {
      "frequency": 60,
      "aggregation": "sum"
    },
    "flops_dp": {
      "frequency": 60,
      "aggregation": "sum"
    },
    "flops_sp": {
      "frequency": 60,
      "aggregation": "sum"
    },
    "ib_recv": {
      "frequency": 60,
      "aggregation": "sum"
    },
    "ib_xmit": {
      "frequency": 60,
      "aggregation": "sum"
    },
    "cpu_power": {
      "frequency": 60,
      "aggregation": "sum"
    },
    "mem_power": {
      "frequency": 60,
      "aggregation": "sum"
    },
    "ipc": {
      "frequency": 60,
      "aggregation": "avg"
    },
    "cpu_load": {
      "frequency": 60,
      "aggregation": null
    },
    "mem_bw": {
      "frequency": 60,
      "aggregation": "sum"
    },
    "mem_used": {
      "frequency": 60,
      "aggregation": null
    }
  },
  "metric-store": {
    "checkpoints": {
      "interval": "12h",
      "directory": "./var/checkpoints"
    },
    "memory-cap": 100,
    "retention-in-memory": "48h",
    "cleanup": {
      "mode": "archive",
      "interval": "48h",
      "directory": "./var/archive"
    }
  }
}

7.2.2 - Configuration

ClusterCockpit Metric Store Configuration Option References

Configuration options are located in a JSON file. Default path is config.json in current working directory. Alternative paths to the configuration file can be specified using the command line switch -config <filename>.

All durations are specified as string that will be parsed like this (Allowed suffixes: s, m, h, …).

The configuration is organized into four main sections: main, metrics, nats, and metric-store.

Main Section

main: Server configuration (required)
- addr: Address to bind to, for example localhost:8080 or 0.0.0.0:443 (required)
- https-cert-file: Filepath to SSL certificate. If also https-key-file is set, use HTTPS (optional)
- https-key-file: Filepath to SSL key file. If also https-cert-file is set, use HTTPS (optional)
- user: Drop root permissions to this user once the port was bound. Only applicable if using privileged port (optional)
- group: Drop root permissions to this group once the port was bound. Only applicable if using privileged port (optional)
- backend-url: URL of cc-backend for querying job information, e.g., https://localhost:8080 (optional)
- jwt-public-key: Base64 encoded Ed25519 public key, use this to verify requests to the HTTP API (required)
- debug: Debug options (optional)
  - dump-to-file: Path to file for dumping internal state (optional)
  - gops: Enable gops agent for debugging (optional)

Metrics Section

metrics: Map of metric-name to objects with the following properties (required)
- frequency: Timestep/Interval/Resolution of this metric in seconds (required)
- aggregation: Can be "sum", "avg" or null (required)
  - null means aggregation across topology levels is disabled for this metric (use for node-scope-only metrics)
  - "sum" means that values from the child levels are summed up for the parent level
  - "avg" means that values from the child levels are averaged for the parent level

NATS Section

nats: NATS server connection configuration (optional)
- address: URL of NATS.io server, example: nats://localhost:4222 (required if nats section present)
- username: NATS username for authentication (optional)
- password: NATS password for authentication (optional)

Metric-Store Section

metric-store: Storage engine configuration (required)
- checkpoints: Checkpoint configuration (required)
  - interval: Create checkpoints every X seconds/minutes/hours (required)
  - directory: Path to checkpoint directory (required)
- retention-in-memory: Keep all values in memory for at least that amount of time. Should be long enough to cover common job durations (required)
- memory-cap: Maximum percentage of system memory to use (optional)
- cleanup: Cleanup/archiving configuration (required)
  - mode: Either "archive" (move and compress old checkpoints) or "delete" (remove old checkpoints) (required)
  - interval: Perform cleanup every X seconds/minutes/hours (required)
  - directory: Path to archive directory (required if mode is "archive")
- nats-subscriptions: Array of NATS subscription configurations (optional, requires nats section)
  - subscribe-to: NATS subject to subscribe to (required)
  - cluster-tag: Default cluster tag for incoming metrics (required)

7.2.3 - Metric Store REST API

ClusterCockpit Metric Store RESTful API Endpoint description

Authentication

JWT tokens

cc-metric-store supports only JWT tokens using the EdDSA/Ed25519 signing method. The token is provided using the Authorization Bearer header.

Example script to test the endpoint:

# Only use JWT token if the JWT authentication has been setup
JWT="eyJ0eXAiOiJKV1QiLCJhbGciOiJFZERTQSJ9.eyJ1c2VyIjoiYWRtaW4iLCJyb2xlcyI6WyJST0xFX0FETUlOIiwiUk9MRV9BTkFMWVNUIiwiUk9MRV9VU0VSIl19.d-3_3FZTsadPjDEdsWrrQ7nS0edMAR4zjl-eK7rJU3HziNBfI9PDHDIpJVHTNN5E5SlLGLFXctWyKAkwhXL-Dw"

curl -X 'GET' 'http://localhost:8080/api/query/' -H "Authorization: Bearer $JWT" \
  -d '{ "cluster": "alex", "from": 1720879275, "to": 1720964715, "queries": [{"metric": "cpu_load","host": "a0124"}] }'

NATS

As an alternative to the REST API, cc-metric-store can receive metrics via NATS messaging. See the NATS configuration for setup details.

Usage of Swagger UI

The Swagger UI is available as part of cc-metric-store if you start it with the -dev option:

./cc-metric-store -dev

You may access it at http://localhost:8080/swagger/ (adjust port to match your main.addr configuration).

API Endpoints

The following REST endpoints are available:

Endpoint	Method	Description
`/api/query/`	GET/POST	Query metrics with selectors
`/api/write/`	POST	Write metrics (InfluxDB line protocol)
`/api/free/`	POST	Free buffers up to timestamp
`/api/debug/`	GET	Dump internal state (debugging)
`/api/healthcheck/`	GET	Node health status

Payload format for write endpoint

The data comes in InfluxDB line protocol format.

<metric>,cluster=<cluster>,hostname=<hostname>,type=<node/hwthread/etc> value=<value> <epoch_time_in_ns_or_s>

Real example:

proc_run,cluster=fritz,hostname=f2163,type=node value=4i 1725620476214474893

A more detailed description of the ClusterCockpit flavored InfluxDB line protocol and their types can be found here in CC specification.

Example script to test endpoint:

# Only use JWT token if the JWT authentication has been setup
JWT="eyJ0eXAiOiJKV1QiLCJhbGciOiJFZERTQSJ9.eyJ1c2VyIjoiYWRtaW4iLCJyb2xlcyI6WyJST0xFX0FETUlOIiwiUk9MRV9BTkFMWVNUIiwiUk9MRV9VU0VSIl19.d-3_3FZTsadPjDEdsWrrQ7nS0edMAR4zjl-eK7rJU3HziNBfI9PDHDIpJVHTNN5E5SlLGLFXctWyKAkwhXL-Dw"

curl -X 'POST' 'http://localhost:8080/api/write/' -H "Authorization: Bearer $JWT" \
  -d "proc_run,cluster=fritz,hostname=f2163,type=node value=4i 1725620476214474893"

Testing with the Metric Generator

For comprehensive testing of the write endpoint, a Metric Generator Script is available. This script simulates high-frequency metric data and supports both REST and NATS transport modes, as well as internal (integrated into cc-backend) and external (standalone) cc-metric-store deployments.

Swagger API Reference

Non-Interactive Documentation

This reference is rendered using the swagger-ui plugin based on the original definition file found in the ClusterCockpit repository, but without a serving backend.

This means that all interactivity (“Try It Out”) will not return actual data. However, a Curl call and a compiled Request URL will still be displayed, if an API endpoint is executed.

7.3 - cc-metric-collector

ClusterCockpit Metric Collector References

Reference information regarding the ClusterCockpit component “cc-metric-collector” (GitHub Repo).

Overview

cc-metric-collector is a node agent for measuring, processing and forwarding node level metrics. It is part of the ClusterCockpit ecosystem.

The metric collector sends (and receives) metrics in the InfluxDB line protocol as it provides flexibility while providing a separation between tags (like index columns in relational databases) and fields (like data columns).

Key Features

Modular Architecture: Flexible plugin-based system with collectors, sinks, receivers, and router
Multiple Data Sources: Collect metrics from various sources (procfs, sysfs, hardware libraries, custom commands)
Flexible Output: Send metrics to multiple sinks simultaneously (InfluxDB, Prometheus, NATS, etc.)
On-the-fly Processing: Router can tag, filter, aggregate, and transform metrics before forwarding
Network Receiver: Accept metrics from other collectors to create hierarchical setups
Low Overhead: Efficient serial collection with single timestamp per interval

Architecture

There is a single timer loop that triggers all collectors serially, collects the data and sends the metrics to the configured sinks. This ensures all data is submitted with a single timestamp. The sinks currently use mostly blocking APIs.

The receiver runs as a go routine side-by-side with the timer loop and asynchronously forwards received metrics to the sink.

flowchart LR
  subgraph col ["Collectors"]
  direction TB
  cpustat["cpustat"]
  memstat["memstat"]
  tempstat["tempstat"]
  misc["..."]
  end
  
  subgraph Receivers ["Receivers"]
  direction TB
  nats["NATS"]
  httprecv["HTTP"]
  miscrecv[...]
  end

  subgraph calc["Aggregator"]
  direction LR
  cache["Cache"]
  agg["Calculator"]
  end

  subgraph sinks ["Sinks"]
  direction RL
  influx["InfluxDB"]
  ganglia["Ganglia"]
  logger["Logfile"]
  miscsink["..."]
  end

  cpustat --> CollectorManager["CollectorManager"]
  memstat --> CollectorManager
  tempstat --> CollectorManager
  misc --> CollectorManager

  nats  --> ReceiverManager["ReceiverManager"]
  httprecv --> ReceiverManager
  miscrecv --> ReceiverManager

  CollectorManager --> newrouter["Router"]
  ReceiverManager -.-> newrouter
  calc -.-> newrouter
  newrouter --> SinkManager["SinkManager"]
  newrouter -.-> calc

  SinkManager --> influx
  SinkManager --> ganglia
  SinkManager --> logger
  SinkManager --> miscsink

Components

Collectors: Read data from local system sources (files, commands, libraries) and send to router
Router: Process metrics by caching, filtering, tagging, renaming, and aggregating
Sinks: Send metrics to storage backends (InfluxDB, Prometheus, NATS, etc.)
Receivers: Accept metrics from other collectors via network (HTTP, NATS) and forward to router

The key difference between collectors and receivers is that collectors are called periodically while receivers run continuously and submit metrics at any time.

Supported Metrics

Supported metrics are documented in the cc-specifications.

Deployment Scenarios

The metric collector was designed with flexibility in mind, so it can be used in many scenarios:

Direct to Database

flowchart TD
  subgraph a ["Cluster A"]
  nodeA[NodeA with CC collector]
  nodeB[NodeB with CC collector]
  nodeC[NodeC with CC collector]
  end
  a --> db[(Database)]
  db <--> ccweb("Webfrontend")

Hierarchical Collection

flowchart TD
  subgraph a [ClusterA]
  direction LR
  nodeA[NodeA with CC collector]
  nodeB[NodeB with CC collector]
  nodeC[NodeC with CC collector]
  end
  subgraph b [ClusterB]
  direction LR
  nodeD[NodeD with CC collector]
  nodeE[NodeE with CC collector]
  nodeF[NodeF with CC collector]
  end
  a --> ccrecv{"CC collector as receiver"}
  b --> ccrecv
  ccrecv --> db[("Database1")]
  ccrecv -.-> db2[("Database2")]
  db <-.-> ccweb("Webfrontend")

7.3.1 - Configuration

cc-metric-collector Configuration Reference

Configuration Overview

The configuration of cc-metric-collector consists of five configuration files: one global file and four component-related files.

Configuration is implemented using a single JSON document that can be distributed over the network and persisted as a file.

Global Configuration File

The global file contains paths to the other four component files and some global options.

Default location: /etc/cc-metric-collector/config.json (can be overridden with -config flag)

Example

{
  "sinks-file": "/etc/cc-metric-collector/sinks.json",
  "collectors-file": "/etc/cc-metric-collector/collectors.json",
  "receivers-file": "/etc/cc-metric-collector/receivers.json",
  "router-file": "/etc/cc-metric-collector/router.json",
  "main": {
    "interval": "10s",
    "duration": "1s"
  }
}

Note: Paths are relative to the execution folder of the cc-metric-collector binary, so it is recommended to use absolute paths.

Configuration Reference

Config Key	Type	Default	Description
`sinks-file`	string	-	Path to sinks configuration file (relative or absolute)
`collectors-file`	string	-	Path to collectors configuration file (relative or absolute)
`receivers-file`	string	-	Path to receivers configuration file (relative or absolute)
`router-file`	string	-	Path to router configuration file (relative or absolute)
`main.interval`	string	`10s`	How often metrics should be read and sent to sinks. Parsed using `time.ParseDuration()`
`main.duration`	string	`1s`	How long one measurement should take. Important for collectors like `likwid` that measure over time.

Alternative Configuration Format

Instead of separate files, you can embed component configurations directly:

{
  "sinks": {
    "mysink": {
      "type": "influxasync",
      "host": "localhost",
      "port": "8086"
    }
  },
  "collectors": {
    "cpustat": {}
  },
  "receivers": {},
  "router": {
    "interval_timestamp": false
  },
  "main": {
    "interval": "10s",
    "duration": "1s"
  }
}

Component Configuration Files

Collectors Configuration

The collectors configuration file specifies which metrics should be queried from the system. See Collectors for available collectors and their configuration options.

Format: Unlike sinks and receivers, the collectors configuration is a set of objects (not a list).

File: collectors.json

Example:

{
  "cpustat": {},
  "memstat": {},
  "diskstat": {
    "exclude_metrics": [
      "disk_total"
    ]
  },
  "likwid": {
    "access_mode": "direct",
    "liblikwid_path": "/usr/local/lib/liblikwid.so",
    "eventsets": [
      {
        "events": {
          "cpu": ["FLOPS_DP"]
        }
      }
    ]
  }
}

Common Options (available for most collectors):

Option	Type	Description
`exclude_metrics`	[]string	List of metric names to exclude from forwarding to sinks
`send_meta`	bool	Send metadata information along with metrics (default varies)

See: Collectors Documentation for collector-specific configuration options.

Note: Some collectors dynamically load shared libraries. Ensure the library path is part of the LD_LIBRARY_PATH environment variable.

Sinks Configuration

The sinks configuration file defines where metrics should be sent. Multiple sinks of the same or different types can be configured.

Format: Object with named sink configurations

File: sinks.json

Example:

{
  "local_influx": {
    "type": "influxasync",
    "host": "localhost",
    "port": "8086",
    "organization": "myorg",
    "database": "metrics",
    "password": "mytoken"
  },
  "central_prometheus": {
    "type": "prometheus",
    "host": "0.0.0.0",
    "port": "9091"
  },
  "debug_log": {
    "type": "stdout"
  }
}

Common Sink Types:

Type	Description
`influxasync`	InfluxDB v2 asynchronous writer
`influxdb`	InfluxDB v2 synchronous writer
`prometheus`	Prometheus Pushgateway
`nats`	NATS messaging system
`stdout`	Standard output (for debugging)
`libganglia`	Ganglia monitoring system
`http`	Generic HTTP endpoint

See: cc-lib Sinks Documentation for sink-specific configuration options.

Note: Some sinks dynamically load shared libraries. Ensure the library path is part of the LD_LIBRARY_PATH environment variable.

Router Configuration

The router sits between collectors/receivers and sinks, enabling metric processing such as tagging, filtering, renaming, and aggregation.

File: router.json

Simple Example:

{
  "add_tags": [
    {
      "key": "cluster",
      "value": "mycluster",
      "if": "*"
    }
  ],
  "interval_timestamp": false,
  "num_cache_intervals": 0
}

Advanced Example:

{
  "num_cache_intervals": 1,
  "interval_timestamp": true,
  "hostname_tag": "hostname",
  "max_forward": 50,
  "process_messages": {
    "manipulate_messages": [
      {
        "add_base_tags": {
          "cluster": "mycluster"
        }
      }
    ]
  }
}

Configuration Reference:

Option	Type	Default	Description
`interval_timestamp`	bool	`false`	Use common timestamp (interval start) for all metrics in an interval
`num_cache_intervals`	int	`0`	Number of past intervals to cache (0 disables cache, required for interval aggregates)
`hostname_tag`	string	`"hostname"`	Tag name for hostname (added to locally created metrics)
`max_forward`	int	`50`	Max metrics to read from a channel at once (must be > 1)
`process_messages`	object	-	Message processor configuration (see below)

See: Router Documentation for detailed configuration options and Message Processor for advanced processing.

Receivers Configuration

Receivers enable cc-metric-collector to accept metrics from other collectors via network protocols. For most standalone setups, this file can contain only an empty JSON map ({}).

File: receivers.json

Example:

{
  "nats_rack0": {
    "type": "nats",
    "address": "nats-server.example.org",
    "port": "4222",
    "subject": "rack0"
  },
  "http_receiver": {
    "type": "http",
    "address": "0.0.0.0",
    "port": "8080",
    "path": "/api/write"
  }
}

Common Receiver Types:

Type	Description
`nats`	NATS subscriber
`http`	HTTP server endpoint for metric ingestion

See: cc-lib Receivers Documentation for receiver-specific configuration options.

Configuration Examples

Complete example configurations can be found in the example-configs directory of the repository.

Configuration Validation

To validate your configuration before running the collector:

# Test configuration loading
cc-metric-collector -config /path/to/config.json -once

The -once flag runs all collectors only once and exits, useful for testing.

7.3.2 - Installation

Building and installing cc-metric-collector

Building from Source

Prerequisites

Go 1.16 or higher
Git
Make
Standard build tools (gcc, etc.)

Basic Build

In most cases, a simple make in the main folder is enough to get a cc-metric-collector binary:

git clone https://github.com/ClusterCockpit/cc-metric-collector.git
cd cc-metric-collector
make

The build process automatically:

Downloads dependencies via go get
Checks for LIKWID library (for LIKWID collector)
Downloads and builds LIKWID as a static library if not found
Copies required header files for cgo bindings

Build Output

After successful build, you’ll have:

cc-metric-collector binary in the project root
LIKWID library and headers (if LIKWID collector was built)

System Integration

Configuration Files

Create a directory for configuration files:

sudo mkdir -p /etc/cc-metric-collector
sudo cp example-configs/*.json /etc/cc-metric-collector/

Edit the configuration files according to your needs. See Configuration for details.

User and Group Setup

It’s recommended to run cc-metric-collector as a dedicated user:

sudo useradd -r -s /bin/false cc-metric-collector
sudo mkdir -p /var/log/cc-metric-collector
sudo chown cc-metric-collector:cc-metric-collector /var/log/cc-metric-collector

Pre-configuration

The main configuration settings for system integration are pre-defined in scripts/cc-metric-collector.config. This file contains:

UNIX user and group for execution
PID file location
Other system settings

Adjust and install it:

# Edit the configuration
editor scripts/cc-metric-collector.config

# Install to system location
sudo install --mode 644 \
             --owner root \
             --group root \
             scripts/cc-metric-collector.config /etc/default/cc-metric-collector

Systemd Integration

If you are using systemd as your init system:

# Install the systemd service file
sudo install --mode 644 \
             --owner root \
             --group root \
             scripts/cc-metric-collector.service /etc/systemd/system/cc-metric-collector.service

# Reload systemd daemon
sudo systemctl daemon-reload

# Enable the service to start on boot
sudo systemctl enable cc-metric-collector

# Start the service
sudo systemctl start cc-metric-collector

# Check status
sudo systemctl status cc-metric-collector

SysVinit Integration

If you are using an init system based on /etc/init.d daemons:

# Install the init script
sudo install --mode 755 \
             --owner root \
             --group root \
             scripts/cc-metric-collector.init /etc/init.d/cc-metric-collector

# Enable the service
sudo update-rc.d cc-metric-collector defaults

# Start the service
sudo /etc/init.d/cc-metric-collector start

The init script reads basic configuration from /etc/default/cc-metric-collector.

Package Installation

RPM Packages

To build RPM packages:

make RPM

Requirements:

RPM tools (rpm and rpmspec)
Git

The command uses the RPM SPEC file scripts/cc-metric-collector.spec and creates packages in the project directory.

Install the generated RPM:

sudo rpm -ivh cc-metric-collector-*.rpm

DEB Packages

To build Debian packages:

make DEB

Requirements:

dpkg-deb
awk, sed
Git

The command uses the DEB control file scripts/cc-metric-collector.control and creates a binary deb package.

Install the generated DEB:

sudo dpkg -i cc-metric-collector_*.deb

Note: DEB package creation is experimental and not as well tested as RPM packages.

Customizing Packages

To customize RPM or DEB packages for your local system:

Fork the cc-metric-collector repository
Enable GitHub Actions in your fork
Make changes to scripts, code, etc.
Commit and push your changes
Tag the commit: git tag v0.x.y-myversion
Push tags: git push --tags
Wait for the Release action to complete
Download RPMs/DEBs from the Releases page of your fork

Library Dependencies

LIKWID Collector

The LIKWID collector requires the LIKWID library. There is currently no Golang interface to LIKWID, so cgo is used to create bindings.

The build process handles LIKWID automatically:

Checks if LIKWID is installed system-wide
If not found, downloads and builds LIKWID with direct access mode
Copies necessary header files

To use a pre-installed LIKWID:

export LD_LIBRARY_PATH=/path/to/likwid/lib:$LD_LIBRARY_PATH

Other Dynamic Libraries

Some collectors and sinks dynamically load shared libraries:

Component	Library	Purpose
LIKWID collector	liblikwid.so	Hardware performance data
NVIDIA collector	libnvidia-ml.so	NVIDIA GPU metrics
ROCm collector	librocm_smi64.so	AMD GPU metrics
Ganglia sink	libganglia.so	Ganglia metric submission

Ensure required libraries are in your LD_LIBRARY_PATH:

export LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH

Permissions

Hardware Access

Some collectors require special permissions:

Collector	Requirement	Solution
LIKWID (direct)	Direct hardware access	Run as root or use `capabilities`
IPMI	Access to IPMI devices	User must be in `ipmi` group
Temperature	Access to `/sys/class/hwmon`	Usually readable by all users
GPU collectors	Access to GPU management libraries	User must have GPU access rights

Setting Capabilities (Alternative to Root)

For LIKWID direct access without running as root:

sudo setcap cap_sys_rawio=ep /path/to/cc-metric-collector

Warning: Direct hardware access can be dangerous if misconfigured. Use with caution.

Verification

After installation, verify the collector is working:

# Test configuration
cc-metric-collector -config /etc/cc-metric-collector/config.json -once

# Check logs
journalctl -u cc-metric-collector -f

# Or for SysV
tail -f /var/log/cc-metric-collector/collector.log

Troubleshooting

Common Issues

Issue: cannot find liblikwid.so

Solution: Set LD_LIBRARY_PATH or configure in systemd service file

Issue: permission denied accessing hardware

Solution: Run as root, use capabilities, or adjust file permissions

Issue: Configuration file not found

Solution: Use -config flag or place config.json in execution directory

Issue: Metrics not appearing in sink

Solution: Check sink configuration, network connectivity, and router settings

Debug Mode

Run in foreground with debug output:

cc-metric-collector -config /path/to/config.json -log stderr

Run collectors only once for testing:

cc-metric-collector -config /path/to/config.json -once

7.3.3 - Usage

Running and using cc-metric-collector

Command Line Interface

Basic Usage

cc-metric-collector [options]

Command Line Options

Flag	Type	Default	Description
`-config`	string	`./config.json`	Path to configuration file
`-log`	string	`stderr`	Path for logfile (use `stderr` for console)
`-once`	bool	`false`	Run all collectors only once then exit

Examples

Run with default configuration:

cc-metric-collector

Run with custom configuration:

cc-metric-collector -config /etc/cc-metric-collector/config.json

Log to file:

cc-metric-collector -config /etc/cc-metric-collector/config.json \
                    -log /var/log/cc-metric-collector/collector.log

Test configuration (run once):

cc-metric-collector -config /etc/cc-metric-collector/config.json -once

This runs all collectors exactly once and exits. Useful for:

Testing configuration
Debugging collector issues
Validating metric output
One-time metric collection

Running as a Service

Systemd

Start service:

sudo systemctl start cc-metric-collector

Stop service:

sudo systemctl stop cc-metric-collector

Restart service:

sudo systemctl restart cc-metric-collector

Check status:

sudo systemctl status cc-metric-collector

View logs:

journalctl -u cc-metric-collector -f

Enable on boot:

sudo systemctl enable cc-metric-collector

SysVinit

Start service:

sudo /etc/init.d/cc-metric-collector start

Stop service:

sudo /etc/init.d/cc-metric-collector stop

Restart service:

sudo /etc/init.d/cc-metric-collector restart

Check status:

sudo /etc/init.d/cc-metric-collector status

Operation Modes

Daemon Mode (Default)

In daemon mode, cc-metric-collector runs continuously with a timer loop that:

Triggers all enabled collectors serially
Collects metrics with a single timestamp per interval
Forwards metrics through the router
Sends processed metrics to all configured sinks
Sleeps until the next interval

Interval timing is controlled by the main.interval configuration parameter.

One-Shot Mode

Activated with the -once flag, this mode:

Initializes all collectors
Runs each collector exactly once
Processes and forwards metrics
Exits

Useful for:

Configuration testing
Debugging
Cron-based metric collection
Integration with other monitoring tools

Metric Collection Flow

sequenceDiagram
    participant Timer
    participant Collectors
    participant Router
    participant Sinks
    
    Timer->>Collectors: Trigger (every interval)
    Collectors->>Collectors: Read metrics from system
    Collectors->>Router: Forward metrics
    Router->>Router: Process (tag, filter, aggregate)
    Router->>Sinks: Send processed metrics
    Sinks->>Sinks: Write to backends
    Timer->>Timer: Sleep until next interval

Common Usage Patterns

Basic Monitoring Setup

Collect basic system metrics and send to InfluxDB:

config.json:

{
  "collectors-file": "./collectors.json",
  "sinks-file": "./sinks.json",
  "receivers-file": "./receivers.json",
  "router-file": "./router.json",
  "main": {
    "interval": "10s",
    "duration": "1s"
  }
}

collectors.json:

{
  "cpustat": {},
  "memstat": {},
  "diskstat": {},
  "netstat": {},
  "loadavg": {}
}

sinks.json:

{
  "influx": {
    "type": "influxasync",
    "host": "influx.example.org",
    "port": "8086",
    "organization": "myorg",
    "database": "metrics",
    "password": "mytoken"
  }
}

router.json:

{
  "add_tags": [
    {
      "key": "cluster",
      "value": "production",
      "if": "*"
    }
  ],
  "interval_timestamp": true
}

receivers.json:

{}

HPC Node Monitoring

Extended monitoring for HPC compute nodes:

collectors.json:

{
  "cpustat": {},
  "memstat": {},
  "diskstat": {},
  "netstat": {},
  "loadavg": {},
  "tempstat": {},
  "likwid": {
    "access_mode": "direct",
    "liblikwid_path": "/usr/local/lib/liblikwid.so",
    "eventsets": [
      {
        "events": {
          "cpu": ["FLOPS_DP", "CLOCK"]
        }
      }
    ]
  },
  "nvidia": {},
  "ibstat": {}
}

Hierarchical Collection

Compute nodes send to aggregation node:

Node config - sinks.json:

{
  "nats_aggregator": {
    "type": "nats",
    "host": "aggregator.example.org",
    "port": "4222",
    "subject": "cluster.rack1"
  }
}

Aggregation node config - receivers.json:

{
  "nats_rack1": {
    "type": "nats",
    "address": "localhost",
    "port": "4222",
    "subject": "cluster.rack1"
  },
  "nats_rack2": {
    "type": "nats",
    "address": "localhost",
    "port": "4222",
    "subject": "cluster.rack2"
  }
}

Aggregation node config - sinks.json:

{
  "influx": {
    "type": "influxasync",
    "host": "influx.example.org",
    "port": "8086",
    "organization": "myorg",
    "database": "metrics",
    "password": "mytoken"
  }
}

Multi-Sink Configuration

Send metrics to multiple destinations:

sinks.json:

{
  "primary_influx": {
    "type": "influxasync",
    "host": "influx1.example.org",
    "port": "8086",
    "organization": "myorg",
    "database": "metrics",
    "password": "token1"
  },
  "backup_influx": {
    "type": "influxasync",
    "host": "influx2.example.org",
    "port": "8086",
    "organization": "myorg",
    "database": "metrics",
    "password": "token2"
  },
  "prometheus": {
    "type": "prometheus",
    "host": "0.0.0.0",
    "port": "9091"
  }
}

Monitoring and Debugging

Check Collector Status

Use -once mode to test without running continuously:

cc-metric-collector -config /etc/cc-metric-collector/config.json -once

Debug Output

Log to stderr for immediate feedback:

cc-metric-collector -config /etc/cc-metric-collector/config.json -log stderr

Verify Metrics

Check what metrics are being collected:

Configure stdout sink temporarily
Run in -once mode
Observe metric output

Temporary debug sink:

{
  "debug": {
    "type": "stdout"
  }
}

Common Issues

No metrics appearing:

Check collector configuration
Verify collectors have required permissions
Ensure sinks are reachable
Check router isn’t filtering metrics

High CPU usage:

Increase main.interval value
Disable expensive collectors
Check for router performance issues

Memory growth:

Reduce num_cache_intervals in router
Check for sink write failures
Verify metric cardinality isn’t excessive

Performance Tuning

Interval Adjustment

Faster updates (more overhead):

{
  "main": {
    "interval": "5s",
    "duration": "1s"
  }
}

Slower updates (less overhead):

{
  "main": {
    "interval": "60s",
    "duration": "1s"
  }
}

Collector Selection

Only enable collectors you need:

{
  "cpustat": {},
  "memstat": {}
}

Metric Filtering

Use router to exclude unwanted metrics:

{
  "process_messages": {
    "manipulate_messages": [
      {
        "drop_by_name": ["cpu_idle", "cpu_iowait"]
      }
    ]
  }
}

Security Considerations

Running as Non-Root

Most collectors work without root privileges, except:

LIKWID (direct mode)
IPMI collector
Some hardware-specific collectors

Use capabilities instead of root when possible.

Network Security

When using receivers:

Use authentication (NATS credentials, HTTP tokens)
Restrict listening addresses
Use TLS for encrypted transport
Firewall receiver ports appropriately

File Permissions

Protect configuration files containing credentials:

sudo chmod 600 /etc/cc-metric-collector/config.json
sudo chown cc-metric-collector:cc-metric-collector /etc/cc-metric-collector/config.json

7.3.4 - Metric Router

Routing and processing metrics in cc-metric-collector

Overview

The metric router sits between collectors/receivers and sinks, enabling metric processing such as:

Adding and removing tags
Filtering and dropping metrics
Renaming metrics
Aggregating metrics across an interval
Normalizing units
Setting common timestamps

Basic Configuration

File: router.json

Minimal configuration:

{
  "interval_timestamp": false,
  "num_cache_intervals": 0
}

Typical configuration:

{
  "add_tags": [
    {
      "key": "cluster",
      "value": "mycluster",
      "if": "*"
    }
  ],
  "interval_timestamp": true,
  "num_cache_intervals": 0
}

Configuration Options

Core Settings

Option	Type	Default	Description
`interval_timestamp`	bool	`false`	Use common timestamp (interval start) for all metrics in an interval
`num_cache_intervals`	int	`0`	Number of past intervals to cache (0 disables cache, required for interval aggregates)
`hostname_tag`	string	`"hostname"`	Tag name for hostname (added to locally created metrics)
`max_forward`	int	`50`	Max metrics to read from a channel at once (must be > 1)

The `interval_timestamp` Option

Collectors’ Read() functions are not called simultaneously, so metrics within an interval can have different timestamps.

When true: All metrics in an interval get a common timestamp (the interval start time) When false: Each metric keeps its original collection timestamp

Use case: Enable this to simplify time-series alignment in your database.

The `num_cache_intervals` Option

Controls metric caching for interval aggregations.

Value	Behavior
`0`	Cache disabled (no aggregations possible)
`1`	Cache last interval only (minimal memory, basic aggregations)
`2+`	Cache multiple intervals (for complex time-based aggregations)

Note: Required to be > 0 for interval_aggregates to work.

The `hostname_tag` Option

By default, the router tags locally created metrics with the hostname.

Default tag name: hostname

Custom tag name:

{
  "hostname_tag": "node"
}

The `max_forward` Option

Performance tuning for metric processing.

How it works: When the router receives a metric, it tries to read up to max_forward additional metrics from the same channel before processing.

Default: 50

Must be: Greater than 1

Metric Processing

Modern Configuration (Recommended)

Use the process_messages section with the message processor:

{
  "process_messages": {
    "manipulate_messages": [
      {
        "add_base_tags": {
          "cluster": "mycluster",
          "partition": "compute"
        }
      },
      {
        "drop_by_name": ["cpu_idle", "mem_cached"]
      },
      {
        "rename_by": {
          "clock_mhz": "clock"
        }
      }
    ]
  }
}

Legacy Configuration (Deprecated)

The following options are deprecated but still supported for backward compatibility. They are automatically converted to process_messages format.

Adding Tags

Deprecated syntax:

{
  "add_tags": [
    {
      "key": "cluster",
      "value": "mycluster",
      "if": "*"
    },
    {
      "key": "type",
      "value": "socket",
      "if": "name == 'temp_package_id_0'"
    }
  ]
}

Modern equivalent:

{
  "process_messages": {
    "manipulate_messages": [
      {
        "add_base_tags": {
          "cluster": "mycluster"
        }
      },
      {
        "add_tags_by": {
          "type": "socket"
        },
        "if": "name == 'temp_package_id_0'"
      }
    ]
  }
}

Deleting Tags

Deprecated syntax:

{
  "delete_tags": [
    {
      "key": "unit",
      "if": "*"
    }
  ]
}

Never delete these tags: hostname, type, type-id

Dropping Metrics

By name (deprecated):

{
  "drop_metrics": [
    "not_interesting_metric",
    "debug_metric"
  ]
}

By condition (deprecated):

{
  "drop_metrics_if": [
    "match('temp_core_%d+', name)",
    "match('cpu', type) && type-id == 0"
  ]
}

Modern equivalent:

{
  "process_messages": {
    "manipulate_messages": [
      {
        "drop_by_name": ["not_interesting_metric", "debug_metric"]
      },
      {
        "drop_by": "match('temp_core_%d+', name)"
      }
    ]
  }
}

Renaming Metrics

Deprecated syntax:

{
  "rename_metrics": {
    "old_name": "new_name",
    "clock_mhz": "clock"
  }
}

Modern equivalent:

{
  "process_messages": {
    "manipulate_messages": [
      {
        "rename_by": {
          "old_name": "new_name",
          "clock_mhz": "clock"
        }
      }
    ]
  }
}

Use case: Standardize metric names across different systems or collectors.

Normalizing Units

Deprecated syntax:

{
  "normalize_units": true
}

Effect: Normalizes unit names (e.g., byte, Byte, B, bytes → consistent format)

Changing Unit Prefixes

Deprecated syntax:

{
  "change_unit_prefix": {
    "mem_used": "G",
    "mem_total": "G"
  }
}

Use case: Convert memory metrics from kB (as reported by /proc/meminfo) to GB for better readability.

Interval Aggregates (Experimental)

Requires: num_cache_intervals > 0

Derive new metrics by aggregating metrics from the current interval.

Configuration

{
  "num_cache_intervals": 1,
  "interval_aggregates": [
    {
      "name": "temp_cores_avg",
      "if": "match('temp_core_%d+', metric.Name())",
      "function": "avg(values)",
      "tags": {
        "type": "node"
      },
      "meta": {
        "group": "IPMI",
        "unit": "degC",
        "source": "TempCollector"
      }
    }
  ]
}

Parameters

Field	Type	Description
`name`	string	Name of the new derived metric
`if`	string	Condition to select which metrics to aggregate
`function`	string	Aggregation function (e.g., `avg(values)`, `sum(values)`, `max(values)`)
`tags`	object	Tags to add to the derived metric
`meta`	object	Metadata for the derived metric (use `"<copy>"` to copy from source metrics)

Available Functions

Function	Description
`avg(values)`	Average of all matching metrics
`sum(values)`	Sum of all matching metrics
`min(values)`	Minimum value
`max(values)`	Maximum value
`count(values)`	Number of matching metrics

Complex Example

Calculate mem_used from multiple memory metrics:

{
  "interval_aggregates": [
    {
      "name": "mem_used",
      "if": "source == 'MemstatCollector'",
      "function": "sum(mem_total) - (sum(mem_free) + sum(mem_buffers) + sum(mem_cached))",
      "tags": {
        "type": "node"
      },
      "meta": {
        "group": "<copy>",
        "unit": "<copy>",
        "source": "<copy>"
      }
    }
  ]
}

Dropping Source Metrics

If you only want the aggregated metric, drop the source metrics:

{
  "drop_metrics_if": [
    "match('temp_core_%d+', metric.Name())"
  ],
  "interval_aggregates": [
    {
      "name": "temp_cores_avg",
      "if": "match('temp_core_%d+', metric.Name())",
      "function": "avg(values)",
      "tags": {
        "type": "node"
      },
      "meta": {
        "group": "IPMI",
        "unit": "degC"
      }
    }
  ]
}

Processing Order

The router processes metrics in a specific order:

Add hostname_tag (if sent by collectors or cache)
Change timestamp to interval timestamp (if interval_timestamp == true)
Check if metric should be dropped (drop_metrics, drop_metrics_if)
Add tags (add_tags)
Delete tags (del_tags)
Rename metric (rename_metrics) and store old name in meta as oldname
Add tags again (to support conditions using new name)
Delete tags again (to support conditions using new name)
Normalize units (if normalize_units == true)
Convert unit prefix (change_unit_prefix)
Send to sinks
Move to cache (if num_cache_intervals > 0)

Legend:

Operations apply to metrics from collectors (c)
Operations apply to metrics from receivers (r)
Operations apply to both (c,r)

Complete Example

{
  "interval_timestamp": true,
  "num_cache_intervals": 1,
  "hostname_tag": "hostname",
  "max_forward": 50,
  "process_messages": {
    "manipulate_messages": [
      {
        "add_base_tags": {
          "cluster": "production",
          "datacenter": "dc1"
        }
      },
      {
        "drop_by_name": ["cpu_idle", "cpu_guest", "cpu_guest_nice"]
      },
      {
        "rename_by": {
          "clock_mhz": "clock"
        }
      },
      {
        "add_tags_by": {
          "high_temp": "true"
        },
        "if": "name == 'temp_package_id_0' && value > 70"
      }
    ]
  },
  "interval_aggregates": [
    {
      "name": "temp_avg",
      "if": "match('temp_core_%d+', name)",
      "function": "avg(values)",
      "tags": {
        "type": "node"
      },
      "meta": {
        "group": "Temperature",
        "unit": "degC",
        "source": "TempCollector"
      }
    }
  ]
}

Performance Considerations

Caching: Only enable if you need interval aggregates (memory overhead)
Complex conditions: Evaluated for every metric (CPU overhead)
Aggregations: Evaluated at the start of each interval (CPU overhead)
max_forward: Higher values can improve throughput but increase latency

7.3.5 - Collectors

Available metric collectors for cc-metric-collector

Overview

Collectors read data from various sources on the local system, parse it into metrics, and submit these metrics to the router. Each collector is a modular plugin that can be enabled or disabled independently.

Configuration Format

File: collectors.json

The collectors configuration is a set of objects (not a list), where each key is the collector type:

{
  "collector_type": {
    "collector_specific_option": "value"
  }
}

Common Configuration Options

Most collectors support these common options:

Option	Type	Default	Description
`exclude_metrics`	[]string	`[]`	List of metric names to exclude from forwarding to sinks
`send_meta`	bool	varies	Send metadata information along with metrics

Example:

{
  "cpustat": {
    "exclude_metrics": ["cpu_idle", "cpu_guest"]
  },
  "memstat": {}
}

Available Collectors

System Metrics

Collector	Description	Source
`cpustat`	CPU usage statistics	`/proc/stat`
`memstat`	Memory usage statistics	`/proc/meminfo`
`loadavg`	System load average	`/proc/loadavg`
`netstat`	Network interface statistics	`/proc/net/dev`
`diskstat`	Disk I/O statistics	`/sys/block/*/stat`
`iostat`	Block device I/O statistics	`/proc/diskstats`

Hardware Monitoring

Collector	Description	Requirements
`tempstat`	Temperature sensors	`/sys/class/hwmon`
`cpufreq`	CPU frequency	`/sys/devices/system`
`cpufreq_cpuinfo`	CPU frequency from cpuinfo	`/proc/cpuinfo`
`ipmistat`	IPMI sensor data	`ipmitool` command

Performance Monitoring

Collector	Description	Requirements
`likwid`	Hardware performance counters via LIKWID	liblikwid.so
`rapl`	CPU energy consumption (RAPL)	`/sys/class/powercap`
`schedstat`	CPU scheduler statistics	`/proc/schedstat`
`numastats`	NUMA node statistics	`/sys/devices/system/node`

GPU Monitoring

Collector	Description	Requirements
`nvidia`	NVIDIA GPU metrics	libnvidia-ml.so (NVML)
`rocm_smi`	AMD ROCm GPU metrics	librocm_smi64.so

Network & Storage

Collector	Description	Requirements
`ibstat`	InfiniBand statistics	`/sys/class/infiniband`
`lustrestat`	Lustre filesystem statistics	Lustre client
`gpfs`	GPFS filesystem statistics	GPFS utilities
`beegfs_meta`	BeeGFS metadata statistics	BeeGFS metadata client
`beegfs_storage`	BeeGFS storage statistics	BeeGFS storage client
`nfs3stat`	NFS v3 statistics	`/proc/net/rpc/nfs`
`nfs4stat`	NFS v4 statistics	`/proc/net/rpc/nfs`
`nfsiostat`	NFS I/O statistics	`nfsiostat` command

Process & Job Monitoring

Collector	Description	Requirements
`topprocs`	Top processes by resource usage	`/proc` filesystem
`slurm_cgroup`	Slurm cgroup statistics	Slurm cgroups
`self`	Collector’s own resource usage	`/proc/self`

Custom Collectors

Collector	Description	Requirements
`customcmd`	Execute custom commands to collect metrics	Any command/script

Collector Lifecycle

Each collector implements these functions:

Init(config): Initializes the collector with configuration
Initialized(): Returns whether initialization was successful
Read(duration, output): Reads metrics and sends to output channel
Close(): Cleanup and shutdown

Example Configurations

Minimal System Monitoring

{
  "cpustat": {},
  "memstat": {},
  "loadavg": {}
}

HPC Node Monitoring

{
  "cpustat": {},
  "memstat": {},
  "diskstat": {},
  "netstat": {},
  "loadavg": {},
  "tempstat": {},
  "likwid": {
    "access_mode": "direct",
    "liblikwid_path": "/usr/local/lib/liblikwid.so",
    "eventsets": [
      {
        "events": {
          "cpu": ["FLOPS_DP", "CLOCK"]
        }
      }
    ]
  },
  "nvidia": {},
  "ibstat": {}
}

Filesystem-Heavy Workload

{
  "cpustat": {},
  "memstat": {},
  "diskstat": {},
  "lustrestat": {},
  "nfs4stat": {},
  "iostat": {}
}

Minimal Overhead

{
  "cpustat": {
    "exclude_metrics": ["cpu_guest", "cpu_guest_nice", "cpu_steal"]
  },
  "memstat": {
    "exclude_metrics": ["mem_slab", "mem_sreclaimable"]
  }
}

Collector Development

Creating a Custom Collector

Collectors implement the MetricCollector interface. See collectors README for details.

Basic structure:

type SampleCollector struct {
    metricCollector
    config SampleCollectorConfig
}

func (m *SampleCollector) Init(config json.RawMessage) error
func (m *SampleCollector) Read(interval time.Duration, output chan lp.CCMetric)
func (m *SampleCollector) Close()

Registration

Add your collector to collectorManager.go:

var AvailableCollectors = map[string]MetricCollector{
    "sample": &SampleCollector{},
}

Metric Format

All collectors submit metrics in InfluxDB line protocol format via the CCMetric type.

Metric components:

Name: Metric identifier (e.g., cpu_used)
Tags: Index-like key-value pairs (e.g., type=node, hostname=node01)
Fields: Data values (typically just value)
Metadata: Source, group, unit information
Timestamp: When the metric was collected

Performance Considerations

Collector overhead: Each enabled collector adds CPU overhead
I/O impact: Some collectors read many files (e.g., per-core statistics)
Library overhead: GPU and hardware performance collectors can be expensive
Selective metrics: Use exclude_metrics to reduce unnecessary data

7.4 - cc-slurm-adapter

ClusterCockpit Slurm Adapter References

Reference information regarding the ClusterCockpit component “cc-slurm-adapter” (GitHub Repo).

Overview

cc-slurm-adapter is a software daemon that feeds cc-backend with job information from Slurm in realtime.

Key Features

Fault Tolerant: Handles cc-backend or Slurm downtime gracefully without losing jobs
Automatic Recovery: Submits jobs to cc-backend as soon as services are available again
Realtime Updates: Supports immediate job notification via Slurm Prolog/Epilog hooks
NATS Integration: Optional job notification messaging via NATS
Minimal Dependencies: Uses Slurm commands (sacct, squeue, sacctmgr, scontrol) - no slurmrestd required

Architecture

The daemon runs on the same node as slurmctld and operates in two modes:

Daemon Mode: Periodic synchronization (default: every 60 seconds) between Slurm and cc-backend
Prolog/Epilog Mode: Immediate trigger on job start/stop events (optional, reduces latency)

Data is submitted to cc-backend via REST API. Note: Slurm’s slurmdbd is mandatory.

Notice

You can set the Slurm option MinJobAge to prolong the duration Slurm will hold Job infos in memory.

Limitations

Resource Information Availability

Because slurmdbd does not store all job information, some details may be unavailable in certain cases:

Resource allocation information is obtained via scontrol --cluster XYZ show job XYZ --json
This information becomes unavailable a few minutes after job completion
If the daemon is stopped for too long, jobs may lack resource information
Critical Impact: Without resource information, cc-backend cannot associate jobs with metrics (CPU, GPU, memory)
Jobs will still be listed in cc-backend but metric visualization will not work

Slurm Version Compatibility

Supported Versions

These Slurm versions are known to work:

24.xx.x
25.xx.x

Compatibility Notes

All Slurm-related code is concentrated in slurm.go for easier maintenance. The most common compatibility issue is nil pointer dereference due to missing JSON fields.

Debugging Incompatibilities

If you encounter nil pointer dereferences:

Get a job ID via squeue or sacct

Check JSON layouts from both commands (they differ):

sacct -j 12345 --json
scontrol show job 12345 --json

SlurmInt and SlurmString Types

Slurm has been transitioning API formats:

SlurmInt: Handles both plain integers and Slurm’s “infinite/set” struct format
SlurmString: Handles both plain strings and string arrays (uses first element if array, blank if empty)

These custom types maintain backward compatibility across Slurm versions.

7.4.1 - Installation

Installing and building cc-slurm-adapter

Prerequisites

Go 1.24.0 or higher
Slurm with slurmdbd configured
cc-backend instance with API access
Access to the slurmctld node

Building from Source

Requirements

go 1.24.0+

Dependencies

Key dependencies (managed via go.mod):

github.com/ClusterCockpit/cc-lib - ClusterCockpit common library
github.com/nats-io/nats.go - NATS client

Compilation

make

This creates the cc-slurm-adapter binary.

Build Commands

# Build binary
make

# Format code
make format

# Clean build artifacts
make clean

7.4.2 - cc-slurm-adapter Configuration

cc-slurm-adapter configuration reference

Configuration File Location

Default: /etc/cc-slurm-adapter/config.json

Example Configuration

{
  "pidFilePath": "/run/cc-slurm-adapter/daemon.pid",
  "prepSockListenPath": "/run/cc-slurm-adapter/daemon.sock",
  "prepSockConnectPath": "/run/cc-slurm-adapter/daemon.sock",
  "lastRunPath": "/var/lib/cc-slurm-adapter/last_run",
  "slurmPollInterval": 60,
  "slurmQueryDelay": 1,
  "slurmQueryMaxSpan": 604800,
  "slurmQueryMaxRetries": 5,
  "ccPollInterval": 21600,
  "ccRestSubmitJobs": true,
  "ccRestUrl": "https://my-cc-backend-instance.example",
  "ccRestJwt": "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
  "gpuPciAddrs": {
    "^nodehostname0[0-9]$": ["00000000:00:10.0", "00000000:00:3F.0"],
    "^nodehostname1[0-9]$": ["00000000:00:10.0", "00000000:00:3F.0"]
  },
  "ignoreHosts": "^nodehostname9\\w+$",
  "natsServer": "mynatsserver.example",
  "natsPort": 4222,
  "natsSubject": "mysubject",
  "natsUser": "myuser",
  "natsPassword": "123456789",
  "natsCredsFile": "/etc/cc-slurm-adapter/nats.creds",
  "natsNKeySeedFile": "/etc/ss-slurm-adapter/nats.nkey"
}

Configuration Reference

Required Settings

Config Key	Type	Description
`ccRestUrl`	string	URL to cc-backend’s REST API (must not contain trailing slash)
`ccRestJwt`	string	JWT token from cc-backend for REST API access

Daemon Settings

Config Key	Type	Default	Description
`pidFilePath`	string	`/run/cc-slurm-adapter/daemon.pid`	Path to PID file (prevents concurrent execution)
`lastRunPath`	string	`/var/lib/cc-slurm-adapter/lastrun`	Path to file storing last successful sync timestamp (as file mtime)

Socket Settings

Config Key	Type	Default	Description
`prepSockListenPath`	string	`/run/cc-slurm-adapter/daemon.sock`	Socket for daemon to receive prolog/epilog events. Supports UNIX and TCP formats (see below)
`prepSockConnectPath`	string	`/run/cc-slurm-adapter/daemon.sock`	Socket for prolog/epilog mode to connect to daemon

Socket Formats:

UNIX: /run/cc-slurm-adapter/daemon.sock or unix:/run/cc-slurm-adapter/daemon.sock
TCP IPv4: tcp:127.0.0.1:12345 or tcp:0.0.0.0:12345
TCP IPv6: tcp:[::1]:12345, tcp:[::]:12345, tcp::12345

Slurm Polling Settings

Config Key	Type	Default	Description
`slurmPollInterval`	int	60	Interval (seconds) for periodic sync to cc-backend
`slurmQueryDelay`	int	1	Wait time (seconds) after prolog/epilog event before querying Slurm
`slurmQueryMaxSpan`	int	604800	Maximum time (seconds) to query jobs from the past (prevents flooding)
`slurmQueryMaxRetries`	int	10	Maximum Slurm query attempts on Prolog/Epilog events

cc-backend Settings

Config Key	Type	Default	Description
`ccPollInterval`	int	21600	Interval (seconds) to query all jobs from cc-backend (prevents stuck jobs)
`ccRestSubmitJobs`	bool	true	Submit started/stopped jobs to cc-backend via REST (set false if using NATS-only)

Hardware Mapping

Config Key	Type	Default	Description
`gpuPciAddrs`	object	`{}`	Map of hostname regexes to GPU PCI address arrays (must match NVML/nvidia-smi order)
`ignoreHosts`	string	`""`	Regex of hostnames to ignore (jobs only on matching hosts are discarded)

NATS Settings

Config Key	Type	Default	Description
`natsServer`	string	`""`	NATS server hostname (leave blank to disable NATS)
`natsPort`	uint16	4222	NATS server port
`natsSubject`	string	`"jobs"`	Subject to publish job information to
`natsUser`	string	`""`	NATS username (for user auth)
`natsPassword`	string	`""`	NATS password
`natsCredsFile`	string	`""`	Path to NATS credentials file
`natsNKeySeedFile`	string	`""`	Path to NATS NKey seed file (private key)

Note: The deprecated ipcSockPath option has been removed. Use prepSockListenPath and prepSockConnectPath instead.

7.4.3 - Daemon Setup

Setting up cc-slurm-adapter as a daemon

The daemon mode is required for cc-slurm-adapter to function. This page describes how to set up the daemon using systemd.

1. Copy Binary and Configuration

Copy the binary and create a configuration file:

sudo mkdir -p /opt/cc-slurm-adapter
sudo cp cc-slurm-adapter /opt/cc-slurm-adapter/
sudo cp config.json /opt/cc-slurm-adapter/

Security: The config file contains sensitive credentials (JWT, NATS). Set appropriate permissions:

sudo chmod 600 /opt/cc-slurm-adapter/config.json

2. Create System User

sudo useradd -r -s /bin/false cc-slurm-adapter
sudo chown -R cc-slurm-adapter:slurm /opt/cc-slurm-adapter

3. Grant Slurm Permissions

The adapter user needs permission to query Slurm:

sacctmgr add user cc-slurm-adapter Account=root AdminLevel=operator

Critical: If permissions are not set and Slurm is restricted, NO JOBS WILL BE REPORTED.

4. Install systemd Service

Create /etc/systemd/system/cc-slurm-adapter.service:

[Unit]
Description=cc-slurm-adapter
Wants=network.target
After=network.target

[Service]
User=cc-slurm-adapter
Group=slurm
ExecStart=/opt/cc-slurm-adapter/cc-slurm-adapter -daemon -config /opt/cc-slurm-adapter/config.json
WorkingDirectory=/opt/cc-slurm-adapter/
RuntimeDirectory=cc-slurm-adapter
RuntimeDirectoryMode=0750
Restart=on-failure
RestartSec=15s

[Install]
WantedBy=multi-user.target

Notes:

RuntimeDirectory creates /run/cc-slurm-adapter for PID and socket files
Group=slurm allows Prolog/Epilog (running as slurm user) to access the socket
RuntimeDirectoryMode=0750 enables group access

5. Enable and Start Service

sudo systemctl daemon-reload
sudo systemctl enable cc-slurm-adapter
sudo systemctl start cc-slurm-adapter

Verification

Check that the service is running:

sudo systemctl status cc-slurm-adapter

You should see output indicating the service is active and running.

7.4.4 - Prolog/Epilog Hooks

Setting up Prolog/Epilog hooks for immediate job notification

Prolog/Epilog hook setup is optional but recommended for immediate job notification, which reduces latency compared to relying solely on periodic polling.

Prerequisites

Daemon must be running (see Daemon Setup)
Hook script must be accessible from slurmctld
Hook script must exit with code 0 to avoid rejecting job allocations

1. Create Hook Script

Create /opt/cc-slurm-adapter/hook.sh:

#!/bin/sh
/opt/cc-slurm-adapter/cc-slurm-adapter
exit 0

Make it executable:

sudo chmod +x /opt/cc-slurm-adapter/hook.sh

Important: Always exit with 0. Non-zero exit codes will reject job allocations.

2. Configure Slurm

Add to slurm.conf:

PrEpPlugins=prep/script
PrologSlurmctld=/opt/cc-slurm-adapter/hook.sh
EpilogSlurmctld=/opt/cc-slurm-adapter/hook.sh

3. Restart slurmctld

sudo systemctl restart slurmctld

Note: If using non-default socket path, add -config /path/to/config.json to hook.sh. The config file must be readable by the slurm user/group.

Multi-Cluster Setup

For multiple slurmctld nodes, use TCP sockets instead of UNIX sockets:

{
  "prepSockListenPath": "tcp:0.0.0.0:12345",
  "prepSockConnectPath": "tcp:slurmctld-host:12345"
}

This allows Prolog/Epilog hooks on different nodes to connect to the daemon over the network.

How It Works

Job Event: Slurm triggers Prolog/Epilog hook when a job starts or stops
Socket Message: Hook sends job ID to daemon via socket
Immediate Query: Daemon queries Slurm for that specific job
Fast Submission: Job submitted to cc-backend with minimal delay

This reduces the job notification latency from up to 60 seconds (default poll interval) to just a few seconds.

7.4.5 - Usage

Command line usage and operation modes

Command Line Flags

Flag	Description
`-config <path>`	Specify the path to the config file (default: `/etc/cc-slurm-adapter/config.json`)
`-daemon`	Run in daemon mode (if omitted, runs in Prolog/Epilog mode)
`-debug <log-level>`	Set the log level (default: 2, max: 5)
`-help`	Show help for all command line flags

Operation Modes

Daemon Mode

Run the adapter as a persistent daemon that periodically synchronizes job information:

cc-slurm-adapter -daemon -config /path/to/config.json

This mode:

Runs continuously in the background
Queries Slurm at regular intervals (default: 60 seconds)
Submits job information to cc-backend
Should be managed by systemd (see Daemon Setup)

Prolog/Epilog Mode

Run the adapter from Slurm’s Prolog/Epilog hooks for immediate job notification:

cc-slurm-adapter

This mode:

Only runs when triggered by Slurm (job start/stop)
Sends job ID to the running daemon via socket
Exits immediately
Must be invoked from Slurm hook scripts (see Prolog/Epilog Setup)

Best Practices

Production Deployment

Keep Daemon Running: Resource info expires quickly after job completion
Monitor Logs: Watch for Slurm API changes or nil pointer errors
Secure Credentials: Restrict config file permissions (600 or 640)
Use Prolog/Epilog Carefully: Always exit with 0 to avoid blocking job allocations
Test Before Production: Verify in development environment first

Performance Tuning

High Job Volume: Reduce slurmPollInterval if periodic sync causes lag
Low Latency Required: Enable Prolog/Epilog hooks
Resource Constrained: Increase ccPollInterval (reduces cc-backend queries)

Debug Logging

Enable verbose logging for troubleshooting:

cc-slurm-adapter -daemon -debug 5 -config /path/to/config.json

Log Levels:

2 (default): Errors and warnings
5 (max): Verbose debug output

For systemd services, edit the service file to add -debug 5 to the ExecStart line.

7.4.6 - Troubleshooting

Debugging and common issues

Check Service Status

Verify the daemon is running:

sudo systemctl status cc-slurm-adapter

You should see output indicating the service is active (running).

View Logs

cc-slurm-adapter logs to stderr (captured by systemd):

sudo journalctl -u cc-slurm-adapter -f

Use -f to follow logs in real-time, or omit it to view historical logs.

Enable Debug Logging

Edit the systemd service file to add -debug 5:

ExecStart=/opt/cc-slurm-adapter/cc-slurm-adapter -daemon -debug 5 -config /opt/cc-slurm-adapter/config.json

Then reload and restart:

sudo systemctl daemon-reload
sudo systemctl restart cc-slurm-adapter

Log Levels:

2 (default): Errors and warnings
5 (max): Verbose debug output

Common Issues

Issue	Possible Cause	Solution
No jobs reported	Missing Slurm permissions	Run `sacctmgr add user cc-slurm-adapter Account=root AdminLevel=operator`
Socket connection errors	Wrong socket path or permissions	Check `prepSockListenPath`/`prepSockConnectPath` and RuntimeDirectoryMode
Prolog/Epilog failures	Non-zero exit code in hook script	Ensure hook script exits with `exit 0`
Missing resource info	Daemon stopped too long	Keep daemon running; resource info expires minutes after job completion
Job allocation failures	Prolog/Epilog exit code ≠ 0	Check hook script and ensure cc-slurm-adapter is running

Debugging Slurm Compatibility Issues

If you encounter nil pointer dereferences or unexpected errors:

Get a job ID via squeue or sacct:
```
squeue
# or
sacct
```

Check JSON layouts from both commands (they differ):

sacct -j 12345 --json
scontrol show job 12345 --json

Compare the output with what the adapter expects in slurm.go
Report issues to the GitHub repository with:
- Slurm version
- JSON output samples
- Error messages from logs

Verifying Configuration

Check that your configuration is valid:

# Test if config file is readable
cat /opt/cc-slurm-adapter/config.json

# Verify JSON syntax
jq . /opt/cc-slurm-adapter/config.json

Testing Connectivity

Test cc-backend Connection

# Test REST API endpoint (replace with your JWT)
curl -H "Authorization: Bearer YOUR_JWT_TOKEN" \
     https://your-cc-backend-instance.example/api/jobs/

Test NATS Connection

If using NATS, verify connectivity:

# Using nats-cli (if installed)
nats server check -s nats://mynatsserver.example:4222

Performance Issues

If the adapter is slow or missing jobs:

Check Slurm Response Times: Run sacct and squeue manually to see if Slurm is responding slowly
Adjust Poll Intervals: Lower slurmPollInterval for more frequent checks (but higher load)
Enable Prolog/Epilog: Reduces dependency on polling for immediate job notification
Check System Resources: Ensure adequate CPU/memory on the slurmctld node

7.4.7 - Architecture

Technical architecture and internal details

Synchronization Flow

The daemon operates on a periodic synchronization cycle:

Timer Trigger: Periodic timer (default: 60s) triggers sync
Query Slurm: Fetch job data via sacct, squeue, scontrol
Submit to cc-backend: POST job start/stop via REST API
Publish to NATS: Optional notification message (if enabled)

This ensures that all jobs are eventually captured, even if Prolog/Epilog hooks fail or are not configured.

Prolog/Epilog Flow

When Prolog/Epilog hooks are enabled, immediate job notification works as follows:

Job Event: Slurm triggers Prolog/Epilog hook when a job starts or stops
Socket Message: Hook sends job ID to daemon via socket
Immediate Query: Daemon queries Slurm for that specific job
Fast Submission: Job submitted to cc-backend with minimal delay

This reduces latency from up to 60 seconds (default poll interval) to just a few seconds.

Data Sources

The adapter queries multiple Slurm commands to build complete job information:

Slurm Command	Purpose
`sacct`	Historical job accounting data
`squeue`	Current job queue information
`scontrol show job`	Resource allocation details (JSON format)
`sacctmgr`	User permissions

Important: scontrol show job provides critical resource allocation information (nodes, CPUs, GPUs) that is only available while the job is in memory. This information typically expires a few minutes after job completion, which is why keeping the daemon running continuously is essential.

State Persistence

The adapter maintains minimal state on disk:

Last Run Timestamp: Stored as file modification time in lastRunPath
- Used to determine which jobs to query on startup
- Prevents flooding cc-backend with historical jobs after restarts
PID File: Stored in pidFilePath
- Prevents concurrent daemon execution
- Automatically cleaned up on graceful shutdown
Socket: IPC between daemon and Prolog/Epilog instances
- Created at prepSockListenPath (daemon listens)
- Connected at prepSockConnectPath (Prolog/Epilog connects)
- Supports both UNIX domain sockets and TCP sockets

Fault Tolerance

The adapter is designed to be fault-tolerant:

Slurm Downtime

Retries Slurm queries with exponential backoff
Continues operation once Slurm becomes available
No job loss during Slurm restarts

cc-backend Downtime

Queues jobs internally (up to slurmQueryMaxSpan seconds in the past)
Submits queued jobs once cc-backend is available
Prevents duplicate submissions

Daemon Restarts

Uses lastRunPath timestamp to catch up on missed jobs
Limited by slurmQueryMaxSpan to prevent overwhelming the system
Resource allocation data may be lost for jobs that completed while daemon was down

Multi-Cluster Considerations

For environments with multiple Slurm clusters:

Run one daemon instance per slurmctld node
Use cluster-specific configuration files
Consider TCP sockets for Prolog/Epilog if slurmctld is not on compute nodes

Performance Characteristics

Resource Usage

Memory: Minimal (< 50 MB typical)
CPU: Low (periodic bursts during synchronization)
Network: Moderate (REST API calls to cc-backend, NATS if enabled)

Scalability

Tested with clusters of 1000+ nodes
Handle thousands of jobs per day
Poll interval can be tuned based on job submission rate

Latency

Without Prolog/Epilog: Up to slurmPollInterval seconds (default: 60s)
With Prolog/Epilog: Typically < 5 seconds

7.4.8 - API Integration

Integration with cc-backend and NATS

cc-backend REST API

The adapter communicates with cc-backend using its REST API to submit job information.

Configuration

Set these required configuration options:

{
  "ccRestUrl": "https://my-cc-backend-instance.example",
  "ccRestJwt": "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
  "ccRestSubmitJobs": true
}

ccRestUrl: URL to cc-backend’s REST API (must not contain trailing slash)
ccRestJwt: JWT token from cc-backend for REST API access
ccRestSubmitJobs: Enable/disable REST API submissions (default: true)

Endpoints Used

The adapter uses the following cc-backend API endpoints:

Endpoint	Method	Purpose
`/api/jobs/start_job/`	POST	Submit job start event
`/api/jobs/stop_job/<jobId>`	POST	Submit job completion event

Authentication

All API requests include a JWT bearer token in the Authorization header:

Authorization: Bearer <ccRestJwt>

Job Data Format

Jobs are submitted in ClusterCockpit’s job metadata format, including:

Job ID and cluster name
User and project information
Start and stop times
Resource allocation (nodes, CPUs, GPUs)
Job state and exit code

Error Handling

Connection Errors: Adapter retries with exponential backoff
Authentication Errors: Logged as errors; check JWT token validity
Validation Errors: Logged with details about invalid fields

NATS Messaging

NATS integration is optional and provides real-time job notifications to other services.

Configuration

{
  "natsServer": "mynatsserver.example",
  "natsPort": 4222,
  "natsSubject": "mysubject",
  "natsUser": "myuser",
  "natsPassword": "123456789"
}

Leave natsServer empty to disable NATS integration.

Authentication Methods

The adapter supports multiple NATS authentication methods:

1. Username/Password

{
  "natsUser": "myuser",
  "natsPassword": "mypassword"
}

See: NATS Username/Password Auth

2. Credentials File

{
  "natsCredsFile": "/etc/cc-slurm-adapter/nats.creds"
}

See: NATS Credentials File

3. NKey Authentication

{
  "natsNKeySeedFile": "/etc/cc-slurm-adapter/nats.nkey"
}

See: NATS NKey Auth

Message Format

Jobs are published as JSON messages to the configured subject:

{
  "jobId": "12345",
  "cluster": "mycluster",
  "user": "username",
  "project": "projectname",
  "startTime": 1234567890,
  "stopTime": 1234567900,
  "numNodes": 4,
  "resources": { ... }
}

Use Cases

NATS integration is useful for:

Real-time Monitoring: Other services can subscribe to job events
Event-Driven Workflows: Trigger actions when jobs start/stop
Alternative to REST: Can disable REST submission and use NATS-only
Multi-Component Architecture: Multiple services consuming job events

Performance Considerations

NATS adds minimal latency (typically < 1ms)
Messages are fire-and-forget (no delivery guarantees by default)
Consider using NATS JetStream for persistent queues if needed

Dual Submission Mode

By default, the adapter submits jobs to both cc-backend REST API and NATS:

{
  "ccRestSubmitJobs": true,
  "natsServer": "mynatsserver.example"
}

This ensures:

cc-backend receives authoritative job data
Other services can react to job events in real-time

NATS-Only Mode

For specialized deployments, you can disable REST submission:

{
  "ccRestSubmitJobs": false,
  "natsServer": "mynatsserver.example"
}

Warning: In this mode, you must ensure another component (e.g., a NATS subscriber) is forwarding job data to cc-backend, or jobs will not appear in the UI.

8 - Web Interface

How to use the web interface?

Home

The entrypoint for each login via the login mask is a table containing each configured cluster as a row with the following columns:

Name: The configured clusters’ name
Running Jobs: Number of Jobs currently running longer than 5 minutes (or configured shortRunning amount of time)
- Clicking the Link will forward to the job list with preset filters for cluster and running jobs
Total Jobs: Number of Jobs in the respective job-archive
- Clicking the Link will forward to the job list with preset filter for cluster
Status View: Link to the status view of the respective cluster
- This column is only shown for users with admin authority.
Systems View: Link to the nodes view view of the respective cluster
- This column is only shown for users with admin authority.

The navigation bar allows direct access to ClusterCockpits’ different views and functions. Depending on the users’ authorization, the selectable views can differ.

For most viewports, the navigation bar is rendered fully expanded:

Item	Title	Description
1	Home Button	Leads back to the home table
2	Views	Leads to ClusterCockpits’ different views, will change dependent on user authority
3	Searchbar	Top-Level Searchbar, see full usage information here
4	Documentation	Leads to this Documentation
5	Settings	Leads to ClusterCockpit settings page
6	Logout	Logs out the active user

Adaptive Render Versions

On smaller viewports, the navigation bar will be rendered in one of two collapsed states:

ClusterCockpit Collapsed Navbar — Partially collapsed navigation bar. ‘Groups’ will expand to show links for Users, Projects, Tags, and Nodes views. ‘Stats’ will expand to show links for Analysis and Status views. Searchbar, Logout and Settings not shown here, but are still rendered explicitly in this case.

ClusterCockpit Burger Navbar — On mobile devices, the navigation bar as a whole is reduced into a burger navigation icon, and will display all views, as well as the searchbar, as stacked navigation menu.

8.1 - Settings

Webinterface Settings Page

The settings view allows non-privileged users to choose their preferred paging style, to customize how metric plots are rendered, and to generate personalized tokes for use with the API. Customization options include line width, number of plots per row (where applicable), whether backgrounds should be colored, and the color scheme of multi-line metric plots.

Privileged users will find an additional interface for choosing the preferred paging style used in the node list view.

Administrators will also find an administrative interface for handling local user accounts. This includes creating local accounts from the interface, editing user roles, listing and deleting existing users, generating JSON Web Tokens for API usage, and delegating managed projects for manager role users.

User Options

Settings available to the User Role are:

Field	Options	Note
Job List Paging Type	Classic / Continuous	Style of paging in job list and user job list
Line Width	# Pixels	Width of the lines in the timeseries plots
Plots Per Row	# Plots	How many plots to show next to each other on pages such as the job or nodes views
Colored Backgrounds	Yes / No	Color plot backgrounds indicating mean values within warning thresholds
Color Blind Mode	Yes / No	Whether to use color vision deficiency friendly palettes across the webinterface
Color Scheme	See Below	Render multi-line metric plots in different color ranges. Will change to CVD-Friendly palettes if Color Blind Mode is active

Generate JWT

This function will generate and return a personalized JWT, printed into the “Display JWT” field.

If working with the ClusterCockpit API, this token is required to authorize the user against the REST API endpoints.

Color Schemes

Name	Colors
Default
Autumn
Beach
BlueRed
Rainbow
Binary
GistEarth
BlueWaves
BlueGreenRedYellow

CVD-Friendly Color Schemes

These color palettes are based on https://personal.sron.nl/~pault/ and https://tsitsul.in/blog/coloropt/

Name	Colors
HighContrast
Bright
Muted
NormalSixColor
NormalTwelveColor

Support Options

Settings available to the Support User Role are:

Field	Options	Note
Node List Paging Type	Classic / Continuous	Style of paging in node list view

Admin Options

Create User

New users can be created directly via the web interface. On successful creation a green response message will be returned, and the user is directly visible in the “Special Users” table - If the user has at least two roles, or a single role other than user.

Error messages will also be displayed if the user creation process failed. No user account is saved to the database in this case.

Please note: Users are usually imported via LDAP on ClusterCockpit startup.

Field	Option	Note
Username (ID)	`string`	Required, must be unique
Password	`string`	Only API users are allowed to have a blank password, users with a blank password can only authenticate via JW tokens
Project	`string`	Only manager users can have a project
Name	`string`	Name of the user, optional, can be blank
Email Address	`string`	Users email, optional, can be blank
Role	Select one	See roles for more detailed information
	`API`	Allowed to interact with REST API
Default	`User`	Same as if created via LDAP sync
	`Manager`	Allows to inspect jobs and users of given project
	`Support`	Allows to inspect jobs and users of all projects, has no admin view or settings access
	`Admin`	General access

Special Users

This table does not contain users who only have user as their only role saved in the database. This is the case for all users created by LDAP import, and thus, these users will not be shown here. However, LDAP users’ roles can still be edited, and will appear in the table as soon as a authority higher than user or two authorities were granted.

All other special case users, e.g. new users manually created with support role, will appear in the list.

User accounts can be deleted by pressing the respective function displayed for each user entry - A verification pop-up window will appear to stop accidental user deletion.

Additionally, JWT tokens for specific users can be generated here as well.

Column	Example	Description
Username	`abcd1`	Username of this user
Name	`Paul Atreides`	Name of this user
Project(s)	`abcd`	Managed project(s) of this user
Email	`demo@demo.com`	Email adress of this user
Roles	`admin,api`	Role(s) of this user
JWT	Press button to reveal freshly generated token	Generate a JWT for this user for use with the CC REST API endpoints
Delete	Press button to verify deletion	Delete this user

Edit User Role

On creation, users can only have one role. However, it is allowed to assign multiple roles to an user account. The addition or removal of roles is performed here.

Enter an existing username and select an existing (for removal) or new (for addition) role in the drop-down menu.

Then press the respective button to remove or add the selected authority from the user account. Errors will be displayed if existing roles are added or non-existing roles are removed.

Edit Managed Projects

On creation, users can only have one managed project. However, it is allowed to assign multiple projects to a manager account. The addition or removal of projects is performed here.

Enter an existing username and select an existing (for removal) or new (for addition) project by entering the respective projectId.

Then press the respective button to remove or add the selected project from the manager account. Errors will be displayed if existing projects are added, non-existing projects are removed, or if the user account is not authorized to manage projects at all.

Scramble Names (Presentation Mode)

Activating this switch will replace all user names, person names, and project names with random strings. Intended for presentations on a production system while retaining critical information from a publc audience.

Metric Plot Resampling

If “Resampling” of metric plots is enabled in the configuration file (config.json), and read correctly on start-up, this informational display will list both the amount of data points on whichthe next resolution will be requested (“Trigger”) as well as the applicable resolutions themselves.

Note: Changes to the resampling options have to be perfofmed by changing the configuration file and restarting the application.

Edit Notice Shown On Homepage

The contents of the text form field will be written into $CCPATH/var/notice.txt on submission. If this file does not exist, it will be created.

If any content is found, an informational card will be rendered above the home site table. The content will also be mirrored within the form field itself.

Removing any content from the form field, and submitting, will clear the file and remove the rendered card from the homepage. This state is indicated by the placeholder text “No Content.” being shown in the form field.

8.2 - Searchbar

Toplevel Searchbar Functionality

The top searchbar will handle page wide searches either by entering a searchterm directly as <query>, or by using a “keyword” implemented in the form of <keyword>:<query>. Entering a searchterm directly will start a hierarchical search which will return the first match in the hierarchy (see table below). It is recommended to supply the search with a keyword to specify the searched entity. For example, jobName:myJobName will specifically search for all jobs which have the queried string (or a part thereof) in their metadata jobName field. For all keywords with examples, see the table below.

Both keywords and queries are trimmed of all spaces before performing the search, returning the same results independently of location and number of spaces, e.g. name : Paul and name: paul are both handled identically.

Unprocessable queries will return a message detailing the cause of the error.

Available Keywords

Please note: Hovering over the information icon right of the query field will list all keywords in the webinterface.

Keyword	Example Query	Destination	Note
No Keyword Used	`abcd100`	Joblist or User Joblist	Performs hierarchical search `jobId -> username -> name -> projectId -> jobName`
JobId	`jobId:123456`	Joblist	Allows multiple identical matches, e.g. JobIds from different clusters
JobName	`jobName:myJobName`	Joblist	Works with partial queries. Allows multiple identical matches, e.g. JobNames from different clusters. An additional `Last 30 Days` filter is active by default.
ProjectId	`projectId:abcd100`	Joblist	All Jobs of the given project
Username	`username:abcd100a`	Users Table	Only active users are returned. Users without jobs are not shown. An additional `Last 30 Days` filter is active by default. Admin Only
Name	`name:Paul`	Users Table	Works with partial queries. Only active users are returned. Users without jobs are not shown. An additional `Last 30 Days` filter is active by default. Admin Only
ArrayJobId	`arrayJobId:891011`	Joblist	All Jobs of the given arrayJobId. An additional `Last 30 Days` filter is active by default.

8.3 - Plots

Plot Descriptions and Functionality

Most plots visible in the ClusterCockpit webinterface are implemented via uPlot or Chart.js, which both offer various functionality to the user.

Metric Plots

The main plot component of ClusterCockpit renders the metric values retrieved from the systems in a time dependent manner.

Interactivity

A selector crosshair is shown when hovering over the rendered data, data points corresponding to the legend are highlighted.

It is possible to zoom in by dragging a selection square with your mouse. Double-Clicking into the plot will reset the zoom.

Please note: Metric plots will be rendered with regard to the configured normal metric threshold at first, i.e. the threshold will either be the highest rendered value (spaced line), or will be used to cut-off outliers (10 x normal threshold). Resetting by double-clicking will re-render the plot with regard to the highest value of the dataset, i.e. adapt the Y-axis to match said maximum value.

Resampling of Data

If “Resampling” of metric plots is enabled in the configuration file (config.json), data is primarily loaded on the coarsest resolution. Zooming into the dataset, as described above, will continuously trigger a reload of the data in finer resolutions, until the highest resolution is reached. A finer resolution is requested from the backend as soon as the number of visible data points falls below a configured amount (“Trigger”).

Please note: While archived data is read from disk, and therefore can be resampled in the backend directly, resampling of data for running jobs requires the use of a matching version of CC-Metric-Store.

Running Job metric data read from older versions of CCMS will still return correctly, but will always return in the metrics configured timestep.

Conditional Legends

Hovering over the rendered data will display a legend as hovering box colored in yellow. Depending on the amount of data shown, this legend will render differently:

Single Dataset: Runtime and Dataset Identifier Only
2 to 6 Datasets: Runtime, Line Color and Dataset Identifier
7 to 12 Datasets: Runtime and Dataset Identifier Only
More than 12 Datasets: No Legend
Statistics Datasets: Runtime and Dataset Identifier Only (See below)

The “no legend” case is required to not clutter the display in case of high data volume, e.g. core granularity data for more than 128 cores, which would result in 128 legend entries, possibly blocking the plotting area of metric graphs below.

Example

Colored Backgrounds

The plots’ background is colored depending the average value of the viewed metric in respect to its configured threshold values. The three cases are

White: Metric average within expected parameters. No performance impact.
Yellow: Metric average below expected parameters, but not yet critical. Possible performace impact.
Red: Metric average unexpectedly low. Indicator for suboptimal usage of resources. Performance impact to be expected.

Example

Statistics Variant

In the job list views, high amounts of data are by default rendered as a statistical representation of the numerous, single datasets:

Maximum: The maximum values of the base datasets of each point in time, over time. Colored in green.
Median: The median values of the base datasets of each point in time, over time. Colored in black.
Minimum: The minimal values of the base datasets of each point in time, over time. Colored in red.

Example

Note: Archived jobs might still show “Max/Average/Min” in their metric plots. This is due to these jobs being archived before the change in favor of median values was perfomed.

Histograms

Histograms display (binned) data allowing distributions of the repective data source to be visualized. Data highlighting, zooming, and resetting the zoom work as described for metric plots.

Example

Roofline Plot

A roofline plot, or roofline model, represents the utilization of available resources as the relation between computation and memory usage.

Dotted Roofline

Roofline models rendered as dotted plots display the utilization of hardware resources over time.

Please Note: The roofline models rendered in the status view are not job-derived, but display the utilization of single nodes at the moment of data-collection. Therefore, no time information is required, and alle dots are colored blue.

Example

Heatmap Roofline

The roofline model shown in the analysis view, as the single exception, is rendered as a heatmap. This is due to the data being displayed is derived from a number of jobs greater than one, since the analysis view returns all jobs matching the selected filters. The roofline therefore colors regions of accumulated activity in increasing shades of red, depicting the regions below the roofs in which the returned jobs primarily perform.

Please note: The plot is rendered in double-logarithmic scaling, yet the lines in the background seem linear: The heatmap roofline is rendered manually (and directly) using only HTML canvas, while the dotted roofline model is rendered with the help of the uPlot package, which allows easy display of double-log scales.

Example

Polar Plots

A polar, or radar, plot represents the utilization of key metrics. Both the maximum and the average utilization as a fraction of the 100% theoretical maximum (labelled as 1.0) are rendered on a number of axes equal to the displayed key metrics. This leads to an increasing area, which in return marks increasingly optimal resource usage. In principle, this is a graphic representation of data also shown in the footprint component.

By clicking on one of the two legends, the respective dataset will be hidden. This can be useful if high overlap reduces visibility.

Example

Scatter / Bubble Plot

Bubble scatter plots show the position of the averages of two selected metrics in relation to each other.

Each circle represents one job, while the size of a circle is proportional to its node hours. Darker circles mean multiple jobs have the same averages for the respective metric selection.

Example

8.4 - Filters

Webinterface Filter Options

Filter Button as displayed in Job List Views

The ClusterCockpit filter component is used for reducing the number of jobs, either for direct display in job list views, or to specifiy the data-source for collecting information displayed in user or project tables, as well as the analysis view.

Multiple Active Filters — Three active filters have reduced the total job count considerably

Multiple filters can be easily combined by selecting more than one option of the available filters.

By clicking on the respective filter pill, colored in blue, and located right of the filter component, one can directly access the respective filters’ menu for editing, or removing, the filter.

At the moment, the following filters are implemented:

Cluster/Partition

Select a configured cluster, or a specified partition of a given cluster, and display only jobs started on that cluster (and partition).

Options: All cluster names, and nested partition names, configured in config.json

Default: Any Cluster (Any Partition)

Job States

Select one or more job states, and display only jobs matching the selected criteria.

Options: running, completed, failed, cancelled, stopped, timeout, preempted, out_of_memory

Default: All states

Start Time

Select the timeframe in which jobs were started, and display only jobs matching the selected criteria.

Options: Free selection of date dd.mm.YYYY and time hh:mm for from and to limits.

Default: All Starttimes

Preset: Jobs started one month ago until $now

Duration

Select the duration of jobs, and display only jobs matching the selected criteria.

Options: Duration less than hh:mm, duration more than hh:mm, duration between two duration selections. Only one of the three options can be used at a time.

Default: All Durations

Resources

Select a named node or specify an amount of used resources, and display only jobs matching the selected criteria.

Options:

Named node free text field: Enter a hostname here to only return jobs which were ran on this node. Select the desired match logic (Defaults to “Equal”, i.e. exact match).
Range selectors: Select a range of allocated job resources ranging from the minimal to the maximum configured resource count of all clusters. If the cluster filter is set, the ranges are limited to the respective resources’ configuration. Available resources are:
- Nodes
- HWThreads
- Accelerators (if available)

Default: No named node, full resource ranges of all configured clusters

Energy

Specify total consumed energy, and display only jobs matching the selected range.

Options: “Total Job Energy” in kWh.

Default: No selection

Please note: Consumed energy will be written during archiving after a job has finished. Thus, this filter only works on jobs which are not marked as running.

Statistics

Specify ranges of metric statistics, and display only jobs matching the selected criteria.

Please note: Metric statistics listed here for selection are configured. All metrics, for which the footprint flag is set in the respective metrics’ configuration will be available here.

Example Options:

FLOPs (Avg.): Select Range From-To by dragging the slider or entering values directly.
Memory Bandwith (Avg.): Select Range From-To by dragging the slider or entering values directly.
Load (Avg.): Select Range From-To by dragging the slider or entering values directly.
Memory Used (Max.): Select Range From-To by dragging the slider or entering values directly.

Default: Full metric statistics ranges as configured

Start Time Quick Selections

Please note: Not available in all views!

Quickly select a preconfigured range of job start times. Will display as named start time filter.

When the returned URL is copied and shared, and the named filter value will transfer over.

Options: Last 6 hours, Last 24 hours, Last 7 Days, Last 30 Days

Default: No selection

8.5 - Views

View-Specific Frontend Usage Information.

Usage descriptions for each view of the ClusterCockpit web interface.

8.5.1 - My Jobs

All Jobs as Table of the Active User

The “My Jobs” View is available to all users regardless of authority and displays the users personal jobs, i.e. jobs started by this users username on the cluster systems.

The view is a personal variant of the user job view and therefore also consists of three components: Basic Information about the users jobs, selectable statistic histograms of the jobs, and a generalized job list.

Users are able to change the sorting, select and reorder the rendered metrics, filter, and activate a periodic reload of the data.

User Information and Basic Distributions

The top row always displays personal usage information, independent of the selected filters. Information displayed:

Username
Person Name (if available in DB)
Total Jobs
Short Jobs (as defined by the configuration, default: less than 300 second runtime)
Total Walltime
Total Core Hours

Additional histograms depicting the distribution of job duration and number of nodes occupied by the returned jobs are affected by the selected filters. The binning of the duration histogram can be selected by the user. The options are as follows:

Bin Size	Number of Bins	Maximum Displayed Duration
1 Minute (1m)	60	1 Hour
10 Minute (10m)	72	12 Hours
1 Hour (1h, Default)	48	2 Days
6 Hours (6h)	12	3 Days
12 Hours (12h)	14	1 Week

Selectable Histograms

Histograms depicting the distribution of the selected jobs’ statistics can be selected from the top navbar “Select Histograms” button. The displayed data is based on the jobs returned from active filters, and will be pulled from the database.

The binning of the statistics histograms can be selected by the user, the bin limits are calculated automatically. The options are as follows: 10 (Default), 20, 50, 100.

Please note: Metrics statistics listed here for selection are configured. All metrics, for which the footprint flag is set in the respective metrics’ configuration will be available here.

Job List

The job list displays all jobs started by your username on the systems. Additional filters will always respect this limitation. For a detailed description of the job list component, see the related documentation.

8.5.2 - User Jobs

All Jobs as Table of a Selected User

The “User Jobs” View is only available to management and supporting staff and displays jobs of the selected user, i.e. jobs started by this users username on the cluster systems.

The view consists of three components: Basic Information about the users jobs, selectable statistic histograms of the jobs, and a generalized job list.

Users are able to change the sorting, select and reorder the rendered metrics, filter, and activate a periodic reload of the data.

User Information and Basic Distributions

The top row always displays information about the user, independent of the selected filters.

Information displayed:

Username
Person Name (if available in DB)
Total Jobs
Short Jobs (as defined by the configuration, default: less than 300 second runtime)
Total Walltime
Total Core Hours

Bin Size	Number of Bins	Maximum Displayed Duration
1 Minute (1m)	60	1 Hour
10 Minute (10m)	72	12 Hours
1 Hour (1h, Default)	48	2 Days
6 Hours (6h)	12	3 Days
12 Hours (12h)	14	1 Week

Selectable Histograms

The binning of the statistics histograms can be selected by the user, the bin limits are calculated automatically. The options are as follows: 10 (Default), 20, 50, 100.

Please note: Metrics statistics listed here for selection are configured. All metrics, for which the footprint flag is set in the respective metrics’ configuration will be available here.

Job List

The job list displays all jobs started by this users username on the systems. Additional filters will always respect this limitation. For a detailed description of the job list component, see the related documentation.

8.5.3 - Job List

A Configurable Table Displaying Jobs According to Filters

Job View — Job List. In this example, the optional footprint is displayed, two filters are active, and the table is refreshed every minute. The first job has a high node count, therefore the plots are rendered in the statistics variant. The ‘mem_bw’ metric likely has artifacts as shown by the grey footprint. The second job has tags and displays less than optimal performance in the ‘flops_any’ metric, coloring the respective plot background in orange.

The primary view of ClusterCockpits webinterface is the tabular listing of jobs, which displays various information about the jobs returned by the selected filters. This information includes the jobs’ full meta data, such as runtime or job state, as well as an optional footprint, allowing quick assessment of the jobs performance.

Most importantly, the list displays a selectable array of metrics as time dependent metric plots, which allows detailed insight into the jobs performance at a glance.

Default Users: For users without additional roles, this view is labelled as ‘Job Search’. Displayed jobs are limited to jobs started by the active user, otherwise the functionality is identical, e.g. filtering or footprint display.

Manager Users: For users with additional manager role, this view is labelled as ‘Managed Jobs’. Displayed jobs are limited to jobs started by users of the managed projects (usergroups), otherwise the functionality is identical, e.g. filtering or footprint display.

Several options allow configuration of the displayed data, which are also persisted for each user individually, either for general usage or by cluster.

Sorting

Basic selection of sorting parameter and direction. By default, jobs are sorted by starting timestamp in descending order (latest jobs first). Other selections to sort by are

Duration
Number of Nodes
Number of Hardware-Threads
Number of Accelerators
Total Energy Consumed
Additional configured Metric Statistics
…

Please note: Additional metrics statistics are configured. All metrics, for which the footprint flag is set in the respective metrics’ configuration will be available as additional sorting options.

Switching of the sort direction is achieved by clicking on the arrow icon next to the desired sorting parameter.

Metrics

Selection of metrics shown in the tabular view for each job. The list is compiled from all available configured metrics of the ClusterCockpit instance, and the tabular view will be updated upon applying the changes.

In addition to the metric names themselves, the availability by cluster is indicated as comma seperated list next to the metric identifier. This information will change to the availablility by partition if the cluster filer is active.

It is furthermore possible to edit the order of the selected metrics. This can be achieved by dragging and dropping the metric selectors to the desired order, where the topmost metric will be displayed next to the “Job Info” column, and additional metrics will be added on the right side.

Lastly, the optional “Footprint” Column can be activated (and deactivated) here. It will always be rendered next to the “Job Info” column, while metrics start right of the “Footprint” column, if activated.

Filters

Selection of filters applied to the queried jobs. By default, no filters are activated if the view was opened via the navigation bar. At multiple location throughout the web-interface, direct links will lead to this view with one or more preset filters active, e.g. selecting a clusters’ “running jobs” from the home page will open this view displaying only running jobs of that cluster.

Possible options are:

Cluster/Partition: Filter by configured cluster (and partitions thereof)
Job State: Filter by defined job state(s)
Start Time: Filter by start timestamp
Duration: Filter by job duration
Tags: Filter by tags assigned to jobs
Resources: Filter by allocated resources or named node
Energy: Filter by consumed total energy (for completed jobs only)
Statistics: Filter by average usage of defined metrics

Each filter and its default value is described in detail here.

Job Count

The total number of jobs returned by the backend for the given set of filters.

Search and Reload

Search for specific jobname, project or username (privileged only) using the searchbox by selecting from the dropdown and entering the query.

Force a complete reload of the table data, or set a timed periodic reload (30, 60, 120, 300 Seconds).

Search for specific project

If the Job-List was opened via a ProjectId-Link or the Projects List, the text search will be fixed to the selected project, and allows for filtering jobnames and users in that project, as indicated by the placeholder text.

If desired, the fixed project can be removed by pressing the button right of the input field, returning the joblist to its default state.

Job List Table

The main component of the job list view renders data pulled from the database, the job archive (completed jobs) and the configured metric data source (running jobs).

Job Info

The meta data containing general information about the job is represented in the “Job Info” column, which is always the first column to be rendered. From here, users can navigate to the detailed view of one specific job as well as the user or project specific job lists.

Field	Example	Description	Destination
Job Id	`123456`	The JobId of the job assigned by the scheduling daemon	Job View
Job Name	`myJobName`	The name of the job as supplied by the user	-
Username	`abcd10`	The username of the submitting user	User Jobs
Project	`abcd`	The name of the usergroup the submitting user belongs to	Joblist with preset Filter
Resources	`n100`	Indicator for the allocated resources. Single resources will be displayed by name, i.e. exclusive single-node jobs or shared resources. Multiples of resources will be indicated by icons for nodes, CPU Threads, and accelerators.	-
Partition	`main`	The cluster partition this job was startet at	-
Start Timestamp	`10.1.2024, 10:00:00`	The epoch timestamp the job was started at, formatted for human readability	-
Duration	`0:21:10`	The runtime of the job, will be updated for running jobs on reload. Additionally indicates the state of the job as colored pill	-
Walltime	`24:00:00`	The allocated walltime for the job as per job submission script	-

Footprint

The optional footprint column will show base metrics for job performance at a glance, and will hint to performance (and performance problems) in regard to configurable metric thresholds.

Please note: Metric statistics displayed here are configured. All metrics, for which the footprint flag is set in the respective metrics’ configuration will be shown in this view.

Examples:

Field	Description	Note
cpu_load	Average CPU utilization	-
flops_any	Floprate calculated as `f_any = (f_double x 2) + f_single`	-
mem_bw	Average memory bandwidth used	Non-GPU Cluster only
mem_used	Maximum memory used	Non-GPU Cluster only
acc_utilization	Average accelerator utilization	GPU Cluster Only

Colors and icons differentiate between the different warning states based on the configured threshold of the metrics. Reported metric values below the warning threshold simply report bad performance in one or more metrics, and should therefore be inspected by the user for future performance improvement.

Metric values colored in blue, however, usually report performance above the expected levels - Which is exactly why these metrics should be inspected as well. The “maximum” thresholds are often the theoretically achievable performance by the respective hardware component, but rarely are they actually reached. Inspecting jobs reporting back such levels can lead to averaging errors, unrealistic spikes in the metric data or even bugs in the code of ClusterCockpit.

Color	Level	Description	Note
Blue	Info	Metric value below maximum configured peak threshold	Job performance above expected parameters - Inspection recommended
Green	OK	Metric value below normal configured threshold	Job performance within expected parameters
Yellow	Caution	Metric value below configured caution threshold	Job performance might be impacted
Red	Warning	Metric value below configured warning threshold	Job performance impacted with high probability - Inscpection recommended
Dark Grey	Error	Metric value extremely above maximum configured threshold	Inspection required - Metric spikes in affected metrics can lead to errorneous average values

For examples, see images in the job view section.

Metric Row

Selected metrics are rendered here in the selected order as metric lineplots. Aspects of the rendering can be configured at the settings page.

8.5.4 - Job Comparison

Compare Job Metric Statistics

Job List With Compare Switch — Job list with compare switch. In this example, filters return 145 jobs, while no job is selected manually.

Accessible from the job list primary view, the job compare view allows for the comparison of metric statistics in a pseudo-time-dependent manner.

The “Compare Jobs” button is located in the upper right corner of the job list view. Jobs for comparison are either selected by

… a combination of filters resulting in a dataset of 500 jobs or less.
… manual job selection by checking the box in the job info card.

If too many jobs are returned by the current filter selection, the button will be disabled.

If jobs are directly selected from the current job list, the button will display the current count, as well as an additional “Reset” button, which will empty the list of selected jobs, if pressed.

Manual job selection will also work if the current job list has more than 500 returned jobs, while the subsequent job compare view will ignore all additional filters, and only show selected jobs. Returning to the job list also returns with the last used filters.

This allows manual job selection between pages, but also manual job selection between different filter combinations!

Fixed Compare Elements

Job Compare Options and Resource Compare — Job compare view top elements. The count of 145 jobs remains after switching to this view. The resource plot shows jobs sorted by their startTime, and all jobs have allocated accelerators (red data line).

The compare view features a reduced header:

Sorting is disabled, as jobs are always sorted by startTime in ascending order.
The filter component is removed and only shows the total number of compared jobs.
The refresh component is also removed.

The “Metric Selection” is active and can be used to add additional metric comparison plots to the view, if desired.

“Return to List” closes the compare view and restores the former job list view.

The “Resource Compare” plot is always shown at the first position. It features a semi-logarithmic view of allocated job resources in a pseudeo-time-dependent manner, as all jobs are sorted by actual start time. The data is colored as follows:

Black: Nodes - will always be at least 1 (Note: Also for shared jobs!)
Blue: Hardware Threads ( ~ Cores)
Red: Accelerators - Can be zero! If so, no line is rendered.

The legend includes further information, such as:

Job-ID
Cluster (and subCluster) on which the job ran
Runtimeof the job

Selectable Compare Elements

Job Compare Metric Plot and Table — Job compare view metric plot and table. ‘Clock’ metric statistics are plotted for every job sorted by their startTime. All information is also shown as sortable table at the bottom of the compare view.

Below the resource compare plot, the individual metric compare plots are rendered. For each job, the Min/Max/Avg of the respective metric is plotted in a banded manner.

Zooming is possible, and will be synchronized to all other rendered plots, including the resource comparison.

Please Note: Due to spacing reasons, not all jobIDs can be rendered as tick-marks if the total count of compared jobs is high!

Below the plots, all information is again rendered as a single table consisting of the following columns:

JobID
Start Time
Duration
Cluster
Resources (Nodes, Threads , Accs)
For each Metric: Minimum, Maximum, Average

It is possible to filter for specific jobIDs or parts thereof, all other columns are sortable.

Clicking on a JobID will lead to the respective Job View.

8.5.5 - Job

Detailed Single Job Information View

The job view displays all data related to one specific job in full detail, and allows detailed inspection of all metrics at several scopes, as well as manual tagging of the job.

Top Bar

The top bar of each job view replicates the “Job Info” and “Footprint” seen in the job list, and additionally renders general metric information in specialized plots.

For shared jobs, a list of jobs which run (or ran) concurrently is shown as well.

Job Info

Identical to the job list equivalent, this component displays meta data containing general information about the job. From here, users can navigate to the detailed view of one specific job as well as the user or project specific job lists.

Field	Example	Description	Destination
Job Id	`123456`	The JobId of the job assigned by the scheduling daemon. The icon on the right allows for easy copy to clipboard.	Job View
Job Name	`myJobName`	The name of the job as supplied by the user	-
Username	`abcd10`	The username of the submitting user	User Jobs
Project	`abcd`	The name of the usergroup the submitting user belongs to	Joblist with preset Filter
Resources	`n100`	Indicator for the allocated resources. Single resources will be displayed by name, i.e. exclusive single-node jobs or shared resources. Multiples of resources will be indicated by icons for nodes, CPU Threads, and accelerators.	-
Partition	`main`	The cluster partition this job was startet at	-
Start Timestamp	`10.1.2024, 10:00:00`	The epoch timestamp the job was started at, formatted for human readability	-
Duration	`0:21:10`	The runtime of the job, will be updated for running jobs on reload. Additionally indicates the state of the job as colored pill	-
Walltime	`24:00:00`	The allocated walltime for the job as per job submission script	-

At the bottom, all tags attached to the job are listed. Users can manage attachted tags via the “manage X Tag(s)” button.

Concurrent Jobs

In the case of a shared job, a second tab next to the job info will display all jobs which were run on the same hardware at the same time. “At the same time” is defined as “has a starting or ending time which lies between the starting and ending time of the reference job” for this purpose.

A cautious period of five minutes is applied to both limits, in order to restrict display of jobs which have too little overlap, and would just clutter the resulting list of jobs.

Each overlapping job is listed with its jobId as a link leading to this jobs detailed job view.

Footprint

Identical to the job list equivalent, this component will show base metrics for job performance at a glance, and will hint to job quality and problems in regard to configurable metric thresholds. In contrast to the job list, it is always active and shown in the detailed job view.

Please note: Metric statistics displayed here are configured. All metrics, for which the footprint flag is set in the respective metrics’ configuration will be shown in this view.

Examples:

Field	Description	Note
cpu_load	Average CPU utilization	-
flops_any	Floprate calculated as `f_any = (f_double x 2) + f_single`	-
mem_bw	Average memory bandwidth used	-
mem_used	Maximum memory used	Non-GPU Cluster only
acc_utilization	Average accelerator utilization	GPU Cluster Only

Colors and icons differentiate between the different warning states based on the configured thresholds of the metrics. Reported metric values below the warning threshold simply report bad performance in one or more metrics, and should therefore be inspected by the user for future performance improvement.

Color	Level	Description	Note
Blue	Info	Metric value below maximum configured peak threshold	Job performance above expected parameters - Inspection recommended
Green	OK	Metric value below normal configured threshold	Job performance within expected parameters
Yellow	Caution	Metric value below configured caution threshold	Job performance might be impacted
Red	Warning	Metric value below configured warning threshold	Job performance impacted with high probability - Inspection recommended
Dark Grey	Error	Metric value extremely above maximum configured threshold	Inspection required - Metric spikes in affected metrics can lead to errorneous average values

Specific to the job view: In the job view, the footprint component also allows for 1:1 rendering of HTML code, saved within the jobs’ meta data section of the database. This is intended for administrative messages towards the user who created the job, e.g. for displaying warning, hints, or contact information.

Examples

Footprint with good Performance — Footprint of a job with performance well within expected parameters, ‘mem_bw’ even overperforms.

Footprint with mixed Performance — Footprint of an accelerated job with mixed performance parameters.

Footprint with Errors — Footprint of a job with performance averages way above the expected maxima - Look for artifacts!

Polar Representation

Next to the footprints, a second tab will render the polar plot representation of the configured footprint metrics. Minimum, Average and Maximum ranges are rendered.

Roofline Representation

A roofline plot representing the utilization of available resources as the relation between computation and memory usage over time (color scale blue -> red).

Energy Summary

For completed jobs, the energy estimates are shown below the top bar. Energy is shown in kilowatt hours for all contributing metrics. If a constant for g/kWh is configured, an additional estimate is calculated which displays the amount of carbon emissions.

Please note: Energy metrics displayed here are configured. All metrics, for which the energy flag is set in the respective metrics’ configuration will be shown in this view.

In addition, “Total Energy” is calculated as the sum of all configured metrics, regardless of their origin. I.e., if core_power and cpu_power are configured, both values contribute to the total energy.

Metric Plot Table

The views’ middle section consists of metric plots for each metric selected in the “Select Metrics” menu, which defaults to all configured metrics available to the jobs’ cluster and subCluster.

The data shown per metric defaults to the smallest available granularity of the metric with data of all nodes, but can be changed at will by using the drop down selectors above each plot.

If available, the statistical representation can be selected as well, by scope (e.g. stats series (node)).

Please note: The backend will calculate and return a statistical data series if the underlying metric dataset has at least 15 data series, e.g. a job utilizing 15 or more cores.

Tagging

Manual tagging of jobs is performed by using the “Manage Tags” option.

Tags are categorized into three “Scopes” of visibility:

Admin: Only administrators can create and attach these tags. Only visible for administrators and support personnel.
Global: Administrators and support personnel can create and attach these tags. Visible for everyone.
Private: Everyone can create and attach private tags, only visible to the creator.

Available tags are listed, colored by scope, and can be added to the jobs’ database entry simply by pressing the respective button.

The list can be filtered for specific tags by using the “Search Tags” prompt.

New tags can be created by entering a new type:name combination in the search prompt, which will display a button for creating this new tag. Privileged users](/docs/explanation/roles/#administrator-role “Admin Role”) will additionally be able to select the “Scope” (see above) of the new tag.

Statistics and Meta Data

On the bottom of the job view, additional information about the job is collected. By default, the statistics of selected metrics are shown in tabular form, each in their metrics’ native granularity.

Statistics Table

The statistics table collects all metric statistical values (min, max, avg) for each allocated node and each granularity.

The metrics to be displayed can be selected using the “Select Metrics” selection pop-up window. In the header, next to the metric name, a second drop down allows the selection of the displayed granularity. If no other scopes than node are available, the drop down menu is disabled.

Core and Accelerator metrics default to their respective native granularities automatically.

For multi-node jobs, fine granularities are not requested from the backend from the start. A “Load Scopes” will allow for the later load of more scopes, which will apply to all selected metrics in the statistics table, and also to metrics selected later.

Job Script

This tab displays the job script with which whis job was started on the systems.

Slurm Info

THis tab displays information returned drom the SLURM batch process management software.

8.5.6 - Users

Table of All Users Running Jobs on the Clusters

User Table, sorted by ‘Total Jobs’ in descending order. In addition, active filters reduce the underlying data to jobs with more than one hour runtime, started on the GPU accelerated cluster.

This view lists all users which are, and were, active on the configured clusters. Information about the total number of jobs, walltimes and calculation usages are shown.

It is possible to filter the list by username using the equally named prompt, which also accepts partial queries.

The filter component allows limitation of the returned users based on job parameters like start timestamp or memory usage.

The table can be sorted by clicking the respective icon next to the column headers.

Please Note: By default, a “Last 30 Days” filter is activated by default when opening this view.

Managers Only: For users with manager authority, this view will be titled ‘Managed Users’ in the navigation bar. Managers will only be able to see other user accounts of the managed projects.

Details

Column	Description	Note
User Name	The user account jobs are associated with	Links to the users’ job list with preset filter returning only jobs of this user and additional histograms
Name	The name of user
Total Jobs	Users’ total of all started jobs
Total Walltime	Users’ total requested walltime
Total Core Hours	Users’ total of all used core hours
Total Accelerator Hours	Users’ total of all used accelerator hours	Please Note: This column is always shown, and will return `0` for clusters without installed accelerators

8.5.7 - Projects

Table of All Projects Running Jobs on the Clusters

This view lists all projects (usergroups) which are, and were, active on the configured clusters. Information about the total number of jobs, walltimes and calculation usages are shown.

It is possible to filter the list by project name using the equally named prompt, which also accepts partial queries.

The filter component allows limitation of the returned projects based on job parameters like start timestamp or memory usage.

The table can be sorted by clicking the respective icon next to the column headers.

Please Note: By default, a “Last 30 Days” filter is activated by default when opening this view.

Managers Only: For users with manager authority, this view will be titled ‘Managed Projects’ in the navigation bar. Managers will only be able to see colected data of managed projects.

Details

Column	Description	Note
Project Name	The project (usergoup) jobs are associated with	Links to a job list with preset filter returning only jobs of this project
Total Jobs	Project total of all started Jobs
Total Walltime	Project total requested walltime
Total Core Hours	Project total of all used core hours used
Total Accelerator Hours	Project total of all used accelerator hours	Please Note: This column is always shown, and will return `0` for clusters without installed accelerators

8.5.8 - Tags

Lists Active Tags Used in the Frontend

This view lists all tags currently used within the ClusterCockpit instance:

The Tag Type of the tag(s) is displayed as dark grey header, collecting all tags which share it, with a total count shown on the right.
The Names of all tags sharing one Tag Type, the number of matching jobs per name, and the scope are rendered as pills below the header, colored accordingly (see below).

Each tags’ pill is clickable, and leads to a job list with a preset filter matching only jobs tagged with this specific label.

Tag Scopes

Tags are categorized into three “Scopes” of visibility, and colored accordingly:

Admin (Cyan): Only administrators can create and attach these tags. Only visible for administrators and support personnel.
Global (Purple): Administrators and support personnel can create and attach these tags. Visible for everyone.
Private (Yellow): Everyone can create and attach private tags, only visible to the creator.

Remove Tags

Tags and all job attachements can be removed from the database if a red X symbol is attached to the tags’ pill. A confirmation popup will appear after which the tag and all attachements are deleted, and the tag is removed from th list.

The following rules apply:

Only Administrators are authorized to remove tags with scopes “global” and “admin” via this functionality in this view.
Managers and Support-Personnel can not remove “global” and “admin” tags from the database this way.
Every User, including staff, can remove their own “private” tags (but not those of other users).

Please note: Creating tags and adding/removing them to/from jobs is either done by using the respective REST API calls, or manually from the job view.

8.5.9 - Nodes

Node Based Metric Information of one Cluster

Node Overview

Nodes View. This example shows the last two hours of the ‘clock’ metric of eight nodes. Node ‘f0147’ of the ‘main’ partition has an average below the configured ‘alert’ threshold, and is colored in red.

The node overview is always called in respect to one specified cluster. It displays the current state of all nodes in that cluster in respect to one selected metric, rendered in form of metric plots, and independent of job meta data, i.e. without consideration for job start and end timestamps.

Please note: The X-Axis of all plots rendered in this view are relative to the latest data point received from the collector daemon, and thus, the time displayed reaches backward as indicated by negative X-axis labels.

Overview Selection Bar

Selections regarding the display, and update, of the plots rendered in the node table can be performed here:

Find Node:: Filter the node table by hostname. Partial queries are possible.
Displayed Timerange: Select the timeframe to be rendered in the node table
- Custom: Select timestamp from and to in which the data should be fetched. It is possible to select date and time.
- 15 Minutes, 30 Minutes, 1 Hour, 2 Hours, 4 Hours, 12 Hours, 24 Hours
Metric:: Select the metric to be fetched for all nodes. If no data can be fetched, messages are displayed per node.
(Periodic) Reload: Force reload of fresh data from the backend or set a periodic reload in specified intervals
- 30 Seconds, 60 Seconds, 120 Seconds, 5 Minutes

Node Table

Nodes (hosts) are ordered alphanumerically in this table, rendering the selected metric in the selected timeframe.

Each heading links to the singular node view of the respective host.

Node List

The node list view is also always called in respect to one specified cluster, and optionally, subCluster. It displays the current state of all nodes in that cluster (or subCluster) in respect to a selectable number, and order, of metrics. Plots are rendered in form of metric plots, and are independent of job meta data, i.e. without consideration for job start and end timestamps.

The always visible “Node Info”-Card displays the following information. “List”-Bottons will lead to according views with preset filters.

Field	Example	Description	Destination
Card Header	`Node a0421 Alex A40`	Hostname and Cluster	Node View
Status Indicator	`Status Exclusive`	Indicates the host state via keywords, see below	-
Activity	`2 Jobs`	Number of Jobs currently running on host	Job List
Users	`2 Users`	Number and IDs of users currently running jobs	User Table
Projects	`1 Project`	Number and IDs of projects currently running jobs	Project Table

In order to give an idea of the currentnode state, the following indicators are possible:

Node Status	Type	Description
Exclusive	Job-Info	One exclusive job is currently running, utilizing all of the nodes’ hardware
Shared	Job-Info	One or more shared jobs are currently running, utilizing allocated amounts of the nodes’ hardware
Allocated	Fallback	If more jobs than one are running, but all jobs are marked as ’exclusive’, this fallback is used
Idle	Job-Info	No currently active jobs
Warning	Warning	At least one of the selected metrics does not return data successfully. Can hint to configuration problems.
Unhealthy	Warning	None of the selected metrics return data successfully. Node could be offline or misconfigured.

Please note: All “Warning States” are estimated on the basis of returned metric data from the metric data repository. No actual hardware health information is queried or handled in any way or form.

List Selection Bar

The selection header allows for configuration of the displayed data in terms of selected metrics or timerange.

Field	Example	Description
Metrics	`4 Selected`	Menu for and Number of Metrics currently selected
Resolution	`600`	Resolution of the metric plots rendered for each node
Find Node(s)	`a0421`	Filter for hostnames
Range	`Last 12hrs`	Time range to be displayed as X-Axis
Refresh	`60 Seconds`	Enable automatic refresh of metric plots

Field	Example	Description	Destination
Job Id	`123456`	The JobId of the job assigned by the scheduling daemon. The icon on the right allows for easy copy to clipboard.	Job View

Extended Legend

For nodes with multiple jobs running on them, accelerator metrics are extended by the username and the job-id currently utilizing this hardware ID. This is based on the ID information sent during job-start to cc-backend (Database resources-column).

8.5.10 - Node

All Metrics of One Selected Node

The node view is always called in respect to one specified cluster and one specified node (host). It displays the current state of all metrics for that node, rendered in form of metric plots, and independent of job meta data, i.e. without consideration for job start and end timestamps.

Selection Bar

Information and selections regarding the data of the plots rendered in the node table can be performed here:

Name: The hostname of the selected node
Displayed Timerange: Select the timeframe to be rendered in the node table
- Custom: Select timestamp from and to in which the data should be fetched. It is possible to select date and time.
- 15 Minutes, 30 Minutes, 1 Hour, 2 Hours, 4 Hours, 12 Hours, 24 Hours
Activity: Number of jobs currently allocated to this node. Exclusively used nodes will always display 1 if a job is running at the moment, or 0 if not.
- The “Show List”-Bitton leads to the joblist with preset filter fetching only currently allocated jobs on this node.
(Periodic) Reload: Force reload of fresh data from the backend or set a periodic reload in specified intervals
- 30 Seconds, 60 Seconds, 120 Seconds, 5 Minutes

Node Table

Metrics are ordered alphanumerically in this table, rendering each metric in the selected timeframe.

8.5.11 - Analysis

Metric Data Analysis View

The analysis view is always called in respect to one specified cluster. It collects and renders data based on the jobs returned by the active filters, which can be specified to a high detail, allowing analysis of specific aspects.

Please note: By default, the requested data is limited by a preset start time filter to jobs started within the last 6 hours. In addition, some results are not calculated when the returned amount of jobs exceeds 500 entries, in order to save on rendering time.

General Information

The general information section of the analysis view is always rendered and consists of the following elements

Totals

Total counts of collected data based on the returned jobs matching the requested filters:

Total Jobs
Total Short Jobs (By default defined as jobs shorter than 5 minutes)
Total Walltime
Total Node Hours
Total Core Hours
Total Accelerator Hours

Top Users and Projects

The ten most active users or projects are rendered in a combination of pie chart and tabular legend with values displayed. By default, the top ten users with the most jobs matching the selected filters will be shown.

Hovering over one of the pie chart fractions will display a legend featuring the identifier and value of the selected parameter.

The selection can be changed directly in the headers of the pie chart and the table, and can be changed to

Element	Options
Pie Chart	`Users, Projects`
Table	`Walltime, Node Hours, Core Hours, Accelerator Hours`

The selection is saved for each user and cluster, and will select the last chosen types of list as default the next time this view is opened.

“User Names” and “Project Codes” are rendered as links, leading to user job lists or project job lists with preset filters for cluster and entity ID.

Please note: The legend colors are fixed by their position, and not by their respective identifier. This means that the orange fraction will always be the largest fraction, even if the contributing user or project changes.

Heatmap Roofline

A roofline plot representing the utilization of available resources as the relation between computation and memory for all jobs matching the filters. In order to represent the data in a meaningful way, the time information of the raw data is abstracted and represented as a heat map, with increasingly red sections of the roofline plot being the most populated regions of utilization.

Histograms

Two histograms depicting the duration and number of allocated cores distributions for the returned jobs matching the filters.

Selectable Data Representations

The second half of the analysis view consists of areas reserved for rendering user-selected data representations.

Select Plots for Histograms: Opens a selector listing all configured metrics of the respective cluster. One or more metrics can be selected, and the data returned will be rendered as average distributions normalized by node hours (core hours, accelerator hours; depending on the metric).
Select Plots in Scatter Plots: Opens a selector which allows selection of user chosen combinations of configured metrics for the respective cluster. Selected duplets will be rendered as scatter bubble plots for each selected pair of metrics.

Analysis View Scatter Selection — Three pairs of metrics are already selected for scatter representation. Remove a selected pair by pressing the ‘x’ button, add a new pair by selecting two metric from the dropdown menu, and confirming by pressing ‘Add Plot’.

Average Distribution Histograms

Analysis View Average Distributions — Three selected metrics are represented as normalized, average distributions based on returned jobs.

These histograms show the distribution of the normalized averages of all jobs matching the filters, split into 50 bins for high detail.

Normalization is achieved by weighting the selected metric data job averages by node hours (default), or by either accelerator hours (for native accelerator scope metrics) or core hours (for native core scope metrics).

Please note: Metrics, which are disabled for specific subclusters as per metric configuration file, will be returned as null values if data is requested for the whole cluster, which can affect the rendered distributions. Select a specific partition using the cluster filter to evade this artifact.

User Defined Scatterplots

Analysis View Scatter Plots — Three user defined scatter plots.

Bubble scatter plots show the position of the averages of two selected metrics in relation to each other.

Each circle represents one job, while the size of a circle is proportional to its node hours. Darker circles mean multiple jobs have the same averages for the respective metric selection.

8.5.12 - Status

Hardware Usage Information

The status view is always called in respect to one specified cluster. It displays the current state of utilization of the respective clusters resources, as well as user and project top lists and distribution histograms of the allocated resources per job.

Please note: By default, the periodic reload function is set to 2 Minutes.

Utilization Information

For each subluster, utilization is displayed in two parts rendered in one row.

Gauges

Simple gauge representation of the current utilization of available resources

Field	Description	Note
Allocated Nodes	Number of nodes currently allocated in respect to maximum available	-
Flop Rate (Any)	Currently achieved flop rate in respect to theoretical maximum	Floprate calculated as `f_any = (f_double x 2) + f_single`
MemBW Rate	Currently achieved memory bandwidth in respect to technical maximum	-

Roofline

A roofline plot representing the utilization of available resources as the relation between computation and memory for each currently allocated, running job at the time of the latest data retrieval. Therefore, no time information is represented (all dots in blue, representing one job each).

Top Users and Projects

The ten most active users or projects are rendered in a combination of pie chart and tabular legend. By default, the top ten users or projects with the most allocated, running jobs are listed.

The selection can be changed directly in the tables header at Number of ..., and can be changed to

Jobs (Default)
Nodes
Cores
Accelerators

The selection is saved for each user and cluster, and will select the last chosen type of list as default the next time this view is rendered.

Hovering over one of the pie chart fractions will display a legend featuring the identifier and value of the selected parameter.

“User Names” and “Project Codes” are rendered as links, leading to user job lists or project job lists with preset filters for cluster, entity ID, and state == running.

Statistic Histograms

Several histograms depicting the utilization of the clusters resources, based on all currently running jobs are rendered here:

Duration Distribution
Number of Nodes Distribution
Number of Cores Distribution
Number of Accelerators Distribution

Additional Histograms showing specified footprint metrics across all systems can be selected via the “Select histograms” menu next to the refresher tool.

Please note: Metric statistics available here for selection are configured. All metrics, for which the footprint flag is set in the respective metrics’ configuration will be shown.

9 - Contribution Guidelines

Articles related to developers

You can find a list with articles related to contributing to ClusterCockpit here.

9.1 - Commit message naming conventions

Special keywords to reference tickets and control release notes

Introduction

ClusterCockpit uses goreleaser for building and uploading releases. In this process the release notes including all notable changes are automatically generated based on special commit message tags. Moreover GitHub will parse special characters and words to link and close issues.

Reference issue tickets

It is good practice to always create a ticket for notable changes. This allows to comment and discuss about source code changes. Any commit that contributes to the ticket should reference the ticket id (in the commit message or description). This is achieved in GitHub by prefixing the ticket id with a number sign character (#):

This change contributes to #235

GitHub will detect if a pull request or commit uses special keywords to close a ticket:

close, closes, closed
fix, fixes, fixed
resolve, resolves, resolved

The ticket will not be closed before the commit appears on the main branch. Example:

This change fixes #423

Control release notes with preconfigured commit message prefixes

Commits with one of the following prefixes will appear in the release notes:

feat: Mark a commit to contain changes related to new features
fix: Mark a commit to contain changes related to bug fixes
sec: Mark a commit to contain changes related to security fixes
doc: Mark a commit to contain changes related to documentation updates
[feat|fix] dep: Mark a commit that is related to a dependency introduction or change

9.2 - Contribute documentation

How to contribute to the documentation website

We use Hugo to format and generate our website, the Docsy theme for styling and site structure. Hugo is an open-source static site generator that provides us with templates, content organisation in a standard directory structure, and a website generation engine. You write the pages in Markdown (or HTML if you want), and Hugo wraps them up into a website.

All submissions, including submissions by project members, require review. We use GitHub pull requests for this purpose. Consult GitHub Help for more information on using pull requests.

Quick start

Here’s a quick guide to updating the docs. It assumes you’re familiar with the GitHub workflow and you’re happy to use the automated preview of your doc updates:

Fork the cc-docs repo on GitHub.
Make your changes and send a pull request (PR).
If you’re not yet ready for a review, add “WIP” to the PR name to indicate it’s a work in progress.
Preview the website locally as described beyond.
Continue updating your doc and pushing your changes until you’re happy with the content.
When you’re ready for a review, add a comment to the PR, and remove any “WIP” markers.

Updating a single page

If you’ve just spotted something you’d like to change while using the docs, Docsy has a shortcut for you:

Click Edit this page in the top right hand corner of the page.
If you don’t already have an up to date fork of the project repo, you are prompted to get one - click Fork this repository and propose changes or Update your Fork to get an up to date version of the project to edit. The appropriate page in your fork is displayed in edit mode.

Previewing your changes locally

If you want to run your own local Hugo server to preview your changes as you work:

Follow the instructions in Getting started to install Hugo and any other tools you need. You’ll need at least Hugo version 0.45 (we recommend using the most recent available version), and it must be the extended version, which supports SCSS.
Fork the cc-docs repo into your own project, then create a local copy using git clone. Don’t forget to use --recurse-submodules or you won’t pull down some of the code you need to generate a working site.

git clone --recurse-submodules --depth 1 https://github.com/ClusterCockpit/cc-doc.git

Run hugo server in the site root directory. By default your site will be available at http://localhost:1313/. Now that you’re serving your site locally, Hugo will watch for changes to the content and automatically refresh your site.
Continue with the usual GitHub workflow to edit files, commit them, push the changes up to your fork, and create a pull request.

Creating an issue

If you’ve found a problem in the docs, but you’re not sure how to fix it yourself, please create an issue in the cc-docs. You can also create an issue about a specific page by clicking the Create Issue button in the top right hand corner of the page.

Useful resources

Docsy user guide: All about Docsy, including how it manages navigation, look and feel, and multi-language support.
Hugo documentation: Comprehensive reference for Hugo.
Github Hello World!: A basic introduction to GitHub concepts and workflow.

9.3 - Docsy example page

Example page to showcase formatting options for docsy.

This is a placeholder page. Replace it with your own content.

Text can be bold, italic, or ~~strikethrough~~. Links should be blue with no underlines (unless hovered over).

There should be whitespace between paragraphs. Vape migas chillwave sriracha poutine try-hard distillery. Tattooed shabby chic small batch, pabst art party heirloom letterpress air plant pop-up. Sustainable chia skateboard art party banjo cardigan normcore affogato vexillologist quinoa meggings man bun master cleanse shoreditch readymade. Yuccie prism four dollar toast tbh cardigan iPhone, tumblr listicle live-edge VHS. Pug lyft normcore hot chicken biodiesel, actually keffiyeh thundercats photo booth pour-over twee fam food truck microdosing banh mi. Vice activated charcoal raclette unicorn live-edge post-ironic. Heirloom vexillologist coloring book, beard deep v letterpress echo park humblebrag tilde.

90’s four loko seitan photo booth gochujang freegan tumeric listicle fam ugh humblebrag. Bespoke leggings gastropub, biodiesel brunch pug fashion axe meh swag art party neutra deep v chia. Enamel pin fanny pack knausgaard tofu, artisan cronut hammock meditation occupy master cleanse chartreuse lumbersexual. Kombucha kogi viral truffaut synth distillery single-origin coffee ugh slow-carb marfa selfies. Pitchfork schlitz semiotics fanny pack, ugh artisan vegan vaporware hexagon. Polaroid fixie post-ironic venmo wolf ramps kale chips.

There should be no margin above this first sentence.
Blockquotes should be a lighter gray with a border along the left side in the secondary color.
There should be no margin below this final sentence.

First Header 2

This is a normal paragraph following a header. Knausgaard kale chips snackwave microdosing cronut copper mug swag synth bitters letterpress glossier craft beer. Mumblecore bushwick authentic gochujang vegan chambray meditation jean shorts irony. Viral farm-to-table kale chips, pork belly palo santo distillery activated charcoal aesthetic jianbing air plant woke lomo VHS organic. Tattooed locavore succulents heirloom, small batch sriracha echo park DIY af. Shaman you probably haven’t heard of them copper mug, crucifix green juice vape single-origin coffee brunch actually. Mustache etsy vexillologist raclette authentic fam. Tousled beard humblebrag asymmetrical. I love turkey, I love my job, I love my friends, I love Chardonnay!

Deae legum paulatimque terra, non vos mutata tacet: dic. Vocant docuique me plumas fila quin afuerunt copia haec o neque.

On big screens, paragraphs and headings should not take up the full container width, but we want tables, code blocks and similar to take the full width.

Scenester tumeric pickled, authentic crucifix post-ironic fam freegan VHS pork belly 8-bit yuccie PBR&B. I love this life we live in.

Second Header 2

This is a blockquote following a header. Bacon ipsum dolor sit amet t-bone doner shank drumstick, pork belly porchetta chuck sausage brisket ham hock rump pig. Chuck kielbasa leberkas, pork bresaola ham hock filet mignon cow shoulder short ribs biltong.

Header 3

This is a code block following a header.

Next level leggings before they sold out, PBR&B church-key shaman echo park. Kale chips occupy godard whatever pop-up freegan pork belly selfies. Gastropub Belinda subway tile woke post-ironic seitan. Shabby chic man bun semiotics vape, chia messenger bag plaid cardigan.

Header 4

This is an unordered list following a header.
This is an unordered list following a header.
This is an unordered list following a header.

Header 5

This is an ordered list following a header.
This is an ordered list following a header.
This is an ordered list following a header.

Header 6

What	Follows
A table	A header
A table	A header
A table	A header

There’s a horizontal rule above and below this.

Here is an unordered list:

Liverpool F.C.
Chelsea F.C.
Manchester United F.C.

And an ordered list:

Michael Brecker
Seamus Blake
Branford Marsalis

And an unordered task list:

Create a Hugo theme
Add task lists to it
Take a vacation

And a “mixed” task list:

Pack bags
?
Travel!

And a nested list:

Jackson 5
- Michael
- Tito
- Jackie
- Marlon
- Jermaine
TMNT
- Leonardo
- Michelangelo
- Donatello
- Raphael

Definition lists can be used with Markdown syntax. Definition headers are bold.

Name: Godzilla
Born: 1952
Birthplace: Japan
Color: Green

Tables should have bold headings and alternating shaded rows.

Artist	Album	Year
Michael Jackson	Thriller	1982
Prince	Purple Rain	1984
Beastie Boys	License to Ill	1986

If a table is too wide, it should scroll horizontally.

Artist	Album	Year	Label	Awards	Songs
Michael Jackson	Thriller	1982	Epic Records	Grammy Award for Album of the Year, American Music Award for Favorite Pop/Rock Album, American Music Award for Favorite Soul/R&B Album, Brit Award for Best Selling Album, Grammy Award for Best Engineered Album, Non-Classical	Wanna Be Startin’ Somethin’, Baby Be Mine, The Girl Is Mine, Thriller, Beat It, Billie Jean, Human Nature, P.Y.T. (Pretty Young Thing), The Lady in My Life
Prince	Purple Rain	1984	Warner Brothers Records	Grammy Award for Best Score Soundtrack for Visual Media, American Music Award for Favorite Pop/Rock Album, American Music Award for Favorite Soul/R&B Album, Brit Award for Best Soundtrack/Cast Recording, Grammy Award for Best Rock Performance by a Duo or Group with Vocal	Let’s Go Crazy, Take Me With U, The Beautiful Ones, Computer Blue, Darling Nikki, When Doves Cry, I Would Die 4 U, Baby I’m a Star, Purple Rain
Beastie Boys	License to Ill	1986	Mercury Records	noawardsbutthistablecelliswide	Rhymin & Stealin, The New Style, She’s Crafty, Posse in Effect, Slow Ride, Girls, (You Gotta) Fight for Your Right, No Sleep Till Brooklyn, Paul Revere, Hold It Now, Hit It, Brass Monkey, Slow and Low, Time to Get Ill

Code snippets like var foo = "bar"; can be shown inline.

Also, this should vertically align ~~with this~~ ~~and this~~.

Code can also be shown in a block element.

foo := "bar";
bar := "foo";

Code can also use syntax highlighting.

func main() {
  input := `var foo = "bar";`

  lexer := lexers.Get("javascript")
  iterator, _ := lexer.Tokenise(nil, input)
  style := styles.Get("github")
  formatter := html.New(html.WithLineNumbers())

  var buff bytes.Buffer
  formatter.Format(&buff, style, iterator)

  fmt.Println(buff.String())
}

Long, single-line code blocks should not wrap. They should horizontally scroll if they are too long. This line should be long enough to demonstrate this.

Inline code inside table cells should still be distinguishable.

Language	Code
Javascript	`var foo = "bar";`
Ruby	`foo = "bar"{`

Small images should be shown at their actual size.

Large images should always scale down and fit in the content container.

The photo above of the Spruce Picea abies shoot with foliage buds: Bjørn Erik Pedersen, CC-BY-SA.

Components

Alerts

This is an alert.

Note

This is an alert with a title.

Note

This is an alert with a title and Markdown.

This is a successful alert.

This is a warning.

Warning

This is a warning with a title.

Another Heading

Add some sections here to see how the ToC looks like. Bacon ipsum dolor sit amet t-bone doner shank drumstick, pork belly porchetta chuck sausage brisket ham hock rump pig. Chuck kielbasa leberkas, pork bresaola ham hock filet mignon cow shoulder short ribs biltong.

This Document

Inguina genus: Anaphen post: lingua violente voce suae meus aetate diversi. Orbis unam nec flammaeque status deam Silenum erat et a ferrea. Excitus rigidum ait: vestro et Herculis convicia: nitidae deseruit coniuge Proteaque adiciam eripitur? Sitim noceat signa probat quidem. Sua longis fugatis quidem genae.

Pixel Count

Tilde photo booth wayfarers cliche lomo intelligentsia man braid kombucha vaporware farm-to-table mixtape portland. PBR&B pickled cornhole ugh try-hard ethical subway tile. Fixie paleo intelligentsia pabst. Ennui waistcoat vinyl gochujang. Poutine salvia authentic affogato, chambray lumbersexual shabby chic.

Contact Info

Plaid hell of cred microdosing, succulents tilde pour-over. Offal shabby chic 3 wolf moon blue bottle raw denim normcore poutine pork belly.

External Links

Stumptown PBR&B keytar plaid street art, forage XOXO pitchfork selvage affogato green juice listicle pickled everyday carry hashtag. Organic sustainable letterpress sartorial scenester intelligentsia swag bushwick. Put a bird on it stumptown neutra locavore. IPhone typewriter messenger bag narwhal. Ennui cold-pressed seitan flannel keytar, single-origin coffee adaptogen occupy yuccie williamsburg chillwave shoreditch forage waistcoat.

This is the final element on the page and there should be no margin below this.

9.4 - Tips for cc-backend frontend development

How to setup cc-backend for easiert frontend development

ClusterCockpit web frontend

The frontend assets including the Svelte js files are per default embedded in the go binary. To enable a quick turnaround cycle for web development of the frontend disable embedding of static assets in config.json:

"embed-static-files": false,
"static-files": "./web/frontend/public/",

Start the node build process (in directory ./web/frontend) in development mode:

> npm run dev

This will start the build process in listen mode. Whenever you change a source files the depending javascript targets will be automatically rebuild. In case the javascript files are minified you may need to set the production flag by hand to false in ./web/frontend/rollup.config.mjs:

const production = false

Usually this should work automatically.

Because the files are still served by ./cc-backend you have to reload the view explicitly in your browser.

A common setup is to have three terminals open:

One running cc-backend (working directory repository root): ./cc-backend -server -dev
Another running npm in developer mode (working directory ./web/frontend): npm run dev
And the last one editing the frontend source files

9.5 - How to prepare a new release

Steps to prepare releases with goreleaser

Steps to prepare a release

On hotfix branch:
- Update ReleaseNotes.md
- Update version in Makefile
- Commit, push, and pull request
- Merge in master
On Linux host:
- Pull master
- Ensure that GitHub Token environment variable GITHUB_TOKEN is set
- Create release tag: git tag v1.1.0 -m release
- Execute goreleaser release

9.6 - Unit tests

How to do software testing

Overview

We use the standard golang testing environment.

The following conventions are used:

White box unit tests: Tests for internal functionality are placed in files
Black box unit tests: Tests for public interfaces are placed in files with <package name>_test.go and belong to the package <package_name>_test. There only exists one package test file per package.
Integration tests: Tests that use multiple componenents are placed in a package test file. These are named <package name>_test.go and belong to the package <package_name>_test.
Test assets: Any required files are placed in a directory ./testdata within each package directory.

Executing tests

Visual Studio Code has a very good golang test integration. For debugging a test this is the recommended solution.

The Makefile provided by us has a test target that executes:

> go clean -testcache
> go build ./...
> go vet ./...
> go test ./...

Of course the commands can also be used on the command line. For details about golang testing refer to the standard documentation:

Documentation

1 - Overview

What does it do?

For HPC users

For support staff

For administrators

How does it work?

Simple setup

Alternative setup

Where to go next?

Documentation Structure

2 - Release specific infos

Major changes

New experimental features

What you need to do

Configuration changes

Transfer cc-metric-store checkpoints

Known issues

3 - Getting Started

Prerequisites

Try it out

Note

Note

Installation

3.1 - Demo with release binary

Note

4 - Tutorials

4.1 - Plan overall ClusterCockpit architecture

Introduction

Transport: REST API vs NATS

REST API Transport

NATS Transport

Recommendation

Metric Store: Internal vs External

Internal Metric Store

External Metric Store

Recommendation

Security Considerations

Network Exposure

Authentication

Privilege Separation

Performance Considerations

Memory Usage

CPU Usage

Disk I/O

Example Configurations

Small Cluster (Internal + REST)

Large Cluster (External + NATS)

Decision Checklist

4.2 - ClusterCockpit installation manual

Introduction

Server Hardware

Planning and initial configuration

Common problems

Inconsistent metric names across components

Inconsistent device naming between cc-metric-collector and batch job scheduler adapter

Missing nodes in subcluster node lists

4.3 - Decide on metric list

Introduction

Required Metrics

Flop throughput rate: flops_any

Memory bandwidth: mem_bw

Memory capacity used: mem_used

Requested cpu core utilization: cpu_load

Total fast network bandwidth: net_bw

Total file IO bandwidth: file_bw

Recommended CPU Metrics

Instructions throughput in cycles: ipc

User active CPU core utilization: cpu_user

Double precision flop throughput rate: flops_dp

Single precision flop throughput rate: flops_sp

Average core frequency: clock

CPU power consumption: rapl_power

Recommended GPU Metrics

GPU utilization: acc_used

GPU memory capacity used: acc_mem_used

GPU power consumption: acc_power

Recommended node level metrics

Ethernet read bandwidth: eth_read_bw

Ethernet write bandwidth: eth_write_bw

Transfer `cc-metric-store` checkpoints

Inconsistent device naming between `cc-metric-collector` and batch job scheduler adapter

Flop throughput rate: `flops_any`

Memory bandwidth: `mem_bw`

Memory capacity used: `mem_used`

Requested cpu core utilization: `cpu_load`

Total fast network bandwidth: `net_bw`

Total file IO bandwidth: `file_bw`

Instructions throughput in cycles: `ipc`

User active CPU core utilization: `cpu_user`

Double precision flop throughput rate: `flops_dp`

Single precision flop throughput rate: `flops_sp`

Average core frequency: `clock`

CPU power consumption: `rapl_power`

GPU utilization: `acc_used`

GPU memory capacity used: `acc_mem_used`

GPU power consumption: `acc_power`

Ethernet read bandwidth: `eth_read_bw`

Ethernet write bandwidth: `eth_write_bw`

Fast network read bandwidth: `ic_read_bw`

Fast network write bandwidth: `ic_write_bw`

File system read bandwidth: `read_bw`

File system write bandwidth: `write_bw`

File system read requests: `read_req`

File system write requests: `write_req`

File system inodes used: `inodes`

File system open and close: `accesses`

File system file syncs: `fsync`

File system file creates: `create`

File system file open: `open`

File system file close: `close`

File system file syncs: `seek`

`config.json`

`archive.retention` section

For `move` policy

For `delete` policy