This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Tutorials

Detailed step-by-step lessons how to configure and deploy ClusterCockpit

1 - Plan overall ClusterCockpit architecture

How to decide on communication and data flows

Introduction

When deploying ClusterCockpit in production, two key architectural decisions need to be made:

  1. Transport mechanism: How metrics flow from collectors to the metric store (REST API vs NATS)
  2. Metric store deployment: Where the metric store runs (internal to cc-backend vs external standalone)

This guide helps you understand the trade-offs to make informed decisions based on your cluster size, administrative capabilities, and requirements.

Transport: REST API vs NATS

The cc-metric-collector can send metrics to cc-metric-store using either direct HTTP REST API calls or via NATS messaging.

REST API Transport

With REST transport, each collector node sends HTTP POST requests directly to the metric store endpoint.

┌─────────────┐     HTTP POST      ┌──────────────────┐
│  Collector  │ ─────────────────► │  cc-metric-store │
│   (Node 1)  │                    │                  │
└─────────────┘                    │                  │
┌─────────────┐     HTTP POST      │                  │
│  Collector  │ ─────────────────► │                  │
│   (Node 2)  │                    └──────────────────┘
└─────────────┘
      ...

Advantages:

  • Simple setup with no additional infrastructure
  • Direct point-to-point communication
  • Easy to debug and monitor
  • Works well for smaller clusters (< 500 nodes)

Disadvantages:

  • Each collector needs direct network access to metric store
  • No buffering: if metric store is unavailable, metrics are lost
  • Scales linearly with node count (many concurrent connections)
  • Higher load on metric store during burst scenarios

NATS Transport

With NATS, collectors publish metrics to a NATS server, and the metric store subscribes to receive them.

┌─────────────┐                    ┌─────────────┐
│  Collector  │ ──► publish ──►    │             │
│   (Node 1)  │                    │             │
└─────────────┘                    │    NATS     │     subscribe     ┌──────────────────┐
┌─────────────┐                    │   Server    │ ◄───────────────► │  cc-metric-store │
│  Collector  │ ──► publish ──►    │             │                   └──────────────────┘
│   (Node 2)  │                    │             │
└─────────────┘                    └─────────────┘
      ...

Advantages:

  • Decoupled architecture: collectors and metric store are independent
  • Built-in buffering and message persistence (with JetStream)
  • Better scalability for large clusters (1000+ nodes)
  • Supports multiple subscribers (e.g., external metric store for redundancy)
  • Collectors continue working even if metric store is temporarily down
  • Lower connection overhead (single connection per collector to NATS)
  • Integrated key management via NKeys (Ed25519-based authentication):
    • No need to generate and distribute JWT tokens to each collector
    • Centralized credential management in NATS server configuration
    • Support for accounts with fine-grained publish/subscribe permissions
    • Credential revocation without redeploying collectors
    • Simpler key rotation compared to JWT token redistribution

Disadvantages:

  • Additional infrastructure component to deploy and maintain
  • More complex initial setup and configuration
  • Additional point of failure (NATS server)
  • Requires NATS expertise for troubleshooting

Recommendation

Cluster SizeRecommended Transport
< 200 nodesREST API
200-500 nodesEither (based on preference)
> 500 nodesNATS

For large clusters or environments requiring high availability, NATS provides better resilience and scalability. For smaller deployments or when minimizing complexity is important, REST API is sufficient.

Metric Store: Internal vs External

The cc-metric-store storage engine can run either integrated within cc-backend (internal) or as a separate standalone service (external).

Internal Metric Store

The metric store runs as part of the cc-backend process, sharing the same configuration and lifecycle.

┌────────────────────────────────────────┐
│              cc-backend                │
│  ┌──────────────┐  ┌────────────────┐  │
│  │   Web UI &   │  │  metric-store  │  │
│  │   GraphQL    │  │    (internal)  │  │
│  └──────────────┘  └────────────────┘  │
└────────────────────────────────────────┘
         │                    ▲
         ▼                    │
    ┌─────────┐          ┌─────────┐
    │ Browser │          │Collector│
    └─────────┘          └─────────┘

Advantages:

  • Single process to deploy and manage
  • Unified configuration
  • Simplified networking (metrics received on same endpoint)
  • Lower resource overhead
  • Easier initial setup

Disadvantages:

  • Metric store restart requires cc-backend restart
  • Resource contention between web serving and metric ingestion
  • No horizontal scaling of metric ingestion
  • Single point of failure for entire system

External Metric Store

The metric store runs as a separate process, communicating with cc-backend via its REST API.

┌──────────────────┐         ┌──────────────────┐
│    cc-backend    │ ◄─────► │  cc-metric-store │
│   (Web UI/API)   │  query  │    (external)    │
└──────────────────┘         └──────────────────┘
         │                            ▲
         ▼                            │
    ┌─────────┐                  ┌─────────┐
    │ Browser │                  │Collector│
    └─────────┘                  └─────────┘

Advantages:

  • Independent scaling and resource allocation
  • Can restart metric store without affecting web interface
  • Enables redundancy with multiple metric store instances
  • Better isolation for security and resource management
  • Can run on dedicated hardware optimized for in-memory workloads

Disadvantages:

  • Two processes to deploy and manage
  • Separate configuration files
  • Additional network communication between components
  • More complex setup and monitoring

Recommendation

ScenarioRecommended Deployment
Development/TestingInternal
Small production (< 200 nodes)Internal
Medium production (200-1000 nodes)Either
Large production (> 1000 nodes)External
Resource-constrained head nodeExternal (on dedicated host)

Security Considerations

Network Exposure

ComponentREST TransportNATS Transport
Metric StoreExposed to all collector nodesOnly exposed to NATS server
NATS ServerN/AExposed to all collectors and metric stores
cc-backendExposed to usersExposed to users

With NATS, the metric store can be isolated from the compute network, reducing attack surface. The NATS server becomes the single point of ingress for metrics. Another option to isolate the web backend from the compute network is to setup cc-metric-collector proxies.

Authentication

  • REST API: Uses JWT tokens (Ed25519 signed). Each collector needs a valid token configured and distributed to it.
  • NATS: Supports multiple authentication methods:
    • Username/password (simple, suitable for smaller deployments)
    • NKeys (Ed25519 key pairs managed centrally in NATS server)
    • Credential files (.creds) for decentralized authentication
    • Accounts for multi-tenancy with isolated namespaces

NKeys Advantage: With NATS NKeys, authentication keys are managed in the NATS server configuration rather than distributed to each collector. This simplifies credential management significantly:

  • Add/remove collectors by editing NATS server config
  • Revoke access instantly without touching collector nodes
  • No JWT token expiration to manage
  • Keys can be scoped to specific subjects (publish-only for collectors)

For both transports, ensure:

  • Keys are properly generated and securely stored
  • TLS is enabled for production deployments
  • Network segmentation isolates monitoring traffic

Privilege Separation

Both cc-backend and the external cc-metric-store support dropping privileges after binding to privileged ports (via user and group configuration). This limits the impact of potential vulnerabilities.

Performance Considerations

Memory Usage

The metric store keeps data in memory based on retention-in-memory. Memory usage scales with:

  • Number of nodes
  • Number of metrics per node
  • Number of hardware scopes (cores, sockets, accelerators)
  • Retention duration
  • Metric frequency

For a 1000-node cluster with 20 metrics at 60-second intervals and 48-hour retention, expect approximately 10-20 GB of memory usage. For larger setups and many core level metrics this can increase up to 100GB, which must fit into main memory.

CPU Usage

  • Internal: Competes with cc-backend web serving
  • External: Dedicated resources for metric processing

For clusters with high query load (many users viewing job details), external deployment prevents metric ingestion from impacting user experience.

Disk I/O

Checkpoints are written periodically. For large deployments:

  • Use fast storage (SSD) for checkpoint directory
  • Consider separate disks for checkpoints and archives
  • Monitor disk space for archive growth

Example Configurations

Small Cluster (Internal + REST)

Single cc-backend with internal metric store, collectors using REST:

// cc-backend config
{
  "metricstore": {
    "enabled": true,
    "checkpoints": {
      "interval": "12h",
      "directory": "./var/checkpoints"
    },
    "retention-in-memory": "48h"
  }
}

Large Cluster (External + NATS)

Separate cc-metric-store with NATS transport:

// cc-metric-store config
{
  "main": {
    "addr": "0.0.0.0:8080",
    "jwt-public-key": "..."
  },
  "nats": {
    "address": "nats://nats-server:4222",
    "username": "ccms",
    "password": "..."
  },
  "metric-store": {
    "retention-in-memory": "48h",
    "memory-cap": 80,
    "checkpoints": {
      "interval": "12h",
      "directory": "/data/checkpoints"
    },
    "cleanup": {
      "mode": "archive",
      "interval": "48h",
      "directory": "/data/archive"
    },
    "nats-subscriptions": [
      {
        "subscribe-to": "hpc-metrics",
        "cluster-tag": "mycluster"
      }
    ]
  }
}

Decision Checklist

Use this checklist to guide your architecture decision:

  • Cluster size: How many nodes need monitoring?
  • Availability requirements: Is downtime acceptable?
  • Administrative capacity: Can you manage additional services?
  • Network topology: Can collectors reach the metric store directly?
  • Resource constraints: Is the head node resource-limited?
  • Security requirements: Do you need network isolation?
  • Growth plans: Will the cluster expand significantly?

For most new deployments, starting with internal metric store and REST transport is recommended. You can migrate to external deployment and/or NATS later as needs evolve.

2 - ClusterCockpit installation manual

How to plan and configure a ClusterCockpit installation

Introduction

ClusterCockpit requires the following components:

  • A node agent running on all compute nodes that measures required metrics and forwards all data to a time series metrics database. ClusterCockpit provides its own node agent cc-metric-collector. This is the recommended setup, but ClusterCockpit can also be integrated with other node agents, e.g. collectd, prometheus or telegraf. In this case you have to use it with the accompanying time series database and ensure the metric data is send or forwarded to cc-backend.
  • The api and web interface backend cc-backend. Only one instance of cc-backend is required. This will provide the HTTP server at the desired monitoring URL for serving the web interface. It also integrates a in-memory metric store.
  • A SQL database. The only supported option is to use the builtin sqlite database for ClusterCockpit. You can setup LiteStream as a service which performs a continuous replication of the sqlite database to multiple storage backends.
  • (Optional) Metric store: One or more cc-metric-store instances. Advantages for using an external cc-metric-store are:
    • Independent scaling and resource allocation
    • Can restart metric store without affecting web interface and the other way around
    • Enables redundancy with multiple metric store instances
    • Better isolation for security and resource management
    • Can run on dedicated hardware optimized for in-memory workloads
  • (Optional) NATS message broker: Apart from REST APIs ClusterCockpit also supports NATS as a way to connect components. Using NATS brings a number of advantages:
    • More flexible deployment and testing. Instances can have different URLs or IP addresses. Test instances are easy to deploy in parallel without a need to touch the configuration.
    • NATS comes with a builtin sophisticated token key management. This also enables to restrict authorization to specific subjects.
    • NATS may provide a larger message throughput compared to REST over HTTP.
    • Upcoming ClusterCockpit components as the Energy Manager require NATS.
  • A batch job scheduler adapter that provides the job meta information to cc-backend. This is done by using the provided REST or NATS API for starting and stopping jobs. Currently available adapters:
    • Slurm: Golang based solution (cc-slurm-adapter) maintained by NHR@FAU. This is the recommended option in case you use Slurm. All options in cc-backend are supported.
    • Slurm: Python based solution (cc-slurm-sync) maintained by PC2 Paderborn
    • HTCondor: cc-condor-sync maintained by Saarland University

Server Hardware

cc-backend is threaded and therefore profits from multiple cores. Enough memory is required to hold the metric data cache. For most setups 128GB should be enough. You can set an upper limit for the memory capacity used b ythe internal metric in-memory cache. It is possible to run it in a virtual machine. For best performance the ./var folder of cc-backend which contains the sqlite database file and the file based job archive should be located on a fast storage device, ideally a NVMe SSD. The sqlite database file and the job archive will grow over time (if you are not removing old jobs using a retention policy). Our setup covering multiple clusters over 5 years takes 75GB for the sqlite database and around 1.4TB for the job archive. In case you have very high job counts, we recommend to use a retention policy to keep the database and the job archive at a manageable size. In case you archive old jobs the database can be easily restored using cc-backend. We run cc-backend as systemd services.

Planning and initial configuration

We recommended the following order for planning and configuring a ClusterCockpit installation:

  1. Decide on overall setup: Initially you have to decide on some fundamental design options about how the components communicate with each other and how the data flows from the compute nodes to the backend.
  2. Setup your metric list: With two exceptions you are in general free which metrics you want choose. Those exceptions are: mem_bw for main memory bandwidth and flops_any for flop throughput (double precision flops are upscaled to single precision rates). The metric list is an integral component for the configuration of all ClusterCockpit components.
  3. Planning of deployment
  4. Configure and deploy cc-metric-collector
  5. Configure and deploy cc-backend
  6. Configure and deploy cc-slurm-adapter or another job scheduler adapter of your choice

You can find complete example production configurations in the cc-examples repository.

Common problems

Up front here is a list with common issues people are facing when installing ClusterCockpit for the first time.

Inconsistent metric names across components

At the moment you need to configure the metric list in every component separately. In cc-metric-collector the metrics that are send to the cc-backend are determined by the collector configuration and possible renaming in the router configuration. In cc-backend for every cluster you need to create a cluster.json configuration in the job-archive. There you setup which metrics are shown in the web-frontend including many additional properties for the metrics. For running jobs cc-backend will query the internal metric-store for exactly those metric names and if there is no match there will be an error.

We provide a JSON schema based specification as part of the job meta and metric schema. This specification recommends a minimal set of metrics and we suggest to use the metric names provided there. While it is up to you if you want to adhere to the metric names suggested in the schema, there are two exceptions: mem_bw (main memory bandwidth) and flops_any (total flop rate with DP flops scaled to SP flops) are required for the roofline plots to work.

Inconsistent device naming between cc-metric-collector and batch job scheduler adapter

The batch job scheduler adapter (e.g. cc-slurm-adapter) provides a list of resources that are used by the job. cc-backend will query the internal metric-store with exactly those resource ids for getting all metrics for a job. As a consequence if cc-metric-collector uses another systematic the metrics will not be found.

If you have GPU accelerators cc-slurm-adapter should use the PCI-E device addresses as ids. The option gpuPciAddrs for the nvidia and rocm-smi collectors in the collector configuration must be configured. To validate and debug problems you can use the cc-backend debug endpoint:

curl -H "Authorization: Bearer $JWT" -D - "http://localhost:8080/api/debug"

This will return the current state of cc-metric-store. You can search for a hostname and scroll there for all topology leaf nodes that are available.

Missing nodes in subcluster node lists

ClusterCockpit supports multiple subclusters as part of a cluster. A subcluster in this context is a homogeneous hardware partition with a dedicated metric and device configuration. cc-backend dynamically matches the nodes a job runs on to a subcluster node list to figure out on which subcluster a job is running. If nodes are missing in a subcluster node list this fails and the metric list used may be wrong.

3 - Decide on metric list

Planning and naming the metrics

Introduction

To decide on a sensible and meaningful set of metrics is deciding factor for how useful the monitoring will be. As part of a collaborative project several academic HPC centers came up with a minimal set of metrics including their naming. To use a consistent naming is crucial for establishing what metrics mean and we urge you to adhere to the metric names suggested there. You can find this list as part of the ClusterCockpit job data structure JSON schemas.

ClusterCockpit supports multiple clusters within one instance of cc-backend. You have to create separate metric lists for each of them. In cc-backend the metric lists are provided as part of the cluster configuration. Every cluster is configured as part of the job archive using one cluster.json file per cluster. This how-to describes in-detail how to create a cluster.json file.

Required Metrics

Flop throughput rate: flops_any

Memory bandwidth: mem_bw

Memory capacity used: mem_used

Requested cpu core utilization: cpu_load

Total fast network bandwidth: net_bw

Total file IO bandwidth: file_bw

Instructions throughput in cycles: ipc

User active CPU core utilization: cpu_user

Double precision flop throughput rate: flops_dp

Single precision flop throughput rate: flops_sp

Average core frequency: clock

CPU power consumption: rapl_power

GPU utilization: acc_used

GPU memory capacity used: acc_mem_used

GPU power consumption: acc_power

Ethernet read bandwidth: eth_read_bw

Ethernet write bandwidth: eth_write_bw

Fast network read bandwidth: ic_read_bw

Fast network write bandwidth: ic_write_bw

File system metrics

In the schema a tree of file system metrics is suggested. This allows to provide a similar set of metrics for different file systems used in a cluster. The file system type names suggested are:

  • nfs
  • lustre
  • gpfs
  • nvme
  • ssd
  • hdd
  • beegfs

File system read bandwidth: read_bw

File system write bandwidth: write_bw

File system read requests: read_req

File system write requests: write_req

File system inodes used: inodes

File system open and close: accesses

File system file syncs: fsync

File system file creates: create

File system file open: open

File system file close: close

File system file syncs: seek

4 - Deployment

Plan and implement deployhment workflow

Deployment

It is recommended to install all ClusterCockpit components in a common directory, e.g. /opt/monitoring, var/monitoring or var/clustercockpit. In the following we use /opt/monitoring.

Two Systemd services run on the central monitoring server:

  • clustercockpit : binary cc-backend in /opt/monitoring/cc-backend.
  • cc-metric-store : Binary cc-metric-store in /opt/monitoring/cc-metric-store.

ClusterCockpit is deployed as a single binary that embeds all static assets. We recommend keeping all cc-backend binary versions in a folder archive and linking the currently active one from the cc-backend root. This allows for easy roll-back in case something doesn’t work.

Workflow to deploy new version

This example assumes the DB and job archive versions did not change.

  • Stop systemd service:
sudo systemctl stop clustercockpit.service
  • Backup the sqlite DB file! This is as simple as to copy it.
  • Copy new cc-backend binary to /opt/monitoring/cc-backend/archive (Tip: Use a date tag like YYYYMMDD-cc-backend). Here is an example:
cp ~/cc-backend /opt/monitoring/cc-backend/archive/20231124-cc-backend
  • Link from cc-backend root to current version
ln -s  /opt/monitoring/cc-backend/archive/20231124-cc-backend /opt/monitoring/cc-backend/cc-backend
  • Start systemd service:
sudo systemctl start clustercockpit.service
  • Check if everything is ok:
sudo systemctl status clustercockpit.service
  • Check log for issues:
sudo journalctl -u clustercockpit.service
  • Check the ClusterCockpit web frontend and your Slurm adapters if anything is broken!

5 - Setup of cc-metric-store

How to configure and deploy cc-metric-store

Introduction

The cc-metric-store provides an in-memory metric time-series database. It is configured via a JSON configuration file (config.json). Metrics are received via messages using the ClusterCockpit ccMessage protocol. It can receive messages via an HTTP REST API or by subscribing to a NATS subject. Requesting data is possible via an HTTP REST API.

Configuration

For a complete list of configuration options see here. The configuration is organized into four main sections: main, metrics, nats, and metric-store.

Minimal example of a configuration file:

{
  "main": {
    "addr": "localhost:8080",
    "jwt-public-key": "kzfYrYy+TzpanWZHJ5qSdMj5uKUWgq74BWhQG6copP0="
  },
  "metrics": {
    "clock": {
      "frequency": 60,
      "aggregation": "avg"
    },
    "mem_bw": {
      "frequency": 60,
      "aggregation": "sum"
    },
    "flops_any": {
      "frequency": 60,
      "aggregation": "sum"
    },
    "flops_dp": {
      "frequency": 60,
      "aggregation": "sum"
    },
    "flops_sp": {
      "frequency": 60,
      "aggregation": "sum"
    },
    "mem_used": {
      "frequency": 60,
      "aggregation": null
    }
  },
  "metric-store": {
    "checkpoints": {
      "interval": "12h",
      "directory": "./var/checkpoints"
    },
    "memory-cap": 100,
    "retention-in-memory": "48h",
    "cleanup": {
      "mode": "archive",
      "interval": "48h",
      "directory": "./var/archive"
    }
  }
}

Main Section

The main section specifies the address and port on which the server should listen (addr). Optionally, for HTTPS, paths to TLS cert and key files can be specified via https-cert-file and https-key-file. If using a privileged port (e.g., 443), you can specify user and group to drop root permissions after binding. The backend-url option allows connecting to cc-backend for querying job information. The REST API uses JWT token based authentication. The option jwt-public-key provides the Ed25519 public key to verify signed JWT tokens.

Metrics Section

The cc-metric-store will only accept metrics that are specified in its metric list. The metric names must exactly match! The frequency for the metrics specifies how incoming values are binned. If multiple values are received in the same interval older values are overwritten, if no value is received in an interval there is a gap. cc-metric-store can aggregate metrics across topological entities, e.g., to compute an aggregate node scope value from core scope metrics. The aggregation attribute specifies how the aggregate value is computed. Resource metrics usually require sum, whereas diagnostic metrics (e.g., clock) require avg. For clock a sum would obviously make no sense. Metrics that are only available at node scope should set aggregation to null.

Metric-Store Section

The most important configuration option is the retention-in-memory setting. It specifies for which duration back in time metrics should be provided. This should be long enough to cover common job durations plus a safety margin. The memory-cap option sets the maximum memory percentage to use. The memory footprint scales with the number of nodes, the number of native metric scopes (cores, sockets), the number of metrics, and the memory retention time divided by the frequency.

The cc-metric-store supports checkpoints and cleanup/archiving. Checkpoints are always performed on shutdown. To not lose data on a crash or other failure, checkpoints are written regularly in fixed intervals. Checkpoints that are not needed anymore can either be archived (moved and compressed) or deleted, controlled by the cleanup.mode setting (archive or delete). The cleanup happens at the interval specified in cleanup.interval. You may want to set up a cron job to delete older archive files.

Authentication

For authentication signed (but unencrypted) JWT tokens are used. Only Ed25519/EdDSA cryptographic key-pairs are supported. A client has to sign the token with its private key, on the server side it is checked if the configured public key matches the private key with which the token was signed, if the token was altered after signing, and if the token has expired. All other token attributes are ignored.

We provide an article on how to generate JWT. The is also a background info article on JWT usage in ClusterCockpit. Tokens are cached in cc-metric-store to minimize overhead.

NATS

As an alternative to HTTP REST cc-metric-store can also receive metrics via NATS. You find more infos about NATS in this background article.

To enable NATS in cc-metric-store add the nats section for the connection and nats-subscriptions in the metric-store section:

{
  "nats": {
    "address": "nats://localhost:4222",
    "username": "user",
    "password": "password"
  },
  "metric-store": {
    "nats-subscriptions": [
      {
        "subscribe-to": "hpc-nats",
        "cluster-tag": "fritz"
      }
    ]
  }
}

The nats section configures the NATS server connection with address and credentials. The nats-subscriptions within metric-store define which subjects to subscribe to and how to tag incoming metrics with cluster information.

6 - Setup of cc-metric-collector

How to configure and deploy cc-metric-collector

Introduction

cc-metric-collector is a node agent for measuring, processing and forwarding node level metrics. It is currently mostly documented via Markdown documents in its GitHub repository. The configuration consists of the following parts:

  • collectors: Metric sources. There is a large number of collectors available. Important and also most demanding to configure is the likwid collector for measuring hardware performance counter metrics.
  • router: Rename, drop and modify metrics.
  • sinks: Configuration where to send the metrics.
  • receivers: Receive metrics. Useful as a proxy to connect different metric sinks. Can be left empty in most cases.

Build and deploy

Since the cc-metric-collector needs to be installed on every compute node and requires configuration specific to the node hardware it is demanding to install and configure. The Makefile supports to generate RPM and DEB packages. There is also a Systemd service file included which you may take as a blueprint. More information on deployment is available here.

Collectors

You may want to have a look at our collector configuration which includes configurations for many different systems, Intel and AMD CPUs and NVIDIA GPUs. The general recommendation is to first decide on the metrics you need and then figure out which collectors are required. For hardware performance counter metrics you may want to have a look at likwid-perfctr performance groups for inspiration on how to computer the required derived metrics on your target processor architecture.

Router

The router enables to rename, drop and modify metrics. Top level configuration attributes (can be usually be left at default):

  • interval_timestamp: Metrics received within same interval get the same identical time stamp if true. Default is true.
  • num_cache_intervals: Number of intervals that are cached in router. Default is 1. Set to 0 to disable router cache.
  • hostname_tag: Set a host name different that what is returned by hostname.
  • max_forward: Number of metrics read at once from a Golang channel. Default is 50. Option has to be larger than 1. Recommendation: Leave at default!

Below you find the operations that are supported by the message processor.

Rename metrics

To rename metric names add a rename_messages section mapping the old metric name to the new name.

"process_messages" : {
    "rename_messages" : {
        "load_one" : "cpu_load",
        "net_bytes_in_bw" : "net_bytes_in",
        "net_bytes_out_bw" : "net_bytes_out",
        "net_pkts_in_bw" : "net_pkts_in",
        "net_pkts_out_bw" : "net_pkts_out",
        "ib_recv_bw" : "ib_recv",
        "ib_xmit_bw" : "ib_xmit",
        "lustre_read_bytes_diff" : "lustre_read_bytes",
        "lustre_read_requests_diff" : "lustre_read_requests",
        "lustre_write_bytes_diff" : "lustre_write_bytes",
        "lustre_write_requests_diff" : "lustre_write_requests",
}

Drop metrics

Sometimes collectors provide a lot of metrics that are not needed. To save data volume metrics can be dropped. Some collectors also support to exclude metrics at the collector level using the exclude_metrics option.

"process_messages" : {
   "drop_messages" : [
       "load_five",
       "load_fifteen",
       "proc_run",
       "proc_total"
   ],
}

Normalize unit naming

Enforce a consistent naming of units in metrics. This option should always be set to true which is the default. The metric value is not altered!

"process_messages" : {
   "normalize_units": true
}

Change metric unit

The collectors usually do not alter the unit of a metric. To change the unit set the change_uni_prefix key. The value is automatically scaled correctly, depending on the old unit prefix.

"process_messages" : {
   "change_unit_prefix": {
       "name == 'mem_used'": "G",
       "name == 'swap_used'": "G",
       "name == 'mem_total'": "G",
       "name == 'swap_total'": "G",
       "name == 'cpufreq'": "M"
   }
}

Add tags

To add tags set the add_tags_if configuration attribute. The following statement unconditionally sets a cluster name tag for all metrics.

"process_messages" : {
    "add_tags_if": [
      {
        "key": "cluster",
        "value": "alex",
        "if": "true"
      }
    ],
}

Sinks

A simple example configuration for two sinks: HTTP cc-metric-store and NATS:

{
  "fritzstore": {
    "type": "http",
    "url": "http://monitoring.nhr.fau.de:8082/api/write?cluster=fritz",
    "jwt": "XYZ",
    "idle_connection_timeout": "60s"
  },
  "fritznats": {
    "type": "nats",
    "host": "monitoring.nhr.fau.de",
    "database": "fritz",
    "nkey_file": "/etc/cc-metric-collector/nats.nkey",
  }
}

All metrics are concurrently send to all configured sinks.

7 - Setup of cc-backend

How to configure and deploy cc-backend

Introduction

cc-backend is the main hub within the ClusterCockpit framework. Its configuration consists of the general part in config.json and the cluster configurations in cluster.json files, that are part of the job archive. The job archive is a long-term persistent storage for all job meta and metric data. The job meta data including job statistics as well as the user data are stored in a SQL database. Secrets as passwords and tokens are provided as environment variables. Environment variables can be initialized using a .env file residing in the same directory as cc-backend. If using an .env file environment variables that are already set take precedence.

Configuration

cc-backend provides a command line switch to generate an initial template for all required configuration files apart from the job archive:

./cc-backend -init

This will create the ./var folder, generate initial version of the config.json and .env files, and initialize a sqlite database file.

config.json

Below is a production configuration enabling the following functionality:

  • Use HTTPS only
  • Mark jobs as short job if smaller than 5m
  • Enable authentication and user syncing via an LDAP directory
  • Enable to initiate a user session via an JWT token, e.g. by an IDM portal
  • Drop permission after privileged ports are taken
  • enable re-sampling of time-series metric data for long jobs
  • Enable NATS for job and metric store APIs
  • Set metric in memory retention to 48h
  • Set upper memory capping for internal metric store to 100GB
  • Enable archiving of metric data
  • Using S3 as job archive backend. Note: The file based archive in ./var/job-archive is the default.

Not included below but set by the default settings:

  • Use compression for metric data files in job archive
  • Allow access to the REST API from all IPs
{
  "main": {
    "addr": "0.0.0.0:443",
    "https-cert-file": "/etc/letsencrypt/live/url/fullchain.pem",
    "https-key-file": "/etc/letsencrypt/live/url/privkey.pem",
    "user": "clustercockpit",
    "group": "clustercockpit",
    "short-running-jobs-duration": 300,
    "enable-job-taggers": true,
    "resampling": {
      "minimum-points": 600,
      "trigger": 180,
      "resolutions": [240, 60]
    },
    "api-subjects": {
      "subject-job-event": "cc.job.event",
      "subject-node-state": "cc.node.state"
    }
  },
  "nats": {
    "address": "nats://x.x.x.x:4222",
    "username": "root",
    "password": "root"
  },
  "auth": {
    "jwts": {
      "max-age": "2000h"
    },
    "ldap": {
      "url": "ldaps://hpcldap.rrze.uni-erlangen.de",
      "user_base": "ou=people,ou=hpc,dc=rz,dc=uni,dc=de",
      "search_dn": "cn=hpcmonitoring,ou=roadm,ou=profile,ou=hpc,dc=rz,dc=uni,dc=de",
      "user_bind": "uid={username},ou=people,ou=hpc,dc=rrze,dc=uni,dc=de",
      "user_filter": "(&(objectclass=posixAccount))",
      "sync_interval": "24h"
    }
  },
  "cron": {
    "commit-job-worker": "1m",
    "duration-worker": "5m",
    "footprint-worker": "10m"
  },
  "archive": {
    "kind": "s3",
    "endpoint": "http://x.x.x.x",
    "bucket": "jobarchive",
    "access-key": "xx",
    "secret-key": "xx",
    "retention": {
      "policy": "move",
      "age": 365,
      "location": "./var/archive"
    }
  },
  "metric-store": {
    "memory-cap": 100,
    "retention-in-memory": "48h",
    "cleanup": {
      "mode": "archive",
      "directory": "./var/archive"
    },
    "nats-subscriptions": [
      {
        "subscribe-to": "hpc-nats",
        "cluster-tag": "fritz"
      },
      {
        "subscribe-to": "hpc-nats",
        "cluster-tag": "alex"
      }
    ]
  },
  "ui-file": "ui-config.json"
}

Further reading:

Environment variables

Secrets are provided in terms of environment variables. The only two required secrets are JWT_PUBLIC_KEY and JWT_PRIVATE_KEY used for signing generated JWT tokens and validate JWT authentication.

Please refer to the environment reference for details.