This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

How-to Guides

How-to solve concrete problems

1: Configure retention policies
2: How to set up hierarchical metric collection
3: Database migrations
4: Job archive migrations
5:
6: Hands-On Demo
7: How to add a MOD notification banner
8: How to create a `cluster.json` file
9: How to customize cc-backend
10: How to deploy and update cc-backend
11: How to enable and configure auto-tagging
12: How to generate JWT tokens
13: How to plan and configure resampling
14: How to regenerate the Swagger UI documentation
15: How to setup a systemd service
16: How to use the REST API Endpoints
17: How to use the Swagger UI documentation

1 - Configure retention policies

Managing database and job archive size with retention policies

Overview

Over time, the ClusterCockpit database and job archive can grow significantly, especially in production environments with high job counts. Retention policies help keep your storage at a manageable size by automatically removing or archiving old jobs.

Why use retention policies?

Without retention policies:

The SQLite database file can grow to tens of gigabytes
The job archive can reach terabytes in size
Storage requirements increase indefinitely
System performance may degrade

A typical multi-cluster setup over 5 years can accumulate:

75 GB for the SQLite database
1.4 TB for the job archive

Retention policies allow you to balance data retention needs with storage capacity.

Retention policy options

ClusterCockpit supports three retention policies:

None (default)

No automatic cleanup. Jobs are kept indefinitely.

{
  "archive": {
    "kind": "file",
    "path": "./var/job-archive"
  }
}

Delete

Permanently removes jobs older than the specified age from both the job archive and the database.

Use when:

Storage space is limited
You don’t need long-term job data
You have external backups or data exports

Configuration example:

{
  "archive": {
    "kind": "file",
    "path": "./var/job-archive",
    "retention": {
      "policy": "delete",
      "age": 365,
      "include-db": true
    }
  }
}

This configuration will:

Delete jobs older than 365 days
Remove them from both the job archive and database
Run automatically based on the cleanup interval

Move

Moves old jobs to a separate location for long-term archival while removing them from the active database.

Use when:

You need to preserve historical data
You want to reduce active database size
You can store archived data on cheaper, slower storage

Configuration example:

{
  "archive": {
    "kind": "file",
    "path": "./var/job-archive",
    "retention": {
      "policy": "move",
      "age": 365,
      "location": "/mnt/archive/old-jobs",
      "include-db": true
    }
  }
}

This configuration will:

Move jobs older than 365 days to /mnt/archive/old-jobs
Remove them from the active database
Preserve the data for potential future analysis

Configuration parameters

`archive.retention` section

Parameter	Type	Required	Default	Description
`policy`	string	Yes	-	Retention policy: `none`, `delete`, `copy`, or `move`
`age`	integer	No	7	Age threshold in days. Jobs older than this are affected
`include-db`	boolean	No	true	Also remove jobs from the database (not just archive)
`omit-tagged`	string	No	`none`	Skip tagged jobs: `none` = apply to all jobs, `all` = skip any tagged job, `user` = skip jobs with user-created tags (auto-tagger tags are not user tags)
`location`	string	For `move`/`copy`	-	Target directory for moved/copied jobs

Complete configuration examples

Example 1: One-year retention with deletion

Suitable for environments with limited storage:

{
  "archive": {
    "kind": "file",
    "path": "./var/job-archive",
    "retention": {
      "policy": "delete",
      "age": 365,
      "include-db": true
    }
  }
}

Example 2: Two-tier archival system

Keep 6 months active, move older data to long-term storage:

{
  "archive": {
    "kind": "file",
    "path": "./var/job-archive",
    "retention": {
      "policy": "move",
      "age": 180,
      "location": "/mnt/slow-storage/archive",
      "include-db": true
    }
  }
}

Example 3: S3 backend with retention

Using S3 object storage with one-year retention:

{
  "archive": {
    "kind": "s3",
    "endpoint": "https://s3.example.com",
    "bucket": "clustercockpit-jobs",
    "access-key": "your-access-key",
    "secret-key": "your-secret-key",
    "retention": {
      "policy": "delete",
      "age": 365,
      "include-db": true
    }
  }
}

How retention policies work

Automatic execution: Retention policies run automatically based on the configured interval
Age calculation: Jobs are evaluated based on their startTime field
Batch processing: All jobs older than the specified age are processed in one operation
Database cleanup: When include-db: true, corresponding database entries are removed
Tagged job handling: Controlled by omit-tagged — use user to preserve jobs tagged by users while still processing auto-tagged jobs
Archive handling: Based on policy (delete removes, move relocates)

Best practices

Planning retention periods

Consider these factors when setting the age parameter:

Accounting requirements: Some organizations require job data for billing/auditing
Research needs: Longer retention for research clusters where users may need historical data
Storage capacity: Available disk space and growth rate
Compliance: Legal or institutional data retention policies

Recommended retention periods:

Use Case	Suggested Age
Development/testing	30-90 days
Production (limited storage)	180-365 days
Production (ample storage)	365-730 days
Research/archival	730+ days or use `move` policy

Storage considerations

For `move` policy

Mount the target location on slower, cheaper storage (e.g., spinning disks, network storage)
Ensure sufficient space at the target location
Consider periodic backups of the moved archive
Document the archive structure for future retrieval

For `delete` policy

Create backups first: Always backup your database and job archive before enabling deletion
Test on a copy: Verify the retention policy works as expected on test data
Export important data: Consider exporting summary statistics or critical job data before deletion

Monitoring and maintenance

Track archive size: Monitor growth to adjust retention periods
```
du -sh /var/job-archive
du -sh /path/to/database.db
```
Verify retention execution: Check logs for retention policy runs
```
grep -i retention /var/log/cc-backend.log
```

Regular backups: Backup before changing retention settings

cp -r /var/job-archive /backup/job-archive-$(date +%Y%m%d)
cp /var/clustercockpit.db /backup/clustercockpit-$(date +%Y%m%d).db

Restoring deleted jobs

If using `move` policy

Jobs moved to the retention location can be restored:

Stop cc-backend

Use the archive-manager tool to import jobs back:

cd tools/archive-manager
go build
./archive-manager -import \
  -src-config '{"kind":"file","path":"/mnt/archive/old-jobs"}' \
  -dst-config '{"kind":"file","path":"./var/job-archive"}'

Rebuild database from archive:
```
./cc-backend -init-db
```
Restart cc-backend

If using `delete` policy

Jobs cannot be restored unless you have external backups. This is why backups are critical before enabling deletion.

archive-manager: Manage and validate job archives
archive-migration: Migrate archives between schema versions
Database migration: See database migration guide

Troubleshooting

Retention policy not running

Check:

Verify archive.retention is properly configured in config.json
Ensure cc-backend was restarted after configuration changes
Check logs for errors: grep -i retention /var/log/cc-backend.log

Database size not decreasing

Possible causes:

include-db: false — Database entries are not being removed
SQLite doesn’t automatically reclaim space - run VACUUM:
```
sqlite3 /var/clustercockpit.db "VACUUM;"
```

Jobs not being moved to target location

Check:

Target directory exists and is writable
Sufficient disk space at target location
File permissions allow cc-backend to write to location
Path in location is absolute, not relative

Performance impact

If retention policy execution causes performance issues:

Consider running during off-peak hours (feature may require manual execution)
Reduce the number of old jobs by running retention more frequently with shorter age periods
Use more powerful hardware for the database operations

2 - How to set up hierarchical metric collection

Configure multiple cc-metric-collector instances to receive metrics from compute nodes and forward them to the backend

Overview

In large HPC clusters, it’s often impractical or undesirable to have every compute node connect directly to the central database. A hierarchical collection setup allows you to:

Reduce database connections: Instead of hundreds of nodes connecting directly, use aggregation nodes as intermediaries
Improve network efficiency: Aggregate metrics at rack or partition level before forwarding
Add processing layers: Filter, transform, or enrich metrics at intermediate collection points
Increase resilience: Buffer metrics during temporary database outages

This guide shows how to configure multiple cc-metric-collector instances where compute nodes send metrics to aggregation nodes, which then forward them to the backend database.

Architecture

flowchart TD
  subgraph Rack1 ["Rack 1 - Compute Nodes"]
    direction LR
    node1["Node 1<br/>cc-metric-collector"]
    node2["Node 2<br/>cc-metric-collector"]
    node3["Node 3<br/>cc-metric-collector"]
  end

  subgraph Rack2 ["Rack 2 - Compute Nodes"]
    direction LR
    node4["Node 4<br/>cc-metric-collector"]
    node5["Node 5<br/>cc-metric-collector"]
    node6["Node 6<br/>cc-metric-collector"]
  end

  subgraph Aggregator ["Aggregation Node"]
    ccrecv["cc-metric-collector<br/>(with receivers)"]
  end

  subgraph Backend ["Backend Server"]
    ccms[("cc-metric-store")]
    ccweb["cc-backend<br/>(Web Frontend)"]
  end
  
  node1 --> ccrecv
  node2 --> ccrecv
  node3 --> ccrecv
  node4 --> ccrecv
  node5 --> ccrecv
  node6 --> ccrecv
  
  ccrecv --> ccms
  ccms <--> ccweb

Components

Compute Node Collectors: Run on each compute node, collect local metrics, forward to aggregation node
Aggregation Node: Receives metrics from multiple compute nodes, optionally processes them, forwards to cc-metric-store
cc-metric-store: In-memory time-series database for metric storage and retrieval
cc-backend: Web frontend that queries cc-metric-store and visualizes metrics

Configuration

Step 1: Configure Compute Nodes

Compute nodes collect local metrics and send them to the aggregation node using a network sink (NATS or HTTP).

Using NATS (Recommended)

NATS provides better performance, reliability, and built-in clustering support.

config.json:

{
  "sinks-file": "/etc/cc-metric-collector/sinks.json",
  "collectors-file": "/etc/cc-metric-collector/collectors.json",
  "receivers-file": "/etc/cc-metric-collector/receivers.json",
  "router-file": "/etc/cc-metric-collector/router.json",
  "main": {
    "interval": "10s",
    "duration": "1s"
  }
}

sinks.json:

{
  "nats_aggregator": {
    "type": "nats",
    "host": "aggregator.example.org",
    "port": "4222",
    "subject": "metrics.rack1"
  }
}

collectors.json (enable metrics you need):

{
  "cpustat": {},
  "memstat": {},
  "diskstat": {},
  "netstat": {},
  "loadavg": {},
  "tempstat": {}
}

router.json (add identifying tags):

{
  "interval_timestamp": true,
  "process_messages": {
    "manipulate_messages": [
      {
        "add_base_tags": {
          "cluster": "mycluster",
          "rack": "rack1"
        }
      }
    ]
  }
}

receivers.json (empty for compute nodes):

{}

Using HTTP

HTTP is simpler but less efficient for high-frequency metrics.

sinks.json (HTTP alternative):

{
  "http_aggregator": {
    "type": "http",
    "host": "aggregator.example.org",
    "port": "8080",
    "path": "/api/write",
    "idle_connection_timeout": "5s",
    "timeout": "3s"
  }
}

Step 2: Configure Aggregation Node

The aggregation node receives metrics from compute nodes via receivers and forwards them to the backend database.

config.json:

{
  "sinks-file": "/etc/cc-metric-collector/sinks.json",
  "collectors-file": "/etc/cc-metric-collector/collectors.json",
  "receivers-file": "/etc/cc-metric-collector/receivers.json",
  "router-file": "/etc/cc-metric-collector/router.json",
  "main": {
    "interval": "10s",
    "duration": "1s"
  }
}

receivers.json (receive from compute nodes):

{
  "nats_rack1": {
    "type": "nats",
    "address": "localhost",
    "port": "4222",
    "subject": "metrics.rack1"
  },
  "nats_rack2": {
    "type": "nats",
    "address": "localhost",
    "port": "4222",
    "subject": "metrics.rack2"
  }
}

sinks.json (forward to cc-metric-store):

{
  "metricstore": {
    "type": "http",
    "host": "backend.example.org",
    "port": "8082",
    "path": "/api/write",
    "idle_connection_timeout": "5s",
    "timeout": "5s",
    "jwt": "eyJ0eXAiOiJKV1QiLCJhbGciOiJFZERTQSJ9.eyJ1c2VyIjoiYWRtaW4iLCJyb2xlcyI6WyJST0xFX0FETUlOIiwiUk9MRV9VU0VSIl19.d-3_3FZTsadPjDbVXKrQr4jNiQV-B_1-uaL_lW8d8gGb-TSAG9KdMg"
  }
}

Note: The jwt token must be signed with the private key corresponding to the public key configured in cc-metric-store. See JWT generation guide for details.

collectors.json (optionally collect local metrics):

{
  "cpustat": {},
  "memstat": {},
  "loadavg": {}
}

router.json (optionally process metrics):

{
  "interval_timestamp": false,
  "num_cache_intervals": 0,
  "process_messages": {
    "manipulate_messages": [
      {
        "add_base_tags": {
          "datacenter": "dc1"
        }
      }
    ]
  }
}

Step 3: Set Up cc-metric-store

The backend server needs cc-metric-store to receive and store metrics from the aggregation node.

config.json (/etc/cc-metric-store/config.json):

{
  "metrics": {
    "cpu_user": {
      "frequency": 10,
      "aggregation": "avg"
    },
    "cpu_system": {
      "frequency": 10,
      "aggregation": "avg"
    },
    "mem_used": {
      "frequency": 10,
      "aggregation": null
    },
    "mem_total": {
      "frequency": 10,
      "aggregation": null
    },
    "net_bw": {
      "frequency": 10,
      "aggregation": "sum"
    },
    "flops_any": {
      "frequency": 10,
      "aggregation": "sum"
    },
    "mem_bw": {
      "frequency": 10,
      "aggregation": "sum"
    }
  },
  "http-api": {
    "address": "0.0.0.0:8082"
  },
  "jwt-public-key": "kzfYrYy+TzpanWZHJ5qSdMj5uKUWgq74BWhQG6copP0=",
  "retention-in-memory": "48h",
  "checkpoints": {
    "interval": "12h",
    "directory": "/var/lib/cc-metric-store/checkpoints",
    "restore": "48h"
  },
  "archive": {
    "interval": "24h",
    "directory": "/var/lib/cc-metric-store/archive"
  }
}

Important configuration notes:

metrics: Must list ALL metrics you want to store. Only configured metrics are accepted.
frequency: Must match the collection interval from cc-metric-collector (in seconds)
aggregation: "sum" for resource metrics (bandwidth, FLOPS), "avg" for diagnostic metrics (CPU %), null for node-only metrics
jwt-public-key: Must correspond to the private key used to sign JWT tokens in the aggregation node sink configuration
retention-in-memory: How long to keep metrics in memory (should cover typical job durations)

Install cc-metric-store:

# Download binary
wget https://github.com/ClusterCockpit/cc-metric-store/releases/latest/download/cc-metric-store

# Install
sudo mkdir -p /opt/monitoring/cc-metric-store
sudo mv cc-metric-store /opt/monitoring/cc-metric-store/
sudo chmod +x /opt/monitoring/cc-metric-store/cc-metric-store

# Create directories
sudo mkdir -p /var/lib/cc-metric-store/checkpoints
sudo mkdir -p /var/lib/cc-metric-store/archive
sudo mkdir -p /etc/cc-metric-store

Create systemd service (/etc/systemd/system/cc-metric-store.service):

[Unit]
Description=ClusterCockpit Metric Store
After=network.target

[Service]
Type=simple
User=cc-metricstore
Group=cc-metricstore
WorkingDirectory=/opt/monitoring/cc-metric-store
ExecStart=/opt/monitoring/cc-metric-store/cc-metric-store -config /etc/cc-metric-store/config.json
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target

Start cc-metric-store:

# Create user
sudo useradd -r -s /bin/false cc-metricstore
sudo chown -R cc-metricstore:cc-metricstore /var/lib/cc-metric-store

# Start service
sudo systemctl daemon-reload
sudo systemctl start cc-metric-store
sudo systemctl enable cc-metric-store

# Check status
sudo systemctl status cc-metric-store

Step 4: Set Up NATS Server

The aggregation node needs a NATS server to receive metrics from compute nodes.

Install NATS:

# Using Docker
docker run -d --name nats -p 4222:4222 nats:latest

# Using package manager (example for Ubuntu/Debian)
curl -L https://github.com/nats-io/nats-server/releases/download/v2.10.5/nats-server-v2.10.5-linux-amd64.zip -o nats-server.zip
unzip nats-server.zip
sudo mv nats-server-v2.10.5-linux-amd64/nats-server /usr/local/bin/

NATS Configuration (/etc/nats/nats-server.conf):

listen: 0.0.0.0:4222
max_payload: 10MB
max_connections: 1000

# Optional: Enable authentication
authorization {
  user: collector
  password: secure_password
}

# Optional: Enable clustering for HA
cluster {
  name: metrics-cluster
  listen: 0.0.0.0:6222
}

Start NATS:

# Systemd
sudo systemctl start nats
sudo systemctl enable nats

# Or directly
nats-server -c /etc/nats/nats-server.conf

Advanced Configurations

Multiple Aggregation Levels

For very large clusters, you can create multiple aggregation levels:

flowchart TD
  subgraph Compute ["Compute Nodes"]
    node1["Node 1-100"]
  end

  subgraph Rack ["Rack Aggregators"]
    agg1["Aggregator<br/>Rack 1-10"]
  end

  subgraph Cluster ["Cluster Aggregator"]
    agg_main["Main Aggregator"]
  end

  subgraph Backend ["Backend"]
    ccms[("cc-metric-store")]
  end
  
  node1 --> agg1
  agg1 --> agg_main
  agg_main --> ccms

Rack-level aggregator sinks.json:

{
  "cluster_aggregator": {
    "type": "nats",
    "host": "main-aggregator.example.org",
    "port": "4222",
    "subject": "metrics.cluster"
  }
}

Cluster-level aggregator receivers.json:

{
  "all_racks": {
    "type": "nats",
    "address": "localhost",
    "port": "4222",
    "subject": "metrics.cluster"
  }
}

Load Balancing with Multiple Aggregators

Use NATS queue groups to distribute load across multiple aggregation nodes:

Compute node sinks.json:

{
  "nats_aggregator": {
    "type": "nats",
    "host": "nats-cluster.example.org",
    "port": "4222",
    "subject": "metrics.loadbalanced"
  }
}

Aggregator 1 and 2 receivers.json (identical configuration):

{
  "nats_with_queue": {
    "type": "nats",
    "address": "localhost",
    "port": "4222",
    "subject": "metrics.loadbalanced",
    "queue_group": "aggregators"
  }
}

With queue_group configured, NATS automatically distributes messages across all aggregators in the group.

Filtering at Aggregation Level

Reduce cc-metric-store load by filtering metrics at the aggregation node:

Aggregator router.json:

{
  "interval_timestamp": false,
  "process_messages": {
    "manipulate_messages": [
      {
        "drop_by_name": ["cpu_idle", "cpu_guest", "cpu_guest_nice"]
      },
      {
        "drop_by": "value == 0 && match('temp_', name)"
      },
      {
        "add_base_tags": {
          "aggregated": "true"
        }
      }
    ]
  }
}

Metric Transformation

Aggregate or transform metrics before forwarding:

Aggregator router.json:

{
  "interval_timestamp": false,
  "num_cache_intervals": 1,
  "interval_aggregates": [
    {
      "name": "rack_avg_temp",
      "if": "name == 'temp_package_id_0'",
      "function": "avg(values)",
      "tags": {
        "type": "rack",
        "rack": "<copy>"
      },
      "meta": {
        "unit": "degC",
        "source": "aggregated"
      }
    }
  ]
}

High Availability Setup

Use multiple NATS servers in cluster mode:

NATS server 1 config:

cluster {
  name: metrics-cluster
  listen: 0.0.0.0:6222
  routes: [
    nats://nats2.example.org:6222
    nats://nats3.example.org:6222
  ]
}

Compute node sinks.json (with failover):

{
  "nats_ha": {
    "type": "nats",
    "host": "nats1.example.org,nats2.example.org,nats3.example.org",
    "port": "4222",
    "subject": "metrics.rack1"
  }
}

Deployment

1. Install cc-metric-collector

On all nodes (compute and aggregation):

# Download binary
wget https://github.com/ClusterCockpit/cc-metric-collector/releases/latest/download/cc-metric-collector

# Install
sudo mkdir -p /opt/monitoring/cc-metric-collector
sudo mv cc-metric-collector /opt/monitoring/cc-metric-collector/
sudo chmod +x /opt/monitoring/cc-metric-collector/cc-metric-collector

2. Deploy Configuration Files

Compute nodes:

sudo mkdir -p /etc/cc-metric-collector
sudo cp config.json /etc/cc-metric-collector/
sudo cp sinks.json /etc/cc-metric-collector/
sudo cp collectors.json /etc/cc-metric-collector/
sudo cp receivers.json /etc/cc-metric-collector/
sudo cp router.json /etc/cc-metric-collector/

Aggregation node:

sudo mkdir -p /etc/cc-metric-collector
# Deploy aggregator-specific configs
sudo cp aggregator-config.json /etc/cc-metric-collector/config.json
sudo cp aggregator-sinks.json /etc/cc-metric-collector/sinks.json
sudo cp aggregator-receivers.json /etc/cc-metric-collector/receivers.json
# etc...

3. Create Systemd Service

On all nodes (/etc/systemd/system/cc-metric-collector.service):

[Unit]
Description=ClusterCockpit Metric Collector
After=network.target

[Service]
Type=simple
User=cc-collector
Group=cc-collector
ExecStart=/opt/monitoring/cc-metric-collector/cc-metric-collector -config /etc/cc-metric-collector/config.json
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target

4. Start Services

Order of startup:

Start cc-metric-store on backend server
Start NATS server on aggregation node
Start cc-metric-collector on aggregation node
Start cc-metric-collector on compute nodes

# On backend server
sudo systemctl start cc-metric-store

# On aggregation node
sudo systemctl start nats
sudo systemctl start cc-metric-collector

# On compute nodes
sudo systemctl start cc-metric-collector

# Enable on boot (on all nodes)
sudo systemctl enable cc-metric-store  # backend only
sudo systemctl enable nats              # aggregator only
sudo systemctl enable cc-metric-collector

Testing and Validation

Test Compute Node → Aggregator

On compute node, run once to verify metrics are collected:

cc-metric-collector -config /etc/cc-metric-collector/config.json -once

On aggregation node, check NATS for incoming metrics:

# Subscribe to see messages
nats sub 'metrics.>'

Test Aggregator → cc-metric-store

On aggregation node, verify metrics are forwarded:

# Check logs
journalctl -u cc-metric-collector -f

On backend server, verify cc-metric-store is receiving data:

# Check cc-metric-store logs
journalctl -u cc-metric-store -f

# Query metrics via REST API (requires valid JWT token)
curl -H "Authorization: Bearer $JWT_TOKEN" \
  "http://backend.example.org:8082/api/query?cluster=mycluster&from=$(date -d '5 minutes ago' +%s)&to=$(date +%s)"

Validate End-to-End

Check cc-backend to see if metrics appear for all nodes:

Open cc-backend web interface
Navigate to node view
Verify metrics are displayed for compute nodes
Check that tags (cluster, rack, etc.) are present

Monitoring and Troubleshooting

Check Collection Pipeline

# Compute node: metrics are being sent
journalctl -u cc-metric-collector -n 100 | grep -i "sent\|error"

# Aggregator: metrics are being received
journalctl -u cc-metric-collector -n 100 | grep -i "received\|error"

# NATS: check connections
nats server info
nats server list

Common Issues

Metrics not appearing in cc-metric-store:

Check compute node → NATS connection
Verify NATS → aggregator reception
Check aggregator → cc-metric-store sink (verify JWT token is valid)
Verify metrics are configured in cc-metric-store’s config.json
Examine router filters (may be dropping metrics)

High latency:

Reduce metric collection interval on compute nodes
Increase batch size in aggregator sinks
Add more aggregation nodes with load balancing
Check network bandwidth between tiers

Memory growth on aggregator:

Reduce num_cache_intervals in router
Check sink write performance
Verify cc-metric-store is accepting writes
Monitor NATS queue depth

Memory growth on cc-metric-store:

Reduce retention-in-memory setting
Increase checkpoint frequency
Verify archive cleanup is working

Connection failures:

Verify firewall rules allow NATS port (4222)
Check NATS server is running and accessible
Test network connectivity: telnet aggregator.example.org 4222
Review NATS server logs: journalctl -u nats -f

Performance Tuning

Compute nodes (reduce overhead):

{
  "main": {
    "interval": "30s",
    "duration": "1s"
  }
}

Aggregator (increase throughput):

{
  "metricstore": {
    "type": "http",
    "host": "backend.example.org",
    "port": "8082",
    "path": "/api/write",
    "timeout": "10s",
    "idle_connection_timeout": "10s"
  }
}

NATS server (handle more connections):

max_connections: 10000
max_payload: 10MB
write_deadline: "10s"

Security Considerations

NATS Authentication

NATS server config:

authorization {
  users = [
    {
      user: "collector"
      password: "$2a$11$..."  # bcrypt hash
    }
  ]
}

Compute node sinks.json:

{
  "nats_aggregator": {
    "type": "nats",
    "host": "aggregator.example.org",
    "port": "4222",
    "subject": "metrics.rack1",
    "username": "collector",
    "password": "secure_password"
  }
}

TLS Encryption

NATS server config:

tls {
  cert_file: "/etc/nats/certs/server-cert.pem"
  key_file: "/etc/nats/certs/server-key.pem"
  ca_file: "/etc/nats/certs/ca.pem"
  verify: true
}

Compute node sinks.json:

{
  "nats_aggregator": {
    "type": "nats",
    "host": "aggregator.example.org",
    "port": "4222",
    "subject": "metrics.rack1",
    "ssl": true,
    "ssl_cert": "/etc/cc-metric-collector/certs/client-cert.pem",
    "ssl_key": "/etc/cc-metric-collector/certs/client-key.pem"
  }
}

Firewall Rules

On aggregation node:

# Allow NATS from compute network
sudo firewall-cmd --permanent --add-rich-rule='rule family="ipv4" source address="10.0.0.0/8" port protocol="tcp" port="4222" accept'

sudo firewall-cmd --reload

On backend server:

# Allow HTTP from aggregation node to cc-metric-store
sudo firewall-cmd --permanent --add-rich-rule='rule family="ipv4" source address="aggregator.example.org" port protocol="tcp" port="8082" accept'

sudo firewall-cmd --reload

Alternative: Using NATS for cc-metric-store

Instead of HTTP, you can also use NATS to send metrics from the aggregation node to cc-metric-store.

Aggregation node sinks.json:

{
  "nats_metricstore": {
    "type": "nats",
    "host": "backend.example.org",
    "port": "4222",
    "subject": "metrics.store"
  }
}

cc-metric-store config.json (add NATS section):

{
  "metrics": { ... },
  "nats": {
    "address": "nats://0.0.0.0:4222",
    "subscriptions": [
      {
        "subscribe-to": "metrics.store",
        "cluster-tag": "mycluster"
      }
    ]
  },
  "http-api": { ... },
  "jwt-public-key": "...",
  "retention-in-memory": "48h",
  "checkpoints": { ... },
  "archive": { ... }
}

Benefits of NATS:

Better performance for high-frequency metrics
Built-in message buffering
No need for JWT tokens in sink configuration
Easier to scale with multiple aggregators

Trade-offs:

Requires running NATS server on backend
More complex infrastructure

3 - Database migrations

Database migrations

Introduction

In general, an upgrade is nothing more than a replacement of the binary file. All the necessary files, except the database file, the configuration file and the job archive, are embedded in the binary file. It is recommended to use a directory where the file names of the binary files are named with a version indicator. This can be, for example, the date or the Unix epoch time. A symbolic link points to the version to be used. This makes it easier to switch to earlier versions.

The database and the job archive are versioned. Each release binary supports specific versions of the database and job archive. If a version mismatch is detected, the application is terminated and migration is required.

IMPORTANT NOTE
It is recommended to make a backup copy of the database before each update. This is mandatory in case the database needs to be migrated. In the case of sqlite, this means to stopping cc-backend and copying the sqlite database file somewhere.

Migrating the database

IMPORTANT NOTE
In case you database is larger than 10GB you may want to do a test migration on a database copy to determine the expected downtime before attempting the migration in production.

After you have backed up the database, run the following command to migrate the database to the latest version:

> ./cc-backend -migrate-db

The migration files are embedded in the binary and can also be viewed in the cc backend source tree. We use the migrate library.

If something goes wrong, you can check the status and get the current schema (here for sqlite):

> sqlite3 var/job.db

In the sqlite console execute:

.schema

to get the current database schema. You can query the current version and whether the migration failed with:

SELECT * FROM schema_migrations;

The first column indicates the current database version and the second column is a dirty flag indicating whether the migration was successful.

4 - Job archive migrations

Job archive migrations

Introduction

IMPORTANT NOTE
It is recommended to make a backup copy of the job-archive before each update.

Migrating the job archive

Notice

Don’t forget to also migrate archive jobs in case you use an archive retention policy!. Archive migration is only supported from the previous archive version.

Job archive migration requires a separate tool (archive-migration), which is part of the cc-backend source tree (build with go build ./tools/archive-migration) and is also provided as part of the releases.

Migration is supported only between two successive releases. You find details how to use the archive-migration tool in its reference documentation

The cluster.json files in job-archive-new must be checked for errors, especially whether the aggregation attribute is set correctly for all metrics.

Migration takes a few hours for large job archives (several hundred GB). A versioned job archive contains a version.txt file in the root directory of the job archive. This file contains the version as an unsigned integer.

5 -

6 - Hands-On Demo

Hands-On Demo for a basic ClusterCockpit setup and API usage (without Docker)

Prerequisites

perl
go
npm
Optional: curl
Script migrateTimestamp.pl

Documentation

You find READMEs or api docs in

./cc-backend/configs
./cc-backend/init
./cc-backend/api

ClusterCockpit configuration files

cc-backend

./.env Passwords and Tokens set in the environment
./config.json Configuration options for cc-backend

cc-metric-store

./config.json Optional to overwrite configuration options

cc-metric-collector

Not yet included in the hands-on setup.

Setup Components

Start by creating a base folder for all of the following steps.

mkdir clustercockpit
cd clustercockpit

Setup cc-backend

Clone Repository
- git clone https://github.com/ClusterCockpit/cc-backend.git
- cd cc-backend
Build
- make
Activate & configure environment for cc-backend
- cp configs/env-template.txt .env
- Optional: Have a look via vim .env
- Copy the config.json file included in this tarball into the root directory of cc-backend: cp ../../config.json ./
Back to toplevel clustercockpit
- cd ..
Prepare Datafolder and Database file
- mkdir var
- ./cc-backend -migrate-db

Setup cc-metric-store

Clone Repository
- git clone https://github.com/ClusterCockpit/cc-metric-store.git
- cd cc-metric-store
Build Go Executable
- go get
- go build
Prepare Datafolders
- mkdir -p var/checkpoints
- mkdir -p var/archive
Update Config
- vim config.json
- Exchange existing setting in metrics with the following:

"clock":      { "frequency": 60, "aggregation": null },
"cpi":        { "frequency": 60, "aggregation": null },
"cpu_load":   { "frequency": 60, "aggregation": null },
"flops_any":  { "frequency": 60, "aggregation": null },
"flops_dp":   { "frequency": 60, "aggregation": null },
"flops_sp":   { "frequency": 60, "aggregation": null },
"ib_bw":      { "frequency": 60, "aggregation": null },
"lustre_bw":  { "frequency": 60, "aggregation": null },
"mem_bw":     { "frequency": 60, "aggregation": null },
"mem_used":   { "frequency": 60, "aggregation": null },
"rapl_power": { "frequency": 60, "aggregation": null }

Back to toplevel clustercockpit
- cd ..

Setup Demo Data

mkdir source-data
cd source-data
Download JobArchive-Source:
- wget https://hpc-mover.rrze.uni-erlangen.de/HPC-Data/0x7b58aefb/eig7ahyo6fo2bais0ephuf2aitohv1ai/job-archive-dev.tar.xz
- tar xJf job-archive-dev.tar.xz
- mv ./job-archive ./job-archive-source
- rm ./job-archive-dev.tar.xz
Download CC-Metric-Store Checkpoints:
- mkdir -p cc-metric-store-source/checkpoints
- cd cc-metric-store-source/checkpoints
- wget https://hpc-mover.rrze.uni-erlangen.de/HPC-Data/0x7b58aefb/eig7ahyo6fo2bais0ephuf2aitohv1ai/cc-metric-store-checkpoints.tar.xz
- tar xf cc-metric-store-checkpoints.tar.xz
- rm cc-metric-store-checkpoints.tar.xz
Back to source-data
- cd ../..
Run timestamp migration script. This may take tens of minutes!
- cp ../migrateTimestamps.pl .
- ./migrateTimestamps.pl
- Expected output:

Starting to update start- and stoptimes in job-archive for emmy
Starting to update start- and stoptimes in job-archive for woody
Done for job-archive
Starting to update checkpoint filenames and data starttimes for emmy
Starting to update checkpoint filenames and data starttimes for woody
Done for checkpoints

Copy cluster.json files from source to migrated folders
- cp source-data/job-archive-source/emmy/cluster.json cc-backend/var/job-archive/emmy/
- cp source-data/job-archive-source/woody/cluster.json cc-backend/var/job-archive/woody/
Initialize Job-Archive in SQLite3 job.db and add demo user
- cd cc-backend
- ./cc-backend -init-db -add-user demo:admin:demo
- Expected output:

<6>[INFO]    new user "demo" created (roles: ["admin"], auth-source: 0)
<6>[INFO]    Building job table...
<6>[INFO]    A total of 3936 jobs have been registered in 1.791 seconds.

Back to toplevel clustercockpit
- cd ..

Startup both Apps

In cc-backend root: $./cc-backend -server -dev
- Starts Clustercockpit at http:localhost:8080
  - Log: <6>[INFO] HTTP server listening at :8080...
- Use local internet browser to access interface
  - You should see and be able to browse finished Jobs
  - Metadata is read from SQLite3 database
  - Metricdata is read from job-archive/JSON-Files
- Create User in settings (top-right corner)
  - Name apiuser
  - Username apiuser
  - Role API
  - Submit & Refresh Page
- Create JTW for apiuser
  - In Userlist, press Gen. JTW for apiuser
  - Save JWT for later use
In cc-metric-store root: $./cc-metric-store
- Start the cc-metric-store on http:localhost:8081, Log:

2022/07/15 17:17:42 Loading checkpoints newer than 2022-07-13T17:17:42+02:00
2022/07/15 17:17:45 Checkpoints loaded (5621 files, 319 MB, that took 3.034652s)
2022/07/15 17:17:45 API http endpoint listening on '0.0.0.0:8081'

Does not have a graphical interface
Otpional: Test function by executing:

$ curl -H "Authorization: Bearer eyJ0eXAiOiJKV1QiLCJhbGciOiJFZERTQSJ9.eyJ1c2VyIjoiYWRtaW4iLCJyb2xlcyI6WyJST0xFX0FETUlOIiwiUk9MRV9BTkFMWVNUIiwiUk9MRV9VU0VSIl19.d-3_3FZTsadPjDEdsWrrQ7nS0edMAR4zjl-eK7rJU3HziNBfI9PDHDIpJVHTNN5E5SlLGLFXctWyKAkwhXL-Dw" -D - "http://localhost:8081/api/query" -d "{ \"cluster\": \"emmy\", \"from\": $(expr $(date +%s) - 60), \"to\": $(date +%s), \"queries\": [{
  \"metric\": \"flops_any\",
  \"host\": \"e1111\"
}] }"

HTTP/1.1 200 OK
Content-Type: application/json
Date: Fri, 15 Jul 2022 13:57:22 GMT
Content-Length: 119
{"results":[[JSON-DATA-ARRAY]]}

Development API web interfaces

The -dev flag enables web interfaces to document and test the apis:

Local GQL Playgorund - A GraphQL playground. To use it you must have a authenticated session in the same browser.
Local Swagger Docs - A Swagger UI. To use it you have to be logged out, so no user session in the same browser. Use the JWT token with role Api generate previously to authenticate via http header.

Use cc-backend API to start job

Enter the URL http://localhost:8080/swagger/index.html in your browser.
Enter your JWT token you generated for the API user by clicking the green Authorize button in the upper right part of the window.
Click the /job/start_job endpoint and click the Try it out button.
Enter the following json into the request body text area and fill in a recent start timestamp by executing date +%s.:

{
    "jobId":         100000,
    "arrayJobId":    0,
    "user":          "ccdemouser",
    "subCluster":    "main",
    "cluster":       "emmy",
    "startTime":    <date +%s>,
    "project":       "ccdemoproject",
    "resources":  [
        {"hostname":  "e0601"},
        {"hostname":  "e0823"},
        {"hostname":  "e0337"},
        {"hostname": "e1111"}],
    "numNodes":      4,
    "numHwthreads":  80,
    "walltime":      86400
}

The response body should be the database id of the started job, for example:

{
  "id": 3937
}

Check in ClusterCockpit
- User ccdemouser should appear in Users-Tab with one running job
- It could take up to 5 Minutes until the Job is displayed with some current data (5 Min Short-Job Filter)
- Job then is marked with a green running tag
- Metricdata displayed is read from cc-metric-store!

Use cc-backend API to stop job

Enter the URL http://localhost:8080/swagger/index.html in your browser.
Enter your JWT token you generated for the API user by clicking the green Authorize button in the upper right part of the window.
Click the /job/stop_job/{id} endpoint and click the Try it out button.
Enter the database id at id that was returned by start_job and copy the following into the request body. Replace the timestamp with a recent one:

{
  "cluster": "emmy",
  "jobState": "completed",
  "stopTime": <RECENT TS>
}

On success a json document with the job meta data is returned.
Check in ClusterCockpit
- User ccdemouser should appear in Users-Tab with one completed job
- Job is no longer marked with a green running tag -> Completed!
- Metricdata displayed is now read from job-archive!
Check in job-archive
- cd ./cc-backend/var/job-archive/emmy/100/000
- cd $STARTTIME
- Inspect meta.json and data.json

Helper scripts

In this tarball you can find the perl script generate_subcluster.pl that helps to generate the subcluster section for your system. Usage:
Log into an exclusive cluster node.
The LIKWID tools likwid-topology and likwid-bench must be in the PATH!
$./generate_subcluster.pl outputs the subcluster section on stdout

Please be aware that

You have to enter the name and node list for the subCluster manually.
GPU detection only works if LIKWID was build with Cuda avalable and you run likwid-topology also with Cuda loaded.
Do not blindly trust the measured peakflops values.
Because the script blindly relies on the CSV format output by likwid-topology this is a fragile undertaking!

7 - How to add a MOD notification banner

Add a message of the day banner on homepage

Overview

To add a notification banner you can add a file notice.txt to the ./var directory of the cc-backend server. As long as this file is present all text in this file is shown in an info banner on the homepage.

As an alternative the admin role can also add and edit the notification banner from the settings view.

8 - How to create a `cluster.json` file

How to initially create a cluster configuration

Overview

Every cluster is configured using a dedicated cluster.json file, that is part of the job archive. You can find the JSON schema for it here. This file provides information about the homogeneous hardware partitions within the cluster including the node topology and the metric list. A real production configuration is provided as part of cc-examples.

`cluster.json`: Basics

The cluster.json file contains three top level parts: the name of the cluster, the metric configuration, and the subcluster list. You find the latest cluster.json schema here. Basic layout of cluster.json files:

{
  "name": "fritz",
  "metricConfig": [
    {
      "name": "cpu_load",
      ...
    },
    {
      "name": "mem_used",
      ...
    }
  ],
  "subClusters": [
    {
      "name": "main",
      ...
    },
    {
      "name": "spr",
      ...
    }
  ]
}

`cluster.json`: Metric configuration

There is one metric list per cluster. You can find a list of recommended metrics and their naming here. Example for a metric list entry with only the required attributes:

"metricConfig": [
    {
        "name": "flops_sp",
        "unit": {
            "base": "Flops/s",
            "prefix": "G"
        },
        "scope": "hwthread",
        "timestep": 60,
        "aggregation": "sum",
        "peak": 5600,
        "normal": 1000,
        "caution": 200,
        "alert": 50
    }
]

Explanation of required attributes:

name: The metric name.
unit: The metrics unit. Base can be: B (for bytes), F (for flops), B/s, F/s, Flops (for floating point operations), Flops/s (for FLOP rate), CPI (for cycles per instruction), IPC (for instructions per cycle), Hz, W (for Watts), °C, or empty string for no unit. Prefix can be: K, M, G, T, P, or E.
scope: The native metric measurement resolution. Can be node, socket, memoryDomain, core, hwthread, or accelerator.
timestep: The measurement frequency in seconds
aggregation: How the metric is aggregated with in node topology. Can be one of sum, avg, or empty string for no aggregation (node level metrics).
Metric thresholds. If threshold applies for larger or smaller values depends on optional attribute lowerIsBetter (default false).
- peak: The maximum possible metric value
- normal: A common metric value level
- caution: Metric value requires attention
- alert: Metric value requiring immediate attention

Optional attributes:

footprint: Is this a job footprint metric. Set to how the footprint is aggregated: Can avg, min, or max. Footprint metrics are shown in the footprint UI component and job view polar plot.
energy: Should the metric be used to calculate the job energy. Can be power (metric has unit Watts) or energy (metric has unit Joules).
lowerIsBetter: Is lower better. Influences frontend UI and evaluation of metric thresholds. Default is false.
restrict: Whether to restrict visibility of this metric to non-user roles (admin, support, manager). Default is false. When set to true, regular users cannot view this metric.
subClusters (Type: array of objects): Overwrites for specific subClusters. The metrics per default are valid for all subClusters. It is possible to overwrite or remove metrics for specific subClusters. If a metric is overwritten for a subClusters all attributes have to be set, partial overwrites are not supported. Example for a metric overwrite:

{
    "name": "mem_used",
    "unit": {
        "base": "B",
        "prefix": "G"
    },
    "scope": "node",
    "aggregation": "sum",
    "footprint": "max",
    "timestep": 60,
    "lowerIsBetter": true,
    "peak": 256,
    "normal": 128,
    "caution": 200,
    "alert": 240,
    "subClusters": [
        {
            "name": "spr1tb",
            "footprint": "max",
            "peak": 1024,
            "normal": 512,
            "caution": 900,
            "alert": 1000
        },
        {
            "name": "spr2tb",
            "footprint": "max",
            "peak": 2048,
            "normal": 1024,
            "caution": 1800,
            "alert": 2000
        }
    ]
},

This metric characterizes the memory capacity used by a job. Aggregation for a job is the sum of all node values. As footprint the largest allocated memory capacity is used. For this configuration lower is better is set, which results in jobs with more than the metric thresholds are marked. There exist two subClusters with 1TB and 2TB memory capacity compared to the default 256GB.

Example for removing metrics for a subcluster:

{
  "name": "vectorization_ratio",
  "unit": {
    "base": ""
  },
  "scope": "hwthread",
  "aggregation": "avg",
  "timestep": 60,
  "peak": 100,
  "normal": 60,
  "caution": 40,
  "alert": 10,
  "subClusters": [
    {
      "name": "icelake",
      "remove": true
    }
  ]
}

`cluster.json`: subcluster configuration

SubClusters in ClusterCockpit are subsets of a cluster with homogeneous hardware. The subCluster part specifies the node topology, a list of nodes that are part of a subClusters, and the node capabilities that are used to draw the roofline diagrams.

Topology Structure

The topology section defines the hardware topology using nested arrays that map relationships between hardware threads, cores, sockets, memory domains, and dies:

node: Flat list of all hardware thread IDs on the node
socket: Hardware threads grouped by physical CPU socket (2D array)
memoryDomain: Hardware threads grouped by NUMA domain (2D array)
die: Optional grouping by CPU die within sockets (2D array). This is used for multi-die processors where each socket contains multiple dies. If not applicable, use an empty array []
core: Hardware threads grouped by physical core (2D array)
accelerators: Optional list of attached accelerators (GPUs, FPGAs, etc.)

The resource ID for CPU cores is the OS processor ID. For GPUs we recommend using the PCI-E address as resource ID.

Here is an example:

{
  "name": "icelake",
  "nodes": "w22[01-35],w23[01-35],w24[01-20],w25[01-20]",
  "processorType": "Intel Xeon Gold 6326",
  "socketsPerNode": 2,
  "coresPerSocket": 16,
  "threadsPerCore": 1,
  "flopRateScalar": {
    "unit": {
      "base": "F/s",
      "prefix": "G"
    },
    "value": 432
  },
  "flopRateSimd": {
    "unit": {
      "base": "F/s",
      "prefix": "G"
    },
    "value": 9216
  },
  "memoryBandwidth": {
    "unit": {
      "base": "B/s",
      "prefix": "G"
    },
    "value": 350
  },
  "topology": {
    "node": [
      0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20,
      21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38,
      39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56,
      57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71
    ],
    "socket": [
      [
        0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
        20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35
      ],
      [
        36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53,
        54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71
      ]
    ],
    "memoryDomain": [
      [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17],
      [18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35],
      [36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53],
      [54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71]
    ],
    "die": [],
    "core": [
      [0],
      [1],
      [2],
      [3],
      [4],
      [5],
      [6],
      [7],
      [8],
      [9],
      [10],
      [11],
      [12],
      [13],
      [14],
      [15],
      [16],
      [17],
      [18],
      [19],
      [20],
      [21],
      [22],
      [23],
      [24],
      [25],
      [26],
      [27],
      [28],
      [29],
      [30],
      [31],
      [32],
      [33],
      [34],
      [35],
      [36],
      [37],
      [38],
      [39],
      [40],
      [41],
      [42],
      [43],
      [44],
      [45],
      [46],
      [47],
      [48],
      [49],
      [50],
      [51],
      [52],
      [53],
      [54],
      [55],
      [56],
      [57],
      [58],
      [59],
      [60],
      [61],
      [62],
      [63],
      [64],
      [65],
      [66],
      [67],
      [68],
      [69],
      [70],
      [71]
    ]
  }
}

Since it is tedious to write this by hand, we provide a Perl script as part of cc-backend that generates a subCluster template. This script only works if the LIKWID tools are installed and in the PATH. You also need the LIKWID library for cc-metric-store. You find instructions on how to install LIKWID here.

Example: SubCluster with GPU Accelerators

Here is an example for a subCluster with GPU accelerators:

{
  "name": "a100m80",
  "nodes": "a[0531-0537],a[0631-0633],a0731,a[0831-0833],a[0931-0934]",
  "processorType": "AMD Milan",
  "socketsPerNode": 2,
  "coresPerSocket": 64,
  "threadsPerCore": 1,
  "flopRateScalar": {
    "unit": {
      "base": "F/s",
      "prefix": "G"
    },
    "value": 432
  },
  "flopRateSimd": {
    "unit": {
      "base": "F/s",
      "prefix": "G"
    },
    "value": 9216
  },
  "memoryBandwidth": {
    "unit": {
      "base": "B/s",
      "prefix": "G"
    },
    "value": 400
  },
  "topology": {
    "node": [
      0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20,
      21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38,
      39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56,
      57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74,
      75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92,
      93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108,
      109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123,
      124, 125, 126, 127
    ],
    "socket": [
      [
        0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
        20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37,
        38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55,
        56, 57, 58, 59, 60, 61, 62, 63
      ],
      [
        64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81,
        82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99,
        100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113,
        114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127
      ]
    ],
    "memoryDomain": [
      [
        0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
        20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37,
        38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55,
        56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73,
        74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91,
        92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107,
        108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121,
        122, 123, 124, 125, 126, 127
      ]
    ],
    "core": [
      [0],
      [1],
      [2],
      [3],
      [4],
      [5],
      [6],
      [7],
      [8],
      [9],
      [10],
      [11],
      [12],
      [13],
      [14],
      [15],
      [16],
      [17],
      [18],
      [19],
      [20],
      [21],
      [22],
      [23],
      [24],
      [25],
      [26],
      [27],
      [28],
      [29],
      [30],
      [31],
      [32],
      [33],
      [34],
      [35],
      [36],
      [37],
      [38],
      [39],
      [40],
      [41],
      [42],
      [43],
      [44],
      [45],
      [46],
      [47],
      [48],
      [49],
      [50],
      [51],
      [52],
      [53],
      [54],
      [55],
      [56],
      [57],
      [58],
      [59],
      [60],
      [61],
      [62],
      [63],
      [64],
      [65],
      [66],
      [67],
      [68],
      [69],
      [70],
      [71],
      [73],
      [74],
      [75],
      [76],
      [77],
      [78],
      [79],
      [80],
      [81],
      [82],
      [83],
      [84],
      [85],
      [86],
      [87],
      [88],
      [89],
      [90],
      [91],
      [92],
      [93],
      [94],
      [95],
      [96],
      [97],
      [98],
      [99],
      [100],
      [101],
      [102],
      [103],
      [104],
      [105],
      [106],
      [107],
      [108],
      [109],
      [110],
      [111],
      [112],
      [113],
      [114],
      [115],
      [116],
      [117],
      [118],
      [119],
      [120],
      [121],
      [122],
      [123],
      [124],
      [125],
      [126],
      [127]
    ],
    "accelerators": [
      {
        "id": "00000000:0E:00.0",
        "type": "Nvidia GPU",
        "model": "A100"
      },
      {
        "id": "00000000:13:00.0",
        "type": "Nvidia GPU",
        "model": "A100"
      },
      {
        "id": "00000000:49:00.0",
        "type": "Nvidia GPU",
        "model": "A100"
      },
      {
        "id": "00000000:4F:00.0",
        "type": "Nvidia GPU",
        "model": "A100"
      },
      {
        "id": "00000000:90:00.0",
        "type": "Nvidia GPU",
        "model": "A100"
      },
      {
        "id": "00000000:96:00.0",
        "type": "Nvidia GPU",
        "model": "A100"
      },
      {
        "id": "00000000:CC:00.0",
        "type": "Nvidia GPU",
        "model": "A100"
      },
      {
        "id": "00000000:D1:00.0",
        "type": "Nvidia GPU",
        "model": "A100"
      }
    ]
  }
}

Important: Each accelerator requires three fields:

id: Unique identifier (PCI-E address recommended, e.g., “00000000:0E:00.0”)
type: Type of accelerator. Valid values are: "Nvidia GPU", "AMD GPU", "Intel GPU"
model: Specific model name (e.g., “A100”, “MI100”)

You must ensure that the metric collector as well as the Slurm adapter also uses the same identifier format (PCI-E address) as the accelerator resource ID for consistency.

9 - How to customize cc-backend

Add legal texts, modify login page, and add custom logo.

Overview

Customizing cc-backend means changing the logo, legal texts, and the login template instead of the placeholders. You can also place a text file in ./var to add dynamic status or notification messages to the ClusterCockpit homepage.

Replace legal texts

To replace the imprint.tmpl and privacy.tmpl legal texts, you can place your version in ./var/. At startup cc-backend will check if ./var/imprint.tmpl and/or ./var/privacy.tmpl exist and use them instead of the built-in placeholders. You can use the placeholders in web/templates as a blueprint.

To replace the default login layout and styling, you can place your version in ./var/. At startup cc-backend will check if ./var/login.tmpl exist and use it instead of the built-in placeholder. You can use the default template web/templates/login.tmpl as a blueprint.

Replace logo

To change the logo displayed in the navigation bar, you can provide the file logo.png in the folder ./var/img/. On startup cc-backend will check if the folder exists and use the images provided there instead of the built-in images. You may also place additional images there you use in a custom login template.

To add a notification banner you can add a file notice.txt to ./var. As long as this file is present all text in this file is shown in an info banner on the homepage.

10 - How to deploy and update cc-backend

Recommended deployment and update workflow for production use

Workflow for deployment

Why we do not provide a docker container

The ClusterCockpit web backend binary has no external dependencies, everything is included in the binary. The external assets, SQL database and job archive, would also be external in a docker setup. The only advantage of a docker setup would be that the initial configuration is automated. But this only needs to be done one time. We therefore think that setting up docker, securing and maintaining it is not worth the effort.

It is recommended to install all ClusterCockpit components in a common directory, e.g. /opt/monitoring, var/monitoring or var/clustercockpit. In the following we use /opt/monitoring.

Two systemd services run on the central monitoring server:

clustercockpit : binary cc-backend in /opt/monitoring/cc-backend.
cc-metric-store : Binary cc-metric-store in /opt/monitoring/cc-metric-store.

ClusterCockpit is deployed as a single binary that embeds all static assets. We recommend keeping all cc-backend binary versions in a folder archive and linking the currently active one from the cc-backend root. This allows for easy roll-back in case something doesn’t work.

Please Note

cc-backend is started with root rights to open the privileged ports (80 and 443). It is recommended to set the configuration options user and group, in which case cc-backend will drop root permissions once the ports are taken. You have to take care, that the ownership of the ./var folder and its contents are set accordingly.

Workflow to update

This example assumes the DB and job archive versions did not change. In case the new binary requires a newer database or job archive version read here how to migrate to newer versions.

Stop systemd service:

sudo systemctl stop clustercockpit.service

Backup the sqlite DB file! This is as simple as to copy it.
Copy new cc-backend binary to /opt/monitoring/cc-backend/archive (Tip: Use a date tag like YYYYMMDD-cc-backend). Here is an example:

cp ~/cc-backend /opt/monitoring/cc-backend/archive/20231124-cc-backend

Link from cc-backend root to current version

ln -s  /opt/monitoring/cc-backend/archive/20231124-cc-backend /opt/monitoring/cc-backend/cc-backend

Start systemd service:

sudo systemctl start clustercockpit.service

Check if everything is ok:

sudo systemctl status clustercockpit.service

Check log for issues:

sudo journalctl -u clustercockpit.service

Check the ClusterCockpit web frontend and your Slurm adapters if anything is broken!

11 - How to enable and configure auto-tagging

Enable automatic job tagging for application detection and job classification

Overview

ClusterCockpit provides automatic job tagging to classify and categorize jobs based on configurable rules. The tagging system consists of two components:

Application Detection - Identifies which application a job is running by matching patterns in the job script
Job Classification - Analyzes job performance metrics to identify performance issues or characteristics

Tags are automatically applied when jobs start or stop, and can also be applied retroactively to existing jobs. This feature is disabled by default and must be explicitly enabled in the configuration.

Enable auto-tagging

Step 1: Copy configuration files

The tagging system requires configuration files to define application patterns and classification rules. Example configurations are provided in the cc-backend repository at configs/tagger/.

From the cc-backend root directory, copy the configuration files to the var directory:

mkdir -p var/tagger
cp -r configs/tagger/apps var/tagger/
cp -r configs/tagger/jobclasses var/tagger/

This copies:

Application patterns (var/tagger/apps/) - Text files containing regex patterns to match application names in job scripts (16 example applications)
Job classification rules (var/tagger/jobclasses/) - JSON files defining rules to classify jobs based on metrics (3 example rules)
Shared parameters (var/tagger/jobclasses/parameters.json) - Common threshold values used across multiple classification rules

Step 2: Enable in configuration

Add or set the enable-job-taggers configuration option in your config.json:

{
  "enable-job-taggers": true
}

Important: Automatic tagging is disabled by default. Setting this to true activates automatic tagging for jobs that start or stop after cc-backend is restarted.

Step 3: Restart cc-backend

The tagging system loads configuration from ./var/tagger/ at startup:

./cc-backend -server

Step 4: Verify configuration loaded

Check the logs for messages indicating successful initialization:

[INFO] Setup file watch for ./var/tagger/apps
[INFO] Setup file watch for ./var/tagger/jobclasses

These messages confirm the tagging system is active and watching for configuration changes.

How auto-tagging works

Automatic tagging

When enable-job-taggers is set to true, tags are automatically applied at two points in the job lifecycle:

Job Start - Application detection runs immediately when a job starts, analyzing the job script to identify the application
Job Stop - Job classification runs when a job completes, analyzing metrics to identify performance characteristics

Note: Only jobs that start or stop after enabling the feature are automatically tagged. Existing jobs require manual tagging (see below).

Manual tagging (retroactive)

To apply tags to existing jobs in the database, use the -apply-tags command line option:

./cc-backend -apply-tags

This processes all jobs in the database and applies current tagging rules. This is useful when:

You have existing jobs created before tagging was enabled
You’ve added new tagging rules and want to apply them to historical data
You’ve modified existing rules and want to re-evaluate all jobs

The -apply-tags option works independently of the enable-job-taggers configuration setting.

Hot reload

The tagging system watches configuration directories for changes. You can modify or add rules without restarting cc-backend:

Changes to var/tagger/apps/* are detected automatically
Changes to var/tagger/jobclasses/* are detected automatically

Simply edit the files and the new rules will be applied to subsequent jobs.

Application detection

Application detection identifies which software a job is running by matching patterns in the job script.

Configuration format

Application patterns are stored in text files under var/tagger/apps/. Each file represents one application, and the filename (without .txt extension) becomes the tag name.

Each file contains one or more regular expression patterns, one per line:

Example: var/tagger/apps/vasp.txt

vasp
VASP

Example: var/tagger/apps/python.txt

python
pip
anaconda
conda

How it works

When a job starts, the system retrieves the job script from metadata
Each line in the app configuration files is treated as a regex pattern
Patterns are matched case-insensitively against the lowercased job script
If a match is found, a tag of type app with the filename as tag name is applied
Only the first matching application is tagged

Adding new applications

To add detection for a new application:

Create a new file in var/tagger/apps/ (e.g., tensorflow.txt)
Add regex patterns, one per line:
```
tensorflow
tf\.keras
import tensorflow
```
The file is automatically detected and loaded (no restart required)

The tag name will be the filename without the .txt extension (e.g., tensorflow).

Provided application patterns

The example configuration includes patterns for 16 common HPC applications:

vasp
python
gromacs
lammps
openfoam
starccm
matlab
julia
cp2k
cpmd
chroma
flame
caracal
turbomole
orca
alf

Job classification

Job classification analyzes completed jobs based on their metrics and properties to identify performance issues or characteristics.

Configuration format

Job classification rules are defined in JSON files under var/tagger/jobclasses/. Each rule file contains:

Metrics required - Which job metrics to analyze
Requirements - Pre-conditions that must be met
Variables - Computed values used in the rule
Rule expression - Boolean expression that determines if the rule matches
Hint template - Message displayed when the rule matches

Shared parameters

The file var/tagger/jobclasses/parameters.json defines threshold values used across multiple rules:

{
  "lowcpuload_threshold_factor": 0.9,
  "excessivecpuload_threshold_factor": 1.1,
  "job_min_duration_seconds": 600.0,
  "sampling_interval_seconds": 30.0
}

These parameters can be referenced in rule expressions and make it easy to maintain consistent thresholds across multiple rules.

Rule file structure

Each classification rule is a JSON file with the following structure:

Example: var/tagger/jobclasses/lowload.json

{
  "name": "Low CPU load",
  "tag": "lowload",
  "parameters": ["lowcpuload_threshold_factor", "job_min_duration_seconds"],
  "metrics": ["cpu_load"],
  "requirements": [
    "job.shared == \"none\"",
    "job.duration > job_min_duration_seconds"
  ],
  "variables": [
    {
      "name": "load_threshold",
      "expr": "job.numCores * lowcpuload_threshold_factor"
    }
  ],
  "rule": "cpu_load.avg < cpu_load.limits.caution",
  "hint": "Average CPU load {{.cpu_load.avg}} falls below threshold {{.cpu_load.limits.caution}}"
}

Field descriptions

Field	Description
`name`	Human-readable description of the rule
`tag`	Tag identifier applied when the rule matches
`parameters`	List of parameter names from `parameters.json` to include in rule environment
`metrics`	List of metrics required for evaluation (must be present in job data)
`requirements`	Boolean expressions that must all be true for the rule to be evaluated
`variables`	Named expressions computed before evaluating the main rule
`rule`	Boolean expression that determines if the job matches this classification
`hint`	Go template string for generating a user-visible message

Expression environment

Expressions in requirements, variables, and rule have access to:

Job properties:

job.shared - Shared node allocation type
job.duration - Job runtime in seconds
job.numCores - Number of CPU cores
job.numNodes - Number of nodes
job.jobState - Job completion state
job.numAcc - Number of accelerators
job.smt - SMT setting

Metric statistics (for each metric in metrics):

<metric>.min - Minimum value
<metric>.max - Maximum value
<metric>.avg - Average value
<metric>.limits.peak - Peak limit from cluster config
<metric>.limits.normal - Normal threshold
<metric>.limits.caution - Caution threshold
<metric>.limits.alert - Alert threshold

Parameters:

All parameters listed in the parameters field

Variables:

All variables defined in the variables array

Expression language

Rules use the expr language for expressions. Supported operations:

Arithmetic: +, -, *, /, %, ^
Comparison: ==, !=, <, <=, >, >=
Logical: &&, ||, !
Functions: Standard math functions (see expr documentation)

Hint templates

Hints use Go’s text/template syntax. Variables from the evaluation environment are accessible:

{{.cpu_load.avg}}     # Access metric average
{{.job.duration}}     # Access job property
{{.load_threshold}}   # Access computed variable

Adding new classification rules

To add a new classification rule:

Create a new JSON file in var/tagger/jobclasses/ (e.g., memoryLeak.json)
Define the rule structure following the format above
Add any new parameters to parameters.json if needed
The file is automatically detected and loaded (no restart required)

Example: Detecting memory leaks

{
  "name": "Memory Leak Detection",
  "tag": "memory_leak",
  "parameters": ["memory_leak_slope_threshold"],
  "metrics": ["mem_used"],
  "requirements": ["job.duration > 3600"],
  "variables": [
    {
      "name": "mem_growth",
      "expr": "(mem_used.max - mem_used.min) / job.duration"
    }
  ],
  "rule": "mem_growth > memory_leak_slope_threshold",
  "hint": "Memory usage grew by {{.mem_growth}} bytes per second"
}

Don’t forget to add memory_leak_slope_threshold to parameters.json.

Provided classification rules

The example configuration includes 3 classification rules:

lowload - Detects jobs with low CPU load (avg CPU load below caution threshold)
excessiveload - Detects jobs with excessive CPU load (avg CPU load above peak × threshold factor)
lowutilization - Detects jobs with low resource utilization (flop rate below alert threshold)

Troubleshooting

Tags not applied

Check tagging is enabled: Verify enable-job-taggers: true is set in config.json

Check configuration exists:

ls -la var/tagger/apps
ls -la var/tagger/jobclasses

Check logs for errors:
```
./cc-backend -server -loglevel debug
```
Verify file permissions: Ensure cc-backend can read the configuration files
For existing jobs: Use ./cc-backend -apply-tags to retroactively tag jobs

Rules not matching

Enable debug logging: Set log level to debug to see detailed rule evaluation:
```
./cc-backend -server -loglevel debug
```
Check requirements: Ensure all requirements in the rule are satisfied
Verify metrics exist: Classification rules require job metrics to be available in the job data
Check metric names: Ensure metric names in rules match those in your cluster configuration

File watch not working

If changes to configuration files aren’t detected automatically:

Restart cc-backend to reload all configuration
Check filesystem supports file watching (some network filesystems may not support inotify)
Check logs for file watch setup messages

Best practices

Start simple: Begin with basic rules and refine based on results
Use requirements: Filter out irrelevant jobs early with requirements to avoid unnecessary metric processing
Test incrementally: Add one rule at a time and verify behavior before adding more
Document rules: Use descriptive names and clear hint messages
Share parameters: Define common thresholds in parameters.json for consistency
Version control: Keep your var/tagger/ configuration in version control to track changes
Backup before changes: Test new rules on a development instance before deploying to production

Tag types and usage

The tagging system creates two types of tags:

app - Application tags (e.g., “vasp”, “gromacs”, “python”)
jobClass - Classification tags (e.g., “lowload”, “excessiveload”, “lowutilization”)

Tags can be:

Queried and filtered in the ClusterCockpit UI
Used in API queries to find jobs with specific characteristics
Referenced in reports and analytics

Tags are stored in the database and appear in the job details view, making it easy to identify application usage and performance patterns across your cluster.

12 - How to generate JWT tokens

Overview

ClusterCockpit uses JSON Web Tokens (JWT) for authorization of its APIs. JWTs are the industry standard for securing APIs and is also used for example in OAuth2. For details on JWTs refer to the JWT article in the Concepts section.

When a user logs in via the /login page using a browser, a session cookie (secured using the random bytes in the SESSION_KEY env variable you should change as well in production) is used for all requests after the successful login. The JWTs make it easier to use the APIs of ClusterCockpit using scripts or other external programs. The token is specified n the Authorization HTTP header using the Bearer schema (there is an example below). Tokens can be issued to users from the configuration view in the Web-UI or the command line (using the -jwt <username> option). In order to use the token for API endpoints such as /api/jobs/start_job/, the user that executes it needs to have the api role. Regular users can only perform read-only queries and only look at data connected to jobs they started themselves.

There are two usage scenarios:

The APIs are used during a browser session. API accesses are authorized with the active session.
The REST API is used outside a browser session, e.g. by scripts. In this case you have to issue a token manually. This possible from within the configuration view or on the command line. It is recommended to issue a JWT token in this case for a special user that only has the api role. By using different users for different purposes a fine grained access control and access revocation management is possible.

The token is commonly specified in the Authorization HTTP header using the Bearer schema. ClusterCockpit uses a ECDSA private/public keypair to sign and verify its tokens. You can use cc-backend to generate new JWT tokens.

Create a new ECDSA Public/private key pair for signing and validating tokens

We provide a small utility tool as part of cc-backend:

go build ./cmd/gen-keypair/
./gen-keypair

Add key pair in your `.env` file for `cc-backend`

An env file template can be found in ./configs. cc-backend requires the private key to sign newly generated JWT tokens and the public key to validate tokens used to authenticate in its REST APIs.

Generate new JWT token

Every user with the admin role can create or change a user in the configuration view of the web interface. To generate a new JWT for a user just press the GenJWT button behind the user name in the user list.

A new api user and corresponding JWT keys can also be generated from the command line.

Create new API user with admin and api role:

./cc-backend -add-user myapiuser:admin,api:<password>

Create a new JWT token for this user:

./cc-backend -jwt myapiuser

Use issued token token on client side

curl -X GET "<API ENDPOINT>" -H "accept: application/json" -H "Content-Type: application/json" -H "Authorization: Bearer <JWT TOKEN>"

This token can be used for the cc-backend REST API as well as for the cc-metric-store. If you use the token for cc-metric-store you have to configure it to use the corresponding public key for validation in its config.json.

Note

Per default the JWT tokens generated by cc-backend will not expire! To set an expiration date you have to configure an expiration duration in config.json. You find details here, use keys jwts:max-age.

Of course the JWT token can be generated also by other means as long it is signed with a ED25519 private key and the corresponding public key is configured in cc-backend or cc-metric-store. For the claims that are set and used by ClusterCockpit refer to the JWT article.

cc-metric-store

The cc-metric-store also uses JWTs for authentication. As it does not issue new tokens, it does not need to know the private key. The public key of the keypair that is used to generate the JWTs that grant access to the cc-metric-store can be specified in its config.json. When configuring the metricDataRepository object in the cluster.json file of the job-archive, you can put a token issued by cc-backend itself.

Other tools to generate signed tokens

The golang-jwt project provides a small command line tool to sign and verify tokens. You can install it with:

 go install github.com/golang-jwt/jwt/v5/cmd/jwt

OpenSSL can be used to generate ED25519 key-pairs:

# Generate ed25519 private key
openssl genpkey -algorithm ed25519 -out privkey.pem
# export its public key
openssl pkey -in privkey.pem -pubout -out pubkey.pem

13 - How to plan and configure resampling

Configure metric resampling

Enable timeseries resampling

ClusterCockpit now supports resampling of time series data to a lower frequency. This dramatically improves load times for very large or very long jobs, and we recommend enabling it. Resampling is supported for running as well as for finished jobs.

Note: For running jobs, this currently only works with the newest version of cc-metric-store. Resampling support for the Prometheus time series database will be added in the future.

Resampling Algorithm

To preserve visual accuracy while reducing data points, ClusterCockpit utilizes the Largest-Triangle-Three-Buckets (LTTB) algorithm.

Standard downsampling methods often fail to represent data accurately:

Averaging: Tends to smooth out important peaks and valleys, hiding critical performance spikes.
Decimation (Step sampling): Simply skips points, which can lead to random data loss and missed outliers.

In contrast, LTTB uses a geometric approach to select data points that form the largest triangles effectively. This technique creates a downsampled representation that retains the perceptual shape of the original line graph, ensuring that significant extrema and performance trends remain visible even at lower resolutions.

Configuration

To enable resampling, you must add the following toplevel configuration key:

"resampling": {
  "minimum-points": 300,
  "trigger": 30,
  "resolutions": [
    600,
    300,
    120,
    60
  ]
}

Configuration Parameters

The enable-resampling object is optional. If configured, it enables dynamic downsampling of metric data using the following properties:

minimum-points (Integer) Specifies the minimum number of data points required to trigger resampling. This ensures short jobs are not unnecessarily downsampled.
- Example: If minimum-points is set to 300 and if the native frequency is 60 seconds, resampling will only trigger for jobs longer than 10 hours (300 points * 60 seconds = 18,000 seconds / 3600 = 5 hours).
resolutions (Array [Integer]) An array of target resampling resolutions in seconds.
- Example: [600, 300, 120, 60]
- Note: The finest resolution in this list must match the native resolution of your metrics. If you have different native resolutions across your metric configuration, you should use the finest available resolution here. The implementation will automatically fallback to the finest available resolution if an exact match isn’t found.
trigger (Integer) Controls the zoom behavior. It specifies the threshold of visible data points required to trigger the next zoom level. When the number of visible points in the plot window drops below this value (due to zooming in), the backend loads the next finer resolution level.

Example view of resampling in graphs

The following examples demonstrate how the configuration above (minimum-points: 300, trigger: 30) affects the visualization of a 16-hour job.

1. Initial Overview (Coarse Resolution)

Because the job duration (~16 hours) requires more than 300 points at native resolution, the system automatically loads the 600s resolution. This provides a fast “overview” load without fetching high-frequency data. You can see in the tooltip of this example that we see datapoints every 10 minutes (because of frequency of 600).

Initial overview at 600s resolution

2. Zooming without Triggering

When the user zooms in, the system checks if the number of visible data points in the new view is less than the trigger value (30). In the example below, the zoomed window still contains enough points, so the resolution remains at 600s. As you can see from the tooltip of the example, we still see dataa points every 10 mins.

Zoom action that does not trigger update

3. Zooming and Triggering Detail

As the user zooms in deeper, the number of visible points drops below the trigger threshold of 30. This signals the backend to fetch the next finer resolution (e.g., 120s or 60s). The graph updates dynamically to show the high-frequency peaks that were previously smoothed out. As you can see from the tooltip of the example, the backend has detected that the selected data points are below trigger threshold and load the second last resampling level with the frequency of 120. With native frequency of 60, a frequency of 120 means 2 mins of data. We will see data points every 2 mins as seen in the tooltip of the example.

Zoom action triggering finer resolution

4. Visual Comparison

The animation below highlights the difference in visual density and performance between the raw data and the optimized resampled view. As you know the minimum-points is 300, means resampling will trigger only for jobs > 5 hours of duration (assuming native frequency of 60).

Comparison of resampling

Suggestion on configuring the resampling

Based on the experiments we have done and the performance we have observed, we recommend the reader to:

configure the "minimum-points": 900. This means, assuming native frequency of 60, resampling will trigger for jobs > 15 hours of duration.
configure the "resolutions" with 2 or 3 levels only, with the last level being native frequency. A resampling frequency of 600 is only recommended for jobs > 24 hours of duration.

14 - How to regenerate the Swagger UI documentation

Overview

This project integrates swagger ui to document and test its REST API. The swagger documentation files can be found in ./api/.

Note

To regenerate the Swagger UI files is only required if you change the files ./internal/api/rest.go. Otherwise the Swagger UI will already be correctly build and is ready to use.

Generate Swagger UI files

You can generate the swagger-ui configuration by running the following command from the cc-backend root directory:

go run github.com/swaggo/swag/cmd/swag init -d ./internal/api,./pkg/schema -g rest.go -o ./api

You need to move one generated file:

mv ./api/docs.go ./internal/api/docs.go

Finally rebuild cc-backend:

make

Use the Swagger UI web interface

If you start cc-backend with the -dev flag, the Swagger web interface is available at http://localhost:8080/swagger/. To use the Try Out functionality, e.g. to test the REST API, you must enter a JWT key for a user with the API role.

Info

The user who owns the JWT key must not be logged into the same browser (have a valid session), or the Swagger requests will not work. It is recommended to create a separate user that has only the API role.

15 - How to setup a systemd service

Run ClusterCockpit components as systemd services

How to run as a systemd service.

The files in this directory assume that you install ClusterCockpit to /opt/monitoring/cc-backend. Of course you can choose any other location, but make sure you replace all paths starting with /opt/monitoring/cc-backend in the clustercockpit.service file!

The config.json may contain the optional fields user and group. If specified, the application will call setuid and setgid after reading the config file and binding to a TCP port (so it can take a privileged port), but before it starts accepting any connections. This is good for security, but also means that the var/ directory must be readable and writeable by this user. The .env and config.json files may contain secrets and should not be readable by this user. If these files are changed, the server must be restarted.

Clone this repository somewhere in your home

git clone git@github.com:ClusterCockpit/cc-backend.git

(Optional) Install dependencies and build. In general it is recommended to use the provided release binaries.

cd cc-backend && make

Copy the binary to the target folder (adapt if necessary):

sudo mkdir -p /opt/monitoring/cc-backend/

cp ./cc-backend /opt/monitoring/cc-backend/

Modify the config.json and env-template.txt file from the configs directory to your liking and put it in the target directory

cp ./configs/config.json /opt/monitoring/config.json && cp ./configs/env-template.txt /opt/monitoring/.env

vim /opt/monitoring/config.json # do your thing...
vim /opt/monitoring/.env # do your thing...

(Optional) Customization: Add your versions of the login view, legal texts, and logo image. You may use the templates in ./web/templates as blueprint. Every overwrite is separate.

cp login.tmpl /opt/monitoring/cc-backend/var/
cp imprint.tmpl /opt/monitoring/cc-backend/var/
cp privacy.tmpl /opt/monitoring/cc-backend/var/
# Ensure your logo, and any images you use in your login template has a suitable size.
cp -R img /opt/monitoring/cc-backend/img

Copy the systemd service unit file. You may adopt it to your needs.

sudo cp ./init/clustercockpit.service /etc/systemd/system/clustercockpit.service

Enable and start the server

sudo systemctl enable clustercockpit.service # optional (if done, (re-)starts automatically)

sudo systemctl start clustercockpit.service

Check whats going on:

sudo systemctl status clustercockpit.service

sudo journalctl -u clustercockpit.service

16 - How to use the REST API Endpoints

Overview

ClusterCockpit offers several REST API Endpoints. While some are integral part of the ClusterCockpit-Stack Workflow (such asstart_job), others are optional. These optional endpoints supplement the functionality of the webinterface with information reachable from scripts or the command line. For example, job metrics could be requested for specific jobs and handled in external statistics programs.

All of the endpoints listed for both administrators and users are secured by JWT authentication. As such, all prerequisites applicable to JSON Web Tokens apply in this case as well, e.g. private and public key setup.

See also the Swagger Reference for more detailed information on each endpoint and the payloads.

Admin Accessible REST API

Endpoints described here should be restricted to administrators only, as they include integral functions.

Admin API Prerequisites

JWT has to be generated by either a dedicated API user (has only api role) or by an administrator with both admin and api roles.
JWTs have a limited lifetime, i.e. will become invalid after a configurable amount of time (see auth.jwt.max-age config option).
Administrator endpoints are additionally subjected to a configurable IP whitelist (see api-allowed-ips config option). Per default there is no restriction on IPs that can access the endpoints.

Admin API Endpoints and Functions

Endpoint	Method	Request Payload(s)	Description
`/api/users/`	GET	-	Lists all Users
`/api/clusters/`	GET	-	Lists all Clusters
`/api/tags/`	DELETE	JSON Payload	Removes payload array of tags specified with `Type, Name, Scope` from DB. Private Tags cannot be removed.
`/api/jobs/start_job/`	POST, PUT	JSON Payload	Starts Job
`/api/jobs/stop_job/`	POST, PUT	JSON Payload	Stops Jobs
`/api/jobs/`	GET	URL-Query Params	Lists Jobs
`/api/jobs/{id}`	POST	$id, JSON Payload	Loads specified job metadata
`/api/jobs/{id}`	GET	$id	Loads specified job with metrics
`/api/jobs/tag_job/{id}`	POST, PATCH	$id, JSON Payload	Adds payload array of tags specified with `Type, Name, Scope` to Job with $id. Tags are created in BD.
`/api/jobs/tag_job/{id}`	POST, PATCH	$id, JSON Payload	Removes payload array of tags specified with `Type, Name, Scope` from Job with $id. Tags remain in DB.
`/api/jobs/edit_meta/{id}`	POST, PATCH	$id, JSON Payload	Edits meta_data db colums info
`/api/jobs/metrics/{id}`	GET	$id, URL-Query Params	Loads specified jobmetrics for metric and scope params
`/api/jobs/delete_job/`	DELETE	JSON Payload	Deletes job specified in payload
`/api/jobs/delete_job/{id}`	DELETE	$id, JSON Payload	Deletes job specified by db id
`/api/jobs/delete_job_before/{ts}`	DELETE	$ts	Deletes all jobs before specified unix timestamp

User Accessible REST API

Endpoints described here can be used by users to write scripted job analysis for their jobs only.

User API Prerequisites

JWT has to be generated by either a dedicated API user (Has only api role) or an User with additional api role.
JWTs have a limited lifetime, i.e. will become invalid after a configurable amount of time (see jwt.max-age config option).

User API Endpoints and Functions

Endpoint	Method	Request	Description
`/userapi/jobs/`	GET	URL-Query Params	Lists Jobs
`/userapi/jobs/{id}`	POST	$id, JSON Payload	Loads specified job metadata
`/userapi/jobs/{id}`	GET	$id	Loads specified job with metrics
`/userapi/jobs/metrics/{id}`	GET	$id, URL-Query Params	Loads specified jobmetrics for metric and scope params

17 - How to use the Swagger UI documentation

Overview

This project integrates swagger ui to document and test its REST API. ./api/.

How-to Guides

1 - Configure retention policies

Overview

Why use retention policies?

Retention policy options

None (default)

Delete

Move

Configuration parameters

archive.retention section

Complete configuration examples

Example 1: One-year retention with deletion

Example 2: Two-tier archival system

Example 3: S3 backend with retention

How retention policies work

Best practices

Planning retention periods

Storage considerations

For move policy

For delete policy

Monitoring and maintenance

Restoring deleted jobs

If using move policy

If using delete policy

Related tools

Troubleshooting

Retention policy not running

Database size not decreasing

Jobs not being moved to target location

Performance impact

See also

2 - How to set up hierarchical metric collection

Overview

Architecture

Components

Configuration

Step 1: Configure Compute Nodes

Using NATS (Recommended)

Using HTTP

Step 2: Configure Aggregation Node

Step 3: Set Up cc-metric-store

Step 4: Set Up NATS Server

Advanced Configurations

Multiple Aggregation Levels

Load Balancing with Multiple Aggregators

Filtering at Aggregation Level

Metric Transformation

High Availability Setup

Deployment

1. Install cc-metric-collector

2. Deploy Configuration Files

3. Create Systemd Service

4. Start Services

Testing and Validation

Test Compute Node → Aggregator

Test Aggregator → cc-metric-store

Validate End-to-End

Monitoring and Troubleshooting

Check Collection Pipeline

Common Issues

Performance Tuning

Security Considerations

NATS Authentication

TLS Encryption

Firewall Rules

Alternative: Using NATS for cc-metric-store

See Also

3 - Database migrations

Introduction

Migrating the database

4 - Job archive migrations

Introduction

Migrating the job archive

Notice

5 -

6 - Hands-On Demo

Prerequisites

Documentation

ClusterCockpit configuration files

cc-backend

`archive.retention` section

For `move` policy

For `delete` policy

If using `move` policy

If using `delete` policy

`cluster.json`: Basics

`cluster.json`: Metric configuration

`cluster.json`: subcluster configuration

Add key pair in your `.env` file for `cc-backend`