1 - Configure retention policies

Managing database and job archive size with retention policies

Overview

Over time, the ClusterCockpit database and job archive can grow significantly, especially in production environments with high job counts. Retention policies help keep your storage at a manageable size by automatically removing or archiving old jobs.

Why use retention policies?

Without retention policies:

  • The SQLite database file can grow to tens of gigabytes
  • The job archive can reach terabytes in size
  • Storage requirements increase indefinitely
  • System performance may degrade

A typical multi-cluster setup over 5 years can accumulate:

  • 75 GB for the SQLite database
  • 1.4 TB for the job archive

Retention policies allow you to balance data retention needs with storage capacity.

Retention policy options

ClusterCockpit supports three retention policies:

None (default)

No automatic cleanup. Jobs are kept indefinitely.

{
  "archive": {
    "kind": "file",
    "path": "./var/job-archive"
  }
}

Delete

Permanently removes jobs older than the specified age from both the job archive and the database.

Use when:

  • Storage space is limited
  • You don’t need long-term job data
  • You have external backups or data exports

Configuration example:

{
  "archive": {
    "kind": "file",
    "path": "./var/job-archive",
    "retention": {
      "policy": "delete",
      "age": 365,
      "includeDB": true
    }
  }
}

This configuration will:

  • Delete jobs older than 365 days
  • Remove them from both the job archive and database
  • Run automatically based on the cleanup interval

Move

Moves old jobs to a separate location for long-term archival while removing them from the active database.

Use when:

  • You need to preserve historical data
  • You want to reduce active database size
  • You can store archived data on cheaper, slower storage

Configuration example:

{
  "archive": {
    "kind": "file",
    "path": "./var/job-archive",
    "retention": {
      "policy": "move",
      "age": 365,
      "location": "/mnt/archive/old-jobs",
      "includeDB": true
    }
  }
}

This configuration will:

  • Move jobs older than 365 days to /mnt/archive/old-jobs
  • Remove them from the active database
  • Preserve the data for potential future analysis

Configuration parameters

archive.retention section

ParameterTypeRequiredDefaultDescription
policystringYes-Retention policy: none, delete, or move
ageintegerNo7Age threshold in days. Jobs older than this are affected
includeDBbooleanNotrueAlso remove jobs from the database (not just archive)
locationstringFor move-Target directory for moved jobs (only for move policy)

Complete configuration examples

Example 1: One-year retention with deletion

Suitable for environments with limited storage:

{
  "archive": {
    "kind": "file",
    "path": "./var/job-archive",
    "retention": {
      "policy": "delete",
      "age": 365,
      "includeDB": true
    }
  }
}

Example 2: Two-tier archival system

Keep 6 months active, move older data to long-term storage:

{
  "archive": {
    "kind": "file",
    "path": "./var/job-archive",
    "retention": {
      "policy": "move",
      "age": 180,
      "location": "/mnt/slow-storage/archive",
      "includeDB": true
    }
  }
}

Example 3: S3 backend with retention

Using S3 object storage with one-year retention:

{
  "archive": {
    "kind": "s3",
    "endpoint": "https://s3.example.com",
    "bucket": "clustercockpit-jobs",
    "access-key": "your-access-key",
    "secret-key": "your-secret-key",
    "retention": {
      "policy": "delete",
      "age": 365,
      "includeDB": true
    }
  }
}

How retention policies work

  1. Automatic execution: Retention policies run automatically based on the configured interval
  2. Age calculation: Jobs are evaluated based on their startTime field
  3. Batch processing: All jobs older than the specified age are processed in one operation
  4. Database cleanup: When includeDB: true, corresponding database entries are removed
  5. Archive handling: Based on policy (delete removes, move relocates)

Best practices

Planning retention periods

Consider these factors when setting the age parameter:

  • Accounting requirements: Some organizations require job data for billing/auditing
  • Research needs: Longer retention for research clusters where users may need historical data
  • Storage capacity: Available disk space and growth rate
  • Compliance: Legal or institutional data retention policies

Recommended retention periods:

Use CaseSuggested Age
Development/testing30-90 days
Production (limited storage)180-365 days
Production (ample storage)365-730 days
Research/archival730+ days or use move policy

Storage considerations

For move policy

  • Mount the target location on slower, cheaper storage (e.g., spinning disks, network storage)
  • Ensure sufficient space at the target location
  • Consider periodic backups of the moved archive
  • Document the archive structure for future retrieval

For delete policy

  • Create backups first: Always backup your database and job archive before enabling deletion
  • Test on a copy: Verify the retention policy works as expected on test data
  • Export important data: Consider exporting summary statistics or critical job data before deletion

Monitoring and maintenance

  1. Track archive size: Monitor growth to adjust retention periods

    du -sh /var/job-archive
    du -sh /path/to/database.db
    
  2. Verify retention execution: Check logs for retention policy runs

    grep -i retention /var/log/cc-backend.log
    
  3. Regular backups: Backup before changing retention settings

    cp -r /var/job-archive /backup/job-archive-$(date +%Y%m%d)
    cp /var/clustercockpit.db /backup/clustercockpit-$(date +%Y%m%d).db
    

Restoring deleted jobs

If using move policy

Jobs moved to the retention location can be restored:

  1. Stop cc-backend

  2. Use the archive-manager tool to import jobs back:

    cd tools/archive-manager
    go build
    ./archive-manager -import \
      -src-config '{"kind":"file","path":"/mnt/archive/old-jobs"}' \
      -dst-config '{"kind":"file","path":"./var/job-archive"}'
    
  3. Rebuild database from archive:

    ./cc-backend -init-db
    
  4. Restart cc-backend

If using delete policy

Jobs cannot be restored unless you have external backups. This is why backups are critical before enabling deletion.

Troubleshooting

Retention policy not running

Check:

  1. Verify archive.retention is properly configured in config.json
  2. Ensure cc-backend was restarted after configuration changes
  3. Check logs for errors: grep -i retention /var/log/cc-backend.log

Database size not decreasing

Possible causes:

  • includeDB: false - Database entries are not being removed

  • SQLite doesn’t automatically reclaim space - run VACUUM:

    sqlite3 /var/clustercockpit.db "VACUUM;"
    

Jobs not being moved to target location

Check:

  1. Target directory exists and is writable
  2. Sufficient disk space at target location
  3. File permissions allow cc-backend to write to location
  4. Path in location is absolute, not relative

Performance impact

If retention policy execution causes performance issues:

  • Consider running during off-peak hours (feature may require manual execution)
  • Reduce the number of old jobs by running retention more frequently with shorter age periods
  • Use more powerful hardware for the database operations

See also

2 - How to set up hierarchical metric collection

Configure multiple cc-metric-collector instances to receive metrics from compute nodes and forward them to the backend

Overview

In large HPC clusters, it’s often impractical or undesirable to have every compute node connect directly to the central database. A hierarchical collection setup allows you to:

  • Reduce database connections: Instead of hundreds of nodes connecting directly, use aggregation nodes as intermediaries
  • Improve network efficiency: Aggregate metrics at rack or partition level before forwarding
  • Add processing layers: Filter, transform, or enrich metrics at intermediate collection points
  • Increase resilience: Buffer metrics during temporary database outages

This guide shows how to configure multiple cc-metric-collector instances where compute nodes send metrics to aggregation nodes, which then forward them to the backend database.

Architecture

flowchart TD
  subgraph Rack1 ["Rack 1 - Compute Nodes"]
    direction LR
    node1["Node 1<br/>cc-metric-collector"]
    node2["Node 2<br/>cc-metric-collector"]
    node3["Node 3<br/>cc-metric-collector"]
  end

  subgraph Rack2 ["Rack 2 - Compute Nodes"]
    direction LR
    node4["Node 4<br/>cc-metric-collector"]
    node5["Node 5<br/>cc-metric-collector"]
    node6["Node 6<br/>cc-metric-collector"]
  end

  subgraph Aggregator ["Aggregation Node"]
    ccrecv["cc-metric-collector<br/>(with receivers)"]
  end

  subgraph Backend ["Backend Server"]
    ccms[("cc-metric-store")]
    ccweb["cc-backend<br/>(Web Frontend)"]
  end
  
  node1 --> ccrecv
  node2 --> ccrecv
  node3 --> ccrecv
  node4 --> ccrecv
  node5 --> ccrecv
  node6 --> ccrecv
  
  ccrecv --> ccms
  ccms <--> ccweb

Components

  1. Compute Node Collectors: Run on each compute node, collect local metrics, forward to aggregation node
  2. Aggregation Node: Receives metrics from multiple compute nodes, optionally processes them, forwards to cc-metric-store
  3. cc-metric-store: In-memory time-series database for metric storage and retrieval
  4. cc-backend: Web frontend that queries cc-metric-store and visualizes metrics

Configuration

Step 1: Configure Compute Nodes

Compute nodes collect local metrics and send them to the aggregation node using a network sink (NATS or HTTP).

NATS provides better performance, reliability, and built-in clustering support.

config.json:

{
  "sinks-file": "/etc/cc-metric-collector/sinks.json",
  "collectors-file": "/etc/cc-metric-collector/collectors.json",
  "receivers-file": "/etc/cc-metric-collector/receivers.json",
  "router-file": "/etc/cc-metric-collector/router.json",
  "main": {
    "interval": "10s",
    "duration": "1s"
  }
}

sinks.json:

{
  "nats_aggregator": {
    "type": "nats",
    "host": "aggregator.example.org",
    "port": "4222",
    "subject": "metrics.rack1"
  }
}

collectors.json (enable metrics you need):

{
  "cpustat": {},
  "memstat": {},
  "diskstat": {},
  "netstat": {},
  "loadavg": {},
  "tempstat": {}
}

router.json (add identifying tags):

{
  "interval_timestamp": true,
  "process_messages": {
    "manipulate_messages": [
      {
        "add_base_tags": {
          "cluster": "mycluster",
          "rack": "rack1"
        }
      }
    ]
  }
}

receivers.json (empty for compute nodes):

{}

Using HTTP

HTTP is simpler but less efficient for high-frequency metrics.

sinks.json (HTTP alternative):

{
  "http_aggregator": {
    "type": "http",
    "host": "aggregator.example.org",
    "port": "8080",
    "path": "/api/write",
    "idle_connection_timeout": "5s",
    "timeout": "3s"
  }
}

Step 2: Configure Aggregation Node

The aggregation node receives metrics from compute nodes via receivers and forwards them to the backend database.

config.json:

{
  "sinks-file": "/etc/cc-metric-collector/sinks.json",
  "collectors-file": "/etc/cc-metric-collector/collectors.json",
  "receivers-file": "/etc/cc-metric-collector/receivers.json",
  "router-file": "/etc/cc-metric-collector/router.json",
  "main": {
    "interval": "10s",
    "duration": "1s"
  }
}

receivers.json (receive from compute nodes):

{
  "nats_rack1": {
    "type": "nats",
    "address": "localhost",
    "port": "4222",
    "subject": "metrics.rack1"
  },
  "nats_rack2": {
    "type": "nats",
    "address": "localhost",
    "port": "4222",
    "subject": "metrics.rack2"
  }
}

sinks.json (forward to cc-metric-store):

{
  "metricstore": {
    "type": "http",
    "host": "backend.example.org",
    "port": "8082",
    "path": "/api/write",
    "idle_connection_timeout": "5s",
    "timeout": "5s",
    "jwt": "eyJ0eXAiOiJKV1QiLCJhbGciOiJFZERTQSJ9.eyJ1c2VyIjoiYWRtaW4iLCJyb2xlcyI6WyJST0xFX0FETUlOIiwiUk9MRV9VU0VSIl19.d-3_3FZTsadPjDbVXKrQr4jNiQV-B_1-uaL_lW8d8gGb-TSAG9KdMg"
  }
}

Note: The jwt token must be signed with the private key corresponding to the public key configured in cc-metric-store. See JWT generation guide for details.

collectors.json (optionally collect local metrics):

{
  "cpustat": {},
  "memstat": {},
  "loadavg": {}
}

router.json (optionally process metrics):

{
  "interval_timestamp": false,
  "num_cache_intervals": 0,
  "process_messages": {
    "manipulate_messages": [
      {
        "add_base_tags": {
          "datacenter": "dc1"
        }
      }
    ]
  }
}

Step 3: Set Up cc-metric-store

The backend server needs cc-metric-store to receive and store metrics from the aggregation node.

config.json (/etc/cc-metric-store/config.json):

{
  "metrics": {
    "cpu_user": {
      "frequency": 10,
      "aggregation": "avg"
    },
    "cpu_system": {
      "frequency": 10,
      "aggregation": "avg"
    },
    "mem_used": {
      "frequency": 10,
      "aggregation": null
    },
    "mem_total": {
      "frequency": 10,
      "aggregation": null
    },
    "net_bw": {
      "frequency": 10,
      "aggregation": "sum"
    },
    "flops_any": {
      "frequency": 10,
      "aggregation": "sum"
    },
    "mem_bw": {
      "frequency": 10,
      "aggregation": "sum"
    }
  },
  "http-api": {
    "address": "0.0.0.0:8082"
  },
  "jwt-public-key": "kzfYrYy+TzpanWZHJ5qSdMj5uKUWgq74BWhQG6copP0=",
  "retention-in-memory": "48h",
  "checkpoints": {
    "interval": "12h",
    "directory": "/var/lib/cc-metric-store/checkpoints",
    "restore": "48h"
  },
  "archive": {
    "interval": "24h",
    "directory": "/var/lib/cc-metric-store/archive"
  }
}

Important configuration notes:

  • metrics: Must list ALL metrics you want to store. Only configured metrics are accepted.
  • frequency: Must match the collection interval from cc-metric-collector (in seconds)
  • aggregation: "sum" for resource metrics (bandwidth, FLOPS), "avg" for diagnostic metrics (CPU %), null for node-only metrics
  • jwt-public-key: Must correspond to the private key used to sign JWT tokens in the aggregation node sink configuration
  • retention-in-memory: How long to keep metrics in memory (should cover typical job durations)

Install cc-metric-store:

# Download binary
wget https://github.com/ClusterCockpit/cc-metric-store/releases/latest/download/cc-metric-store

# Install
sudo mkdir -p /opt/monitoring/cc-metric-store
sudo mv cc-metric-store /opt/monitoring/cc-metric-store/
sudo chmod +x /opt/monitoring/cc-metric-store/cc-metric-store

# Create directories
sudo mkdir -p /var/lib/cc-metric-store/checkpoints
sudo mkdir -p /var/lib/cc-metric-store/archive
sudo mkdir -p /etc/cc-metric-store

Create systemd service (/etc/systemd/system/cc-metric-store.service):

[Unit]
Description=ClusterCockpit Metric Store
After=network.target

[Service]
Type=simple
User=cc-metricstore
Group=cc-metricstore
WorkingDirectory=/opt/monitoring/cc-metric-store
ExecStart=/opt/monitoring/cc-metric-store/cc-metric-store -config /etc/cc-metric-store/config.json
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target

Start cc-metric-store:

# Create user
sudo useradd -r -s /bin/false cc-metricstore
sudo chown -R cc-metricstore:cc-metricstore /var/lib/cc-metric-store

# Start service
sudo systemctl daemon-reload
sudo systemctl start cc-metric-store
sudo systemctl enable cc-metric-store

# Check status
sudo systemctl status cc-metric-store

Step 4: Set Up NATS Server

The aggregation node needs a NATS server to receive metrics from compute nodes.

Install NATS:

# Using Docker
docker run -d --name nats -p 4222:4222 nats:latest

# Using package manager (example for Ubuntu/Debian)
curl -L https://github.com/nats-io/nats-server/releases/download/v2.10.5/nats-server-v2.10.5-linux-amd64.zip -o nats-server.zip
unzip nats-server.zip
sudo mv nats-server-v2.10.5-linux-amd64/nats-server /usr/local/bin/

NATS Configuration (/etc/nats/nats-server.conf):

listen: 0.0.0.0:4222
max_payload: 10MB
max_connections: 1000

# Optional: Enable authentication
authorization {
  user: collector
  password: secure_password
}

# Optional: Enable clustering for HA
cluster {
  name: metrics-cluster
  listen: 0.0.0.0:6222
}

Start NATS:

# Systemd
sudo systemctl start nats
sudo systemctl enable nats

# Or directly
nats-server -c /etc/nats/nats-server.conf

Advanced Configurations

Multiple Aggregation Levels

For very large clusters, you can create multiple aggregation levels:

flowchart TD
  subgraph Compute ["Compute Nodes"]
    node1["Node 1-100"]
  end

  subgraph Rack ["Rack Aggregators"]
    agg1["Aggregator<br/>Rack 1-10"]
  end

  subgraph Cluster ["Cluster Aggregator"]
    agg_main["Main Aggregator"]
  end

  subgraph Backend ["Backend"]
    ccms[("cc-metric-store")]
  end
  
  node1 --> agg1
  agg1 --> agg_main
  agg_main --> ccms

Rack-level aggregator sinks.json:

{
  "cluster_aggregator": {
    "type": "nats",
    "host": "main-aggregator.example.org",
    "port": "4222",
    "subject": "metrics.cluster"
  }
}

Cluster-level aggregator receivers.json:

{
  "all_racks": {
    "type": "nats",
    "address": "localhost",
    "port": "4222",
    "subject": "metrics.cluster"
  }
}

Load Balancing with Multiple Aggregators

Use NATS queue groups to distribute load across multiple aggregation nodes:

Compute node sinks.json:

{
  "nats_aggregator": {
    "type": "nats",
    "host": "nats-cluster.example.org",
    "port": "4222",
    "subject": "metrics.loadbalanced"
  }
}

Aggregator 1 and 2 receivers.json (identical configuration):

{
  "nats_with_queue": {
    "type": "nats",
    "address": "localhost",
    "port": "4222",
    "subject": "metrics.loadbalanced",
    "queue_group": "aggregators"
  }
}

With queue_group configured, NATS automatically distributes messages across all aggregators in the group.

Filtering at Aggregation Level

Reduce cc-metric-store load by filtering metrics at the aggregation node:

Aggregator router.json:

{
  "interval_timestamp": false,
  "process_messages": {
    "manipulate_messages": [
      {
        "drop_by_name": ["cpu_idle", "cpu_guest", "cpu_guest_nice"]
      },
      {
        "drop_by": "value == 0 && match('temp_', name)"
      },
      {
        "add_base_tags": {
          "aggregated": "true"
        }
      }
    ]
  }
}

Metric Transformation

Aggregate or transform metrics before forwarding:

Aggregator router.json:

{
  "interval_timestamp": false,
  "num_cache_intervals": 1,
  "interval_aggregates": [
    {
      "name": "rack_avg_temp",
      "if": "name == 'temp_package_id_0'",
      "function": "avg(values)",
      "tags": {
        "type": "rack",
        "rack": "<copy>"
      },
      "meta": {
        "unit": "degC",
        "source": "aggregated"
      }
    }
  ]
}

High Availability Setup

Use multiple NATS servers in cluster mode:

NATS server 1 config:

cluster {
  name: metrics-cluster
  listen: 0.0.0.0:6222
  routes: [
    nats://nats2.example.org:6222
    nats://nats3.example.org:6222
  ]
}

Compute node sinks.json (with failover):

{
  "nats_ha": {
    "type": "nats",
    "host": "nats1.example.org,nats2.example.org,nats3.example.org",
    "port": "4222",
    "subject": "metrics.rack1"
  }
}

Deployment

1. Install cc-metric-collector

On all nodes (compute and aggregation):

# Download binary
wget https://github.com/ClusterCockpit/cc-metric-collector/releases/latest/download/cc-metric-collector

# Install
sudo mkdir -p /opt/monitoring/cc-metric-collector
sudo mv cc-metric-collector /opt/monitoring/cc-metric-collector/
sudo chmod +x /opt/monitoring/cc-metric-collector/cc-metric-collector

2. Deploy Configuration Files

Compute nodes:

sudo mkdir -p /etc/cc-metric-collector
sudo cp config.json /etc/cc-metric-collector/
sudo cp sinks.json /etc/cc-metric-collector/
sudo cp collectors.json /etc/cc-metric-collector/
sudo cp receivers.json /etc/cc-metric-collector/
sudo cp router.json /etc/cc-metric-collector/

Aggregation node:

sudo mkdir -p /etc/cc-metric-collector
# Deploy aggregator-specific configs
sudo cp aggregator-config.json /etc/cc-metric-collector/config.json
sudo cp aggregator-sinks.json /etc/cc-metric-collector/sinks.json
sudo cp aggregator-receivers.json /etc/cc-metric-collector/receivers.json
# etc...

3. Create Systemd Service

On all nodes (/etc/systemd/system/cc-metric-collector.service):

[Unit]
Description=ClusterCockpit Metric Collector
After=network.target

[Service]
Type=simple
User=cc-collector
Group=cc-collector
ExecStart=/opt/monitoring/cc-metric-collector/cc-metric-collector -config /etc/cc-metric-collector/config.json
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target

4. Start Services

Order of startup:

  1. Start cc-metric-store on backend server
  2. Start NATS server on aggregation node
  3. Start cc-metric-collector on aggregation node
  4. Start cc-metric-collector on compute nodes
# On backend server
sudo systemctl start cc-metric-store

# On aggregation node
sudo systemctl start nats
sudo systemctl start cc-metric-collector

# On compute nodes
sudo systemctl start cc-metric-collector

# Enable on boot (on all nodes)
sudo systemctl enable cc-metric-store  # backend only
sudo systemctl enable nats              # aggregator only
sudo systemctl enable cc-metric-collector

Testing and Validation

Test Compute Node → Aggregator

On compute node, run once to verify metrics are collected:

cc-metric-collector -config /etc/cc-metric-collector/config.json -once

On aggregation node, check NATS for incoming metrics:

# Subscribe to see messages
nats sub 'metrics.>'

Test Aggregator → cc-metric-store

On aggregation node, verify metrics are forwarded:

# Check logs
journalctl -u cc-metric-collector -f

On backend server, verify cc-metric-store is receiving data:

# Check cc-metric-store logs
journalctl -u cc-metric-store -f

# Query metrics via REST API (requires valid JWT token)
curl -H "Authorization: Bearer $JWT_TOKEN" \
  "http://backend.example.org:8082/api/query?cluster=mycluster&from=$(date -d '5 minutes ago' +%s)&to=$(date +%s)"

Validate End-to-End

Check cc-backend to see if metrics appear for all nodes:

  1. Open cc-backend web interface
  2. Navigate to node view
  3. Verify metrics are displayed for compute nodes
  4. Check that tags (cluster, rack, etc.) are present

Monitoring and Troubleshooting

Check Collection Pipeline

# Compute node: metrics are being sent
journalctl -u cc-metric-collector -n 100 | grep -i "sent\|error"

# Aggregator: metrics are being received
journalctl -u cc-metric-collector -n 100 | grep -i "received\|error"

# NATS: check connections
nats server info
nats server list

Common Issues

Metrics not appearing in cc-metric-store:

  1. Check compute node → NATS connection
  2. Verify NATS → aggregator reception
  3. Check aggregator → cc-metric-store sink (verify JWT token is valid)
  4. Verify metrics are configured in cc-metric-store’s config.json
  5. Examine router filters (may be dropping metrics)

High latency:

  1. Reduce metric collection interval on compute nodes
  2. Increase batch size in aggregator sinks
  3. Add more aggregation nodes with load balancing
  4. Check network bandwidth between tiers

Memory growth on aggregator:

  1. Reduce num_cache_intervals in router
  2. Check sink write performance
  3. Verify cc-metric-store is accepting writes
  4. Monitor NATS queue depth

Memory growth on cc-metric-store:

  1. Reduce retention-in-memory setting
  2. Increase checkpoint frequency
  3. Verify archive cleanup is working

Connection failures:

  1. Verify firewall rules allow NATS port (4222)
  2. Check NATS server is running and accessible
  3. Test network connectivity: telnet aggregator.example.org 4222
  4. Review NATS server logs: journalctl -u nats -f

Performance Tuning

Compute nodes (reduce overhead):

{
  "main": {
    "interval": "30s",
    "duration": "1s"
  }
}

Aggregator (increase throughput):

{
  "metricstore": {
    "type": "http",
    "host": "backend.example.org",
    "port": "8082",
    "path": "/api/write",
    "timeout": "10s",
    "idle_connection_timeout": "10s"
  }
}

NATS server (handle more connections):

max_connections: 10000
max_payload: 10MB
write_deadline: "10s"

Security Considerations

NATS Authentication

NATS server config:

authorization {
  users = [
    {
      user: "collector"
      password: "$2a$11$..."  # bcrypt hash
    }
  ]
}

Compute node sinks.json:

{
  "nats_aggregator": {
    "type": "nats",
    "host": "aggregator.example.org",
    "port": "4222",
    "subject": "metrics.rack1",
    "username": "collector",
    "password": "secure_password"
  }
}

TLS Encryption

NATS server config:

tls {
  cert_file: "/etc/nats/certs/server-cert.pem"
  key_file: "/etc/nats/certs/server-key.pem"
  ca_file: "/etc/nats/certs/ca.pem"
  verify: true
}

Compute node sinks.json:

{
  "nats_aggregator": {
    "type": "nats",
    "host": "aggregator.example.org",
    "port": "4222",
    "subject": "metrics.rack1",
    "ssl": true,
    "ssl_cert": "/etc/cc-metric-collector/certs/client-cert.pem",
    "ssl_key": "/etc/cc-metric-collector/certs/client-key.pem"
  }
}

Firewall Rules

On aggregation node:

# Allow NATS from compute network
sudo firewall-cmd --permanent --add-rich-rule='rule family="ipv4" source address="10.0.0.0/8" port protocol="tcp" port="4222" accept'

sudo firewall-cmd --reload

On backend server:

# Allow HTTP from aggregation node to cc-metric-store
sudo firewall-cmd --permanent --add-rich-rule='rule family="ipv4" source address="aggregator.example.org" port protocol="tcp" port="8082" accept'

sudo firewall-cmd --reload

Alternative: Using NATS for cc-metric-store

Instead of HTTP, you can also use NATS to send metrics from the aggregation node to cc-metric-store.

Aggregation node sinks.json:

{
  "nats_metricstore": {
    "type": "nats",
    "host": "backend.example.org",
    "port": "4222",
    "subject": "metrics.store"
  }
}

cc-metric-store config.json (add NATS section):

{
  "metrics": { ... },
  "nats": {
    "address": "nats://0.0.0.0:4222",
    "subscriptions": [
      {
        "subscribe-to": "metrics.store",
        "cluster-tag": "mycluster"
      }
    ]
  },
  "http-api": { ... },
  "jwt-public-key": "...",
  "retention-in-memory": "48h",
  "checkpoints": { ... },
  "archive": { ... }
}

Benefits of NATS:

  • Better performance for high-frequency metrics
  • Built-in message buffering
  • No need for JWT tokens in sink configuration
  • Easier to scale with multiple aggregators

Trade-offs:

  • Requires running NATS server on backend
  • More complex infrastructure

See Also

3 - Database migrations

Database migrations

Introduction

In general, an upgrade is nothing more than a replacement of the binary file. All the necessary files, except the database file, the configuration file and the job archive, are embedded in the binary file. It is recommended to use a directory where the file names of the binary files are named with a version indicator. This can be, for example, the date or the Unix epoch time. A symbolic link points to the version to be used. This makes it easier to switch to earlier versions.

The database and the job archive are versioned. Each release binary supports specific versions of the database and job archive. If a version mismatch is detected, the application is terminated and migration is required.

Migrating the database

After you have backed up the database, run the following command to migrate the database to the latest version:

> ./cc-backend -migrate-db

The migration files are embedded in the binary and can also be viewed in the cc backend source tree. We use the migrate library.

If something goes wrong, you can check the status and get the current schema (here for sqlite):

> sqlite3 var/job.db

In the sqlite console execute:

.schema

to get the current database schema. You can query the current version and whether the migration failed with:

SELECT * FROM schema_migrations;

The first column indicates the current database version and the second column is a dirty flag indicating whether the migration was successful.

4 - Job archive migrations

Job archive migrations

Introduction

In general, an upgrade is nothing more than a replacement of the binary file. All the necessary files, except the database file, the configuration file and the job archive, are embedded in the binary file. It is recommended to use a directory where the file names of the binary files are named with a version indicator. This can be, for example, the date or the Unix epoch time. A symbolic link points to the version to be used. This makes it easier to switch to earlier versions.

Migrating the job archive

Job archive migration requires a separate tool (archive-migration), which is part of the cc-backend source tree (build with go build ./tools/archive-migration) and is also provided as part of the releases.

Migration is supported only between two successive releases. You find details how to use the archive-migration tool in its reference documentation

The cluster.json files in job-archive-new must be checked for errors, especially whether the aggregation attribute is set correctly for all metrics.

Migration takes a few hours for large job archives (several hundred GB). A versioned job archive contains a version.txt file in the root directory of the job archive. This file contains the version as an unsigned integer.

5 -

6 - Hands-On Demo

Hands-On Demo for a basic ClusterCockpit setup and API usage (without Docker)

Prerequisites

  • perl
  • go
  • npm
  • Optional: curl
  • Script migrateTimestamp.pl

Documentation

You find READMEs or api docs in

  • ./cc-backend/configs
  • ./cc-backend/init
  • ./cc-backend/api

ClusterCockpit configuration files

cc-backend

  • ./.env Passwords and Tokens set in the environment
  • ./config.json Configuration options for cc-backend

cc-metric-store

  • ./config.json Optional to overwrite configuration options

cc-metric-collector

Not yet included in the hands-on setup.

Setup Components

Start by creating a base folder for all of the following steps.

  • mkdir clustercockpit
  • cd clustercockpit

Setup cc-backend

  • Clone Repository
    • git clone https://github.com/ClusterCockpit/cc-backend.git
    • cd cc-backend
  • Build
    • make
  • Activate & configure environment for cc-backend
    • cp configs/env-template.txt .env
    • Optional: Have a look via vim .env
    • Copy the config.json file included in this tarball into the root directory of cc-backend: cp ../../config.json ./
  • Back to toplevel clustercockpit
    • cd ..
  • Prepare Datafolder and Database file
    • mkdir var
    • ./cc-backend -migrate-db

Setup cc-metric-store

  • Clone Repository
    • git clone https://github.com/ClusterCockpit/cc-metric-store.git
    • cd cc-metric-store
  • Build Go Executable
    • go get
    • go build
  • Prepare Datafolders
    • mkdir -p var/checkpoints
    • mkdir -p var/archive
  • Update Config
    • vim config.json
    • Exchange existing setting in metrics with the following:
"clock":      { "frequency": 60, "aggregation": null },
"cpi":        { "frequency": 60, "aggregation": null },
"cpu_load":   { "frequency": 60, "aggregation": null },
"flops_any":  { "frequency": 60, "aggregation": null },
"flops_dp":   { "frequency": 60, "aggregation": null },
"flops_sp":   { "frequency": 60, "aggregation": null },
"ib_bw":      { "frequency": 60, "aggregation": null },
"lustre_bw":  { "frequency": 60, "aggregation": null },
"mem_bw":     { "frequency": 60, "aggregation": null },
"mem_used":   { "frequency": 60, "aggregation": null },
"rapl_power": { "frequency": 60, "aggregation": null }
  • Back to toplevel clustercockpit
    • cd ..

Setup Demo Data

  • mkdir source-data
  • cd source-data
  • Download JobArchive-Source:
    • wget https://hpc-mover.rrze.uni-erlangen.de/HPC-Data/0x7b58aefb/eig7ahyo6fo2bais0ephuf2aitohv1ai/job-archive-dev.tar.xz
    • tar xJf job-archive-dev.tar.xz
    • mv ./job-archive ./job-archive-source
    • rm ./job-archive-dev.tar.xz
  • Download CC-Metric-Store Checkpoints:
    • mkdir -p cc-metric-store-source/checkpoints
    • cd cc-metric-store-source/checkpoints
    • wget https://hpc-mover.rrze.uni-erlangen.de/HPC-Data/0x7b58aefb/eig7ahyo6fo2bais0ephuf2aitohv1ai/cc-metric-store-checkpoints.tar.xz
    • tar xf cc-metric-store-checkpoints.tar.xz
    • rm cc-metric-store-checkpoints.tar.xz
  • Back to source-data
    • cd ../..
  • Run timestamp migration script. This may take tens of minutes!
    • cp ../migrateTimestamps.pl .
    • ./migrateTimestamps.pl
    • Expected output:
Starting to update start- and stoptimes in job-archive for emmy
Starting to update start- and stoptimes in job-archive for woody
Done for job-archive
Starting to update checkpoint filenames and data starttimes for emmy
Starting to update checkpoint filenames and data starttimes for woody
Done for checkpoints
  • Copy cluster.json files from source to migrated folders
    • cp source-data/job-archive-source/emmy/cluster.json cc-backend/var/job-archive/emmy/
    • cp source-data/job-archive-source/woody/cluster.json cc-backend/var/job-archive/woody/
  • Initialize Job-Archive in SQLite3 job.db and add demo user
    • cd cc-backend
    • ./cc-backend -init-db -add-user demo:admin:demo
    • Expected output:
<6>[INFO]    new user "demo" created (roles: ["admin"], auth-source: 0)
<6>[INFO]    Building job table...
<6>[INFO]    A total of 3936 jobs have been registered in 1.791 seconds.
  • Back to toplevel clustercockpit
    • cd ..

Startup both Apps

  • In cc-backend root: $./cc-backend -server -dev
    • Starts Clustercockpit at http:localhost:8080
      • Log: <6>[INFO] HTTP server listening at :8080...
    • Use local internet browser to access interface
      • You should see and be able to browse finished Jobs
      • Metadata is read from SQLite3 database
      • Metricdata is read from job-archive/JSON-Files
    • Create User in settings (top-right corner)
      • Name apiuser
      • Username apiuser
      • Role API
      • Submit & Refresh Page
    • Create JTW for apiuser
      • In Userlist, press Gen. JTW for apiuser
      • Save JWT for later use
  • In cc-metric-store root: $./cc-metric-store
    • Start the cc-metric-store on http:localhost:8081, Log:
2022/07/15 17:17:42 Loading checkpoints newer than 2022-07-13T17:17:42+02:00
2022/07/15 17:17:45 Checkpoints loaded (5621 files, 319 MB, that took 3.034652s)
2022/07/15 17:17:45 API http endpoint listening on '0.0.0.0:8081'
  • Does not have a graphical interface
  • Otpional: Test function by executing:
$ curl -H "Authorization: Bearer eyJ0eXAiOiJKV1QiLCJhbGciOiJFZERTQSJ9.eyJ1c2VyIjoiYWRtaW4iLCJyb2xlcyI6WyJST0xFX0FETUlOIiwiUk9MRV9BTkFMWVNUIiwiUk9MRV9VU0VSIl19.d-3_3FZTsadPjDEdsWrrQ7nS0edMAR4zjl-eK7rJU3HziNBfI9PDHDIpJVHTNN5E5SlLGLFXctWyKAkwhXL-Dw" -D - "http://localhost:8081/api/query" -d "{ \"cluster\": \"emmy\", \"from\": $(expr $(date +%s) - 60), \"to\": $(date +%s), \"queries\": [{
  \"metric\": \"flops_any\",
  \"host\": \"e1111\"
}] }"

HTTP/1.1 200 OK
Content-Type: application/json
Date: Fri, 15 Jul 2022 13:57:22 GMT
Content-Length: 119
{"results":[[JSON-DATA-ARRAY]]}

Development API web interfaces

The -dev flag enables web interfaces to document and test the apis:

  • Local GQL Playgorund - A GraphQL playground. To use it you must have a authenticated session in the same browser.
  • Local Swagger Docs - A Swagger UI. To use it you have to be logged out, so no user session in the same browser. Use the JWT token with role Api generate previously to authenticate via http header.

Use cc-backend API to start job

  • Enter the URL http://localhost:8080/swagger/index.html in your browser.

  • Enter your JWT token you generated for the API user by clicking the green Authorize button in the upper right part of the window.

  • Click the /job/start_job endpoint and click the Try it out button.

  • Enter the following json into the request body text area and fill in a recent start timestamp by executing date +%s.:

{
    "jobId":         100000,
    "arrayJobId":    0,
    "user":          "ccdemouser",
    "subCluster":    "main",
    "cluster":       "emmy",
    "startTime":    <date +%s>,
    "project":       "ccdemoproject",
    "resources":  [
        {"hostname":  "e0601"},
        {"hostname":  "e0823"},
        {"hostname":  "e0337"},
        {"hostname": "e1111"}],
    "numNodes":      4,
    "numHwthreads":  80,
    "walltime":      86400
}
  • The response body should be the database id of the started job, for example:
{
  "id": 3937
}
  • Check in ClusterCockpit
    • User ccdemouser should appear in Users-Tab with one running job
    • It could take up to 5 Minutes until the Job is displayed with some current data (5 Min Short-Job Filter)
    • Job then is marked with a green running tag
    • Metricdata displayed is read from cc-metric-store!

Use cc-backend API to stop job

  • Enter the URL http://localhost:8080/swagger/index.html in your browser.
  • Enter your JWT token you generated for the API user by clicking the green Authorize button in the upper right part of the window.
  • Click the /job/stop_job/{id} endpoint and click the Try it out button.
  • Enter the database id at id that was returned by start_job and copy the following into the request body. Replace the timestamp with a recent one:
{
  "cluster": "emmy",
  "jobState": "completed",
  "stopTime": <RECENT TS>
}
  • On success a json document with the job meta data is returned.

  • Check in ClusterCockpit

    • User ccdemouser should appear in Users-Tab with one completed job
    • Job is no longer marked with a green running tag -> Completed!
    • Metricdata displayed is now read from job-archive!
  • Check in job-archive

    • cd ./cc-backend/var/job-archive/emmy/100/000
    • cd $STARTTIME
    • Inspect meta.json and data.json

Helper scripts

  • In this tarball you can find the perl script generate_subcluster.pl that helps to generate the subcluster section for your system. Usage:
  • Log into an exclusive cluster node.
  • The LIKWID tools likwid-topology and likwid-bench must be in the PATH!
  • $./generate_subcluster.pl outputs the subcluster section on stdout

Please be aware that

  • You have to enter the name and node list for the subCluster manually.
  • GPU detection only works if LIKWID was build with Cuda avalable and you run likwid-topology also with Cuda loaded.
  • Do not blindly trust the measured peakflops values.
  • Because the script blindly relies on the CSV format output by likwid-topology this is a fragile undertaking!

7 - How to add a MOD notification banner

Add a message of the day banner on homepage

Overview

To add a notification banner you can add a file notice.txt to the ./var directory of the cc-backend server. As long as this file is present all text in this file is shown in an info banner on the homepage.

Add notification banner in web interface

As an alternative the admin role can also add and edit the notification banner from the settings view.

8 - How to create a `cluster.json` file

How to initially create a cluster configuration

Overview

Every cluster is configured using a dedicated cluster.json file, that is part of the job archive. You can find the JSON schema for it here. This file provides information about the homogeneous hardware partitions within the cluster including the node topology and the metric list. A real production configuration is provided as part of cc-examples.

cluster.json: Basics

The cluster.json file contains three top level parts: the name of the cluster, the metric configuration, and the subcluster list. You find the latest cluster.json schema here. Basic layout of cluster.json files:

{
  "name": "fritz",
  "metricConfig": [
    {
      "name": "cpu_load",
      ...
    },
    {
      "name": "mem_used",
      ...
    }
  ],
  "subClusters": [
    {
      "name": "main",
      ...
    },
    {
      "name": "spr",
      ...
    }
  ]
}

cluster.json: Metric configuration

There is one metric list per cluster. You can find a list of recommended metrics and their naming here. Example for a metric list entry with only the required attributes:

"metricConfig": [
    {
        "name": "flops_sp",
        "unit": {
            "base": "Flops/s",
            "prefix": "G"
        },
        "scope": "hwthread",
        "timestep": 60,
        "aggregation": "sum",
        "peak": 5600,
        "normal": 1000,
        "caution": 200,
        "alert": 50
    }
]

Explanation of required attributes:

  • name: The metric name.
  • unit: The metrics unit. Base can be: B (for bytes), F (for flops), B/s, F/s, Flops (for floating point operations), Flops/s (for FLOP rate), CPI (for cycles per instruction), IPC (for instructions per cycle), Hz, W (for Watts), °C, or empty string for no unit. Prefix can be: K, M, G, T, P, or E.
  • scope: The native metric measurement resolution. Can be node, socket, memoryDomain, core, hwthread, or accelerator.
  • timestep: The measurement frequency in seconds
  • aggregation: How the metric is aggregated with in node topology. Can be one of sum, avg, or empty string for no aggregation (node level metrics).
  • Metric thresholds. If threshold applies for larger or smaller values depends on optional attribute lowerIsBetter (default false).
    • peak: The maximum possible metric value
    • normal: A common metric value level
    • caution: Metric value requires attention
    • alert: Metric value requiring immediate attention

Optional attributes:

  • footprint: Is this a job footprint metric. Set to how the footprint is aggregated: Can avg, min, or max. Footprint metrics are shown in the footprint UI component and job view polar plot.
  • energy: Should the metric be used to calculate the job energy. Can be power (metric has unit Watts) or energy (metric has unit Joules).
  • lowerIsBetter: Is lower better. Influences frontend UI and evaluation of metric thresholds. Default is false.
  • restrict: Whether to restrict visibility of this metric to non-user roles (admin, support, manager). Default is false. When set to true, regular users cannot view this metric.
  • subClusters (Type: array of objects): Overwrites for specific subClusters. The metrics per default are valid for all subClusters. It is possible to overwrite or remove metrics for specific subClusters. If a metric is overwritten for a subClusters all attributes have to be set, partial overwrites are not supported. Example for a metric overwrite:
{
    "name": "mem_used",
    "unit": {
        "base": "B",
        "prefix": "G"
    },
    "scope": "node",
    "aggregation": "sum",
    "footprint": "max",
    "timestep": 60,
    "lowerIsBetter": true,
    "peak": 256,
    "normal": 128,
    "caution": 200,
    "alert": 240,
    "subClusters": [
        {
            "name": "spr1tb",
            "footprint": "max",
            "peak": 1024,
            "normal": 512,
            "caution": 900,
            "alert": 1000
        },
        {
            "name": "spr2tb",
            "footprint": "max",
            "peak": 2048,
            "normal": 1024,
            "caution": 1800,
            "alert": 2000
        }
    ]
},

This metric characterizes the memory capacity used by a job. Aggregation for a job is the sum of all node values. As footprint the largest allocated memory capacity is used. For this configuration lower is better is set, which results in jobs with more than the metric thresholds are marked. There exist two subClusters with 1TB and 2TB memory capacity compared to the default 256GB.

Example for removing metrics for a subcluster:

{
  "name": "vectorization_ratio",
  "unit": {
    "base": ""
  },
  "scope": "hwthread",
  "aggregation": "avg",
  "timestep": 60,
  "peak": 100,
  "normal": 60,
  "caution": 40,
  "alert": 10,
  "subClusters": [
    {
      "name": "icelake",
      "remove": true
    }
  ]
}

cluster.json: subcluster configuration

SubClusters in ClusterCockpit are subsets of a cluster with homogeneous hardware. The subCluster part specifies the node topology, a list of nodes that are part of a subClusters, and the node capabilities that are used to draw the roofline diagrams.

Topology Structure

The topology section defines the hardware topology using nested arrays that map relationships between hardware threads, cores, sockets, memory domains, and dies:

  • node: Flat list of all hardware thread IDs on the node
  • socket: Hardware threads grouped by physical CPU socket (2D array)
  • memoryDomain: Hardware threads grouped by NUMA domain (2D array)
  • die: Optional grouping by CPU die within sockets (2D array). This is used for multi-die processors where each socket contains multiple dies. If not applicable, use an empty array []
  • core: Hardware threads grouped by physical core (2D array)
  • accelerators: Optional list of attached accelerators (GPUs, FPGAs, etc.)

The resource ID for CPU cores is the OS processor ID. For GPUs we recommend using the PCI-E address as resource ID.

Here is an example:

{
  "name": "icelake",
  "nodes": "w22[01-35],w23[01-35],w24[01-20],w25[01-20]",
  "processorType": "Intel Xeon Gold 6326",
  "socketsPerNode": 2,
  "coresPerSocket": 16,
  "threadsPerCore": 1,
  "flopRateScalar": {
    "unit": {
      "base": "F/s",
      "prefix": "G"
    },
    "value": 432
  },
  "flopRateSimd": {
    "unit": {
      "base": "F/s",
      "prefix": "G"
    },
    "value": 9216
  },
  "memoryBandwidth": {
    "unit": {
      "base": "B/s",
      "prefix": "G"
    },
    "value": 350
  },
  "topology": {
    "node": [
      0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20,
      21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38,
      39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56,
      57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71
    ],
    "socket": [
      [
        0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
        20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35
      ],
      [
        36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53,
        54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71
      ]
    ],
    "memoryDomain": [
      [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17],
      [18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35],
      [36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53],
      [54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71]
    ],
    "die": [],
    "core": [
      [0],
      [1],
      [2],
      [3],
      [4],
      [5],
      [6],
      [7],
      [8],
      [9],
      [10],
      [11],
      [12],
      [13],
      [14],
      [15],
      [16],
      [17],
      [18],
      [19],
      [20],
      [21],
      [22],
      [23],
      [24],
      [25],
      [26],
      [27],
      [28],
      [29],
      [30],
      [31],
      [32],
      [33],
      [34],
      [35],
      [36],
      [37],
      [38],
      [39],
      [40],
      [41],
      [42],
      [43],
      [44],
      [45],
      [46],
      [47],
      [48],
      [49],
      [50],
      [51],
      [52],
      [53],
      [54],
      [55],
      [56],
      [57],
      [58],
      [59],
      [60],
      [61],
      [62],
      [63],
      [64],
      [65],
      [66],
      [67],
      [68],
      [69],
      [70],
      [71]
    ]
  }
}

Since it is tedious to write this by hand, we provide a Perl script as part of cc-backend that generates a subCluster template. This script only works if the LIKWID tools are installed and in the PATH. You also need the LIKWID library for cc-metric-store. You find instructions on how to install LIKWID here.

Example: SubCluster with GPU Accelerators

Here is an example for a subCluster with GPU accelerators:

{
  "name": "a100m80",
  "nodes": "a[0531-0537],a[0631-0633],a0731,a[0831-0833],a[0931-0934]",
  "processorType": "AMD Milan",
  "socketsPerNode": 2,
  "coresPerSocket": 64,
  "threadsPerCore": 1,
  "flopRateScalar": {
    "unit": {
      "base": "F/s",
      "prefix": "G"
    },
    "value": 432
  },
  "flopRateSimd": {
    "unit": {
      "base": "F/s",
      "prefix": "G"
    },
    "value": 9216
  },
  "memoryBandwidth": {
    "unit": {
      "base": "B/s",
      "prefix": "G"
    },
    "value": 400
  },
  "topology": {
    "node": [
      0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20,
      21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38,
      39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56,
      57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74,
      75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92,
      93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108,
      109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123,
      124, 125, 126, 127
    ],
    "socket": [
      [
        0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
        20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37,
        38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55,
        56, 57, 58, 59, 60, 61, 62, 63
      ],
      [
        64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81,
        82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99,
        100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113,
        114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127
      ]
    ],
    "memoryDomain": [
      [
        0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
        20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37,
        38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55,
        56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73,
        74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91,
        92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107,
        108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121,
        122, 123, 124, 125, 126, 127
      ]
    ],
    "core": [
      [0],
      [1],
      [2],
      [3],
      [4],
      [5],
      [6],
      [7],
      [8],
      [9],
      [10],
      [11],
      [12],
      [13],
      [14],
      [15],
      [16],
      [17],
      [18],
      [19],
      [20],
      [21],
      [22],
      [23],
      [24],
      [25],
      [26],
      [27],
      [28],
      [29],
      [30],
      [31],
      [32],
      [33],
      [34],
      [35],
      [36],
      [37],
      [38],
      [39],
      [40],
      [41],
      [42],
      [43],
      [44],
      [45],
      [46],
      [47],
      [48],
      [49],
      [50],
      [51],
      [52],
      [53],
      [54],
      [55],
      [56],
      [57],
      [58],
      [59],
      [60],
      [61],
      [62],
      [63],
      [64],
      [65],
      [66],
      [67],
      [68],
      [69],
      [70],
      [71],
      [73],
      [74],
      [75],
      [76],
      [77],
      [78],
      [79],
      [80],
      [81],
      [82],
      [83],
      [84],
      [85],
      [86],
      [87],
      [88],
      [89],
      [90],
      [91],
      [92],
      [93],
      [94],
      [95],
      [96],
      [97],
      [98],
      [99],
      [100],
      [101],
      [102],
      [103],
      [104],
      [105],
      [106],
      [107],
      [108],
      [109],
      [110],
      [111],
      [112],
      [113],
      [114],
      [115],
      [116],
      [117],
      [118],
      [119],
      [120],
      [121],
      [122],
      [123],
      [124],
      [125],
      [126],
      [127]
    ],
    "accelerators": [
      {
        "id": "00000000:0E:00.0",
        "type": "Nvidia GPU",
        "model": "A100"
      },
      {
        "id": "00000000:13:00.0",
        "type": "Nvidia GPU",
        "model": "A100"
      },
      {
        "id": "00000000:49:00.0",
        "type": "Nvidia GPU",
        "model": "A100"
      },
      {
        "id": "00000000:4F:00.0",
        "type": "Nvidia GPU",
        "model": "A100"
      },
      {
        "id": "00000000:90:00.0",
        "type": "Nvidia GPU",
        "model": "A100"
      },
      {
        "id": "00000000:96:00.0",
        "type": "Nvidia GPU",
        "model": "A100"
      },
      {
        "id": "00000000:CC:00.0",
        "type": "Nvidia GPU",
        "model": "A100"
      },
      {
        "id": "00000000:D1:00.0",
        "type": "Nvidia GPU",
        "model": "A100"
      }
    ]
  }
}

Important: Each accelerator requires three fields:

  • id: Unique identifier (PCI-E address recommended, e.g., “00000000:0E:00.0”)
  • type: Type of accelerator. Valid values are: "Nvidia GPU", "AMD GPU", "Intel GPU"
  • model: Specific model name (e.g., “A100”, “MI100”)

You must ensure that the metric collector as well as the Slurm adapter also uses the same identifier format (PCI-E address) as the accelerator resource ID for consistency.

9 - How to customize cc-backend

Add legal texts, modify login page, and add custom logo.

Overview

Customizing cc-backend means changing the logo, legal texts, and the login template instead of the placeholders. You can also place a text file in ./var to add dynamic status or notification messages to the ClusterCockpit homepage.

To replace the imprint.tmpl and privacy.tmpl legal texts, you can place your version in ./var/. At startup cc-backend will check if ./var/imprint.tmpl and/or ./var/privacy.tmpl exist and use them instead of the built-in placeholders. You can use the placeholders in web/templates as a blueprint.

Replace login template

To replace the default login layout and styling, you can place your version in ./var/. At startup cc-backend will check if ./var/login.tmpl exist and use it instead of the built-in placeholder. You can use the default template web/templates/login.tmpl as a blueprint.

To change the logo displayed in the navigation bar, you can provide the file logo.png in the folder ./var/img/. On startup cc-backend will check if the folder exists and use the images provided there instead of the built-in images. You may also place additional images there you use in a custom login template.

Add notification banner on homepage

To add a notification banner you can add a file notice.txt to ./var. As long as this file is present all text in this file is shown in an info banner on the homepage.

10 - How to deploy and update cc-backend

Recommended deployment and update workflow for production use

Workflow for deployment

It is recommended to install all ClusterCockpit components in a common directory, e.g. /opt/monitoring, var/monitoring or var/clustercockpit. In the following we use /opt/monitoring.

Two systemd services run on the central monitoring server:

  • clustercockpit : binary cc-backend in /opt/monitoring/cc-backend.
  • cc-metric-store : Binary cc-metric-store in /opt/monitoring/cc-metric-store.

ClusterCockpit is deployed as a single binary that embeds all static assets. We recommend keeping all cc-backend binary versions in a folder archive and linking the currently active one from the cc-backend root. This allows for easy roll-back in case something doesn’t work.

Workflow to update

This example assumes the DB and job archive versions did not change. In case the new binary requires a newer database or job archive version read here how to migrate to newer versions.

  • Stop systemd service:
sudo systemctl stop clustercockpit.service
  • Backup the sqlite DB file! This is as simple as to copy it.
  • Copy new cc-backend binary to /opt/monitoring/cc-backend/archive (Tip: Use a date tag like YYYYMMDD-cc-backend). Here is an example:
cp ~/cc-backend /opt/monitoring/cc-backend/archive/20231124-cc-backend
  • Link from cc-backend root to current version
ln -s  /opt/monitoring/cc-backend/archive/20231124-cc-backend /opt/monitoring/cc-backend/cc-backend
  • Start systemd service:
sudo systemctl start clustercockpit.service
  • Check if everything is ok:
sudo systemctl status clustercockpit.service
  • Check log for issues:
sudo journalctl -u clustercockpit.service
  • Check the ClusterCockpit web frontend and your Slurm adapters if anything is broken!

11 - How to enable and configure auto-tagging

Enable automatic job tagging for application detection and job classification

Overview

ClusterCockpit provides automatic job tagging to classify and categorize jobs based on configurable rules. The tagging system consists of two components:

  1. Application Detection - Identifies which application a job is running by matching patterns in the job script
  2. Job Classification - Analyzes job performance metrics to identify performance issues or characteristics

Tags are automatically applied when jobs start or stop, and can also be applied retroactively to existing jobs. This feature is disabled by default and must be explicitly enabled in the configuration.

Enable auto-tagging

Step 1: Copy configuration files

The tagging system requires configuration files to define application patterns and classification rules. Example configurations are provided in the cc-backend repository at configs/tagger/.

From the cc-backend root directory, copy the configuration files to the var directory:

mkdir -p var/tagger
cp -r configs/tagger/apps var/tagger/
cp -r configs/tagger/jobclasses var/tagger/

This copies:

  • Application patterns (var/tagger/apps/) - Text files containing regex patterns to match application names in job scripts (16 example applications)
  • Job classification rules (var/tagger/jobclasses/) - JSON files defining rules to classify jobs based on metrics (3 example rules)
  • Shared parameters (var/tagger/jobclasses/parameters.json) - Common threshold values used across multiple classification rules

Step 2: Enable in configuration

Add or set the enable-job-taggers configuration option in your config.json:

{
  "enable-job-taggers": true
}

Important: Automatic tagging is disabled by default. Setting this to true activates automatic tagging for jobs that start or stop after cc-backend is restarted.

Step 3: Restart cc-backend

The tagging system loads configuration from ./var/tagger/ at startup:

./cc-backend -server

Step 4: Verify configuration loaded

Check the logs for messages indicating successful initialization:

[INFO] Setup file watch for ./var/tagger/apps
[INFO] Setup file watch for ./var/tagger/jobclasses

These messages confirm the tagging system is active and watching for configuration changes.

How auto-tagging works

Automatic tagging

When enable-job-taggers is set to true, tags are automatically applied at two points in the job lifecycle:

  • Job Start - Application detection runs immediately when a job starts, analyzing the job script to identify the application
  • Job Stop - Job classification runs when a job completes, analyzing metrics to identify performance characteristics

Note: Only jobs that start or stop after enabling the feature are automatically tagged. Existing jobs require manual tagging (see below).

Manual tagging (retroactive)

To apply tags to existing jobs in the database, use the -apply-tags command line option:

./cc-backend -apply-tags

This processes all jobs in the database and applies current tagging rules. This is useful when:

  • You have existing jobs created before tagging was enabled
  • You’ve added new tagging rules and want to apply them to historical data
  • You’ve modified existing rules and want to re-evaluate all jobs

The -apply-tags option works independently of the enable-job-taggers configuration setting.

Hot reload

The tagging system watches configuration directories for changes. You can modify or add rules without restarting cc-backend:

  • Changes to var/tagger/apps/* are detected automatically
  • Changes to var/tagger/jobclasses/* are detected automatically

Simply edit the files and the new rules will be applied to subsequent jobs.

Application detection

Application detection identifies which software a job is running by matching patterns in the job script.

Configuration format

Application patterns are stored in text files under var/tagger/apps/. Each file represents one application, and the filename (without .txt extension) becomes the tag name.

Each file contains one or more regular expression patterns, one per line:

Example: var/tagger/apps/vasp.txt

vasp
VASP

Example: var/tagger/apps/python.txt

python
pip
anaconda
conda

How it works

  1. When a job starts, the system retrieves the job script from metadata
  2. Each line in the app configuration files is treated as a regex pattern
  3. Patterns are matched case-insensitively against the lowercased job script
  4. If a match is found, a tag of type app with the filename as tag name is applied
  5. Only the first matching application is tagged

Adding new applications

To add detection for a new application:

  1. Create a new file in var/tagger/apps/ (e.g., tensorflow.txt)

  2. Add regex patterns, one per line:

    tensorflow
    tf\.keras
    import tensorflow
    
  3. The file is automatically detected and loaded (no restart required)

The tag name will be the filename without the .txt extension (e.g., tensorflow).

Provided application patterns

The example configuration includes patterns for 16 common HPC applications:

  • vasp
  • python
  • gromacs
  • lammps
  • openfoam
  • starccm
  • matlab
  • julia
  • cp2k
  • cpmd
  • chroma
  • flame
  • caracal
  • turbomole
  • orca
  • alf

Job classification

Job classification analyzes completed jobs based on their metrics and properties to identify performance issues or characteristics.

Configuration format

Job classification rules are defined in JSON files under var/tagger/jobclasses/. Each rule file contains:

  • Metrics required - Which job metrics to analyze
  • Requirements - Pre-conditions that must be met
  • Variables - Computed values used in the rule
  • Rule expression - Boolean expression that determines if the rule matches
  • Hint template - Message displayed when the rule matches

Shared parameters

The file var/tagger/jobclasses/parameters.json defines threshold values used across multiple rules:

{
  "lowcpuload_threshold_factor": 0.9,
  "excessivecpuload_threshold_factor": 1.1,
  "job_min_duration_seconds": 600.0,
  "sampling_interval_seconds": 30.0
}

These parameters can be referenced in rule expressions and make it easy to maintain consistent thresholds across multiple rules.

Rule file structure

Each classification rule is a JSON file with the following structure:

Example: var/tagger/jobclasses/lowload.json

{
  "name": "Low CPU load",
  "tag": "lowload",
  "parameters": ["lowcpuload_threshold_factor", "job_min_duration_seconds"],
  "metrics": ["cpu_load"],
  "requirements": [
    "job.shared == \"none\"",
    "job.duration > job_min_duration_seconds"
  ],
  "variables": [
    {
      "name": "load_threshold",
      "expr": "job.numCores * lowcpuload_threshold_factor"
    }
  ],
  "rule": "cpu_load.avg < cpu_load.limits.caution",
  "hint": "Average CPU load {{.cpu_load.avg}} falls below threshold {{.cpu_load.limits.caution}}"
}

Field descriptions

FieldDescription
nameHuman-readable description of the rule
tagTag identifier applied when the rule matches
parametersList of parameter names from parameters.json to include in rule environment
metricsList of metrics required for evaluation (must be present in job data)
requirementsBoolean expressions that must all be true for the rule to be evaluated
variablesNamed expressions computed before evaluating the main rule
ruleBoolean expression that determines if the job matches this classification
hintGo template string for generating a user-visible message

Expression environment

Expressions in requirements, variables, and rule have access to:

Job properties:

  • job.shared - Shared node allocation type
  • job.duration - Job runtime in seconds
  • job.numCores - Number of CPU cores
  • job.numNodes - Number of nodes
  • job.jobState - Job completion state
  • job.numAcc - Number of accelerators
  • job.smt - SMT setting

Metric statistics (for each metric in metrics):

  • <metric>.min - Minimum value
  • <metric>.max - Maximum value
  • <metric>.avg - Average value
  • <metric>.limits.peak - Peak limit from cluster config
  • <metric>.limits.normal - Normal threshold
  • <metric>.limits.caution - Caution threshold
  • <metric>.limits.alert - Alert threshold

Parameters:

  • All parameters listed in the parameters field

Variables:

  • All variables defined in the variables array

Expression language

Rules use the expr language for expressions. Supported operations:

  • Arithmetic: +, -, *, /, %, ^
  • Comparison: ==, !=, <, <=, >, >=
  • Logical: &&, ||, !
  • Functions: Standard math functions (see expr documentation)

Hint templates

Hints use Go’s text/template syntax. Variables from the evaluation environment are accessible:

{{.cpu_load.avg}}     # Access metric average
{{.job.duration}}     # Access job property
{{.load_threshold}}   # Access computed variable

Adding new classification rules

To add a new classification rule:

  1. Create a new JSON file in var/tagger/jobclasses/ (e.g., memoryLeak.json)
  2. Define the rule structure following the format above
  3. Add any new parameters to parameters.json if needed
  4. The file is automatically detected and loaded (no restart required)

Example: Detecting memory leaks

{
  "name": "Memory Leak Detection",
  "tag": "memory_leak",
  "parameters": ["memory_leak_slope_threshold"],
  "metrics": ["mem_used"],
  "requirements": ["job.duration > 3600"],
  "variables": [
    {
      "name": "mem_growth",
      "expr": "(mem_used.max - mem_used.min) / job.duration"
    }
  ],
  "rule": "mem_growth > memory_leak_slope_threshold",
  "hint": "Memory usage grew by {{.mem_growth}} bytes per second"
}

Don’t forget to add memory_leak_slope_threshold to parameters.json.

Provided classification rules

The example configuration includes 3 classification rules:

  • lowload - Detects jobs with low CPU load (avg CPU load below caution threshold)
  • excessiveload - Detects jobs with excessive CPU load (avg CPU load above peak × threshold factor)
  • lowutilization - Detects jobs with low resource utilization (flop rate below alert threshold)

Troubleshooting

Tags not applied

  1. Check tagging is enabled: Verify enable-job-taggers: true is set in config.json

  2. Check configuration exists:

    ls -la var/tagger/apps
    ls -la var/tagger/jobclasses
    
  3. Check logs for errors:

    ./cc-backend -server -loglevel debug
    
  4. Verify file permissions: Ensure cc-backend can read the configuration files

  5. For existing jobs: Use ./cc-backend -apply-tags to retroactively tag jobs

Rules not matching

  1. Enable debug logging: Set log level to debug to see detailed rule evaluation:

    ./cc-backend -server -loglevel debug
    
  2. Check requirements: Ensure all requirements in the rule are satisfied

  3. Verify metrics exist: Classification rules require job metrics to be available in the job data

  4. Check metric names: Ensure metric names in rules match those in your cluster configuration

File watch not working

If changes to configuration files aren’t detected automatically:

  1. Restart cc-backend to reload all configuration
  2. Check filesystem supports file watching (some network filesystems may not support inotify)
  3. Check logs for file watch setup messages

Best practices

  1. Start simple: Begin with basic rules and refine based on results
  2. Use requirements: Filter out irrelevant jobs early with requirements to avoid unnecessary metric processing
  3. Test incrementally: Add one rule at a time and verify behavior before adding more
  4. Document rules: Use descriptive names and clear hint messages
  5. Share parameters: Define common thresholds in parameters.json for consistency
  6. Version control: Keep your var/tagger/ configuration in version control to track changes
  7. Backup before changes: Test new rules on a development instance before deploying to production

Tag types and usage

The tagging system creates two types of tags:

  • app - Application tags (e.g., “vasp”, “gromacs”, “python”)
  • jobClass - Classification tags (e.g., “lowload”, “excessiveload”, “lowutilization”)

Tags can be:

  • Queried and filtered in the ClusterCockpit UI
  • Used in API queries to find jobs with specific characteristics
  • Referenced in reports and analytics

Tags are stored in the database and appear in the job details view, making it easy to identify application usage and performance patterns across your cluster.

12 - How to generate JWT tokens

Overview

ClusterCockpit uses JSON Web Tokens (JWT) for authorization of its APIs. JWTs are the industry standard for securing APIs and is also used for example in OAuth2. For details on JWTs refer to the JWT article in the Concepts section.

JWT tokens for cc-backend login and REST API

When a user logs in via the /login page using a browser, a session cookie (secured using the random bytes in the SESSION_KEY env variable you should change as well in production) is used for all requests after the successful login. The JWTs make it easier to use the APIs of ClusterCockpit using scripts or other external programs. The token is specified n the Authorization HTTP header using the Bearer schema (there is an example below). Tokens can be issued to users from the configuration view in the Web-UI or the command line (using the -jwt <username> option). In order to use the token for API endpoints such as /api/jobs/start_job/, the user that executes it needs to have the api role. Regular users can only perform read-only queries and only look at data connected to jobs they started themselves.

There are two usage scenarios:

  • The APIs are used during a browser session. API accesses are authorized with the active session.
  • The REST API is used outside a browser session, e.g. by scripts. In this case you have to issue a token manually. This possible from within the configuration view or on the command line. It is recommended to issue a JWT token in this case for a special user that only has the api role. By using different users for different purposes a fine grained access control and access revocation management is possible.

The token is commonly specified in the Authorization HTTP header using the Bearer schema. ClusterCockpit uses a ECDSA private/public keypair to sign and verify its tokens. You can use cc-backend to generate new JWT tokens.

Create a new ECDSA Public/private key pair for signing and validating tokens

We provide a small utility tool as part of cc-backend:

go build ./cmd/gen-keypair/
./gen-keypair

Add key pair in your .env file for cc-backend

An env file template can be found in ./configs. cc-backend requires the private key to sign newly generated JWT tokens and the public key to validate tokens used to authenticate in its REST APIs.

Generate new JWT token

Every user with the admin role can create or change a user in the configuration view of the web interface. To generate a new JWT for a user just press the GenJWT button behind the user name in the user list.

A new api user and corresponding JWT keys can also be generated from the command line.

Create new API user with admin and api role:

./cc-backend -add-user myapiuser:admin,api:<password>

Create a new JWT token for this user:

./cc-backend -jwt myapiuser

Use issued token token on client side

curl -X GET "<API ENDPOINT>" -H "accept: application/json" -H "Content-Type: application/json" -H "Authorization: Bearer <JWT TOKEN>"

This token can be used for the cc-backend REST API as well as for the cc-metric-store. If you use the token for cc-metric-store you have to configure it to use the corresponding public key for validation in its config.json.

Of course the JWT token can be generated also by other means as long it is signed with a ED25519 private key and the corresponding public key is configured in cc-backend or cc-metric-store. For the claims that are set and used by ClusterCockpit refer to the JWT article.

cc-metric-store

The cc-metric-store also uses JWTs for authentication. As it does not issue new tokens, it does not need to know the private key. The public key of the keypair that is used to generate the JWTs that grant access to the cc-metric-store can be specified in its config.json. When configuring the metricDataRepository object in the cluster.json file of the job-archive, you can put a token issued by cc-backend itself.

Other tools to generate signed tokens

The golang-jwt project provides a small command line tool to sign and verify tokens. You can install it with:

 go install github.com/golang-jwt/jwt/v5/cmd/jwt

OpenSSL can be used to generate ED25519 key-pairs:

# Generate ed25519 private key
openssl genpkey -algorithm ed25519 -out privkey.pem
# export its public key
openssl pkey -in privkey.pem -pubout -out pubkey.pem

13 - How to plan and configure resampling

Configure metric resampling

Enable timeseries resampling

ClusterCockpit now supports resampling of time series data to a lower frequency. This dramatically improves load times for very large or very long jobs, and we recommend enabling it. Resampling is supported for running as well as for finished jobs.

Note: For running jobs, this currently only works with the newest version of cc-metric-store. Resampling support for the Prometheus time series database will be added in the future.

Resampling Algorithm

To preserve visual accuracy while reducing data points, ClusterCockpit utilizes the Largest-Triangle-Three-Buckets (LTTB) algorithm.

Standard downsampling methods often fail to represent data accurately:

  • Averaging: Tends to smooth out important peaks and valleys, hiding critical performance spikes.
  • Decimation (Step sampling): Simply skips points, which can lead to random data loss and missed outliers.

In contrast, LTTB uses a geometric approach to select data points that form the largest triangles effectively. This technique creates a downsampled representation that retains the perceptual shape of the original line graph, ensuring that significant extrema and performance trends remain visible even at lower resolutions.

Configuration

To enable resampling, you must add the following toplevel configuration key:

"resampling": {
  "minimum-points": 300,
  "trigger": 30,
  "resolutions": [
    600,
    300,
    120,
    60
  ]
}

Configuration Parameters

The enable-resampling object is optional. If configured, it enables dynamic downsampling of metric data using the following properties:

  • minimum-points (Integer) Specifies the minimum number of data points required to trigger resampling. This ensures short jobs are not unnecessarily downsampled.

    • Example: If minimum-points is set to 300 and if the native frequency is 60 seconds, resampling will only trigger for jobs longer than 10 hours (300 points * 60 seconds = 18,000 seconds / 3600 = 5 hours).
  • resolutions (Array [Integer]) An array of target resampling resolutions in seconds.

    • Example: [600, 300, 120, 60]
    • Note: The finest resolution in this list must match the native resolution of your metrics. If you have different native resolutions across your metric configuration, you should use the finest available resolution here. The implementation will automatically fallback to the finest available resolution if an exact match isn’t found.
  • trigger (Integer) Controls the zoom behavior. It specifies the threshold of visible data points required to trigger the next zoom level. When the number of visible points in the plot window drops below this value (due to zooming in), the backend loads the next finer resolution level.

Example view of resampling in graphs

The following examples demonstrate how the configuration above (minimum-points: 300, trigger: 30) affects the visualization of a 16-hour job.

1. Initial Overview (Coarse Resolution)

Because the job duration (~16 hours) requires more than 300 points at native resolution, the system automatically loads the 600s resolution. This provides a fast “overview” load without fetching high-frequency data. You can see in the tooltip of this example that we see datapoints every 10 minutes (because of frequency of 600).

Initial overview at 600s resolution

2. Zooming without Triggering

When the user zooms in, the system checks if the number of visible data points in the new view is less than the trigger value (30). In the example below, the zoomed window still contains enough points, so the resolution remains at 600s. As you can see from the tooltip of the example, we still see dataa points every 10 mins.

Zoom action that does not trigger update

3. Zooming and Triggering Detail

As the user zooms in deeper, the number of visible points drops below the trigger threshold of 30. This signals the backend to fetch the next finer resolution (e.g., 120s or 60s). The graph updates dynamically to show the high-frequency peaks that were previously smoothed out. As you can see from the tooltip of the example, the backend has detected that the selected data points are below trigger threshold and load the second last resampling level with the frequency of 120. With native frequency of 60, a frequency of 120 means 2 mins of data. We will see data points every 2 mins as seen in the tooltip of the example.

Zoom action triggering finer resolution

4. Visual Comparison

The animation below highlights the difference in visual density and performance between the raw data and the optimized resampled view. As you know the minimum-points is 300, means resampling will trigger only for jobs > 5 hours of duration (assuming native frequency of 60).

Comparison of resampling

Suggestion on configuring the resampling

Based on the experiments we have done and the performance we have observed, we recommend the reader to:

  1. configure the "minimum-points": 900. This means, assuming native frequency of 60, resampling will trigger for jobs > 15 hours of duration.
  2. configure the "resolutions" with 2 or 3 levels only, with the last level being native frequency. A resampling frequency of 600 is only recommended for jobs > 24 hours of duration.

14 - How to regenerate the Swagger UI documentation

Overview

This project integrates swagger ui to document and test its REST API. The swagger documentation files can be found in ./api/.

Generate Swagger UI files

You can generate the swagger-ui configuration by running the following command from the cc-backend root directory:

go run github.com/swaggo/swag/cmd/swag init -d ./internal/api,./pkg/schema -g rest.go -o ./api

You need to move one generated file:

mv ./api/docs.go ./internal/api/docs.go

Finally rebuild cc-backend:

make

Use the Swagger UI web interface

If you start cc-backend with the -dev flag, the Swagger web interface is available at http://localhost:8080/swagger/. To use the Try Out functionality, e.g. to test the REST API, you must enter a JWT key for a user with the API role.

15 - How to setup a systemd service

Run ClusterCockpit components as systemd services

How to run as a systemd service.

The files in this directory assume that you install ClusterCockpit to /opt/monitoring/cc-backend. Of course you can choose any other location, but make sure you replace all paths starting with /opt/monitoring/cc-backend in the clustercockpit.service file!

The config.json may contain the optional fields user and group. If specified, the application will call setuid and setgid after reading the config file and binding to a TCP port (so it can take a privileged port), but before it starts accepting any connections. This is good for security, but also means that the var/ directory must be readable and writeable by this user. The .env and config.json files may contain secrets and should not be readable by this user. If these files are changed, the server must be restarted.

  1. Clone this repository somewhere in your home
git clone git@github.com:ClusterCockpit/cc-backend.git
  1. (Optional) Install dependencies and build. In general it is recommended to use the provided release binaries.
cd cc-backend && make

Copy the binary to the target folder (adapt if necessary):

sudo mkdir -p /opt/monitoring/cc-backend/
cp ./cc-backend /opt/monitoring/cc-backend/
  1. Modify the config.json and env-template.txt file from the configs directory to your liking and put it in the target directory
cp ./configs/config.json /opt/monitoring/config.json && cp ./configs/env-template.txt /opt/monitoring/.env
vim /opt/monitoring/config.json # do your thing...
vim /opt/monitoring/.env # do your thing...
  1. (Optional) Customization: Add your versions of the login view, legal texts, and logo image. You may use the templates in ./web/templates as blueprint. Every overwrite is separate.
cp login.tmpl /opt/monitoring/cc-backend/var/
cp imprint.tmpl /opt/monitoring/cc-backend/var/
cp privacy.tmpl /opt/monitoring/cc-backend/var/
# Ensure your logo, and any images you use in your login template has a suitable size.
cp -R img /opt/monitoring/cc-backend/img
  1. Copy the systemd service unit file. You may adopt it to your needs.
sudo cp ./init/clustercockpit.service /etc/systemd/system/clustercockpit.service
  1. Enable and start the server
sudo systemctl enable clustercockpit.service # optional (if done, (re-)starts automatically)
sudo systemctl start clustercockpit.service

Check whats going on:

sudo systemctl status clustercockpit.service
sudo journalctl -u clustercockpit.service

16 - How to use the REST API Endpoints

Overview

ClusterCockpit offers several REST API Endpoints. While some are integral part of the ClusterCockpit-Stack Workflow (such asstart_job), others are optional. These optional endpoints supplement the functionality of the webinterface with information reachable from scripts or the command line. For example, job metrics could be requested for specific jobs and handled in external statistics programs.

All of the endpoints listed for both administrators and users are secured by JWT authentication. As such, all prerequisites applicable to JSON Web Tokens apply in this case as well, e.g. private and public key setup.

See also the Swagger Reference for more detailed information on each endpoint and the payloads.

Admin Accessible REST API

Admin API Prerequisites

  1. JWT has to be generated by either a dedicated API user (has only api role) or by an administrator with both admin and api roles.
  2. JWTs have a limited lifetime, i.e. will become invalid after a configurable amount of time (see auth.jwt.max-age config option).
  3. Administrator endpoints are additionally subjected to a configurable IP whitelist (see api-allowed-ips config option). Per default there is no restriction on IPs that can access the endpoints.

Admin API Endpoints and Functions

EndpointMethodRequest Payload(s)Description
/api/users/GET-Lists all Users
/api/clusters/GET-Lists all Clusters
/api/tags/DELETEJSON PayloadRemoves payload array of tags specified with Type, Name, Scope from DB. Private Tags cannot be removed.
/api/jobs/start_job/POST, PUTJSON PayloadStarts Job
/api/jobs/stop_job/POST, PUTJSON PayloadStops Jobs
/api/jobs/GETURL-Query ParamsLists Jobs
/api/jobs/{id}POST$id, JSON PayloadLoads specified job metadata
/api/jobs/{id}GET$idLoads specified job with metrics
/api/jobs/tag_job/{id}POST, PATCH$id, JSON PayloadAdds payload array of tags specified with Type, Name, Scope to Job with $id. Tags are created in BD.
/api/jobs/tag_job/{id}POST, PATCH$id, JSON PayloadRemoves payload array of tags specified with Type, Name, Scope from Job with $id. Tags remain in DB.
/api/jobs/edit_meta/{id}POST, PATCH$id, JSON PayloadEdits meta_data db colums info
/api/jobs/metrics/{id}GET$id, URL-Query ParamsLoads specified jobmetrics for metric and scope params
/api/jobs/delete_job/DELETEJSON PayloadDeletes job specified in payload
/api/jobs/delete_job/{id}DELETE$id, JSON PayloadDeletes job specified by db id
/api/jobs/delete_job_before/{ts}DELETE$tsDeletes all jobs before specified unix timestamp

User Accessible REST API

User API Prerequisites

  1. JWT has to be generated by either a dedicated API user (Has only api role) or an User with additional api role.
  2. JWTs have a limited lifetime, i.e. will become invalid after a configurable amount of time (see jwt.max-age config option).

User API Endpoints and Functions

EndpointMethodRequestDescription
/userapi/jobs/GETURL-Query ParamsLists Jobs
/userapi/jobs/{id}POST$id, JSON PayloadLoads specified job metadata
/userapi/jobs/{id}GET$idLoads specified job with metrics
/userapi/jobs/metrics/{id}GET$id, URL-Query ParamsLoads specified jobmetrics for metric and scope params

17 - How to use the Swagger UI documentation

Overview

This project integrates swagger ui to document and test its REST API. ./api/.

Access the Swagger UI web interface

If you start cc-backend with the -dev flag, the Swagger web interface is available at http://localhost:8080/swagger/. To use the Try Out functionality, e.g. to test the REST API, you must enter a JWT key for a user with the API role.