> For the complete documentation index, see [llms.txt](https://docs.computle.com/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.computle.com/service-delivery/service-delivery-architecture/telemetry-and-monitoring-at-computle.md).

# Telemetry and Monitoring at Computle

At Computle, we maintain continuous **24/7 telemetry** across our entire infrastructure and Computle Machines, ensuring that every aspect of performance, security, and uptime is closely monitored. Our system tracks anomalies, machine issues, outages, and performance trends to keep operations running smoothly.

## **Real-Time Monitoring**

**System Information Monitored:**

* **Memory Usage**: Total, free, and used memory, helping identify potential memory bottlenecks.
* **CPU**: CPUs and their usage to monitor performance under load.
* **Disk**: Disk partitions, usage statistics, and file system health to detect potential storage issues.
* **GPU Information**: Active monitoring of GPU utilisation, thermal performance, power draw, and memory usage.
* **Network**: Interface details, traffic statistics, and potential anomalies like high latency or packet loss.
* **Disk Status**: Continuous health checks.

This allows us to quickly identify and resolve potential issues before they affect performance.

***

## **Anomaly Detection**

Our telemetry system uses advanced algorithms to detect anomalies across the estate, such as:

* **Unusual spikes in resource usage** (CPU, memory, disk).
* **Sudden drops in network throughput**.
* **Hardware failures**, such as disk read/write errors or network interface degradation.

When an anomaly is detected, alerts are immediately generated for our engineering team to assess and resolve the issue. We monitor overall **machine health** by looking at historical data trends, identifying patterns of performance degradation, and ensuring proactive maintenance.

***

## **Global Endpoint Monitoring**

We monitor the **global network infrastructure**, including switches, routers, and other critical endpoints, ensuring high availability across regions. This includes tracking uptime, bandwidth usage, and hardware status at each site. If any hardware, such as switches or network interfaces, shows signs of failure, immediate action is taken to prevent service disruptions.

***

## **Outage Detection and Response**

We continuously monitor for machine outages or downtime. If a machine goes offline or experiences reduced availability, our system detects this automatically and begins the remediation process. Failover mechanisms are also in place to ensure high availability across all virtual machines.

Automated responses and notifications are sent to the relevant engineers and customers in case of significant performance or hardware issues. Through this proactive approach, we ensure that the Computle environment runs smoothly 24/7, delivering optimal performance for customers.