While countless companies are starting to use Kubernetes to offer their services and applications, many are experiencing the classic problems of “Day-2 operations”: The systems are often not optimized, maintenance aspects are missing, or observability concepts are not sufficiently or incorrectly implemented. A common problem is that existing monitoring and logging solutions cannot be used, because they do not fit the requirements of distributed systems and Multi-Cluster-Environments.
Problems of monitoring distributed systems
But why is it so hard to monitor distributed systems? Is it not enough to ping our services and collect our logs? The answer is yes. You are not able to monitor your Kubernetes cluster by just pinging services and storing the logs of some containers in your given logging-solution. This can be explained by following points:
The infrastructure and the applications are able to scale up and down dynamically
On the one hand this is the big advantage of systems like Kubernetes. You deploy your application to the Kubernetes clusters and the system itself is able to manage it and take care of the application. It will scale up the application if the load rises and the other way round. The same autoscaling mechanism can apply to your cluster, so that it is able to scale up and down, if more computing power is needed. All in all, we will have to monitor and collect logs from a dynamic system where servers and applications are able to scale without any manual input.
Traditional monitoring systems rely on hostnames or IP addresses
Because of this fact we are not able to just pick a traditional monitoring system and collect everything we need. IP addresses and hostnames can get reassigned in this world so quickly. We do not know if the IP address of our application is still in use, so, we are not able to rely on these properties. We have to find different ways to deal with our system.
Monitoring system should scale automatically
While our applications and services are able to scale automatically, we have to enable our monitoring system to do the same. Otherwise, we would run into problems like DDoS-ing our own monitoring system. It can crash and we will lose the complete observability.
One customer asked for our support to implement a multi cluster environment on AWS. They wanted us to build up a solution to collect metrics and logs from all the other Kubernetes clusters. To be short, they gave us a few requirements:
- One consistent dashboard solution
- A similar way to access metrics and logs
- Avoid a Vendor Lock-in (because it should be possible to collect logs and metrics not only from within AWS)
- Customers/Developers should be able to build up their own log streams to collect their logs and forward them to their own logging solution if necessary
In addition to these requirements, we had to connect various networks to be able to collect logs and metrics. So, at first, we built up a solution to connect the networks. This is solved by a VPC Endpoint Service on the monitoring-clusters-side and a VPC Endpoint on the monitored-clusters-side (called Customer-Cluster). It was necessary that all clusters are accepted manually by the VPC Endpoint Service. The customer appreciated this as he is now able to control which clusters can push logs and metrics to the central monitoring cluster. The VPC Endpoint allows us to connect applications on the customer cluster with another application in our monitoring cluster.
Prometheus is one prominent solution to monitor Kubernetes clusters. An easy way to set up and manage Prometheus is to use the Prometheus-Operator by CoreOS, so this is the way we installed it. Because we wanted to collect our metrics in a central cluster, we had to use Prometheus´ Remote-Write-Feature, which enables us, to write the metrics that are collected to our central instance. Our central monitoring solution is to use Thanos, because we used it in an earlier project and we had a good experience with it. Thanos is like an extension for Prometheus and enables us to store long-term-metrics. All the Thanos-components are set-up with autoscaling features, to be sure, that our Monitoring-Cluster is able to resist the input of the multiple Prometheus instances. AWS S3 is used to store all the metrics, because it is cheap and fast.
The logging solution consists of two parts. First, we had to find a solution to ship all the logs from the containers to our central logging solution and secondly, we had to find a central logging solution to store our logs.
We decided to use the logging-operator by banzaicloud for the shipping part. The logging-operator is utilizing fluentd and fluent-bit to collect logs. The fluent-bit instances running as a Daemonset are small and fast, they will collect the logs of the containers on each node and forward them to the fluentd-instance. This fluentd is able to use many plugins to modify the logs and forward them to our central logging solution. The logging-operator is able to set up different log-streams, too. So, we would be able to collect the logs of the Kubernetes system and forward them to one system, while the application developers are able to forward their logs to another system they like more. The second solution we had to find is one to store and query our log messages from. We evaluated two different solutions to store our logs. We would have discussed a third option, Amazon CloudWatch, but this solution was not possible, because the customer asked us to avoid a Vendor-Lock-In. The two remaining solutions were the ELK-stack and Loki. In the end we decided to use Loki, because it is very lightweight, we can scale it horizontally and it is compatible to S3. We do not need a separate database, which is a nice feature, and it is possible to query logs in a very similar way to how we query metrics from Prometheus. So, in the end, Loki is a perfect match.
Conclusion and outlook
All in all, we are able to query logs and metrics with our central Grafana instance. The data of Thanos and Loki is stored in S3-Buckets, so it is quite cheap and easy to manage, because we do not need to take care of things like Persistent Volumes for long-term data. The components in our monitoring cluster are set up with autoscaling features so that they can handle the load of multiple customer clusters.
In the near future, we will have to change a few little things to connect various on-prem clusters to this solution. It is not yet decided, if we have to connect clusters that are running on other cloud providers (e.g. Azure or GCP), but this would not take too much time at all. All in all, we are able to use this solution for all Kubernetes clusters, the customer wants us to observe.