Stats and alerts through CloudWatch Logs analysis

Stats and alerts through CloudWatch Logs analysis

TL;DR

If you have your logs already managed by AWS CloudWatch, you might use Insights to easily running queries, summarising data or even discovering anomalous conditions in your systems. A short piece of code listed below shows how it can be quickly done.

Running queries over logs using Insights

Monitoring is a must nowadays and general-purpose solutions in this field usually work out-of-the-box with some simple and quick configuration. Tools like New Relic or Datadog, for instance, allow to get critical information about user experience like error rates, response times, etc. However, when it comes to capturing and analysing more detailed, domain-related data to find issues more tied to the concrete business, we have to seek for other ways to collect important events, take metrics, define alerts, etc. A very common stack that have gained a lot of popularity in this field is ELK. But if you have your infrastructure already set up in AWS with Dockerized containers and managed logs through CloudWatch as we have, you have a powerful tool already available: AWS CloudWatch Log Insights. With Insights, you can write queries to filter, project and group logs to get the information that you want. For instance, in the example below, we’ve wrote a very simple query that filters audio ads requests for one of our projects, showing their occurrence minute-by-minute.

This is great for looking at live stats during working hours, but in some cases we may like also to have automatic alerts if some domain-related situation happens. For instance, in this case, if we are not getting audio ads requests for, let’s say, 15 minutes, it means that something is not working as expected and, in consequence, we have to take immediate action.

To have full flexibility on how to process data and determine alerting situations, we decided to write a simple, reusable script that runs the query and gets the result set using AWS SDK and Node.js. The resulting script is listed below and it:

  1. Builds the CloudWatch request and executes the query, returning a Promise
  2. If the query has been accepted for processing, we poll CloudWatch until the results of the query are ready
  3. A set of optional data pre-processing functions can be passed as a parameter

This script can be used both to group and render the data in a graph or to determine if there is an anomalous condition that has to be alerted – integrating it with tools monitoring tools like Nagios.

Costs involved

CloudWatch Logs Insights is basically an on-demand stream processing service and, as almost every provided Cloud service, it comes with a cost. So before start using it we should consider the costs implied. Insights pricing schema is based on amounts of log processed per query, and by the time this blog post has been written, it was $0.005 per GB processed.

So, let’s say that our app generates 2MB log every 5 minutes and we need to run an Insight query once per minute to determine if the amount of errors in that 5 minutes window is higher than a specific threshold. So, let’s do a cost-per-day calculation:

  • 60 minutes * 24 hours = 1,440 minutes in a day = 1,440 queries run per day
  • 1,440 times per day * 2MB per query (last 5 min) = 2,880 MB or ~ 3 GB of log processing per day
  • 3 * 0.005 USD = 0.015 USD per day

Thus, for having this cloud-provided log processing we are spending 1 cent and a half per day. Of course that 2MB every 5 minutes is an extremely conservative log volume and apps may easily multiply this number thousands of times. In these cases, custom stream processing with frameworks like Apache Storm with a provisioned computed time available 24×7 will be a more cost-effective solution over time. So, the approach described in this post is more suitable for having a quick and 100% hosted log processing solution that is quickly implementable in an hour or two.