Realscale is a philosophy for architecting cloud native applications to scale up to 1M visits/day and 1000 requests/sec, typically requiring 10-200 server instances and a variety of internal services.

Full-Stack Cloud Native Monitoring Strategies

It is the weekend, late at night, or first thing Monday morning and you receive the dreaded phone call or SMS: “The application is down! We need all hands to figure this out!” After spending hours trying to track down the issue by reviewing logs, looking at the currently deployed code, and hundreds of IM exchanges, you finally find it – someone deployed a patch to production late Friday and it broke an important part of your application.

You have probably been there. While tracking down issues can be difficult for any application, cloud native applications are particularly difficult. Applications are deployed to multiple servers across any number of zones. Servers have been created and destroyed. And the amount of traffic you must handle makes it hard to read log files (even centrally managed log files).

So, how do you solve the need for monitoring in a cloud native architecture? While every application’s needs are different, we must first understand the options available to us. This article provides an overview of the goals of monitoring and the types of cloud native monitoring available, to help provide some insight into monitoring your Realscale applications.

Monitoring Goals

Cloud monitoring seeks to achieve 3 key goals:

  1. User success monitoring ensures that the user can perform necessary functions within the application, and when they can’t they can obtain the technical support necessary to get back on track. This should be the primary goal of monitoring a cloud application and is driven from a KPI perspective
  2. System performance monitoring, which focuses on the overall throughput of the system under the current and anticipated future load. Without monitoring this goal, users may experience slow performance or outages
  3. Troubleshooting and error handling to help internal teams isolate and resolve issues quickly

Each of these goals contributes to the overall health and success of the business. Understanding the different types of monitoring used to achieve these goals is our next topic.

The Types of Monitoring

Cloud native applications are built using a variety of solutions and layers. Therefore, there isn’t one monitoring technique that can support today’s complex applications. Instead, we must install a variety of monitoring solutions that provide insight into the different aspects of a cloud application. Below are details on how each of the various types of monitoring solutions provide insight into the key monitoring goals:

Cloud Provider Monitoring

When troubleshooting poor performance or errors, the first step is to verify that all cloud infrastructure services your applications depends on are functioning. The list of infrastructure services commonly used includes: servers, network infrastructure, databases, shared file systems, and others.

While each cloud infrastructure provider will provide their own status page (e.g. AWS, Google Cloud, and Azure), services such as Cloud66 Birdseye offer a high-level rollmop of a cloud vendor’s status at-a-glance. No matter how you decide to gather cloud infrastructure status, it is critical to review manually and/or automate around issues that emerge for specific resources that your application depends upon.

Security Monitoring

The recent security vulnerabilities found in bash (shellshock) and Heartbleed was found to be affecting many servers around the world. Add to this the need to track security vulnerabilities within third-party libraries used by your application to prevent exposing security risks. Security monitoring is responsible for generating alerts when known security vulnerabilities emerge in operating system and third-party libraries currently in use. Remediation is then taken once these vulnerabilities are discovered.

Business Transaction Monitoring

While some applications perform simple tasks, most require complex workflows. Even today’s ecommerce websites focus on a complex series of steps, including: list available items, add an item to a cart, remove an item from the cart, update quantity, checkout, and order confirmation. If the end-user cannot perform any of these steps successfully and within a specific threshold of time then the result is lost revenue.

Business transaction monitoring focuses on verifying that expected workflows can be carried out within the application. This type of monitoring typically requires training and/or coding to teach the monitoring tool how to perform each step through the website and verify that it succeeded. In the case of a failure, the error is reported and proper action taken.

Performance Monitoring

Performance monitoring is responsible for tracking KPIs around API and website response times, throughput, traffic (e.g. concurrent users or requests), and error response rates. Performance monitoring provides insight into the overall user experience (UX) to diagnose externally-facing issues. It is also be used to monitor internal business transactions, providing metrics on areas of the application that requires additional resources or tuning. Performance monitoring tools may be used externally to track overall response time, or in combination with on-premise tools to track internal application performance metrics to assist in diagnosing issues.

Log Monitoring

There are a wide variety of processes and logging solutions that scatter various files across a server. Miss one and you’ll lose important insight into your application and server health. Log monitoring is responsible for looking at all captured log files and assessing any errors or security threats. We have covered distributed logging, including the types of logs generated and the benefits of distributed logging, in a previous article.

Database/Query Monitoring

Slow query logs are the most useful as they allow for the logging of queries that exceed a specific threshold. Queries that exceed the threshold may indicate data structures that are no longer optimized due to the scale of data stored, lack of optimized data structures, or poor-performing SQL queries. We discussed this in the previous article on distributed logging.

However, slow query logs do not provide insight into the overall query usage for an application. Database monitoring tools provide deeper insights into all executed queries, cache hits and misses, and query result size. These metrics provide details into how the application is performing at run-time with the data set and application load. Developers and DBAs can then work to optimize hotspots and introduce caching techniques to reduce the load on the database as the application load increases.

Resource Health Monitoring

Cloud native applications must deal with a variety of resources, from server instances to virtual networks and filesystems. It is important to monitor the health of each resource instance that has been created to support the application. While there exist a number of health monitoring solutions, not all of them support the addition and removal of resources at regular intervals as is common with cloud native applications. Therefore, it is important to locate a monitoring solution that is capable of understanding that a deleted resource should not result in an alert (but a resource not requested for deletion should).

Most cloud infrastructure providers offer at least a basic set of monitoring features into their platform. Some vendors even offer deep customization for alerting and include an awareness of the difference between resource deletion and failure. Third-party vendors also exist to extend and further customize these vendor-based monitoring tools.

System and Process Monitoring

Server instances likely have important resources such as CPU, memory, and local disk storage that must be monitored. Additionally, multiple processes running at one time, including web servers, application servers, background workers, etc. System monitoring ensures that a server has enough available resources and that all monitored processes are healthy and remain running. If a process fails, the process monitor is responsible for taking the appropriate action, which may include restarting the process and generating an alert. Servers that experience low resource availability or recurring process failures should be marked as unhealthy (i.e. the resource health) and replaced.

A Note About Application Performance Management (APM)

As you likely realized, many of the monitoring types listed above offer some overlap in the kinds of insight they offer. The last few years has seen an increase in a number of monitoring solutions categorized as Application Performance Management (APM). The APM market centers around monitoring and management of applications across the full spectrum of monitoring layers: user, resource, process, database, and transactional.

Leaders in the APM market feature a robust set of options beyond simple monitoring, including: bottom-up and top-down monitoring, alerting, analytics, and reporting. Businesses choosing to roll-their-own solutions using a combination of open source tools and scripting often find themselves building a custom APM solution. Use caution when embarking on custom monitoring solutions, as they require complex knowledge and resources. Instead, we recommend finding an APM solution (either SaaS or on-premise) that provides the level of monitoring support and customization necessary for your cloud native application.

What’s Next?

Below are additional articles detailing the other core components within a cloud native architecture:

Cloud Vendor Resources

Related Articles