Virtualization Featured Article

Why Service Monitoring and Testing Matter

July 18, 2017
By Special Guest
Craig Gulliver, Vice President of Support at xMatters -

A major part of running a cloud service is ensuring that the system is healthy and performing as expected. A solid monitoring approach should provide transparency across all aspects of your infrastructure. These might include operating systems, applications, build and deployment pipelines, web traffic, sales pipeline, and so on. Monitoring services allow your teams to understand the health of all the components required to deliver your service to clients.

As a cloud service provider, you learn quickly that service downtime impacts your business in many critical ways. Without the right level of transparency across your technology stack, troubleshooting during an incident can become a lengthy investigative process that eats up valuable time and resources. That's why it's important to employ different techniques and use various levels of monitoring that will not only detect but also prevent issues, and help you solve problems quickly before they impact your clients.

Monitoring is based on the premise that the application will detect when an error has occurred, and generate a message that can be acted upon. While this works on a basic level, you’re waiting until something goes wrong before you can take action, at which point the error may have already impacted your clients. You need to consider different ways of testing your service to proactively identify potential problems in conjunction with the monitors you have in place. There are many forms of service testing, such as integration testing, component testing, black-box testing, system testing, etc. For example, white-box testing can be useful to help monitor the internal structures or architecture of your service.

Automation is King
Many repeatable processes can be automated with the right tools, and in this particular case, automation is absolutely preferable over manual-driven processes. Automation affords operators and support agents the ability to focus on higher-level tasks instead of running or coordinating groups of commands and ensuring that they worked as expected. Like all code, automations must be maintained and tested constantly to ensure they are reliable and correct when needed.

While automation is not a silver bullet, it can certainly change the way your teams do business when it matters most. The focus it provides is imperative to an incident team when resolving incidents or repairing service levels back to normal.

Understanding the variables
Testing should account for the different environmental variables of notification delivery. For instance, email notifications rely on internet connectivity and email relays while SMS messages rely on the availability of mobile networks. Obviously, these are outside our private cloud infrastructure and out of our control.

To compound these issues, testing in a non-production or staging environment is entirely different from production. Even when the infrastructure is an exact mirror of production, we have found that you cannot duplicate the same traffic and the transaction volume as found in production. This makes each production environment unique, which affects the baseline benchmark for tests.

A simplified example would be testing notification delivery and user responses. In a quiet system, the response can be quick as there is little activity, and more resources available for processing. In a busy production system, however, the response time may be longer depending on the levels of traffic. This is tricky for monitoring and testing since production systems always have heavier traffic than testing or staging environments.

Testing and Monitoring Recommendations


Employ different techniques to detect and prevent issues


Testing your service in conjunction with the monitors you have in place


Automate repeatable processes


Account for environment variables for notification delivery

Service Health:

Exercise your services in different ways to gain a holistic view


Be transparent and honest with your customers

Incorporating system testing
Many cloud services have provided insight into how they test their services, for example the Chaos Monkey at Netflix. Ideally, you should exercise your services in different ways to have a complete, holistic view of service health. You’ll want to ensure rigorous testing in order to be proactive about any potential issues, and be able to address them before they have any business impact.

For SMS delivery, you might consider testing using a global network of real SMS devices. This will allow you to test SMS delivery across various regions, and even different carriers within a single region, across the world. You can measure when a message was sent to vendor, how long it took to reach the device, and whether the content sent matches the original message, among other things.

This information can then be fed into your central monitoring system, which can also be configured to detect various types of failure conditions such as internal component failures, performance issues, or upstream carrier issues. This kind of testing not only exercises your own cloud infrastructure and components, but also the communication networks required to reach end user devices: it’s full stack testing.

Being transparent
When things go sideways, it's important to be transparent and honest. Make sure you provide all of the details that matter to your customers, so that you can demonstrate that you are working towards improving yours services and tackling issues head-on.

Being honest is much easier when you can demonstrate that you responded to an issue quickly and responsibly. Responding appropriately requires planning and processes. Demonstrating that you responded appropriately requires preserving issues and conversations for post mortems.

Doing so manually is time-consuming and error-prone, so focus on tools and approaches that will allow you to integrate critical systems for a streamlined approach.

In a 2017 survey of more than 1,000 DevOps organizations, half of all responders say they lack a consistent process for responding to a major incident. The greatest delay is the time a ticket sits in the queue before an engineer touches it. You want to make sure you resolve the issue before a customer reports it. This is the essence of proactive customer service.

About the Author

Craig Gulliver is Vice President of Support at xMatters, where previously he was Engineering Manager.


Click here to share your opinion – Would color of equipment influence your purchasing decision, one way or another?

Featured Blog Entries

Day 4, Cisco Live! - The Wrap

Day 4 was the final day of our first ever Cisco Live! We had a great show, with many great conversations and new connections with existing and potential end users, resellers, partners and job hunters.

Day 3, Cisco Live!

Day 3 of Cisco Live is history! For Fiber Mountain, we continued to enjoy visits from decision makers and influencers who were eager to share their data center and structured cabling challenges.

Day 2, Cisco Live!

Tuesday was Day 2 of Cisco Live for Fiber Mountain and we continued to experience high levels of traffic, with many high value decision makers and influencers visiting our booth. One very interesting difference from most conferences I attend is that there are no titles on anyone's show badges. This allows open conversations without people being pretentious. I think this is a very good idea.

Day 1, Cisco Live!

Fiber Mountain is exhibiting at Cisco Live! In Las Vegas for the first time ever! Our first day was hugely successful from just about any perspective - from quantity and quality of booth visitors to successful meetings with customers.

Industry News