In this post we explore how to identify and resolve issues with application performance over the network. We highlight monitoring tools and triage, and then show how a common scenario of your users complaining of poor performance might be traced. We consider several scenarios in which the same user experience might be traced to an issue with the WAN, LAN or server, application or database.
We have found that up to 31% of Mid-market businesses believe they have a problem with their application performance.
Life can become uncomfortable for an IT team whose users are complaining about slow or unreliable applications. Users who can’t work will quickly create a lot of pressure, but diagnosing and fixing application performance can be very tricky despite that pressure. As applications move increasingly to a SaaS model or a Cloud platform, this can make performance problems both more acute and more difficult to diagnose.
How have people traditionally addressed application issues?
In the past, when an application started to run slowly, people tended to do one of two things; they either threw more bandwidth at the problem or they threw more computing power at the problem.
That doesn’t work so well these days. Web and cloud applications have become commonplace, the application environment is more complex and there is often sufficient bandwidth already.
People often assume that an application performance issue is caused by the network and start by calling their network provider. However, when we analysed application performance issues suffered by our customers, we identified that the network was to blame only around 30% of the time.
So, what else causes performance problems?
- It turned out that 30% of problems were down to infrastructure; things like the server operating system, the server hardware, the configuration and the storage.
- A massive 40% sat in the application space; either the performance of the application itself – perhaps badly coded applications or badly implemented applications, or quite often the way that the database had been configured.
When the network causes an application performance issue, it’s often fairly easy to fix. If a site goes down, a router has gone faulty or someone has pulled the plug on it, those things are easy to spot and fix. When the infrastructure or application cause the problem, it’s not always straightforward to identify the root cause.
The presence of multiple resolver groups (such as the internal IT team, and third parties who supply the network, the data centre, the server farm, the database and the application) makes it harder still to resolve issues. How do you know who to kick?
Triaging an application performance issue
We need to identify what’s causing an application performance issue before we can fix it. When responding to an issue, we generally refer to this first stage as Triage. Triage will identify whether an issue originates within the network, the infrastructure or the application.
- A network issue might be found in the WAN, the LAN, the internet, or perhaps the cabling
- An infrastructure issue might be found in the hardware, operating system or configuration of a server, or in storage
- An application issue might be found in the design of the application, the database or the way that the application is using the database.
If this is not the only issue within the network, Triage will also help to prioritise response.
The need for monitoring tools to trace performance problems
A good monitoring tool is required in order to identify an issue quickly and with minimum effort; ideally one that covers the whole of the IP path. You’ll also need some resource with the skills to use it.
An effective monitoring tool will provide insight into your IT infrastructure along the whole IP path. It will provide a wide range of information such as whether a device is up or down, the utilisation of a link, the application traffic on the link, the memory utilisation of a server or photographs of each local installation. It may be possible to construct a representation of each critical application’s IP path, with the ability to drill down into lower levels of detail in order to trace and identify issues.
If you’re facing application issues but your network service does not include a detailed monitoring facility, then you may need to buy one. Alternatively, you might buy an audit, in which someone installs the tools temporarily in order to help you identify the problem.
These options might seem a little scary, but they don’t need to be. For example, we run monitoring as a service, either with or without a network, and we set up the monitoring remotely using a secure connection to a customer’s network and a list of IP addresses and permissions that we build up with the customer. This can then be left in place for the future.
So, with monitoring in place, let’s illustrate how you might then use it to go about triaging a performance issue. We’ll assume that the user has reported poor performance and consider several hypothetical causes.
For these illustrations, we’ll use the SAS Next Generation Monitoring tool. We’ll also use the Critical Path Monitoring module. This highlights visually the IP path for a critical application to help you see the location of problems at a glance and drill down to see more detail.
Let’s start with a WAN fault.
Let’s imagine that some Home Workers have reported that an application they use is running slow. You don’t yet know the cause of the issue, so you turn to your monitoring tool and start with the dashboard. The dashboard shows that there is a problem with one of your applications; the same one, in fact, that has been reported by your users.
If we click on this traffic light, then we can see an IP path map for the application. From the map you can clearly see the WAN uplink is highlighted as RED.
We can click to drill down to the interface details, and then we see that the amount of data being sent has breached the threshold of 85% that has been set.
As we mentioned earlier, a WAN problem is often fairly straightforward to identify and fix. In this case, for a traditional network the fix is likely to involve increasing bandwidth. This might not always be something you can achieve quickly, but at least you can stop looking for other causes, and take other remedial action while you wait. In practice, it’s likely that this sort of issue would have been experienced regularly. Thus, with ongoing monitoring in place, the IT manager might already have had a quote to upgrade the bandwidth.
How about a LAN or Server fault?
The same user experience might have been caused by a problem with the LAN or the server, rather than the WAN.
Let’s imagine a scenario in which office-based users have reported that the application they use is slow, and we’ll start by looking at the dashboard.
Again, we can see that there is, indeed, a problem with this application, so we’ll click onto the red traffic light to drill down for more information.
This brings up the IP Path map, on which you can clearly see that the Application Server and Application is highlighted as RED.
If we click to drill down further, we might find that memory use breached a set threshold of 85%.
We can then further explore to see that the Cluster Service is found to be leaking memory.
To resolve this, we might have to do nothing more than restarting the service. If this was an SAS monitoring service, we might set it up to do this automatically.
Alternatively, a slow application might be caused by the application itself. Let’s assume a scenario where a slow application has been reported and we’ve seen from the dashboard that there is indeed a problem with that application.
If we drill into the application, then we see on the IP Path that Application is highlighted as RED.
When you click on the red Application icon you will drill down to the application information. From here you can see that “Job Engine v2: Jobs Queued” is in a critical state because it has breached the threshold of zero that we’ve set for the purpose of this illustration.
The resolution of a problem like this might involve a rebuild of the job engine, all queued jobs cleared, and performance then improved for all users.
Or maybe our slow application has actually been caused by a database issue.
So let’s start with the dashboard again, and see that there is a problem with the application.
This time, when we click on the traffic light, we can clearly see that the Database Application and Database is highlighted as RED.
When we click on the icon and drill down into the Database information, we can clearly see that the Average Write Latency is exceeding the pre-defined threshold of 20ms. You can also see the most Expensive Queries that are causing the issues.
A fault like this might clear when the query is finished. However, a redesign of the query by a DBA might be required to permanently resolve the issue.
So an application running over the network might run slowly for lots of reasons, with the symptoms being similar for each. Triaging and resolving the problem can be difficult and time-consuming (especially with multiple resolver-groups). However, it can be made much quicker and easier if you have a monitoring service installed, or can have one set temporarily to help with the issue.
If you find yourself in this situation, we would be happy to discuss it with you. You can download our Application Audit brochure to find out more about temporary support for a particular issue.
You can also watch our video, showing the dashboard in a bit more detail here: Introduction the SAS Unified Dashboard