How to troubleshoot Problems in Server Setups, Rails Apps or any other Config or Code Problem

This post was first published on the imedo.de devblog. The blog went offline in 2012, so I migrated the posts to my personal blog.

This post might be interesting for all people who are faced with strange problems like this: "Yesterday it worked. Now it’s broken" or "It works on my machine (and it does not in production)".

I’m sure that all programmers and sysadmins have had an incident like this in their lives. I’ve had a lot of these problems and found out that in the end there’s always an explaination for the problem. Very rarley it’s some quantum mechaincs effect that caused the problem. In most cases there is a really simple explaination for the problem even if it was hard to find. These things include "the cause of a crashing perl script is the java version of the app starting the script", "failing tests that where caused by a minor version difference of a testing library where the error message lead to something completely different"

The process described below was used in all cases to find the root cause of the problem which then was solved very easily. We weren’t aware that we used this process but rather did it intuitively. After talking about the process and writing it down, we were able to find more and more problems by following these steps and also to transfer the knowledge to other people so that they will build up the intuition to find root causes of problems as well.

General idea

Systems that work and system that don’t work differ.

If you make the not working system equal to the working system, it will work.

That’s all there is to Troubleshooting (basically).

Process to find out the difference

The hard part is to find out where the working and not working systems differ.

The general process is really simple though:

List all items that can differ
Check if they differ
Make them equal (one at a time!)
Repeat 3 until finished. If still broken, think harder about 1 and start again

The optimized version of this is, to start with the things that are most likely to cause the problem.

The term "most likely" is based on

your own experience
information found in the web: blog posts, google searches, etc.
experiences of your co-workers

It is fundamental that you make each step conciously (writing the step/change down helps to do that). If one step doesn’t yield the desired outcome: revert it immediately. Again: having written it done helps not to forget anything. Forgetting steps may make the situation even worse.

How can systems differ?

The main questions are:

What changed since it worked (if it is the same system)?
What is different or changed on the not working system compared to the working system (if it is a different system)?

The latter is a lot easier because you something to compare to. In the former case you have to create the "working system" again. Which in itself may be the solution to the problem.

If the answer is "nothing". Think again…! Because time has progressed. So at least the time changed.

Possible effects of changed time

File system full
weird time dependend behavior of applications
system/application restart occured
data changes happend

Other things that may have changed:

software versions through package updates – Minor Changes are important!
- OS Kernel
- OS packages
- application libraries (ruby gems, jars))
Database schemas
Database content
Filesystem content of any kind (That includes timestamps of a file that is only read!)
- Location of files
- symlink vs. real files
- timestamps
Hardware
Increased load
- Network I/O
- Disk I/O
- CPU
- Exceeded RAM -> Swapping

Some things will be straightforward and it is obvious why something brakes something else. Some things are not as obvious (at least not at the time when you try to find it – it’s always obvious afterwards!). Don’t jump to conclusions about cause and effect while you debug. If you think "I’ll don’t try X because X has nothing to do with Y" try X! Maybe it has something to do with Y. You don’t know before you try. Revert (or create the equal state to the working system for) the "most obvious" things that "can’t possibly interfere with the problem". That includes

Comments in Source or Configuration files
Whitespaces
Trivial Code/Configuration changes
minor version changes in Packages

The "most likley" rule does apply here, too. Don’t start with whitespace if there are other not so subtle changes still different. Don’t look for access time timestamps if the files on one system are are in completely different locations compared to on the other system. This requires some experience but with time you’ll find which things to look for first.

Tools

Filesystem-Analysis: df, ls, find,
Application-Behavior: strace (dtruss on Solaris and Mac OS), lsof, netstat
Databases: For mysql: mysql, innotop
Packages: Debian: apt-get, dpkg
Finding differences/problem causes in running vs. not running code: Binary Search (e.g. via git bisect, debugger or just plain “print”-Debugging).

We hope these thoughts help you to debug and troubleshoot strange problems. Feel free to post additions, comments, tool or experiences with troubleshooting.

Stop writing tutorials - start writing Vagrantfiles or Dockerfiles - Make use of modern tools to make tutorials understandable as well as executable
The missing piece: Operating Systems for web scale Cloud Apps - Operating systems that are optimised for cloud applications regarding configuration support and the distributed nature of apps are not there yet.
There will be no reliable cloud (part 3) - How I stopped worrying and love the cloud

How to troubleshoot Problems in Server Setups, Rails Apps or any other Config or Code Problem

General idea

Process to find out the difference

How can systems differ?

Tools

Related Posts