How to troubleshoot Problems in Server Setups, Rails Apps or any other Config or Code Problem
2009-11-20 07:47:45 +0000
This post was first published on the imedo.de devblog. The blog went offline in 2012, so I migrated the posts to my personal blog.
This post might be interesting for all people who are faced with strange problems like this: "Yesterday it worked. Now it’s broken" or "It works on my machine (and it does not in production)".
I’m sure that all programmers and sysadmins have had an incident like this in their lives. I’ve had a lot of these problems and found out that in the end there’s always an explaination for the problem. Very rarley it’s some quantum mechaincs effect that caused the problem. In most cases there is a really simple explaination for the problem even if it was hard to find. These things include "the cause of a crashing perl script is the java version of the app starting the script", "failing tests that where caused by a minor version difference of a testing library where the error message lead to something completely different"
The process described below was used in all cases to find the root cause of the problem which then was solved very easily. We weren’t aware that we used this process but rather did it intuitively. After talking about the process and writing it down, we were able to find more and more problems by following these steps and also to transfer the knowledge to other people so that they will build up the intuition to find root causes of problems as well.
Systems that work and system that don’t work differ.
If you make the not working system equal to the working system, it will work.
That’s all there is to Troubleshooting (basically).
Process to find out the difference
The hard part is to find out where the working and not working systems differ.
The general process is really simple though:
- List all items that can differ
- Check if they differ
- Make them equal (one at a time!)
- Repeat 3 until finished. If still broken, think harder about 1 and start again
The optimized version of this is, to start with the things that are most likely to cause the problem.
The term "most likely" is based on
- your own experience
- information found in the web: blog posts, google searches, etc.
- experiences of your co-workers
It is fundamental that you make each step conciously (writing the step/change down helps to do that). If one step doesn’t yield the desired outcome: revert it immediately. Again: having written it done helps not to forget anything. Forgetting steps may make the situation even worse.
How can systems differ?
The main questions are:
- What changed since it worked (if it is the same system)?
- What is different or changed on the not working system compared to the working system (if it is a different system)?
The latter is a lot easier because you something to compare to. In the former case you have to create the "working system" again. Which in itself may be the solution to the problem.
If the answer is "nothing". Think again…! Because time has progressed. So at least the time changed.
Possible effects of changed time
- File system full
- weird time dependend behavior of applications
- system/application restart occured
- data changes happend
Other things that may have changed:
- software versions through package updates – Minor Changes are important!
- OS Kernel
- OS packages
- application libraries (ruby gems, jars))
- Database schemas
- Database content
- Filesystem content of any kind (That includes timestamps of a file that is only read!)
- Location of files
- symlink vs. real files
- Increased load
- Network I/O
- Disk I/O
- Exceeded RAM -> Swapping
Some things will be straightforward and it is obvious why something brakes something else. Some things are not as obvious (at least not at the time when you try to find it – it’s always obvious afterwards!). Don’t jump to conclusions about cause and effect while you debug. If you think "I’ll don’t try X because X has nothing to do with Y" try X! Maybe it has something to do with Y. You don’t know before you try. Revert (or create the equal state to the working system for) the "most obvious" things that "can’t possibly interfere with the problem". That includes
- Comments in Source or Configuration files
- Trivial Code/Configuration changes
- minor version changes in Packages
The "most likley" rule does apply here, too. Don’t start with whitespace if there are other not so subtle changes still different. Don’t look for access time timestamps if the files on one system are are in completely different locations compared to on the other system. This requires some experience but with time you’ll find which things to look for first.
- Filesystem-Analysis: df, ls, find,
- Application-Behavior: strace (dtruss on Solaris and Mac OS), lsof, netstat
- Databases: For mysql: mysql, innotop
- Packages: Debian: apt-get, dpkg
- Finding differences/problem causes in running vs. not running code: Binary Search (e.g. via git bisect, debugger or just plain “print”-Debugging).
We hope these thoughts help you to debug and troubleshoot strange problems. Feel free to post additions, comments, tool or experiences with troubleshooting.
- Stop writing tutorials - start writing Vagrantfiles or Dockerfiles - Make use of modern tools to make tutorials understandable as well as executable
- The missing piece: Operating Systems for web scale Cloud Apps - Operating systems that are optimised for cloud applications regarding configuration support and the distributed nature of apps are not there yet.
- There will be no reliable cloud (part 3) - How I stopped worrying and love the cloud