Monday, December 5, 2011

Helpful log parsing tips

Most programs and services produce logs. When a user visits an apache web server, the service will most likely keep a log of that request, along with the date and requester's ip address. Other details might be logged as well. Here us an example of some entries in a logfile:

192.168.1.20 - - [21/Sep/2011:11:04:40 +1000] "GET / HTTP/1.0" 200 468
192.168.1.20 - - [21/Sep/2011:11:07:48 +1000] "GET /login.php HTTP/1.0" 200 6433

Log files would usually contain hundreds of such entries, most, if not all of which are important to us. If there is an issue with a service, perhaps there is an entry in the logfile that can tell us why. Another scenario is where management require some statistical information. For example, how many unique IP addresses visited their website in the past hour and what pages did they visit. Or which web pages are the most frequently visited.

If you look around on the web, you will be able to find tools that would retrieve most of this information for you. However, some of these tools may not have the functionality built in to retrieve all the data you require. Hence, knowing how to do things yourself might come in handy.

Here are some examples. I used the following log entries in my examples.

192.168.1.20 - - [21/Sep/2011:11:04:40 -0500] "GET / HTTP/1.0" 200 6443
192.168.1.25 - - [21/Sep/2011:11:07:48 -0500] "GET /logo.gif HTTP/1.0" 200 4006
192.168.1.22 - - [21/Sep/2011:11:08:40 -0500] "GET /forum.php HTTP/1.0" 200 468
192.168.1.20 - - [21/Sep/2011:11:08:48 -0500] "GET /sports.php HTTP/1.0" 200 98002
192.168.1.22 - - [21/Sep/2011:11:09:42 -0500] "GET /basketball.htm HTTP/1.0" 200 45869
192.168.1.22 - - [21/Sep/2011:11:09:48 -0500] "POST /login.php HTTP/1.0" 404 501
192.168.1.25 - - [21/Sep/2011:11:09:50 -0500] "POST /login.php HTTP/1.0" 404 501
192.168.1.20 - - [21/Sep/2011:11:09:55 -0500] "GET / HTTP/1.0" 200 6433

We can parse the unique IP addresses that visited our apache website. We can take this step further and sort these IP address by the one with the most requests. This will give you an idea of which IP address chatted the most with the server and which ones were least talkative.

awk '{print $1}' access.log | sort | uniq -c | sort -nr

We can generate statistics based on HTTP status codes. From there, you can see how many successful requests were made, as well as how many bad requests were made for non-existent pages or files. Based on our example log entries above, we should get 6 successful requests and 2 "file not found" requests.

awk '{print $9}' access.log| sort | uniq -c | sort -rn

To see only the log entries that genereated the HTTP status code of 404 or "file not found" requests.

awk '$9 == "404"' access.log

Continuing from the last example, if we only wanted to see the IP address and the request that was made that triggered the 404 status code:

awk '$9 == "404"{print $1, $7}'

Lets take the last example a little further. Say management wanted to know the number of requests that resulted in a "404" status code between 11 AM to 12 PM on 21/Sep/2011.

awk '$9 == "404"' access.log | egrep "21\/Sep\/2011" | awk 'substr($4,14,2) == "11"'

Lastly, if you want to check how many bytes your server has served up for a particular day, awk and grep can help us agian. The number after the status code in the log entries is the size (in bytes) of the object returned to the client from the request.
In my example, I omit the entries with status code "304". I do this because an intelligent user agent (browser) may already have the object in its cache. A 304 indicates that the cached version has the same timestamp as the 'live' version of the file so they don't need to download it. If the 'live' file was newer then the response would instead be a 200.

egrep "21\/Sep\/2011" | access.log | awk '$9 != "304"{sum+=$10}END{print sum}'
The result of this is in bytes. To convert to kilobytes and or megabytes, the print statement at the end would be {print sum/1024/1024}.

Its important to know that these techniques are not limited to only apache log files. Any log file can be parsed using the combination of grep, awk, sort, uniq and even sed (sed can be used to clean up the output). Under linux systems, the log files at /var/log can be parsed in a similar fashion with slight modifications to the parameters passed to the programs. If you wanna get fancy, you can output the data to a text file and run a php script that reads this file and outputs a nicely formatted HTML report page.

Resources / Good Reading:
www.the-art-of-web.com

status codes