Last Updated: February 25, 2016
·
13.84K
· timmoore

Parsing logs with awk

I had to filter some Apache access logs and Rails server logs to find out what requests caused an enormous spike in memory usage on one of our production servers. Here are a few one-liners that helped me sift through the data:

Print lines within a particular time range

awk '/01:05:/,/01:20:/' access.log

/start/, /end/ in awk ignores every line until it finds one that matches the first pattern, then prints every line until it finds one that matches the end pattern.

Note that, because this is doing simple text matching, it will only work if you actually have accesses falling on those minutes. If you have a low-traffic site that isn't getting hits every minute, you'll need to adjust your patterns to ensure that they match.

Print lines within a particular time range that match a pattern

awk '/01:05:/,/01:20:/ { if (/POST/) print }' access.log

This filters the above to only print lines that contain particular text.

Sort access log by response size (increasing)

awk --re-interval '{ match($0, /(([^[:space:]]+|\[[^\]]+\]|"[^"]+")[[:space:]]+){7}/, m); print m[2], $0 }' access.log|sort -nk 1

The specifics of this will depend on your access log format, but the basic idea is:

  1. Parse the response size field from the line.
  2. Prepend it to the line (separated by a space).
  3. Sort by the prepended size.

The sort command can sort by space-separated field, so in a perfect world, we could just pass the right sort field to one command and be done with it. Unfortunately, many common access log formats include fields that can have spaces in them, so we have to break out some more complex regular expressions.

  • [^[:space:]]+ matches one or more non-space characters.
  • \[[^\]]+\]matches one or more characters (including spaces) enclosed in square brackets.
  • "[^"]+" matches one or more characters (including spaces) enclosed in double quotes.
  • ([^[:space:]]+|\[[^\]]+\]|"[^"]+") ORs those together and matches any one of the above, which covers all of the field formats in my access log.
  • [[:space:]]+ matches one or more space.
  • (([^[:space:]]+|\[[^\]]+\]|"[^"]+")[[:space:]]+) concatenates those, so that we end up with a pattern that matches a field and then a space.
  • Adding {7} to the end (and the --re-interval flag) matches the seventh occurrence of a field followed by a space on each line. You'll probably need to adjust this to print the correct field for your log format.
  • match matches the first argument ($0 — the current line) against the regular expression in the second argument, and stores the captured matches in an array named by the third argument (m).
  • m[2] prints the second captured group, counting left parentheses. In this case, that means the matched field, without the trailing spaces.
  • sort -nk 1 sorts numerically on the first field of the input, in ascending order, so the biggest requests are at the end.

Depending on what you have in your access log, you can use this on other values, too, such as response time.