Article Image
read

I don't know about you, but in my almost decade of software development a good chunk of my job has always been weeding through log files.

I regularly need to get summarized metrics out of log files. There are some great tools you probably already know for the common search/discovery tasks (grep, silver searcher ag, awk, etc). Lately, I've needed things like request timings broken down by route, or average time spent calling the database by query parameters.

I can do this with existing tools (even as simple as zcat | grep | sort | uniq, and deriviations thereof), but I need to wait for the entire set to be computed before seeing results. Something like awk is a bit more powerful, but it's a lot to write when you need to slice data by a relatively common use-case.

This is where rare comes in.

Github: https://github.com/zix99/rare

Binary releases: https://github.com/zix99/rare/releases

What is Rare?

Rare's name is a play on regex and more/less. Rare does fast, line-based realtime regex-extraction, and aggregation into common formats such as histograms, numerical summaries, tables, and more!

Rare Usage

Once a regex extracts the groups, there is a handlebars-like expression syntax to allow grouping, readibility enhancements, and comparisons.

Features

  • Multiple summary formats including: filter (like grep), histogram, tables, and numerical analysis
  • File glob expansions (eg /var/log/* or /var/log/*/*.log) and -R for recursion
  • Optional gzip decompression (with -z)
  • Following -f or re-open following -F (use --poll to poll)
  • Ignoring lines that match an expression with -i
  • Aggregating and realtime summary (Don't have to wait for all data to be scanned)
  • Multi-threaded reading, parsing, and aggregation
  • Color-coded outputs (optionally) --nocolor or --color
  • Pipe support (stdin for reading, stdout will disable color) eg. tail -f | rare ...
  • Extracting values from JSON with the {json} expression

Example: Parsing nginx logs

Let's take a look at an example.

Let's say we have a traditional nginx access log. It probably looks something like this:

209.15.211.12 - - [14/Apr/2016:18:13:29 +0200] "GET /mirror/server.cfg HTTP/1.1" 206 101 "-" "curl/7.43.0"
209.15.211.12 - - [14/Apr/2016:18:13:38 +0200] "GET /mirror/server.cfg HTTP/1.1" 206 100 "-" "curl/7.43.0"
107.182.132.11 - - [14/Apr/2016:18:15:46 +0200] "GET / HTTP/1.1" 200 1021 "-" "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.4 (KHTML, like Gecko) Chrome/98 Safari/537.4 (StatusCake)"

There's some good information in there. Let's say we want to better understand the distribution of status codes. We can write a regex like this to extract it: " (\d{3}). Let's apply that to rare's matcher -m.

$ rare histo -m '" (\d{3})' access.log
" 200               160,681   
" 404               860       
" 304               53        
" 403               14        
" 301               12        

Matched: 161,622 / 161,622
Groups:  6

That's great, but since part of the regex was matching the first ", we got that as a result. This is where expressions come in. We want to select the first match group as our output with -e.

$ rare histo -m '" (\d{3})' -e '{1}' access.log
200                 160,681   
404                 860       
304                 53        
403                 14        
301                 12        

Matched: 161,622 / 161,622
Groups:  6

That's better. Now let's remove the 200 response code, because I want to focus in on the other types. We'll use --ignore or -i for this. It will ignore a line if it evaluates the expression to be non-empty (truthy). I've also thrown on -b to get a bars-visualization of the results to quickly understand distribution.

$ rare histo -m '" (\d{3})' -e '{1}' -i '{eq {1} 200}' -b access.log
404                 860        ||||||||||||||||||||||||||||||||||||||||||||||||||
304                 53         |||
403                 14         
301                 12         
206                 2          

Matched: 941 / 161,622 (Ignored: 160,681)
Groups:  5

Great! Now let's really focus in on this. Let's pull out the method and path with another regex: "(\w{3,4}) ([A-Za-z/.]+).*" (\d{3})

You'll also note I changed the extraction expression -e and -i to the appropriate match group indexes.

$ rare histo -m '"(\w{3,4}) ([A-Za-z/.]+).*" (\d{3})' -e '{3} {1} {2}' -i '{eq {3} 200}' -b access.log 
404 GET /robots.txt                728        ||||||||||||||||||||||||||||||||||||||||||||||||||
404 GET /favicon.ico               109        |||||||
304 GET /                          52         |||
301 GET /mirror                    11         
403 GET /mirror/distros/cup/       7          

Matched: 941 / 161,622 (Ignored: 160,681)
Groups:  23

Looks like robots.txt and favicon are our main cuplrits for missing files.

Now let's have some fun.. let's see which IPs get the most non-200 error codes. (IP's tweaked to keep anonymitiy)

$ rare table -m '([0-9\.]+).*"(\w{3,4}) ([A-Za-z/.]+).*" (\d{3})' -e '{tab {4} {1}}' -i '{eq {4} 200}' access.log
               404            304            403            301            206            
77.88.47.6     397            14             0              0              0                             
5.255.250.94   50             1              0              0              0                             
100.43.81.133  33             0              0              0              0                             
208.58.208.13  23             0              3              1              0                             
77.88.47.89    18             0              0              0              0                             
77.88.5.9      17             0              0              0              0                             
209.190.211.28 7              0              0              5              2                             
180.76.15.147  9              3              0              0              0                             
84.201.133.42  10             0              0              0              0                             
93.158.161.28  9              0              0              0              0                             
180.76.15.11   8              0              0              0              0                             
73.149.131.189 6              0              0              2              0                             
180.76.15.143  5              2              0              0              0                             
180.76.15.145  6              1              0              0              0                             
180.76.15.6    7              0              0              0              0                             
173.48.198.132 0              0              6              0              0                             
180.76.15.140  5              1              0              0              0                             
180.76.15.148  6              0              0              0              0                             
180.76.15.163  5              1              0              0              0                             
Matched: 941 / 161,622 (Ignored: 160,681)
Rows: 127; Cols: 5

Expressions

There's a lot more documentation on expressions on github or if you run rare docs expressions, but needless to say they're the easiest way to summarize and aggregate the output.

It's handlebars-like (just one brace instead of two), and has many built-in helpers for arithmatic, equality-checking, json parsing, and some nicer things like bytesize formatting.

eg {bytesize 1024} will output 1 KB. You can take this further like {bytesize {add {1} {2}}} to add two match-groups together and output the bytesize.

Bucketing

For histograms, a common desire might be to bucket a given value. This is useful, for instance, to get a sense of what file sizes are being returned in the logs.

Let's take a look at another example:

Look at all requests file size, ignore the 0 bucket (My log has a disproportionate number of 0's in it), and format with byte-size.

$ rare histo -m '" (\d{3}) (\d+)' -e '{bytesize {bucket {2} {multi 1024 1024}}}' -n 20 -b -i '{eq {bucket {2} {multi 1024 1024}} 0}' access.log 
1 MB                796        ||||||||||||||||||||||||||||||||||||||||||||||||||
2 MB                540        |||||||||||||||||||||||||||||||||
4 MB                300        ||||||||||||||||||
6 MB                251        |||||||||||||||
8 MB                231        ||||||||||||||
5 MB                229        ||||||||||||||
3 MB                192        ||||||||||||
9 MB                192        ||||||||||||
11 MB               179        |||||||||||
14 MB               172        ||||||||||
7 MB                163        ||||||||||
16 MB               158        |||||||||
10 MB               150        |||||||||
20 MB               142        ||||||||
21 MB               123        |||||||
32 MB               94         |||||
13 MB               88         |||||
35 MB               86         |||||
59 MB               79         ||||
17 MB               76         ||||

Matched: 7,395 / 161,622 (Ignored: 154,227)
Groups:  210

Performance

One of the goals at the start was to have this tool be comparible in performance to something like silver-searcher ag, and grep. Understanding that this tool does a bit of a different job as those, some leeway was given. Golang has an awesome parallelization toolkit (channels, etc), so that was taken advantage of and optimizations were done on top of that.

By default, multiple workers (go-routines) are responsible for reading (and optionally decompressing) the logs, and then more workers are responsible for running the regex and expression-builder on top of the batch of results.

Benchmarking

It's worth noting that in many of these results rare is just as fast, but part of that reason is that it consumes CPU in a more efficient way (go is great at parallelization). So take that into account, for better or worse.

All tests were done on ~200MB of gzip'd nginx logs spread acorss 10 files.

Each program was run 3 times and the last time was taken (to make sure things were cached equally).

Profiling

As part of optimization, I ran rare with --profile to hook in golang's pprof profiler. I ended up optimizing some IO by leveraging channel batching, and implementing a read-ahead buffer, along with using byte arrays rather than strings almost everywhere for the data (Less memory copying). Overally, I was able to get rare faster than ag, and darn close to grep.

CPU Flame Chart

About 40% of the time is taken from things out of our control (Regex and file io/decompression). Work has been done to optimize heap allocation, though that did shift some responsibility to the functions (which is why things like processLineSync takes some time).

Comparing Results

I've tried to make each set of commands equivalent to each other. Each will output the same results, and have the same meaning.

zcat & grep

$ time zcat testdata/* | grep -Poa '" (\d{3})' | wc -l
1131354

real    0m0.990s
user    0m1.480s
sys 0m0.080s

$ time zcat testdata/* | grep -Poa '" 200' > /dev/null

real    0m1.136s
user    0m1.644s
sys 0m0.044s

I believe the largest holdup here is the fact that zcat will pass all the data to grep via a synchronous pipe, whereas rare can process everything in async batches. Using pigz instead didn't yield different results, but on single-file results they did perform comparibly.

Silver Searcher (ag)

$ ag --version
ag version 0.31.0

Features:
  +jit +lzma +zlib

$ time ag -z '" (\d{3})' testdata/* | wc -l
1131354

real    0m3.944s
user    0m3.904s
sys 0m0.152s

rare

$ rare -v
rare version 0.1.16, 11ca2bfc4ad35683c59929a74ad023cc762a29ae

$ time rare filter -m '" (\d{3})' -e "{1}" -z testdata/* | wc -l
Matched: 1,131,354 / 3,638,594
1131354

real    0m0.927s
user    0m1.764s
sys 0m1.144s

# A slightly different command to show the power of parallel aggregation
$ time rare histo -m '" (\d{3})' -e "{1}" -z testdata/*
200                 1,124,767 
404                 6,020     
304                 371       
403                 98        
301                 84        

Matched: 1,131,354 / 3,638,594
Groups:  6

real    0m0.284s
user    0m1.648s
sys 0m0.048s

Histo is significantly faster.. mostly because less synchronois io (wc -l) is occurring. The main take-away is that filter (which is most comparible to grep) takes similar real-time, and slightly more cpu-time.

And don't forget.. it was giving you the results to your terminal as it was building them! Unlike the other commands, you didn't have to wait until completion.

Summary

I hope you find this tool, and what it's trying to accomplish, beneficial to your workflow. Always happy to field any questions (either below, or on its github page).

Blog Logo

Christopher LaPointe


Published

Image

Chris LaPointe

Another site of Code

Back to Overview