Debugging Kubernetes

23 Jul 2024

Helpful tips for debugging applications running in k8s

build/k8s.png

Handling multiple errors in Rust iterator adapters

17 Dec 2023

Approaches for handling multiple errors within iterator adapters

build/rust.png

Better FastAPI Background Jobs

29 Aug 2022

A more featureful background task runner for Async apps like FastAPI or Discord bots

build/fastapi_logo.png

Useful Linux Examples

21 Dec 2021

A plethora of helpful tips for working in Linux

build/bash_logo.png
Continue to all blog posts

Parsing Traefik Access Logs with Python & Pandas

Today we’re exploring how to parse Traefik Access Logs using Python and Pandas

Reading the Traefik Access Logs Docs we see that Traefik (by default) writes it’s logs in Common Log Format

There are a number of other applications that write logs in a similar way, so starting from some existing examples gives us a good starting point which can be slightly modified to suit our uses - Read Apache HTTP server access log with Pandas - Read Nginx access log (multiple quotechars)

import pandas as pd

from datetime import datetime
import pytz

def parse_str(string):
    """
    Returns the string delimited by two `"` characters.

    Example:
        `>>> parse_str('"my string"')`
        `'my string'`
    """
    return string.strip('"')

def parse_datetime(x):
    """
    Parses datetime with timezone formatted as:
        `[day/month/year:hour:minute:second zone]`

    Example:
        `>>> parse_datetime('13/Nov/2015:11:45:42 +0000')`
        `datetime.datetime(2015, 11, 3, 11, 45, 4, tzinfo=<UTC>)`

    Due to problems parsing the timezone (`%z`) with `datetime.strptime`, the
    timezone will be obtained using the `pytz` library.
    """
    dt = datetime.strptime(x[1:-7], '%d/%b/%Y:%H:%M:%S')
    dt_tz = int(x[-6:-3])*60+int(x[-3:-1])
    return dt.replace(tzinfo=pytz.FixedOffset(dt_tz))

We need an extra converter to cover to format Traefik uses for the duration e.g. 321ms

def parse_duration(duration):
    return int(duration.strip('ms'))

Now we have all the helper functions we need, we can read the data into a Pandas Dataframe

# <remote_IP_address> - <client_user_name_if_available> [<timestamp>] "<request_method> <request_path> <request_protocol>" <origin_server_HTTP_status> <origin_server_content_size> "<request_referrer>" "<request_user_agent>" <number_of_requests_received_since_Traefik_started> "<Traefik_router_name>" "<Traefik_server_URL>" <request_duration_in_ms>ms
log_file = '/path/to/my/access.log'
df = pd.read_csv(
    log_file,
    sep=r'\s(?=(?:[^"]*"[^"]*")*[^"]*$)(?![^\[]*\])',
    engine='python',
    usecols=[0, 3, 4, 5, 6, 7, 8, 10, 11, 12],
    names=['ip', 'timestamp', 'request', 'status', 'size', 'referer', 'user_agent', 'Traefik_router_name', 'Traefik_server_URL', 'request_duration_in_ms'],
    na_values='-',
    header=None,
    dtype={'status': pd.Categorical},
    converters={
        'request_duration_in_ms': parse_duration,
        'timestamp': parse_datetime,
        'request': parse_str,
        'referer': parse_str,
        'user_agent': parse_str,
        'Traefik_router_name': parse_str,
        'Traefik_server_URL': parse_str,
    },
)

From here we’d like to unpack the request section "<request_method> <request_path> <request_protocol>", this is simply achievable by unpacking a List into Pandas Columns

df[['request_method', 'request_path', 'request_protocol']] = pd.DataFrame(df['request'].str.split().to_list(), index=df.index) 

Now we have the data in a Dataframe, we’re on familiar ground and can use lots of common tools to do some EDA

import pandas_profiling
profile = df.profile_report(title='Pandas Profiling Report')
profile.to_file(output_file="access_log_pandas_profiling.html")

We can simply plot charts to show how quick we’re responding to requests

df.groupby(df['timestamp'].dt.floor('h'))['request_duration_in_ms'] \
.quantile([0.25, 0.5, 0.75, 0.95, 1])\
.unstack(level=-1)\
.plot()