Handling multiple errors in Rust iterator adapters
Approaches for handling multiple errors within iterator adapters
Better FastAPI Background Jobs
A more featureful background task runner for Async apps like FastAPI or Discord bots
Parsing Traefik Access Logs with Python & Pandas
Today we’re exploring how to parse Traefik Access Logs using Python and Pandas
Reading the Traefik Access Logs Docs we see that Traefik (by default) writes it’s logs in Common Log Format
There are a number of other applications that write logs in a similar way, so starting from some existing examples gives us a good starting point which can be slightly modified to suit our uses - Read Apache HTTP server access log with Pandas - Read Nginx access log (multiple quotechars)
import pandas as pd
from datetime import datetime
import pytz
def parse_str(string):
"""
Returns the string delimited by two `"` characters.
Example:
`>>> parse_str('"my string"')`
`'my string'`
"""
return string.strip('"')
def parse_datetime(x):
"""
Parses datetime with timezone formatted as:
`[day/month/year:hour:minute:second zone]`
Example:
`>>> parse_datetime('13/Nov/2015:11:45:42 +0000')`
`datetime.datetime(2015, 11, 3, 11, 45, 4, tzinfo=<UTC>)`
Due to problems parsing the timezone (`%z`) with `datetime.strptime`, the
timezone will be obtained using the `pytz` library.
"""
dt = datetime.strptime(x[1:-7], '%d/%b/%Y:%H:%M:%S')
dt_tz = int(x[-6:-3])*60+int(x[-3:-1])
return dt.replace(tzinfo=pytz.FixedOffset(dt_tz))
We need an extra converter to cover to format Traefik uses for the duration e.g. 321ms
def parse_duration(duration):
return int(duration.strip('ms'))
Now we have all the helper functions we need, we can read the data into a Pandas Dataframe
# <remote_IP_address> - <client_user_name_if_available> [<timestamp>] "<request_method> <request_path> <request_protocol>" <origin_server_HTTP_status> <origin_server_content_size> "<request_referrer>" "<request_user_agent>" <number_of_requests_received_since_Traefik_started> "<Traefik_router_name>" "<Traefik_server_URL>" <request_duration_in_ms>ms
log_file = '/path/to/my/access.log'
df = pd.read_csv(
log_file,
sep=r'\s(?=(?:[^"]*"[^"]*")*[^"]*$)(?![^\[]*\])',
engine='python',
usecols=[0, 3, 4, 5, 6, 7, 8, 10, 11, 12],
names=['ip', 'timestamp', 'request', 'status', 'size', 'referer', 'user_agent', 'Traefik_router_name', 'Traefik_server_URL', 'request_duration_in_ms'],
na_values='-',
header=None,
dtype={'status': pd.Categorical},
converters={
'request_duration_in_ms': parse_duration,
'timestamp': parse_datetime,
'request': parse_str,
'referer': parse_str,
'user_agent': parse_str,
'Traefik_router_name': parse_str,
'Traefik_server_URL': parse_str,
},
)
From here we’d like to unpack the request section "<request_method> <request_path> <request_protocol>"
, this is
simply achievable by unpacking a List into Pandas Columns
df[['request_method', 'request_path', 'request_protocol']] = pd.DataFrame(df['request'].str.split().to_list(), index=df.index)
Now we have the data in a Dataframe, we’re on familiar ground and can use lots of common tools to do some EDA
import pandas_profiling
profile = df.profile_report(title='Pandas Profiling Report')
profile.to_file(output_file="access_log_pandas_profiling.html")
We can simply plot charts to show how quick we’re responding to requests
df.groupby(df['timestamp'].dt.floor('h'))['request_duration_in_ms'] \
.quantile([0.25, 0.5, 0.75, 0.95, 1])\
.unstack(level=-1)\
.plot()