How to Display Realtime Traffic Analytics
Posted on September 2nd, 2009 by Greg Allard in Django, Programming, Server Administration | Comments
Users of Presskit’n have been asking for traffic statistics on their press releases so I decided I would get them the most recent data possible. At first I was parsing the access log once a minute and when I was testing that I decided it wasn’t updating fast enough. I’ve gotten used to everything being instant on the internet and I didn’t want to wait a minute to see how many more views there were. In this post I show how I got it to update on page load using Apache, python, Django, and memcached.
Apache Access Logs
Apache is installed with rotatelogs. This program can be used to rotate the logs after they get too large. However I wanted a few more features. Cronolog will update a symlink everytime it creates a new log file so that you can always have the most recent stats.
CustomLog "|/usr/bin/cronolog --symlink=/path/to/access /path/to/%Y/%m/%d/access.log" combined ErrorLog "|/usr/bin/cronolog --symlink=/path/to/error /path/to/%Y/%m/%d/error.log"
CustomLog and ErrorLog directives in apache will let you pipe output to a command. So I put the full path to cronolog and then specified the parameters to cronolog. –symlink will point the named symlink to the most recent log created with cronolog. After the options, the path to the log location is specified and date formats can be used. I decided to break mine up by day.
Piping Apache Log info to a Python Script
Apache can have multiple log locations and log multiple times. So I wrote my own logging script in python that would insert into memcached. Here is the extra line I added to apache:
CustomLog "|/path/to/python /path/to/log_cache.py" combined
And this is log_cache.py:
#!/usr/bin/env python import os import sys import re from datetime import date sys.path = ['/path/to/project',] + sys.path os.environ['DJANGO_SETTINGS_MODULE'] = 'myproject.settings' from django.core.cache import cache r = re.compile(r'"GET (?P\S+) ') def be_parent(child_pid): exit_status = os.waitpid(child_pid, 0) if exit_status: # if there's an error, restart the child. pid = os.fork() if not pid: be_child() else: be_parent(pid) return def be_child(): while True: line = sys.stdin.readline() # wait for apache log data if not line: return # without error code so everything stops log_data(line) def log_data(data): page = r.search(data) if page: key = '%s%s' % (date.today(), page.group('url')) try: cache.incr(key) except ValueError: # add it to the cache for 24 hours cache.set(key, 1, 24*60*60) return pid = os.fork() if not pid: be_child() else: be_parent(pid)
A blog post about using python to store access records in postgres helped me out a lot. The parent/child processing came from that and fixed a lot of problems I was having before.
The page views are being added to memcached (with cache.incr() which is new in django 1.1) for quick retrieval and the logs will still be created by cronolog so no data will be lost when the cache expires. Those logs are used in the next part.
Parsing the Logs
The hit counts will expire from the cache after 24 hours so I parse the logs once a day and put that information into my database. For this I wrote a django management command (I didn’t do a management command before because I wasn’t sure how it would handle the parent and child processes). This command is called by ./manage.py parse_log
from django.conf import settings from django.contrib.contenttypes.models import ContentType from django.core.cache import cache from django.core.management.base import BaseCommand from django.core.urlresolvers import resolve, Resolver404 import datetime # found on page linked above from apachelogs import ApacheLogFile from app.models import Model_being_hit from metrics.models import Hits def save_log(alf, date): hits = {} # loop to sum hits for log_line in alf: request = log_line.request_line request_parts = request.split(' ') hits[request_parts[1]] = hits.get(request_parts[1], 0) + 1 for page, views in hits.iteritems(): try: view, args, kwargs = resolve(page) # I check kwargs for something only passed to one app if 'param' in kwargs: a = Model_being_hit.objects.get(id=kwargs['id']) try: content_type = ContentType.objects.get_for_model(a) hit = Hits.objects.get( date=date, content_type=content_type, object_id=a.id, ) hit.views = views except Hits.DoesNotExist: hit = Hits(date=date, views=views, content_object=a) hit.save() except: # something not in urls file like static files pass class Command(BaseCommand): def handle(self, *args, **options): day = datetime.date.today() day = day - datetime.timedelta(days=1) alf = ApacheLogFile('%s/%s/%s/%s/access.log' % ( settings.ACCESS_LOG_LOCATION, day.year, day.strftime('%m'), #month day.strftime('%d'), #day )) save_log(alf, day)
I use django.core.urlresolvers.resolve so that I can use my urls file and I don’t have to repeat myself.
Hits is a django model I created with a few fields for storing date and views. It uses the content types framework so that it can be tied to any of my django models.
from django.contrib.contenttypes import generic from django.contrib.contenttypes.models import ContentType from django.db import models class Hits(models.Model): date = models.DateField() views = models.IntegerField() # to add to any model content_type = models.ForeignKey(ContentType) object_id = models.PositiveIntegerField() content_object = generic.GenericForeignKey('content_type', 'object_id') def __unicode__(self): return "%s hits on %s" % (self.views, self.date)
This was added to my cron with crontab -e
#every morning on the first minute 1 0 * * * /path/to/python /path/to/manage.py parse_log > /dev/null
Displaying the Hits
On my models I added a couple methods that would look up the info in the cache or database.
@property def hits_today(self): from datetime import date from django.core.cache import cache key = '%s%s' % (date.today(), self.get_absolute_url()) return cache.get(key) @property def hits(self): from metrics.models import Hits from django.contrib.contenttypes.models import ContentType content_type = ContentType.objects.get_for_model(self) hits = Hits.objects.filter( content_type=content_type, object_id=self.id, ).order_by('-date') return hits
The hits_today method requires that you define get_absolute_url which is useful in other places as well. @property is a decorator that makes it possible to access the data with object.hits and leave off the parenthesis.
The hits method uses the content type framework again to look up the hits in the database.
Just the Basics
There is a lot more that can be done with this. This barely touches the raw data available in the logs. A few ways I’ve already started improving this is to not include known bots as hits, check the referrer to see where traffic is coming from, and save the keywords used in search engines.