X-CSRFToken Header Resets Session Object

Scenario:
I am using the jQuery code from Django documentation in order to send post requests via ajax. When a link is clicked from a page, it opens another page in a new tab, and at the same time sending an ajax request.

Problem:
The ajax request, for some reason, is resetting the Session object. The effect is, any new data added in the session (in the non-ajax request) will be lost.

Solution (or rather “workaround”):
After some investigation, the problem lies somewhere in the csrf middleware. I’m still unable to find where the problem is within the middleware, but to patch the issue, I modifed the javascript code to send null X-CSRFToken for non-POST requests. The new code now looks like this:

$.ajaxSetup({
    beforeSend: function(xhr, settings) {
        if (!csrfSafeMethod(settings.type) && !this.crossDomain) {
            xhr.setRequestHeader("X-CSRFToken", getCookie('csrftoken'));
        }else{
            xhr.setRequestHeader("X-CSRFToken", null);
        }
    }
});

Running Scrapy on Django-Chronograph

Most of the scheduled jobs I am running are set on django-chronograph. It makes jobs easier to manage, and allows my clients to manage the scheduling. It also allows them to check the logs in a user friendly environment.

So after creating my scraper using Scrapy that runs in django management command (as described here), I tried to deploy it using chronograph.

The first problem is, method run_from_argv is bypassed by chronograph. So I modified my management command into like this:

class Command(BaseCommand):
    def run_from_argv(self, argv):
        self._argv = argv
        self.execute()

    def handle(self, *args, **options):
        from scrapy.cmdline import execute
        try:
            execute(self._argv[1:-1])
        except AttributeError: # when running from django-chronograph
            execute(list(args))

Then in the arguments, I added the django management command. In my case, my management command is “scrape”.

scrapy

The second problem is, the execute command from Scrapy runs sys.exit which stops the execution. This means django-chronograph will also stop execution and in effect cannot do the necessary tasks like saving the logs into the database, changing the status of the job from “running” to “not running”.

The first workaround I tried was to create a separate thread to run Scrapy’s “execute” command. However, Scrapy threw this error: signal only works in main thread.

After some reading, Python doc says sys.exit simply throws a SystemExit exception. This would allow us to do some cleanup using the finally block like this:

class Command(BaseCommand):
    def run_from_argv(self, argv):
        self._argv = argv
        self.execute()

    def handle(self, *args, **options):
        from scrapy.cmdline import execute
        try:
            execute(self._argv[1:-1])
        except AttributeError:  # when running from django-chronograph
            execute(list(args))
        finally:
            return  # let django-chronograph do some cleanup

And if you’re running the cron of django-chronograph as root, you have to create a symlink of scrapy.cfg from the root directory to your project folder where the file is. This will enable Scrapy to locate your crawler settings.

Optimization Pitfall: Memcached Memory Limit

Google describes Python as a language that is easy to develop with. It can greatly improve developer productivity and code readability. This Python’s strength comes at a price of slow performance. Well, interpreted high-level languages are inherently slow.

Fortunately Django comes with great caching techniques which can bypass all the calculations on the Python level. Memcached is the fastest of them all as described in the Django documentation. In order to effectly use this feature, you should understand how Memcached works or else you will fall into the same problem I encountered this week.

I spent so much time to figure out why Memcached suddently stopped caching the data. Here is code:

    from django.core.cache import cache
    def my_django_view(request, template, page):
        articles = cache.get('all_popular_articles')
        if articles is None:
            articles = some_complex_query()
            cache.set('all_popular_articles', articles, 60*60)
        return render_to_response(template, 
                                  {'articles', paginate_func(articles, 
                                                             page)})

Of course this works during the development even when in production. But as the data grows, caching suddently stopped. When I run it using other types of Django caching like local memory (CACHE_BACKEND = ‘locmem://’) it works perfectly.

I realized later that Memcached has a default limit of 1MB. So caching all these articles into a single key would eventually exceed the limit when the data grows.

In order to overcome this limit, I grouped the articles into a certain size which Memcached can manage.

Update (Mar 7, 2012): This technique is proven to be worse than not using Memcached at all. :(