5 Posts

Scenario:
I am using the jQuery code from Django documentation in order to send post requests via ajax. When a link is clicked from a page, it opens another page in a new tab, and at the same time sending an ajax request.

Problem:
The ajax request, for some reason, is resetting the Session object. The effect is, any new data added in the session (in the non-ajax request) will be lost.

Solution (or rather “workaround”):
After some investigation, the problem lies somewhere in the csrf middleware. I’m still unable to find where the problem is within the middleware, but to patch the issue, I modifed the javascript code to send null X-CSRFToken for non-POST requests. The new code now looks like this:

$.ajaxSetup({
    beforeSend: function(xhr, settings) {
        if (!csrfSafeMethod(settings.type) && !this.crossDomain) {
            xhr.setRequestHeader("X-CSRFToken", getCookie('csrftoken'));
        }else{
            xhr.setRequestHeader("X-CSRFToken", null);
        }
    }
});
scrapy

Most of the scheduled jobs I am running are set on django-chronograph. It makes jobs easier to manage, and allows my clients to manage the scheduling. It also allows them to check the logs in a user friendly environment.

So after creating my scraper using Scrapy that runs in django management command (as described here), I tried to deploy it using chronograph.

The first problem is, method run_from_argv is bypassed by chronograph. So I modified my management command into like this:

class Command(BaseCommand):
    def run_from_argv(self, argv):
        self._argv = argv
        self.execute()

    def handle(self, *args, **options):
        from scrapy.cmdline import execute
        try:
            execute(self._argv[1:-1])
        except AttributeError: # when running from django-chronograph
            execute(list(args))

Then in the arguments, I added the django management command. In my case, my management command is “scrape”.

scrapy

The second problem is, the execute command from Scrapy runs sys.exit which stops the execution. This means django-chronograph will also stop execution and in effect cannot do the necessary tasks like saving the logs into the database, changing the status of the job from “running” to “not running”.

The first workaround I tried was to create a separate thread to run Scrapy’s “execute” command. However, Scrapy threw this error: signal only works in main thread.

After some reading, Python doc says sys.exit simply throws a SystemExit exception. This would allow us to do some cleanup using the finally block like this:

class Command(BaseCommand):
    def run_from_argv(self, argv):
        self._argv = argv
        self.execute()

    def handle(self, *args, **options):
        from scrapy.cmdline import execute
        try:
            execute(self._argv[1:-1])
        except AttributeError:  # when running from django-chronograph
            execute(list(args))
        finally:
            return  # let django-chronograph do some cleanup

And if you’re running the cron of django-chronograph as root, you have to create a symlink of scrapy.cfg from the root directory to your project folder where the file is. This will enable Scrapy to locate your crawler settings.

Today I have successfully migrated this blog to Google App Engine (GAE) with only a change in the database settings by exploiting the new Google Cloud SQL.

Here are the steps I did:

  1. I followed this guide to setup my database in Google Cloud SQL.
  2. I uploaded the database backup file to Google Cloud Storage to be able to import it to Could SQL.
  3. I created another app in GAE to serve my static files, then modified my STATIC_URL setting. Before deploying files, I ran django-staticfiles command “collectstatic” locally to collect my static files from my Django app.
  4. To deploy the environment applications, I created a symlink to the site-packages into the project folder and added this in the settings:
    if not DEBUG:
        sys.path.insert(0, os.path.join(PROJECT_DIR, "env"))
    
  5. To minimize cost, I modified the code to use GAE Memcache as described here.

Problems I encountered:

  1. I noticed that django comment app isn’t working. Will update this blog once I got it working. If you need to comment on this blog, please email me instead until I fix this problem.
  2. I’m still trying to figure out how use Django model’s FileField. I read some articles mentioning Google Cloud Storage but I wonder how would the native django code could do that.
  3. The application cannot send email alerts when error is triggered.
  4. The application loads slow because of the delay in loading the static files and starting the database instance from Cloud SQL.
  5. I cannot use Django South to manage changes in the database structure.

Update (2012-06-12): I eventually moved this blog site to Amazon EC2 using a micro reserved instance. The price is only $0.008/hour as compared to Google Cloud SQL which is $0.1/hour. This also gives me flexibility and freedom on how will I deploy the static files and use python-virtualenv.

Google describes Python as a language that is easy to develop with. It can greatly improve developer productivity and code readability. This Python’s strength comes at a price of slow performance. Well, interpreted high-level languages are inherently slow.

Fortunately Django comes with great caching techniques which can bypass all the calculations on the Python level. Memcached is the fastest of them all as described in the Django documentation. In order to effectly use this feature, you should understand how Memcached works or else you will fall into the same problem I encountered this week.

I spent so much time to figure out why Memcached suddently stopped caching the data. Here is code:

    from django.core.cache import cache
    def my_django_view(request, template, page):
        articles = cache.get('all_popular_articles')
        if articles is None:
            articles = some_complex_query()
            cache.set('all_popular_articles', articles, 60*60)
        return render_to_response(template, 
                                  {'articles', paginate_func(articles, 
                                                             page)})

Of course this works during the development even when in production. But as the data grows, caching suddently stopped. When I run it using other types of Django caching like local memory (CACHE_BACKEND = ‘locmem://’) it works perfectly.

I realized later that Memcached has a default limit of 1MB. So caching all these articles into a single key would eventually exceed the limit when the data grows.

In order to overcome this limit, I grouped the articles into a certain size which Memcached can manage.

Update (Mar 7, 2012): This technique is proven to be worse than not using Memcached at all. :(

Our company stores data from Avaya into MySQL database. While our applications are using MSSQL for storage.

We have our two databases connected having MySQL as a Linked Server in MSSQL.

From MSSQL, we tried:

    SELECT *
    FROM OPENQUERY([MySQL_DB], 
        'SELECT unhex(column1)
        FROM table1')

The MySQL query works if I run it in MSSQL Query Browser but it doesn’t work in Stored Procedures. This is why I have created this user-defined function in MSSQL to replicate the unhex() function of MySQL.

Here is the code:

    CREATE FUNCTION [dbo].[unhex] (@input_text varchar(255))
    RETURNS varchar(255)
    AS
    BEGIN
    declare @unhex varchar(255)
    declare @position int, @length int
    declare @pair varchar(2),@equivalent varchar(1)

    set @position = -1
    set @length = len(@input_text)
    set @unhex = ''

    WHILE(@position+2 < @length)
    BEGIN
    set @position = @position + 2
    set @pair = SUBSTRING(@input_text, @position, 2)

    SELECT @equivalent = char(cast('' as xml).value('xs:hexBinary(
        substring(sql:variable("@pair"), 
        sql:column("t.pos")) )', 'varbinary(max)'))
    FROM (select case substring(@pair, 1, 2) 
    when '0x' then 3 
    else 0 
    end) 
    AS t(pos)
    set @unhex = @unhex+@equivalent
    END

    RETURN @unhex

    END

Once the user-defined function is created, I then modified my query like this:

    SELECT *
    FROM OPENQUERY([MySQL_DB], 
    'SELECT dbo.unhex(column1)
    FROM table1')