Running Scrapy on Django-Chronograph

Most of the scheduled jobs I am running are set on django-chronograph. It makes jobs easier to manage, and allows my clients to manage the scheduling. It also allows them to check the logs in a user friendly environment.

So after creating my scraper using Scrapy that runs in django management command (as described here), I tried to deploy it using chronograph.

The first problem is, method run_from_argv is bypassed by chronograph. So I modified my management command into like this:

class Command(BaseCommand):
    def run_from_argv(self, argv):
        self._argv = argv
        self.execute()

    def handle(self, *args, **options):
        from scrapy.cmdline import execute
        try:
            execute(self._argv[1:-1])
        except AttributeError: # when running from django-chronograph
            execute(list(args))

Then in the arguments, I added the django management command. In my case, my management command is “scrape”.

scrapy

The second problem is, the execute command from Scrapy runs sys.exit which stops the execution. This means django-chronograph will also stop execution and in effect cannot do the necessary tasks like saving the logs into the database, changing the status of the job from “running” to “not running”.

The first workaround I tried was to create a separate thread to run Scrapy’s “execute” command. However, Scrapy threw this error: signal only works in main thread.

After some reading, Python doc says sys.exit simply throws a SystemExit exception. This would allow us to do some cleanup using the finally block like this:

class Command(BaseCommand):
    def run_from_argv(self, argv):
        self._argv = argv
        self.execute()

    def handle(self, *args, **options):
        from scrapy.cmdline import execute
        try:
            execute(self._argv[1:-1])
        except AttributeError:  # when running from django-chronograph
            execute(list(args))
        finally:
            return  # let django-chronograph do some cleanup

And if you’re running the cron of django-chronograph as root, you have to create a symlink of scrapy.cfg from the root directory to your project folder where the file is. This will enable Scrapy to locate your crawler settings.