Gerapy is a full stack solution for Scrapy deployment. If you look at the commit history, it’s received some dependency bumps but hasn’t really been updated since 2022. Getting Gerapy started can be a difficult process often filled with trial and error.
This guide exists to make Gerapy easier. By the end of this guide, you’ll be able to answer the following questions.
- Why doesn’t Gerapy work with my standard Python installation?
- How can I configure Python and pip for Gerapy?
- How do I create an admin account?
- How do I write my first scraper?
- How do I troubleshoot my scraper?
- How do I test and deploy my scraper?
Introduction to Gerapy
Let’s get a better understanding of what Gerapy actually is and what makes it unique.
What is Gerapy?
Gerapy provides us with a Django management dashboard and the Scrapyd API. These services give you a simple yet powerful interface to manage your stack. At this point, it’s a legacy program but it still improves workflow and speeds up deployment. Gerapy makes web scraping more accessible to DevOps and management-oriented teams.
- GUI dashboard for creating and monitoring scrapers.
- Deploy a scraper with the click of a button.
- Get real-time visibility into logs and errors as they occur.
What Makes Gerapy Unique?
Gerapy gives you a one-stop shop for scraper management. Getting up and running with Gerapy is a tedious process due to its legacy code and dependencies. However, once you’ve got it working, you unlock a full toolset tailored for handling scrapers at scale.
- Build your scrapers from inside the browser.
- Deploy them to Scrapyd without touching the command line.
- Centralized management for all of your crawlers and scrapers.
- Frontend built on Django for spider management.
- Backend powered by Scrapyd for easy building and deployment.
- Built-in scheduler for task automation.
How To Scrape the Web With Gerapy
Gerapy’s setup process is laborious. You need to address technical debt and perform software maintenance. After much trial and error, we learned that Gerapy isn’t even compatible with more modern versions of Python. We started with a modern installation of Python 3.13. It was too modern for Gerapy’s dependencies. We tried 3.12 — still no luck — just more dependency issues.
As it turned out, we needed Python 3.10. On top of that, we needed to alter some of Gerapy’s actual code to fix a deprecated class — and then we needed to manually downgrade almost every dependency in Gerapy. Python has undergone significant changes in the last three years and Gerapy’s development hasn’t kept pace. We need to recreate Gerapy’s ideal conditions from three years ago.
Project Setup
Python 3.10
To start, we need to install Python 3.10. This version isn’t extinct, but it’s no longer widely available. On native Ubuntu and Windows WSL with Ubuntu, it can be installed with apt.
You can then check to make sure it’s installed with the --version
flag.
If all goes well, you should see output similar to the output below.
Creating a Project Folder
First, make a new folder.
Next, we need to cd
into our new project folder and setup a virtual environment.
Activate the environment.
Once your environment is active, you can check the active version of Python.
As you can see, python
now defaults to our 3.10 installation from within the virtual environment.
Installing Dependencies
The command below installs Gerapy and its required dependency versions. As you can see, we need to manually target many legacy packages using pip==
.
We’ll now create an actual Gerapy project with the init
command.
Next, we’ll cd
into our gerapy
folder and run migrate
to create our database.
Now, it’s time to create an admin account. This command gives you administrator privileges by default.
Finally, we start the Gerapy server.
You should see an output like this.
Using the Dashboard
If you visit http://127.0.0.1:8000/, you’ll be prompted to log in. Your default account name is admin
and so is your password. After logging in, you’ll be taken to Gerapy’s dashboard.
Click on the “Projects” tab and create a new project. We’ll call this one quotes
.
Getting the Target Site
Now, we’ll create a new spider. From within your new project, click the “add spider” button. In the “Start Urls” section, add https://quotes.toscrape.com
. Under “Domains”, enter quotes.toscrape.com
.
Extraction Logic
Next, we’ll add our extraction logic. The parse()
function below uses CSS selectors to extract quotes from the page. You can learn more about selectors here.
Scroll down to the “Inner Code” section and add your parsing function.
Now, click the “Save” button located in the bottom right hand corner of the screen. If you run the spider now, you’ll run into a critical error. Gerapy is trying to use BaseItem
from Scrapy. However, BaseItem
was removed from Scrapy several years ago.
Fixing the BaseItem Error
To solve this error, we actually need to edit Scrapy’s internal code. You can do this from the command line. However, it’s much easier from a GUI text editor with search features.
cd
into the source files for your virtual environment.
To open the folder in VSCode, you can use the command below.
Open up parser.py
, and you’ll find our culprit.
We need to replace this line with the following.
Now that we’ve removed the BaseItem
import, we need to remove all instances of BaseItem
with Item
. Our only instance of it is in the run_callback()
function. When you’re finished saving the changes, close the editor.
If you run your spider, you’ll now receive a new error.
Fixing REQUEST_FINGERPRINTER_IMPLEMENTATION Deprecation
It’s not apparent, but Gerapy actually injects our settings directly into our spider. cd
out of our current folder and then into the projects
folder.
Once again, open up your text editor.
Now open up your spider. It should be titled quotes.py
and it’s located inside the spiders
folder. You should see your parse()
function inside the spider class. At the bottom of the file, you should see an array called custom_settings
. Our settings have literally been injected into the spider by Gerapy.
We need to add one new setting. You need to use 2.7
. 2.6
will continue to throw the error. We discovered this after numerous instances of trial and error.
Now, when you run the spider using Gerapy’s play button, all errors are resolved. As you can see below, instead of an error message, we just see a “Follow Request”.
Putting Everything Together
Building the Scraper
If you go back to your “Projects” tab in Gerapy, you’ll see an “X” in the “Built” column for the project. This means that our scraper hasn’t been built into an executable file for deployment.
Click the “deploy” button. Now, click “build”.
Using The Scheduler
To schedule your scraper to run at a specific time or interval, click “Tasks” and then create a new task. Then, select your desired settings for the schedule.
Once finished, click the “create” button.
Limitations When Scraping With Gerapy
Dependencies
Its legacy code introduces many limitations we’ve addressed head-on during this article. Just to get Gerapy running, we needed to go in and edit its internal source code. If you’re not comfortable touching the system’s internals, Gerapy is not for you. Remember the BaseItem
error?
As Gerapy’s dependencies continue to evolve, Gerapy remains frozen in time. To continue using it, you’ll need to maintain your installation personally. This adds technical debt in the form of maintenance and a very real process of trial and error.
Recall this snippet below. Each of these version numbers was discovered through a meticulous process of trial and error. When dependencies break, you need to continually try different version numbers until you get a working one. In this tutorial alone, we had to use trial and error to find working versions of 10 dependencies. As time goes on, this will only get worse.
Operating System Limitations
When we attempted this tutorial initially, we tried using native Windows. This was how we discovered the initial limitations due to Python versions. Current Python stable releases are limited to 3.9, 3.11 and 3.13. Managing multiple versions of Python is difficult regardless of OS. However, Ubuntu gives us the deadsnakes
PPA repository.
Without deadsnakes
, it is possible to find a compatible version of Python, but even then, you need to handle PATH issues and differentiate between python
(your default installation) and python3.10
. It’s likely possible to handle this natively from Windows and macOS, but you will need to find a different workaround. With Ubuntu and other apt-based Linux distros, you at least get a reproducible environment with quick access to older versions of Python directly installed into your PATH.
Proxy Integration With Gerapy
As with vanilla Scrapy itself, proxy integration is easily done. In the true spirit of Gerapy’s settings injection, we can inject a proxy directly into the spider. In the example below, we add the HTTPPROXY_ENABLED
and HTTPPROXY_PROXY
settings to connect using Web Unlocker.
Here’s the full spider after proxy integration. Remember to swap the username, zone and password with your own.
Viable Alternatives to Gerapy
- Scrapyd: This is the actual backbone behind Gerapy and just about any other Scrapy stack. With Scrapyd, you can manage everything through plain old HTTP Requests and build a dashboard if you so choose.
- Scraping Functions: Our scraping functions allow you to deploy your scrapers directly to the cloud and edit them from an online IDE — with a dashboard like Gerapy but more flexible and modern.
Conclusion
Gerapy is a legacy product in our rapidly changing world. It requires real maintenance and you’ll need to get your hands dirty. Tools like Gerapy allow you to centralize your scraping environment and monitor everything from a single dashboard. In DevOps circles, Gerapy provides real utility and value.
If Scrapy isn’t your thing, we offer many viable alternatives to meet your need for data collection. The products below are just a few.
- Custom Scraper: Create scrapers with no code required and deploy them to our cloud infrastructure.
- Datasets: Access historical datasets updated daily from all over the web. A library of internet history right at your fingertips.
- Residential Proxies: Whether you prefer to write code yourself or scrape with AI, our proxies give you access to the internet with geotargeting on a real residential internet connection.
Sign up for a free trial today and take your data collection to the next level!