Apache Superset from Scratch: Day 1 (Python Setup)
December 23, 2021
I'm on a quest, to understand and map out as much of the Apache Superset code base as I can. In my day job, I have the opportunity to use Superset on a daily basis but I'm not intimately familiar with the code paths themselves. This series will revolve around the process on a M1 Macbook Air, but should generalize to most *nix systems.
My goal is to make noticeable progress on a daily basis. With the preamble out of the way, let's start!
Contributing.md
The Superset codebase is large; where does one even begin? For new code bases, I generally like alternating between:
- breadth: starting with an overview of the development / contributor's guide
- depth: recursively going through each component & sub-component
For breadth, I'll start with the Setup Local Environment for Development section from CONTRIBUTING.MD.
Python 3.8
Python 3.7.x or 3.8.x are recommended for running the Superset backend. I'm on a Mac, and prefer to leave the default python
that ships with the operating system to 2.7.x. Instead, I'll use Homebrew to install Python 3.8:
brew install python@3.8
Now, both the python3
and pip3
commands work as expected (independent of the python
and pip
commands)!
python3 --version
returnsPython 3.8.12
pip3 --version
returnspip 21.2.4 from /opt/homebrew/lib/python3.8/site-packages/pip (python 3.8)
Virtualenv
Now time to create a Python virtual environment. Virtual environment is really a sandbox for your Python libraries that lives within a specific folder / project. This workflow gives you a few benefits:
- Virtual environment lives completely independent of the global Python sandbox
- It's super quick and easy to delete all of the project specific Python libraries and re-install, as an escape hatch
- Less time wasted (not zero sadly) dealing with version / dependency conflicts
Are there any downsides?
- The main one is increased storage requirements, because every Python project on your computer has its own copies of similar libraries
First, let me install virtualenv
:
pip3 install virtualenv
Next, let's give our virtual environment a name. The virtualenv
creates a folder within your project folder and stuffs all of the Python libraries you install there. So we're really trying to decide on the name of this folder.
The CONTRIBUTING.MD file in the Superset repo suggests naming it venv
:
python3 -m venv venv
- The first
venv
is short-hand forvirtualenv
- The second
venv
refers to the name of the folder we're creating (../superset/venv/
)
Why should we name it venv/
? One hint is in the .gitignore
file, which specifies files & folder paths to ignore in version control. This means that each user can have their own local state and those details won't get checked into version control.
The .gitignore
file itself is version controlled though. So this file provides a "universal" agreemenet between all of the contributors to Superset that these files should not be checked into version control. Let's search for any string values containing "env" in the .gitignore
:
cat .gitignore | grep 'env'
This returns:
.env
.envrc
env
venv*
env_py3
envpy3
env36
venv
While some open source projects use the .venv/
convention for virtualenv, the Superset one uses venv
it seems. So this means:
- we can party in our local
venv/
and none of those changes will make it into any code PR's we may want to make - if we want to use
.venv/
instead, the git version control system will detect a change
Let's stick to the community convention, and run the suggested command:
python3 -m venv venv
If we run ls
while within the superset/
folder, we'll see venv
listed as a folder. Success!
Python Dependencies
Usually, the Python requirements are specified in a requirements.txt
file. In the case of Superset, we're blessed with a folder of .in
and .txt
files. There's a lot we could explore and unpack here, but I'm going to focus on getting everything setup first.
If we look to CONTRIBUTING.MD, we see:
pip install -r requirements/testing.txt
If we open that file, we see something that resembles a standard requirements.txt
file, but with this header:
# This file is autogenerated by pip-compile-multi
I've made a mental note to investigate & explore pip-compile-multi
later, a library for compiling multiple requirement files. For now, let's run the following command to install the dependencies:
pip3 install -r requirements/testing.txt
Error 1: MySQL
I ran into this issue with red scary error text while on my M1 Macbook computer:
Collecting mysqlclient==2.1.0
Using cached mysqlclient-2.1.0.tar.gz (87 kB)
ERROR: Command errored out with exit status 1:
command: /opt/homebrew/opt/python@3.8/bin/python3.8 -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/6d/f0fzvlyn6sd58q5rmx6s6df00000gn/T/pip-install-6c548wua/mysqlclient_a8c054d3233d4d00acb42d6a6bf2a562/setup.py'"'"'; __file__='"'"'/private/var/folders/6d/f0fzvlyn6sd58q5rmx6s6df00000gn/T/pip-install-6c548wua/mysqlclient_a8c054d3233d4d00acb42d6a6bf2a562/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /private/var/folders/6d/f0fzvlyn6sd58q5rmx6s6df00000gn/T/pip-pip-egg-info-0735tk4h
WARNING: Discarding https://files.pythonhosted.org/packages/de/79/d02be3cb942afda6c99ca207858847572e38146eb73a7c4bfe3bdf154626/mysqlclient-2.1.0.tar.gz#sha256=973235686f1b720536d417bf0a0d39b4ab3d5086b2b6ad5e6752393428c02b12 (from https://pypi.org/simple/mysqlclient/) (requires-python:>=3.5). Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
ERROR: Could not find a version that satisfies the requirement mysqlclient==2.1.0 (from versions: 1.3.0, 1.3.1, 1.3.2, 1.3.3, 1.3.4, 1.3.5, 1.3.6, 1.3.7, 1.3.8, 1.3.9, 1.3.10, 1.3.11rc1, 1.3.11, 1.3.12, 1.3.13, 1.3.14, 1.4.0rc1, 1.4.0rc2, 1.4.0rc3, 1.4.0, 1.4.1, 1.4.2, 1.4.2.post1, 1.4.3, 1.4.4, 1.4.5, 1.4.6, 2.0.0, 2.0.1, 2.0.2, 2.0.3, 2.1.0rc1, 2.1.0)
ERROR: No matching distribution found for mysqlclient==2.1.0
Some StackOverflow sleuthing suggested that I needed to install MySQL server via homebrew so the installation process for the Python client library would work. So this may not be an M1 related issue after all:
brew install mysql
Error 2: Postgres
While mysql-client
succeeded, pip now got stuck on postgres:
Error: pg_config executable not found.
pg_config is required to build psycopg2 from source. Please add the directory
containing pg_config to the $PATH or specify the full executable path with the
option:
python setup.py build_ext --pg-config /path/to/pg_config build ...
or with the pg_config option in 'setup.cfg'.
If you prefer to avoid building psycopg2 from source, please install the PyPI
'psycopg2-binary' package instead.
Let's check out Stack Overflow again. I like using the Postgres Mac app, which contains a pg_config
executable. So I'm going to
I'm going to move forward with finding the path to the pg_config
file and add it to my PATH. I'll first crack open the Postgres.app folder:
After jumping through folders, I found the pg_config
executable. As suggested in StackOverflow, I'm going to add that executable's folder to my PATH:
export PATH=$PATH:/Applications/Postgres.app/Contents/Versions/14/bin
Now when I pip3 install -r requirements/testing.txt
again, everything works beautifully!
Editable Superset
Now, we're ready to install Superset in "editable" mode. Editable mode lets us modify and test code changes in Superset quickly, which is ideal when developing features or fixing bugs.
pip3 install -e .
To test the installation, run the superset
command and the Superset CLI should appear:
Next Up
That's it for Day 1. In Day 2, I'll play with setting up the metadata database, creating roles & permissions, loading example data, and starting the backend server.
If you want to follow along, use the RSS feed. Stay tuned! 📺