Python development environment 2020 edition

3 minute read

It’s 2020 and I’m still using Python for most of my stuff. I’ve been keeping an eye on Julia because of its roots in numerical computing, but Python is still my workhorse. I’ll detail below my views and tools on Python development, mostly for data science and backend projects, related to data products.

Python interpreter

I used pyenv for a while, but the shims links and the compilation options were too fragile for me. I decided to either use the system Python or in case I really need to compile locally, I created an Ansible Playbook to do that for me. This is Debian only for now, but I’m working on a more general one.

Editors

I started programming in Python using Eclipse during my PhD, and I have to say, it was not pleasant. I tried a bunch of editors, from vim to Emacs, and after a while I settled on Pycharm. One of the main changes is that now I use Visual Studio Code for 99% of my coding instead of using PyCharm, although I still can’t refactor code properly in VSCode. I also use heavily Jupyter Notebooks and Jupyter Lab, depending where and what I’m working on.

Remote development

I have to do a lot of my coding on remote systems for different reasons, like data privacy, performance or customer restrictions. I have two main workflows: Jupyter or git plus Visual Studio Code. If the code is not too long or I’m doing data science stuff, it’s usually Jupyter notebooks with Jupyter Lab, and the embedded editor sometimes. For EDAs and prototyping this is usually enough. When working on backend code or when the models need to be productized, I create a git repository and clone the code to work with VSCode locally, commit the code to the repository, pull in the remote machine, and work with it in IPython to iterate or directly on the shell. It’s a bit cumbersome to keep committing the code, but the practice of having all changes on version control, combined with IPython, makes the code much better.

Virtual environments

Over all tools that we hear about for managing virtualenvs in Python, in the end I use what I think the simplest one, and that comes with the package: venv. I create the virtualenv with

python3 -m venv <path/to/venv>
source <path/to/venv/bin/activate>

and done. All other tools are either too heavy handed, need several steps to be installed or are very confusing. I also keep all my virtualenvs on the same folder: $HOME/lib/venv for user projects and /opt/venv for system wide ones. Even though virtualenvs, for me, usually maps one to one with projects, I sometimes use them for more than one project. Also, the path is standard and the name of the virtualenv is the name of the project by default, making deployment and management more straightforward.

Project structure

No news here. I follow the folder structure for most of the Python packages:

project-name
|-- project-name
|   |-- main.py
|   |-- module.py
|   |-- func-name
|       `-- func-name.py
|   |-- tests
|       `-- test_all.py
|-- doc
|   |-- project-name.md
|   |-- goals.md
|   `-- other-doc.md
|-- notebooks
|   |-- eda-20200515.ipynb
|   |-- modelproto-20200517.ipynb
|   `-- data-extract-20200516.ipynb
|-- data
|   |-- raw
|       |-- raw_data.csv
|       `-- raw_source.prq
|   |-- intermediate
|       `-- transformed_20200516.csv
|   |-- processed
|       `-- model_ready_20200617.prq
|   |-- results
|       `-- model_trained.pkl
|-- README.md
|-- .gitignore
|-- requirements.txt
|-- LICENSE

and so on. I keep my data folder out of the git repository, to reduce the risk of publishing private or proprietary data. The rest should be self-explanatory. I have a cookie-cutter to deploy this structure, and try to follow the reproducibility guidelines.

Deploy to production

Here it depends on what is production. For example, sometimes what I have to deploy is just my model in pickle form so the production infrastructure can run it. Sometimes I have to deploy my package as a library, so either it’s an Ansible script that fetches the repository from git and install using pip directly, or also using Ansible, deploy to a folder, install the dependencies and start the service with supervisord. It really depends on the existing devops pipeline that I have to integrate, and if my package is a library, a service or just a model.

Conclusion

So, nothing too different from what I’ve seen around, just collecting good practices that I’ve seen to work on my day to day. Any suggestions for improvement are welcome!