To pin or not to pin dependencies: reproducible vs. reusable software
We recently had a very interesting conversation in our lab about how to describe software dependencies (libraries one needs to install) for a software project in the context of research. One camp was proposing explicitly listing which version of a dependency is required (a scheme also referred to “pinning”) and the other camp was more in favor of either not specifying version at all or specifying the minimal required version. Luckily both camps agreed on the importance of specifying dependencies, but what’s the big deal about pinning vs not pinning?
Advantages of pinning dependencies
When you pin a dependency (for example by saying “numpy==1.1.3”) you explicitly point to a version of a library that a) you know works with your script b) was used to generate the result you present in your paper. This is very useful for people trying to replicate your results using your code as well as yourself attempting to revisit a project that was put aside for a while (for example when reviewers come back after 3 months). By knowing which versions of libraries worked last time and generated the results in question one can quickly run and modify the code.
Advantages of not pinning dependencies
On the other side when you opt not to pin dependencies (for example by specifying “numpy” or “numpy>=1.1.3”) you provide fewer constraints on your software and make it easier to incorporate into an existing software stack. Let’s say you specified “numpy==1.1.3” in your setup.py and your friend Jack who has numpy 1.1.4 installed tries to install your package. They will have to downgrade numpy to meet the package dependencies or create a new Python environment just to use your package. However, if you specify “numpy>=1.1.3” Jack will be able to install your package without problems.
When to pin dependencies
- If you are not developing a reusable package, but distributing a set of scripts to perform a particular analysis (for example to accompany a manuscript). This will increase the reproducibility of your work. Additionally, since you are not building a package, the dependencies will be only specified in requirements.txt or a Dockerfile allowing people to ignore them if they wish so. It’s worth noting that you do not need to create an installable package (with setup.py) in such scenario (see for example this repository).
- In Dockerfiles. This way each time someone builds the image using this Dockerfile they will get the same result.
- If you are developing a reusable package, but only for specifying testing environments for continuous integration. If you don’t pin dependencies a release of a broken (or backward incompatible) library could break your tests. This can happen at a random moment causing Pull Requests to fail randomly and leaving contributors perplexed what is wrong with their code.
When not to pin dependencies
- Inside setup.py, when developing a reusable package (library, framework, tool etc.). You want your package to be easy to integrate with other people software stack and thus you should only specify minimally required dependency versions and only if you know that your package will not work with older versions.
At the end of the day, this is a good example of a difference between reusability and reproducibility. Reusable software is much more valuable — it can be integrated into many different software stacks and become a building block of other tools. It is also hard to make and maintain — for once you need to validate that it runs with all the versions of ever updated dependencies. Mere reproducibility is much easier and still has its place in a context of one-off analyses to create results for a scientific paper.
Originally published at blog.chrisgorgolewski.org.