Data Driven Decisions Using PyPI Download Statistics
What platform do my users run on? How many users do I have? Is it okay to drop support for something?
Open source library developers have an analytics problem. How do you answer questions like this when people are embedding your software as a component of their own product?
For example, you, the developer of the library frinkiac-iac
would like to drop support for Python 2.6 because it is old, unsupported by upstream, and restricting your use of newer Python features. Is it okay to do this? Since you don’t have any data you may attempt to compensate for this lack of insight in a variety of ways:
-
Reach out through your community (mailing lists, bug trackers, IRC channels) to tentatively suggest the idea.
-
Do a release with a deprecation warning with a timeline for removal that will potentially be massively pushed back as users complain.1
-
Just remove it and see if anybody complains.
-
Wait for a major package to do it and follow their lead.
Each of these choices can work, but frequently result in major breakage for users or an unacceptably long timeline for removal resulting in significant maintenance burden2. You need data, but it’s so hard to collect it. Can’t someone else do it?3
BigQuery
Quietly announced back in May 2016, PyPI download statistics are the most powerful tool a Python developer can use to inform their decision making. To unleash its awesome power start by loading the dataset. Preemptive apology: If you are a new Google Cloud user you may be confronted with an unpleasant set of modals asking you to agree to various things and create a new project. What you actually need to do is get to billing and start a free trial4. Once you’ve got the BigQuery console available you should be able to click that link again and access the page. Now you can click Compose Query!
BigQuery uses a SQL-like language. A query reference is available to help you write your own queries. Additionally, you can see the set of available fields you can query against by clicking downloads
on the left pane and looking at the schema. One note: file.project
is the normalized name of the project.
With this robust data set we can make substantially more informed decisions, but here are some examples to help you out. All the examples here use what Google references as “legacy SQL”. Despite the name this is the default for queries, but feel free to use the standard SQL if you’d like!
Query: Python Versions
In the original scenario we were curious if we could drop support for Python 2.6. What does that query look like? For the following examples we’ll be using cryptography
since that’s my primary project and the one I run queries on most commonly.
SELECT
REGEXP_EXTRACT(details.python, r"^([^\.]+\.[^\.]+)") as python_version,
COUNT(*) as download_count,
FROM
TABLE_DATE_RANGE(
[the-psf:pypi.downloads],
DATE_ADD(CURRENT_TIMESTAMP(), -31, "day"),
DATE_ADD(CURRENT_TIMESTAMP(), -1, "day")
)
WHERE
file.project = 'cryptography'
GROUP BY
python_version,
ORDER BY
download_count DESC
LIMIT 100
And our results (as of 2016-12-09):
Row | python_version | download_count |
---|---|---|
1 | 2.7 | 2208835 |
2 | 3.5 | 202515 |
3 | 2.6 | 109134 |
4 | null | 91076 |
5 | 3.4 | 87837 |
6 | 3.3 | 5623 |
7 | 3.6 | 1354 |
8 | 3.7 | 643 |
9 | 1.17 | 341 |
10 | 3.2 | 270 |
11 | 3.1 | 2 |
12 | 2.4 | 1 |
13 | 2.5 | 1 |
As you can see, Python 2.6 makes up 109,134 out of 2,707,632 downloads in the past 30 days. This represents roughly 4% of downloads. Is that low enough to drop support?5.
null
also makes up approximately 3.4% of downloads. These
are downloads from PyPI using clients that do not support sending the statistics
we’re querying against. This can be an older version of pip
or alternate
clients. You also see 341 downloads from 1.17, which is…who knows!
When making maintenance decisions you should factor these unknowns as you
feel appropriate.
Query: OpenSSL versions
cryptography
supports a wide variety of OpenSSL versions. However, supporting
0.9.8 and 1.0.0 are a significant challenge since they are missing many of the
features we need (and are no longer supported by upstream). Let’s craft a
query to see what versions of OpenSSL are in use:
SELECT
details.system.name,
REGEXP_EXTRACT(details.openssl_version, r"^OpenSSL ([^ ]+) ") as openssl_version,
COUNT(*) as download_count,
FROM
TABLE_DATE_RANGE(
[the-psf:pypi.downloads],
DATE_ADD(CURRENT_TIMESTAMP(), -31, "day"),
DATE_ADD(CURRENT_TIMESTAMP(), -1, "day")
)
WHERE
details.openssl_version IS NOT NULL
GROUP BY
details.system.name,
openssl_version,
HAVING
download_count >= 100
ORDER BY
download_count DESC
LIMIT 100
This gives us a view of what OpenSSL is linked against Python broken down by platform:
Row | details_system_name | openssl_version | download_count |
---|---|---|---|
1 | Linux | 1.0.1f | 56152115 |
2 | Linux | 1.0.1t | 26301089 |
3 | Linux | 1.0.2g | 25892721 |
4 | Linux | 1.0.1e-fips | 24082526 |
5 | Linux | 1.0.1 | 19199858 |
6 | Darwin | 0.9.8zh | 15528451 |
7 | Linux | 1.0.1k-fips | 9029119 |
8 | Linux | 1.0.2j | 8499043 |
9 | Windows | 1.0.2h | 5037984 |
10 | Darwin | 1.0.2j | 3077958 |
While I haven’t provided the entire set of results it turns out less than 100,000 downloads out of 210,063,137 were made using OpenSSL 1.0.0. 0.9.8 holds a much greater share, but only due to Darwin (aka macOS…aka OS X). In cryptography’s case we statically link wheels on Mac and Windows so we can ignore the OpenSSL version on those platforms. Looks like dropping 0.9.8 and 1.0.0 is probably safe!6
Query: Most Popular Projects
Maybe you just want to know how popular your package is relative to others in the past 30 days.
SELECT
file.project,
COUNT(*) as total_downloads,
FROM
TABLE_DATE_RANGE(
[the-psf:pypi.downloads],
DATE_ADD(CURRENT_TIMESTAMP(), -31, "day"),
DATE_ADD(CURRENT_TIMESTAMP(), -1, "day")
)
GROUP BY
file.project
ORDER BY
total_downloads DESC
LIMIT 100
Query: How Many Downloads Did My Project Get?
Or maybe just your package:
SELECT
COUNT(*) as total_downloads,
FROM
TABLE_DATE_RANGE(
[the-psf:pypi.downloads],
DATE_ADD(CURRENT_TIMESTAMP(), -31, "day"),
DATE_ADD(CURRENT_TIMESTAMP(), -1, "day")
)
WHERE
file.project = 'cryptography'
Query: Downloads By Filename Filtered By Project and Version
If you ship multiple release artifacts (platform/version specific wheels as well as sdist) this query can show you their popularity.
SELECT
file.filename,
COUNT(*) as total_downloads,
FROM
TABLE_DATE_RANGE(
[the-psf:pypi.downloads],
DATE_ADD(CURRENT_TIMESTAMP(), -31, "day"),
DATE_ADD(CURRENT_TIMESTAMP(), -1, "day")
)
WHERE
file.project = 'cryptography'
AND
file.filename like '%1.6%'
GROUP BY
file.filename
ORDER BY
total_downloads DESC
LIMIT 100
Query: Python 2 vs 3 For A Single Project
We broke it down by Python release previously, but what if you just want to know 2 vs 3?
SELECT
ROUND(100 * SUM(CASE WHEN REGEXP_EXTRACT(details.python, r"^([^\.]+)") = "3" THEN 1 ELSE 0 END) / COUNT(*), 1) AS percent_3,
COUNT(*) as download_count,
FROM
TABLE_DATE_RANGE(
[the-psf:pypi.downloads],
DATE_ADD(CURRENT_TIMESTAMP(), -31, "day"),
DATE_ADD(CURRENT_TIMESTAMP(), -1, "day")
)
WHERE
file.project = 'cryptography'
ORDER BY
download_count DESC
LIMIT 100
In cryptography’s case Python 3 currently makes up 11% of downloads7.
Query: Percentage of Downloads By Python 3 In The Top 100
Of course, a trivial modification and we can see the Python 3 percentage in the top 100 packages:
SELECT
file.project,
ROUND(100 * SUM(CASE WHEN REGEXP_EXTRACT(details.python, r"^([^\.]+)") = "3" THEN 1 ELSE 0 END) / COUNT(*), 1) AS percent_3,
COUNT(*) as download_count,
FROM
TABLE_DATE_RANGE(
[the-psf:pypi.downloads],
DATE_ADD(CURRENT_TIMESTAMP(), -31, "day"),
DATE_ADD(CURRENT_TIMESTAMP(), -1, "day")
)
group by
file.project
ORDER BY
download_count DESC
LIMIT 100
Query: Highest Python 3 Usage With More Than 100,000 Downloads Per 30 Days
SELECT
file.project,
ROUND(100 * SUM(CASE WHEN REGEXP_EXTRACT(details.python, r"^([^\.]+)") = "3" THEN 1 ELSE 0 END) / COUNT(*), 1) AS percent_3,
COUNT(*) as download_count,
FROM
TABLE_DATE_RANGE(
[the-psf:pypi.downloads],
DATE_ADD(CURRENT_TIMESTAMP(), -31, "day"),
DATE_ADD(CURRENT_TIMESTAMP(), -1, "day")
)
group by
file.project
HAVING
download_count > 100000
ORDER BY
percent_3 DESC
LIMIT 100
The top 10 is interesting here:
Row | Date | Percentage | Total Downloads |
---|---|---|---|
1 | async-timeout | 98.8 | 117724 |
2 | multidict | 89.4 | 157661 |
3 | yarl | 82.1 | 118081 |
4 | aiohttp | 74.6 | 209035 |
5 | azure-servicebus | 72.3 | 151706 |
6 | plumbum | 60.1 | 131295 |
7 | mysqlclient | 57.9 | 140905 |
8 | pkginfo | 56.2 | 102214 |
9 | azure-nspkg | 55.2 | 214821 |
10 | azure-storage | 54.7 | 213129 |
Query: Which Packages By Percentage Are Downloaded Most Often Via Python 2.6?
SELECT
file.project,
ROUND(100 * SUM(CASE WHEN REGEXP_EXTRACT(details.python, r"^([^\.]+\.[^\.]+)") = "2.6" THEN 1 ELSE 0 END) / COUNT(*), 1) AS percent_26,
SUM(CASE WHEN REGEXP_EXTRACT(details.python, r"^([^\.]+\.[^\.]+)") = "2.6" THEN 1 ELSE 0 END) as total_26_downloads,
COUNT(*) as total_downloads,
FROM
TABLE_DATE_RANGE(
[the-psf:pypi.downloads],
DATE_ADD(CURRENT_TIMESTAMP(), -31, "day"),
DATE_ADD(CURRENT_TIMESTAMP(), -1, "day")
)
WHERE
details.python IS NOT NULL AND
# Exclude things which are stdlib backports *for* Python 2.6
file.project NOT IN ("argparse", "ordereddict")
GROUP BY
file.project,
HAVING
total_downloads > 5000
ORDER BY
percent_26 DESC
LIMIT 250
Query: Most Used Installers/Versions
SELECT
details.installer.name,
details.installer.version,
COUNT(*) as total_downloads
FROM
TABLE_DATE_RANGE(
[the-psf:pypi.downloads],
DATE_ADD(CURRENT_TIMESTAMP(), -31, "day"),
DATE_ADD(CURRENT_TIMESTAMP(), -1, "day")
)
GROUP BY
details.installer.name,
details.installer.version
ORDER BY
total_downloads DESC
LIMIT 100
Query: Downloads By Country
SELECT
country_code,
COUNT(*) as downloads,
FROM
TABLE_DATE_RANGE(
[the-psf:pypi.downloads],
DATE_ADD(CURRENT_TIMESTAMP(), -31, "day"),
DATE_ADD(CURRENT_TIMESTAMP(), -1, "day")
)
GROUP BY
country_code
ORDER BY
downloads DESC
LIMIT 100
Query: Python 3 Download Percentage Across PyPI Grouped By Month
SELECT
STRFTIME_UTC_USEC(timestamp, "%Y-%m") AS yyyymm,
ROUND(100 * SUM(CASE WHEN REGEXP_EXTRACT(details.python, r"^([^\.]+)") = "3" THEN 1 ELSE 0 END) / COUNT(*), 1) AS percent_3,
COUNT(*) as download_count
FROM
TABLE_DATE_RANGE(
[the-psf:pypi.downloads],
DATE_ADD(CURRENT_TIMESTAMP(), -1, "year"),
CURRENT_TIMESTAMP()
)
group by
yyyymm
ORDER BY
yyyymm DESC
LIMIT 100
This is an interesting query for obvious reasons8, but also one that processes quite a bit of data. Here are the results through January 2017:
Date | Percentage | Total Downloads |
---|---|---|
2017-01 | 10.3 | 619254519 |
2016-12 | 9.6 | 537529176 |
2016-11 | 9.2 | 563208796 |
2016-10 | 8.9 | 528112583 |
2016-09 | 7.9 | 538323554 |
2016-08 | 7.4 | 542956355 |
2016-07 | 7.1 | 483143293 |
2016-06 | 7.1 | 428402893 |
2016-05 | 7.1 | 137138123 |
2016-03 | 6.6 | 63615708 |
2016-02 | 6.5 | 314275153 |
2016-01 | 6.4 | 98290315 |
Data ingestion into the BigQuery data set was spotty prior to June 20169, but you can see a significant uptick in Python 3 based downloads over 2016. If these trends continue…
Decisions
These queries just scratch the surface of the data you can get about the Python ecosystem at large and your project in particular. Write your own queries and make your decisions about platform, version, and implementation with real information.
Of course, this doesn’t address more granular usage, exception tracking, and other analytics web and mobile devs frequently take for granted, but that’s a separate blog post.
Thanks to Donald Stufft for his tireless efforts on PyPI, Alex Gaynor for several of the queries in this blog post, and Google for generously donating capacity on BigQuery for PyPI data.
-
Or worse, don’t complain until after you’ve done the removal. ↩
-
…which can contribute significantly to project burnout. ↩
-
I am legally obligated to link Frinkiac in every blog post. ↩
-
Queries are charged against your account, but you get 1TB free per month and cached queries won’t count against it. ↩
-
It’s not for cryptography. We have a deprecation warning in place but an indefinite timeline for removal of support. ↩
-
Issues for 1.0.0 removal and 0.9.8 removal. ↩
-
requests is at 14%, bcrypt 29%, django 32.9%, and jupyter 33.3%. Packages can vary wildly! ↩
-
To my knowledge no one has posted this information previously, so we’re accidentally breaking new ground with our queries! ↩
-
But it shouldn’t be biased, so these percentages are likely to be accurate. ↩