What platform do my users run on? How many users do I have? Is it okay to drop support for something?

Open source library developers have an analytics problem. How do you answer questions like this when people are embedding your software as a component of their own product?

For example, you, the developer of the library frinkiac-iac would like to drop support for Python 2.6 because it is old, unsupported by upstream, and restricting your use of newer Python features. Is it okay to do this? Since you don’t have any data you may attempt to compensate for this lack of insight in a variety of ways:

  • Reach out through your community (mailing lists, bug trackers, IRC channels) to tentatively suggest the idea.

  • Do a release with a deprecation warning with a timeline for removal that will potentially be massively pushed back as users complain.1

  • Just remove it and see if anybody complains.

  • Wait for a major package to do it and follow their lead.

Each of these choices can work, but frequently result in major breakage for users or an unacceptably long timeline for removal resulting in significant maintenance burden2. You need data, but it’s so hard to collect it. Can’t someone else do it?3

BigQuery

Quietly announced back in May 2016, PyPI download statistics are the most powerful tool a Python developer can use to inform their decision making. To unleash its awesome power start by loading the dataset. Preemptive apology: If you are a new Google Cloud user you may be confronted with an unpleasant set of modals asking you to agree to various things and create a new project. What you actually need to do is get to billing and start a free trial4. Once you’ve got the BigQuery console available you should be able to click that link again and access the page. Now you can click Compose Query!

BigQuery uses a SQL-like language. A query reference is available to help you write your own queries. Additionally, you can see the set of available fields you can query against by clicking downloads on the left pane and looking at the schema. One note: file.project is the normalized name of the project.

With this robust data set we can make substantially more informed decisions, but here are some examples to help you out. All the examples here use what Google references as “legacy SQL”. Despite the name this is the default for queries, but feel free to use the standard SQL if you’d like!

Query: Python Versions

In the original scenario we were curious if we could drop support for Python 2.6. What does that query look like? For the following examples we’ll be using cryptography since that’s my primary project and the one I run queries on most commonly.

SELECT
  REGEXP_EXTRACT(details.python, r"^([^\.]+\.[^\.]+)") as python_version,
  COUNT(*) as download_count,
FROM
  TABLE_DATE_RANGE(
    [the-psf:pypi.downloads],
    DATE_ADD(CURRENT_TIMESTAMP(), -31, "day"),
    DATE_ADD(CURRENT_TIMESTAMP(), -1, "day")
  )
WHERE
  file.project = 'cryptography'
GROUP BY
  python_version,
ORDER BY
  download_count DESC
LIMIT 100

And our results (as of 2016-12-09):

Rowpython_versiondownload_count
12.72208835
23.5202515
32.6109134
4null91076
53.487837
63.35623
73.61354
83.7643
91.17341
103.2270
113.12
122.41
132.51

As you can see, Python 2.6 makes up 109,134 out of 2,707,632 downloads in the past 30 days. This represents roughly 4% of downloads. Is that low enough to drop support?5.

null also makes up approximately 3.4% of downloads. These are downloads from PyPI using clients that do not support sending the statistics we’re querying against. This can be an older version of pip or alternate clients. You also see 341 downloads from 1.17, which is…who knows! When making maintenance decisions you should factor these unknowns as you feel appropriate.

Query: OpenSSL versions

cryptography supports a wide variety of OpenSSL versions. However, supporting 0.9.8 and 1.0.0 are a significant challenge since they are missing many of the features we need (and are no longer supported by upstream). Let’s craft a query to see what versions of OpenSSL are in use:

SELECT
  details.system.name,
  REGEXP_EXTRACT(details.openssl_version, r"^OpenSSL ([^ ]+) ") as openssl_version,
  COUNT(*) as download_count,
FROM
  TABLE_DATE_RANGE(
    [the-psf:pypi.downloads],
    DATE_ADD(CURRENT_TIMESTAMP(), -31, "day"),
    DATE_ADD(CURRENT_TIMESTAMP(), -1, "day")
  )
WHERE
  details.openssl_version IS NOT NULL
GROUP BY
  details.system.name,
  openssl_version,
HAVING
  download_count >= 100
ORDER BY
  download_count DESC
LIMIT 100

This gives us a view of what OpenSSL is linked against Python broken down by platform:

Rowdetails_system_nameopenssl_versiondownload_count
1Linux1.0.1f56152115
2Linux1.0.1t26301089
3Linux1.0.2g25892721
4Linux1.0.1e-fips24082526
5Linux1.0.119199858
6Darwin0.9.8zh15528451
7Linux1.0.1k-fips9029119
8Linux1.0.2j8499043
9Windows1.0.2h5037984
10Darwin1.0.2j3077958

While I haven’t provided the entire set of results it turns out less than 100,000 downloads out of 210,063,137 were made using OpenSSL 1.0.0. 0.9.8 holds a much greater share, but only due to Darwin (aka macOS…aka OS X). In cryptography’s case we statically link wheels on Mac and Windows so we can ignore the OpenSSL version on those platforms. Looks like dropping 0.9.8 and 1.0.0 is probably safe!6

Maybe you just want to know how popular your package is relative to others in the past 30 days.

SELECT
  file.project,
  COUNT(*) as total_downloads,
FROM
  TABLE_DATE_RANGE(
    [the-psf:pypi.downloads],
    DATE_ADD(CURRENT_TIMESTAMP(), -31, "day"),
    DATE_ADD(CURRENT_TIMESTAMP(), -1, "day")
  )
GROUP BY
  file.project
ORDER BY
  total_downloads DESC
LIMIT 100

Query: How Many Downloads Did My Project Get?

Or maybe just your package:

SELECT
  COUNT(*) as total_downloads,
FROM
  TABLE_DATE_RANGE(
    [the-psf:pypi.downloads],
    DATE_ADD(CURRENT_TIMESTAMP(), -31, "day"),
    DATE_ADD(CURRENT_TIMESTAMP(), -1, "day")
  )
WHERE
  file.project = 'cryptography'

Query: Downloads By Filename Filtered By Project and Version

If you ship multiple release artifacts (platform/version specific wheels as well as sdist) this query can show you their popularity.

SELECT
  file.filename,
  COUNT(*) as total_downloads,
FROM
  TABLE_DATE_RANGE(
    [the-psf:pypi.downloads],
    DATE_ADD(CURRENT_TIMESTAMP(), -31, "day"),
    DATE_ADD(CURRENT_TIMESTAMP(), -1, "day")
  )
WHERE
  file.project = 'cryptography'
AND
  file.filename like '%1.6%'
GROUP BY
  file.filename
ORDER BY
  total_downloads DESC
LIMIT 100

Query: Python 2 vs 3 For A Single Project

We broke it down by Python release previously, but what if you just want to know 2 vs 3?

SELECT
  ROUND(100 * SUM(CASE WHEN REGEXP_EXTRACT(details.python, r"^([^\.]+)") = "3" THEN 1 ELSE 0 END) / COUNT(*), 1) AS percent_3,
  COUNT(*) as download_count,
FROM
  TABLE_DATE_RANGE(
    [the-psf:pypi.downloads],
    DATE_ADD(CURRENT_TIMESTAMP(), -31, "day"),
    DATE_ADD(CURRENT_TIMESTAMP(), -1, "day")
  )
WHERE
  file.project = 'cryptography'
ORDER BY
  download_count DESC
LIMIT 100

In cryptography’s case Python 3 currently makes up 11% of downloads7.

Query: Percentage of Downloads By Python 3 In The Top 100

Of course, a trivial modification and we can see the Python 3 percentage in the top 100 packages:

SELECT
  file.project,
  ROUND(100 * SUM(CASE WHEN REGEXP_EXTRACT(details.python, r"^([^\.]+)") = "3" THEN 1 ELSE 0 END) / COUNT(*), 1) AS percent_3,
  COUNT(*) as download_count,
FROM
  TABLE_DATE_RANGE(
    [the-psf:pypi.downloads],
    DATE_ADD(CURRENT_TIMESTAMP(), -31, "day"),
    DATE_ADD(CURRENT_TIMESTAMP(), -1, "day")
  )
group by
  file.project
ORDER BY
  download_count DESC
LIMIT 100

Query: Highest Python 3 Usage With More Than 100,000 Downloads Per 30 Days

SELECT
  file.project,
  ROUND(100 * SUM(CASE WHEN REGEXP_EXTRACT(details.python, r"^([^\.]+)") = "3" THEN 1 ELSE 0 END) / COUNT(*), 1) AS percent_3,
  COUNT(*) as download_count,
FROM
  TABLE_DATE_RANGE(
    [the-psf:pypi.downloads],
    DATE_ADD(CURRENT_TIMESTAMP(), -31, "day"),
    DATE_ADD(CURRENT_TIMESTAMP(), -1, "day")
  )
group by
  file.project
HAVING
  download_count > 100000
ORDER BY
  percent_3 DESC
LIMIT 100

The top 10 is interesting here:

RowDatePercentageTotal Downloads
1async-timeout98.8117724
2multidict89.4157661
3yarl82.1118081
4aiohttp74.6209035
5azure-servicebus72.3151706
6plumbum60.1131295
7mysqlclient57.9140905
8pkginfo56.2102214
9azure-nspkg55.2214821
10azure-storage54.7213129

Query: Which Packages By Percentage Are Downloaded Most Often Via Python 2.6?

SELECT
  file.project,
  ROUND(100 * SUM(CASE WHEN REGEXP_EXTRACT(details.python, r"^([^\.]+\.[^\.]+)") = "2.6" THEN 1 ELSE 0 END) / COUNT(*), 1) AS percent_26,
  SUM(CASE WHEN REGEXP_EXTRACT(details.python, r"^([^\.]+\.[^\.]+)") = "2.6" THEN 1 ELSE 0 END) as total_26_downloads,
  COUNT(*) as total_downloads,
FROM
  TABLE_DATE_RANGE(
    [the-psf:pypi.downloads],
    DATE_ADD(CURRENT_TIMESTAMP(), -31, "day"),
    DATE_ADD(CURRENT_TIMESTAMP(), -1, "day")
  )
WHERE
  details.python IS NOT NULL AND
  # Exclude things which are stdlib backports *for* Python 2.6
  file.project NOT IN ("argparse", "ordereddict")
GROUP BY
  file.project,
HAVING
  total_downloads > 5000
ORDER BY
  percent_26 DESC
LIMIT 250

Query: Most Used Installers/Versions

SELECT
  details.installer.name,
  details.installer.version,
  COUNT(*) as total_downloads
FROM
  TABLE_DATE_RANGE(
    [the-psf:pypi.downloads],
    DATE_ADD(CURRENT_TIMESTAMP(), -31, "day"),
    DATE_ADD(CURRENT_TIMESTAMP(), -1, "day")
  )
GROUP BY
  details.installer.name,
  details.installer.version
ORDER BY
  total_downloads DESC
LIMIT 100

Query: Downloads By Country

SELECT
  country_code,
  COUNT(*) as downloads,
FROM
  TABLE_DATE_RANGE(
    [the-psf:pypi.downloads],
    DATE_ADD(CURRENT_TIMESTAMP(), -31, "day"),
    DATE_ADD(CURRENT_TIMESTAMP(), -1, "day")
  )
GROUP BY
  country_code
ORDER BY
  downloads DESC
LIMIT 100

Query: Python 3 Download Percentage Across PyPI Grouped By Month

SELECT
  STRFTIME_UTC_USEC(timestamp, "%Y-%m") AS yyyymm,
  ROUND(100 * SUM(CASE WHEN REGEXP_EXTRACT(details.python, r"^([^\.]+)") = "3" THEN 1 ELSE 0 END) / COUNT(*), 1) AS percent_3,
  COUNT(*) as download_count
FROM
  TABLE_DATE_RANGE(
    [the-psf:pypi.downloads],
    DATE_ADD(CURRENT_TIMESTAMP(), -1, "year"),
    CURRENT_TIMESTAMP()
  )
group by
  yyyymm
ORDER BY
  yyyymm DESC
LIMIT 100

This is an interesting query for obvious reasons8, but also one that processes quite a bit of data. Here are the results through January 2017:

DatePercentageTotal Downloads
2017-0110.3619254519
2016-129.6537529176
2016-119.2563208796
2016-108.9528112583
2016-097.9538323554
2016-087.4542956355
2016-077.1483143293
2016-067.1428402893
2016-057.1137138123
2016-036.663615708
2016-026.5314275153
2016-016.498290315

Data ingestion into the BigQuery data set was spotty prior to June 20169, but you can see a significant uptick in Python 3 based downloads over 2016. If these trends continue…

Decisions

These queries just scratch the surface of the data you can get about the Python ecosystem at large and your project in particular. Write your own queries and make your decisions about platform, version, and implementation with real information.

Of course, this doesn’t address more granular usage, exception tracking, and other analytics web and mobile devs frequently take for granted, but that’s a separate blog post.

Thanks to Donald Stufft for his tireless efforts on PyPI, Alex Gaynor for several of the queries in this blog post, and Google for generously donating capacity on BigQuery for PyPI data.

  1. Or worse, don’t complain until after you’ve done the removal. 

  2. …which can contribute significantly to project burnout. 

  3. I am legally obligated to link Frinkiac in every blog post. 

  4. Queries are charged against your account, but you get 1TB free per month and cached queries won’t count against it. 

  5. It’s not for cryptography. We have a deprecation warning in place but an indefinite timeline for removal of support. 

  6. Issues for 1.0.0 removal and 0.9.8 removal

  7. requests is at 14%, bcrypt 29%, django 32.9%, and jupyter 33.3%. Packages can vary wildly! 

  8. To my knowledge no one has posted this information previously, so we’re accidentally breaking new ground with our queries! 

  9. But it shouldn’t be biased, so these percentages are likely to be accurate.