The Hub’s Database#

JupyterHub uses a database to store information about users, services, and other data needed for operating the Hub. This is the state of the Hub.

Why does JupyterHub have a database?#

JupyterHub is a stateful application (more on that ‘state’ later). Updating JupyterHub’s configuration or upgrading the version of JupyterHub requires restarting the JupyterHub process to apply the changes. We want to minimize the disruption caused by restarting the Hub process, so it can be a mundane, frequent, routine activity. Storing state information outside the process for later retrieval is necessary for this, and one of the main thing databases are for.

A lot of the operations in JupyterHub are also relationships, which is exactly what SQL databases are great at. For example:

  • Given an API token, what user is making the request?

  • Which users don’t have running servers?

  • Which servers belong to user X?

  • Which users have not been active in the last 24 hours?

Finally, a database allows us to have more information stored without needing it all loaded in memory, e.g. supporting a large number (several thousands) of inactive users.

What’s in the database?#

The short answer of what’s in the JupyterHub database is “everything.” JupyterHub’s state lives in the database. That is, everything JupyterHub needs to be aware of to function that doesn’t come from the configuration files, such as

  • users, roles, role assignments

  • state, urls of running servers

  • Hashed API tokens

  • Short-lived state related to OAuth flow

  • Timestamps for when users, tokens, and servers were last used

What’s not in the database#

Not quite all of JupyterHub’s state is in the database. This mostly involves transient state, such as the ‘pending’ transitions of Spawners (starting, stopping, etc.). Anything not in the database must be reconstructed on Hub restart, and the only sources of information to do that are the database and JupyterHub configuration file(s).

How does JupyterHub use the database?#

JupyterHub makes some unusual choices in how it connects to the database. These choices represent trade-offs favoring single-process simplicity and performance at the expense of horizontal scalability (multiple Hub instances).

We often say that the Hub ‘owns’ the database. This ownership means that we assume the Hub is the only process that will talk to the database. This assumption enables us to make several caching optimizations that dramatically improve JupyterHub’s performance (i.e. data written recently to the database can be read from memory instead of fetched again from the database) that would not work if multiple processes could be interacting with the database at the same time.

Database operations are also synchronous, so while JupyterHub is waiting on a database operation, it cannot respond to other requests. This allows us to avoid complex locking mechanisms, because transaction races can only occur during an await, so we only need to make sure we’ve completed any given transaction before the next await in a given request.

Note

We are slowly working to remove these assumptions, and moving to a more traditional db session per-request pattern. This will enable multiple Hub instances and enable scaling JupyterHub, but will significantly reduce the number of active users a single Hub instance can serve.

Database performance in a typical request#

Most authenticated requests to JupyterHub involve a few database transactions:

  1. look up the authenticated user (e.g. look up token by hash, then resolve owner and permissions)

  2. record activity

  3. perform any relevant changes involved in processing the request (e.g. create the records for a running server when starting one)

This means that the database is involved in almost every request, but only in quite small, simple queries, e.g.:

  • lookup one token by hash

  • lookup one user by name

  • list tokens or servers for one user (typically 1-10)

  • etc.

The database as a limiting factor#

As a result of the above transactions in most requests, database performance is the leading factor in JupyterHub’s baseline requests-per-second performance, but that cost does not scale significantly with the number of users, active or otherwise. However, the database is rarely a limiting factor in JupyterHub performance in a practical sense, because the main thing JupyterHub does is start, stop, and monitor whole servers, which take far more time than any small database transaction, no matter how many records you have or how slow your database is (within reason). Additionally, there is usually very little load on the database itself.

By far the most taxing activity on the database is the ‘list all users’ endpoint, primarily used by the idle-culling service. Database-based optimizations have been added to make even these operations feasible for large numbers of users:

  1. State filtering on GET /hub/api/users?state=active, which limits the number of results in the query to only the relevant subset (added in JupyterHub 1.3), rather than all users.

  2. Pagination of all list endpoints, allowing the request of a large number of resources to be more fairly balanced with other Hub activities across multiple requests (added in 2.0).

Note

It’s important to note when discussing performance and limiting factors and that all of this only applies to requests to /hub/.... The Hub and its database are not involved in most requests to single-user servers (/user/...), which is by design, and largely motivated by the fact that the Hub itself doesn’t need to be fast because its operations are infrequent and large.

Database backends#

JupyterHub supports a variety of database backends via SQLAlchemy. The default is sqlite, which works great for many cases, but you should be able to use many backends supported by SQLAlchemy. Usually, this will mean PostgreSQL or MySQL, both of which are officially supported and well tested with JupyterHub, but others may work as well. See SQLAlchemy’s docs for how to connect to different database backends. Doing so generally involves:

  1. installing a Python package that provides a client implementation, and

  2. setting JupyterHub.db_url to connect to your database with the specified implementation

Default backend: SQLite#

The default database backend for JupyterHub is SQLite. We have chosen SQLite as JupyterHub’s default because it’s simple (the ‘database’ is a single file) and ubiquitous (it is in the Python standard library). It works very well for testing, small deployments, and workshops.

For production systems, SQLite has some disadvantages when used with JupyterHub:

  • upgrade-db may not always work, and you may need to start with a fresh database

  • downgrade-db will not work if you want to rollback to an earlier version, so backup the jupyterhub.sqlite file before upgrading (JupyterHub automatically creates a date-stamped backup file when upgrading sqlite)

The sqlite documentation provides a helpful page about when to use SQLite and where traditional RDBMS may be a better choice.

Picking your database backend (PostgreSQL, MySQL)#

When running a long term deployment or a production system, we recommend using a full-fledged relational database, such as PostgreSQL or MySQL, that supports the SQL ALTER TABLE statement, which is used in some database upgrade steps.

In general, you select your database backend with JupyterHub.db_url, and can further configure it (usually not necessary) with JupyterHub.db_kwargs.

Notes and Tips#

SQLite#

The SQLite database should not be used on NFS. SQLite uses reader/writer locks to control access to the database. This locking mechanism might not work correctly if the database file is kept on an NFS filesystem. This is because fcntl() file locking is broken on many NFS implementations. Therefore, you should avoid putting SQLite database files on NFS since it will not handle well multiple processes which might try to access the file at the same time.

PostgreSQL#

We recommend using PostgreSQL for production if you are unsure whether to use MySQL or PostgreSQL or if you do not have a strong preference. There is additional configuration required for MySQL that is not needed for PostgreSQL.

For example, to connect to a postgres database with psycopg2:

  1. install psycopg2: pip instal psycopg2 (or psycopg2-binary to avoid compilation, which is not recommended for production)

  2. set authentication via environment variables PGUSER and PGPASSWORD

  3. configure JupyterHub.db_url:

    c.JupyterHub.db_url = "postgres+psycopg2://my-postgres-server:5432/my-db-name"
    

MySQL / MariaDB#

  • You should probably use the pymysql or mysqlclient sqlalchemy provider, or another backend recommended by sqlalchemy

  • You also need to set pool_recycle to some value (typically 60 - 300, JupyterHub will default to 60) which depends on your MySQL setup. This is necessary since MySQL kills connections serverside if they’ve been idle for a while, and the connection from the hub will be idle for longer than most connections. This behavior will lead to frustrating ‘the connection has gone away’ errors from sqlalchemy if pool_recycle is not set.

  • If you use utf8mb4 collation with MySQL earlier than 5.7.7 or MariaDB earlier than 10.2.1 you may get an 1709, Index column size too large error. To fix this you need to set innodb_large_prefix to enabled and innodb_file_format to Barracuda to allow for the index sizes jupyterhub uses. row_format will be set to DYNAMIC as long as those options are set correctly. Later versions of MariaDB and MySQL should set these values by default, as well as have a default DYNAMIC row_format and pose no trouble to users.

For example, to connect to a mysql database with mysqlclient:

  1. install mysqlclient: pip install mysqlclient

  2. configure JupyterHub.db_url:

    c.JupyterHub.db_url = "mysql+mysqldb://myuser:mypassword@my-sql-server:3306/my-db-name"