Skip to content

Overall architecture

Overview

Our World In Data is a statically rendered ("JAMstack") publication with no public facing backend. It has two primary codebases:

  • owid-grapher: a Typescript project used to build the overall publication and its admin tools
  • etl: a Python data-pipeline used to build the datasets that power our site

There four different types of users of our infrastructure: readers, authors, data scientists and engineers.

graph TB

reader(["👤 reader"]) -->|reads| owid[Our World In Data site]
author(["👤 author"]) -->|writes articles| owid
data(["👤 data scientist"]) -->|manages data| owid
engineer(["👤 engineer"]) -->|deploys changes| owid

If we break it down one layer further, we can see two main workspaces for staff:

  • An internal admin site for managing the site and its data, used by authors and data managers
  • A data pipeline used by data managers to import and transform datasets

The output of these two workspaces is our live site that readers access.

Authors

Data scientists

  • Data scientists are responsible for finding and importing data from upstream providers, and putting it through our editorial process
  • They may use the grapher admin for small datasets, or our ETL for larger datasets
  • Our ETL is a custom Python project designed to be executable by the public
  • It defines the following stages of data processing:
    • upstream: the original data with the provider
    • snapshot: a copy of the original data, for reproducibility
    • meadow: the data reformatted to a standard format
    • garden: the data harmonized to use reference names for countries and regions
    • mysql: the data transformed for our charting tool Grapher
  • A common challenge is updating existing charts to use newer available data
    • The grapher admin supports a chart revision tool that helps with this process
graph TB

subgraph etl["ETL"]
    direction LR
    upstream -.->|save| snapshot
    snapshot -->|reformat| meadow -->|harmonize| garden
end

grapher[grapher admin]
grapher --> mysql

dm(["👤 data scientist"]) -->|manage large datasets| etl
dm -->|manage charts & small data| grapher
etl --> mysql

Documentation on our data pipeine is available here.

Engineers

  • Engineers are responsible for deploying changes to the site
  • They tend to work either in the Typescript owid-grapher codebase or the Python etl codebase
    • Each of these repos describes its own dev setup
  • Developers have access to staging environments
  • Operationally, internal tools run off a number of small hand-managed servers in DigitalOcean, and off a high-capacity Hetzner dedicated server called foundation
    • foundation is used as a host for LXC containers
  • Most development is done on local machines, using Docker for supporting services
  • We track our work using Github issues and Github Projects
  • We are increasingly using Buildkite for CI, and LXC containers for new services we need

A map of our infrastructure is as follows:

graph TB

subgraph Hetzner
    subgraph FSN1[FSN1 region]
        subgraph foundation[foundation LXC host]
            subgraph covid-cron
                covid-19
            end
            subgraph etl-build
                buildkite-agent
            end
            subgraph etl-prod
                etl
            end
        end
    end
end

subgraph digitalocean[Digital Ocean]
    subgraph LON1[LON1 region]
        subgraph owid-live
            grapher-admin
            wordpress-admin
            baker
        end
        subgraph owid-live-db
            mysql
        end
        grapher-admin --> mysql
        wordpress-admin --> mysql
        baker --> mysql
    end


    subgraph NYC1[NYC1 region]
        subgraph s3
            covid-19-data
            data-catalog
        end
    end
end

subgraph github
    covid-19-repo
    etl-repo
end

etl --> data-catalog
etl --> mysql
covid-19 --> mysql
covid-19 --> covid-19-data
covid-19 --> covid-19-repo
buildkite-agent --> etl-repo
buildkite-agent --> data-catalog

Transitions underway

Google Docs and ArchieML

  • We are currently moving away from Wordpress and towards Google Docs with ArchieML markup for authoring
    • This balances ease of authoring and ease of review in the editorial process
  • Our goal in 2023 is to completely move off of Wordpress and convert all existing content to Google Docs

New fast track for small data

  • Part of the grapher-admin is dedicated to importing small datasets
  • We nickname it the "fast track"
  • We are building a new fast track on top of the ETL, using Google Sheets as the authoring tool
  • This will make the ETL authoritative for data values, but non-authoritative for metadata
  • Data curation tools on top of MySQL may layer metadata edits on top, making MySQL authoritative for metadata

Last update: February 7, 2023