Skip to content

Statistics Calculations

Underpass process two input streams for data, but only one is used for the initial statistics calculations. ChangeSet are used to populate some of the columns in the database, but aren't used during statistics calculations by the backend. Only the OsmChange file is used, as it contains the the data that is changed.

The OsmChange data file contains 3 categories of data, what was created, modified, or deleted. Currently only changed and modified data are used for statistics calculations.

All of the changed data is parsed into a data structure that can be passed between classes. This data structure contains all the data as well as the associated action, create or modify. The parsed data is then passed to the OsmChangeFile::collectStat() method to do the calculations. While currently part of the core code, in the future this will be a plugin, allowing others to create different statistics calculations without modifying the code.

Priority boundary

Currently, changes to be proccessed are filtered by ChangeSetFile::areaFilter() and OsmChangeFile::areaFilter(), using a boundary polygon. In some cases is not possible to say if a OsmChange is inside the priority boundary. To disable the filtering in the replication process, you can add the argument --osmnoboundary for OsmChanges or --oscnoboundary for Changesets.

What Is Collected

The original statistics counted buildings, waterways, and POIs. The new statistics break this down into two categories, accumulates statistics for things like buildings, as well as the more detailed representation, like what type of building it is. The list of values is configurable, as it uses a YAML file.

OpenStreetMap features support a keyword and value pair. The keywords are loosely defined, and over time some have changed and been improved. Often new mappers get confused, so may use keywords and values in an inconsistent manner. Also over time the definitions of some tags has changed, or been extended.

For example, let's look at schools. There is a variety of ways school buildings are tagged. Sometimes it's building=school, with school as the value. Sometimes school is the keyword, and the value is the type of school. In this case, the type of school is accumulated, as well as a generic school count. Also if building=school isn't used, then every school also increments the count of buildings.

Each category of data has the total accumulated value, as well as the more detailed breakdown. For example, it's possible to extract statistics for only hospitals, which is a subset of all the buildings. As keywords and values can be in different feature categories, some are checked for in multiple ways. Some keywords and values may be spelled differently than the default, so variations are also looked for to be complete.

To find all the common tags, several continents worth of data was analyzed to find the most common patterns for the features we want to collect statistics for. Mixed with inconsistent tagging schemes is random capitalization, misspellings, and international spellings. The attempt is made to catch all reasonable variations. TagInfo was used to find totals of some variations, with weird tagging that was not very common gets ignored to avoid performance impacts, and data bloat.

Building Types

Most buildings added by remote tracing of satellite imagery lack any metadata tags beyond building=yes. When local mappers import more detailed data, or update the existing metadata, those values get added. This is a common set of building values.

  • yes
  • house
  • residential
  • commercial
  • retail
  • commercial;residential
  • apartments
  • kitchen
  • roof
  • construction
  • school
  • clinic
  • hospital
  • office
  • public
  • church
  • mosque
  • temple
  • service
  • warehouse
  • industrial
  • kiosk
  • abandoned
  • cabin
  • bungalow
  • hotel
  • farm
  • hut
  • train_station
  • house_boat
  • barn
  • historic
  • latrine
  • latrines
  • toilet
  • toilets

Amenity Types

Most amenities are added by local mappers or through a data import. Not all amenities are buildings, but for our use case that's all that is analyzed. These are the common values for the amenity keyword.

  • hospital
  • school
  • clinic
  • kindergarten
  • drinking_water
  • health_facility
  • health_center
  • healthcare

Places Types

Places contain multiple features, and are mostly used to determine local metadata improvements.

  • village
  • hamlet
  • neighborhood
  • city
  • town

Highway Types

Highways traced from satellite imagery often lack metadata beyond the functional type. For example, a highway that connects two villages is easy to determine. An accumulated value for the total of highways, and the total length in kilometers is store as an aggregate. More detailed statistics are also kept allowing more detail when needed.

  • trunk
  • tertiary
  • secondary
  • unclassified
  • track
  • residential
  • path
  • service
  • bridge

School Types

When school is a keyword, there are several values for the type of school. An aggregate total of schools can be calculated, as well as detail for the type of school.

  • primary
  • secondary
  • kindergarten

Calculation Data flow

The data is processed by Underpass. Underpass downloads the changeset and the OsmChange files from the OpenStreetMap planet server every minute. Downloaded files are also cached on disk, so it's also possible to process data without a network connection. Once the data is parsed from the respective data formats, it gets passed to the OsmChangeFile::collectStat() method. That method loops through the data structure containing the changes to the map data. Within that method, it calls ChangeSetFile::scanTags(), which does all the real work. The scanTags() method uses StatsConfigSearch::search() to search the lists of keywords and values configured at the stats configuration file. ScanTags() returns an array of statistics for the desired features. That array is then converted by collectStats() into the statistics data structure, and control returns to the processing thread.

The processing thread then passes the statistics data to osmstats::applyChange(), to insert them into the database.

changesets table

This is the primary table used to contain the data for each changeset. All of the data is stored in a table for better query performance on large data sets. It also limits needing SQL sub queries or a JOIN between tables, reducing complexity.

An hstore is used to store the statistics instead of having a separate table for each feature to allow for more flexibility. An hstore is a key & value pair, and multiple data items can be stored in a single column, indexed by the key. This allows for more features to be added by the backend and the frontend, without having modify the database schema.

An example query to count the total number of buildings added by the user 4321 for a Tasking Manager project 1234 would be this:

SELECT SUM(CAST(added::hstore->'building' AS DOUBLE precision)) FROM changesets WHERE 'hotosm-project-1234' = ANY(hashtags) AND uid=4321;

The source is the satellite imagery used for remote mapping.


Keyword Description
id The ID of this changeset
editor The editor used for this changeset
uid The OSM User ID of the mapper
created_at The timestamp when this changes was uploaded
closed_at The timestamp when this uploaded change completed processing
updated_at The timestamp when this last had data updated
added An hstore array of the added map features
modified An hstore array of the modified map features
deleted An hstore array of the deleted map features
hashtags An array of the hashtags used for this changeset
source The imagery source used for this changeset
bbox The bounding box of this changeset