OpenStreetMap (OSM) is maybe the most extensive open data project for geo-data. It has rich information on points of interest (POIs), such as apartments, shops, or offices, globally. You can truly call it a data treasure. Also, the big players, like Apple, Amazon, and Facebook, realized that and built heavily upon this global community project.
To also enable smaller data science teams to benefit from this resource, we created an ETL pipeline to directly download the raw OSM data in the .pbf file format and store the transformed objects in a database. Leveraging Uber’s H3 indexing, you can then easily query those objects with radius or polygon queries.
In the remaining parts of this article, I will show you how we approached the transformation of the raw data using the power of streams in Node.js. All of the code snippets are open-source, just as everything else at Kuwala. You can find it on our GitHub.
Streams in Node.js are an elegant way to process big files while not using a lot of memory. The reason is that you don’t need to load the entire file into the memory to perform transformations on or work with the data, but it packs them in batches. That way, even on files with millions of rows, the memory consumption stays consistently low. Of course, you need to write the transformed data directly to a database or file system or perform requests based on the data in the batch and not store everything in intermediate in-memory objects.
We can already use streams when downloading the OSM files since the file size may be several gigabytes big depending on the region you want to download. You first need to create a write stream that writes the data to a specified file path. You then pipe it to the data stream of the request-response object and continue when the stream is finished.
Now that we downloaded the file, let’s parse it into handleable objects and write them to a database. First, we create a read stream that reads the file. For parsing the OSM protocol buffer into JSON objects, we use the osm-pbf-parser npm package. With the pipe command, you can connect a writable stream to a readable stream. There is a particular class, the Transform stream, which implements both the Readable and Writable interface. Using that, we can break the entire filtering and transformation down into semantic streams and pipe them together.
After parsing the pbf format to JSON, we want to create more concise objects based on the OSM tags, which is just a list of key-value pairs. The objects in the OSM file are ordered by type. First nodes, then ways, and lastly relations. Which makes a lot of sense since ways consist of nodes and relations consist of ways.
So with our first transformer, we store all the nodes in a temporary database. We use LevelDB here, a fast key-value store developed by Google. Secondly, we filter all objects that do not contain relevant tags for classifying them as a POI, e.g., nodes that only exist to represent a point in a building footprint. Thirdly, we create building footprints for ways and relations, so the outline of a building, for example. In the OSM objects, only the references to the nodes that form the building footprint are stored. We retrieve the relevant nodes from the LevelDB and generate the geometries in GeoJSON format. Next, we get the centroid based on these geometries, which we then transform to H3 indexes with another Transformer. H3 is a spatial index developed by Uber, which is a great way to connect several geospatial data sources because you can quickly traverse from one resolution to another. We aggregate tags relevant to POI properties such as categories, addresses, or contact details in our next pipe. Lastly, the transformed objects are wrapped in Mongoose objects to validate the schema and so that we can write them to the database in the last pipe.
Having passed these transformations, we now have handy objects that you could either load into your data warehouse or query over an API with, for example, radius and geometry queries.
At Kuwala, we want to unify geospatial data and make them easily integratable and connectable. To have a significant impact, we are entirely open-source and believe in the power of the community. If you work with data, make sure to star us on GitHub and join our Slack community.
UPDATE: We switched to using Python for all of our pipelines since it has the highest adoption in our community of data scientists and engineers. However, the JS pipeline can still be found in the GitHub history and we hope this post provides value for other projects as well. Because one thing is for sure, Node.js streams are extremely powerful.