Cleaning Open Street Map Data

Although maps can carry an awful lot of information, they might seem at first to be a relatively simple data structure: identify a point in space and then associate a bunch of values to it – street number, land cover type, public water fountain, stoplight… you name it.

But once you start to pull on that thread, you realize it’s a much more complicated tapestry. Certain points need to be associated with one another so that you know they all belong to a pathway like road or a river bank or to the outline of your house foundation. Not just that, but they may need to be associated in a particular order so that you know what direction the road is going or where the riverbank meanders. And now it’s not just those individual points that need attached data – now so too do the paths so that we can have road names and building names and administrative boundaries.

A few of Pittsburgh’s bridges as seen on

Pull on the thread even more, and soon you realize that often those individual paths also need to be grouped together. A bus route runs over a multitude of different roads, a university campus comprises the outlines of a lot of different disconnected buildings, an archipelago holds a series of small islands each with their own coastline. And now these relationships too need names and values and keys.

This was the complicated information landscape of points, paths, and relationships that we encountered when we turned to OpenStreetMap (or OSM) for data about the roads and bridges of Pittsburgh. In their own words, OSM “is built by a community of mappers that contribute and maintain data about roads, trails, cafés, railway stations, and much more, all over the world.” It’s not inaccurate to call it the Wikipedia of mapping. Right now, today, you can make an account and start editing and adding to the global map wherever you like. And adding elements is surprisingly simple, as OSM is made of of only three core elements:

  • nodes (defining points in space),
  • ways (defining linear features and area boundaries), and
  • relations (which are sometimes used to explain how other elements work together).
OSM image of Pittsburgh with all the underlying map features displayed
The same view of Pittsburgh, now showing all the underlying nodes and ways

How do you from just these three geometric primitives to being able to describe canyons, DMZs, funiculars, dedicated bike lanes, and – most importantly for us – bridges? Like Wikipedia, the community of users who author and edit OSM have gradually developed their own loose “free tagging” system, so you can mark any given road as a motorway versus a residential street, whether foot traffic is allowed on it, what the speed limit is, and any number of other possible attributes. Lucky for us, there’s a pretty straightforward “bridge” attribute. Because the tags are all free text, we encountered some interesting values for that tag: “yes”, “no”, and (for at least 1 pathway) “viaduct” (!) But at least this gave us a start identifying bridges.

A more complex problem reared its head when we got to multi-lane bridges, however. Prominent crossings like Veterans bridge, with multiple OSM “ways” needed to represent their many lanes and on/off-ramps – had overarching “relations” all marking their components as belonging to the same bridge. That way, we would know when charting a path that even if we just crossed one lane of that bridge, we could check off all the rest as “crossed” and so keep our path from doubling back. But many more minor multi-lane bridges, like highway overpasses, didn’t have this extra metadata, and so our pathfinding algorithm thought each lane was a separate bridge to be crossed, and so doubled back and forth quite a lot.

OSM view of Veterans Bridge in Pittsburgh
Veterans’ Bridge is a particularly complex bridge in Pittsburgh

We had hoped there would be a way to automatically detect which separate bridges were close enough or parallel enough that they should be related together during our pathfinding search. But for every automated solution we thought we found, we’d quickly find an exception – a very short bridge with three lanes that was wider than it was long; a long curving bridge whose exit ramps fanned out in all directions; a bridge that intersected with another one; two different bridges that ran parallel before heading in different directions.

So we took a different tack: if we couldn’t automate it, why not just use OpenStreetMap the way it was meant to be used, by hand editing? So we made up an old fashioned paper map of all the bridges in our data and flagged the ones that looked suspiciously close to one another, which were our candidates for merging. Then, as part of a CMU Libraries workshop to learn how to edit OSM, we worked with our attendees to add “relations” over all these small bridges – cleaning up data for our analysis, but also making sure that those small changes were upstreamed back to the entire OSM community.

photograph of a printed-out poster-sized version of the pittsburgh bridge map with handwritten annotations
One of our printed maps with marker annotations where we found problematic bridges that needed to be fixed.

Like almost all data-driven research, this time spent understanding our source data and reshaping it to fit our needs took up the vast majority of the project. Once we had this pipeline in place, we could finally move on to our original problem: how do we find a path, any path, that will cross through every bridge in the city?