Analysing Southwark’s natural geography

Following my map of London’s green and blue infrastructure, I have been working on some analysis of the land uses.

I was inspired and encouraged to try this by Liliana’s interesting work called “imagining all of Southwark“. Lili and Ari have managed to get the council to release lots of data on properties and car parking, and they are producing analysis of this data by postal code area and by street. They haven’t managed to get anything on land uses, so I thought, why not produce this with OpenStreetMap data?

A few evenings later, here is the result shared on Google docs (direct link) covering the eight postal code areas that between them cover most of the borough (SE1, SE5, SE15, SE16, SE17, SE21, SE22, SE24):

What the data means

The “summary” worksheet shows the total land area, expressed in hectares (10,000 m2), for various different types of land coverage. I have also calculated the percentage of that postal code area that the land uses represent, which gives an interesting insight into the differences between the areas.

Some of the land uses will overlap, for example miscellaneous bits of green space are often mapped on top of residential areas. So the numbers aren’t supposed to add up to anything like 100%.

The spreadsheet also contains worksheets for each postal code area. These contain a dump of all the objects in OpenStreetMap in those postal code areas, and this is the raw data the summary spreadsheet uses to get the totals.

Flaws in the data

You should use this data with a large spoonful of salt. Here are the significant flaws I have noticed:

Postal code areas are approximate, for example the boundary between SE15 and SE22 should mark the boundary between Peckham Rye Common (SE15) and Peckham Rye Park (SE22). In my data both the park and the common show up in both of the postal codes, because the boundary isn’t quite right. Read down to my method to see why. The errors introduced are pretty tiny in most places (plus or minus a few meters along the full boundary), and probably cancel themselves out for big land uses like residential, but they probably also introduce some significant errors for parks where the boundaries go awry by 20-30m in places. Sadly there aren’t any accurate open data polygons I can use.

Data is missing because OpenStreetMap contributors haven’t mapped it. Of course the easy solution here is to get more of it mapped and up to date! My estimate of the different types is as follows:

  • Allotments: complete for the whole borough.
  • Parks and commons: all major and district parks complete.
  • Misc green spaces: very poor coverage of, for example, large areas of grass on estates, especially in SE5, the north pat of SE15 and SE17.
  • Woods/forest: all major woods complete, coverage of big clumps of trees e.g. on a housing estate or in a park is very uneven.
  • Residential: complete except for SE16.
  • Industrial, retail, commercial: large areas are complete, but small shopping parades, industrial parks and rows of offices are very patchy.
  • Brownfield/construction: patchy across the borough and sometimes out of date as sites are built on.

Data is also sometimes missing because of flaws in the Geofabrik shapefiles, not all of which I have corrected. For example, I noticed they were missing commons so I manually added those in, but I may have missed other land uses. One major omission, a shame given the interest in them, is the humble sports pitch/playing field.

How I produced this

After a lot of experimentation – I’ve never been trained to use GIS tools – I worked out this method. If you know of an easier way I’d love to hear about it.

  1. Prepare the boundary data:
    1. Extract a polygon for the London Borough of Southwark from the OS Boundary-Line data.
    2. Download the OS Code-Point-Open data, open the spreadsheet for the SE area in QGIS and use the ftools ‘Voronoi polygons’ plugin to infer polygons for the postal codes from the centroids. Post code centroids are very dense in the middle of residential areas, so the boundary between SE15 4HR and SE22 9BD is only going to be out by a few meters, but are quite far apart with large parks and commons, so the inferred boundaries get less accurate in those areas. See this map for an illustration of the Peckham Rye Park / Common problem mentioned above.
    3. Merge together postal codes into the areas (e.g. SE22 9QF, SE22 4DU etc. into SE22) by quering the shapefile for all objects with postal codes starting with SE22, then using the mmqgis merge tool to merge them into single polygons. Clean up the attributes so the shapefile just has one attribute for the correct postal code area.
    4. Clip the postal codes by the Southwark polygon and save the result – finally – as the postal codes shapefile for Southwark.
  2. Prepare the land use data:
    1. Download the  OpenStreetMap shapefiles from Geofabrik for Greater London.
    2. Download common and marsh ways/relations using the Overpass API (with the meta flag on), import the data into QGIS using the OpenStreetMap plugin, and save the data as a Shapefile.
    3. Merge together the Geofabrik natural and landuse shapefiles with my Overpass-derived shapefile into one land use shape file using the mmqgis plugin.
    4. Clip the land use file by the Southwark polygon and save the result – finally – as the land uses shapefile for Southwark.
  3. Produce the postal code stats; for each postal code:
    1. Select the postal code, and clip the land use layer to that selected code, saving it as a new shapefile.
    2. Open that shapefile, then save it in a new projection that will be in meters rather than degrees (I used  EPSG:32631 – WGS 84 / UTM zone 31N).
    3. Open the new shapefile, then run the ftools ‘Export/add geometry columns’ tool (in Vector/Geometry Tools) to add two attributes to the objects for the area and perimeter.
    4. Save the layer again as a CSV file.
  4. Produce the stats for the area of each postal code so we can calculate % of the area as well as ha for each land use:
    1. Save the Southwark postal codes polygon in the meters projection, add the geometry columns, and save as a CSV file.
  5. Collate all the data
    1. Tidy up and copy the data from each CSV file into a spreadsheet, then add in the formulae to tot everything up. You’re done!

For reference, some of the totals in the summary work off more than one land use type so here are the categories and the corresponding OpenStreetMap tags:

  • Allotments – landuse=allotments
  • Parks and commons – leisure=park / leisure=common
  • Misc green spaces – landuse=conservation / landuse=farm / leisure=garden / landuse=grass / landuse=greenfield / landuse=greenspace / landuse=meadow / landuse=orchard / landuse=recreation_ground
  • Woods and forest – landuse=forest / natural=wood
  • Residential, industrial, retail, commercial, brownfield, construction – corresponding landuse tags

Future ideas

One obvious improvement would be to get more data in. Perhaps this first analysis will encourage people to help out with that? I have also emailed Geofabrik about the flaws I have discovered in their shapefiles, so I hope those get fixed.

Another thought is to produce the stats by council ward. But given that there are far more wards, I’d like to find a quicker way of producing the stats for each ward (step three above) first.

It would also be interesting to do it by town/suburb, for example comparing Peckham to East Dulwich. But we don’t have any meaningful boundaries for those natural areas. It would be really interesting to do a mass version of “this isn’t fucking Dalston” for a whole borough, using the Voronoi polygons method to infer areas from surveys at thousands of locations around the borough. One day…

10 Comments

  1. chrisosm said:

    A hectare is 10,000 square metres.

    2nd November 2012
    • Tom Chance said:

      Oops, missing zero, thanks. Got it right in the spreadsheet, wrong on the blog post!

      2nd November 2012
  2. Pretty cool. One day I should try to follow these steps for islington, if only to learn some QGIS tricks.

    Landuse data is a bit of a menace in terms of cluttering up the data and being difficult to untangle within the editors, not-to-mention cluttering up the ‘standard’ rendering. Plus there’s lots of open questions about how we should actually be doing it …but it becomes quite interesting when you do analysis like this.

    Presumably you could easily add a couple of rows to this spreadsheet to show “total area” that you’re basing the percentages off, and then also the area/percent of “unknown” as in unmapped landuse. I wonder how much of that would be made up of just thin roads and slivers of land area in between landuse polygons. Some sort of buffering trick would reveal just the larger blobs of unmapped landuse. …but maybe there aren’t any in Southwark. I notice somebody’s done a lot of landuse east of this line, but not west.

    What happens if there’s a patch of woodland drawn on top of a patch of park? I suppose it should count towards just the woodland total.

    2nd November 2012
    • Tom Chance said:

      Woodland on top of parks, unless there’s some clever multipolygon work, will double count the area. So the total is, as I explained in my original post, a bit meaningless.

      In Southwark I noticed a lot of gaps for railway embankments and sidings and objects not captured by Geofabrik’s shapefiles (like sports pitches as mentioned). The residential/retail land uses are pretty tight for most of the borough so one would only need a small buffer to run up the edge of the area on the other side of the road. Then again, how much of the land is taken up by roads? A personal bugbear of mine, but not something we measure. Most landuse areas go right up to the edge of the way (i.e. within a meter or less of the centre of the road), but if you look at aerial imagery there’s an argument for stopping them at the edge of pavements. It looks worse when rendered, to my mind, but is more accurate.

      I’m slightly relaxed about these flaws. The biggest flaw in our data is its incomplete-ness, because I suppose nobody has cared that much about getting every last brownfield site in, and our old problem with deprived areas being poorly covered means loads of bits of grass on housing estates are missing. And so on.

      As you say, I usually find flaws in our data through actually using it, more often than poking around in JOSM. Give it a go, it’s quite interesting what you can do!

      2nd November 2012
  3. lili said:

    fantastic stuff, thank you so much!

    2nd November 2012
  4. Alex said:

    “this isn’t fucking Dalston” reminds me of some analysis I have seen in NYC where they found that you could break up Manhattan into Neighborhoods based on where the cabs picked up and dropped of. It showed that for the most parts there was socially a type of person that live i each of the five areas of mahattan (at least in the cab using population) and those areas actually corresponded to the common sense understanding that everyone in the city had about each part of town rather than where the technical boundaries were. The data seemed to show there really was a sense the the people who lived in the village went to certain haunts, and did the same kinds of thing that people who lived around them did. cool stuff.

    2nd November 2012
    • Tom Chance said:

      That’s great. The classic complication in London is the estate agent definition, closely resembling the social climber definition. I also wonder what you’d get if you worked out the definition of, say, Peckham by separately asking arty graduates, settled wealthy families and lifelong Peckham-residing council tenants. Parallel lives, and all that.

      2nd November 2012
  5. Ian Babb said:

    The landuse profile is certainly useful, but I would agree insomuch that some of the data is grossly generalised! A fascinating read though, Tom 🙂

    19th November 2012

Comments are closed.