Property Listings

Showcasing the raw data that I scraped. We can find the number of times a property has been listed and all previous sales and rental listings. In the second half of the post, I found that number of properties selling in Central London has been declining since 2014, even though the number of listings have remained constant.

This is a post forming the Central London Property analysis I am doing using data from various listing and government websites.

In today’s post, I will be looking at the number of properties listed for sale and sold over the last few years in Central London. As before, I define Central London as postcodes EC1, EC2, EC3, EC4 and WC1, WC2.

I would expect that COVID-19 and work from home culture must have lead to many more listings this year. It would be interesting to see if these were followed by sales. I will do an analysis of rentals and lets separately.

The data for listing and sales comes from an online property listing service. Listings are under a property id which is great for confirming repeat sales. However, sometimes there are extra listings within a listings, for example - if there has been a price reduction or a re listing. So, I spent 80% of the time cleaning and making sense of this data 😭

The full code for this blog post can be found on my GitHub.

Intro

I have 243k listings from 40k properties in Central London.

Lets look at all the different types of listings we have, denoted by listing status.

A listing status are ’tags’ on the listing page. They tell us if a property has been sold/let. Alternatively we also get status updates from the property page. For e.g Listedforsale means that a property was historically listed and Sold means a property was sold according to land registry.

Status Freq Meaning
sold 148581 landregistry.gov.uk data until 1st June 2020
missing 81077 no status found, usually some old listings and also brand new listings
listedforsale 4055 historic listings on website
sold stc 3228 tag on the listing on the website, not from official source.
let 1651 a property has been rented out succesfully
let agreed 1160 same as above
under offer 1113 an offer has been made on the property but still for sale
new 1081 a ’new’ listing, not really important for us
listing expired 922 listing was removed from the website
just added 218 same as new
sold subject to contract 188 same as sold stc
new build 188 new build property
investment 116 tag on the listing
new home 7 tag on the listing

A listing can be relisted (!), so we have a info_type attribute for each listing. This is parsed from a listing page when we have the history of the listing itself.

Type Freq Meaning
previous_sale_prices 121823 previous prices of the property from other listings
main_listing 109989 actual listing
first listed 5739 first listing of the listing
price reduction 3925 price reductions on the listing
last sold 1708 same as previous_sale_prices
price increase 641 price increases

Surprised to see price increases for a listing! Although they are 4x less common than a price reduction.

Data Cleaning: A property has multiple listings in our table which mean the same thing, leading to duplicates. We could also have multiple entries in our table for the same “listing”. Eg price changes, status changes. To analyse sales, we want to have a unique row per “real” status. We are only really interested in finding out how many properties are sold or rented. However, given the different number of statuses it is not trivial to get “unique” listings. Below is my attempt in code.

Cleaning data

To clean our data, I create a clean_status column which ignores some statues and groups other statuses. I also remove listings with no prices and dates. The code should explain what I am doing.

R seems to be becoming less favourable than python with the advent of ML. However dplyr is one of my favourite R packages for data science. The syntax feels very natural and it has powerful features for data wrangling. Pandas/Python has a similar API but I find it a bit clunky. Additionally, the %>% (pipe) operator in R sends the output from left to right, which makes chaining functions very readable. I would highly recommend giving it a try!

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
library(dplyr)
prices_data_sum <- listings %>% # listings is our table of 240k listings

  # keep only residential listings
  filter(is.na(listing_url) | (!str_detect(listing_url,"commercial"))  ) %>% 
  
  mutate(
    
    # this is the information in the table that comes up on the website, for a listing
    # it sometimes tell us previous listings of the property
    info_type = ifelse(str_detect(info_type, "Price reduced"),"Price reduction",info_type),
    info_type = ifelse(str_detect(info_type, "Price increased"),"Price increase",info_type),
    # create factors, keeping the more important levels first, 
    # since we will use this as the order to de-duplicate
    info_type = factor(str_replace_all(str_trim(info_type)," ","_") %>% tolower, 
                       levels = c("price_reduction","price_increase","first_listed","last_sold","main_listing","previous_sale_prices")),

    status_clean = case_when(
      str_detect(tolower(status),"sold") ~ "Sold", # put under offer same as sold for reducing number of cats
      info_type=="last_sold" ~ "Sold",
      str_detect(tolower(status),"under offer") ~ "For Sale", # put under offer same as sold for reducing number of cats
      str_detect(tolower(status),"let") ~ "Rented",
      str_detect(tolower(status),"listedforsale") ~ "For Sale", # this is the 'archived' info on zoopla, displaying when something was listed for sale historically
      str_detect(listing_url,"to-rent") & status == "Listing expired" ~ "Rent Expired",
      str_detect(listing_url,"for-sale") & status == "Listing expired" ~ "Sale Expired",
      str_detect(listing_url,"to-rent") ~ "To Rent",
      str_detect(listing_url,"for-sale") ~ "For Sale",
      str_detect(price,"pcm") ~ "Rent NA", # price in pcm but no other status info..
      str_detect(listing_url,"property-history")  ~ "For Sale" # if not status given, this url is the archived url template, and assume its for sale
    ),
    
    status_clean = factor(status_clean, levels=c("Sold","For Sale","Rented","To Rent","Rent NA","Rent Expired","Sale Expired")),
    
    # clean prices
    price_num = str_replace_all(price,"[£,]","") %>% str_extract("\\d+") %>% as.numeric,
    
    # clean dates
    saledate_date = lubridate::ymd_hms(saledate) %>% as.Date,
    saledate_date = as.Date(ifelse(is.na(saledate_date),ymd(saledate), saledate_date), origin='1970-01-01'),
    
    # get unique id / key for listing.
    listing_key = ifelse(is.na(listing_id), paste0("p",property_id,year(saledate_date)), paste0("l",listing_id)),
    listing_key = as.factor((as.character(listing_key))),
  ) %>% 
  # remove missing saledates, approx 20% of the data is lost here because these are properties which have
  # never been sold
  filter(saledate != "None" & !is.na(saledate_date) & !is.na(price_num)) %>% 
  left_join(properties %>% select(area, beds, property_id, zest, postalCode), by="property_id") %>% 
  select(-saledate,-price,-listing_url) %>% arrange(saledate_date)

Summary of our data (188k rows and 15k properties)

saledate_date          status_clean      price_num                       info_type     
 Min.   :1995-01-01   Sold    :153316   Min.   :     173   price_reduction     :  3925  
 1st Qu.:2002-08-01   For Sale: 15466   1st Qu.:  184000   price_increase      :   641  
 Median :2010-04-08   To Rent : 10740   Median :  340000   first_listed        :  5739  
 Mean   :2009-03-26   Rent NA :  5374   Mean   :  455937   last_sold           :  1708  
 3rd Qu.:2015-06-01   Rented  :  2683   3rd Qu.:  585000   main_listing        : 54870  
 Max.   :2020-09-05   (Other) :   906   Max.   :73106139   previous_sale_prices:121617  
                      NA's    :    15                                                

As you can see I managed to find status updates for all but 15 listings. I also have prices for them. I find that our listing data ranges from 2000s - today.

Property histories

The data I have gathered is very rich on the property level. See below all the transactions I know of a certain property.

1
2
3
4
prices_data_sum %>% filter(property_id %in% 7971826) %>% 
  arrange(year(saledate_date),month(saledate_date),status_clean) %>% 
  distinct(property_id, status_clean, year(saledate_date), .keep_all = T) %>% # only show distinct listings
  select(listing_id, property_id, info_type, status_clean, price_num, saledate_date) %>% to_markdown

listing_id property_id info_type status_clean price_num saledate_date
NA 7971826 previous_sale_prices Sold 150000 1996-09-01
NA 7971826 previous_sale_prices Sold 200000 1997-02-01
NA 7971826 previous_sale_prices Sold 247000 1999-11-01
NA 7971826 previous_sale_prices Sold 495000 2002-02-01
28210585 7971826 main_listing For Sale 899950 2013-03-01
NA 7971826 previous_sale_prices Sold 899950 2013-08-01
49364002 7971826 main_listing For Sale 1300000 2018-09-01
53235585 7971826 main_listing Rent NA 2773 2019-11-04
54907582 7971826 main_listing Rent NA 3467 2020-05-13
55730087 7971826 first_listed To Rent 4160 2020-08-03
55730087 7971826 main_listing Rent Expired 2947 2020-09-02

You can see here that the property has been sold 4 times between 1996 and 2002.

  1. It came on for sale in March 2013 and was then listed as sold in August 2013 for £900k.
  2. It was then listed for sale in September 2018 for £1.3m (a 6.9% increase per annum!). Since there is no sold listing after the date, we can assume that the property did not sell.
  3. Instead the owners decided to put it out for rent for £2,773 per month which is just a 3.7% yield on their initial purchase price of £900k (and a 2.5% yield on the)
  4. Finally, it came on for rent at £4.2k and went down to £2.9k, with the listing no longer available.

The above is a short history of 11 transactions because we have chosen a unique status per year. Actually, there are 31 transactions in total!

The above example is an outlier, the mean number of transactions in our dataset is 3 and the median is 2. We have 3.7k/15k properties with more than 3 transactions.

I am looking forward to analysing this data more in future posts! 🤓

Properties sold per year

To calculate the number of properties for sale, I first deduplicate the data by keeping a single ‘For Sale’ or ‘Sold’ record per property per year.

I am experimenting with svg plots since ggplot2 in R produces them natively. They are 85% smaller in size compared to png and scale perfectly with window size, retaining the detail as you zoom in.

The peak in properties sold was 2013, with 824 properties sold. 2019 saw the lowest number of properties sold in central london since 2005 - a 14 year stretch! The number of properties sold has collapsed by almost 50% since 2005 and 75% since the peak.

The number of properties listed for sale has remained steady in the last 10 years. This implies the collapse in properties sold seems to be led by weaker demand than supply. This year has seen the 3rd highest # of properties offered for sale which is not surprising given the stamp duty discount and COVID causing people to leave for the countryside.

Properties rented per year

Analysing rentals is a bit more difficult. Firstly, there is no central registry which holds a record of a successful let (unlike sales). Secondly,estate agents remove rental listings more often without updating the status. There could be several reasons, including that the property has been successfully rented.

Because of the above reasons, we do not have good historical data on lets. The chart below has two categories.

  1. Rented - when a successful let is confirmed on the listing.
  2. Rent NA - when the listing has been taken off but no indication has been made if let or not. It is highly probable here that these are as good as rented.

This leads to the chart below:

The number of successful lets in 2020 is surprising. Total confirmed lettings Rented are above the 2019 level, with 4 months still to go. This does not fit the COVID narrative - everyone fleeing for the countryside.

An important caveat here is that we do not know of listings that were removed prior to 1 August 2020. Hence, the total lettings before 2020 could be understated. Based on listings in August, this effect could understate the lettings in the previous years by 50 - 75%. I will be monitoring this further going forward.

updatedupdated2023-04-092023-04-09