Rents in San Francsisco

Jian Shun | Sep 19, 2022

Rents in San Francisco 2000-2018

In this case, we have a clean(ish) dataset with about 200K rows that corresponds to Craigslist listings for renting properties in the greater SF area. The data dictionary is as follows

variable class description
post_id character Unique ID
date double date
year double year
nhood character neighborhood
city character city
county character county
price double price in USD
beds double n of beds
baths double n of baths
sqft double square feet of rental
room_in_apt double room in apartment
address character address
lat double latitude
lon double longitude
title character title of listing
descr character description
details character additional details

The dataset was used in a recent tidyTuesday project.

# download directly off tidytuesdaygithub repo

rent <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-07-05/rent.csv')

Cursory examination of data

The variable types are either character or double. On cursory glance, the variable types all tally. We sieved out all observations that had complete information to better check if the types correspond. From observation, the one variable type that seems not optimal is “date”, which is input as a “dbl”. This could (not should) be changed to the “date” type for heightened accuracy. The “chr” variables are all descriptors or names while the “dbl” variables are all numeric values; this checks out. The interesting thing to note is that the variable “address” is registered as “chr” type despite it being numeric; this is logical since the address is essentially a descriptor that serves as a categorical variable. The numeric aspect is not relevant for analysis. The top 5 variables that have the most missing values are “descr” (197542), “address” (196888), “lon” (196484), “lat” (193145) and “details” (192780).

glimpse(rent)
## Rows: 200,796
## Columns: 17
## $ post_id     <chr> "pre2013_134138", "pre2013_135669", "pre2013_127127", "pre…
## $ date        <dbl> 20050111, 20050126, 20041017, 20120601, 20041021, 20060411…
## $ year        <dbl> 2005, 2005, 2004, 2012, 2004, 2006, 2007, 2017, 2009, 2006…
## $ nhood       <chr> "alameda", "alameda", "alameda", "alameda", "alameda", "al…
## $ city        <chr> "alameda", "alameda", "alameda", "alameda", "alameda", "al…
## $ county      <chr> "alameda", "alameda", "alameda", "alameda", "alameda", "al…
## $ price       <dbl> 1250, 1295, 1100, 1425, 890, 825, 1500, 2925, 450, 1395, 1…
## $ beds        <dbl> 2, 2, 2, 1, 1, 1, 1, 3, NA, 2, 2, 5, 4, 0, 4, 1, 3, 3, 1, …
## $ baths       <dbl> 2, NA, NA, NA, NA, NA, 1, NA, 1, NA, NA, NA, 3, NA, NA, NA…
## $ sqft        <dbl> NA, NA, NA, 735, NA, NA, NA, NA, NA, NA, NA, 2581, 1756, N…
## $ room_in_apt <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ address     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ lat         <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 37.5, NA, …
## $ lon         <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ title       <chr> "$1250 / 2br - 2BR/2BA   1145 ALAMEDA DE LAS PULGAS", "$12…
## $ descr       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ details     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "<p class=…
skimr::skim(rent)
(#tab:skim_data)Data summary
Name rent
Number of rows 200796
Number of columns 17
_______________________
Column type frequency:
character 8
numeric 9
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
post_id 0 1.00 9 14 0 200796 0
nhood 0 1.00 4 43 0 167 0
city 0 1.00 5 19 0 104 0
county 1394 0.99 4 13 0 10 0
address 196888 0.02 1 38 0 2869 0
title 2517 0.99 2 298 0 184961 0
descr 197542 0.02 13 16975 0 3025 0
details 192780 0.04 4 595 0 7667 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
date 0 1.00 2.01e+07 44694.07 2.00e+07 2.01e+07 2.01e+07 2.01e+07 2.02e+07 ▁▇▁▆▃
year 0 1.00 2.01e+03 4.48 2.00e+03 2.00e+03 2.01e+03 2.01e+03 2.02e+03 ▁▇▁▆▃
price 0 1.00 2.14e+03 1427.75 2.20e+02 1.30e+03 1.80e+03 2.50e+03 4.00e+04 ▇▁▁▁▁
beds 6608 0.97 1.89e+00 1.08 0.00e+00 1.00e+00 2.00e+00 3.00e+00 1.20e+01 ▇▂▁▁▁
baths 158121 0.21 1.68e+00 0.69 1.00e+00 1.00e+00 2.00e+00 2.00e+00 8.00e+00 ▇▁▁▁▁
sqft 136117 0.32 1.20e+03 5000.22 8.00e+01 7.50e+02 1.00e+03 1.36e+03 9.00e+05 ▇▁▁▁▁
room_in_apt 0 1.00 0.00e+00 0.04 0.00e+00 0.00e+00 0.00e+00 0.00e+00 1.00e+00 ▇▁▁▁▁
lat 193145 0.04 3.77e+01 0.35 3.36e+01 3.74e+01 3.78e+01 3.78e+01 4.04e+01 ▁▁▅▇▁
lon 196484 0.02 -1.22e+02 0.78 -1.23e+02 -1.22e+02 -1.22e+02 -1.22e+02 -7.42e+01 ▇▁▁▁▁
rent_nomissing <- rent %>% 
  drop_na()
head(rent_nomissing)
## # A tibble: 6 × 17
##   post_id        date  year nhood   city  county price  beds baths  sqft room_…¹
##   <chr>         <dbl> <dbl> <chr>   <chr> <chr>  <dbl> <dbl> <dbl> <dbl>   <dbl>
## 1 4710888130 20141012  2014 alameda alam… alame…  2250     2     1  1080       0
## 2 4988581576 20150421  2015 alameda alam… alame…  2650     2     1   950       0
## 3 4988561264 20150421  2015 alameda alam… alame…  1950     2     1   800       0
## 4 4855533017 20150120  2015 alameda alam… alame…  2650     2     1   950       0
## 5 4631188738 20140824  2014 alameda alam… alame…  3295     4     1  1716       0
## 6 4794051100 20141209  2014 alameda alam… alame…  1860     1     1   705       0
## # … with 6 more variables: address <chr>, lat <dbl>, lon <dbl>, title <chr>,
## #   descr <chr>, details <chr>, and abbreviated variable name ¹​room_in_apt
## # ℹ Use `colnames()` to see all variable names
# Observe observations with complete information for better understanding of context

rent %>%
  summarise_all(~sum(is.na(.))) %>% 
  gather() %>% 
  arrange(desc(value))
## # A tibble: 17 × 2
##    key          value
##    <chr>        <int>
##  1 descr       197542
##  2 address     196888
##  3 lon         196484
##  4 lat         193145
##  5 details     192780
##  6 baths       158121
##  7 sqft        136117
##  8 beds          6608
##  9 title         2517
## 10 county        1394
## 11 post_id          0
## 12 date             0
## 13 year             0
## 14 nhood            0
## 15 city             0
## 16 price            0
## 17 room_in_apt      0
# Check which variables have the most missing values

Plot top 20 cities in terms of percent of total listings between 2000 and 2018

# Creating dataset with top 20 cities
top_cities <- rent %>% 
  group_by(city) %>% 
  summarize(count_city = n()) %>% 
  arrange(desc(count_city)) %>% 
  mutate(frequency = count_city / sum(count_city), frequency = parse_number(scales::percent(frequency)))%>%
  top_n(20)

# Plotting top 20 cities by % listings
ggplot(data = top_cities, mapping = aes(x = frequency, y = fct_reorder(city, frequency))) +
  geom_col() +
  labs(title = "Top 20 cities in terms of % of classifieds from 2000-2018", x = "Frequency (%)", y = "City") +
  theme(axis.title = element_text(size = 12), axis.text = element_text(size = 10))

Plot the evolution of median prices of 0, 1, 2 and 3 bedroom-flats within San Francisco

# Created a table for the median price of flats with 3 or fewer bedrooms
median_price_beds <- rent %>% 
  filter(beds <= 3 & city == "san francisco") %>% 
  group_by(year, beds) %>% 
  summarise(median_price = median(price))
median_price_beds
## # A tibble: 75 × 3
## # Groups:   year [19]
##     year  beds median_price
##    <dbl> <dbl>        <dbl>
##  1  2000     0        1100 
##  2  2000     1        2175 
##  3  2000     2        2098.
##  4  2001     0        1250 
##  5  2001     1        1725 
##  6  2001     2        2250 
##  7  2001     3        2995 
##  8  2002     0        1125 
##  9  2002     1        1532.
## 10  2002     2        1972.
## # … with 65 more rows
## # ℹ Use `print(n = ...)` to see more rows
# Plotted the median price of flats with 3 or fewer bedrooms from 2000-2018
ggplot(data = median_price_beds, mapping = aes(year, median_price, colour = factor(beds))) +
  facet_wrap(~ beds, ncol = 4) +
  geom_line() +
  scale_y_continuous(name = "Median price", breaks = seq(1000, 6000, 1000)) +
  labs(title = str_wrap("Evolution of median prices in San Francisco for different number of beds", 60), x = "Year", y = "Median Price") +
  theme(plot.title = element_text(size = 14, hjust = 0.5),
        axis.title = element_text(size = 12),
        axis.text = element_text(size = 10),
        strip.text.x = element_text(size = 5)) +
  theme(legend.key.size = unit(0.5, 'cm'), #change legend key size
        legend.key.height = unit(0.5, 'cm'), #change legend key height
        legend.key.width = unit(0.5, 'cm'), #change legend key width
        legend.title = element_text(size=10), #change legend title font size
        legend.text = element_text(size=8),
        legend.position = "bottom") #change legend text font size

Plot the median price per city per year for flats in the top 12 cities (in terms of percent listings) from 2000 to 2018

# We used the description provided rather than what was shown in the plots; i.e. we took data from all number of bedrooms rather than just 1-bedroom like shown in the plots.

# Extract city names for top 12 cities into new dataframe (took a subset from an earlier section)
top_12 <- top_cities$city[1:12]

# Summarised median prices per city per year for the top 12 cities
median_top12_cities <- rent %>% 
  filter(city %in% top_12) %>% 
  group_by(city, year) %>% 
  summarise(median_price = median(price))

# Plotted median prices per city per year for the top 12 cities from 2000-2018
ggplot(data = median_top12_cities, mapping = aes(year, median_price, colour = factor(city))) +
  facet_wrap(~ city, ncol = 4) +
  geom_line() +
  scale_y_continuous(name = "Median price", breaks = seq(1000, 6000, 1000)) +
  labs(title = "Median rental price for top 12 cities", x = "Year", y = "Median Price") +
  theme(plot.title = element_text(size = 14, hjust = 0.5),
        axis.title = element_text(size = 12),
        axis.text = element_text(size = 10),
        axis.text.x = element_text(angle = 90),
        strip.text.x = element_text(size = 10)) +
  theme(legend.key.size = unit(0.5, 'cm'), #change legend key size
        legend.key.height = unit(0.5, 'cm'), #change legend key height
        legend.key.width = unit(0.5, 'cm'), #change legend key width
        legend.title = element_text(size=10), #change legend title font size
        legend.text = element_text(size=8), #change legend text font size
        legend.position = "right")

Inference and Analysis

We see that the median price of all flats with 3 or fewer bedrooms in San Francisco rose over the 15 years. It also seemed that the more bedrooms a flat had, the more severe the increase in price across years; 1-bed flats increased by about 100% of their original price after 15 years whereas 3-bed flats increased by 150%. We infer that a process of gentrification is at work here in the San Francisco region; wealthier individuals not only prefer, but also have the means to rent apartments with more luxurious amenities, e.g. more bedrooms. These individuals may engage in an auction process to out-bid their competitors to move into such apartments and thus prices rise more aggressively. One could also imagine that these apartments are within neighbourhoods with similar apartments; a process of gentrification then occurs since more wealthier individuals are attracted to these areas, thus fueling another cycle of competition and deepening the upward price spiral. While this cannot be seen from the plots, one could conduct spatial analysis/regression to test this narrative.

When the plots were faceted by cities, we see that the upward trend in prices still holds for all top 12 cities. However, the extent of overall increase in price within each city from 2000 to 2018 differs widely, with cities that experienced the highest price change suffering an almost 300% jump in price (Palo Alto) while other cities only saw a 100% increase in price. This suggests that some areas were more preferred than the others, perhaps due to better regional amenities maybe because they belonged to a more matured estate; the heightened demand for those areas caused an upward pressure on prices.

Plot cumulative percent change of median prices of 0, 1 and 2 bedroom flats for the top 12 cities in Bay Area

# Summarise popularity of cities in terms of percent of ads that appeared
# Find the top 12 cities (We reused code from the section above since they are linked)
top_12_cities <- rent %>% 
  group_by(city) %>% 
  summarize(count_city = n()) %>% 
  arrange(desc(count_city)) %>% 
  mutate(frequency = count_city / sum(count_city)) %>% 
  top_n(12, frequency)

# View top 12 cities
top_12_cities
## # A tibble: 12 × 3
##    city          count_city frequency
##    <chr>              <int>     <dbl>
##  1 san francisco      55730    0.278 
##  2 san jose           13733    0.0684
##  3 oakland             9443    0.0470
##  4 santa rosa          6230    0.0310
##  5 santa cruz          5464    0.0272
##  6 san mateo           5127    0.0255
##  7 sunnyvale           4526    0.0225
##  8 mountain view       4414    0.0220
##  9 berkeley            4201    0.0209
## 10 santa clara         4171    0.0208
## 11 palo alto           3916    0.0195
## 12 union city          3451    0.0172
# Based on the cumulative_percent_change code/formula that you sent us via Slack, we focused on the cumulative median price change
cumulative_percent_change <- rent %>% 
  filter(city %in% top_12_cities$city & beds <= 2) %>% 
  group_by(city, beds, year) %>%
  summarise(median_price = median(price)) %>%
  ungroup() %>% 
  mutate(pct_change = (median_price/lag(median_price))) %>% 
  mutate(pct_change = ifelse(is.na(pct_change), 1, pct_change)) %>% 
  mutate(percent_change = cumprod(pct_change), percent_change = parse_number(scales::percent(percent_change)))

# Plot cumulative percent median price change across time for each of the unique bed-city combination  
ggplot(data = cumulative_percent_change, mapping = aes(year, percent_change, colour = factor(city))) +
    facet_grid(vars(beds), vars(city)) +
    geom_line() +
    labs(title = "Cumulative Percentage Change for flats", x = "Year", y = "Cumulative Percent Change (%)", ) +
    theme(axis.title = element_text(size = 14),
          axis.text = element_text(size = 8),
          axis.text.x = element_text(angle = 90),
          strip.text = element_text(size = 5),
          strip.background = element_rect(fill="lightblue", colour="black", size=1),
          legend.key.size = unit(0.5, 'cm'), #change legend key size
          legend.key.height = unit(0.5, 'cm'), #change legend key height
          legend.key.width = unit(0.5, 'cm'), #change legend key width
          legend.title = element_text(size=10), #change legend title font size
          legend.text = element_text(size=8),
          legend.position = "bottom") +
  scale_x_continuous(sec.axis = sec_axis(~ . , name = "Cities", breaks = NULL, labels = NULL)) +
  scale_y_continuous(sec.axis = sec_axis(~ . , name = "Beds", breaks = NULL, labels = NULL))

comments powered by Disqus