Project: Building a Datebase of Home Sales in St. Louis

Cover Image for Project: Building a Datebase of Home Sales in St. Louis

As a long-time St. Louis resident, serial home buyer, and short-lived real estate entrepreneur, I love digging into real estate data sets to identify emerging trends.

I'm also a millennial, so commiserating about home prices comes naturally.

While housing might be in short supply, there's a surplus of high quality real estate data. Sources like the St. Louis Federal Reserve's housing data sets and Redfin's "Data Center" provide great high level data sets to understand national, regional, and metro-specific housing trends.

These data sets are useful to grok the big picture, but they often fall short if you want to dig into more granular housing data. And in a city like St. Louis, where housing stock's quality varies "block to block," it's helpful to understand the nuance of the neighborhood. Housing prices can differ considerably from one street to the next.

Viewing a home's previous sale history on Zillow or Redfin might satisfy the average home buyer. They might also find their real estate agent's "newsworthy" market analyses and clunky MLS interface "good enough."

But, I'm not the average home buyer. I have bought & sold more homes than I care to admit over the past decade, and making data-informed decisions has helped keep my deals profitable.

It is difficult to find a comprehensive set of sale record data for the properties within the St. Louis City limits.

Sites like Zillow & Redfin provide past sale history for individual homes, but what if you want to know the sale history for every home in an entire city?

How would you answer questions like, "What is the median turnover for single family homes in a neighborhood?"

Or a question like, "Which streets are seeing the greatest increases in home values?"

What if you want to know the sale history for every home in St. Louis City?

Building a Database of St. Louis Residential Sale Data

St. Louis City doesn't have a single data source publicly exposing every parcel's sale data. But this problem isn't novel and the data already exists -- you just have to know where to look.

I built a proof-of-concept project leveraging a publicly available list of basic parcel metadata to programmatically find and persist detailed sale records and other property characteristics from the City of St. Louis' Address Search. This is otherwise not possible using the city's open data resources.

Check out the project's repository here: GitHub Repository

Data Pipeline Architecture

Here's a visual representation of the project's logic flow & architecture:

Project Architecture Diagram, visual representation of the data pipeline

Depending on your familiarity with data processing and the St. Louis parcel data sets, the diagram likely provides varying degrees of value.

Project Overview

At a high level,

This pipelines builds a database of sale data by programmatically fetching sale data for each row in the City's "Parcel Joining Data Dataset" from the City's Address Search interface.

In slightly more detail,

The pipelines iterates over each parcel in the the City's "Parcel Joining Data Dataset." If the parcel's sale data has already been added to the database, we skip it. If it hasn't been added to the database, we check the City's Address Search interface for sale info. The sale info is then normalized and added to the database.

Technical Pipeline Overview

The data pipeline lives in a Python package, stl_city_parcel_ingestion.

The program is initialized in stl_city_parcel_ingestion/__main__.py (link).

The bulk of the pipeline's logic lives in the stl_city_parcel_ingestion/parcel_processor (link).

This parcel_processor works by:

  1. Opening the input data file located in /csv_reader/data/stl-city-parcel-data.csv (link)
  2. Searching the STL City Address portal,
  3. Parsing the STL City Address portal results
  4. Normalizing the input data and STL City Address portal's parcel results
  5. Persisting the parcel and sale data into a database

The parcel_processor is initialized here.

Database Overview

The initial database schema for the proof of concept contains two tables. The first table maintains the parcel metadata while the other table contains the sale records. There is a one-to-many relationship between the parcel records and the sale records.

Here's an illustration of the proof-of-concept schema:

Database schema diagram

The database was initially developed using an AWS RDS PostgreSQL instance. However, the program can be run to connect and persist data on any PostgreSQL database provided the configuration information is specified in the database.ini.

Database Migrations

The main_parcel_data and parcel_sale_records table creation SQL queries are stored in the db_migrations package.

This package supports database migrations for the parent stl_city_parcel_ingestion package. These migrations allow for the creation of the package's expected database structure across different environments.

The package can be extended to support additional database needs as new functionality is added to stl_city_parcel_ingestion.

Database Resources

Check out these two resources for more information on setting up the database,

Next steps

This project isn't complete - it's a proof-of-concept and isn't ready for production use. It's a jumping off point for exploring property sale information in St. Louis city.

I recommend using a self-imposed rate limit to prevent the City's services from falling over. I like erroring on the side of caution, something like 2-3 seconds per request.

Using a 2-3 second rate limit means it will take a minimum of 9 days to build a complete database using this approach. As this is an extraordinary amount of time for most use cases, I recommend trimming down the "Parcel Joining Data Dataset" to the areas of the city that matter for your unique analysis.

Get Started - Build Your Own Database

Check out the project's setup intructions to get started.