Will Gagne-Maynard

Personal Web Page

View My GitHub Profile

Craigslist Apt Scraper

The goal of this project is to build a quick tool that will automatically scrape housing data from Craigslist with a definted set of features.  This tool will have two purposes:
1). Automatically update the user when new apartments matching the filters are found
2). Add to a database of all craigslist housing to analyze longitudinal trends
import pandas as pd
import os
import time
from craigslist import CraigslistHousing
from sqlalchemy import create_engine
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy import Column, Integer, String, DateTime, Float, Boolean
from sqlalchemy.orm import sessionmaker


cl = CraigslistHousing(site='seattle', area='see', category='apa', filters={'max_price': 3000, 'min_price': 800)

#get_results generator that filters to add coordinates and limits the results the the newest 50
gen = cl.get_results(sort_by='newest', geotagged=True, limit=50)
t = []
while True:
    try:
        result = next(gen)
    except StopIteration:
        break
    except Exception:
        continue
    t.append(result)
df = pd.DataFrame(t) 


Right now I have a quick scraper built up from the python-craigslist package. This returns the results generator which I then loop through and print each individual result. The results contain:

 datetime        time at which the listing was posted
 has_image       whether there are images attached
 where           user-defined value for where the listing is
 geotag          geotagged coordinates
 has_map         whether there's a map associated with the image
 name            user-defined name of the posting
 price           price of the listing
 url             URL of the full listing
 id              craigslist defined ID for the listing

I can further limit my search using the various fiilters within the python-craigslist package(shown below)

CraigslistHousing.show_filters()
Base filters:
* posted_today = True/False
* search_titles = True/False
* has_image = True/False
* query = ...
Section specific filters:
* min_price = ...
* max_ft2 = ...
* laundry_in_unit = True/False
* bathrooms = ...
* min_ft2 = ...
* max_price = ...
* bedrooms = ...
* cats_ok = True/False
* dogs_ok = True/False
* no_smoking = True/False
* private_bath = True/False
* zip_code = ...
* private_room = True/False
* search_distance = ...

For now, I want to stick with 1-bedroom apartments. Unfortunately, Craigslist only filters based on 1+ bedrooms, so I’ll have to do some cleaning later on to remove extra hits.

cl = CraigslistHousing(site='seattle', area='see', category='apa', filters={'max_price': 2000, 'min_price': 800, 'bedrooms' : 1})

#get_results generator that filters to add coordinates and limits the results the the newest 50
gen = cl.get_results(sort_by='newest', geotagged=True, limit=50)
t = []
while True:
    try:
        result = next(gen)
    except StopIteration:
        break
    except Exception:
        continue
    t.append(result)
df = pd.DataFrame(t) 
df.head()


datetime geotag has_image has_map id name price url where
0 2017-02-01 10:46 (47.556376, -122.386932) True True 5985135361 1 Bedroom With Den and Amazing Water Views $1943 http://seattle.craigslist.org/see/apa/59851353... 5020 California Ave SW Seattle, WA
1 2017-02-01 10:45 (47.395329, -122.300863) True True 5971957026 2 Bedroom Apartment in Des Moines $1219 http://seattle.craigslist.org/see/apa/59719570... Des Moines
2 2017-02-01 10:42 (47.615191, -122.31126) True True 5985127894 Huge One bedroom, Dining space, 6 months free ... $1950 http://seattle.craigslist.org/see/apa/59851278... Capitol Hill
3 2017-02-01 10:42 (47.680426, -122.32406) True True 5985127842 Open 1 bedroom priced to lease today - 1 month... $1760 http://seattle.craigslist.org/see/apa/59851278... Green Lake
4 2017-02-01 10:42 (47.685651, -122.365776) True True 5985127286 North Ballard, Hardwood, Bright, Garage! $1700 http://seattle.craigslist.org/see/apa/59851272... Ballard, Greenwood

Now that I have this data, there are 2 seperate pipelines.
1. Add to previously scraped craigslist data for future analysis 2. Filter this data with user-defined filters

Pathway #1

For now, I’m just going to save to a .csv file. In the future, I’d like to use a SQL Database but for now the data is small enough to be managable with just a .csv.

cl = CraigslistHousing(site='seattle', area='see', category='apa', filters={'max_price': 2000, 'min_price': 800, 'bedrooms': 1})

#get_results generator that filters to add coordinates and limits the results the the newest 2000
gen = cl.get_results(sort_by='newest', geotagged=True, limit=2000)
t = []
while True:
    try:
        result = next(gen)
    except StopIteration:
        break
    except Exception:
        continue
    t.append(result)
df = pd.DataFrame(t) 
date = time.strftime("%m%d%Y")
#Append to old file if I've already ran that day, otherwise create new csv
if os.path.isfile(date+'.csv'):
    df.to_csv(date+'.csv',mode = 'a',header=False)
else:
    df.to_csv(date+'.csv')

Now I have the data in a daily .csv file. I want to play around with creating a pipeline for data analysis.