Craigslist Apt Scraper

The goal of this project is to build a quick tool that will automatically scrape housing data from Craigslist with a definted set of features.  This tool will have two purposes:
1). Automatically update the user when new apartments matching the filters are found
2). Add to a database of all craigslist housing to analyze longitudinal trends

import pandas as pd
import os
import time
from craigslist import CraigslistHousing
from sqlalchemy import create_engine
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy import Column, Integer, String, DateTime, Float, Boolean
from sqlalchemy.orm import sessionmaker

cl = CraigslistHousing(site='seattle', area='see', category='apa', filters={'max_price': 3000, 'min_price': 800)

#get_results generator that filters to add coordinates and limits the results the the newest 50
gen = cl.get_results(sort_by='newest', geotagged=True, limit=50)
t = []
while True:
    try:
        result = next(gen)
    except StopIteration:
        break
    except Exception:
        continue
    t.append(result)
df = pd.DataFrame(t) 

Right now I have a quick scraper built up from the python-craigslist package. This returns the results generator which I then loop through and print each individual result. The results contain:

 datetime        time at which the listing was posted
 has_image       whether there are images attached
 where           user-defined value for where the listing is
 geotag          geotagged coordinates
 has_map         whether there's a map associated with the image
 name            user-defined name of the posting
 price           price of the listing
 url             URL of the full listing
 id              craigslist defined ID for the listing

I can further limit my search using the various fiilters within the python-craigslist package(shown below)

CraigslistHousing.show_filters()

Base filters:
* posted_today = True/False
* search_titles = True/False
* has_image = True/False
* query = ...
Section specific filters:
* min_price = ...
* max_ft2 = ...
* laundry_in_unit = True/False
* bathrooms = ...
* min_ft2 = ...
* max_price = ...
* bedrooms = ...
* cats_ok = True/False
* dogs_ok = True/False
* no_smoking = True/False
* private_bath = True/False
* zip_code = ...
* private_room = True/False
* search_distance = ...

For now, I want to stick with 1-bedroom apartments. Unfortunately, Craigslist only filters based on 1+ bedrooms, so I’ll have to do some cleaning later on to remove extra hits.

cl = CraigslistHousing(site='seattle', area='see', category='apa', filters={'max_price': 2000, 'min_price': 800, 'bedrooms' : 1})

#get_results generator that filters to add coordinates and limits the results the the newest 50
gen = cl.get_results(sort_by='newest', geotagged=True, limit=50)
t = []
while True:
    try:
        result = next(gen)
    except StopIteration:
        break
    except Exception:
        continue
    t.append(result)
df = pd.DataFrame(t) 
df.head()

	datetime	geotag	has_image	has_map	id	name	price	url	where
0	2017-02-01 10:46	(47.556376, -122.386932)	True	True	5985135361	1 Bedroom With Den and Amazing Water Views	$1943	http://seattle.craigslist.org/see/apa/59851353...	5020 California Ave SW Seattle, WA
1	2017-02-01 10:45	(47.395329, -122.300863)	True	True	5971957026	2 Bedroom Apartment in Des Moines	$1219	http://seattle.craigslist.org/see/apa/59719570...	Des Moines
2	2017-02-01 10:42	(47.615191, -122.31126)	True	True	5985127894	Huge One bedroom, Dining space, 6 months free ...	$1950	http://seattle.craigslist.org/see/apa/59851278...	Capitol Hill
3	2017-02-01 10:42	(47.680426, -122.32406)	True	True	5985127842	Open 1 bedroom priced to lease today - 1 month...	$1760	http://seattle.craigslist.org/see/apa/59851278...	Green Lake
4	2017-02-01 10:42	(47.685651, -122.365776)	True	True	5985127286	North Ballard, Hardwood, Bright, Garage!	$1700	http://seattle.craigslist.org/see/apa/59851272...	Ballard, Greenwood

Now that I have this data, there are 2 seperate pipelines.
1. Add to previously scraped craigslist data for future analysis 2. Filter this data with user-defined filters

Pathway #1

For now, I’m just going to save to a .csv file. In the future, I’d like to use a SQL Database but for now the data is small enough to be managable with just a .csv.

cl = CraigslistHousing(site='seattle', area='see', category='apa', filters={'max_price': 2000, 'min_price': 800, 'bedrooms': 1})

#get_results generator that filters to add coordinates and limits the results the the newest 2000
gen = cl.get_results(sort_by='newest', geotagged=True, limit=2000)
t = []
while True:
    try:
        result = next(gen)
    except StopIteration:
        break
    except Exception:
        continue
    t.append(result)
df = pd.DataFrame(t) 
date = time.strftime("%m%d%Y")
#Append to old file if I've already ran that day, otherwise create new csv
if os.path.isfile(date+'.csv'):
    df.to_csv(date+'.csv',mode = 'a',header=False)
else:
    df.to_csv(date+'.csv')

Now I have the data in a daily .csv file. I want to play around with creating a pipeline for data analysis.