Content Warning: python, open data, data for good, pandas, geospatial data
While look at various data sources from the Boston area, I ran across the lovely Analyze Boston portal and decided to spin up a quick map of drug-related incidents via the current incident reporting portal. Working with these types of portals (socrata or OData) makes this process rather straighforward and we can get to an interactive map quick quickly with a bit of Pandas and some Folium-powered Leaflet.js. If you'd like to just see the map, skip to the end of the page.
Let's set up our notebook environment:
from IPython.core.interactiveshell import InteractiveShell # makes all cells allow multiple outputs
InteractiveShell.ast_node_interactivity = "all"
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
%matplotlib inline
plt.style.use('bmh') # prettier plots
plt.rcParams['figure.figsize'] = (8, 6)
pd.options.display.float_format = '{:,.2f}'.format
pd.options.display.max_rows = 10
And import the libraries we'll need:
import re
import json
import folium
import requests
For a project like this that is essentially a script, defining these points as functions might be a hair bit of overkill, but I try to practice Safe Data Science© and write functions around main points in programs. A related project will be shared with other data-savvy folks and functions like this are far more readable to others, as they allow you to assign a name to a section of code.
Small functions for parts of your data-gathering-and-cleaning steps also allow you to make quick changes to those composable sections (and boy howdy are data cleaning operations composable...). When I was wrapping up this blog post, I noticed that I was using the soon-to-be deprecated Boston Open Data portal. In the interest of easy reproducibility, it was very easy to modify the following (old API) function from:
def get_bpd_data(n=200000):
url = ("https://data.cityofboston.gov/resource/29yf-ye7n.json" +
"?$limit={}".format(n))
data = pd.read_json(url)
return data
to the following:
def get_bpd_data(n=200000):
url = ("https://data.boston.gov/datastore/odata3.0/" +
"12cb3883-56f5-47de-afa5-3b1cf61b257b?$top={}&$format=json".format(n))
r = requests.get(url)
data = pd.DataFrame(r.json()["value"])
data = data.rename(columns={k: k.lower() for k in data.columns})
data = data.assign(lat=data["lat"].astype(float))
data = data.assign(long=data["lat"].astype(float))
def convert_loc(x):
split = x.split(',')
split = [re.sub("[()]", '', i).strip() for i in split]
return float(split[0]), float(split[1])
data = data.assign(location=data["location"].apply(convert_loc))
return data
Rather quickly, especially given the lack of companion source file and ease of the notebook environment.
I also find it easier to build test or prototype scripts with this type of method, allowing for a variable pull from the source for quick development. Also,If you are going to put your notebook code into a standalone file or python library, pasting ready-made functions is rather convienent.
Anyway, as you can see, the basic flow is to grab the data from the Boston API, pull the json values into a pandas dataframe, alter the column names (this was honestly a compatibility step with the old method, which had all-lowercased column names), and ensure we parse the json coordinate data as two floats, not as strings.
I always begin with a bit of exploration and seeing what we have in the data.
data = get_bpd_data(250000)
data.head()
data.shape
(data
.groupby("offense_code_group")
.size()
.sort_values()
.plot
.barh(title="Number of Incidents by offense code group")
)
It's good to see things like arson, home invasions, and homicide toward the bottom of the list...
Since we're focusing on drug-related violations, let's sort out what the code groups look like within the Drug Violation
category.
(data
.assign(occurred_on_date=lambda df: pd.to_datetime(df["occurred_on_date"]))
.set_index("occurred_on_date")
.sort_index()
.query("offense_code_group == 'Drug Violation'")
["offense_description"]
.value_counts()
.sort_index()
.index
.tolist()
)
You might also note that I am using the method-chaining / functional style in pandas, which is not always what you tend to see. I find it great for several reasons:
- You keep your damn namespace clean:
This is crucial overall, but even more so for the interactive environment. How many times have you used quick names like df
, tmp
, or tmp_with_something
? You can get cluttered quickly, as well as wind up typing a ton and having to put more and more on your mental stack to keep up.
- Rapid changes to what you want to see from a data operation:
In the notebook environment, you can rapidly copy and paste a cell, and with method chaining, you can alter a single part of the chain to achieve similar or extend the results. For example, now that we have a raw list of the codes from which we can figure out what they mean, it might be nice to see a visualization or counts of them. We'll do that below with minimal efforts:
It took a bit of googling to find a good set of explainations for these codes from here.
(data
.assign(occurred_on_date=lambda df: pd.to_datetime(df["occurred_on_date"]))
.set_index("occurred_on_date")
.sort_index()
.query("offense_code_group == 'Drug Violation'")
["offense_description"]
.value_counts()
# .sort_index()
# .index
# .tolist()
.plot.barh()
)
Note that I am intentionally leaving the commented lines to explicitly point out how quick it is to modify the previous cell.
This is a powerful way of interacting with the notebook, particularly when your data is "small", e.g., results from your processing chunks are delivered in near realtime or within some tolerable delay (for me, usually a few seonds). If the results of a step, be it aggregation or cleaning, take too long or will be reused a lot downstream, it's probably worth assigning the results to a variable and/or wrapping the processing chunk in a function. I don't normally do this until I get past a particular exploratory stage, where "particular" is some function of the project at hand, how much time I have, and if these stages are going to be used in a library or bigger project.
Anway... regarding the drug offense codes, it took a minute to find a good explaination for them and to know what to work with. I am not a drug policy, legal, or policing expert, and for any legal / policy project to which I might contribute I would certainly work with domain experts in understanding those codes, particularly the nuances I might miss when finding explainations online. It took a bit of googling to find a good set of explainations for these codes from here, but it seems that I can divde them into two major groups for visualization:
- general possession of drugs
- intent to distribute: far more serious than mere possession
With that being said, on to processing. The following lambda functions are to ease the data extaction; if this was going to go into a library or source file, I would likely put them into a proper function and add some documentation as to my choices, mabye even with reference to some documentation or logic as to why. Policy-affecting projects are probably a bit more sensitive to design reasoning than hacking away at personal things, but for now...
def preprocess_bpd_data(df):
"""
Converts the raw response from the boston data api to the drug-related incidents we want,
marking fields for possession types.
"""
all_possessions = (lambda df: df["offense_description"]
.apply(lambda x: 1 if x.startswith("DRUGS - POSS") else 0))
distribute = (lambda df: df["offense_description"]
.apply(lambda x: 1 if re.match(".*INTENT.*", x) else 0))
return (df
.assign(occurred_on_date=lambda df: pd.to_datetime(df["occurred_on_date"]))
.set_index("occurred_on_date")
.sort_index()
.query("offense_code_group == 'Drug Violation'")
.assign(all_possession=all_possessions)
.assign(distribute=distribute)
.assign(poss_only=lambda df: df["all_possession"] - df["distribute"])
)
bpd = preprocess_bpd_data(data)
bpd.head()
bpd.shape
From here, we could branch off into looking at things like BPD district-area stats. E.g., let's look at the percentage of Intent-to-Distribute possession incidents by reporting area:
(bpd
.groupby("district")
.agg({"all_possession": np.count_nonzero,
"distribute": np.count_nonzero,
"poss_only": np.count_nonzero,
})
.assign(dist_perc=lambda df: df["distribute"] / df["all_possession"])
.fillna(0)
.sort_values("dist_perc")
["dist_perc"]
.plot
.barh(title="Percentage of intent-to-distribute incidents per BPD precinct")
)
Let's also look over time to see what the frequency of incidents is like:
(bpd
.resample("D")
.size()
.rolling(7).mean()
.plot(title="all BPD drug crimes, 7-day rolling mean")
)
Without diverging into some more interesting stats, let's march forward to the map. Let's clean some of the bad values in the data and only take incidents for which we have valid geo data.
Mapping with Folium¶
Without diverging into some more interesting stats, let's march forward mapmaking. I would like to build a choropleth with some type of markers on it, and to do so, we'll use the nifty MarkerCluster feature. Let's clean some of the bad values in the data and only take incidents for which we have valid geo data.
drugs = (bpd
.dropna(subset=["lat", "long", "district", "location"])
.query("lat > 0 and long != -1")
)
drugs
Following the examples from Folium, we'll set up our maps as follows:
# helper function to get the geo-coded values. the lat / lon columns are invalid
get_coords = lambda x: (x[0], x[1])
lats, lons = zip(*drugs["location"].apply(get_coords))
locations = list(zip(lats, lons))
from IPython.core.interactiveshell import InteractiveShell # really annoying output for the blog
# so lets set this for a single cell
InteractiveShell.ast_node_interactivity = "none"
mapa = folium.Map(location=[np.mean(lats), np.mean(lons)],
tiles='Cartodb Positron', zoom_start=12)
mc = folium.MarkerCluster().add_to(mapa)
for marker in drugs.to_records(): # too many samples
color = 'red' if marker["distribute"] == 1 else "blue"
loc = get_coords(marker["location"])
popup = "\n".join([str(marker["offense_description"]),
str(marker["street"])
])
(folium
.Marker(location=loc,
popup=popup,
icon=folium.Icon(color=color))
.add_to(mc)
);
mapa.save("/resources/base_map.html")
from IPython.display import IFrame
IFrame('/resources/base_map.html', width=700, height=450)
from IPython.core.interactiveshell import InteractiveShell # makes all cells allow multiple outputs
InteractiveShell.ast_node_interactivity = "all"
And now we have our base map! The marker cluster allows us to bin together a lot of markers.
It's lacking a bit though - let's make this into some type of choropleth map. Thankfully, BPD has their precinct boundaries available online, so we can grab those and easily join them with the existing pandas data. See this folium example for other details.
bpd_geojson_url = ("http://bostonopendata-boston.opendata.arcgis.com" +
"/datasets/9a3a8c427add450eaf45a470245680fc_5.geojson")
bpd_geojson = requests.get(bpd_geojson_url).json()
The geojson has several nested levels, but let's see what is in the properties
attribute for a single precinct:
bpd_geojson["features"][0]["properties"]
Awesome, we can easily join on the district number. Let's aggregate the districts and calculate the proportions of incidents occuring in each precinct:
district_counts = (drugs
.groupby("district")
.size()
.to_frame(name="number")
.reset_index()
.assign(percentage=lambda df: df["number"] / df["number"].sum())
)
district_counts.sort_values("percentage")
now let's redo our map with the new values and keep the markers as well:
# to see what the color bins will be like
list(np.linspace(0, district_counts["percentage"].max(), num=6)),
lats, longs = zip(*drugs["location"].apply(get_coords))
locations = list(zip(lats, longs))
from IPython.core.interactiveshell import InteractiveShell # really annoying output for the blog
# so lets set this for a single cell
InteractiveShell.ast_node_interactivity = "none"
precinct_map = folium.Map(location=[np.median(lats), np.median(longs)],
tiles='Cartodb Positron',
zoom_start=11)
precinct_map.choropleth(geo_str=bpd_geojson, data=district_counts,
columns=["district", "percentage"],
key_on = 'feature.properties.DISTRICT',
threshold_scale=list(np.linspace(0, district_counts["percentage"].max(), num=6)),
fill_color = 'OrRd',
fill_opacity = 0.5,
line_opacity = 0.2,
legend_name="Percent of drug-related incidents per BPD District"
)
mc = folium.MarkerCluster().add_to(precinct_map)
for marker in drugs.to_records():
color = 'red' if marker["distribute"] == 1 else "blue"
loc = get_coords(marker["location"])
popup = "\n".join([str(marker["offense_description"]),
str(marker["street"])
])
(folium
.Marker(location=loc,
popup=popup,
icon=folium.Icon(color=color))
.add_to(mc)
);
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
precinct_map.save("/resources/precinct_map.html")
from IPython.display import IFrame
IFrame("/resources/precinct_map.html", width=700, height=450)
And here is a decent map. We can go a lot farther, in adding drop-downs and other interactivity, but I'll call it a day for now.
Perhaps this is no shocker to anyone who lives here, but overwhelmingly the incidents come from the not-so-nice areas or areas with many people of color. Soon, I will refine this map to zip-code level or use something like datashading to look for more nuanced patterns in the data.