Mapping Boston Drug Incidents with Folium

Content Warning: python, open data, data for good, pandas, geospatial data

While look at various data sources from the Boston area, I ran across the lovely Analyze Boston portal and decided to spin up a quick map of drug-related incidents via the current incident reporting portal. Working with these types of portals (socrata or OData) makes this process rather straighforward and we can get to an interactive map quick quickly with a bit of Pandas and some Folium-powered Leaflet.js. If you'd like to just see the map, skip to the end of the page.

Let's set up our notebook environment:

In [2]:
from IPython.core.interactiveshell import InteractiveShell # makes all cells allow multiple outputs
InteractiveShell.ast_node_interactivity = "all"
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
%matplotlib inline
plt.style.use('bmh') # prettier plots
plt.rcParams['figure.figsize'] = (8, 6)
pd.options.display.float_format = '{:,.2f}'.format
pd.options.display.max_rows = 10


And import the libraries we'll need:

In [3]:
import re
import json
import folium
import requests


For a project like this that is essentially a script, defining these points as functions might be a hair bit of overkill, but I try to practice Safe Data Science© and write functions around main points in programs. A related project will be shared with other data-savvy folks and functions like this are far more readable to others, as they allow you to assign a name to a section of code.

Small functions for parts of your data-gathering-and-cleaning steps also allow you to make quick changes to those composable sections (and boy howdy are data cleaning operations composable...). When I was wrapping up this blog post, I noticed that I was using the soon-to-be deprecated Boston Open Data portal. In the interest of easy reproducibility, it was very easy to modify the following (old API) function from:


def get_bpd_data(n=200000):
url = ("https://data.cityofboston.gov/resource/29yf-ye7n.json" +
"?$limit={}".format(n)) data = pd.read_json(url) return data to the following: In [4]: def get_bpd_data(n=200000): url = ("https://data.boston.gov/datastore/odata3.0/" + "12cb3883-56f5-47de-afa5-3b1cf61b257b?$top={}&\$format=json".format(n))
r = requests.get(url)
data = pd.DataFrame(r.json()["value"])
data = data.rename(columns={k: k.lower() for k in data.columns})
data = data.assign(lat=data["lat"].astype(float))
data = data.assign(long=data["lat"].astype(float))
def convert_loc(x):
split = x.split(',')
split = [re.sub("[()]", '', i).strip() for i in split]
return float(split[0]), float(split[1])

data = data.assign(location=data["location"].apply(convert_loc))
return data


Rather quickly, especially given the lack of companion source file and ease of the notebook environment.

I also find it easier to build test or prototype scripts with this type of method, allowing for a variable pull from the source for quick development. Also,If you are going to put your notebook code into a standalone file or python library, pasting ready-made functions is rather convienent.

Anyway, as you can see, the basic flow is to grab the data from the Boston API, pull the json values into a pandas dataframe, alter the column names (this was honestly a compatibility step with the old method, which had all-lowercased column names), and ensure we parse the json coordinate data as two floats, not as strings.

I always begin with a bit of exploration and seeing what we have in the data.

In [5]:
data = get_bpd_data(250000)
data.shape
(data
.groupby("offense_code_group")
.size()
.sort_values()
.plot
.barh(title="Number of Incidents by offense code group")
)

Out[5]:
day_of_week district hour incident_number lat location long month occurred_on_date offense_code offense_code_group offense_description reporting_area shooting street ucr_part year _id
0 Friday B2 21 I172037475 42.33 (42.33248759, -71.08365801) 42.33 5 2017-05-12T21:19:21 3108 Fire Related Reports FIRE REPORT - HOUSE, BUILDING, ETC. 276 SHAWMUT AVE Part Three 2017 1
1 Friday 17 I172037474 42.30 (42.30490895, -71.06878406) 42.30 5 2017-05-12T17:34:00 3114 Investigate Property INVESTIGATE PROPERTY Part Three 2017 2
2 Friday B2 21 I172037473 42.32 (42.31847226, -71.08409289) 42.32 5 2017-05-12T21:42:00 802 Simple Assault ASSAULT SIMPLE - BATTERY 310 MARTIN LUTHER KING JR BLV Part Two 2017 3
3 Friday 21 I172037469 42.28 (42.28429865, -71.09154342) 42.28 5 2017-05-12T21:10:00 1848 Drug Violation DRUGS - POSS CLASS D - INTENT TO MFR DIST DISP Part Two 2017 4
4 Friday C11 19 I172037467 42.30 (42.30375506, -71.06839375) 42.30 5 2017-05-12T19:42:00 1402 Vandalism VANDALISM 341 TOPLIFF ST Part Two 2017 5
Out[5]:
(186495, 18)
Out[5]:
<matplotlib.axes._subplots.AxesSubplot at 0x135d7c9e8>

It's good to see things like arson, home invasions, and homicide toward the bottom of the list...

Since we're focusing on drug-related violations, let's sort out what the code groups look like within the Drug Violation category.

In [6]:
(data
.assign(occurred_on_date=lambda df: pd.to_datetime(df["occurred_on_date"]))
.set_index("occurred_on_date")
.sort_index()
.query("offense_code_group == 'Drug Violation'")
["offense_description"]
.value_counts()
.sort_index()
.index
.tolist()
)

Out[6]:
['CONSPIRACY EXCEPT DRUG LAW',
'DRUGS - CLASS A TRAFFICKING OVER 18 GRAMS',
'DRUGS - CLASS B TRAFFICKING OVER 18 GRAMS',
'DRUGS - CLASS D TRAFFICKING OVER 50 GRAMS',
'DRUGS - CONSP TO VIOL CONTROLLED SUBSTANCE',
'DRUGS - GLUE INHALATION',
'DRUGS - OTHER',
'DRUGS - POSS CLASS A - HEROIN, ETC.',
'DRUGS - POSS CLASS A - HEROIN, ETC. ',
'DRUGS - POSS CLASS A - INTENT TO MFR DIST DISP',
'DRUGS - POSS CLASS B - COCAINE, ETC.',
'DRUGS - POSS CLASS B - INTENT TO MFR DIST DISP',
'DRUGS - POSS CLASS C',
'DRUGS - POSS CLASS C - INTENT TO MFR DIST DISP',
'DRUGS - POSS CLASS D',
'DRUGS - POSS CLASS D - INTENT TO MFR DIST DISP',
'DRUGS - POSS CLASS D - MARIJUANA, ETC.',
'DRUGS - POSS CLASS E',
'DRUGS - POSS CLASS E - INTENT TO MFR DIST DISP',
'DRUGS - POSS CLASS E INTENT TO MF DIST DISP',
'DRUGS - POSSESSION',
'DRUGS - POSSESSION OF DRUG PARAPHANALIA',
'DRUGS - SALE / MANUFACTURING',
'DRUGS - SICK ASSIST - HEROIN',
'DRUGS - SICK ASSIST - OTHER HARMFUL DRUG',
'DRUGS - SICK ASSIST - OTHER NARCOTIC']

You might also note that I am using the method-chaining / functional style in pandas, which is not always what you tend to see. I find it great for several reasons:

• You keep your damn namespace clean:

This is crucial overall, but even more so for the interactive environment. How many times have you used quick names like df, tmp, or tmp_with_something? You can get cluttered quickly, as well as wind up typing a ton and having to put more and more on your mental stack to keep up.

• Rapid changes to what you want to see from a data operation:

In the notebook environment, you can rapidly copy and paste a cell, and with method chaining, you can alter a single part of the chain to achieve similar or extend the results. For example, now that we have a raw list of the codes from which we can figure out what they mean, it might be nice to see a visualization or counts of them. We'll do that below with minimal efforts:

It took a bit of googling to find a good set of explainations for these codes from here.

In [7]:
(data
.assign(occurred_on_date=lambda df: pd.to_datetime(df["occurred_on_date"]))
.set_index("occurred_on_date")
.sort_index()
.query("offense_code_group == 'Drug Violation'")
["offense_description"]
.value_counts()
# .sort_index()
# .index
# .tolist()
.plot.barh()
)

Out[7]:
<matplotlib.axes._subplots.AxesSubplot at 0x117b6b4e0>

Note that I am intentionally leaving the commented lines to explicitly point out how quick it is to modify the previous cell.

This is a powerful way of interacting with the notebook, particularly when your data is "small", e.g., results from your processing chunks are delivered in near realtime or within some tolerable delay (for me, usually a few seonds). If the results of a step, be it aggregation or cleaning, take too long or will be reused a lot downstream, it's probably worth assigning the results to a variable and/or wrapping the processing chunk in a function. I don't normally do this until I get past a particular exploratory stage, where "particular" is some function of the project at hand, how much time I have, and if these stages are going to be used in a library or bigger project.

Anway... regarding the drug offense codes, it took a minute to find a good explaination for them and to know what to work with. I am not a drug policy, legal, or policing expert, and for any legal / policy project to which I might contribute I would certainly work with domain experts in understanding those codes, particularly the nuances I might miss when finding explainations online. It took a bit of googling to find a good set of explainations for these codes from here, but it seems that I can divde them into two major groups for visualization:

• general possession of drugs
• intent to distribute: far more serious than mere possession

With that being said, on to processing. The following lambda functions are to ease the data extaction; if this was going to go into a library or source file, I would likely put them into a proper function and add some documentation as to my choices, mabye even with reference to some documentation or logic as to why. Policy-affecting projects are probably a bit more sensitive to design reasoning than hacking away at personal things, but for now...

In [8]:
def preprocess_bpd_data(df):
"""
Converts the raw response from the boston data api to the drug-related incidents we want,
marking fields for possession types.
"""
all_possessions = (lambda df: df["offense_description"]
.apply(lambda x: 1 if x.startswith("DRUGS - POSS") else 0))
distribute = (lambda df: df["offense_description"]
.apply(lambda x: 1 if re.match(".*INTENT.*", x) else 0))
return (df
.assign(occurred_on_date=lambda df: pd.to_datetime(df["occurred_on_date"]))
.set_index("occurred_on_date")
.sort_index()
.query("offense_code_group == 'Drug Violation'")
.assign(all_possession=all_possessions)
.assign(distribute=distribute)
.assign(poss_only=lambda df: df["all_possession"] - df["distribute"])
)

In [9]:
bpd = preprocess_bpd_data(data)
bpd.shape

Out[9]:
day_of_week district hour incident_number lat location long month offense_code offense_code_group offense_description reporting_area shooting street ucr_part year _id all_possession distribute poss_only
occurred_on_date
2015-06-15 00:01:00 Monday C11 0 I152049594 42.31 (42.310434, -71.0613401) 42.31 6 1874 Drug Violation DRUGS - OTHER 289 HANCOCK ST Part Two 2015 186057 0 0 0
2015-06-15 00:01:00 Monday C11 0 I152049593 42.31 (42.310434, -71.0613401) 42.31 6 1874 Drug Violation DRUGS - OTHER 289 HANCOCK ST Part Two 2015 186058 0 0 0
2015-06-15 07:30:00 Monday B2 7 I152049702 42.33 (42.32866284, -71.08563401) 42.33 6 1874 Drug Violation DRUGS - OTHER 282 WASHINGTON ST Part Two 2015 185985 0 0 0
2015-06-15 11:09:00 Monday C11 11 I152049550 42.30 (42.30462277, -71.07958027) 42.30 6 1849 Drug Violation DRUGS - POSS CLASS B - COCAINE, ETC. 460 GLENARM ST Part Two 2015 186082 1 0 1
2015-06-15 15:00:00 Monday E13 15 I152049634 42.30 (42.30481623, -71.1091185) 42.30 6 1842 Drug Violation DRUGS - POSS CLASS A - HEROIN, ETC. 569 BROOKLEY RD Part Two 2015 186030 1 0 1
Out[9]:
(10102, 20)

From here, we could branch off into looking at things like BPD district-area stats. E.g., let's look at the percentage of Intent-to-Distribute possession incidents by reporting area:

In [10]:
(bpd
.groupby("district")
.agg({"all_possession": np.count_nonzero,
"distribute": np.count_nonzero,
"poss_only": np.count_nonzero,
})
.assign(dist_perc=lambda df: df["distribute"] / df["all_possession"])
.fillna(0)
.sort_values("dist_perc")
["dist_perc"]
.plot
.barh(title="Percentage of intent-to-distribute incidents per BPD precinct")
)

Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0x140551c18>

Let's also look over time to see what the frequency of incidents is like:

In [10]:
(bpd
.resample("D")
.size()
.rolling(7).mean()
.plot(title="all BPD drug crimes, 7-day rolling mean")
)

Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0x116d49630>

Without diverging into some more interesting stats, let's march forward to the map. Let's clean some of the bad values in the data and only take incidents for which we have valid geo data.

Mapping with Folium¶

Without diverging into some more interesting stats, let's march forward mapmaking. I would like to build a choropleth with some type of markers on it, and to do so, we'll use the nifty MarkerCluster feature. Let's clean some of the bad values in the data and only take incidents for which we have valid geo data.

In [11]:
drugs = (bpd
.dropna(subset=["lat", "long", "district", "location"])
.query("lat > 0 and long != -1")
)

drugs

Out[11]:
day_of_week district hour incident_number lat location long month offense_code offense_code_group offense_description reporting_area shooting street ucr_part year _id all_possession distribute poss_only
occurred_on_date
2015-06-15 00:01:00 Monday C11 0 I152049594 42.31 (42.310434, -71.0613401) 42.31 6 1874 Drug Violation DRUGS - OTHER 289 HANCOCK ST Part Two 2015 186057 0 0 0
2015-06-15 00:01:00 Monday C11 0 I152049593 42.31 (42.310434, -71.0613401) 42.31 6 1874 Drug Violation DRUGS - OTHER 289 HANCOCK ST Part Two 2015 186058 0 0 0
2015-06-15 07:30:00 Monday B2 7 I152049702 42.33 (42.32866284, -71.08563401) 42.33 6 1874 Drug Violation DRUGS - OTHER 282 WASHINGTON ST Part Two 2015 185985 0 0 0
2015-06-15 11:09:00 Monday C11 11 I152049550 42.30 (42.30462277, -71.07958027) 42.30 6 1849 Drug Violation DRUGS - POSS CLASS B - COCAINE, ETC. 460 GLENARM ST Part Two 2015 186082 1 0 1
2015-06-15 15:00:00 Monday E13 15 I152049634 42.30 (42.30481623, -71.1091185) 42.30 6 1842 Drug Violation DRUGS - POSS CLASS A - HEROIN, ETC. 569 BROOKLEY RD Part Two 2015 186030 1 0 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2017-05-12 18:00:00 Friday E18 18 I172037420 42.28 (42.28446743, -71.11183089) 42.28 5 1849 Drug Violation DRUGS - POSS CLASS B - COCAINE, ETC. 503 AMERICAN LEGION HWY Part Two 2017 36 1 0 1
2017-05-12 19:00:00 Friday C6 19 I172037445 42.34 (42.33670628, -71.05330809) 42.34 5 1842 Drug Violation DRUGS - POSS CLASS A - HEROIN, ETC. 936 D ST Part Two 2017 14 1 0 1
2017-05-12 20:16:00 Friday B2 20 I172037451 42.32 (42.32224255, -71.08277356) 42.32 5 1832 Drug Violation DRUGS - SICK ASSIST - OTHER HARMFUL DRUG 297 KENSINGTON PARK Part Two 2017 11 0 0 0
2017-05-12 21:05:00 Friday C6 21 I172037463 42.33 (42.32894903, -71.05739663) 42.33 5 1830 Drug Violation DRUGS - SICK ASSIST - HEROIN 191 BOSTON ST Part Two 2017 6 0 0 0
2017-05-12 21:10:00 Friday 21 I172037469 42.28 (42.28429865, -71.09154342) 42.28 5 1848 Drug Violation DRUGS - POSS CLASS D - INTENT TO MFR DIST DISP Part Two 2017 4 1 1 0

9030 rows × 20 columns

Following the examples from Folium, we'll set up our maps as follows:

In [12]:
# helper function to get the geo-coded values. the lat / lon columns are invalid
get_coords = lambda x: (x[0], x[1])
lats, lons = zip(*drugs["location"].apply(get_coords))
locations = list(zip(lats, lons))

In [13]:
from IPython.core.interactiveshell import InteractiveShell # really annoying output for the blog
# so lets set this for a single cell
InteractiveShell.ast_node_interactivity = "none"

In [14]:
mapa = folium.Map(location=[np.mean(lats), np.mean(lons)],
tiles='Cartodb Positron', zoom_start=12)

for marker in drugs.to_records(): # too many samples
color = 'red' if marker["distribute"] == 1 else "blue"

loc = get_coords(marker["location"])
popup = "\n".join([str(marker["offense_description"]),
str(marker["street"])
])

(folium
.Marker(location=loc,
popup=popup,
icon=folium.Icon(color=color))
);

In [15]:
mapa.save("/resources/base_map.html")

In [32]:
from IPython.display import IFrame
IFrame('/resources/base_map.html', width=700, height=450)

Out[32]:
In [30]:
from IPython.core.interactiveshell import InteractiveShell # makes all cells allow multiple outputs
InteractiveShell.ast_node_interactivity = "all"


And now we have our base map! The marker cluster allows us to bin together a lot of markers.

It's lacking a bit though - let's make this into some type of choropleth map. Thankfully, BPD has their precinct boundaries available online, so we can grab those and easily join them with the existing pandas data. See this folium example for other details.

In [19]:
bpd_geojson_url = ("http://bostonopendata-boston.opendata.arcgis.com" +
bpd_geojson = requests.get(bpd_geojson_url).json()


The geojson has several nested levels, but let's see what is in the properties attribute for a single precinct:

In [20]:
bpd_geojson["features"][0]["properties"]

Out[20]:
{'BPDGIS_GIS': 'A',
'DISTRICT': 'A15',
'DISTRICT_': 15,
'DISTRICT__': '15',
'ID': 'A15',
'OBJECTID': 1,
'ShapeSTArea': 37982842.515625,
'ShapeSTLength': 57556.859964083764}

Awesome, we can easily join on the district number. Let's aggregate the districts and calculate the proportions of incidents occuring in each precinct:

In [21]:
district_counts = (drugs
.groupby("district")
.size()
.to_frame(name="number")
.reset_index()
.assign(percentage=lambda df: df["number"] / df["number"].sum())
)

In [22]:
district_counts.sort_values("percentage")

Out[22]:
district number percentage
0 59 0.01
2 A15 184 0.02
8 D14 331 0.04
12 E5 421 0.05
11 E18 510 0.06
... ... ... ...
7 C6 912 0.10
9 D4 1037 0.11
1 A1 1188 0.13
4 B2 1206 0.13
6 C11 1358 0.15

13 rows × 3 columns

now let's redo our map with the new values and keep the markers as well:

In [23]:
#  to see what the color bins will be like
list(np.linspace(0, district_counts["percentage"].max(), num=6)),

Out[23]:
([0.0,
0.030077519379844965,
0.06015503875968993,
0.090232558139534902,
0.12031007751937986,
0.15038759689922482],)
In [24]:
lats, longs = zip(*drugs["location"].apply(get_coords))
locations = list(zip(lats, longs))

In [33]:
from IPython.core.interactiveshell import InteractiveShell # really annoying output for the blog
# so lets set this for a single cell
InteractiveShell.ast_node_interactivity = "none"

In [26]:
precinct_map = folium.Map(location=[np.median(lats), np.median(longs)],
tiles='Cartodb Positron',
zoom_start=11)

precinct_map.choropleth(geo_str=bpd_geojson,  data=district_counts,
columns=["district", "percentage"],
key_on = 'feature.properties.DISTRICT',
threshold_scale=list(np.linspace(0, district_counts["percentage"].max(), num=6)),
fill_color = 'OrRd',
fill_opacity = 0.5,
line_opacity = 0.2,
legend_name="Percent of drug-related incidents per BPD District"
)

In [34]:
mc = folium.MarkerCluster().add_to(precinct_map)
for marker in drugs.to_records():
color = 'red' if marker["distribute"] == 1 else "blue"

loc = get_coords(marker["location"])
popup = "\n".join([str(marker["offense_description"]),
str(marker["street"])
])

(folium
.Marker(location=loc,
popup=popup,
icon=folium.Icon(color=color))
);

In [35]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [36]:
precinct_map.save("/resources/precinct_map.html")

In [37]:
from IPython.display import IFrame
IFrame("/resources/precinct_map.html", width=700, height=450)

Out[37]:

And here is a decent map. We can go a lot farther, in adding drop-downs and other interactivity, but I'll call it a day for now.

Perhaps this is no shocker to anyone who lives here, but overwhelmingly the incidents come from the not-so-nice areas or areas with many people of color. Soon, I will refine this map to zip-code level or use something like datashading to look for more nuanced patterns in the data.

Written by Aaron Gonzales

I'm a typically atypical guy.

Updated