Visualizing incident report patterns from the Boston Police Department

Recently, I wrote a post about mapping incidents from the Boston Police Department. While the resultant choropleth wasn't a bad start, it does leave a bit left to be desired, and I didn't spend much time looking at the rest of the dataset.

I recall seeing Datashader before and thought this might be a nice opportunity to use it. The process of working through examples led me to see both Geoviews and Holoviews, which led me down a bit of a GIS / visualization rabbit hole... Which leads me here.

needless commentary about plotting in python

The Python data ecosystem (python, numpy, scipy, pandas, etc.) is rich and wonderful, though historically visualizations have been a pain point. There is no shortage of posts covering comparisons between R/Ggplot2 and python's varying visualization frameworks (matplotlib, seaborn, pandas built-ins, plotly, yhat's ggplot, altair, pygal, folium, bokeh). (My favorite post is this dramatic tour.)

As a data scientist who regularly works on implementing projects into production, I appreciate the flexibility that lower-level frameworks (or frameworks that expose lower-level primitives ) give in allowing custom code.

This is seldom the case with visualizations for in-notebook dataset exploration. I cannot stand plotting things in raw matplotlib and would love to have a robust declarative library that gives me access to lower-level plot objects for customization when you need it, with a nice functional syntax. Altair is probably the most promising of the bunch for rapid ease of use and semantics, but it's difficult to work with in-notebook plots without futzing around with giant notebook sizes (Altair embeds all plot data as JSON in the page and uses Vega to render it.) Bokeh is great for interactivity and I think as the Chart API smooths out a bit it will really start to shine even more.

Anyway, it was fun to learn a bit about holoviews and I will keep it in mind for certain types of data when needed. I like the API and ease of combining plot types, as we'll see later.

moving on

As before, I'm using the Analyze Boston portal to get geo-coded incidents via the current incident reporting page.

Let's set up our notebook environment:

from IPython.core.interactiveshell import InteractiveShell # makes all cells allow multiple outputs
from IPython.core.display import HTML, display
from IPython.display import IFrame
InteractiveShell.ast_node_interactivity = "all"
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
%matplotlib inline
plt.style.use('bmh') # prettier plots
plt.rcParams['figure.figsize'] = (8, 6)
pd.options.display.float_format = '{:,.2f}'.format
pd.options.display.max_rows = 10
display(HTML("<style>.container { width:90% !important; }</style>"))

There are a lot of imports here. We'll also define some useful constants and settings, including a set of tiles for geographic projection later.

import itertools as it
from functools import partial
import re
import json
import folium
import requests

import numpy as np
import pandas as pd
import holoviews as hv
from holoviews.operation.datashader import aggregate, shade, datashade, dynspread

import datashader as ds
from datashader.utils import export_image
from datashader.colors import colormap_select, Greys9, inferno
import datashader.transfer_functions as tf

import geoviews as gv
import cartopy
from cartopy import crs
from cartopy import feature as cf
from geoviews import feature as gf
from bokeh import palettes
from bokeh.tile_providers import STAMEN_TONER
from bokeh.models import WMTSTileSource

background = "black"
export = partial(export_image, background = background, export_path="export")
cm = partial(colormap_select, reverse=(background!="black"))
dynspread.max_px=20
dynspread.threshold=0.5
hv.notebook_extension('mpl', 'bokeh')
tiles = {'OpenMap': WMTSTileSource(url='http://c.tile.openstreetmap.org/{Z}/{X}/{Y}.png'),
         'ESRI': WMTSTileSource(url='https://server.arcgisonline.com/ArcGIS/rest/services/World_Imagery/MapServer/tile/{Z}/{Y}/{X}.jpg'),
         'Wikipedia': WMTSTileSource(url='https://maps.wikimedia.org/osm-intl/{Z}/{X}/{Y}@2x.png'),
         'Stamen Toner': STAMEN_TONER}

As mentioned in the previous post about boston crime data, here is our main entry point to read the data. I've added some conversion options here, making use of the pandas Categorical data type. We also have a method to convert lat/lon coordinates to web mercator format.

def get_bpd_data(n=200000):
    dow = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
    url = ("https://data.boston.gov/datastore/odata3.0/"
           "12cb3883-56f5-47de-afa5-3b1cf61b257b?$top={}&$format=json".format(n))
    r = requests.get(url)
    data = pd.DataFrame(r.json()["value"])
    data = data.rename(columns={k: k.lower() for k in data.columns})
    # data = data.assign(lat=data["lat"].astype(float))
    # data = data.assign(long=data["lat"].astype(float))
    def convert_loc(x):
        split = x.split(',')
        split = [re.sub("[()]", '', i).strip() for i in split]
        return float(split[0]), float(split[1])
    
    data = (data
            .assign(location=data["location"].apply(convert_loc))
            .assign(lat=lambda df: df["location"].apply(lambda x: x[0]))
            .assign(long=lambda df: df["location"].apply(lambda x: x[1]))
            .assign(hour=lambda df: df['hour'].astype(int))
            .assign(offense_code_group=lambda x: pd.Categorical(x["offense_code_group"]))
            .assign(day_of_week=lambda x: pd.Categorical(x["day_of_week"], categories=dow))
            .assign(month=lambda x: pd.Categorical(x["month"].astype("int")))
           )
    return data

def latlng_to_meters(df, lat_name, lng_name):
    """
    Taken and modified from the datashader notebooks 
    """
    lat = df[lat_name]
    lng = df[lng_name]
    origin_shift = 2 * np.pi * 6378137 / 2.0
    mx = lng * origin_shift / 180.0
    my = np.log(np.tan((90 + lat) * np.pi / 360.0)) / (np.pi / 180.0)
    my = my * origin_shift / 180.0
    return df.assign(mx=mx).assign(my=my)

So let's read some data and get on to plotting.

data = get_bpd_data(250000)

For this post, we'll want to clean up the geo data and redfine a few categorical variables. It's a bit script-y, but this almost never needs to be tinkered with.

drops = ["offense_description", "occurred_on_date", "shooting",
         "reporting_area", "street", "ucr_part", "_id", "incident_number"]

dow = [ "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday",]
tod = ["late_night", "morning", "midday", "drive_home", "evening", "night"]
top_cats = 'Drug Violation', 'Investigate Person', 'Other', 'Larceny', 'Motor Vehicle Accident Response'
pdata = (data
         .drop(drops, axis=1)
         .query("offense_code_group in @top_cats")
         .query("lat > 40 and long < -70")
         .pipe(latlng_to_meters, "lat", "long")
         .assign(offense_code_group=lambda x: pd.Categorical(x["offense_code_group"].astype("str")))
         .assign(tod=lambda x: pd.cut(x.hour,
                                      [0, 6, 10, 15, 18, 21,  24],
                                      include_lowest=True,
                                      labels=tod))
        )
boston_coords = (pdata["lat"].min() , pdata["lat"].max()), (pdata["long"].min(), pdata["long"].max())
lx_range = pdata["lat"].min(), pdata["long"].max()
ly_range = pdata["lat"].min(), pdata["long"].max()
x_range = pdata["mx"].min(), pdata["mx"].max()
y_range = pdata["my"].min(), pdata["my"].max()

# I left these in here for posterity. I had wanted to explore other aspects. 
tod_colors = dict(zip(tod,  palettes.Category10[len(tod)] ))
dow_colors = dict(zip(dow,  palettes.Category10[len(dow)]))
offense_colors = dict(zip(top_cats,  palettes.Category10[len(top_cats)] )) 

Using Holoviews

Holoviews is a radically different way of working with data. You define a container for the data and use that to define operations. Perhaps the most simple format is the Table container.

While holoviews does support aggregation within a dataset, in this case it was a bit more clear to do it up front. We'll take the "log" or "melted" format of our data (one row per observation) and aggregate down to some counts per category. From there, we can create a Table and pass the grouping columns as indexers, with the count value as the ... value dimension.

Note the inclusion of the IFrame loading the chart back into the notebook. Holoviews does not support exporting from the notebook the plots correctly via Bokeh as a backend. I'm sure I might be able to figure out a more elegant workaround, but saving the charts to a file, using shell tools to add in the bokeh JS headers, and loading it into an iframe for blogging seems to have worked for now. If you happen to copy this code, you'll get live plots within the notebook.

It's quite inelegant, but the plots look great.

tdata = (pdata
         .groupby(["day_of_week", "hour", "offense_code_group"])
         .size()
         .to_frame("count")
         .reset_index()
        )
tdata = hv.Table(tdata, kdims=["day_of_week", "hour", "offense_code_group"], vdims=[ "count"])

From here, we can rapidly visualize the data, though the holoviews semantics are a bit odd to learn. In a notebook, you can specify your chart options up front via the %%opts syntax.

From what I can tell, when you specify a transformation to a chart type, holoviews computes everything up front, and if you have multiple dimensions you can slice one out like this heatmap of drug violations by day and hour.

%%output filename="bpd_crimemap_advanced_hm_dv_simp"
%%opts HeatMap [width=500, height=500 colorbar=True tools=["hover"]] (cmap="Blues")
(tdata.to.heatmap(["day_of_week", "hour"], ["count"], )
 .select(offense_code_group="Drug Violation")
 .relabel("Drug Violations") # change the title
)
IFrame('./bpd_crimemap_advanced_hm_dv_simp.html', width=750, height=500)

Now, if we want to collapse across the offense_code_group category to see the pattern of all violations, we can aggregate as such within our Table. (It's essentially the same as a groupby in pandas but the aggregation function is passed explicitly.)

%%output filename="bpd_crimemap_advanced_hm_all_agg"
%%opts HeatMap [width=600, height=400 colorbar=True, tools=["hover"] toolbar="above"]  (cmap="Blues")
(tdata
 .aggregate(["hour", "day_of_week"], function=np.sum)
 .to.heatmap(["hour", "day_of_week"], ["count"], )
).relabel("Violations by day and hour")
IFrame('./bpd_crimemap_advanced_hm_all_agg.html', width=750, height=500)

We can see that overwhelmingly, violations tend to happen during the day, particularly in the later afternoon. There is a bit of a shift in this pattern on weekends.

Let's break out a similar chart by the type of offense. Now for a bit more of a fancy layout, we can generate an aptly-named Layout, which takes a list of charts. Note the explict selection and relabeling operations with the offense codes. Unfortunetely, the color bars ranges are not uniform.

So it goes.

%%output filename="bpd_crimemap_advanced_hm_iter"
%%opts HeatMap [width=420, height=275 colorbar=True tools=["hover"] show_title=True]  (cmap="Blues")
%%opts Layout [aspect_weight=1 fig_size=150 sublabel_position=(-0.2, 1.)]
hv.Layout([tdata.to.heatmap(["hour", "day_of_week"], ["count"])
           .select(offense_code_group=ofg)
           .relabel(ofg)
           for ofg in offense_colors.keys()]).cols(2)
WARNING:root:LayoutPlot14100: None is empty, skipping subplot.
WARNING:root:LayoutPlot14280: None is empty, skipping subplot.
IFrame('bpd_crimemap_advanced_hm_iter.html', width=800, height=800)

Now we can see some details:

  • Drug violations have an extremely tight pattern, overwhelmingly occuring during the early evening hours during the week. Apparently people don't seek drugs on Sunday all that much.
  • Car accidents occur mostly during the afternoon rush hour, but also have a slight increase in the morning rush hour period. This is shifted wildly on the weekends, where many late-night and early-morning accidents occur.

Holoviews gives the option of passing the full dataset with your key dimensions, adding interactivity with the remaining key dimensions. We rearrange the chart here to look at hourly variations among offense types.

%%output filename="bpd_crimemap_advanced_hm_iter_interact"
%%opts HeatMap [width=600, height=400 tools=["hover"]]  (cmap="Blues")
tdata.to.heatmap([ "hour", "offense_code_group",], ["count"])
IFrame('bpd_crimemap_advanced_hm_iter_interact.html', width=900, height=500)

Geo level

So, the heatmaps are really great ways of visualizing patterns across categories, but with this dataset, we do have the geospatial information associated with each stop. It makes sense, like in the previous post, to look at this within the context of the city itself. Geoviews, a companion to Holoviews, combined with Datashader, can simplify some of that work.

We do not need to aggregate down to the level of a choropleth, we want the data to speak for itself without enforcing a form on to it first. This roughly follows the data-cenric trend of machine learning overall, learning from the data directly vs assuming or forcing it to follow your favorite distribution.

Note that this part is a bit funky to work with at times and the plots probably won't work as smoothly in your browser as they do in a notebook. If you want, download the notebook do this yourself.

I have heavily leaned on samples from the following sources to build this:

I already loaded the relevant libraries in the top of the notebook; I dislike the clutter of having imports scattered around. :)

This function helps generate named color points for each category in a set of categories.

def gen_col_points(categories, colormap):
    inv_cats = {k: k for k in categories}
    color_points = hv.NdOverlay({inv_cats[k]: gv.Points([0,0],
                                                        crs=crs.PlateCarree(),
                                                        label=inv_cats[k])
                                 (style=dict(color=v))
                                 for k, v in colormap.items()})
    return color_points

Now, the method for generating the plotting data follows a few conventions.

  • since we'll use datashading, we want to set those options here.
  • we make a Points dataset, keying on the actual web mercator coordinates mx, my
  • use datashader to aggregate down to the categorical level you want
  • overlay the points and legend on a set of tiles (defined earlier).

The options help define our basic styling elements, and bokeh gives us interactivity.

%%output filename="bpd_datashaded_points"
%%opts Overlay [width=800 height=600 xaxis=None yaxis=None show_grid=False ] (background_alpha=0.1) 
%%opts Shape (fill_color=None line_width=1.5) [apply_ranges=False] 
%%opts Points [apply_ranges=False tools=[]]
%%opts WMTS (alpha=0.25)

shade_defaults = dict(x_range=(x_range[0] + 10000, x_range[1] - 5000),
                      y_range=(y_range[0] + 8500, y_range[1] - 6000),
                      width=1200,
                      height=660)

shaded_points = datashade(hv.Points(gv.Dataset(pdata, kdims=["mx", "my"], vdims=["offense_code_group"])),
                          cmap=offense_colors,
                          element_type=gv.Image,
                          aggregator=ds.count_cat("offense_code_group"),
                          **shade_defaults
                         )

color_points = gen_col_points(offense_colors.keys(), offense_colors)

map_ = gv.WMTS(tiles["Stamen Toner"]) * dynspread(shaded_points, max_px=10, threshold=0.8) * color_points
map_
IFrame('bpd_datashaded_points.html', width=900, height=650)

The datashaded version only works well in the live notebook, but you can get a sense of what is happening from the high level.

To see the points explicitly plotted without shading (which might not really be needed in this case as we don't have too many points), we can use this method.

%%output filename="bpd_nonshaded_points"
%%opts Overlay [width=800 height=700 xaxis=None yaxis=None show_grid=False] 
%%opts Points [apply_ranges=True color_index=2 tools=['tap', "resize", "hover", "box_zoom"]]
%%opts Points (cmap="gist_rainbow_r" alpha=0.5 size=1.25)
%%opts WMTS (alpha=0.4)

kdims=[ "lat", "long", ]
vdims = ["offense_code_group", "day_of_week", "hour"]
gvdata = gv.Dataset(pdata,
                    kdims=kdims,
                    vdims=vdims,
                   )

points = gvdata.to(gv.Points, 
                   kdims=['long', 'lat'],
                   vdims=["offense_code_group", "day_of_week"],
                   crs=crs.PlateCarree()
                  )

gv.WMTS(tiles["Stamen Toner"]) * points
IFrame('bpd_nonshaded_points.html', width=800, height=800)

I cannot for the life of me figure out how to pass a custom color mapping to the bokeh backend. Anyway, zoom around in the map to see some minor patterns in the violations data - drug incidents overwhelmingly occur in a few concentrated areas, accidents occur along major commuting routes, etc.

I'll revisit this data again and wrap up some observations in a future post, aimed at a broader audience.