Gender, Pay, UNM, Bokeh

Content Warning: python, bokeh, data science, data visualization, feminism

The glass ceiling is a well-studied and oft-commented issue in and out of academia. Just today, the New York Times published a long article about gender pay discrimination.

The University of New Mexico (UNM) publishes a list of employee salaries and grants on Sunshine@UNM and I decided to grab it and do some custom plotting as an excuse to learn Bokeh, as data visualization is probably my weakest data science skillset. I will refrain from much commentary on pay issues for the time being and focus more on the analysis process and visualization.

Preprocessing

We'll be using the python data stack -- pandas, numpy, etc.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import itertools as it
import glob

from IPython.display import HTML, display
from matplotlib.colors import LinearSegmentedColormap
from matplotlib.lines import Line2D
plt.rcParams['figure.figsize'] = (20, 7)


pd.options.display.max_rows = 20
plt.style.use('bmh')
%matplotlib inline

From Sunshine@UNM, I gathered a csv version of the salary book and did very minor preprocessing (removing empty lines, adding quotes for ease of reading, removing '$' from the salary field, etc.) in my shell, saving the results to the file being read into Pandas below.

For most of the post, i'm not going to publically reveal names with salary, but some of these are pretty easy to figure out.

In [2]:
# a function to remove names from printing
private_printer = lambda x: x.head(10).ix[:, 1:]

salary = pd.read_csv("./unm_salary.csv")
salary.columns = ['name', 'department', 'grade', 'job_title', 'type', 'appt', 'annual']
private_printer(salary.sort_values(by='annual', ascending=False))
Out[2]:
department grade job_title type appt annual
9281 VP Health Sciences Office 99 Dean Primary 100 656500
7356 Radiology General 99 Chair Primary 100 599000
9365 SOM Surgery Administration 99 Chairperson Primary 100 534040
3155 OB GYN Reproductive Hlth & Fam Pln 99 Chairperson Primary 100 485810
1747 SOM Neurosurgery 99 Assistant Professor Primary 100 484800
6650 SOM Neurosurgery 99 Clinician Ed-Assoc Prof Primary 100 478240
12019 SOM Neurosurgery 99 Chairperson Primary 100 454667
10010 SOM Neurosurgery 99 Assistant Professor Primary 100 454500
8949 Surgery Thoracic 99 Professor Primary 100 441986
10701 SOM Neurosurgery 99 Clinician Ed-Assoc Prof Primary 100 429512

UNM's system has all professor and academic admin jobs in salary grade 99, but this also includes lots of residents, postdocs, and so forth that will distort our findings.

In [3]:
private_printer(salary[salary.grade == '99'])
Out[3]:
department grade job_title type appt annual
0 Electrical Computer Engineering 99 Research Asst Professor Primary 100 87500
2 Psych Geriatrics 99 Associate Professor Primary 100 167074
3 Provost Office Staff 99 Provost/Exec VP for AA Primary 100 286443
5 Grad Med Education Housestaff 99 Resident Physicians I Primary 100 49556
12 Gallup Branch 99 Assistant Professor Primary 100 48000
14 Accessibility Resource Center 99 Pre-Cert Signed Lang Interp Primary 25 10400
15 Taos Branch 99 Temp Parttime Faculty Primary 40 13259
21 School of Law 99 Non Credit Teacher Primary 8 0
23 Dental Hygiene 99 Assistant Professor Primary 100 60274
24 Grad Med Education Housestaff 99 Resident Physicians I Primary 100 49556

We'll begin by cleaning up the data a bit more.

What I'd like to have in the end of this process is:

  • a dataframe of non-adjuct professors (assistant, associate, full, etc).
  • who have full appointments
  • with any obvious erros fixed

Let's take a look at basic departmental summary stats, filtering out professors that do not have full appointments (salaries). I was going to estimate the 100% appointment rate for every professor, but some may have appointments in different departments or in some cases the value given would not be representative of a 100% appointment, so I skipped it and am only looking at those professors with 100% appointments. For those profs who have a secondary appointment, I'll get their total salary and assign it to the "primary" appointed department and ignore the secondary spot, as it would create big outliers down the way.

I'll also include only departments with four or more profs.

In [4]:
# clinical appointments often say "clinician ed-assoc prof"
clin = salary[salary.job_title.str.contains('[Pp]rof$')]
nonclin = salary[salary.job_title.str.contains('[Pp]rofessor')]
profs = pd.concat([clin, nonclin])
# profs = salary[salary.job_title.str.contains('[Pp]rof')]
profs = profs[profs.appt == 100]

# Identify profs that have multiple appointments and combine salary into one total
# because this really creates some outliers with small salaries
sec = list(profs[profs['type'] == "Secondary"].name)
double = lambda x: profs[profs.name == x]
_sum = lambda x: double(x).annual.sum()
newsal = [_sum(x) for x in sec]
profs.loc[(profs.name.isin(sec)) & (profs['type'] == 'Primary'), 'annual'] = newsal
profs = profs[profs['type'] == 'Primary'].copy()
In [5]:
# aggregate some summary stats
gbdept = profs.groupby('department')['annual']
aggfuncs = [np.mean, np.median, np.std, np.count_nonzero, np.min, np.max]
gbdept = gbdept.agg(aggfuncs)

print(gbdept.count_nonzero.describe())
gbdept.sort_values(by='count_nonzero', ascending=False)
# take only those dept with a few profs
# gbdept = gbdept[gbdept.count_nonzero >= 3]
count    172.000000
mean       9.877907
std       10.346747
min        1.000000
25%        2.750000
50%        6.000000
75%       14.000000
max       59.000000
Name: count_nonzero, dtype: float64
Out[5]:
mean median std count_nonzero amin amax
department
Family Community Medicine FCM 146225.389831 150913.0 37479.262229 59 75035 265937
Pathology 177280.723404 173417.0 63674.969255 47 54000 372553
AS Biology General Administrative 88220.454545 84210.0 20874.310967 44 44500 153434
Emergency Medicine 196615.756757 179780.0 34703.803546 37 179780 344000
Art Art History 69705.583333 63139.5 18571.645985 36 45000 122154
English Department 71101.647059 66805.0 13569.978524 34 52500 106712
College of Nursing 113291.090909 109606.0 19572.267546 33 91809 165870
School of Law 117848.696970 120581.0 28495.728559 33 50000 197273
Physics Astronomy Department 93865.656250 86904.5 25957.668303 32 55000 153053
Gallup Branch 57258.468750 52581.5 14805.748443 32 45000 123691
... ... ... ... ... ... ...
History 52500.000000 52500.0 NaN 1 52500 52500
SAAP Landscape Architecture Program 65460.000000 65460.0 NaN 1 65460 65460
IM Administration 207327.000000 207327.0 NaN 1 207327 207327
Pediatrics 300000.000000 300000.0 NaN 1 300000 300000
Psychiatry Psych 85850.000000 85850.0 NaN 1 85850 85850
Museum Studies 71431.000000 71431.0 NaN 1 71431 71431
Neonatology Developmental Care 92829.000000 92829.0 NaN 1 92829 92829
Presidents Office 188747.000000 188747.0 NaN 1 188747 188747
Political Science Department 70094.000000 70094.0 NaN 1 70094 70094
Orthopedics Shoulder & Elbow 126836.000000 126836.0 NaN 1 126836 126836

172 rows × 6 columns

I inspected the departments that had overlapping names and combined them. I'll spare you the output in this post, but the following cell printed out the groups i wanted to see. Clearly, a different strategy would be needed for larger datasets, but this was more than sufficient here.

In [6]:
small_dept = gbdept[gbdept['count_nonzero'] <= 3].index.tolist()
# print(small_dept)

# for redundancy
small_dept = set([i.split()[0] for i in small_dept])

_gbdept = gbdept.reset_index()
# for d in small_dept:
# display(_gbdept[_gbdept.department.str.contains(d)])

Using that above output, it's clear that some departments will be easy to manage, so here's some convenience to do so.

In [7]:
# helpful to keep from having to type this all the time
dept_saver = lambda x, y: profs.loc[profs.department == x, 'department']


# deparments to reassign due to typos or oddities in the data
redo = [("History", "History Department"),
("Neurology Adult", "Neurology"),
("Neurology Child", "Neurology"),
("Political Science Department", "Political Science"),
("Sch Arch Planning Gen Admin", "SAAP"),
("Sociology Department", "Sociology"),
("Speech Hearing Sciences Gen Admin", "Speech and Hearing Sciences"),
("Valencia Social Cultural Studies", "Valencia County Branch")
]


# little function to help
def reassign_departments(tups):
for old, new in tups:
profs.loc[profs.department == old, 'department'] = new

reassign_departments(redo)

Psychiatry, Ortho, Pediatrics, Architecture & planning, and Anesthesiology have a large number of small subdepartments and require something a bit more efficient.

Anesthesiology and Architecture are easy.

In [8]:
# i love the hell out of lambdas like this
x = lambda x: profs.department.str.contains(x)

profs.loc[x("SAAP"), 'department'] = 'Architecture and Planning'
profs.loc[x('Anesthe'), 'department'] = "Anesthesiology"

We don't want to lump psychologists or neuropsychologists in with the psychiatrists for multiple reasons:

  • They have vastly different pay and credentials
  • They would be horribly offended
In [9]:
# captures both 'psychology' and 'neuropsychology'
profs.loc[x("Psych") & (~x("chology")), 'department'] = 'Psychiatry Combined'

The ortho departments have a lot, but we don't want to include occupational or physical therapy in there.

In [10]:
profs.loc[x("Ortho") & ~x("Therapy"), 'department'] = "Orthopaedics Combined"

For pediatrics, I'll take the departments that have 3 or less members and lump them into general peds. Luckily, this leaves out a few large and lower paying specialties: cards, neonatology, and OT.

In [11]:
_gb = profs.loc[x("Pediatrics")].groupby('department')['annual'].agg(aggfuncs)
display(_gb)
peds = list(_gb[_gb['count_nonzero'] <= 3].reset_index().department)

reassign_departments(zip(peds, (it.repeat('Pediatrics General' , len(peds)))))
mean median std count_nonzero amin amax
department
Pediatrics 300000 300000 NaN 1 300000 300000
Pediatrics Adolescent Medicine 151253 149480 25693.937755 3 126492 177788
Pediatrics Cardiology 231083 233600 29171.573313 4 193045 264090
Pediatrics Center for Development 85268 74824 20584.955936 12 71000 131593
Pediatrics Critical Care 178878 157030 44946.202758 18 136000 267450
Pediatrics Endocrinology 140000 140000 NaN 1 140000 140000
Pediatrics Gastroenterology 189325 189325 32067.292527 2 166650 212000
Pediatrics General 142403 135881 19826.288426 12 126250 187660
Pediatrics Genetics Dysmorphology 160565 160565 22818.335829 2 144430 176700
Pediatrics Hematology Oncology 172664 164344 40573.796296 6 133320 219020
Pediatrics Infectious Disease 148992 144360 14513.382514 3 137360 165256
Pediatrics Neonatology Division 224448 219650 23694.451179 6 196900 261440
Pediatrics Nephrology 175855 172560 34373.435964 4 137310 220990
Pediatrics Occupational Therapy 98485 99490 18905.235568 6 77528 122328
Pediatrics Para Los Ninos 154968 154968 35614.140141 2 129785 180151
Pediatrics Pulmonary 146860 128150 52716.935609 3 106050 206380
Pediatrics Rehab Physical Med 166300 166300 65209.387361 2 120190 212410
In [12]:
gbdept = profs.groupby('department')['annual'].agg(aggfuncs)

print(gbdept.count_nonzero.describe())
# take only those dept with at least a few profs
gbdept = gbdept[gbdept.count_nonzero >= 4]
gbdept.sort_values(by='mean', ascending=False)
count    137.000000
mean      12.401460
std       11.572317
min        1.000000
25%        3.000000
50%       10.000000
75%       17.000000
max       59.000000
Name: count_nonzero, dtype: float64
Out[12]:
mean median std count_nonzero amin amax
department
Radiation Oncology 323302.750000 325096 18652.953321 4 302900 340118
Radiology General 312001.115385 312090 34304.697216 26 208060 368998
Surgery Urology 277433.333333 269969 18836.236500 6 267852 315696
Medical Oncology 268891.411765 262095 53737.335164 17 202859 386486
Surgery Vascular 268751.000000 269334 39558.816051 4 233335 303000
Surgery Div of Plastic Surgery 262114.500000 236946 51912.781172 4 234600 339966
Surgery Head Neck Reconstructive 259951.272727 246946 32138.963257 11 226644 310963
SOM Neurosurgery 257467.071429 231391 154775.692116 14 83224 484800
Anesthesiology 252862.800000 249753 33707.754869 35 176750 364105
IM Div of Cardiology 252059.428571 239269 42250.692036 14 202000 348540
... ... ... ... ... ... ...
Communication Journalism 69295.466667 65810 12352.498277 15 59352 100010
Foreign Languages Literatures 67135.000000 63595 11078.179399 16 54000 92986
Spanish Portuguese 64979.000000 66066 3569.649206 12 58526 70000
Music 63224.535714 61683 12158.008455 28 48500 101183
LosAlamos Branch 62145.000000 51170 25037.317362 4 46680 99560
Cinematic Arts Department 59296.333333 57257 19915.701661 6 32754 94037
Theatre and Dance 58887.500000 52350 11671.761547 10 50000 80167
Taos Branch 57631.000000 55808 5225.263400 6 53495 67761
Gallup Branch 57258.468750 52581 14805.748443 32 45000 123691
Valencia County Branch 49252.235294 46680 7974.972504 17 44000 76381

101 rows × 6 columns

Aggregating by job title here allows us to do some quality checks. There are a few data entry errors and we should ignore prof types that have nearly no samples, as well as visiting professors.

In [13]:
ranks = profs.groupby("job_title")['annual'].agg(aggfuncs)
ranks
Out[13]:
mean median std count_nonzero amin amax
job_title
Assistant Professor 106075.117347 76970 69097.115159 392 44000 484800
Associate Professor 86908.610778 78782 30508.130742 334 44188 262095
Clincian Ed-Assoc Prof 306486.000000 306486 NaN 1 306486 306486
Clinician Ed - Assist Prof 157859.500000 157859 31000.268394 2 135939 179780
Clinician Ed - Professor 226128.775510 219196 66620.038126 98 115150 407650
Clinician Ed-Assist Prof 163634.683673 154379 55980.878006 196 71000 383296
Clinician Ed-Assoc Prof 197103.305344 191415 72043.408846 131 75861 478240
Distinguished Professor 139882.240000 141249 22884.355889 25 96000 176459
Professor 133021.684636 115038 60580.567097 371 50000 441986
Professor of Practice 49785.000000 45000 19860.627407 3 32754 71601
Research Assoc Professor 91289.680000 83252 26810.499613 25 55064 188747
Research Asst Professor 70651.100000 68372 16795.164727 80 38000 131789
Research Professor 130926.733333 106050 52930.345379 15 81870 287100
Visiting Assoc Professor 90000.000000 90000 NaN 1 90000 90000
Visiting Asst Professor 67589.833333 53250 29202.858914 18 43000 140000
Visiting Professor 98105.571429 85000 33336.723004 7 50000 140000
In [14]:
profs.loc[profs.job_title == "Clinician Ed - Assist Prof", 'job_title'] = "Clinician Ed-Assist Prof"
profs.loc[profs.job_title == "Clincian Ed-Assoc Prof", 'job_title'] = "Clinician Ed-Assoc Prof"

ignore = ['Professor of Practice',
'Visiting Asst Professor',
'Visiting Assoc Professor',
'Visiting Professor'
]
profs = profs[~profs.job_title.isin(ignore)].copy()
ranks = profs.groupby("job_title")['annual'].agg(aggfuncs)
ranks
Out[14]:
mean median std count_nonzero amin amax
job_title
Assistant Professor 106075 76970 69097.115157 392 44000 484800
Associate Professor 86908 78782 30508.130736 334 44188 262095
Clinician Ed - Professor 226128 219196 66620.038119 98 115150 407650
Clinician Ed-Assist Prof 163576 154379 55742.769226 198 71000 383296
Clinician Ed-Assoc Prof 197931 191415 72396.638175 132 75861 478240
Distinguished Professor 139882 141249 22884.355879 25 96000 176459
Professor 133021 115038 60580.567090 371 50000 441986
Research Assoc Professor 91289 83252 26810.499604 25 55064 188747
Research Asst Professor 70651 68372 16795.164721 80 38000 131789
Research Professor 130926 106050 52930.345370 15 81870 287100

Adding gender information

Clearly, this data set does not have gender associated to each person, so we'll have to infer that. I used the SSA's name popularity through time dataset to help with that. It comes as a set of .txt files with name, gender, and frequency information for all years from 1880 onward. I saved that in a folder and read them in with glob's help.

In [15]:
fn = glob.glob('/dropbox/random_notebooks/unm_salary/yob*')

names = []
for file in fn:
with open(file, 'r') as f:
l = [line.strip().lower().split(',') for line in f.readlines()]
names.extend(l)

print(len(names))
1825433

Each name is stored as such:

In [16]:
names[0:10]
Out[16]:
[['mary', 'f', '7065'],
['anna', 'f', '2604'],
['emma', 'f', '2003'],
['elizabeth', 'f', '1939'],
['minnie', 'f', '1746'],
['margaret', 'f', '1578'],
['ida', 'f', '1472'],
['alice', 'f', '1414'],
['bertha', 'f', '1320'],
['sarah', 'f', '1288']]

As the dataset contains duplicate names for males and females, i decided to break them up by gender and use the name with the greater number of occurances to determine gender. Pandas makes this process rather easy, giving us in the end a dataframe with no duplicate names and a sum of the number of times the name has been recorded.

In [17]:
# turn the list into a dataframe
nameframe = pd.DataFrame(names, columns=['name', 'gender', 'num'])
nameframe.num = nameframe.num.astype('int')

# get unique dataframes of just the male and female nmaes
female = nameframe[nameframe['gender'] == 'f'].sort_values('name').copy()
male = nameframe[nameframe['gender'] == 'm'].sort_values('name').copy()

# get unique names with total counts
male = male.groupby('name')['num'].sum().reset_index()
female = female.groupby('name')['num'].sum().reset_index()
male['gender'] = 'm'
female['gender'] = 'f'

# make these into a set for fast lookups later
mname = set(male.name.values)
fname = set(female.name.values)
nameframe = []
#check results
print('Number of male names: ' +  str(len(mname)))
print('Number of female names: ' +  str(len(fname)))
Number of male names: 39199
Number of female names: 64911

Notice how my name appears with far greater frequency as a boy's name than a girl's name and how the inverse is true with the typical female spelling of the same name.

In [18]:
display(pd.concat([male[male.name == 'aaron'], female[female.name == 'aaron']]))
display(pd.concat([male[male.name == 'erin'], female[female.name == 'erin']]))
name num gender
83 aaron 553615 m
166 aaron 4228 f
name num gender
11667 erin 9071 m
19058 erin 309817 f

Now a bit of a hacky way of assigning gender in our dataframe. We'll default to an 'x' for names that are not found in the set.

In [19]:
# convenience lambdas
fgetval = lambda x: female.loc[female['name'] == x, 'num'].values[0]
mgetval = lambda x: male.loc[male['name'] == x, 'num'].values[0]

# i could rework the logic here to make it faster but this is pretty clear.
def namer(name):
_m = name in mname
_f = name in fname
if _m and not _f:
return 'm'
elif _f and not _m:
return 'f'
elif _m and _f:
return 'f' if fgetval(name) > mgetval(name) else 'm'

return 'x'

Testing it out:

In [20]:
print(namer('aaron'))
print(namer('erin'))
m
f

now we have to extract the first name of everyone in the dataset...

In [21]:
profs.head().name
Out[21]:
26     Abraham, Shirley M.
29        Abrams, Swala K.
38    Achrekar, Abinash P.
47       Acosta, Sylvia J.
74       Adler, William M.
Name: name, dtype: object
In [22]:
prof_names = [line.lower().split() for line in profs.name.tolist()]

def get_first_names(names):
# this function annoys me
first_names = []
for line in names:
if len(line) == 2:
first_names.append(line[1])
elif len(line) == 3:
first_names.append(line[1])
elif len(line) == 4:
first_names.append(line[2])
else:
first_names.append(line[1])
return first_names
In [23]:
first_names = get_first_names(prof_names)
profs['gender'] = list(map(namer, first_names))
# profs.gender = profs.gender.astype('category')

We can see that we have ~160 names that were not assigned correctly, with nearly all of those names being non-american names, but overall the naming strategy seems to have worked reasonably well.

In [24]:
print(profs.gender.value_counts())

display(profs[profs['gender'] == 'x'].head()[['name', 'department', 'gender']])
display(profs.head()[['name', 'department', 'gender']])
m    802
f    713
x    155
Name: gender, dtype: int64
name department gender
29 Abrams, Swala K. Psychiatry Combined x
38 Achrekar, Abinash P. IM Div of Cardiology x
108 Ahmed, Shozab IM Div of Pulmonary CC and Sleep x
536 Ayoubieh, Houriya S. IM Division of Hospital Medicine x
2160 Cline Parhamovich, Karen Pathology x
name department gender
26 Abraham, Shirley M. Pediatrics Hematology Oncology f
29 Abrams, Swala K. Psychiatry Combined x
38 Achrekar, Abinash P. IM Div of Cardiology x
47 Acosta, Sylvia J. Pediatrics Center for Development f
74 Adler, William M. Medical Oncology m

Here's where things get fun: let's scrape the web a bit to help narrow down the lack of data for foreign names. I found a nice site -- geoff's baby name guesser -- that can help with this. Using beautifulsoup, we can get the name: gender mapping we need. The webpage returns a result like:

It's a girl!

And we can easily pop out the gender from the resulting webpage to get what we need.

The beautifulsoup code is a bit dense, but i find using lots of lambdas, maps, and list comprehensions when using it to be really helpful.

In [25]:
ungendered = profs[profs.gender == 'x'].name
# this is nasty
ungen_first_names = [x.split(',')[1].lower().strip().split()[0] for x in ungendered]
list(zip(ungendered, ungen_first_names))[0:10]
Out[25]:
[('Abrams, Swala K.', 'swala'),
('Achrekar, Abinash P.', 'abinash'),
('Ahmed, Shozab', 'shozab'),
('Ayoubieh, Houriya S.', 'houriya'),
('Cline Parhamovich, Karen', 'karen'),
('Dayao, Zoneddy R.', 'zoneddy'),
('Del Campo de Gonzalez, Sara M.', 'sara'),
('Del Fabbro, Anilla', 'anilla'),
('Hedna, Vishnumurthy S.', 'vishnumurthy'),
('Howard Moran, Rebecca', 'rebecca')]
In [26]:
from bs4 import BeautifulSoup
import urllib.request as req
import requests
In [27]:
# this is dense but it does the job

ungendered = list(profs[profs.gender == 'x'].name)

# helper functions to find and clean what we need from the html results
gender = lambda x: x.findAll("td", {'valign': 'top'})[2].find('b').text.rstrip('!').split()[2]
# nice helper to get correct gender
g = lambda x: 'm' if gender(x) == 'boy' else 'f' if gender(x) == 'girl' else 'x'

url_base = 'http://www.gpeters.com/names/baby-names.php?name='
url_end = '&button=Go'
urls = [url_base + name + url_end for name in ungen_first_names]

soup_getter = lambda x:  BeautifulSoup(requests.get(x).content, 'lxml')

# scrape
pages = list(map(soup_getter, urls))

# test our results
list(zip(ungen_first_names, [gender(page) for page in pages], [g(page) for page in pages]))[0:10]
Out[27]:
[('swala', 'Guesser', 'x'),
('abinash', 'boy', 'm'),
('shozab', 'boy', 'm'),
('houriya', 'girl', 'f'),
('karen', 'girl', 'f'),
('zoneddy', 'Guesser', 'x'),
('sara', 'girl', 'f'),
('anilla', 'girl', 'f'),
('vishnumurthy', 'boy', 'm'),
('rebecca', 'girl', 'f')]
In [28]:
new_genders = [g(page) for page in pages]
profs.loc[profs['gender'] == 'x', 'gender'] = new_genders
print(profs.gender.value_counts())
m    872
f    761
x     37
Name: gender, dtype: int64

It worked, but let's examine a department I know pretty well.

In [29]:
profs[profs.department == "Computer Science"][["name", "department", "job_title", "gender"]]
Out[29]:
name department job_title gender
42 Ackley, David H. Computer Science Associate Professor m
442 Arnold, Dorian Computer Science Associate Professor m
1334 Bridges, Patrick G. Computer Science Associate Professor m
2407 Crandall, Jedidiah R. Computer Science Associate Professor m
3181 Estrada-Piedra, Trilce P. Computer Science Assistant Professor f
3454 Forrest, Stephanie Computer Science Distinguished Professor f
4579 Hayes, Thomas Computer Science Associate Professor m
5448 Kapur, Deepak Computer Science Distinguished Professor m
5523 Kelley, Patrick Computer Science Assistant Professor m
6372 Luan, Shuang Computer Science Associate Professor f
7576 Moses, Melanie Computer Science Associate Professor f
7600 Mueen, Abdullah Computer Science Assistant Professor m
9192 Roman, Gruia-Catalin Computer Science Professor x
10414 Stefanovic, Darko Computer Science Professor m
10669 Tapia, Lydia E. Computer Science Assistant Professor f
11743 Williams, Lance R. Computer Science Associate Professor m

There are two errors here: Shaung and Gruia are both men. That isn't the best success rate, but let's forge onward. I'll correct these two people, but be wary of the future results.

In [30]:
profs.loc[profs.name == 'Roman, Gruia-Catalin', 'gender'] = 'm'
profs.loc[profs.name == 'Luan, Shuang', 'gender'] = 'm'

Onward to Plotting

Now let's group by estimated gender and department and aggregate some stats.

In [31]:
genderdf = profs[profs.gender != 'x'].groupby('gender')['annual'].agg(aggfuncs)
error = genderdf['std']
genderdf[['mean']].plot.bar(yerr=error, figsize=(6, 6), title="Annual Salary by gender")
genderdf
Out[31]:
mean median std count_nonzero amin amax
gender
f 121001 99526 62788.035668 760 38000 407650
m 134459 111523 75760.686903 874 44000 484800

Overall, there is about a \$13,000 difference between men and women's pay, with clearly very wide ranges in pay across both genders.

playing with bokeh

It took a bit to get used to Bokeh's syntax and how to manipulate plot objects, but moving down to a low-level plotting option gave me the control that i was looking for (more for the 2nd example that will come soon, but this was a good way of trying it out).

In the interest of saving space, I'll make all of the plots and combine them into an interactive set of Tabs that I will show at the bottom.

On to the code.

In [32]:
from bokeh.io import output_notebook, show
from bokeh.models.glyphs import Circle, Text
from bokeh.models import ColumnDataSource, Range1d, DataRange1d, Plot
from bokeh.plotting import figure
from bokeh.models import Grid, LinearAxis, SingleIntervalTicker
from bokeh.models import NumeralTickFormatter
from bokeh.models import PrintfTickFormatter
from bokeh.models import HoverTool, BoxSelectTool
from bokeh.charts import Bar, output_file
# this allows the plot to be displayed in the notebook
output_notebook()
BokehJS successfully loaded.

Defining a few functions that make life easier:

In [33]:
PLOT_HEIGHT = 1500
PLOT_WIDTH = 1000
RESPONSIVE = False

PLOT_OPTIONS = dict(plot_width=PLOT_WIDTH,
plot_height=PLOT_HEIGHT,
responsive=RESPONSIVE,
title = "PLOT",
)

def make_error_bars(df, plot=None, **kwargs):
"""Returns data to plot error bars
    Args:
        df: pandas dataframe with columns "index", "mean", and "std"
        plot: the plot on which to draw the bars
    """

bar_args = dict(color='black', alpha=0.55, line_cap='round')
if kwargs:
bar_args = kwargs

### error bars ###
# http://stackoverflow.com/questions/29166353/how-do-you-add-error-bars-to-bokeh-plots-in-python
err_x = []
err_y = []

# yay for python
for x, y, yerr in zip(df['index'] + 1, df['mean'], df['std']):
err_x.append((x, x))
err_y.append((y - yerr, y + yerr))

if plot:
plot.multi_line(err_y, err_x, **bar_args)
return
return err_x, err_y



def add_formatting_opts(plot):
# strictly convenience
# in radians
plot.xaxis.major_label_orientation = 1.0
# format the x axis labels
plot.xaxis[0].formatter = NumeralTickFormatter(format='$ 0,0[.]00')

plot.yaxis.axis_label_text_alpha = 0.75
plot.yaxis.axis_label_text_font_size = '12pt'
# remove gridlines 
plot.ygrid.grid_line_color = None
plot.xgrid.grid_line_alpha = 0.4
# make a lightly dashed line
plot.xgrid.grid_line_dash = [6, 4]
plot.yaxis.axis_line_alpha = 0.2
plot.yaxis.major_tick_line_alpha = 0
plot.yaxis.major_label_text_font_size = '10pt'
plot.legend.location= "top_right"
In [34]:
df = gbdept.sort_values(by='mean', ascending=False)

# this is a hack to help index the dataframe for sorting data
# in the plot. could be made a bit easier, i suppose.
df = df.reset_index().reset_index()

The code for the first plot, displaying pay by department.

In [35]:
# defines the type of tips we want to have on the chart
hover = HoverTool(
tooltips=[
("Department", "@department"),
("Salary Mean", "@mean{$int:,}"),
("Salary STD ", "@std{$int:,}"),
("Salary Min ", "@amin{$int:,}"),
("Salary Max ", "@amax{$int:,}"),
("Dept Size", "@count_nonzero")
]
)

tools = [hover, "pan", "wheel_zoom", "reset", "resize"]

# allows us to get hovertips easily
source = ColumnDataSource(df)

PLOT_OPTIONS['tools'] = tools
# make the figure
dept_plot = figure(y_range=list(df['department']), **PLOT_OPTIONS)
#dept_plot.title = "Professor Pay at UNM by department"
# draw bars first to have circle on top
make_error_bars(df, plot=dept_plot)

# draw means
dept_plot.circle(x='mean',
y=df['index'] + 1,
size=np.sqrt(df['count_nonzero']) + 5,
fill_alpha=1,
legend='size increases with department size',
source=source)

dept_plot.circle(x='amin',
y=df['index'] + 1,
size=2,
color='grey',
fill_alpha=1,
legend="Max and Min pay",
source=source)

dept_plot.circle(x='amax',
y=df['index'] + 1,
size=2,
color='grey',
fill_alpha=1,
source=source)

dept_plot.title = "UNM Professor Pay per Department"

add_formatting_opts(dept_plot)

And now the code for a similar plot, but using rank instead of department.

In [36]:
# defines the type of tips we want to have on the chart
ranks = profs.groupby("job_title")['annual'].agg(aggfuncs)
ranks = ranks.sort_values('mean')
ranks = ranks.reset_index().reset_index()

hover = HoverTool(
tooltips=[
("Title", "@job_title"),
("Salary Mean", "@mean{$int:,}"),
("Salary STD ", "@std{$int:,}"),
("Total Number", "@count_nonzero")
]
)

tools = [hover, "pan", "wheel_zoom", "reset", "resize"]

# allows us to get hovertips easily
source = ColumnDataSource(ranks)

PLOT_OPTIONS['tools'] = tools


# make the figure
rank_plot = figure(y_range=list(ranks['job_title']), **PLOT_OPTIONS)

# draw bars first to have circle on top
make_error_bars(ranks, plot=rank_plot)

# draw means
rank_plot.circle(x='mean',
y=ranks['index'] + 1,
size=np.sqrt(ranks['count_nonzero']) + 5,
fill_alpha=1,
legend='point size increases with sample size',
source=source)

# draw min 
rank_plot.circle(x='amin',
y=ranks['index'] + 1,
size=2,
fill_alpha=1,
color='grey',
source=source)

# draw max
rank_plot.circle(x='amax',
y=ranks['index'] + 1,
size=2,
fill_alpha=1,
color='grey',
legend='Max and Min Salary',
source=source)


add_formatting_opts(rank_plot)
rank_plot.legend.location= "bottom_right"
rank_plot.title = "UNM Professor Pay By Rank"

rank_plot.plot_height = 600

And let's look at pay overall in departments by gender. We'll include only departments that have three or more professors of each gender. I realize this is a small sample.

In [37]:
df = profs[profs.gender != 'x'].groupby(['department', 'gender'])['annual'].agg(aggfuncs).reset_index()

df = df[df.count_nonzero >= 3]
df = df.groupby('department').count().reset_index()
depts = list(df.loc[df.gender ==2, 'department'].values)

gb = profs[profs.gender != 'x'].groupby(['department', 'gender'])['annual'].agg(aggfuncs)

df = gb.ix[depts].copy()
df.head(10)
Out[37]:
mean median std count_nonzero amin amax
department gender
AS Biology General Administrative f 83524.090909 80240.0 14564.160295 11 65000 117478
m 89785.909091 87719.0 22563.834262 33 44500 153434
ASM Department of Accounting f 129213.500000 124895.5 15709.406704 4 116031 151032
m 130948.166667 126777.5 18784.059864 6 112146 165948
ASM Finance Intl Tech Mngt FIT f 136278.000000 137248.0 1952.906552 3 134030 137556
m 131192.500000 130616.0 12075.174284 8 117436 153599
ASM Mrkting Info Decision Sci MIDS f 121107.500000 119981.5 6462.042118 4 115128 129339
m 123455.666667 116200.0 21499.969680 9 102468 176459
ASM Organizational Studies f 129259.666667 130000.0 12619.544693 9 114665 159219
m 125647.666667 124056.0 5955.224121 3 120650 132237

Let's introduce some logic to find the difference between aggregates for each gender in each department. As there are plenty of departments that do not meet our critiera or have only one gender, those will be discarded in the analysis and given a null value. Pandas cross-section:

df.xs()

is a beautiful, beautiful thing.

In [38]:
dept_diffs = (df.xs('m', level='gender') -  df.xs('f', level='gender'))
# rename columns 
dept_diffs.columns = [i + 'diff' for i in dept_diffs.columns]
display(dept_diffs.head(10))

# this left join will spread the diffs out to the right department 
_df = pd.merge(left=df.reset_index(), right=dept_diffs.reset_index(),
on='department',
how='left')

_df = _df.sort_values(['meandiff', 'gender']).reset_index()
# hack to get a nicely labeled tick marker
_df['dept'] = _df['department'] + ', ' + _df['gender']
_df['index'] = _df.reset_index()
display(_df.head(10))
meandiff mediandiff stddiff count_nonzerodiff amindiff amaxdiff
department
AS Biology General Administrative 6261.818182 7479.0 7999.673967 22 -20500 35956
ASM Department of Accounting 1734.666667 1882.0 3074.653160 2 -3885 14916
ASM Finance Intl Tech Mngt FIT -5085.500000 -6632.0 10122.267733 5 -16594 16043
ASM Mrkting Info Decision Sci MIDS 2348.166667 -3781.5 15037.927562 5 -12660 47120
ASM Organizational Studies -3612.000000 -5944.0 -6664.320571 -6 5985 -26982
American Studies Department -11361.933333 -5962.0 -16134.322266 -1 7554 -39931
Anesthesiology 2538.895105 7600.5 9419.315746 9 -15857 58619
Anthropology Department 10004.113636 16413.5 -1862.220402 1 -59 -12950
Architecture and Planning 10052.214286 3672.0 11147.243376 7 -1682 47154
Art Art History -1956.159091 -6625.0 5298.976508 -13 0 11841
index department gender mean median std count_nonzero amin amax meandiff mediandiff stddiff count_nonzerodiff amindiff amaxdiff dept
0 0 Surgical Oncology f 260012.500000 262600 21136.988709 4 232300 282550 -51255.500000 -5050 113372.230072 -1 -175639 29510 Surgical Oncology, f
1 1 Surgical Oncology m 208757.000000 257550 134509.218781 3 56661 312060 -51255.500000 -5050 113372.230072 -1 -175639 29510 Surgical Oncology, m
2 2 IM Div of Epidemiology f 150271.071429 150490 48652.293915 14 84820 248766 -30166.871429 -35108 -15552.820925 -9 20 -83289 IM Div of Epidemiology, f
3 3 IM Div of Epidemiology m 120104.200000 115382 33099.472990 5 84840 165477 -30166.871429 -35108 -15552.820925 -9 20 -83289 IM Div of Epidemiology, m
4 4 Neurosciences f 142949.666667 159297 28665.867444 3 109850 159702 -29018.095238 -51778 310.855075 4 -29850 -5561 Neurosciences, f
5 5 Neurosciences m 113931.571429 107519 28976.722519 7 80000 154141 -29018.095238 -51778 310.855075 4 -29850 -5561 Neurosciences, m
6 6 Pathology f 187313.269231 190353 70434.711250 26 54000 372553 -22698.690283 -18653 -17119.528719 -7 1233 -117553 Pathology, f
7 7 Pathology m 164614.578947 171700 53315.182530 19 55233 255000 -22698.690283 -18653 -17119.528719 -7 1233 -117553 Pathology, m
8 8 Organization Info Learning Sci-OILS f 89500.000000 97303 19318.923598 3 67500 103697 -16021.250000 -28365 -7187.904257 1 -2500 -12658 Organization Info Learning Sci-OILS, f
9 9 Organization Info Learning Sci-OILS m 73478.750000 68938 12131.019341 4 65000 91039 -16021.250000 -28365 -7187.904257 1 -2500 -12658 Organization Info Learning Sci-OILS, m

Pay by department and estimated gender

And now the fun...

In [39]:
hover = HoverTool(
tooltips=[
("Department", "@dept"),
("Estimated Gender", "@gender"),
("Mean Annual Salary", "@mean{$int:,}"),
("Mean Annual Salary STD", "@std{$int:,}"),
("Median Annual Salary", "@median{$int:,}"),
("# in department", "@count_nonzero")
]
)
tools = [hover, "pan", "wheel_zoom", "reset" ,"resize"]
PLOT_OPTIONS['tools'] = tools

dept_gender = figure(y_range=list(_df['dept'].values), **PLOT_OPTIONS)

# easier to plot colors by separating these out
boys = _df[_df['gender'] == 'm'].copy()
girls = _df[_df['gender'] == 'f'].copy()

# set the column data sources for the hovertips
_boys = ColumnDataSource(boys)
_girls = ColumnDataSource(girls)
bar_width = 0.3

# manually draw the bars on the plot

## boys
dept_gender.quad(right=boys['mean'],
left=0,
bottom= boys['index'] + 1 - bar_width ,
top= boys['index'] + 1 + bar_width,
fill_color='blue',
fill_alpha=0.6,
legend="Men's pay",
source=_boys)

## girls
dept_gender.quad(right=girls['mean'],
left=0,
bottom= girls['index'] + 1  - bar_width ,
top= girls['index'] + 1 + bar_width ,
fill_color='red',
fill_alpha=0.6,
legend="Women's pay",
source=_girls)

add_formatting_opts(dept_gender)
make_error_bars(_df, plot=dept_gender)
dept_gender.plot_height = 2300
dept_gender.title = "UNM Professor Pay by Department and Estimated Gender"

The plot is sorted by the greatest difference in pay between men and women in a department. In some departments, women earn more than men.

Pay by rank and estimated gender

Now let's do the same for rank and gender.

In [40]:
gender_rank = profs.loc[~(profs.gender == 'x')].groupby(['job_title', 'gender'])['annual'].agg(aggfuncs)
gender_rank.head(10)

g_diffs = (gender_rank.xs('m', level='gender') -  gender_rank.xs('f', level='gender'))
# rename columns 
g_diffs.columns = [i + 'diff' for i in g_diffs.columns]
g_diffs

# this left join will spread the diffs out to the right department 
gender_rank = pd.merge(left=gender_rank.reset_index(), right=g_diffs.reset_index(),
on='job_title',
how='left')

gender_rank = gender_rank.sort_values(['meandiff', 'gender']).reset_index()
# hack to get a nicely labeled tick marker
gender_rank['rank'] = gender_rank['job_title'] + ', ' + gender_rank['gender']

gender_rank['index'] = gender_rank.reset_index()
gender_rank.head()
Out[40]:
index job_title gender mean median std count_nonzero amin amax meandiff mediandiff stddiff count_nonzerodiff amindiff amaxdiff rank
0 0 Research Professor f 142329.333333 139481 37784.105971 3 106050 181457 -14253.250000 -36242 19328.104799 9 -24180 105643 Research Professor, f
1 1 Research Professor m 128076.083333 103239 57112.210770 12 81870 287100 -14253.250000 -36242 19328.104799 9 -24180 105643 Research Professor, m
2 2 Research Assoc Professor f 98995.777778 93566 18885.104501 9 71054 124836 -12609.177778 -18134 12151.125957 6 -15990 63911 Research Assoc Professor, f
3 3 Research Assoc Professor m 86386.600000 75432 31036.230457 15 55064 188747 -12609.177778 -18134 12151.125957 6 -15990 63911 Research Assoc Professor, m
4 4 Distinguished Professor f 138184.750000 136381 25385.266396 8 103697 174776 2496.308824 9115 -2991.255009 9 -7697 1683 Distinguished Professor, f
In [41]:
# hovertools are cool -- you have to define a "source" for the data and can access the columns with @
hover = HoverTool(
tooltips=[
("Title", "@job_title"),
("Estimated Gender", "@gender"),
("Mean Annual Salary", "@mean{$int:,}"),
("Annual Salary STD", "@std{$int:,}"),
("Median Annual Salary", "@median{$int:,}"),
("Amount men make over women", "@meandiff{$int:,}"),
("Sample Size", "@count_nonzero")
]
)
tools = [hover, "pan", "wheel_zoom", "reset" ,"resize"]

PLOT_OPTIONS['tools'] = tools
## make the figure
p_rank_gender = figure(y_range=list(gender_rank['rank'].values),
**PLOT_OPTIONS)


# easier to plot colors by separating these out
boys = gender_rank[gender_rank['gender'] == 'm'].copy()
girls = gender_rank[gender_rank['gender'] == 'f'].copy()

# set the column data sources for the hovertips
_boys = ColumnDataSource(boys)
_girls = ColumnDataSource(girls)
bar_width = 0.3

# manually draw the bars on the plot

## boys
p_rank_gender.quad(right=boys['mean'],
left=0,
bottom= boys['index'] + 1 - bar_width ,
top= boys['index'] + 1 + bar_width,
fill_color='blue',
fill_alpha=0.6,
legend="Men's pay",
source=_boys)

## girls
p_rank_gender.quad(right=girls['mean'],
left=0,
bottom= girls['index'] + 1  - bar_width ,
top= girls['index'] + 1 + bar_width ,
fill_color='red',
fill_alpha=0.6,
legend="Women's pay",
source=_girls)

add_formatting_opts(p_rank_gender)
make_error_bars(gender_rank, plot=p_rank_gender)
p_rank_gender.plot_height = 600
p_rank_gender.plot_width = 800
p_rank_gender.title = "UNM Professor Pay By Rank and Estimated Gender"

Deducing Just the Differences In Pay

We can plot the data to better show the differences in pay between our estimated genders, but these plots will obscure some information and perhaps exaggerates the differences between departments more than the above plots. The code to manipulate the data is a bit ugly.

Estimated Gender Difference in pay per department

In [42]:
diffdf = _df.sort_values('meandiff', ascending=False).copy()
diffdf = diffdf[diffdf.gender == 'm'].reset_index()

diffdf.drop('level_0', inplace=True, axis=1)
diffdf = diffdf[~diffdf.meandiff.isnull()]
diffdf['index'] = diffdf.reset_index()


# I really don't like this, but put up with it so we can get the correct department size info from
# the original data frame
tmp = gbdept.reset_index()
tmp = pd.merge(tmp, diffdf[['department', 'meandiff']], how='inner', on="department",
sort=True,
suffixes=('_x', '_y'), copy=True, indicator=False)

diffdf = tmp.sort_values('meandiff').copy()
diffdf = diffdf.reset_index()
diffdf.drop('index', inplace=True, axis=1)

diffdf = diffdf.reset_index()

...and the plot

In [43]:
hover = HoverTool(
tooltips=[
("Department", "@department"),
("Difference in Pay", "@meandiff{$int:,}"),
("Dept Size", "@count_nonzero")
]
)

tools = [hover, "pan", "wheel_zoom", "reset", "resize"]
source = ColumnDataSource(diffdf)

PLOT_OPTIONS['tools'] = tools
# make the figure
diffplot = figure(y_range=list(diffdf['department']), **PLOT_OPTIONS)


women = diffdf[diffdf.meandiff < 0]
men = diffdf[diffdf.meandiff > 0]

diffplot.quad(right=men['meandiff'],
left = 0,
bottom = men['index'] + 1 - 0.2 ,
top = men['index'] + 1 + 0.2,
fill_color='blue',
fill_alpha=0.6,
legend="Men make more",
source=source)

diffplot.quad(right=0,
left =women['meandiff'],
bottom = women['index'] + 1 - 0.2 ,
top = women['index'] + 1 + 0.2,
fill_color='red',
fill_alpha=0.6,
legend="Women make more",
source=source)

## I honestly have no damn clue why i have to plot the whole thing to get it to display properly.
## I will bug the bokeh people about this.
diffplot.quad(right=diffdf['meandiff'],
left = 0,
bottom = diffdf['index'] + 1 - 0.2 ,
top = diffdf['index'] + 1 + 0.2,
fill_color='red',
fill_alpha=0.0,
source=source)

add_formatting_opts(diffplot)
diffplot.legend.location = 'top_left'
diffplot.title = "Estimated Gender Difference In Pay by Department"

Estimated gender difference in pay per rank

In [44]:
# I really don't like this, but put up with it so we can get the correct department size info from
# the original data frame
tmp = ranks.reset_index()
tmp = pd.merge(tmp, g_diffs.reset_index()[['job_title', 'meandiff']], how='inner', on="job_title",
sort=True,
suffixes=('_x', '_y'), copy=True, indicator=False)

rdiffdf = tmp.sort_values('meandiff').copy()

rdiffdf.drop('level_0', inplace=True, axis=1)
rdiffdf.drop('index', inplace=True, axis=1)

rdiffdf = rdiffdf.reset_index()
rdiffdf.drop('index', inplace=True, axis=1)
rdiffdf = rdiffdf.reset_index()

Plotting the rank differences in pay:

In [45]:
hover = HoverTool(
tooltips=[
("Title", "@job_title"),
("Difference in Pay", "@meandiff{$int:,}"),
("Sample Size", "@count_nonzero")
]
)

tools = [hover, "pan", "wheel_zoom", "reset", "resize"]
source = ColumnDataSource(rdiffdf)

PLOT_OPTIONS['tools'] = tools
# make the figure
rdiffplot = figure(y_range=list(rdiffdf['job_title']), **PLOT_OPTIONS)


women = rdiffdf[rdiffdf.meandiff < 0]
men = rdiffdf[rdiffdf.meandiff > 0]

rdiffplot.quad(right=men['meandiff'],
left = 0,
bottom = men['index'] + 1 - 0.2 ,
top = men['index'] + 1 + 0.2,
fill_color='blue',
fill_alpha=0.6,
legend="Men make more",
source=source)

rdiffplot.quad(right=0,
left =women['meandiff'],
bottom = women['index'] + 1 - 0.2 ,
top = women['index'] + 1 + 0.2,
fill_color='red',
fill_alpha=0.6,
legend="Women make more",
source=source)

## I honestly have no damn clue why i have to plot the whole thing to get it to display properly.
## I will bug the bokeh people about this.
rdiffplot.quad(right=rdiffdf['meandiff'],
left = 0,
bottom = rdiffdf['index'] + 1 - 0.2 ,
top = rdiffdf['index'] + 1 + 0.2,
fill_color='red',
fill_alpha=0.0,
source=source)

add_formatting_opts(rdiffplot)
rdiffplot.plot_height = 700
rdiffplot.legend.location = 'top_left'
rdiffplot.title = "Estimated Gender Difference In Pay by Rank"

Now let's put the plots into one nice bundle using bokeh's Tabs functionality. This is a really cool feature.

In [46]:
from bokeh.models.widgets import Panel, Tabs
In [47]:
def setter(plots):
"""Removing responsiveness saves the constraint issues
    inherent with the plots."""
def close(x):
x.responsive = False if x.responsive is True else True
return x
return [close(p) for p in plots]


plots = [dept_plot, rank_plot, dept_gender, p_rank_gender, diffplot, rdiffplot]
responsive = (p.responsive for p in plots)
if True in responsive:
print("setting plots to be non responsive")
setter(plots)


tabs = Tabs(tabs=[Panel(child=p, title=p.title) for p in plots])

Results

Here's the plots.

In [48]:
show(tabs)
Out[48]:
<bokeh.io._CommsHandle at 0x140736978>

Thoughts

Pay by department

Clearly, the data still includes department heads that are classified as academic roles, though we did explicitly exclude deans, provosts, chairs, etc. I won't omit the highest and lowest paid people from the data for the time being. It's frankly amazing how much higher some people are paid than the rest of their department:

  • University Libraries has a salary mean of $85,000 and max of $270,000

  • CS has a mean of \$113,000 and a max of $256,000, though the person earning that was the School of Engineering Dean for some time.

In [49]:
print("-" * 5 + " Max minus mean pay" + "-" * 5)
print((gbdept.amax - gbdept['mean']).sort_values(ascending=False).head())
print("-" * 5 + " Max minus median pay" + "-" * 5)
print((gbdept.amax - gbdept['median']).sort_values(ascending=False).head())
----- Max minus mean pay-----
department
SOM Neurosurgery              227332.928571
Pathology                     195272.276596
IM Div of Gastroenterology    186814.000000
University Libraries          185446.421053
Experimental Therapeutics     151072.333333
dtype: float64
----- Max minus median pay-----
department
SOM Neurosurgery             253409
Pathology                    199136
University Libraries         196672
Experimental Therapeutics    181247
Emergency Medicine           164220
dtype: int64

Pay by rank

Assistant and Associate profs have a huge range of pay, but it's a bit surprising that the mean for associates is still lower than assistants. Clearly, there are a few huge outliers in here somewhere, likely medical folks, dominating the data.

Pay by department and gender

The chart is sorted in descending order by departments with men making more than women.

Pay by rank and gender

The chart is sorted in descending order by ranks with men making more than women.

Gender difference in pay by department and rank

As i said before, these plots do a great job of highlighting differences in pay but may exaggerate some of those differenes -- no measures of deviation or range are included.

Wrapup

Investigating all of this more deeply would take a bit more effort, but let's examine a department with wildly high deviations in pay to see if there is something going on here.

In [50]:
profs.loc[profs.department == 'SOM Neurosurgery'][['job_title', 'annual', 'gender']]
Out[50]:
job_title annual gender
5474 Clinician Ed-Assoc Prof 232988 f
6650 Clinician Ed-Assoc Prof 478240 m
8757 Clinician Ed-Assist Prof 121200 f
9064 Clinician Ed-Assist Prof 202000 m
10701 Clinician Ed-Assoc Prof 429512 m
210 Assistant Professor 229795 m
1285 Research Asst Professor 85850 m
1747 Assistant Professor 484800 m
2065 Assistant Professor 350000 m
5793 Research Asst Professor 83224 m
7771 Research Professor 101000 m
9186 Research Assoc Professor 94030 f
10010 Assistant Professor 454500 m
10950 Assistant Professor 257400 m

As i suspected, it looks like the research profs in the neurosurgery department are wildly throwing off our pay data. The female clinical assistant prof making $121,000 is a DO, not a surgeon, which explains some of the huge difference in pay compared to the surgeons.

Fin

There is plenty more to play with here, but that's all for the time being. Let me know if you see something idiotic. As I mentioned, take the plots with a grain of salt:

  • I am not sure how well the gender identification worked out
  • Some departments have very small samples
  • The difference plots exaggerate the differences in pay a bit by not showing standard deviations or other quantifiers of the mean
  • Clinical departments are difficult due to non MD people being classified as a "Clinician Prof", which brings the pay ranges down.
  • Some clinicial professors are classed as non-clinical profs, which will also skew some payscales compared to the non-clinical groups.

When I get a bit of spare time, I'll come back to this data and fit a few models to try and quantify just how much gender influences pay in this dataset while accounting for rank and department.