Decision Tree Project

Decision Tree Project

Content Warning: Machine Learning, Python

This was the report that I turned in for our first machine learning “mini” project. All code can be found here.

Implementation strategy

The project is broken into several major files:

  • classify.py: functionality for predicting class labels and plotting results
  • dna.py: class to hold the gene sequences and promoter values
  • feature.py: group of classes that help organize splitting the data
  • id3.py: main implementation of the decision tree
  • main.py: main point of entry
  • metrics.py: implementations of , entropy, and data splitting
  • README.md: detailed description of running the program

The program’s main entry point is main.py. The decision tree was implemented using a dynamic breadth-first strategy, calculating a root node first and then building down from there using a subset of the data. Each node has four children, corresponding to nucleotides adenine, cytosine, guanine, or thymine (a,c,g,t). Each vector of nucleotides were checked for information gain and the resulting child node was assigned either:

  • as a leaf node and labeled via majority voting or all members belonging to the same class.
  • as an internal node to grow another layer to the tree.

I used NetworkX as the backing structure for the tree as it allows anything to be used as nodes or edges and provides a great plotting interface via Matplotlib. Once the tree is built, it is visualized (saved to a file or printed live, option provided via a command-line argument) and the specified validation data is classified against it. This is done by another dynamic breadth-first search, exploring the given nucleotide sequence for an observation until it reaches a leaf node and is labeled. This set of answers is then visualized as a confusion matrix, showing how well the algorithm worked.

The program overall is efficient and provides great visualization of the generated decision tree and classification accuracy. The only major modification to the program that I would make is using Numpy as a backend for the data structures instead of functional Python list and dictionary comprehensions - it would be a more efficient tool for the job, both in development and in optimization.

Results

IPython Notebook can run the program within it and display the results live. The following command is a flag that allows the plots to be embedded here:

In [1]:

%matplotlib inline
import matplotlib.pyplot as plt
import pylab as pylab
# sets the size of figures

And my program is ran as such:

%run main.py --train ../data/training.txt --validation ../data/validation.txt --confidence 0

with the arguments as follows:

In [8]:

%run main.py -h
pylab.rcParams['figure.figsize'] = (14,9)

usage: main.py [-h] -t TRAIN -v VALIDATION [--ipython] -x CONFIDENCE

Implements the classic ID3 algorithm for classifying a set of dna promoters.

optional arguments:
  -h, --help            show this help message and exit
  -t TRAIN, --train TRAIN
                        the data on which you wish to train e.g.
                        "../data/training.txt"
  -v VALIDATION, --validation VALIDATION
                        the validation data
  --ipython             this is an ipython session and we want to draw the
                        figs, not save them
  -x CONFIDENCE, --confidence CONFIDENCE
                        threshold confidence level for growing the decision
                        tree. Can either be (0, 95, 99)

Running the program with confidence of 0 allows us to see the full tree (figures are larger here due to the difficultly of drawing that many nodes in the small space).

In [9]:

%run main.py --train ../data/training.txt --validation ../data/validation.txt --confidence 0 --ipython 

0
Build_Tree: parent_f is 16 
Build feature: Attempting to build a node from 9 values
Build feature: Attempting to build a node from 15 values
Build feature: Attempting to build a node from 38 values
Build_Tree: parent_f is 34 
Build_Tree: parent_f is 31 
Build feature: Attempting to build a node from 2 values
Build feature: Attempting to build a node from 4 values
Build_Tree: parent_f is 38 
Build feature: Attempting to build a node from 15 values
Build feature: Attempting to build a node from 6 values
Build feature: Attempting to build a node from 3 values
Build_Tree: parent_f is 1 
Build_Tree: parent_f is 1 
Build_Tree: parent_f is 14 
Build_Tree: parent_f is 9 
Build_Tree: parent_f is 1 
[[ 0.875       0.125     ]
 [ 0.26315789  0.73684211]]

<matplotlib.figure.Figure at 0x7f19fb593290>

Growing the full tree gives us 87.5% accuracy on classifying true promoters correctly and 73% accuracy on classifying false promoters correctly.

Trying again with 95% confidence gives us these results:

In [11]:

%run main.py --train ../data/training.txt --validation ../data/validation.txt --confidence 95 --ipython 
pylab.rcParams['figure.figsize'] = (4.5,4.5)

95
Build_Tree: parent_f is 16 
Build feature: Attempting to build a node from 9 values
Build feature: Attempting to build a node from 15 values
Build feature: Attempting to build a node from 38 values
Build_Tree: parent_f is 38 
Build feature: Attempting to build a node from 15 values
Build feature: Attempting to build a node from 6 values
Build feature: Attempting to build a node from 3 values
Build_Tree: parent_f is 14 
[[ 0.6875  0.3125]
 [ 0.      1.    ]]

<matplotlib.figure.Figure at 0x7f1a00a1a490>

Using $x\^2$ to decide when to split the tree and typical confidence level gives us

  • 68.8% accuracy on classifying true promoters correctly
  • 100% accuracy on classifying false promoters correctly

Trying again with 99% confidence gives us these results:

In [5]:

%run main.py --train ../data/training.txt --validation ../data/validation.txt --confidence 99 --ipython 

99
Build_Tree: parent_f is 16 
Build feature: Attempting to build a node from 9 values
Build feature: Attempting to build a node from 15 values
Build feature: Attempting to build a node from 38 values
Build_Tree: parent_f is 38 
Build feature: Attempting to build a node from 15 values
Build feature: Attempting to build a node from 6 values
Build feature: Attempting to build a node from 3 values
Build_Tree: parent_f is 14 
[[ 0.6875  0.3125]
 [ 0.      1.    ]]

<matplotlib.figure.Figure at 0x7f19fb69a390>

The increased confidence level gains us nothing from the data. I suspect this would be helpful only if we had far more data or many more features gave stronger information. We still have:

  • 68.8% accuracy on classifying true promoters correctly
  • 100% accuracy on classifying false promoters correctly

Summary

In this limited dataset, reasonable predictive power could be achieved from training a decision tree classifier using Quinlan’s ID3 method on the provided DNA promoter region data.