Using Gradient Boosting (with Early Stopping)


Whenever training a machine learning algorithm, you need to balance between learning from data and overfitting data. Overfitting means that the algorithm will work well on the training data, but will not generalize well to new data. For many algorithms, the standard method to avoid this problem is to use regularization. For gradient boosting, the most straightforward way is to use cross validation in order to select the learning rate and the number of trees to train.

In this post, I will go over the most important gradient boosting parameters, and describe how to implement a technique called early stopping, in order to avoid overfitting, using scikit-learn.

Read more…

Youtube Chord OCR

I am working on building a chord classifier using neural networks. In order to train a machine learning algorithm, I need training data, i.e. audio snippets labeled with the correct chord. I thought of 2 sources:

  • Tape myself.
  • Use youtube play along videos. These videos intended for people to play along with the video diplay the chords in real time. Using OCR I can read this chord in order to correctly label the audio.

Here's an example of the type of video I'm referring to:

This post describes the process of taking youtube videos and creating a training set. The code can be found on github. You can also view the code in an ipython notebook. I also explain how I used profiling to optimize my code.

Read more…

Chord Classification using Neural Networks

I'm currently working on classifying chords from audio using neural networks. This post gives an overview of the project, how it works, and the (soon) showcases the final product.

This is a work in progress, and will be updated regularly.

Read more…

Streaming Microphone Input With Flask

This post describes how to set up a microphone stream using the HTML 5 API for handling the input, and Web Sockets for handling communication with the Flask application.

I use this for my project on chord classification with neural networks.

Read more…

Choosing A Job - Data Science New York

I have just started looking for a Data Scientist position in New York City (email me: jobs at I decided to have fun with it, and tackle this problem from a data science perspective, by:

  • Scraping data science job offers and company reviews.
  • Visualizing company review scores.
  • Topic Modeling: which topics come up in the reviews, and if some companies disproportionately have some topics.

Read more…

Dioxin Emissions


During my Master's degree at Columbia University, I studied dioxin emissions in the United states.

Dioxins are a group of related chemical compounds. They are persistent environmental pollutants of major concern due to their high toxicity. Dioxins are created by a wide range of processes, including industrial processes and natural processes, and are found throughout the world.

The Environmental Protection Agency conducted a national inventory of dioxin emissions for the years 1987, 1995, and 2000. Part of research involved performing a new inventory for the year 2012, and updating the older values using more recent data.

This post shows the results of this research, and showcases several visualizations.

Read more…

Hollywood Social Network

This project aims to create a social network of Hollywood's actors, directors and producers. Then, by extracting features from this network, I will predict movie's box office results.

Read more…

New York Subway Traffic Data Part 2

Interesting Results

In NYC Traffic Data Part 1, I explained where to obtain MTA subway traffic data and how to transform it into something easy to analyze.

This post provides some interesting insights from the data:

  1. Peak Times
  2. Stations with highest peak times
  3. Stations with highest daily peak
  4. Average daily and weekly riders
  5. Total Riders
In [1]:
import pickle
station_to_traffic_daily, station_to_traffic_hourly = {}, {}

with open("station_to_traffic_daily.pkl", "rb") as station_traffic_pickle:
    station_to_traffic_daily = pickle.load(station_traffic_pickle)

with open("station_to_traffic_hourly.pkl", "rb") as station_traffic_pickle:
    station_to_traffic_hourly = pickle.load(station_traffic_pickle)

Peak Times

One limitation of the data is that the traffic is often only recorded every several hours. As a result, several hours have the same average hourly traffic.

In [2]:
def station_to_peaks(station_to_traffic_interval):
    station_to_peak_entry_time_entries_exit_time_exits = {}
    for station in station_to_traffic_interval:
        peak_entries, peak_exits = 0, 0
        peak_entry_time, peak_exit_time = [], []
        for date_time, entries, exits in station_to_traffic_interval[station]:
            if peak_entries < entries:
                peak_entries = entries
                peak_entry_time = [date_time]
            elif peak_entries == entries:
            if peak_exits < exits:
                peak_exits = exits
                peak_exit_time = [date_time]
            elif peak_exits == exits:
        station_to_peak_entry_time_entries_exit_time_exits[station] = \
                [peak_entry_time, peak_entries, peak_exit_time, peak_exits]
    return station_to_peak_entry_time_entries_exit_time_exits

station_to_peaks_dic = station_to_peaks(station_to_traffic_hourly)
In [3]:
%matplotlib inline
import matplotlib.pylab as plt

stations, values = zip(*station_to_peaks_dic.items())
peak_entry_times, peak_entries, peak_exit_times, peak_exits = zip(*values)
peak_entry_hours = [date_time.hour for sublist in peak_entry_times for date_time in sublist]
peak_exit_hours = [date_time.hour for sublist in peak_exit_times for date_time in sublist]

counts, bins, patches = plt.hist([peak_entry_hours, peak_exit_hours], 23)
plt.title("Station Peak Entry and Exit Times", size=18)

New York Subway Traffic Data Part 1

Formatting the MTA Turnstile Data

During the Metis Data Science bootcamp, I looked at New York subway traffic using MTA turnstile data. This post is about obtaining, and cleaning the data, with a first look at what it actually contains. For a more in depth analysis check out NYC Traffic Data Part 2

The data structure changed in october, so I only used data published after 10/18/14, in the latest format. I want to identify the stations with the most traffic, and get the highest traffic time for each station.

I realized that although this data seemed high quality, I still needed to do serious cleaning and checks.

  1. The raw data
  2. Cleaning the data
  3. Visualizing hourly entries
  4. Saving the data

Downloading the data

The files are named turnstile_yymmdd.txt, with 1 published each week on Saturday.

In [1]:
import datetime

# Format the date like the file name
def date_format(date):
    return date.strftime("%y%m%d")

#download a couple weeks of data
start_date =,10,18)
end_date =
link_prefix = ""
file_addresses = []

this_date = start_date
while this_date <= end_date:
    file_addresses.append(link_prefix + date_format(this_date) + ".txt")
    this_date += datetime.timedelta(days=7)

# for f in file_addresses:
#     !wget {f}

A first look at the raw data

In [2]:
files = !ls | grep turnstile.*txt

!head {files[0]}
A060,R001,00-00-00,WHITEHALL ST,R1,BMT,10/11/2014,01:00:00,REGULAR,0000805439,0001141080                                          
A060,R001,00-00-00,WHITEHALL ST,R1,BMT,10/11/2014,05:00:00,REGULAR,0000805459,0001141141                                          
A060,R001,00-00-00,WHITEHALL ST,R1,BMT,10/11/2014,09:00:00,REGULAR,0000805589,0001141257                                          
A060,R001,00-00-00,WHITEHALL ST,R1,BMT,10/11/2014,13:00:00,REGULAR,0000805834,0001141512                                          
A060,R001,00-00-00,WHITEHALL ST,R1,BMT,10/11/2014,17:00:00,REGULAR,0000806150,0001141903                                          
A060,R001,00-00-00,WHITEHALL ST,R1,BMT,10/11/2014,21:00:00,REGULAR,0000806431,0001142305                                          
A060,R001,00-00-00,WHITEHALL ST,R1,BMT,10/12/2014,01:00:00,REGULAR,0000806591,0001142537                                          
A060,R001,00-00-00,WHITEHALL ST,R1,BMT,10/12/2014,05:00:00,REGULAR,0000806609,0001142618                                          
A060,R001,00-00-00,WHITEHALL ST,R1,BMT,10/12/2014,09:00:00,REGULAR,0000806670,0001142994                                          

Hello World

Hello world!

Welcome to my blog!

This is my first post. Just checking things out.