Choosing A Job - Data Science New York

  |   Source

I have just started looking for a Data Scientist position in New York City (email me: jobs at I decided to have fun with it, and tackle this problem from a data science perspective, by:

  • Scraping data science job offers and company reviews.
  • Visualizing company review scores.
  • Topic Modeling: which topics come up in the reviews, and if some companies disproportionately have some topics.

The code for this is available on my github. There are 3 ipython notebooks to show the various steps: Scraping, Ratings and Topic Modeling (they are also on github). Try it with other keywords/location!

Scraping The Data

First, I searched on the job posting website for "Data Science" in "New York City". For each company that showed up, I scraped all the reviews on indeed and on glassdoor, a website with information and reviews about companies.

The scraping takes a while (about 2h with my parameters and internet). This gave me about 200,000 reviews which I stored in MongoDB. A review on indeed looks like this:

{u'_id': ObjectId('54ee381dbcccd95833015574'),
u'company_name': u'Milliman',
u'date': datetime.datetime(2014, 8, 17, 0, 0),
u'employment_status': u'\xa0(Current Employee),\xa0',
u'job_title': u'Benefits Analyst II, QKA',
u'location': u'Woodland Park, NJ',
u'rating': u'4.0',
u'review_cons': u'',
u'review_pros': u'Pros: no commute to work since i work few minutes from my house.',
u'review_text': u'Handle various of different tasks from reviewing: new pension payment, or payment changes, or address/direct deposit/Fed & State tax updating, or perform monthly payment reconciliations, or review and authorize communication with participants of 28 clients..etc. The day is field with so many tasks so it flies very quick.From my daily work I have gained an extensive knowledge in pension administration for a full spectrum of pension plans such as: financial, government, union, education, and nonprofit organization.Work with vary diverse culture co-workers and have learned a lot about different cultures.I would say the hardest part of the job is when you have a participant that is in need for the payment immediately but we are not able to process since he/she has not returned everything or time wise we have to follow business guideline processing.I would say the most enjoyable part of the job is helping participants, making sure they receive what they have worked for and the diversity of daily tasks, learning new things everyday and enjoying the people I work with.',
u'review_title': u'Productive and friendly workplace',
u'stars': {u'Compensation/Benefits': 3, u'Job Culture': 4,
   u'Job Security/Advancement': 3, u'Job Work/Life Balance': 3,
   u'Management': 4}}

Ratings Histograms

First, let's take a look at the overall ratings (when missing, I replaced them with the average of the available ratings). Here are all of the ratings put together:


Most people like their company! I thought maybe the overall ratings was due to a few large companies with good reviews. It turns out even the average ratings look similar. The most noticeable difference is in the management category, which by company is pretty uniform. Here you can see all overal ratings and averages:

Of course not all companies are equal. Some are actually pretty bad, for instance:


Here you can see any company's overall ratings:

Topic Modeling

If you want to read a detailed explanation of topic modeling, a good place to start is this review article (PDF) by one of the inventors of today's most common topic model, David M. Blei. For a brief introduction, read on.

Introduction to LDA

In latent Dirichlet allocation topic modeling, documents are modeled as a distribution of topics, and topics as a distribution of words.

In this model, to generate a document first choose a topic (based on the document's topic distribution), and then pick a word based on the chosen topic's word distribution. Repeat this process for each word in the document.

What we're trying to do is the opposite: based on the documents (which we assume were generated using the above method), we try to infer the topics they originated from. This is called a Bayesian Inference problem.

To find the topics, I used an extension of LDA called HDP which calculates how many topics there are and what words they contain. The code can be found on github, based on the HDP library found on David M. Blei's topic modeling page.

Corpus Preparation

In LDA, only the words matter (not their order, punctuation, or any other structure). The following steps generally improve the results:

  • Word Tokenization: Replace the documents by their word frequency list.
  • Remove stop words: many words do not really hold meaning (e.g. "the") and add complexity. They can and should be removed before running LDA
  • Stemming: This optional step involves reducing words to them stemmed version (e.g. if the word ends in 'ed', remove the 'ed'). I found this step not necessary for my corpus.
  • Remove outliers: remove words that appear very rarely in the corpus. In my case, most of these were typos.

In the end, I had 10673 unique words and 183789 documents and 170 different topics.


Topics are a distribution of words, which you can think of as a weighted list of words. The words that best define a topic are generally the ones that are most weighted. Here is an example of a topic, with the 10 most important words:

[u'balance', u'life', u'good', u'work', u'hours', u'place', u'flexible', u'long', u'nice', u'lots', u'compensation', u'pay']

A topic has several dimensions: first the words, second the weight of the word, and third, how unique that word is in the corpus. In the example above, the word good would be highly weighted but not very unique, while flexible is less weighted, but more unique. An easy way to visualize all these aspects is with a wordcloud: the size of the word indicates how common the word is in that topic (bigger -> more common), and the color indicate how unique the word is to that topic (redder -> more unique).

Below, you can visualize, for each company, the topics that are most important among all of their reviews. To see the most common topics overall, select "Main Topics".