Hollywood Social Network

  |   Source

This project aims to create a social network of Hollywood's actors, directors and producers. Then, by extracting features from this network, I will predict movie's box office results.

Creating the social network

A network is composed of nodes - in this case people - and vertices - in this case movies in which those people worked together. Various websites can be used to gather this data.

One example is Box Office Mojo, which has a page like this for most people:


By scraping each person's page, it's possible to create a list of movies worked in. Then, these can be inverted into a list of movies, each with the list of people they participated in. Once we have the list of movies, we can create a network, using the NetworkX package. Specifically:

  • A node is created for every person, with their name as the key and their job as an attribute.
  • When two people are in the same movie, a vertex is created between their nodes. If they already had a connection, then the weight of that connection is increased by one.
import networkx as nx
def generate_graph(movie_list):
  G = nx.Graph()
  for movie in movie_list:
   for i,p1 in enumerate(movie.people):
      for p2 in movie.people[i+1:]:
        p1job, p2job= p1.job, p2.job
        p1name, p2name = p1.name, p2.name

        G.add_edge(p1name, p2name)

          G[p1name][p2name]["weight"] += 1
        except KeyError:
          G[p1name][p2name]["weight"] = 1
  return G

  hollywood_graph = generate_graph(movie_list)

Visualizing the graph

Now I want to visualize this graph. For that, I'll use the D3 javascript library force-directed graph <https://github.com/mbostock/d3/wiki/Force-Layout>. First I'll export the data to a suitable format:

hollywood_graph_int = nx.convert_node_labels_to_integers(hollywood_graph, label_attribute='name')

from networkx.readwrite import json_graph
data = json_graph.node_link_data(hollywood_graph_int)
with open('graph.json', 'w') as f:
   json.dump(data, f, indent=4)

Finally, I can use the visualization library to create the output. Below is a graph with a random subset of 200 movies. The full graph is here (Warning: the full graph can take several seconds to load). You can search for specific people to see their connections, as well as double click on nodes.

Graph Features

In order to predict movie box office using this graph, I'll need to extract features. I'm looking to quatify how well connected and important a person is. I selected the following features:

  • degree = number of connections
  • average neighborhood degree = average number of connections of the connections
  • sum of neighborhood degree = sum of the number of connections of the connections
  • node centrality = how many shortest paths between nodes pass through this node
  • node importance = the node's pagerank (algorithm initially used to rank websites by order of importance)

Most of these features are already implemented in the NetworkX package:

G = hollywood_graph

# PageRank
pagerank = nx.pagerank_numpy(G, alpha=0.85, weight='weight')

# Degree
degree = nx.degree(G, nbunch=None, weight='weight')

# Average neighborhood degre
avg_neighbor_degree = nx.average_neighbor_degree(G, source='out', target='out', nodes=None, weight='weight')

# Total neighborhood degree
total_neighbor_degree = {}
for person, deg in avg_neighbor_degree.items():
   total_neighbor_degree[person] = deg*degree[person]

# load centrality
centrality = nx.load_centrality(G, cutoff=10)

I'm trying to predict movie box office, so I need to combine these features for people into features for movies. For this, I used three approaches:

  • the highest value of that feature among all the people involved in the movie
  • the sum
  • the average

Combining this together, we can build the features for movies:

feature_names = ['budget', 'degree sum', 'degree highest', 'degree average', 'average neighborhood degree sum', 'average neighborhood degree highest', 'average neighborhood degree average', 'total neighborhood degree sum', 'total neighborhood degree highest', 'total neighborhood degree average', 'load centrality sum', 'load centrality highest', 'load centrality average', 'pagerank sum', 'pagerank highest', 'pagerank average']

# returns a list of the value of this feature for the given movie and param
# param : 0 - sum, 1 - highest, 2 - average
def movie_feat(movie, feat, param):
   value = 0
   count = 1
   for person in movie.people:
         current = feat[person.name]# + " " + person_job_string(person)]
         if param == 0: # sum
           value += current
         if param == 1: # highest
           if value < current:
             value = current
         if param == 2: # average
           value += current
           count += 1
      except KeyError:
         pass # this person has no connections :(
   return value/count

# returns a list of the 3 different features (sum, highest, average) for each movie
# for the given feature
def movies_feat(movies_list, feat):
   return [[movie_feat(mov, feat, param) for mov in movies_list] for param in [0,1,2]]

node_features = [degree, avg_ne_deg, tot_ne_deg, load_cen, node_pagerank]

# this is a list (1 per feature) of lists (3 for each feature) of lists (1 value per movie)
movie_features = map(lambda x:movies_feat(movie_list, x), features)
# I'll flatten the list to have a list of feature values
movie_features = [mov_ft for feat in movie_features for mov_ft in feat]
#add budget

# Now I'll create the feature matrix:
movie_features = np.array(movie_features)

np.shape(movie_features) # = (16,2014)

Who's the most important in Hollywood?

The following lists show the top 5 people according to each feature (using only actors in the graph).

  • Degree:

    • Samuel L. Jackson = 192
    • Jon Favreau = 184
    • Seth Rogen = 173
    • Stanley Tucci = 169
    • Bruce Willis = 169
  • Average Neighborhood Degree:

    • Dan Stevens = 75.8
    • Tobey Maguire = 74.0
    • Barbra Streisand = 73.9
    • Ben Barnes = 72.7
    • Hailee Steinfeld = 71.9
  • Total Neighborhood Degree:

    • Jon Favreau = 12217
    • Samuel L. Jackson = 12175
    • Stanley Tucci = 10058
    • Bruce Willis = 9703
    • Seth Rogen = 9610
  • Load Centrality:

    • Dennis Quaid = 0.0140
    • Bruce Willis = 0.0129
    • Stanley Tucci = 0.0117
    • Samuel L. Jackson = 0.0114
    • Liam Neeson = 0.0109
  • Pagerank:

    • Samuel L. Jackson = 0.00510
    • Jon Favreau = 0.00472
    • Seth Rogen = 0.00467
    • Bruce Willis = 0.00462
    • Stanley Tucci = 0.00458

The same people are ranked highly by several features, but fortunately the features don't seem to be exactly identical.

Predicting Movie Box Office

I'm going to use a gradient boosting regressor from scikit-learn. First I'll split the data into a test and training set:

from sklearn import ensemble
from sklearn.metrics import mean_squared_error

Y = np.array([movie.lifetime_gross for movie in movie_list])
X = np.transpose(movie_features) # the features should be columns

test_index = random.sample(xrange(len(Y)), len(Y)/4)
X_test, Y_test = X[test_index, :], Y[test_index]
X_train, Y_train = np.delete(X, test_index, axis=0), np.delete(Y, test_index)

params = {'n_estimators': 30000, 'max_depth': 1, 'min_samples_split': 1, 'alpha':0.5,
       'learning_rate': 0.001, 'loss': 'huber', 'subsample':0.5, 'random_state':1}
clf = ensemble.GradientBoostingRegressor(**params)
clf.fit(X_train, Y_train)

The parameters were chosen using cross-validation, and limited by the power of the laptop I ran the code on.

One advantage of tree based classifiers, is that they are somewhat interpretable. It's not quite as clear of ensembles, but you can still plot the variable importance:


The regression results will soon be posted.