Hollywood Social Network
This project aims to create a social network of Hollywood's actors, directors and producers. Then, by extracting features from this network, I will predict movie's box office results.
Contents
Creating the social network
A network is composed of nodes  in this case people  and vertices  in this case movies in which those people worked together. Various websites can be used to gather this data.
One example is Box Office Mojo, which has a page like this for most people:
By scraping each person's page, it's possible to create a list of movies worked in. Then, these can be inverted into a list of movies, each with the list of people they participated in. Once we have the list of movies, we can create a network, using the NetworkX package. Specifically:
 A node is created for every person, with their name as the key and their job as an attribute.
 When two people are in the same movie, a vertex is created between their nodes. If they already had a connection, then the weight of that connection is increased by one.
import networkx as nx def generate_graph(movie_list): G = nx.Graph() for movie in movie_list: for i,p1 in enumerate(movie.people): for p2 in movie.people[i+1:]: p1job, p2job= p1.job, p2.job p1name, p2name = p1.name, p2.name G.add_node(p1name) G.add_node(p2name) G.add_edge(p1name, p2name) try: G[p1name][p2name]["weight"] += 1 except KeyError: G[p1name][p2name]["weight"] = 1 return G hollywood_graph = generate_graph(movie_list)
Visualizing the graph
Now I want to visualize this graph. For that, I'll use the D3 javascript library forcedirected graph <https://github.com/mbostock/d3/wiki/ForceLayout>. First I'll export the data to a suitable format:
hollywood_graph_int = nx.convert_node_labels_to_integers(hollywood_graph, label_attribute='name') from networkx.readwrite import json_graph data = json_graph.node_link_data(hollywood_graph_int) with open('graph.json', 'w') as f: json.dump(data, f, indent=4)
Finally, I can use the visualization library to create the output. Below is a graph with a random subset of 200 movies. The full graph is here (Warning: the full graph can take several seconds to load). You can search for specific people to see their connections, as well as double click on nodes.
Graph Features
In order to predict movie box office using this graph, I'll need to extract features. I'm looking to quatify how well connected and important a person is. I selected the following features:
 degree = number of connections
 average neighborhood degree = average number of connections of the connections
 sum of neighborhood degree = sum of the number of connections of the connections
 node centrality = how many shortest paths between nodes pass through this node
 node importance = the node's pagerank (algorithm initially used to rank websites by order of importance)
Most of these features are already implemented in the NetworkX package:
G = hollywood_graph # PageRank pagerank = nx.pagerank_numpy(G, alpha=0.85, weight='weight') # Degree degree = nx.degree(G, nbunch=None, weight='weight') # Average neighborhood degre avg_neighbor_degree = nx.average_neighbor_degree(G, source='out', target='out', nodes=None, weight='weight') # Total neighborhood degree total_neighbor_degree = {} for person, deg in avg_neighbor_degree.items(): total_neighbor_degree[person] = deg*degree[person] # load centrality centrality = nx.load_centrality(G, cutoff=10)
I'm trying to predict movie box office, so I need to combine these features for people into features for movies. For this, I used three approaches:
 the highest value of that feature among all the people involved in the movie
 the sum
 the average
Combining this together, we can build the features for movies:
feature_names = ['budget', 'degree sum', 'degree highest', 'degree average', 'average neighborhood degree sum', 'average neighborhood degree highest', 'average neighborhood degree average', 'total neighborhood degree sum', 'total neighborhood degree highest', 'total neighborhood degree average', 'load centrality sum', 'load centrality highest', 'load centrality average', 'pagerank sum', 'pagerank highest', 'pagerank average'] # returns a list of the value of this feature for the given movie and param # param : 0  sum, 1  highest, 2  average def movie_feat(movie, feat, param): value = 0 count = 1 for person in movie.people: try: current = feat[person.name]# + " " + person_job_string(person)] if param == 0: # sum value += current if param == 1: # highest if value < current: value = current if param == 2: # average value += current count += 1 except KeyError: pass # this person has no connections :( return value/count # returns a list of the 3 different features (sum, highest, average) for each movie # for the given feature def movies_feat(movies_list, feat): return [[movie_feat(mov, feat, param) for mov in movies_list] for param in [0,1,2]] node_features = [degree, avg_ne_deg, tot_ne_deg, load_cen, node_pagerank] # this is a list (1 per feature) of lists (3 for each feature) of lists (1 value per movie) movie_features = map(lambda x:movies_feat(movie_list, x), features) # I'll flatten the list to have a list of feature values movie_features = [mov_ft for feat in movie_features for mov_ft in feat] #add budget movie_features.append(movie_budgets) # Now I'll create the feature matrix: movie_features = np.array(movie_features) np.shape(movie_features) # = (16,2014)
Who's the most important in Hollywood?
The following lists show the top 5 people according to each feature (using only actors in the graph).

Degree:
 Samuel L. Jackson = 192
 Jon Favreau = 184
 Seth Rogen = 173
 Stanley Tucci = 169
 Bruce Willis = 169

Average Neighborhood Degree:
 Dan Stevens = 75.8
 Tobey Maguire = 74.0
 Barbra Streisand = 73.9
 Ben Barnes = 72.7
 Hailee Steinfeld = 71.9

Total Neighborhood Degree:
 Jon Favreau = 12217
 Samuel L. Jackson = 12175
 Stanley Tucci = 10058
 Bruce Willis = 9703
 Seth Rogen = 9610

Load Centrality:
 Dennis Quaid = 0.0140
 Bruce Willis = 0.0129
 Stanley Tucci = 0.0117
 Samuel L. Jackson = 0.0114
 Liam Neeson = 0.0109

Pagerank:
 Samuel L. Jackson = 0.00510
 Jon Favreau = 0.00472
 Seth Rogen = 0.00467
 Bruce Willis = 0.00462
 Stanley Tucci = 0.00458
The same people are ranked highly by several features, but fortunately the features don't seem to be exactly identical.
Predicting Movie Box Office
I'm going to use a gradient boosting regressor from scikitlearn. First I'll split the data into a test and training set:
from sklearn import ensemble from sklearn.metrics import mean_squared_error Y = np.array([movie.lifetime_gross for movie in movie_list]) X = np.transpose(movie_features) # the features should be columns test_index = random.sample(xrange(len(Y)), len(Y)/4) X_test, Y_test = X[test_index, :], Y[test_index] X_train, Y_train = np.delete(X, test_index, axis=0), np.delete(Y, test_index) params = {'n_estimators': 30000, 'max_depth': 1, 'min_samples_split': 1, 'alpha':0.5, 'learning_rate': 0.001, 'loss': 'huber', 'subsample':0.5, 'random_state':1} clf = ensemble.GradientBoostingRegressor(**params) clf.fit(X_train, Y_train)
The parameters were chosen using crossvalidation, and limited by the power of the laptop I ran the code on.
One advantage of tree based classifiers, is that they are somewhat interpretable. It's not quite as clear of ensembles, but you can still plot the variable importance:
The regression results will soon be posted.