This webinar explores how to use Python for dealing specifically with networks. It is designed as an introduction to network analysis using Python and Game of Thrones data.
This webinar uses the NetworkX, Pandas and Matplotlib Python packages. This is not strictly an interactive webinar, but if you would like to run the code and receive outputs on your computer as I go through it, you should have Anaconda for Python 3.5 downloaded and be comfortable using Jupyter Notebook (Josiah Davis’ previous webinar covers this).
Agenda
- What is network analysis? (Hint: It is just like any other data analysis!)
- How do I visualize a network in Python?
- Who is the main character of Game of Thrones? A simple network analysis exercise
About the Instructor
Ben Gillman works as a data science consultant in New York City. Ben's professional experience spans Machine Learning, Natural Language Processing, General Linear Modeling, Forecasting, and Data Visualization. Previously, Ben was a data scientist at J.Crew where his algorithms are used for things like sending the J.Crew catalog. He graduated from the George Washington University with a Bachelor's degree in Economics.
There are many tools for developing python scripts. For this webinar, we will be executing everything using the Jupyter notebook. For those of you who are new to the Jupyter project, here are some items that you should know about it.
(This is a quick recap of what Josiah covered in the last webinar)
Item #0: Jupyter notebooks are a great way to develop and share code
Here is the definition from the Jupyter project website.
The Jupyter Notebook is a web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, machine learning and much more.
Item #1: Jupyter notebooks are organized into cells, which can contain markdown or code. This is a cell with markdown
# This is a cell with code
def simple_calculator(first_number, second_number):
return first_number + second_number
simple_calculator(5, 6)
11
Item #3: There are many useful keyboard shortcuts which are commonly used
Here are some shortcuts I use a lot. To get the full list go to Help -> Keyboard Shortcuts.
Mac | PC | |
---|---|---|
Command Mode | esc | esc |
Delete | d, d | d, d |
Markdown | m | m |
Run Cell and move to next cell | shift, return | shift, enter |
Run Cell and Insert Below | option, return | alt, enter |
Insert Above | a | a |
Insert Below | b | b |
Edit Mode | return | enter |
Run Cell and move to next cell | shift, return | shift, enter |
Run Cell and Insert Below | option, return | alt, enter |
A quick Google search will tell you network analysis is "the mathematical analysis of complex working procedures in terms of a network of related activities."
In simpler terms, network analysis is the analysis of data that represents a group of connected data points. It is different then typical tabular data analysis because data points are explicitly connected to each other.
For easier research via Google: Network analysis is under the academic umbrella of Graph Theory, the study of graphical data. Network Analysis can also be called Network Science and Graph Analysis.
- Network (aka Graph): Set of objects where at least one of the objects have a connective relationship with another
- Node (aka vertice): Singular point in the network. A node can represent any entity that has relationship to others.
- Edge (aka link): Connection between two nodes in a network
- Directed Edge: Connection where information flows from one node to another
- Un-Directed Edge: Connection where information flows bilaterally between two nodes
- Edge Weight: A value for each edge denoting a numerical value associated with the relationship between nodes
- Clique: Group of nodes where each node is connected to every other node
- Attributes: Data points associated with nodes or edges
*Typically networks are either considered un-directed or directed; they either consist of only directed edges or only undirected edges
There are many areas of network analysis. Here are some major areas:
Centrality - Finding the most important nodes in a network; the most centrally located
Clustering - Grouping nodes together
Clique Structure - Finding groups of nodes based strictly on cycles of edges
Connectivity - Given edge connections, calculating how well connected nodes are in the network
Distance - Calculating how far nodes are from others and get an idea of how large your network is
Probabilistic Analysis - Uncovering the probabilities in your network. For example, calculating the probability of reaching nodes from other nodes
We will be covering centrality and hope to give you the foundation on-which you can build skills in other areas of network analysis
The popular HBO series Game of Thrones involves a dense social network between characters. These networks are often used to manipulate one another to gain political power. But with so many characters and story lines, who are the main characters of the show; Who has the best social positioning?
This exercise is based of work done at Macalester College. The report and full data set can be found here.
Let's begin by pulling the network data into python using the Pandas package.
import pandas as pd
data = pd.read_csv('https://www.macalester.edu/~abeverid/data/stormofswords.csv')
data.head(10)
Source | Target | Weight | |
---|---|---|---|
0 | Aemon | Grenn | 5 |
1 | Aemon | Samwell | 31 |
2 | Aerys | Jaime | 18 |
3 | Aerys | Robert | 6 |
4 | Aerys | Tyrion | 5 |
5 | Aerys | Tywin | 8 |
6 | Alliser | Mance | 5 |
7 | Amory | Oberyn | 5 |
8 | Arya | Anguy | 11 |
9 | Arya | Beric | 23 |
As you can see, the raw data represents a weighted network, where weights are the number of times each character interacted in "A Storm of Swords", the third book in the series. Interaction is undirected and is defined as names appearing within 15 words of each other.
Now that we have the edges in tabular form, let's create our network using NetworkX.
import networkx as nx
edges = map(list,data.values)
G = nx.Graph()
G.add_weighted_edges_from(edges)
G
<networkx.classes.graph.Graph at 0x2552b820278>
G.number_of_nodes()
107
G.number_of_edges()
352
With NetworkX we can see we have a network with 107 nodes and 353 edges. This means we have 107 characters in our network and 353 different relationships between them.
It's nice to know these statistics but its generally helpful to visualize the network if possible. Let's do this using the MatplotLib package.
import matplotlib.pyplot as plt
%matplotlib inline
plt.figure(figsize=(30,30))
#edge size given weights
esmall=[(u,v) for (u,v,d) in G.edges(data=True) if d['weight'] <7]
emed=[(u,v) for (u,v,d) in G.edges(data=True) if 7 <= d['weight'] <=14]
elarge=[(u,v) for (u,v,d) in G.edges(data=True) if d['weight'] >14]
# positions algorithm for nodes
pos=nx.spring_layout(G, k = .2, iterations = 300)
# nodes
nx.draw_networkx_nodes(G,pos,node_size=10)
# edges
nx.draw_networkx_edges(G,pos,edgelist=esmall,width = .7,alpha=0.3,edge_color='b')
nx.draw_networkx_edges(G,pos,edgelist=emed, width=1,alpha=0.4,edge_color='b')
nx.draw_networkx_edges(G,pos,edgelist=elarge, width=1.5,alpha=0.5,edge_color='b')
# labels
nx.draw_networkx_labels(G,pos,font_size=24,font_family='sans-serif')
plt.axis('off')
#plt.savefig("weighted_graph.png") # save as png
plt.show() # display
plt.clf() # clear
<matplotlib.figure.Figure at 0x2552c892358>
The network graph above is more for a data scientist to get an understanding of the network. With the stand-alone softwares Gephi or Cytoscape, more interactive and visually pleasing plots can be made.
The plot below was made using Gephi.
Now that we have a general idea of what our network looks like, and what it consists of, let's run an algorithm on the data to answer our initial question; who are the main characters?
NetworkX has a suite of centrality algorithms that can used for different interpretations of calculating how central a node is in the network.
The most fitting centrality metric for this exercise is Betweenness Centrality: how often you lie on shortest paths between two other characters, making you a broker of information.
#calculate centrality
centrality = nx.betweenness_centrality(G, weight='weight',normalized = False, seed=10)
top_characters = sorted(centrality, key=centrality.get, reverse=True)[:10]
#plot centrality for top characters
import matplotlib.pyplot as plt
%matplotlib inline
plt.figure(figsize=(15,5))
plt.bar(range(len(top_characters)), [centrality[x] / 100 for x in top_characters], align='center', color = 'g')
plt.xticks(range(len(top_characters)), top_characters)
plt.title('GoT Main Characters: Betweenness Centrality Scores')
plt.show()
True Game of Thrones fans can argue who the main characters of the book are but from our data, we can see that Robert and Tyrion are the main two characters, with Jon following close behind. After these three, it seems to tail off.
-
Network analysis is a distinct lense from which to view data analysis
Some unconventional Network Analysis use-cases:
- Customer purchase recommendations
- Employee career-path prediction
- Tracking migratory patterns of birds
-
NetworkX has a large tool-box for answering many different questions, but lacks visualization capabilties
-
Python + NetworkX are powerful and easy-to-use tools for analyzing networks
Continue to play around in NetworkX with the GoT data. Try different centrality metrics and other functions like clustering or clique structure. NetworkX functions are very well documented here. Learning about algorithms by example in NetworkX is very effective!
Also check out Graph-Tool for faster algorithms, and Cytoscape and/or Gephi for stand-out network visualizations!