Social Network data permeates our world -- yet we often don't know what to do with it. In this tutorial, I will introduce both theory and practice of Social Network Analysis -- gathering, analyzing and visualizing data using Python and other open-source tools. I will walk the attendees through an entire project, from gathering and cleaning data to presenting results.
SNA techniques are derived from sociological and social-psychological theories and take into account the whole network (or, in case of very large networks such as Twitter -- a large segment of the network). Thus, we may arrive at results that may seem counter-intuitive -- e.g. that Justin Bieber (7.5 mil. followers) and Lady Gaga (7.2 mil. followers) have relatively little actual influence despite their celebrity status -- while a middle-of-the-road blogger with 30K followers is able to generate tweets that "go viral" and result in millions of impressions.
In this tutorial, we will conduct social network analysis of a real dataset, from gathering and cleaning data to analysis and visualization of results. We will use Python and a set of open-source libraries, including NetworkX, NumPy and Matplotlib.
Outline:
Learn the basics of natural language processing with NLTK, the Natural Language ToolKit. First we'll cover tokenization, stemming and wordnet. Next we'll get into part-of-speech tagging, chunking & named entity recognition. Then we'll close with text classification and sentiment analysis. You'll walk out with new super-powers and an appreciation of the difficulties of analyzing human language.
This tutorial will be a hands on approach to learning natural language processing using NLTK, the Natural Language ToolKit. We will cover everything from tokenizing sentences to phrase extraction, from splitting words to training your own text classifiers for sentiment analysis. Please come prepared with NLTK already installed so we can dive into the code & data immediately.
Hour 1: Tokenization, Stemming & Corpora
Tokenization & familiarity with corpus readers and models are required knowledge before you can get into the more interesting aspects of NLTK. This first hour will include:
Hour 2: Part-of-Speech Tagging & Chunking/NER
Using tokenization and a working knowledge of corpus readers & pickled models, we'll dive into part-of-speech tagging and chunking/NER, including:
Hour 3: Text Classification & Sentiment Analysis
After using classifiers for training part-of-speech taggers and chunkers, this final hour will explain text classification in greater detail with:
Wrapping Up
Now that you know how to use NLTK to process some of the included English corpora, we'll wrap up by covering: