Data Science Projects

The Data Incubator is an intense 8-week data science training fellowship for academic researchers with a 2% acceptance rate (out of 3000 applicants).

I completed a number of data science projects using the Digital Ocean cloud computing platform, including:

1. NYC Social Network Analysis – web scraping and network graph analysis

– Created a social network graph by extracting and parsing over 100,000 photo-captions from photo albums on a New York socialite blog,
– Analyzed the structure of a social network using the node degree, node pagerank and the highest weighted edges of a graph
– You can find the code on Github

Tools: lxml, BeautifulSoup, regular expressions, pandas, networkx

2. NYC restaurant inspection database analysis

– Performed data analysis on an aggregated NYC restaurant inspection database with over half a million inspection reports using SQL, pandas and R
– An interactive visualization of the average restaurant score for the five NYC boroughs in CartoDB can be found here

3. NLP analysis on Yelp reviews

– Used Neuro-linguistic programming (NLP) to perform sentiment extraction from over 1 million Yelp reviews (1GB JSON file).

4. Yelp review predictions with scikit-learn

– Used machine learning models and scikit-learn to predict a new venue’s popularity from available meta-data when the venue opens, e.g., where it is located, the type of food served, etc.

5. MapReduce in the Cloud

– Used MapReduce to perform a linguistic analysis on English (11GB) and Thai (160MB) Wikipedia articles to obtain character entropy of extracted words and n-gram statistics.
– You can find the code on Github

6. Time Series Analysis

– Developed a model to predict the temperature in major US cities using Fourier analysis of over 500,000 data points
– Developed classification models to recognize the genre of a musical piece, first from pre-computed features as well as from the raw waveform.