Learning Data Science

Proper data generation, collection, storage and analysis are expected, yet mostly untaught, skills in the undergraduate experience. One bottleneck to a field’s growth is the handling and interpretation of the data created and gathered. While most, if not all, disciplines struggle with this lack of data stewardship, those traditionally quantitative disciplines must also interpret and analyze data earlier in undergraduate curricula.

The private sector's open challenge with data understanding, processing and interpretation impacts the educational realm. According to the McKinsey Report, there will be almost 200,000 positions requiring "deep analytical skills". The report further outlines the additional job responsibilities of nearly 1.5 million managers and analysts who will need to analyze and interpret data findings in order to make decisions. Although data science degree programs exist as the graduate level, national and international society can not wait for the awarding of these advanced degrees recipients. Undergraduate programs, concentrations and certificates have emerged to address the workforce shortage. Since data are at the center of many disciplines and business operations, data science is partnered with another field to provide sufficient and appropriate context. There are emerging cross-disciplinary themes, such as how data can translate into actionable knowledge, what are the useful data analysis techniques, which programming languages and tools support ease of interpretation, what are the commonly used data cleaning, data summarization, statistical inference and regression models.

Since Fall 2015, we are developing the EvergreenLP framework, a student-coded data-centric learning environment. Our framework allows the student coders to explore data science principles in context of a current event or topic trending on twitter. EvergreenLP is constructed as a content management system to leverage the creative (front-end) and technical (back-end) aspects of software development. By using Tweepy, data flow over a network and key data science fundamentals are shared. We are analyzing Oscars 2016 data gathered using the following blacktags and keywords: #Oscars,#OscarsSoWhite, Chris Rock, @chrisrock. The limitations of MS Excel for ~150MB files quickly became apparent to the student coders. Now, we explore the Python Data Analysis Library (pandas) in Spring 2017!