Google BigQuery for Education: Framework for Parsing and Analyzing edX MOOC Data
Glenn Lopez, Daniel T. Seaton, Andrew Ang, Dustin Tingley, and Isaac Chuang
The size and complexity of MOOC data have overwhelmed many institutions. Dataset sizes require infrastructure capable of parsing millions of events on a weekly basis, while complexity requires teams of researchers, instructors, and administrators of highly varying levels of data science skills capable of analyzing parsed data. This paper details the functionality of edx2bigquery - an open source Python package developed by Harvard and MIT to ingest and report on hundreds of MITx and HarvardX course datasets from edX, making use of Google BigQuery to handle multiple terabytes and thousands of tables. Google BigQuery provides 1) ease of use in loading complex datasets, 2) near real-time interactive querying of all loaded data, and 3) through Google Cloud Platform, a flexible facility for research and reporting dashboards visualizing and aggregating data. These frameworks make it feasible for edx2bigquery to be open source, following standards which emphasize the importance of data products that transcend a particular data science platform and allow teams with diverse backgrounds to interact with data. edx2bigquery is now being adopted by other institutions with an aim toward future collaboration.