Apache Pig: A MapReduce Alternative for Data Processing.

Friday. September 15, 2017 - 1 min

In my final year of computer science study at Smith College, I took a seminar on parallel and distributed computing. For the course, I read a variety of scholarly articles as well as magazine articles on different topics such as GPUs, cloud computing, and machine learning. My favorite one was on Cloud Spanner. The final project for the course required us to pick a technology, design a presentation, mini-lab, and project on the technology. Using LaTeX, we reported our findings.

My partner, Angie Dinh, and I choose to do our final project on Apache Pig. We gave a presentation about Apache Pig and also lead a mini-lab.

The research project we designed compares the run time of Apache Pig against Hadoop MapReduce (Java) on different Amazon EMR cluster sizes. Each program processed 1.06GB of text in the form of 1,697,533 movie reviews from Amazon to get word counts. The Pig script we used can be found here. For the full report on the project, you can read our final paper, “Apache Pig: A MapReduce Alternative for Data Processing”.

If you prefere to download the full project file, you can do so by clicking here.

Samantha Louise Causey

Junior Rails dev @ Annkissam. BA in Computer Science from Smith College, 2017.