You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Over the course of my internship, I constructed a three node Hadoop cluster and tested ETL with Hive, SparkSQL, as well as pySpark. My goals were to document the install, test / review the technologies, and compare them to the current data warehousing solution.
All development now happens over here: https://github.com/cwensel/cascading. Cascading is a feature rich API for defining and executing complex and fault tolerant data processing workflows on various cluster computing platforms.