CIS-2212 Homework #4 (Using Hadoop)

Due: Friday, November 12, 2021

For this assignment you will write a Python script that uses Hadoop on the Lemuria cluster to process the Twitter customer support database.

What we want to find out is how customer support activity varies depending on the time of day. Thus, we are only interested in messages being sent from customer support accounts (inbound = False) to avoid considering when the customers themselves are active. Since we are using Hadoop Map/Reduce, there are two programs to write: a mapper and a reducer.

The mapper, named homework-04_m.py, should create a "bucket" for each hour of the day... 24 buckets. For tweets coming from a customer support account, the tweet author (i.e., the name of the customer support account) should be placed into the proper bucket. No other details need to be saved.

The reducer, named homework-04_r.py, should reduce each bucket and produce the following output: the bucket identifier (the hour of the day), followed by a total count of the number of tweets in that bucket, followed by a list of customer support accounts that tweeted during that hour with individual counts for each account. For example:

      10    1234,{'VirginTrains':978, 'comcastcares':222, 'AppleSupport':34}

As suggested by the output above, you can use a dictionary to store the association between a customer support account and the number of tweets made by that account. Just printing the dictionary will produce the format above.

Note that the times in the database are all in UTC (time zone of +0000). In theory, you could check the time zone and apply a correction to the hours if necessary. For example, a tweet shown as being created at 10:00:00 with a time zone of -0500 (EST) was really created at the UTC hour of 15. However, I don't believe there are any database records with a time zone other than UTC, so this correction is likely unnecessary.

Proceed as follows:

Be sure you can log into Lemuria.
Download the Hadoop sample program that analyzes weather station data and verify that you can get it to work on Lemuria.
Modify the runPython.sh script from the weather station sample as needed. There is a small, sample database file in HDFS:
```
        /user/hadoop/Twitter-Customer-Support-sample.csv
      
```
You can use this small sample to exercise and debug your program. Once you are satisfied that it is working, try it on the main data file:
```
        /user/hadoop/Twitter-Customer-Support.csv
      
```
Using the weather station sample as a template, write your mapper and reducer programs.

Create a zip archive of your two programs homework-04_m.py and homework-04_r.py, and submit your archive to Canvas.