PROGRAM

PROGRAM #5

Fall 2012

Due: November 28, 2012, 11:59pm on Angel Dropbox

General Guidelines:

· All source code must be in C++. You can use Windows or Unix environments at your discretion.

· Each program project should be accompanied by a COVER SHEET (see details below). Assignments without cover page will NOT be graded.

· For this programming assignment, you have the option of either work as a TEAM if you prefer, or as an individual. A TEAM should contain at most 2 students including yourself. Grading will *not* differentiate between the two - i.e., if you are working as a team of two people, then the same grade will be applied to both of you.

· If working as a team, submit only one copy of your assignment into the Drop box and indicate in the text box of the drop-box while submission, both names in the project. If you specify only one name then only that person will get the grade and the other will not. So please make sure you fill in both names.

· If working as a team, remember to include both your last names in the zip file of your submission.

· In the rest of this document, the term “YOU” is used either to refer to you (if you are working individually) or your team (if you work as a team).

· The final material for submission should be entirely written by you. If you decide to consult with others or refer materials online, you MUST give due credits to these sources (people, books, webpages, etc.) by listing them on the cover sheet mentioned below. Note that no points will be deducted for referencing these sources. However, your discussion/consultation should be limited to the initial design level. Sharing or even showing your source code/assignment verbiage to anyone else in the class, or direct reproduction of source code/verbiage from online resources, will all be considered plagiarism, and therefore will be awarded ZERO points and subject to the WSU Academic Dishonesty policy. (Reproducing from the Weiss textbook is an exception to this rule, and such reproduction is encouraged wherever possible.)

· Grading will be based on correctness, coding style, implementation efficiency, exception handling, source code documentation, and the written report. Try to also follow the good coding practices discussed in class during the C++ review lecture.

How to submit using Angel?
All assignments should be submitted to the PA3 Dropbox within Angel.
1) Once you login to Angel, click on the "Lessons" tab.

2) You will see "Programming Assignment #5".
    Click on that and you will see the submission website.
    Under Title, just write your first and last name and say PA 5 - e.g., "Ananth Kalyanaraman, PA5".
    If you are submitting as a team, only one of you should submit, and put both your names on the Title:
                          e.g., "Firstname1 Lastname1 & Firstname2 Lastname2, PA5"
    Leave the message textbox empty, unless you want to add a special comment to us.
    Then click on the "Attachments" button and attach your ZIP file there.     And then press the SUBMIT button
EMAIL METHOD (backup only):
   Only if dropbox doesn't work for some reason, then you can submit by email to "All course faculty" on Angel. Same time deadlines apply.
Late submission policy: A late penalty of 10% will be assessed for late submissions within the next 24-hour.
Any submission sent by any other means than specified above will be discarded and not graded.

PROBLEM Hashing and Hash tables:

For this assignment you will be comparing the performance of different hash functions and different collision resolution implementations. More specifically, you will be testing hash table implementations to store and retrieve a set of strings. The input is a set of n strings, and for all practical purposes, you can assume that the length of each string can be bound by a small constant. You need to build and maintain a searchable hash table for this input. Obviously there are many ways you can design the hash table, depending on what hash function you pick, and what collision resolution strategy you choose.

Aim 1. Comparing different collision resolution strategies keeping the hash function fixed

For this aim, you will be implementing and experimentally comparing three collision resolution mechanisms: chaining, linear probing, and quadratic probing. For doing this, you don't have to write the code from scratch. The source code for chaining and quadratic probing are already downloadable from the Weiss book's webpage: http://users.cis.fiu.edu/~weiss/dsaa_c++/Code/. tables. For your convenience, I have created zipped archive of the main relevant files in this zip file (note: this code may *not* be STL compliant).

In fact I modified the Weiss source code to be STL compliant. Here is the ZIP file. If you use this version, you don't need the prior zip file.
Disclaimer: I spent only about 30 minutes working on this modification and so haven't had a chance to test it thoroughly. All I can guarantee is that the code compiles and runs the test driver code (TestHashTable.cpp) successfully on my Linux box. So what I suggest you do is eye-ball every line of this code (its not very long!) and make sure it looks correct before you modify it for your assignment. You might even want to think of adding your own lines of code to the test driver code for some additional testing. In the process if you notice any issues with the code, let me know so that I can fix it.

Notice that you need to implement the code to do linear probing. This should be a very simple modification of the quadratic probing code (more specifically, to the function that finds the next position of probe). Call the resulting source files for linear probing as LinearProbing.h and LinearProbing.cpp.

You will also notice that the hash function for converting a string key into a hash table index used by all these three implementations is the same, and is defined "int hash( const string & key, int tableSize )" inside the three .cpp files. For this aim, we will keep this hash function as-is, and just experiment by varying the collision resolution implementations, viz. chaining, linear probing and quadratic probing.

You will also notice that the quadratic probing code initializes the hash tablesize to a default size of 101, and then does a rehash anytime the number of elements in the hash table reaches half of the hash tablesize. And rehashing doubles the tablesize. Leave these things as they are and keep the same setup for linear probing as well.

You will also notice that the chaining implementation also initializes the hash tablesize to a default size of 101. However it does NOT do rehashing. Again, leave this as is.

Experimental plan:

Input files: There are two input files provided for your testing and performance measurements.

("Data file") The first file is OHenry.txt. This is a file that contains exactly one word per line. The "word's" definition includes all alphanumeric and special characters except the whitespace character. (This text file is actually a set of O Henry's short stories that I downloaded and transformed it into this single word per line format so that it is easy for you to read.) There are a total of 10,.377 words in this file and represents the input strings to be inserted into the hash table. The number of input strings is denoted by n (i.e., n=10377 in this assignment).
("Query file") The second file is queries.txt. This file contains a list of 1,500 words which should be used as "queries" to search for in the hash table. The number of queries is denoted by m (i.e., m=1500 in this assignment).

For the sake of simplicity of your code, we will assume that the names of these two input files won't change and that these two input files are always made available in the current working directory of execution. Of course you need to ensure this when you run and test the code.

Write a test driver function, call it TestAim1.cpp, whose main() function performs the following sequence of operations in order:

Open the input data file (OHenry.txt) and load the contents into a vector of strings. Let us call this vector object "DataArray".
Open the input query file (queries.txt) and load the contents into another vector of strings. Let us call this vector object "QueryArray".
Instantiate all of the above three hash table (HT) implementations. Let us denote the corresponding objects as "ChainingHT", "LinearProbingHT" and "QuadraticProbingHT".
(Analysis of Chaining)
Call a function "InsertIntoChainingHT(DataArray)". This function should insert each word in DataArray, one by one, into ChainingHT. Of course, recall that no duplicates are allowed in the hash table.
Initialize a separate timer called InsertionTimerChainingHT and record the time taken to do all the insertions into this variable. Basically, this timer should contain the sum of the time taken to do all n inserts. From this you can calculate the average time per insertion. Make sure you include only the time to do the insert, which means only the time the code spends inside the insert() function call. The way to do this would be set a timer t1 at entrance of the insert() function and set a timer t2 just before the exit of the insert() function (doesn't matter whether insertion was successful or not), and then add [ t2-t1 ] to the global timer variable InsertionTimerChainingHT.
Also, initialize and keep track of a variable called CollisionsChainingHT to count the total number of collisions across all inserts. Note that this is same as counting the total number of times you are inside the while loop within function findPos() across all insertions.
Next, we will do searches and time the searches. To do this, call a function "SearchChainingHT(QueryArray)" which does a find for every query in the list of queries. For analyzing this, just keep track a global timer variable called SearchTimerChainingHT which should contain the total time to do searches for all the m queries. From this you can calculate the average time per search.
(Analysis of Linear Probing and Quadratic Probing)
Apply the same procedure as described above to analyze the linear probing and quadratic probing implementations.

At the end of this experiment, you should generate the following table:

Collision strategy	Insert			Search
Collision strategy	Total time	Avg. time	Total number of collisions	Total time	Avg. time
Chaining
Linear Probing
Quadratic Probing

Make sure you specify the unit for the run-times - i.e., hours or minutes or seconds or milliseconds or microseconds, as appropriate.

Aim 2. Comparing different hash functions keeping the collision strategy fixed

    For this aim, you will compare three different string hashing functions (given in the book) keeping the collision strategy fixed at Quadratic Probing. The three hash functions to use are all given in the book, Figure 5.2, Figure 5.3 and Figure 5.4. Note that Figure 5.4 is same as the default hash function used in the source code for Aim 1. For the purpose of analysis, let us refer to these three hash functions as follows:
            1) Figure 5.2: "Simple hash function"
            2) Figure 5.3:    "Prefix hash function"
            3) Figure 5.4:    "Full-length hash function"

For comparing these three hash functions, first add the other two hash functions into the source code for Quadratic Probing (obviously using different names to refer to those functions). Then, repeat the same procedure that you used to analyze Quadratic Probing method in Aim 1, with the only exception that now you should keep track of separate timer variables and collision variables for the three different hash functions. As a result of this experiment, you should generate the following table:

Hash function	Insert			Search
Hash function	Total time	Avg. time	Total number of collisions	Total time	Avg. time
Simple
Prefix
Full-length

Note that you don't have to repeat the experiment for the row corresponding to the Full-length. The results should already be available from the corresponding entry in Aim 1.

REPORT:

In a separate document (Word or PDF), not more than 2 pages, compile the following sections:

· A: Problem statement. In 1-2 sentences state the goal(s) of this exercise.

B: Experimental setup.
        - Specify the machine architecture (CPU, clock speed, RAM) where all the testing was conducted.
        - Mention whether you used Windows or Unix or Mac OS X for your testing.
        - How many experiments were performed and averaged, to determine each point in your plot?

· C: Experimental Results: In this section, include the following:

o The two tables as described above.

o Are the observations made in the above plots as per your theoretical expectations? Justify.

FINAL CHECKLIST FOR SUBMISSION:

___ Cover sheet

___ A separate folder containing all your source code (including the main function you used for testing)

___ Report

___ Both the above zipped into another folder archive called Program5<YourLastName(s)>.zip

GRADING RUBRICS:

This assignment is mainly about empirical testing and analysis. There is some design component but not much. So for grading this PA, we will primarily look at how well you have designed experiments, what the plots look like, and have you offered satisfactory/convincing rationale to explain your observations. Therefore the points during grading will be distributed as follows:

The whole assignment is worth a total of 100 points.

----------------------------------------

CODING (45 pts):

(10 pts): Are all versions of the hash tables and functions implemnted correctly?

(10 pts): Is the code implemented in an efficient way? i.e., are there parts in the code that appear redundant, or implemented in ways that can be easily improved? Does the code conform to good coding practices of Objected Oriented programming?

(15 pts): Does the code compile and run successfully on a couple of test cases?

(10 pts): Is the code documented well and generally easy to read (with helpful comments and pointers)?

REPORT (55 pts):

(15 pts): Is the experimental plan technically sound? Is the experimental setup specified clearly?

(10 pts): Are results shown as plots in a well annotated manner and are general trends in the results visible?

(30 pts): Are the justifications/reasons provided to explain the observations analytically sound? Is there a reasonable attempt to explain anomalies (i.e., results that go against analytical expectations), if any?

----------------------------------------

Obviously to come up with the above evaluation, the TAs are going to both read and run your code.