PROGRAM

PROGRAM #5

Fall 2011

Due: December 5, 2011, 11:59pm on Angel

General Guidelines:

All source code must be in C++. You can use Windows or Unix environments at your discretion.
Each program project should be accompanied by a COVER SHEET (see details below). Assignments without cover page will NOT be graded.
For this programming assignment, you have the option of either work as a TEAM if you prefer, or as an individual. A TEAM should contain at most 2 students including yourself.
In the rest of this document, the term “YOU” is used either to refer to you (if you are working individually) or your team (if you work as a team).
The final material for submission should be entirely written by you. If you decide to consult with others or refer materials online, you MUST give due credits to these sources (people, books, webpages, etc.) by listing them on the cover sheet mentioned below. Note that no points will be deducted for referencing these sources. However, your discussion/consultation should be limited to the initial design level. Sharing or even showing your source code/assignment verbiage to anyone else in the class, or direct reproduction of source code/verbiage from online resources, will all be considered plagiarism, and therefore will be awarded ZERO points and subject to the WSU Academic Dishonesty policy. (Reproducing from the Weiss textbook is an exception to this rule, and such reproduction is encouraged wherever possible.)
Grading will be based on correctness, coding style, implementation efficiency, exception handling, source code documentation, and the written report. Try to also follow the good coding practices discussed in class during the C++ review lecture.
If working as a team, the same grade will be given to both members of a team.
If working as a team, submit only one copy of your assignment into the Drop box and indicate in the text box of the drop-box while submission, both names in the project. If you specify only one name then only that person will get the grade and the other will not. So please make sure you fill in both names.
If working as a team, remember to include both your last names in the zip file of your submission.
Angel DropBox submission:
We will use DropBox.
All assignments with cover sheet should be zipped or tarred into one archive folder (named after your last name).
Instructions for DropBox submission: I have set up a have set up a new DropBox called “Program project 5” inside Angel. The way to submit to the dropbox is as follows:

1) Once you login to Angel, click on the "Lessons" tab.

2) You will see "Program project 5".
Click on that and you will see the submission website.
Under Title, just write your first and last name and say Program 5 - e.g., "Ananth Kalyanaraman, Program 5".
If you are submitting as a team, type in both your names in the textbox. Otherwise leave it empty.
Then click on the "Attachments" button and attach your ZIP file there.
And then press the SUBMIT button.

PROBLEM Hashing and Hash tables:

For this assignment you will be comparing the performance of different hash functions and different collision resolution implementations. More specifically, you will be testing hash table implementations to store and retrieve a set of strings. The input is a set of n strings, and for all practical purposes, you can assume that the length of each string can be bound by a small constant. You need to build and maintain a searchable hash table for this input. Obviously there are many ways you can design the hash table, depending on what hash function you pick, and what collision resolution strategy you choose.

Aim 1. Comparing different collision resolution strategies keeping the hash function fixed

For this aim, you will be implementing and experimentally comparing three collision resolution mechanisms: chaining, linear probing, and quadratic probing. For doing this, you don't have to write the code from scratch. The source code for chaining and quadratic probing are already downloadable from the Weiss book's webpage: http://users.cis.fiu.edu/~weiss/dsaa_c++/Code/. tables. For your convenience, I have created zipped archive of the main relevant files in this zip file. Starting with this code as a template, you need to do the following:

You need to implement the code to do linear probing. This should be a very simple modification of the quadratic probing code (more specifically, to the function that finds the next position of probe). Call the resulting source files for linear probing as LinearProbing.h and LinearProbing.cpp.

You will notice that the SeparateChaining.h and QuadraticProbing.h files include custom written header files for vector, string and linked list instead of STL. The author has done this assuming your system doesn't support STL. And their API is largely similar to, but also slightly different from the STL's own API for vector, string, and list. So you have two options. You can either stick to this version of string, vector, and linked list, which means you need to also include those source files (downloaded and made available in this zip file for your convenience) as part of your project.
Disclaimer: Note that I have not had the chance to compile or test this code. So if you use this version then it is your responsibility to get it working.

Alternatively, if you want to use STL, then that's okay too but for that you will have to include the corresponding STL header files instead of these custom files, and also go through the entire SeparateChaining.cpp and QuadraticProbing.cpp source codes and make them compliant with STL's API. I leave the choice to you.

Postscript: After posting the assignment I modified the Weiss source code to be STL compliant. Here is the ZIP file. If you use this version, you don't need any of the above zip files.
Disclaimer: I spent only about 30 minutes working on this modification and so haven't had a chance to test it thoroughly. All I can guarantee is that the code compiles and runs the test driver code (TestHashTable.cpp) successfully on my Linux box. So what I suggest you do is eye-ball every line of this code (its not very long!) and make sure it looks correct before you modify it for your assignment. You might even want to think of adding your own lines of code to the test driver code for some additional testing. In the process if you notice any issues with the code, let me know so that I can fix it.

You will notice that the hash function for converting a string key into a hash table index used by all these three implementations is the same, and is defined "int hash( const string & key, int tableSize )" inside the three .cpp files. For this aim, we will keep this hash function as-is, and just experiment by varying the collision resolution implementations, viz. chaining, linear probing and quadratic probing.

You will also notice that the quadratic probing code initializes the hash tablesize to a default size of 101, and then does a rehash anytime the number of elements in the hash table reaches half of the hash tablesize. And rehashing doubles the tablesize. Leave these things as they are and keep the same setup for linear probing as well.

You will also notice that the chaining implementation also initializes the hash tablesize to a default size of 101. However it does NOT do rehashing. Again, leave this as is.

Experimental plan:

Input files: There are two input files provided for your testing and performance measurements.

("Data file") The first file is OHenry.txt. This is a file that contains exactly one word per line. The "word's" definition includes all alphanumeric and special characters except the whitespace character. (This text file is actually a set of O Henry's short stories that I downloaded and transformed it into this single word per line format so that it is easy for you to read.) There are a total of 10,.377 words in this file and represents the input strings to be inserted into the hash table. The number of input strings is denoted by n (i.e., n=10377 in this assignment).

("Query file") The second file is queries.txt. This file contains a list of 1,500 words which should be used as "queries" to search for in the hash table. The number of queries is denoted by m (i.e., m=1500 in this assignment).

For the sake of simplicity of your code, we will assume that the names of these two input files won't change and that these two input files are always made available in the current working directory of execution. Of course you need to ensure this when you run and test the code.

Write a test driver function, call it TestAim1.cpp, whose main() function performs the following sequence of operations in order:

Open the input data file (OHenry.txt) and load the contents into a vector of strings. Let us call this vector object "DataArray".

Open the input query file (queries.txt) and load the contents into another vector of strings. Let us call this vector object "QueryArray".

Instantiate all of the above three hash table (HT) implementations. Let us denote the corresponding objects as "ChainingHT", "LinearProbingHT" and "QuadraticProbingHT".

(Analysis of Chaining)
Call a function "InsertIntoChainingHT(DataArray)". This function should insert each word in DataArray, one by one, into ChainingHT. Of course, recall that no duplicates are allowed in the hash table.
Initialize a separate timer called InsertionTimerChainingHT and record the time taken to do all the insertions into this variable. Basically, this timer should contain the sum of the time taken to do all n inserts. From this you can calculate the average time per insertion. Make sure you include only the time to do the insert, which means only the time the code spends inside the insert() function call. The way to do this would be set a timer t1 at entrance of the insert() function and set a timer t2 just before the exit of the insert() function (doesn't matter whether insertion was successful or not), and then add [ t2-t1 ] to the global timer variable InsertionTimerChainingHT.
Also, initialize and keep track of a variable called CollisionsChainingHT to count the total number of collisions across all inserts. Note that this is same as counting the total number of times you are inside the while loop within function findPos() across all insertions.
Next, we will do searches and time the searches. To do this, call a function "SearchChainingHT(QueryArray)" which does a find for every query in the list of queries. For analyzing this, just keep track a global timer variable called SearchTimerChainingHT which should contain the total time to do searches for all the m queries. From this you can calculate the average time per search.

(Analysis of Linear Probing and Quadratic Probing)
Apply the same procedure as described above to analyze the linear probing and quadratic probing implementations.

At the end of this experiment, you should generate the following table:

Collision strategy Insert Search

Total time Avg. time Total number of collisions Total time Avg. time

Chaining

Linear Probing

Quadratic Probing

Make sure you specify the unit for the run-times - i.e., hours or minutes or seconds or milliseconds or microseconds, as appropriate.

Aim 2. Comparing different hash functions keeping the collision strategy fixed

    For this aim, you will compare three different string hashing functions (given in the book) keeping the collision strategy fixed at Quadratic Probing. The three hash functions to use are all given in the book, Figure 5.2, Figure 5.3 and Figure 5.4. Note that Figure 5.4 is same as the default hash function used in the source code for Aim 1. For the purpose of analysis, let us refer to these three hash functions as follows:
            1) Figure 5.2: "Simple hash function"
            2) Figure 5.3:    "Prefix hash function"
            3) Figure 5.4:    "Full-length hash function"

For comparing these three hash functions, first add the other two hash functions into the source code for Quadratic Probing (obviously using different names to refer to those functions). Then, repeat the same procedure that you used to analyze Quadratic Probing method in Aim 1, with the only exception that now you should keep track of separate timer variables and collision variables for the three different hash functions. As a result of this experiment, you should generate the following table:

Hash function	Insert			Search
Hash function	Total time	Avg. time	Total number of collisions	Total time	Avg. time
Simple
Prefix
Full-length

Note that you don't have to repeat the experiment for the row corresponding to the Full-length. The results should already be available from the corresponding entry in Aim 1.

REPORT:

In a separate document (Word or PDF), not more than 2 pages, compile the following sections:

· A: Problem statement. In 1-2 sentences state the goal(s) of this exercise.

B: Experimental setup.
        - Specify the machine architecture (CPU, clock speed, RAM) where all the testing was conducted.
        - Mention whether you used Windows or Unix or Mac OS X for your testing.
        - How many experiments were performed and averaged, to determine each point in your plot?

· C: Experimental Results: In this section, include the following:

o The two tables as described above.

o Are the observations made in the above plots as per your theoretical expectations? Justify.

FINAL CHECKLIST FOR SUBMISSION:

___ Cover sheet

___ A separate folder containing all your source code (including the main function you used for testing)

___ Report

___ Both the above zipped into another folder archive called Program5<YourLastName(s)>.zip

Collision strategy	Insert			Search
Collision strategy	Total time	Avg. time	Total number of collisions	Total time	Avg. time
Chaining
Linear Probing
Quadratic Probing