PROGRAM #5

Fall 2012

Due: November 28, 2012, 11:59pm on Angel Dropbox


General Guidelines:

·         All source code must be in C++. You can use Windows or Unix environments at your discretion.

·         Each program project should be accompanied by a COVER SHEET (see details below). Assignments without cover page will NOT be graded.

·         For this programming assignment, you have the option of either work as a TEAM if you prefer, or as an individual. A TEAM should contain at most 2 students including yourself. Grading will *not* differentiate between the two - i.e., if you are working as a team of two people, then the same grade will be applied to both of you.

·         If working as a team, submit only one copy of your assignment into the Drop box and indicate in the text box of the drop-box while submission, both names in the project. If you specify only one name then only that person will get the grade and the other will not. So please make sure you fill in both names.

·         If working as a team, remember to include both your last names in the zip file of your submission.

·         In the rest of this document, the term “YOU” is used either to refer to you (if you are working individually) or your team (if you work as a team).

·         The final material for submission should be entirely written by you. If you decide to consult with others or refer materials online, you MUST give due credits to these sources (people, books, webpages, etc.) by listing them on the cover sheet mentioned below. Note that no points will be deducted for referencing these sources. However, your discussion/consultation should be limited to the initial design level. Sharing or even showing your source code/assignment verbiage to anyone else in the class, or direct reproduction of source code/verbiage from online resources, will all be considered plagiarism, and therefore will be awarded ZERO points and subject to the WSU Academic Dishonesty policy. (Reproducing from the Weiss textbook is an exception to this rule, and such reproduction is encouraged wherever possible.)

·         Grading will be based on correctness, coding style, implementation efficiency, exception handling, source code documentation, and the written report. Try to also follow the good coding practices discussed in class during the C++ review lecture.

 


PROBLEM    Hashing and Hash tables:

For this assignment you will be comparing the performance of different hash functions and different collision resolution implementations. More specifically, you will be testing hash table implementations to store and retrieve a set of strings. The input is a set of n strings, and for all practical purposes, you can assume that the length of each string can be bound by a small constant. You need to build and maintain a searchable hash table for this input. Obviously there are many ways you can design the hash table, depending on what hash function you pick, and what collision resolution strategy you choose.

Aim 1.    Comparing different collision resolution strategies keeping the hash function fixed

For this aim, you will be implementing and experimentally comparing three collision resolution mechanisms: chaining, linear probing, and quadratic probing. For doing this, you don't have to write the code from scratch. The source code for chaining and quadratic probing are already downloadable from the Weiss book's webpage: http://users.cis.fiu.edu/~weiss/dsaa_c++/Code/.  tables. For your convenience, I have created zipped archive of the main relevant files in this zip file (note: this code may *not* be STL compliant).

In fact I modified the Weiss source code to be STL compliant. Here is the ZIP file. If you use this version, you don't need the prior zip file.
Disclaimer: I spent only about 30 minutes working on this modification and so haven't had a chance to test it thoroughly. All I can guarantee is that the code compiles and runs the test driver code (TestHashTable.cpp) successfully on my Linux box. So what I suggest you do is eye-ball every line of this code (its not very long!) and make sure it looks correct before you modify it for your assignment. You might even want to think of adding your own lines of code to the test driver code for some additional testing. In the process if you notice any issues with the code, let me know so that I can fix it.

 
Notice that you need to implement the code to do linear probing. This should be a very simple modification of the quadratic probing code (more specifically, to the function that finds the next position of probe).  Call the resulting source files for linear probing as LinearProbing.h and LinearProbing.cpp.

You will also notice that the hash function for converting a string key into a hash table index used by all these three implementations is the same, and is defined "int hash( const string & key, int tableSize )" inside the three .cpp files. For this aim, we will keep this hash function as-is, and just experiment by varying the collision resolution implementations, viz. chaining, linear probing and quadratic probing.

You will also notice that the quadratic probing code initializes the hash tablesize to a default size of 101, and then does a rehash anytime the number of elements in the hash table reaches half of the hash tablesize. And rehashing doubles the tablesize. Leave these things as they are and keep the same setup for linear probing as well.

You will also notice that the chaining implementation also initializes the hash tablesize to a default size of 101. However it does NOT do rehashing. Again, leave this as is.

Experimental plan:

Input files: There are two input files provided for your testing and performance measurements.

For the sake of simplicity of your code, we will assume that the names of these two input files won't change and that these two input files are always made available in the current working directory of execution. Of course you need to ensure this when you run and test the code.

Write a test driver function, call it TestAim1.cpp, whose main() function performs the following sequence of operations in order:

  1. Open the input data file (OHenry.txt) and load the contents into a vector of strings. Let us call this vector object "DataArray".
  2. Open the input query file (queries.txt) and load the contents into another vector of strings. Let us call this vector object "QueryArray".
  3. Instantiate all of the above three hash table (HT) implementations. Let us denote the corresponding objects as "ChainingHT",  "LinearProbingHT" and "QuadraticProbingHT".
  4. (Analysis of Chaining)
    Call a function "InsertIntoChainingHT(DataArray)". This function should insert each word in DataArray, one by one, into ChainingHT. Of course, recall that no duplicates are allowed in the hash table.
    Initialize a separate timer called InsertionTimerChainingHT and record the time taken to do all the insertions into this variable. Basically, this timer should contain the sum of the time taken to do all n inserts. From this you can calculate the average time per insertion. Make sure you include only the time to do the insert, which means only the time the code spends inside the insert() function call. The way to do this would be set a timer t1 at entrance of the insert() function and set a timer t2 just before the exit of the insert() function (doesn't matter whether insertion was successful or not), and then add [ t2-t1 ] to the global timer variable InsertionTimerChainingHT.
    Also, initialize and keep track of a variable called CollisionsChainingHT to count the total number of collisions across all inserts. Note that this is same as counting the total number of times you are inside the while loop within function findPos() across all insertions.
    Next, we will do searches and time the searches. To do this, call a function "SearchChainingHT(QueryArray)" which does a find for every query in the list of queries. For analyzing this, just keep track a global timer variable called SearchTimerChainingHT which should contain the total time to do searches for all the m queries. From this you can calculate the average time per search.
  5. (Analysis of Linear Probing and Quadratic Probing)
    Apply the same procedure as described above to analyze the linear probing and quadratic probing implementations.

At the end of this experiment, you should generate the following table:

Collision strategy  Insert Search
Total time Avg. time Total number of collisions Total time Avg. time
Chaining          
Linear Probing          
Quadratic Probing          

Make sure you specify the unit for the run-times - i.e., hours or minutes or seconds or milliseconds or microseconds, as appropriate.

 

Aim 2.    Comparing different hash functions keeping the collision strategy fixed

    For this aim, you will compare three different string hashing functions (given in the book) keeping the collision strategy fixed at Quadratic Probing. The three hash functions to use are all given in the book, Figure 5.2, Figure 5.3 and Figure 5.4. Note that Figure 5.4 is same as the default hash function used in the source code for Aim 1. For the purpose of analysis, let us refer to these three hash functions as follows:
            1) Figure 5.2:  "Simple hash function"
            2) Figure 5.3:    "Prefix hash function"
            3) Figure 5.4:    "Full-length hash function"

      For comparing these three hash functions, first add the other two hash functions into the source code for Quadratic Probing (obviously using different names to refer to those functions). Then, repeat the same procedure that you used to analyze Quadratic Probing method in Aim 1, with the only exception that now you should keep track of separate timer variables and collision variables for the three different hash functions. As a result of this experiment, you should generate the following table:

 

Hash function  Insert Search
Total time Avg. time Total number of collisions Total time Avg. time
Simple          
Prefix          
Full-length          

Note that you don't have to repeat the experiment for the row corresponding to the Full-length. The results should already be available from the corresponding entry in Aim 1.

REPORT:

In a separate document (Word or PDF), not more than 2 pages, compile the following sections:

·   A: Problem statement. In 1-2 sentences state the goal(s) of this exercise.

·     C: Experimental Results:  In this section, include the following:

o       The two tables as described above.

o       Are the observations made in the above plots as per your theoretical expectations? Justify.


FINAL CHECKLIST FOR SUBMISSION:

    ___  Cover sheet

    ___  A separate folder containing all your source code (including the main function you used for testing)

    ___  Report

    ___  Both the above zipped into another folder archive called Program5<YourLastName(s)>.zip

 

GRADING RUBRICS:

This assignment is mainly about empirical testing and analysis. There is some design component but not much. So for grading this PA, we will primarily look at how well you have designed experiments, what the plots look like, and have you offered satisfactory/convincing rationale to explain your observations. Therefore the points during grading will be distributed as follows:

The whole assignment is worth a total of 100 points.

----------------------------------------

CODING (45 pts):

(10 pts): Are all versions of the hash tables and functions implemnted correctly?

(10 pts): Is the code implemented in an efficient way? i.e., are there parts in the code that appear redundant, or implemented in ways that can be easily improved? Does the code conform to good coding practices of Objected Oriented programming?

(15 pts): Does the code compile and run successfully on a couple of test cases?

(10 pts): Is the code documented well and generally easy to read (with helpful comments and pointers)?

REPORT (55 pts):

(15 pts): Is the experimental plan technically sound? Is the experimental setup specified clearly?

(10 pts): Are results shown as plots in a well annotated manner and are general trends in the results visible?

(30 pts): Are the justifications/reasons provided to explain the observations analytically sound? Is there a reasonable attempt to explain anomalies (i.e., results that go against analytical expectations), if any?

----------------------------------------

Obviously to come up with the above evaluation, the TAs are going to both read and run your code.