CPT S 411: Final White Paper (instructions and guidelines)
When to submit?
11:59pm PDT, December 11, 2017 (Monday
of Finals week)
Where to submit?
Upload on OSBLE+ dropbox AND
email ananth@eecs.wsu.edu (do both before
deadline)
Will there be any extension? NO; this is a non-negotiable absolute deadline
Assignment type:
Individual
Submission file type:
Each submission will be accepted
only as a single ZIP file
This course has a white paper required for all students, instead of a final
exam. The paper is due at or before 11:59pm on December 11, 2017 (Monday of
the Finals week). This is an individual assignment (i..e., not
team-based).
How it works?
- I give you a problem statement (see below).
- You design algorithms (consistent with the problem
specification), provide implementations, design and perform
experiments, analyze your results, and prepare a detailed report
and a quad chart (adhering to the instructions below).
- You make the submission electronically by the deadline mentioned above.
- Go home and enjoy your Xmas break!
Collaboration rules:
- You are allowed to discuss with one another (or with me) during
the design phase and experimental phase. Discussions should be
restricted to understanding the problem parameters better including any
assumptions, obtain a high level of understanding of the
approach, and figuring out your experimental design/plan. Acknowledge
these people in the Acknowledgments section of your white paper.
- You are *NOT* allowed to share or show any part of your solution (code, specific algorithms, report) to anyone in the class.
- You are *NOT* allowed to refer or use materials available online.
- The entire implementation and entire report should be 100% yours.
- If any deviation from the above rules is observed, it will be considered cheating, and an automatic F grade will be assigned.
- If you are unsure about something please discuss it with me prior to contacting anyone.
Problem Statement: Parallel Pagerank Estimator
In this project, you will design, implement and test a parallel estimator for PAGERANK of all nodes in a graph.
The Pagerank of
a node (v) in a graph is an indicator of the node's importance. Take
for instance, a webgraph, where nodes are webpages and edges are
crosslinks (directional; incoming or outgoing) between two webpages. In
this context, the Pagerank of a node is the relative importance of that
webpage on the internet (higher the better).
The original Pagerank paper was written by Larry Page, Sergey Brin and others (@Stanford, and later Google):
Page,
L., Brin, S., Motwani, R. and Winograd, T., 1999. The PageRank citation
ranking: Bringing order to the web. Stanford InfoLab.
For the purpose of this white paper, you will design, implement and test a parallel estimator for Pagerank calculation.
Design:
Questions:
More specifically, in your white paper you are expected to answer these two questions:
1) Multithreaded version using OpenMP: For this you will design, implement and test.
2) Distributed
memory version using MPI: For this, you will design an algorithm and
present the pseudocode. I don't expect you to implement this part as
part of this white paper. But I need your design details including your
algorithm and related challenges.
High-level Approach:
The input is a directed graph G(V,E) with n vertices and m edges. Vertices
represent webpages. Edge (u,v) means there is a hyperlink from webpage u to
webpage v (and hence this edge is directed u->v).
The output should be the pagerank that you estimate for every vertex.
The approach that you need to use to estimate pagerank is as follows:
Starting from every node in the graph, perform a random walk of a
user-specified length K. Keep track of how many times the random
walks visit each node. The "estimate of pagerank" for a
given node u is then given by the number of times u was visited (as
part of any random walk) divided by the total visits across all vertices.
To determine each step of a random walk, let us say a given random walk is
at node u at any given stage. And let the out-degree (i.e., the
number of 1-hop neighbors) of u be denoted by d(u). Then to
determine the "target" node to jump to for the next step, follow
this procedure:
a. Toss a coin which has a probability D of
landing heads and a probability of 1-D of landing tails. The parameter D
is called Damping Ratio and is a user-specified probability term
(obviously, D should be a term between 0 and 1).
b. IF the toss = tail
THEN target = pick a node among
the pool of u's neighbors with equal probability (i.e., 1/(d(u)+1) )
ELSE (// toss = head) then: target = pick a
node among the entire set of n nodes with equal probability (i.e.,
1/n)
Implementation:
As pointed out above, you are required to implement and test only the OpenMP version of your algorithm.
Testing:
Input Data Sets:
Posted below are different real-world networks of varying sizes and shapes:
web-Google.txt
web-BerkStan.txt
web-Notredame.txt
facebook_combined.txt
These networks have been downloaded from the Stanford
Large Network Dataset (SNAP). Descriptions about these networks can be
found there. You are welcome to download and test with other networks from this site.
Experimental design:
Your code will need to accept two parameters {K: length of random
walk} {D: damping ratio}
In your output, include the top 5 nodes (with the top 5 highest Pageranks).
The
remaining parts of your experimental plan is something that I would
like you to design. This is a critical part of your project.
White paper:
The
white paper should be NOT EXCEED 8 pages including any required
references. The
document should use 1.5 line spacing, 11pt
Times New Roman or 10pt Arial font, and 1" page margin on all
sides. Adherence to this format is mandatory. Reports that do not
adhere to this format will *not* be graded.
The report itself should be written like a short scientific paper or technical report. The recommended format is as follows (some deviations from this format will be tolerated):
Section 1) Introduction
- Provide problem background,
with some motivation for parallelization as you see it.
Section 2) Problem Statement
- Problem definition (what is the
input? what is the output?), use a formal definition to the extent
possible. We used formal ways to define a problem in the class. Look at
parallel prefix sum or other problems for examples.
Section 3) Key Challenges in Parallelization
- State briefly why parallelization
poses challenges for this problem
Section 4) Proposed Approach(es)
- Propose your key ideas and
approach elements here. It is encouraged that you describe your algorithms
precisely in the form of a pseudocode. Use figures to help illustrate and
articulate the main ideas clearly. This section should also do a
complexity analysis of your algorithm - space and run-time
complexity.
Section 5) Experimental Results and Discussion
- Please report your
experimental results and your evaluation of those results in this section.
Also add your discussion in an objective manner.
Section 6) Acknowledgments
- Acknowledge any people you would like to as appropriate.
Section 7) References
- This section should consist of
citations to any literature that you have cited in the main text (any of
the above sections). The bibliography format should be that of IEEE format
(other similar ones okay), which means that each reference will be
numbered and the numbered entries would appear like this:
[1] [author1_initial}. author1_Lastname, [author2_initial}.
author2_Lastname. "Paper title", {Conference or Journal name},
{page and volume numbers}, {Year of publication}.
If you look up papers on Google Scholar there will be a CITE (") link next to
each paper. If you click that it will show you different formats. You can
use one of them.
Quad chart:
I would like you to use this Quad Chart (single slide)
format and populate it for your solution. The idea of such a quad chart
is to quickly convey the main highlights of everything that you have
done (problem, solution, results/findings) in a single slide - think of
a 60-second elevator speech. Okay - you are allowed to think of it as a
3 minute Pep Talk :)
What you need to turn in? (all bundled up in one ZIP file)
a) White paper (PDF)
b) QUAD chart (PPT or PDF)
c)
Your source code folder. Please don't include test data sets. I have
those. Just include your source code. If you used any additional inputs
beyond the above list, please provide references to them in your report
so that I know where to get them.
My Rubric for Evaluation: *IMPORTANT*
In my
evaluation of your white paper I will be looking for details related to
the following set of broad questions (not meant to be exhaustive):
Design (for both versions):
- Are the algorithms well described?
- Are the challenges pertaining to the design of those algorithms adequately related to?
- How well conceived are the solutions? Are the solutions technically sound and efficient?
- Have any potential pitfalls or limitations been identified?
- Are the assumptions (if any) spelled out clearly? Are there are any unwarranted/restrictive assumptions made?
Implementation and Testing (only for the OpenMP code):
- Is the code consistent with the algorithm presented in the design section?
- Are the results and observations adequately justified and/or explained?
- Have performance characteristics of the multithreaded code appropriately analyzed?
- Does the paper have an overall sound experimental design?
- Have the effect of different scheduling schemes and synchronization schemes for the multithreaded code appropriately analyzed?
- Has the effect of changing parameters like K and D appropriately analyzed?
Presentation and style:
- Is the white paper easy to read and understand?
- Is the writing style clear and scientific/technical?
- Are there helpful figures and/or illustrations to facilitate reading of the paper?
- Is there any sloppiness (e.g., typos, inconsistent formatting, etc.) in the writing?
- Is the Quad Chart effective (e.g., does it do a good job of
motivating the reader/listener to work or get more curious about the
problem? is it effective in showing the main highlights of the work?)