Programming Projects and Related Materials
Experimental platforms:
The following compute platforms are available for use by this class:
- A 5-node cluster ("glx nodes") which you will need to access from
within the EECS firewall. Each of these nodes contains a quad-core
Intel(R) Core(TM) i5-3330 CPU processor running at 3GHz, and 4GB RAM
(shared by the cores). The glx nodes are labeled glx1.eecs.wsu.edu through
glx5.eecs.wsu.edu. They are conencted by a gigabit ethernet switch.
Therefore you can use this system as a "cluster". There is no support for
job queue system at this point of time. Instructions on how to run your
parallel jobs are specified under Setup.
- It is expected that the students will also have restricted access to
another (larger) cluster within VCEA. More details will be posted here as
that arrangement is finalized.
The above will be the only dedicated platforms made available for this
course. Each student is welcome to use other clusters (for MPI codes) or
laptops (for OpenMP/multithreaded programs) that he/she may have access
to. If a student uses a different platform then the details of that
platform will have to mentioned in the experimental setup section of each
programming project report.
Setup:
- Instructions to set up MPI daemon on the cluster PDF
- Example machine file machinefile.txt
- Commands to run MPI jobs on the cluster:
> To compile, use the
command: mpicc -o <exec name>
<sourcefiles>
> To run your jobs, use the command: mpiexec
-machinefile <path>/machinefile.txt -n <number of procs>
<path>/executable-name <args if any>
OpenMP instructions:
- To compile your OpenMP function you need to use the -fopenmp flag. For
instance to compile a C/OpenMP code on the glx nodes you will use "gcc
-fopenmp -o <execname> <sourcefile(s)>".
- To run the executable, simply run the executable directly with its
arguments like you would do if it were a serial code. No need to do
anything different. It is a good practice to have the user specify the
number of threads as one of the executable's arguments.
Examples:
Programming Projects:
Project
1 - Due date: 9/17/2015 (Thursday, 11:59pm PT) via OSBLE dropbox
Assignment type: Individual
The goal of this project is to empirically estimate the network
parameters (latency and bandwidth constants) for the network
connecting the nodes of the GLX compute cluster. To derive this estimate
write a simple MPI send receive program involving only two processors (one
sends and the other receives). Each MPI_Send should send a message of size m
bytes to the other processor. By increasing the message size m
from 1, 2, 4, 8, ... and so on, you are expected to plot a runtime curve
with runtime for each send-recv communication on the Y-axis and message size
(m) on the X-axis. For the message size, you may have to go on up to 1MB or
so to observe a meaningful trend. Make sure you double m for each
step.
From the curve derive the values for latency and bandwidth. To make sure the
observed numbers are accurate, please do the send-recv inside a large loop
and average the time out.
Deliverables (zipped into one zip file - with your name on it):
i) Source code with timing functions,
ii) Report in PDF or Word that shows your tables and
charts followed by your derivation for the network parameter estimates.
(As an alternative to MPI_Send and MPI_Recv, you are also allowed to use
MPI_Sendrecv. Please look at the API for MPI routines for help.)
Project 2 - Due date: 9/29/2015
(Tuesday, 11:59pm PT) via OSBLE dropbox
Assignment type: Individual
or Teams of 2
In
this project you will implement the Conway's
Game of Life in
MPI. Please see the PDF for
details.
Project 3 - Due date: 10/27/2015
(Tuesday, 11:59pm PT)via OSBLE dropbox
Assignment type: Individuals or Teams of 2
In this project you will implement a parallel random
number series generator, using the Linear Congruential generating
model. Here, the ith random number, denoted by xi, is given by:
xi = (a*xi-1
+ b) mod P, where, a and b
are some positive constants and P is a big constant (typically a large
prime).
Your goal is to implement an MPI program for generating a random series up
to the nth random number of the linear congruential series (i.e.,
we need all n numbers of the series, and not just the nth
number). We discussed an algorithm that uses parallel prefix in the class on
10/13/2015. Further explanation of the parallel algorithm can be found in
the notes by Prof. Aluru (chapter 1). I expect you to implement this
algorithm. Operate under the assumption that n>>p. Your code should
have its own explicit implementation the parallel prefix operation. Your
code should also get parameter values {a,b,P} and the random seed to use
(same as x0), from the user.
It should be easy to test your code by writing your own simple serial code
and comparing your output.
Performance analysis:
a) Generate speedup charts (speedups calculated over your serial
implementation), fixing n to a large number such as a million and varying
the number of processors from 2 to 16.
b) Study total runtime as a function of n. Vary n from a small number such
as 16 and keep doubling until it reaches a large value (e.g., 1
million).
Compile a brief report that presents and discusses your results/findings.
Deliverables (zipped into one zip file - with your name on it):
i) Source code,
ii) Report in PDF or Word
Project 4 - Due date: 11/10/2015
(Tuesday, 11:59pm PT) via OSBLE dropbox
Assignment type: Individuals or Teams of 2
In this project you will implement the parallel sample sort algorithm
discussed in class. For a detailed description of the sample sort algorithm,
please refer to the lecture notes section (Parallel sorting chapter).
Your code should use Alltoallv (for step S4) and Allgather (for step
S2). You may also be using an Alltoall to relay all the receive count
values prior to performing the Alltoallv in Step S4.
For help on how to write an MPI code using Alltoall and Alltoallv, you can
refer to code examples on pages like these:
MPI_Alltoall
example code
MPI_Alltoallv
example
Make sure you are familiar with the syntax by referring to the MPI
Routines API page (also linked from lecture notes).
When clocking communication time, be sure to include the time taken to
prepare the send/receive buffers and send/receive counts, in addition to the
times taken to do Alltoallv and Alltoall.
For performing the local sorts in steps S1 and S5, you are allowed to reuse
existing library functions (e.g., qsort in C).
For generating the input array, you are required to use the parallel random
number series generator function you implemented in Project 3. You can use
the same set of parameters (a=7, b=19, P=9973) for your experiments. This
would imply that the values are bounded within the range [0,9972]. However,
treat this as a comparison-based sort problem. That is, don't use any
integer sorting methods like counting sort.
Your test code (e.g., main.c) should first call the random generator
function to generate the input array in a distributed manner and then it
should call your sample sort function to sort the array.
Performance analysis:
a) Runtime Table that shows the input sizes (n) along rows and the number of
processors used along columns.Vary the number of processors from 1 to 16.
Vary input size from a small size (e.g., 128) to something that is order of
a million or more (stop when the parallel time-to-solution exceeds the order
of ~15-20 mins.
b) Generate speedup charts (speedups calculated over your parallel code's
single processor run) using the runtime data in the above table. Note, your
speedup charts should have speedup along the y-axis and number of processors
along the x-axis. Use different curves for different input sizes.
c) In addition to the speedup chart, also plot another chart that shows the
total runtime broken down into two parts: total computational time and total
communication time. Express this in percentages. So the y-axis contains 100%
to mean the total runtime, and a split of that into two segments - one for
computation and another for communication. The x-axis is the number of
processors. Show one such plot for doubling input sizes you tested.
Compile a brief report that presents and discusses your results/findings.
Deliverables (zipped into one zip file - with your name on it):
i) Source code,
ii) Report in PDF or Word
Project 5 - Due date: 11/17/2015
(Tuesday, 11:59pm PT) via OSBLE dropbox
Assignment type: Individual
In this project you will implement a simple PI value estimator using the
algorithm discussed in class. This algorithm essentially throws a dart n
times into a unit square (1x1) and computes the fraction of times that
dart falls into the embedded unit circle. This fraction multiplied by 4
gives an estimate for PI.
Your code should expect two arguments: <n> <number of threads>.
Your code should output the PI value estimated at the end. Note that
the precision of the PI estimate could potentially vary with the number of
threads (assuming n is fixed).
Your code should also output the total time (calculated using omp_get_wtime
function).
Experiment for different values of n (starting from 1024 and going all the
way up to a billion or more) and p (1,2,4..).
Please do two sets of experiments as discussed in
class earlier, and report them as instructed below:
1)
For speedup - keeping n fixed and increase p (1,2,4). You may have to do
this for large values of n to observe meaningful speedup.
Note that the glx nodes have only 4 cores per node. So
there is no need to increase the number of threads beyond 4. In your
report show the run-time table for this testing and also the speedup
chart.
2)
For precision testing - keep n/p fixed, and increase p (1,2,.. up to 16
or 32). For this you will have to start with a good granularity (n/p)
value which gave you some meaningful speedup from experiment set 1. The goal of this testing
is to see if the PI value estimated by your code improves in precision
with increase in n. Therefore, in your report make a table that
shows the PI values estimated (up to some 20-odd decimal places) with each
value of n tested.
Project
6 - Due date: 12/8/2015 (Tuesday, 11:59pm PT) via OSBLE dropbox
Assignment type: Individuals or Teams of 2
In this project, you will implement an estimator for pagerank. This is the
same algorithm we discussed in class (lecture on November 12).
The input is a directed graph G(V,E) with n vertices and m edges. Vertices
represent webpages. Edge (u,v) means there is a hyperlink from webpage u to
webpage v (and hence this edge is directed u->v).
The output should be the pagerank that you estimate for every vertex.
The algorithm that you need to use to estimate pagerank is as follows:
Starting from every node in the graph, perform a random walk of a
user-specified length K. Keep track of how many times the random
walks visit each node. The "estimate of pagerank" for a
given node u is then given by the number of times u was visited (as
part of any random walk) divided by the total visits across all vertices.
To determine each step of a random walk, let us say a given random walk is
at node u at any given stage. And let the out-degree (i.e., the
number of 1-hop neighbors) of u be denoted by d(u). Then to
determine the "target" node to jump to for the next step, follow
this procedure:
a. Toss a coin which has a probability D of
landing heads and a probability of 1-D of landing tails. The parameter D
is called Damping Ratio and is a user-specified probability term
(obviously, D should be a term between 0 and 1).
b. IF the toss = tail
THEN target = pick a node among
the pool of u's neighbors with equal probability (i.e., 1/(d(u)+1) )
ELSE (// toss = head) then: target = pick a
node among the entire set of n nodes with equal probability (i.e.,
1/n)
High-level pseudocode:
******************
0) Init the visit counter at every vertex to 0; init any other locking data
structures as appropriate
1) From every node u:
a. node = u;
b. For step = 1 to K:
- visit_counter[node]++ //
this step should implement a low-cost synchronized update. I would like you
to make a comparison between "omp atomic" and OpenMP locking as
discussed in class.
- determine "Target" for next
step (this step implements the probabilistic approach discussed above)
- node = Target
2) Compute estimate of pagerank at every node using the normalization
procedure described above.
3) Output pageranks (do printfs only for debugging and testing. Do not include
this step for timing.)
I have posted three different real world network inputs here:
web-Google.txt
web-BerkStan.txt
web-Notredame.txt
These networks have been downloaded from the Stanford
Large Network Dataset (SNAP). Descriptions about these networks can be
found there.
Experimental plan:
Your code will need to accept two parameters {K: length of random
walk} {D: damping ratio}
For testing you can try K beginning from a small value (e.g., 128,
256, etc.) to larger values (1024, 2048, etc.).
You are also welcome to tryout different values for D but for
performance testing its okay if you want to fix it, say at D=0.2
You will need to implement and compare two variants of the
above method - one using "omp atomic" and another using OpenMP locks
(omp_set_lock and related functions).
You are also required to play with different OpenMP schedule settings
- namely, schedule(static), schedule(dynamic) and schedule(guided) -
using appropriate chunk sizes. Please report your findings on this
comparison in the form of a simple runtime table comparing the three
schemes. Its sufficient if you do this experiment just for one of the three
inputs.
Record the time it takes for steps (1) and (2) of the pseudocode and report
that for thread counts 1, 2 and 4.
Your report should include:
a) Three run-time tables, one for each of the three
inputs. The table should have K along columns and thread count (p)
along rows.
b) Three corresponding speedup charts.
c) Table comparing different OpenMP schedule schemes.
c) Your observations and justification of results.