Programming Projects and Related Materials
Experimental platforms:
Two experimental platforms are now available for this class use: the Pleiades
cluster and the GLX cluster.
The Pleiades cluster is an 8-node, 64-core cluster. Each
compute node has 8 Intel Xeon cores and 4GB of shared RAM.
The GLX cluster is a 5-node, 20-core cluster. Each compute
node has 4 Intel i5 cores and 4GB of shared RAM.
Pleiades cluster: Please
follow these instructions while
using the Pleiades cluster.
GLX cluster: Please follow these instructions
while using the GLX cluster. Note, you need to do a bunch of setup
for the environment before you can run MPI jobs.
Recommended usage: Please use the Pleiades cluster as your first
preference, and use GLX as backup. You are likely to find your experience
using the Pleiades cluster a lot more pleasant (!) as the right
MPI enviroments are already setup, and also because there is a job
scheduler.
Examples:
PS: To compile an OpenMP
code use:
gcc
-o {execname} -fopenmp {source file names}
COVER PAGE PDF,
WORD
(to be included and submitted as part of all project and homework
submissions)
Programming Projects:
Program
1
- Due date: 9/21/2017 (Thursday, 11:59pm PT) via OSBLE+ dropbox
Assignment type: Individual
The goal of this project is to empirically estimate the network
parameters (latency and bandwidth constants) for the network
connecting the nodes of the compute cluster. To derive this estimate write a
simple MPI send receive program involving only two processors (one sends and
the other receives). Each MPI_Send should send a message of size m bytes
to the other processor. By increasing the message size m from 1, 2,
4, 8, ... and so on, you are expected to plot a runtime curve with runtime
for each send-recv communication on the Y-axis and message size (m) on the
X-axis. For the message size, you may have to go on up to 1MB or so to
observe a meaningful trend. Make sure you double m for each step.
From the curve derive the values for latency and bandwidth. To make
sure the observed numbers are as precise and reliable as they can be,
be sure to take an average over multiple iterations (at least 10)
before reporting.
Deliverables (zipped into one zip file - with your name on it):
i) Source code with timing functions,
ii) Report in PDF that shows your tables and charts
followed by your derivation for the network parameter estimates.
(As an alternative to MPI_Send and MPI_Recv, you are also allowed to
use MPI_Sendrecv for further testing purposes but for the main results
time your MPI_Recv. Please look at the API for MPI routines for help.)
Program
2 - Due date: 10/3/2017 (Tuesday, 11:59pm PT) via OSBLE+ dropbox
Assignment
type: Individual
The goal of this programming project is for you to implement, test and
compare the shift, ring, and hypercubic permutations and their
effectivess in solving the "sum of n numbers" problem.
Let p denote the number of
processes. For this project, we will assume that p
is a power of 2.
You will write three functions in your code:
Sum_byShift()
Sum_byRing()
Sum_byHypercube()
The input to all these three functions is a set of p
numbers, and the output expected is the sum of those p
numbers that should be available at all
processes at the end of the function call. How the sum is computed is
what will be different between these functions. Here is the pseudocode:
Step 1. Each proc. generates a single random number in the interval of
[0, 99]. You can use your favorite random number generating function of
choice.
Step 2. Each proc. calls the Sum_{X} function, where {X} is one of "byShift"
or "byRing" or "byHypercube".
Sum_byShift():
This
function will be a two-pass code. First, it does a right shift
permutation of the values so that the final sum gets accumulated at
process rank p-1. This means, the
final sum will be available only at rank p-1
at the end of this pass. To get the sum to all the other processes, a second
pass (left shift) is necessary.
Sum_byRing():
This function does a ring permutation of the values so that the final
sum gets accummulated at all proceses. It should be farily easy to see
how this can be done in just one pass.
Sum_byHypercube():
This function does a hypercubic permutation of the values so that the final
sum gets accumulated at all proceses. It should be farily easy to
see how to do this based on the lecture materials.
In all your implementations, you are only allowed to use point to point
MPI communication primitives (e.g., MPI_Sendrecv, MPI_Send, MPI_Recv,
and the nonblocking variants). I will let you pick which ones to use.
You are not allowed to use any
collective primitives such as Reduce or Allreduce.
Report: For your report, please experiment with varying values of p
from
1 through 64 on the cluster. Record the timings for all the three
variants of your Sum_{X} function calls (excluding any print or I/O
statements). Tabulate your results and plot the results on a runtime
chart. Also, compute and plot the speedup and efficiency
achieved.
Deliverables (zipped into one zip file - with your name on it):
i) Source code with timing functions,
ii) Report in PDF that shows your tables and charts
followed by a brief discussion of your results (are they as per your
expectations?).
Program
3 - Due date:
10/19/2017 (Thursday, 11:59pm PT) via OSBLE+ dropbox
Assignment type: Individual
or Team of size 2
The goal of this is project is to do a parallel implementation of the
Conway's Game of Life. Here are the detailed
instructions.
Project 4 - Due date: 11/7/2017
(Tuesday, 11:59pm PT)via OSBLE dropbox
Assignment type: Individual
or Team of size 2
In this project you will implement a parallel random
number series generator, using the Linear Congruential generating
model. Here, the ith random number, denoted by xi, is given by:
xi = (a*xi-1
+ b) mod P, where, a and b
are some positive constants and P is a big constant (typically a large
prime).
Your goal is to implement an MPI program for generating a random series up
to the nth random number of the linear congruential series (i.e.,
we need all n numbers of the series, and not just the nth
number). We discussed an algorithm that uses parallel prefix in the
class. You are expected to implement this algorithm. Further
explanation of the parallel algorithm can be found in the notes by
Prof. Aluru (chapter 1). Operate under the assumption that n>>p.
Your code should have your own explicit implementation the parallel
prefix operation. Your code should also get parameter values {a,b,P}
and the random seed to use (same as x0), from the user.
It should be easy to test your code by writing your own simple serial code
and comparing your output.
Performance analysis:
a) Generate speedup charts (speedups calculated over your serial
implementation), fixing n to a large number such as a million and varying
the number of processors from 2 to 64.
b) Study total runtime as a function of n. Vary n from a small number such
as 16 and keep doubling until it reaches a large value (e.g., 1
million).
Compile a brief report that presents and discusses your results/findings.
Deliverables (zipped into one zip file - with your name on it):
i) Source code,
ii) Report in PDF or Word
Project 5 - Due date: 11/16/2017
(Thursday, 11:59pm PT) via OSBLE+ dropbox
Assignment type: Individual
In this project you will implement an OpenMP multithreaded PI value
estimator using the algorithm discussed in class. This algorithm essentially
throws a dart n times into a unit square and computes the fraction
of times that dart falls into the embedded unit circle. This fraction
multiplied by 4 gives an estimate for PI.
Here is the classroom scribe for further details: PDF
Your code should expect two arguments: <n> <number of threads>.
Your code should output the PI value estimated at the end. Note that
the precision of the PI estimate could potentially vary with the number of
threads (assuming n is fixed).
Your code should also output the total time (calculated using omp_get_wtime
function).
Experiment for different values of n (starting from 1024 and going all the
way up to a billion or more) and p (1,2,4..).
Please do two sets of experiments as instructed below:
1)
For speedup - keeping n fixed and increase p (1,2,4, 8). You may have to
do this for large values of n to observe meaningful speedup.
Calculate relative speedup. Note that the Pleiades nodes
have 8 cores per node. So there is no need to increase the number of threads
beyond 8. In your report show the run-time table for this testing and also
the speedup chart.
PS: If you are using Pleiades (which is what I recommend) you should still
use the Queue system (SLURM) to make sure you get exclusive access to a
node. For this you need to run sbatch with -N1 option.
2)
For precision testing - keep n/p fixed, and increase p (1,2,.. up to 16 or
32). For this you will have to start with a good granularity (n/p) value
which gave you some meaningful speedup from experiment set 1.
The goal of this testing is to see if the PI value estimated
by your code improves in precision with increase in n.
Therefore, in your report make a table that shows the PI values
estimated (up to say 20-odd decimal places) with each value of n
tested.