Programming Projects and Related Materials

Experimental platforms:

The main experimental platform for this class is the Pleiades cluster.

The Pleiades cluster is an 8-node, 64-core cluster. Each compute node has 8 Intel Xeon cores and 4GB of shared RAM. 

Pleiades cluster: Please follow these instructions while using the Pleiades cluster.

Recommended usage: Please use the Pleiades cluster as your first preference. If you want to use a different cluster that you already have access to, that is fine as long as: a) the cluster is comparable (if not larger) in its configuration compared to Pleiades; and b) you log in your project report the system configuration of that cluster (under experimental setup section).

Examples: 

      PS: To compile an OpenMP code use:
                gcc -o {execname} -fopenmp {source file names}

COVER PAGE:  PDF   Word
(please include this with every project and homework submission)

PROGRAMMING PROJECTS

Assignment type: Team of size up to 2 encouraged
Where to submit? Blackboard dropbox for Programming Project #1

The goal of this project is to empirically estimate the network parameters latency and bandwidth, and the network buffer size, for the network connecting the nodes of the compute cluster. Note that these were described in the class in detail (refer to the lectures on August 27th and September 3rd).

You are expected to the Pleiades cluster for this project.

To derive the estimates write a simple MPI send receive program involving only two processors (one sends and the other receives). Each MPI send communication should send a message of size m bytes to the other processor (which will invoke receive). By increasing the message size m from 1, 2, 4, 8, ... and so on, you are expected to plot two runtime curves, one for send and another for receive. The communication time is to be plotted on the Y-axis and message size (m) on the X-axis. For the message size, you may have to go on up to 1MB or more to observe a meaningful trend. Make sure you double m for each step.

You will need to test uing two implementations:

Blocking test: For this, you will implement the sends and receives using the blocking calls MPI_Send and MPI_Recv. As an alternative to MPI_Send and MPI_Recv, you are also allowed to use MPI_Sendrecv. This is blocking as well.

Nonblocking test: For this, you will implement the receives using the nonblocking call MPI_Irecv. For the send, it doesn't matter whether you use MPI_Send or MPI_Isend - it shouldn't matter much as explained in the class.

Please refer to the API for MPI routines for function syntax and semantics. An example code that does MPI_Send and MPI_Recv along with timing functions is given above (send_recv_test.c). You can reuse parts of this code as you see fit. Obviously modification is necessary to fit your needs.

Deriving the estimates: From the curves derive the values for latency, bandwidth and network buffer size. To ensure that your estimates are as precise and reliable as they can be, be sure to take an average over multiple iterations (at least 10) before reporting. Report your estimated values, one set for the Blocking Test, and another for the Nonblocking Test. (It is possible you might get slightly varying values between the two tests.)


Deliverables (zipped into one zip file - with your names on it):
Note, for those of you who worked in teams of size 2, both of you should submit, but only one of you should submit the full assignment along with the report and cover page stating who your other partner was, and the other person simply submits the cover page.
    i) Source code with timing functions,
    ii) Report in PDF that shows your tables and charts followed by your derivation for the network parameter estimates. Make sure add a justification/explanation of your results. Don't dump the raw data or results. Your results need to be presented in a professional manner.

Assignment type: Team of size up to 2 encouraged
Where to submit? Blackboard dropbox for Programming Project #2

For a detailed description of this project, please refer to this PDF.


(Total points: 20)
Assignment type: Team of size up to 2 encouraged
Where to submit? Blackboard dropbox for Programming Project #3


In this project you will implement a parallel pseudo-random number series generator, using the Linear Congruential Generating model we discussed in class. Here, the ith random number, denoted by xi, is given by:
        xi = (A*xi-1 + B) mod P,         where, A and B are some positive constants and P is a big constant (typically a large prime); all three parameters {A,B,P} are user-defined parameters.
Your goal is to implement an MPI program for generating a random series up to the nth random number of the linear congruential series (i.e., we need all n numbers of the series, and not just the nth number). We discussed an algorithm that uses parallel prefix in the class. You are expected to implement this algorithm. Refer to the Lecture notes website to look for more references if you need.
Operate under the assumption that n>>p. Your code should have your own explicit implementation the parallel prefix operation. Your code should also get parameter values {A,B,P} and the random seed to use (same as x
0), from the  user. 
All the logic in your code for doing parallel prefix should be written from scratch. Use of MPI_Scan is *not* allowed for doing parallel prefix. Write your own parallel prefix function. I have already provided a lot of tips in the class.
It should be easy to test your code by writing your own simple serial code and comparing your output. If your parallel implementation is right, its output should be identical to that of the serial output (for the same parameter setting).
It is expected that your code has two serial implementations:
i) serial_baseline: which is a function that simply is the most straightforward serial implementation that directly implements the mathematical recurrence using a simple for loop.
ii) serial_matrix: which is a function that is also a for loop but it uses the matrix formulation (in other words, this is equivalent to your parallel algorithm on a single process).

Both these serial baseline pseudocodes are provided in the lecture notes. The above two versions are important for correctness check.

Performance analysis:

a) Study total runtime as a function of n. Vary n from a small number such as 16 and keep doubling until it reaches a large value (e.g., 1 million).   
b) Generate speedup charts (speedups calculated over your serial_baseline implementation), fixing n to a large number such as a million and varying the number of processors from 2 to 64.

Compile a brief report that presents and discusses your results/findings. Quality of the presentation style in your report is important.

Deliverables (zipped into one zip file - with your names on it):
Note, for those of you who worked in teams of size 2, both of you should submit, but only one of you should submit the full assignment along with the report and cover page stating who your other partner was, and the other person simply submits the cover page.
    i) Cover page
    ii) Source code, 

    iii) Report in PDF
Name your zip folders as Project3_MemberNames.zip. No whitespace allowed in the folder or file names. Follow the naming convention strictly. Otherwise you stand to lose points.
Rough grading rubric: code (30%), testing (40%), reporting (30%)

Total points: 10
Assignment type: Team of size up to 2 encouraged
In this project you will implement an OpenMP multithreaded PI value estimator using the algorithm discussed in class. This algorithm essentially throws a dart n times into a unit square and computes the fraction of times that dart falls into the embedded unit circle. This fraction multiplied by 4 gives an estimate for PI.
Here is the generation approach that you need to implement: PDF

Your code should expect two arguments: <n> <number of threads>.
Your code should output the PI value estimated at the end.  Note that the precision of the PI estimate could potentially vary with the number of threads (assuming n is fixed).
Your code should also output the total time (calculated using omp_get_wtime function).

Experiment for different values of n (starting from 1024 and going all the way up to a billion or more) and p (1,2,4..). 

Please do two sets of experiments as instructed below:

1) For speedup - keeping n fixed and increase p (1,2,4, 8). You may have to do this for large values of n to observe meaningful speedup. Calculate relative speedup. Note that the Pleiades nodes have 8 cores per node. So there is no point in increasing the number of threads beyond 8. In your report show the run-time table for this testing and also the speedup chart.
PS: If you are using Pleiades (which is what I recommend) you should still use the Queue system (SLURM) to make sure you get exclusive access to a node.   For this you need to run "sbatch -N1" option (i.e., run the code on a single node).

2) For precision testing - keep n/p fixed, and increase p (1,2,.. up to 16 or 32). For this you will have to start with a good granularity (n/p) value which gave you some meaningful speedup from experiment set 1. The goal of this testing is to see if the PI value estimated by your code improves in precision with increase in n. Therefore, in your report make a table that shows the PI values estimated (up to say 20-odd decimal places) with each value of n tested.


Deliverables (zipped into one zip file - with your names on it):
Note, for those of you who worked in teams of size 2, both of you should submit, but only one of you should submit the full assignment along with the report and cover page stating who your other partner was, and the other person simply submits the cover page.
    i) Cover page
    ii) Source code, 

    iii) Report in PDF
Name your zip folders as Project4_MemberNames.zip. No whitespace allowed in the folder or file names. Follow the naming convention strictly. Otherwise you stand to lose points.
Rough grading rubric: code (40%), testing (20%), reporting (40%)