Programming Projects and Related

Programming Projects and Related Materials

Experimental platforms:

Two experimental platforms are now available for this class use: the Pleiades cluster and the GLX cluster.

The Pleiades cluster is an 8-node, 64-core cluster. Each compute node has 8 Intel Xeon cores and 4GB of shared RAM.

The GLX cluster is a 5-node, 20-core cluster. Each compute node has 4 Intel i5 cores and 4GB of shared RAM.

Pleiades cluster: Please follow these instructions while using the Pleiades cluster.

GLX cluster: Please follow these instructions while using the GLX cluster. Note, you need to do a bunch of setup for the environment before you can run MPI jobs.

Recommended usage: Please use the Pleiades cluster as your first preference, and use GLX as backup. You are likely to find your experience using the Pleiades cluster a lot more pleasant (!) as the right MPI enviroments are already setup, and also because there is a job scheduler.

Examples:

MPI Hello World helloworld.c
MPI Send Receive test send_recv_test.c
OpenMP: Simple for loop parallelization: loop.c
OpenMP: Sum of n numbers using p threads: sumcomp.c
OpenMP: Matrix vector parallelization using p threads: matrix_vector.c
OpenMP: Synchronization primitives (critical section, atomic, locks): sync.c

Programming Projects:

Program 1 - Due date: 9/15/2016 (Thursday, 11:59pm PT) via OSBLE+ dropbox
Assignment type: Individual
The goal of this project is to empirically estimate the network parameters (latency and bandwidth constants) for the network connecting the nodes of the GLX compute cluster. To derive this estimate write a simple MPI send receive program involving only two processors (one sends and the other receives). Each MPI_Send should send a message of size m bytes to the other processor. By increasing the message size m from 1, 2, 4, 8, ... and so on, you are expected to plot a runtime curve with runtime for each send-recv communication on the Y-axis and message size (m) on the X-axis. For the message size, you may have to go on up to 1MB or so to observe a meaningful trend. Make sure you double m for each step.
From the curve derive the values for latency and bandwidth. To make sure the observed numbers are accurate, please do the send-recv inside a large loop and average the time out.
Deliverables (zipped into one zip file - with your name on it):
    i) Source code with timing functions,
    ii) Report in PDF that shows your tables and charts followed by your derivation for the network parameter estimates.
(As an alternative to MPI_Send and MPI_Recv, you are also allowed to use MPI_Sendrecv. Please look at the API for MPI routines for help.)

Program 2 - Due date: 10/11/2016 (Tuesday, 11:59pm PT) via OSBLE+ dropbox
Assignment type: Individual or Team of size 2
The goal of this is project is to do a parallel implementation of the Conway's Game of Life. Here are the detailed instructions.

Project 3 - Due date: 10/31/2016 (Monday, 11:59pm PT)via OSBLE dropbox
Assignment type: Individuals or Teams of 2

In this project you will implement a parallel random number series generator, using the Linear Congruential generating model. Here, the ith random number, denoted by x_i, is given by:
        x_i = (a*x_i-1 + b) mod P,         where, a and b are some positive constants and P is a big constant (typically a large prime).
Your goal is to implement an MPI program for generating a random series up to the n^th random number of the linear congruential series (i.e., we need all n numbers of the series, and not just the n^th number). We discussed an algorithm that uses parallel prefix in the class on 10/13/2016. You are expected to implement this algorithm. Further explanation of the parallel algorithm can be found in the notes by Prof. Aluru (chapter 1). Operate under the assumption that n>>p. Your code should have your own explicit implementation the parallel prefix operation. Your code should also get parameter values {a,b,P} and the random seed to use (same as x₀), from the user.
It should be easy to test your code by writing your own simple serial code and comparing your output.
Performance analysis:
a) Generate speedup charts (speedups calculated over your serial implementation), fixing n to a large number such as a million and varying the number of processors from 2 to 64.
b) Study total runtime as a function of n. Vary n from a small number such as 16 and keep doubling until it reaches a large value (e.g., 1 million).
Compile a brief report that presents and discusses your results/findings.
Deliverables (zipped into one zip file - with your name on it):
    i) Source code,
    ii) Report in PDF or Word

Project 4 - Due date: 11/15/2016 (Tuesday, 11:59pm PT) via OSBLE+ dropbox
Assignment type: Individual

In this project you will implement a simple PI value estimator using the algorithm discussed in class. This algorithm essentially throws a dart n times into a unit square (1x1) and computes the fraction of times that dart falls into the embedded unit circle. This fraction multiplied by 4 gives an estimate for PI.

Your code should expect two arguments: <n> <number of threads>.
Your code should output the PI value estimated at the end. Note that the precision of the PI estimate could potentially vary with the number of threads (assuming n is fixed).
Your code should also output the total time (calculated using omp_get_wtime function).

Experiment for different values of n (starting from 1024 and going all the way up to a billion or more) and p (1,2,4..).

Please do two sets of experiments as instructed below:

1) For speedup - keeping n fixed and increase p (1,2,4, 8). You may have to do this for large values of n to observe meaningful speedup. Note that the Pleiades nodes have 8 cores per node while the GLX cluster nodes have only 4 cores each. So there is no need to increase the number of threads beyond 8 or 4 depending on what system you are using. In your report show the run-time table for this testing and also the speedup chart.
PS: If you are using Pleiades (which is what I recommend) you should still use the Queue system (SLURM) to make sure you get exclusive access to a node.

2) For precision testing - keep n/p fixed, and increase p (1,2,.. up to 16 or 32). For this you will have to start with a good granularity (n/p) value which gave you some meaningful speedup from experiment set 1. The goal of this testing is to see if the PI value estimated by your code improves in precision with increase in n. Therefore, in your report make a table that shows the PI values estimated (up to some 20-odd decimal places) with each value of n tested.