Tuesday, May 24, 2011

Implementation of the parallel odd-even transposition sort algorithm with Open MPI


Today I want to share with you my experiments with Open MPI, an open sources implementation of the MPI interface. Open MPI is a C library that allow you to write parallel programs and run them on a computer cluster.
I start my experiments during labs of the parallel programing course at ESIAL. We implements servals algorithm including the parallel odd-even transposition sort algorithm I will show today.




Here is a quick explanation of the algorithm :
  1. It starts by distributing n/p sub-lists (p is the number of processors and n the size of the array to sort) to all the processors.
  2. Each processor then sequentially sorts its sub-list.
  3. The algorithm then operates by alternating between an odd and an even phase :
    1. In the even phase, even numbered processors(processor i) communicate with the next odd numbered processors (processor i+1). In this communication process, the two sub-lists for each 2 communicating processes are merged together. The upper half of the list is then kept in the higher number processor and the lower half is put in the lower number processor.
    2. In the odd phase, odd number processors (processor i) communicate with the previous even number processors (i-1) in exactly the same way as in the even phase.
This algorithm is not efficient, the average complexity is 0(n^2), but the aim isn't to write a fast program but to experiment Open MPI.

Implementation
The first step is to distribute a sub list to each process, the master process send to all process a sub-array, to do that we use the MPI_Scatter() function :
1:  int *subArray = malloc(N/hostCount * sizeof(int));  
2:       if(rank == 0) { /* The master send data to everyone */  
3:            MPI_Scatter(arrayToSort,N/hostCount,MPI_INT,subArray,N/hostCount,MPI_INT,0,MPI_COMM_WORLD);  
4:       }  
After that each process has to receive data, we use MPI_Scatterv(), we need 2 particular array : displs that specifies the displacement of sub-array relative to arrayToSort and sendcnts that specifies the number of elements to send to each host.
1:  int *displs = malloc(hostCount * sizeof(int));  
2:  int i;  
3:  for (i=0;i<hostCount;i++) {  
4:       displs[i] = i*(N/hostCount);  
5:  }  
6:  int *sendcnts = malloc(hostCount * sizeof(int));  
7:  for (i=0;i<hostCount;i++) {  
8:       sendcnts[i]=N/hostCount;  
9:  }  
10:  /* reieve data */  
11:  MPI_Scatterv(arrayToSort,sendcnts,displs,MPI_INT,subArray,N/hostCount,MPI_INT,0,MPI_COMM_WORLD);  
12:  free(displs);  
13:  free(sendcnts);  
The next step, is to write a sequential sort algorithm. I choose the odd-even sort algorithm. Here is my C function :
1:  void sequentialSort(int *arrayToSort, int size) {  
2:       int sorted = 0;  
3:       while( sorted == 0) {  
4:            sorted= 1;  
5:            int i;  
6:            for(i=1;i<size-1; i += 2) {  
7:                 if(arrayToSort[i] > arrayToSort[i+1])  
8:                 {  
9:                      int temp = arrayToSort[i+1];  
10:                      arrayToSort[i+1] = arrayToSort[i];  
11:                      arrayToSort[i] = temp;  
12:                      sorted = 0;  
13:                 }  
14:            }  
15:            for(i=0;i<size-1;i+=2) {  
16:                 if(arrayToSort[i] > arrayToSort[i+1])  
17:                 {  
18:                      int temp = arrayToSort[i+1];  
19:                      arrayToSort[i+1] = arrayToSort[i];  
20:                      arrayToSort[i] = temp;  
21:                      sorted = 0;  
22:                 }  
23:            }       
24:       }  
25:  }  
The third step is the odd-even phases. For this step we need 2 functions. One function to send data to the next host and keep the lower part of the to arrays. And another one to send data to the previous host and keep the higher part.
1:  /* Parameter :   
2:   * subArray : an integer array   
3:   * size : the size of the integer  
4:   * rank : the rank of the host  
5:   * Send to the next host an array, recieve array from the next host  
6:   * keep the lower part of the 2 array  
7:   */  
8:  void exchangeWithNext(int *subArray, int size, int rank)  
9:  {  
10:       MPI_Send(subArray,size,MPI_INT,rank+1,0,MPI_COMM_WORLD);  
11:       /* recieve data from the next odd numbered host */  
12:       int *nextArray = malloc(size*sizeof(int));  
13:       MPI_Status stat;  
14:       MPI_Recv(nextArray,size,MPI_INT,rank+1,0,MPI_COMM_WORLD,&stat);  
15:       /* Keep the lower half of subArray and nextArray */  
16:       lower(subArray,nextArray,size);  
17:       free(nextArray);  
18:  }  
19:  /* Parameter :   
20:   * subArray : an integer array   
21:   * size : the size of the integer  
22:   * rank : the rank of the host  
23:   * Send to the previous host an array, recieve array from the previous host  
24:   * keep the higher part of the 2 array  
25:   */  
26:  void exchangeWithPrevious(int *subArray, int size, int rank)  
27:  {  
28:       /* send our sub-array to the previous host*/  
29:       MPI_Send(subArray,size,MPI_INT,rank-1,0,MPI_COMM_WORLD);  
30:       /* recieve data from the previous host */  
31:       int *previousArray = malloc(size*sizeof(int));  
32:       MPI_Status stat;  
33:       MPI_Recv(previousArray,size,MPI_INT,rank-1,0,MPI_COMM_WORLD,&stat);  
34:       /* Keep the higher half of subArray and previousArray */  
35:       higher(subArray,previousArray,size);  
36:       free(previousArray);  
37:  }  
Then we can write easily the odd-even phase :
1:  i =0;  
2:       for(i=0;i<hostCount;i++) {  
3:            /* even phase */  
4:            if (i%2==0) {  
5:                 /* even numbered host */  
6:                 if(rank%2==0) {   
7:                      /* even numbered host communicate with the next odd numbered host */  
8:                      /* make sure the next odd exits */  
9:                      if(rank<hostCount-1) {  
10:                           /* send our sub-array to the next odd numbered host  
11:                            * receive data from the next odd numbered host  
12:                            * Keep the lower half of our array and of the next host array's  
13:                            */  
14:                           exchangeWithNext(subArray,N/hostCount,rank);  
15:                      }  
16:                 } else {   
17:                      /* odd numbered host communicate with the previous even numbered host */  
18:                      /* make sure the previous even exits */  
19:                      if (rank-1 >=0 ) {  
20:                           /* send our sub-array to the previous even numbered host  
21:                            * receive data from the previous even numbered host  
22:                            * Keep the higher half of our array and of the previous host array's  
23:                            */  
24:                            exchangeWithPrevious(subArray,N/hostCount,rank);  
25:                      }  
26:                 }  
27:            }  
28:            /* odd phase */  
29:            else {   
30:                 /* odd host */  
31:                 if(rank%2!=0) {   
32:                      /* In the odd phase odd numbered host communicate with the next   
33:                       * even numbered host make sure the next even exits */  
34:                      if (rank<hostCount-1) {  
35:                           /* send our sub-array to the next even numbered host  
36:                            * receive data from the next even numbered host  
37:                            * Keep the lower half of our array and the next host array's  
38:                            */  
39:                           exchangeWithNext(subArray,N/hostCount,rank);  
40:                      }  
41:                 }   
42:                 /* even host */  
43:                 else {  
44:                      /* In the odd phase even numbered host communicate with the previous  
45:                       * odd numbered host make sure the previous host exits */  
46:                      if (rank-1 >=0 ) {  
47:                           /* send our sub-array to the previous odd numbered host  
48:                            * receive data from the previous odd numbered host  
49:                            * Keep the higher half of our array and of the previous host array's  
50:                            */  
51:                            exchangeWithPrevious(subArray,N/hostCount,rank);  
52:                      }  
53:                 }  
54:            }  
55:       }  
Now our array is sorted, the lower element own to the host with the rank 0 and the higher to p-1. We just need to gather data, we use MPI_Gather()
 MPI_Gather(subArray,N/hostCount,MPI_INT,arrayToSort,N/hostCount,MPI_INT,0,MPI_COMM_WORLD);  

Performances
The question is : what is the execution time of the parallel algorithm compare to the sequential one. Well it's pretty good, on average in a cluster of 8 machines and for an array of 128 000 elements the parallel algorithm run in 1.50 s, while the sequential one run in 81.71 s on a single machine.
We are in a case of super-linearity because sort a sub-array 8 times smaller than the original array run 64 times faster.

Feel free to reproduce this experiment : you can use my C program for the parallel version as well as my sequential one.

18 comments:

  1. Good tutorial and well documented. I have tried to use your code and test it but runs for over 30min when I use 2 processors and 100,000 numbers. Is this normal for your program. Note: I am new to MPI programming

    ReplyDelete
    Replies
    1. Thanks for your suport :)

      I don't think there is a problem with this huge execution time. In this algorithm, each process have to sort sequentially an N/hostCount array (in your case 50,000) . The sort algorithm is totally inefficient so I am not surprise.

      If you have enough time you can try to sort a 100,000 array with the sequential version and compare this result to the time you get with the paralell version.
      Or you can just try with a smaller N, 20,000 for example.

      Delete
    2. Thanks for your reply, well unfortunately the server I am running the program on has a max execution time of 30 mins, but what doubts me is e.g. 2 processors can parallel-sort 30,000 numbers in 3.5secs but for 50,000 the program runs up to the max execution time without doing the parallel sort.

      Furthermore 6 processors can sort 50,000 numbers in 1.09s but cannot sort 100,000, this huge change in time caused my confusion. After doing some analysis I have observed that if the digits to sort for each processor is ~15,000 then the parallel-sorting fails. I am still looking into it but since you are the author you might have a greater idea about the way it works.

      The sequential version works just fine.

      Delete
    3. Sorting ~15,000 is fine with the sequential version but sorting ~15,000 x hostCont with the parallel isn't ?
      If it's the case there is probably a problem in the exchange step.

      I am not an expert in Open MPI or parallel programing. I'm a student and I post this code to help others to start with MPI, so it might contains some errors :)

      Delete
    4. Exactly that's the case, I will check it more to see if can find a solution to it. Thanks for your time and the hint of where the problem might be.

      Delete
    5. This comment has been removed by the author.

      Delete
    6. This comment has been removed by the author.

      Delete
  2. Hello
    I carried out the program and written instructions as follows:
    mpicc -c odd-even.c -o a
    mpiexec -comm mpich-gm ./a

    and I have this result:
    1-no mpd is running on this host
    2-an mpd is running but was started without a "console" (-n option)

    how can I solve this problem???

    ReplyDelete
    Replies
    1. This comment has been removed by the author.

      Delete
  3. Hi,

    Are you aware of implementation which would also work under limited RAM memory and would sort file of strings? I have a 21Gb file with 2 billion lines of strings which I want to sort. I can find all these MPI sorting algorithms on net, but there is no which would have memory management implemented.

    ReplyDelete
  4. Hi Sergej,
    This a a very good use case for Big Data technology.
    In my current company we use Hadoop with Pig Latin for this kind of task. You can start very quickly by installing a local version of pig: http://pig.apache.org/ (currently down). If it matches your need you can then set up an Hadoop Cluster (or use Amazon service) and run your script in distributed context (without changing the code you use in local environment).
    Hope this help :)

    ReplyDelete
  5. please where are the function statements for the higher and lower functions?

    ReplyDelete
  6. Hi, I want to thank you for giving explaination and sharing source code to public.

    When I run the program with 4044 numbers, using "mpirun -np 4 oddeven", the program was stuck in somewhere code and all my CPUs kept 100% usage (I'm using a laptop with i3 processor). In my analysis, there's a communication problem on each host. If I using 4043 numbers with -np 4, the program worked. And if I using -np 5 instead of 4 (for >4044), the program also successfully finished. But, if I using larger number of data with small number of host, it always getting stuck. It's like each of my CPUs cannot handle when sending large of data. Yeah, it doesn't make any sense since it's only n/p numbers, that 4044/4 is 1011 numbers per host. Fortunately, that is not the problem. I found a little mistake in your code when each host communicate with next and previous host in exchangeWithNext and exchangeWithPrevious functions. Based on Barry Wilkinson and Michael Allen on the book of Parallel Programming, in parallel odd-even tranposition, for 2 phases (odd and even phase), even numbered host must calling MPI_Recv() first and then MPI_Send(), and odd numbered host must calling MPI_Send() first and then MPI_Recv(). Based on this information, I modify the code, and now my problems are gone. I can sort any number of data with any number of process.

    As stated from SHERIFFO on the first comment, I tried to sort 10,000 numbers with 2 processors and it runs for 15 sec.

    And I want to ask why are you using MPI_Scatter and MPI_Scatterv at the same code? I tried to use MPI_Scatterv only and it also works.
    I hope this can helpful, and I'm sorry for my English. :)

    ReplyDelete
    Replies
    1. I'm sorry, that is 100,000 numbers, not 10,000. :D

      Delete
  7. How do i compile and run this?

    ReplyDelete
  8. Can you provide the function for higher and lower, please?

    ReplyDelete