多线程入门之矩阵乘法

问题：矩阵相乘一系列优化。下面我们以c[2000][2000] = a[2000][2000] * b[2000][2000] 为例来进行优化。
环境：CPU intel 9600K 6 核心 6 线程 3.7 GHz

多线程

To optimize the performance of this program, we have to use the advantage of multi-cores. We will create a thread for each row in a matrix that does the multiplication in parellel and reduce the processing time.

Let us create a Thread class that implements Runnable Interface.

RowMultiplyWorker.class :

public class RowMultiplyWorker implements Runnable /* extends Thread */ {
    public RowMultiplyWorker(int[][] result, int[][] matrix1, int[][] matrix2, int row) {
        this.result = result;
        this.matrix1 = matrix1;
        this.matrix2 = matrix2;
        this.row = row;
    }

    @Override
    public void run() {
        for (int i = 0; i < matrix2[0].length; i++) {
            result[row][i] = 0;
            for (int j = 0; j < matrix1[row].length; j++) {
                result[row][i] += matrix1[row][j] * matrix2[j][i];
            }
        }
    }

    private int[][] matrix1;
    private int[][] matrix2;
    private final int[][] result;
    private final int row;
}

Next, create a class to create 10 threads at a time because if we create 2000 threads for 2000 x 2000 matrix then the application gets to hang up. So we will be using the 10 threads as a group and let them complete then again initiate the next 10 threads untill complete each row multiplication.

ParallelThreadCreator.class :

import java.util.List;
import java.util.ArrayList;

public class ParallelThreadCreator {

    // creating 10 threads and waiting for them to complete then again repeat steps.
    public static void multiply(int[][] matrix1, int[][] matrix2, int[][] result) {
        List<Thread> threads = new ArrayList<>();
        int rows1 = matrix1.length;
        for (int i = 0; i < rows1; i++) {
            // 方式一：直接创建线程（RowMultiplyWorker implements Runnable）
            RowMultiplyWorker task = new RowMultiplyWorker(result, matrix1, matrix2, i);
            Thread thread = new Thread(task);
            thread.start();
            threads.add(thread);

//            // 方式二：以子类的方式创建线程（RowMultiplyWorker extends Thread）
//            Thread thread = new RowMultiplyWorker(result, matrix1, matrix2, i);
//            thread.start();
//            threads.add(thread);

            if (threads.size() % 10 == 0) {
                waitForThreads(threads);
            }
        }
    }

    private static void waitForThreads(List<Thread> threads) {
        for (Thread thread : threads) {
            try {
                thread.join();
            } catch (InterruptedException e) {
                e.printStackTrace();
            }
        }
        threads.clear();
    }
}

Let us create the main class to test the time taking using this approach.

Main.class :

import java.util.Random;

public class Main {
    public static void main(String[] args) {
        // 初始化a,b数组
        init();

        // 无任何优化
        int t1 = get1();
        System.out.println("单线程串行：" + t1 + "毫秒，约" + t1 / 1000 + "秒");
        System.out.println("==================================================");

        // 利用 Cache：改变循环的嵌套顺序
        int t2 = get2();
        System.out.println("单线程串行（利用 Cache：改变循环的嵌套顺序）：" + t2 + "毫秒，约" + t2 / 1000 + "秒");
        // 检查get2()计算的d数组，结果是否正确
        if (check(c, d)) System.out.println("正确");
        else System.out.println("错误");
        System.out.println("==================================================");

        // 多线程
        long start = System.currentTimeMillis();
        ParallelThreadCreator.multiply(a, b, e);
        long end = System.currentTimeMillis();
        int t3 = (int) (end - start);
        System.out.println("多线程并行：" + t3 + "毫秒，约" + t3 / 1000 + "秒");
        // 检查多线程计算的e数组，结果是否正确
        if (check(c, e)) System.out.println("正确");
        else System.out.println("错误");
    }

    // 返回一次矩阵乘法运算所需时间（ms）
    private static int get1() {
        long start = System.currentTimeMillis();
        for (int i = 0; i < N; i++)
            for (int j = 0; j < N; j++)
                for (int k = 0; k < N; k++)
                    c[i][j] += a[i][k] * b[k][j];
        long end = System.currentTimeMillis();
        return (int) (end - start);
    }

    // 返回一次矩阵乘法运算所需时间（ms）无任何优化
    private static int get2() {
        long start = System.currentTimeMillis();
        for (int i = 0; i < N; i++)
            for (int k = 0; k < N; k++)
                for (int j = 0; j < N; j++)
                    d[i][j] += a[i][k] * b[k][j];
        long end = System.currentTimeMillis();
        return (int) (end - start);
    }

    private static void init() {
        // a，b矩阵赋值 [1, 100] 范围内的随机数
        for (int i = 0; i < N; i++)
            for (int j = 0; j < N; j++) {
                // 元素值为[1, 100]的随机数
                a[i][j] = rd.nextInt(seed) + 1;
                b[i][j] = rd.nextInt(seed) + 1;
            }
    }

    private static boolean check(int[][] matrix1, int[][] matrix2) {
        for (int i = 0; i < N; i++)
            for (int j = 0; j < N; j++)
                if (matrix1[i][j] != matrix2[i][j])
                    return false;

        return true;
    }

    private static final int N = 2000, seed = 100;
    private static int[][] a = new int[N][N];
    private static int[][] b = new int[N][N];
    private static int[][] c = new int[N][N];   // 无任何优化计算的结果
    private static int[][] d = new int[N][N];   // 利用 Cache：改变循环的嵌套顺序计算的结果
    private static int[][] e = new int[N][N];   // 多线程计算的结果
    private static Random rd = new Random();
}

In this article, we have seen how to build the concurrent application for Matrix Multiplication. Now compare the time taken for 2000 x 2000 sized matrix is 65 seconds in the normal program and when we applied the multiple threads then it took only 17 seconds to complete this.

Still we can improve parallel version using Runtime.getRuntime().availableProcessors(). This returns the threads available to use and this count can be instead of 10 threads.

So, It is better to use the multiple threads efficiently for large scale applications.

实验结果

因为我们是把N行分组，然后把该组交由多线程计算，所以组内行数必须能整除N。否则最后会有几行没有计算。

当我们设置矩阵维度N = 2000，组内行数（线程数量）设置为20时，结果如下：

单线程串行：48698毫秒，约48秒
==================================================
单线程串行（利用 Cache：改变循环的嵌套顺序）：5859毫秒，约5秒
正确
==================================================
多线程并行：8210毫秒，约8秒
正确

当我们设置矩阵维度N = 2000，组内行数（线程数量）设置为10时，结果如下：

单线程串行：48904毫秒，约48秒
==================================================
单线程串行（利用 Cache：改变循环的嵌套顺序）：5864毫秒，约5秒
正确
==================================================
多线程并行：8887毫秒，约8秒
正确

当我们设置矩阵维度N = 2000，组内行数（线程数量）设置为8时，结果如下：

单线程串行：48263毫秒，约48秒
==================================================
单线程串行（利用 Cache：改变循环的嵌套顺序）：5871毫秒，约5秒
正确
==================================================
多线程并行：8934毫秒，约8秒
正确

当我们设置矩阵维度N = 2000，组内行数（线程数量）设置为4时，结果如下：

单线程串行：49045毫秒，约49秒
==================================================
单线程串行（利用 Cache：改变循环的嵌套顺序）：6003毫秒，约6秒
正确
==================================================
多线程并行：11993毫秒，约11秒
正确

当我们设置矩阵维度N = 2000，组内行数设置为2时，结果如下：

单线程串行：50119毫秒，约50秒
==================================================
单线程串行（利用 Cache：改变循环的嵌套顺序）：5861毫秒，约5秒
正确
==================================================
多线程并行：24061毫秒，约24秒
正确

当我们设置矩阵维度N = 2000，组内行数（线程数量）设置为1时，结果如下：

单线程串行：51972毫秒，约51秒
==================================================
单线程串行（利用 Cache：改变循环的嵌套顺序）：6111毫秒，约6秒
正确
==================================================
多线程并行：49762毫秒，约49秒
正确

我们可以发现设置过多（超过硬件本身线程规格）或过少（没有充分利用硬件本身线程规格）的线程数都不合适，线程数量的设置应该与硬件提供的线程数相当。当然，我们也不能一味的盲目推崇多线程，而不注重程序的优化，这一点从 利用 Cache：改变循环的嵌套顺序 数据就可以看出。它的用时更短，当然这也和我们硬件本身提供的线程数量有限有关，毕竟只有 6 线程。

目录CONTENT

多线程入门之矩阵乘法

多线程

实验结果

Preference

评论区