高性能的Python扩展（3）

504 查看

简介

本文是这个系列的第三篇，我们关注于使用NumPy API为Python编写高性能的C扩展模块。在本文中，我们将使用OpenMP来并行第二部分中的实现。

回顾

Wrold是存储N体状态的一个类。我们的模拟将演化一系列时间步长下的状态。

class World(object):

"""World is a structure that holds the state of N bodies and

additional variables.

threads : (int) The number of threads to use for multithreaded

implementations.

dt : (float) The time-step.

STATE OF THE WORLD:

N : (int) The number of bodies in the simulation.

m : (1D ndarray) The mass of each body.

r : (2D ndarray) The position of each body.

v : (2D ndarray) The velocity of each body.

F : (2D ndarray) The force on each body.

TEMPORARY VARIABLES:

Ft : (3D ndarray) A 2D force array for each thread's local storage.

s : (2D ndarray) The vectors from one body to all others.

s3 : (1D ndarray) The norm of each s vector.

NOTE: Ft is used by parallel algorithms for thread-local

storage. s and s3 are only used by the Python

implementation.

"""

def __init__(self, N, threads=1,

m_min=1, m_max=30.0, r_max=50.0, v_max=4.0, dt=1e-3):

self.threads = threads

self.N = N

self.m = np.random.uniform(m_min, m_max, N)

self.r = np.random.uniform(-r_max, r_max, (N, 2))

self.v = np.random.uniform(-v_max, v_max, (N, 2))

self.F = np.zeros_like(self.r)

self.Ft = np.zeros((threads, N, 2))

self.s = np.zeros_like(self.r)

self.s3 = np.zeros_like(self.m)

self.dt = dt

在开始模拟时，N体被随机分配质量m，位置r和速度v。对于每个时间步长，接下来的计算有：

合力F，每个体上的合力根据所有其他体的计算。
速度v，由于力的作用每个体的速度被改变。
位置r，由于速度每个体的位置被改变。

计算力：串行代码

下面是之前文章实现中（全部的源代码在这里）的compute_F函数。这个函数计算模拟中每对体之间的相互作用力，其复杂度为O(N^2)。

static inline void compute_F(npy_int64 N,

npy_float64 *m,

__m128d *r,

__m128d *F) {

npy_int64 i, j;

__m128d s, s2, tmp;

npy_float64 s3;

// Set all forces to zero.

for(i = 0; i < N; ++i) {

F[i] = _mm_set1_pd(0);

}

// Compute forces between pairs of bodies.

for(i = 0; i < N; ++i) {

for(j = i + 1; j < N; ++j) {

s = r[j] - r[i];

s2 = an class="crayon-v">i];

s2 = i>高性能的Python扩展（1）

高性能的Python扩展（2）

简介

本文是这个系列的第三篇，我们关注于使用NumPy API为Python编写高性能的C扩展模块。在本文中，我们将使用OpenMP来并行第二部分中的实现。

回顾

Wrold是存储N体状态的一个类。我们的模拟将演化一系列时间步长下的状态。

class World(object):

"""World is a structure that holds the state of N bodies and

additional variables.

threads : (int) The number of threads to use for multithreaded

implementations.

dt : (float) The time-step.

STATE OF THE WORLD:

N : (int) The number of bodies in the simulation.

m : (1D ndarray) The mass of each body.

r : (2D ndarray) The position of each body.

v : (2D ndarray) The velocity of each body.

F : (2D ndarray) The force on each body.

TEMPORARY VARIABLES:

Ft : (3D ndarray) A 2D force array for each thread's local storage.

s : (2D ndarray) The vectors from one body to all others.

s3 : (1D ndarray) The norm of each s vector.

NOTE: Ft is used by parallel algorithms for thread-local

storage. s and s3 are only used by the Python

implementation.

"""

def __init__(self, N, threads=1,

m_min=1, m_max=30.0, r_max=50.0, v_max=4.0, dt=1e-3):

self.threads = threads

self.N = N

self.m = np.random.uniform(m_min, m_max, N)

self.r = np.random.uniform(-r_max, r_max, (N, 2))

self.v = np.random.uniform(-v_max, v_max, (N, 2))

self.F = np.zeros_like(self.r)

self.Ft = np.zeros((threads, N, 2))

self.s = np.zeros_like(self.r)

self.s3 = np.zeros_like(self.m)

self.dt = dt

在开始模拟时，N体被随机分配质量m，位置r和速度v。对于每个时间步长，接下来的计算有：

合力F，每个体上的合力根据所有其他体的计算。
速度v，由于力的作用每个体的速度被改变。
位置r，由于速度每个体的位置被改变。

计算力：串行代码

下面是之前文章实现中（全部的源代码在这里）的compute_F函数。这个函数计算模拟中每对体之间的相互作用力，其复杂度为O(N^2)。

static inline void compute_F(npy_int64 N,

npy_float64 *m,

__m128d *r,

__m128d *F) {

npy_int64 i, j;

__m128d s, s2, tmp;

npy_float64 s3;

// Set all forces to zero.

for(i = 0; i < N; ++i) {

F[i] = _mm_set1_pd(0);

}

// Compute forces between pairs of bodies.

for(i = 0; i < N; ++i) {

for(j = i + 1; j < N; ++j) {

s = r[j] - r[i];

s2 = s="crayon-s"> v : (2D ndarray) The velocity of each body.

F : (2D ndarray) The force on each body.

TEMPORARY VARIABLES:

Ft : (3D ndarray) A 2D force array for each thread's local storage.

s : (2D ndarray) The vectors from one body to all others.

s3 : (1D ndarray) The norm of each s vector.

NOTE: Ft is used by parallel algorithms for thread-local