Categories
Programming

Accelerating Python Code With ctypes – An Example

The aim of this post is to demonstrate the use of the ctypes package to accelerate the solution of a problem by invoking a C function for a computationally intensive task. I will not discuss ctypes in excruciating detail (one should consult the documentation for in-depth information about the package), but instead will provide the basic introduction needed for a user to begin experimenting with ctypes. I selected the problem for this example in the hopes that it would be complex enough to expose some key components of ctypes, but not so complex that it distracts from the main point.

In this post I will demonstrate how to write a function in C, compile it into a shared library, and invoke it from Python using ctypes. The function in question will take multiple parameters, including pointers. It will not return any value, instead relying on the user to provide a pointer to pre-allocated memory for the output. One may be tempted to allocate memory in the C function, and return a pointer to this memory instead, but due to the way the memory is managed, this can cause problems. This is not to say that it is impossible to do so, (after all, often times you don’t know the size of the output of a function before invoking it) but it should be done carefully to avoid memory leaks. I will cover how to do this in a future post.

I will use gcc as my compiler in this example.

My work often requires me to process large datasets using methods that are not easily “vectorized” (i.e. implemented completely by NumPy operations to maximize speed) using NumPy. For instance I have worked on multiple projects that have a common sub-problem: Given a set rays \(\{r_i\}_{i=1}^N \subset \mathbb{R}^6\) (That is base points, with the direction of the ray), and a regular voxel grid \(\{v_{ijk}\}_{(i,j,k)}^{(N_x,N_y,N_z)}\), count how many rays intersect each voxel.

It may be possible to vectorize parts of this sub-problem, but the implementation quickly gets convoluted, and may not be particularly efficient. An algorithm to solve this problem is simple enough to formulate and implement. For the sake of simplicity I will use a simple, yet inaccurate algorithm (A more accurate, yet slightly more complicated algorithm can be found at here).

I will assume that the size of the voxel grid is known beforehand, and that the origins of all the rays are close to the voxel grid. Under these assumptions the problem can be solved approximately by simply marching along each ray, one-by-one, and checking which voxel we fall in at each step. This is inaccurate because it is possible that if the step size is too large as we march along a ray, we can entirely miss voxels. For the purposes of this post I am more focused on demonstrating how ctypes can speedup the implementation over Python however, so I will leave the reader to think about how this issue can be rectified (or follow the paper linked previously).

The psuedocode for the algorithm is given below:

Input: voxel width, voxel grid dimensions, voxel_grid_origin, rays
For each ray:
     t=0
     old_voxel_idx = null
     While t < L:
          position = ray_origin+t*ray_direction
          voxel_idx = floor((position-voxel_grid_origin)/voxel_width)
          t+=step_size
          if(voxel_idx != old_voxel_idx)
               counter[voxel_idx]++
               old_voxel_idx=voxel_idx

Those with some familiarity working with larger computational problems in Python may have already identified an issue. Namely, this approach includes a nested loop. Python is inherently slow since it is an interpreted language (every line is converted into machine code once it is encountered, which adds overhead). Loops, especially nested loops, are very slow as a result. Modern computers are so fast that the lack of speed may not be observable to a human for relatively small problems, but for very large loops one may find themselves waiting several minutes for a computation to complete.

This is an ideal candidate for utilizing ctypes. The ctypes Python package allows for interface between Python and shared C libraries. With ctypes you can write a function in C, compile it into a shared library, and then invoke the function in your Python code.

For simplicity I will discuss 2D implementations of this pseudocode. I will start with the python implementation, given in the listing below.

import numpy as np
def RayVoxelCount(voxel_width, voxel_grid_dimensions, voxel_grid_origin,rays,voxel_counts,L=2,step_size=0.01):
	N = int(L/step_size) #Precompute number of steps so a for loop can be used instead of a while loop (for loops are faster in Python)
	for ray in rays:
		old_voxel_idx=None
		for i in range(N):
			t=i*step_size
			position = ray[0:2]+t*ray[2:4]
			voxel_idx = ((position-voxel_grid_origin)/voxel_width).astype(np.int)
			if (voxel_idx[0] > -1 and voxel_idx[0] < voxel_grid_dimensions[0] and voxel_idx[1] > -1 and voxel_idx[1] < voxel_grid_dimensions[1]):
				voxel_idx_linear = voxel_grid_dimensions[0]*voxel_idx[1]+voxel_idx[0]
				if (not (voxel_idx_linear == old_voxel_idx)):
					voxel_counts[voxel_idx_linear] = voxel_counts[voxel_idx_linear] + 1
					old_voxel_idx = voxel_idx_linear

This implementation is pretty straight-forward. I save this code in a file called RayVoxel.py.

Next, I will discuss how to implement this in a C function. Although it is possible to use data structures such as classes, and even other libraries in C with ctypes, to keep this simple and self-contained, I will not use any of those features. For single row-vector parameters such as the voxel grid dimensions I could opt to pass a pointer to this data, but I chose to break it into separate parameters component by component. You can tinker with this to your hearts content, and adopt the scheme you prefer. The C code is shown in the listing below.

void RayVoxelCount(double voxel_width, int voxel_grid_dimensions_x, int voxel_grid_dimensions_y, double voxel_grid_origin_x, double voxel_grid_origin_y, double* rays, int num_rays, int* voxel_counts, double L, double step_size)
{
	int N = L/step_size;
	for (int i = 0; i < num_rays; i++)
	{
		double* ray = rays+4*i;
		int old_voxel_idx = -1;
		for (int j = 0; j<N;j++)
		{
			double t = j*step_size;
			double position_x = ray[0]+t*ray[2];
			double position_y = ray[1]+t*ray[3];
			int voxel_idx_x = (position_x-voxel_grid_origin_x)/voxel_width;
			int voxel_idx_y = (position_y-voxel_grid_origin_y)/voxel_width;
			if (voxel_idx_x > -1 && voxel_idx_x < voxel_grid_dimensions_x && voxel_idx_y > -1 && voxel_idx_y < voxel_grid_dimensions_y)
			{
				int voxel_idx_linear = voxel_idx_y*voxel_grid_dimensions_x+voxel_idx_x;
				if (voxel_idx_linear != old_voxel_idx)
				{
					voxel_counts[voxel_idx_linear]++;
					old_voxel_idx = voxel_idx_linear;
				}
			}
		}
	}
}

I save this function in a file called RayVoxel.c.

Now in order to be able to use this function in a shared library, I will need to add a header file that will declare this function to the C compiler. The header file is shown is the listing below.

//extern keyword will ensure this function gets exposed in the shared library after compilation
extern "C"
{
	void RayVoxelCount(double voxel_width, int voxel_grid_dimensions_x, int voxel_grid_dimensions_y, double voxel_grid_origin_x, double voxel_grid_origin_y, double* rays, int num_rays, int* voxel_counts, double L, double step_size);
	
}

This code is saved in a file called RayVoxel.h. This header only declares that a function called RayVoxelCount exists, and that it should be accessible in the shared library, but the implementation of function is defined in RayVoxel.c, and that needs to be compiled before it can be used by the computer. So in order to make this into a shared library we can use, we need to compile the code as follows.

gcc -c -Wall -Werror -fpic RayVoxel.c
gcc -shared -o libRayVoxel.so RayVoxel.o

These commands should be run in a terminal in the directory where you saved RayVoxel.c and RayVoxel.h. The first command compiles the implementation of the function into machine code our machine can use. The -c flag tells the compiler to just turn our code into machine code without linking it to a program (otherwise it will complain that it can’t find a ‘main’ function i.e. an entrypoint for a program). The -Wall and -Werror flags just set the verbosity of the info gcc prints to the console during compilation. The -fpic flag will instruct the compiler to produce position independent code. This means that multiple programs can use the code at the same time, each with their own memory spaces. Finally, RayVoxel.c is just the name of the source code file.

The output of the first line will be an object file with a .o extension containing the machine code for our function. The second line will construct a shared library which links to that machine code. The -shared flag tells the compiler to produce a shared library, the -o flag allows us to name the library with the following argument (in our case libRayVoxel.so) and finally the last argument is the name of the object file, in our case RayVoxel.o. If everything was done correctly there should now be a file named libRayVoxel.so in the directory you are working in.

Now I can make a “C version” of the RayVoxelCount function that I can use as a drop in replacement in Python using ctypes. I can put this in its own file if I want, but I will just append the following snippet of code to RayVoxel.py

import ctypes
lib = ctypes.cdll.LoadLibrary('./libRayVoxel.so')

def RayVoxelCountC(voxel_width, voxel_grid_dimensions, voxel_grid_origin,rays,voxel_counts,L=2,step_size=0.01):
	lib.RayVoxelCount(ctypes.c_double(voxel_width),ctypes.c_int(voxel_grid_dimensions[0]),ctypes.c_int(voxel_grid_dimensions[1]),ctypes.c_double(voxel_grid_origin[0]),ctypes.c_double(voxel_grid_origin[1]),rays.ctypes.data,ctypes.c_int(rays.shape[0]),voxel_counts.ctypes.data,ctypes.c_double(L),ctypes.c_double(step_size))

The second line of the snippet above just loads the shared library I just made. The last line invokes the function RayVoxelCount in that library. This setup of having a Python function immediately (or with minimal boiler plate setup) invoke a C function is often called a wrapper.

Since the C code I wrote knows nothing about Python or NumPy I need to make sure it is receiving the data in the way it expects. This is why I have wrapped each individual argument in a ctypes.c_int or ctypes.c_double invocation. For the pointer arguments, NumPy arrays have a built-in method to get a ctypes pointer to the data by just calling MyNumpyArray.ctypes.data. In Python classes often implement some sort of size or length function so you can see e.g. how many entries are in an array. In C I would need to either explicitly know the size beforehand (which is how I set the code up here) or implement some other function to calculate the size. Just keep that in mind as you look at the code.

Now I can compare the two different implementations with the following code.

import numpy as np

num_rays = 1000
num_rows = 100
num_cols = num_rows
voxel_width = 0.02
N = 10


voxel_grid_origin = np.array([-1.0, -1.0])
voxel_grid_dimensions = np.array([num_rows, num_cols],dtype=int)

voxel_counts = np.zeros(np.prod(voxel_grid_dimensions))

import time
from RayVoxel import RayVoxelCount
from RayVoxel import RayVoxelCountC

tic = time.time()
for i in range(N):

	ray_origins = np.random.normal(0,1,size=(num_rays,2))
	ray_directions = np.random.normal(0,1,size=(num_rays,2))
	ray_direction_norms = np.linalg.norm(ray_directions,axis=1)
	ray_directions[:,0] = ray_directions[:,0]/ray_direction_norms
	ray_directions[:,1] = ray_directions[:,1]/ray_direction_norms
	ray_directions[ray_direction_norms == 0,:] = 0

	rays = np.hstack((ray_origins,ray_directions))

	RayVoxelCount(voxel_width, voxel_grid_dimensions, voxel_grid_origin,rays,voxel_counts)

toc = time.time()
print('Average run time with Python: ', (toc-tic)/N)

tic = time.time()
for i in range(N):

	ray_origins = np.random.normal(0,1,size=(num_rays,2))
	ray_directions = np.random.normal(0,1,size=(num_rays,2))
	ray_direction_norms = np.linalg.norm(ray_directions,axis=1)
	ray_directions[:,0] = ray_directions[:,0]/ray_direction_norms
	ray_directions[:,1] = ray_directions[:,1]/ray_direction_norms
	ray_directions[ray_direction_norms == 0,:] = 0

	rays = np.hstack((ray_origins,ray_directions))

	RayVoxelCountC(voxel_width, voxel_grid_dimensions, voxel_grid_origin,rays,voxel_counts)

toc = time.time()
print('Average run time with ctypes: ', (toc-tic)/N)

This code simply sets up a 100×100 voxel grid and then for each function generates 1000 random rays 10 times and calculates the average run time (You can play with different parameters if you are interested). On my desktop which sports an AMD Ryzen 9 3950X, the native Python implementation averages 1.239 seconds, and the C implementation averages 0.002 seconds. That is a dramatic speedup of over 600x!

You probably shouldn’t expect such a dramatic speedup in most cases, but I have found that it is often times worthwhile to do even a naive implementation of a function in C and then use ctypes to integrate it into my Python code.

Leave a Reply

Your email address will not be published. Required fields are marked *