Let's write up a little program (time_calc0.py) that will do a speed comparison test between a scalar multiplication on the CPU and then the same operation on the GPU. We will then use NumPy's allclose function to compare the two output values. We will generate an array of 50 million random 32-bit floating point values (this will amount to roughly 48 megabytes of data, so this should be entirely feasible with several gigabytes of memory on any somewhat modern host and GPU device), and then we will time how long it takes to scalar multiply the array by two on both devices. Finally, we will compare the output values to ensure that they are equal. Here's how it's done:
import numpy as npimport pycuda.autoinitfrom pycuda import gpuarray ...