前面三节已经对CUDA做了一个简单的介绍,这一节开始真正进入编程环节。
首先,初学者应该对自己使用的设备有较为扎实的理解和掌握,这样对后面学习并行程序优化很有帮助,了解硬件详细参数可以通过上节介绍的几本书和官方资料获得,但如果仍然觉得不够直观,那么我们可以自己动手获得这些内容。
以第二节例程为模板,我们稍加改动的部分代码如下:
// Add vectors in parallel. cudaError_t cudaStatus; int num = 0; cudaDeviceProp prop; cudaStatus = cudaGetDeviceCount(&num); for(int i = 0;i<num;i++) { cudaGetDeviceProperties(&prop,i); } cudaStatus = addWithCuda(c, a, b, arraySize);
这个改动的目的是让我们的程序自动通过调用cuda API函数获得设备数目和属性,所谓“知己知彼,百战不殆”。
cudaError_t 是cuda错误类型,取值为整数。
cudaDeviceProp为设备属性结构体,其定义可以从cuda Toolkit安装目录中找到,我的路径为:C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v5.0\include\driver_types.h,找到定义为:
/** * CUDA device properties */ struct __device_builtin__ cudaDeviceProp { char name[256]; /**< ASCII string identifying device */ size_t totalGlobalMem; /**< Global memory available on device in bytes */ size_t sharedMemPerBlock; /**< Shared memory available per block in bytes */ int regsPerBlock; /**< 32-bit registers available per block */ int warpSize; /**< Warp size in threads */ size_t memPitch; /**< Maximum pitch in bytes allowed by memory copies */ int maxThreadsPerBlock; /**< Maximum number of threads per block */ int maxThreadsDim[3]; /**< Maximum size of each dimension of a block */ int maxGridSize[3]; /**< Maximum size of each dimension of a grid */ int clockRate; /**< Clock frequency in kilohertz */ size_t totalConstMem; /**< Constant memory available on device in bytes */ int major; /**< Major compute capability */ int minor; /**< Minor compute capability */ size_t textureAlignment; /**< Alignment requirement for textures */ size_t texturePitchAlignment; /**< Pitch alignment requirement for texture references bound to pitched memory */ int deviceOverlap; /**< Device can concurrently copy memory and execute a kernel. Deprecated. Use instead asyncEngineCount. */ int multiProcessorCount; /**< Number of multiprocessors on device */ int kernelExecTimeoutEnabled; /**< Specified whether there is a run time limit on kernels */ int integrated; /**< Device is integrated as opposed to discrete */ int canMapHostMemory; /**< Device can map host memory with cudaHostAlloc/cudaHostGetDevicePointer */ int computeMode; /**< Compute mode (See ::cudaComputeMode) */ int maxTexture1D; /**< Maximum 1D texture size */ int maxTexture1DMipmap; /**< Maximum 1D mipmapped texture size */ int maxTexture1DLinear; /**< Maximum size for 1D textures bound to linear memory */ int maxTexture2D[2]; /**< Maximum 2D texture dimensions */ int maxTexture2DMipmap[2]; /**< Maximum 2D mipmapped texture dimensions */ int maxTexture2DLinear[3]; /**< Maximum dimensions (width, height, pitch) for 2D textures bound to pitched memory */ int maxTexture2DGather[2]; /**< Maximum 2D texture dimensions if texture gather operations have to be performed */ int maxTexture3D[3]; /**< Maximum 3D texture dimensions */ int maxTextureCubemap; /**< Maximum Cubemap texture dimensions */ int maxTexture1DLayered[2]; /**< Maximum 1D layered texture dimensions */ int maxTexture2DLayered[3]; /**< Maximum 2D layered texture dimensions */ int maxTextureCubemapLayered[2];/**< Maximum Cubemap layered texture dimensions */ int maxSurface1D; /**< Maximum 1D surface size */ int maxSurface2D[2]; /**< Maximum 2D surface dimensions */ int maxSurface3D[3]; /**< Maximum 3D surface dimensions */ int maxSurface1DLayered[2]; /**< Maximum 1D layered surface dimensions */ int maxSurface2DLayered[3]; /**< Maximum 2D layered surface dimensions */ int maxSurfaceCubemap; /**< Maximum Cubemap surface dimensions */ int maxSurfaceCubemapLayered[2];/**< Maximum Cubemap layered surface dimensions */ size_t surfaceAlignment; /**< Alignment requirements for surfaces */ int concurrentKernels; /**< Device can possibly execute multiple kernels concurrently */ int ECCEnabled; /**< Device has ECC support enabled */ int pciBusID; /**< PCI bus ID of the device */ int pciDeviceID; /**< PCI device ID of the device */ int pciDomainID; /**< PCI domain ID of the device */ int tccDriver; /**< 1 if device is a Tesla device using TCC driver, 0 otherwise */ int asyncEngineCount; /**< Number of asynchronous engines */ int unifiedAddressing; /**< Device shares a unified address space with the host */ int memoryClockRate; /**< Peak memory clock frequency in kilohertz */ int memoryBusWidth; /**< Global memory bus width in bits */ int l2CacheSize; /**< Size of L2 cache in bytes */ int maxThreadsPerMultiProcessor;/**< Maximum resident threads per multiprocessor */ };
后面的注释已经说明了其字段代表意义,可能有些术语对于初学者理解起来还是有一定困难,没关系,我们现在只需要关注以下几个指标:
name:就是设备名称;
totalGlobalMem:就是显存大小;
major,minor:CUDA设备版本号,有1.1, 1.2, 1.3, 2.0, 2.1等多个版本;
clockRate:GPU时钟频率;
multiProcessorCount:GPU大核数,一个大核(专业点称为流多处理器,SM,Stream-Multiprocessor)包含多个小核(流处理器,SP,Stream-Processor)
编译,运行,我们在VS2008工程的cudaGetDeviceProperties()函数处放一个断点,单步执行这一函数,然后用Watch窗口,切换到Auto页,展开+,在我的笔记本上得到如下结果:
可以看到,设备名为GeForce 610M,显存1GB,设备版本2.1(比较高端了,哈哈),时钟频率为950MHz(注意950000单位为kHz),大核数为1。在一些高性能GPU上(如Tesla,Kepler系列),大核数可能达到几十甚至上百,可以做更大规模的并行处理。
PS:今天看SDK代码时发现在help_cuda.h中有个函数实现从CUDA设备版本查询相应大核中小核的数目,觉得很有用,以后编程序可以借鉴,摘抄如下:
// Beginning of GPU Architecture definitions inline int _ConvertSMVer2Cores(int major, int minor) { // Defines for GPU Architecture types (using the SM version to determine the # of cores per SM typedef struct { int SM; // 0xMm (hexidecimal notation), M = SM Major version, and m = SM minor version int Cores; } sSMtoCores; sSMtoCores nGpuArchCoresPerSM[] = { { 0x10, 8 }, // Tesla Generation (SM 1.0) G80 class { 0x11, 8 }, // Tesla Generation (SM 1.1) G8x class { 0x12, 8 }, // Tesla Generation (SM 1.2) G9x class { 0x13, 8 }, // Tesla Generation (SM 1.3) GT200 class { 0x20, 32 }, // Fermi Generation (SM 2.0) GF100 class { 0x21, 48 }, // Fermi Generation (SM 2.1) GF10x class { 0x30, 192}, // Kepler Generation (SM 3.0) GK10x class { 0x35, 192}, // Kepler Generation (SM 3.5) GK11x class { -1, -1 } }; int index = 0; while (nGpuArchCoresPerSM[index].SM != -1) { if (nGpuArchCoresPerSM[index].SM == ((major << 4) + minor)) { return nGpuArchCoresPerSM[index].Cores; } index++; } // If we don't find the values, we default use the previous one to run properly printf("MapSMtoCores for SM %d.%d is undefined. Default to use %d Cores/SM\n", major, minor, nGpuArchCoresPerSM[7].Cores); return nGpuArchCoresPerSM[7].Cores; } // end of GPU Architecture definitions
可见,设备版本2.1的一个大核有48个小核,而版本3.0以上的一个大核有192个小核!
前文说到过,当我们用的电脑上有多个显卡支持CUDA时,怎么来区分在哪个上运行呢?这里我们看一下addWithCuda这个函数是怎么做的。
cudaError_t cudaStatus; // Choose which GPU to run on, change this on a multi-GPU system. cudaStatus = cudaSetDevice(0); if (cudaStatus != cudaSuccess) { fprintf(stderr, "cudaSetDevice failed! Do you have a CUDA-capable GPU installed?"); goto Error; }
使用了cudaSetDevice(0)这个操作,0表示能搜索到的第一个设备号,如果有多个设备,则编号为0,1,2...。
再看我们本节添加的代码,有个函数cudaGetDeviceCount(&num),这个函数用来获取设备总数,这样我们选择运行CUDA程序的设备号取值就是0,1,...num-1,于是可以一个个枚举设备,利用cudaGetDeviceProperties(&prop)获得其属性,然后利用一定排序、筛选算法,找到最符合我们应用的那个设备号opt,然后调用cudaSetDevice(opt)即可选择该设备。选择标准可以从处理能力、版本控制、名称等各个角度出发。后面讲述流并发过程时,还要用到这些API。
如果希望了解更多硬件内容可以结合http://www.geforce.cn/hardware获取。