Texas

CUDA ("Compute Unified Device Architecture"), is a GPGPU technology that allows a programmer to use the C programming language to code algorithms for execution on the GPU. CUDA has been developed by NVIDIA and to use this architecture requires an NVIDIA GPU and special stream processing drivers. CUDA only works with the new GeForce 8 Series, featuring G8X GPUs; NVIDIA guarantees that programs developed for the GeForce 8 series will also work without modification on all future NVIDIA video cards. CUDA gives developers unfettered access to the native instruction set and memory of the massively parallel computational elements in CUDA GPUs. Using CUDA, NVIDIA GeForce-based GPUs effectively become powerful, programmable open architectures like today’s CPUs (Central Processing Units). By opening up the architecture, CUDA provides developers both with the low-level, deterministic, and for repeatable access to hardware that is necessary API to develop essential high-level programming tools such as compilers, debuggers, math libraries, and application platforms.

The initial CUDA SDK was made public 15th February 2007.^[1]

NVIDIA 8-Series GeForce-based GPU

The 8-Series (G8X) GeForce-based GPU from NVIDIA is the first series of GPU to support the CUDA SDK. The 8-Series (G8X) GPUs features hardware support for 32-bit (single precision) floating point vector processors, using the CUDA SDK as API. (CUDA supports the C "double" data type, However on G8X series GPUs these types will get demoted to 32-bit floats.). Due to the highly parallel nature of vector processors, GPU assisted hardware stream processing can have a huge impact in specific data processing applications. It is anticipated in the computer gaming industry that graphics cards may be used in future game physics calculations (physical effects like debris, smoke, fire, fluids).

Advantages

CUDA has several advantages over traditional general purpose computation on GPUs (GPGPU) using graphics APIs.

It uses the standard C language, with some simple extensions.
Scattered writes - code can write to arbitrary addresses in memory.
Shared memory - CUDA exposes a fast shared memory region (16KB in size) that can be shared amongst threads. This can be used as a user-managed cache, enabling higher bandwidth than is possible using texture lookups.
Faster downloads and readbacks to and from the GPU

Limitations

Compared to traditional floating point accelerators such as the 64-bit floating point (FP64) CSX600 boards from Clearspeed that is used in today's supercomputers, the current GPUs from NVIDIA (and AMD/ATI) are only running on 32-bit, providing only single precision data capability^[1] – instead of double precision (64-bit) capability of supercomputers.^[2] NVIDIA stated in the CUDA Release Notes Version 0.8 file that NVIDIA GPUs supporting (64-bit) Double Precision Floating Point arithmetic in hardware will become available in late 2007.^[3]
Only bilinear texture filtering is supported - mipmapped textures and anisotropic filtering are not supported at this time.
Recursive functions are not supported.
Various deviations from the IEEE 754 standard. Denormals and signalling NaNs are not supported; the rounding mode cannot be changed, and the precision of division/square root is slightly lower than single precision.
The bus bandwidth and latency between the CPU and the GPU may be a bottleneck.

References

^ Double precision on GPUs (Proceedings of ASIM 2005): Dominik Goddeke, Robert Strzodka, and Stefan Turek. Accelerating Double Precision (FEM) Simulations with (GPUs). Proceedings of ASIM 2005 - 18th Symposium on Simulation Technique, 2005.

External links

Template:Processing units

[1] ttp://news.developer.nvidia.com/2007/02/cuda_for_gpu_co.html

[2] ttp://www.tgdaily.com/2007/02/16/nvidia_cuda/

[3] ttp://developer.download.nvidia.com/compute/cuda/0_8/NVIDIA_CUDA_SDK_releasenotes_readme_win32_linux.zip

[1]

[2]

[3]

@@ Line 17: / Line 17: @@
 ==Limitations==
-* Compared to traditional [[floating point]] accelerators such as the 64-bit floating point (FP64) [[CSX600]] boards from [[Clearspeed]] that is used in today's [[supercomputers]], the current [[GPU]]s from [[NVIDIA]] (and [[AMD]]/[[ATI]]) are only running on 32-bit, providing only single-precision data capability{{ref|doublePrecisionOnGPU}} – instead of double-precision (64-bit) capability of [[supercomputers]].<ref>http://www.tgdaily.com/2007/02/16/nvidia_cuda/</ref> NVIDIA stated in the CUDA Release Notes Version 0.8 file that NVIDIA GPUs supporting (64-bit) Double Precision Floating Point arithmetic in hardware will become available in late 2007.<ref>http://developer.download.nvidia.com/compute/cuda/0_8/NVIDIA_CUDA_SDK_releasenotes_readme_win32_linux.zip</ref>
+* Compared to traditional [[floating point]] accelerators such as the 64-bit floating point (FP64) [[CSX600]] boards from [[Clearspeed]] that is used in today's [[supercomputers]], the current [[GPU]]s from [[NVIDIA]] (and [[AMD]]/[[ATI]]) are only running on 32-bit, providing only [[single precision]] data capability{{ref|doublePrecisionOnGPU}} – instead of [[double precision]] (64-bit) capability of [[supercomputers]].<ref>http://www.tgdaily.com/2007/02/16/nvidia_cuda/</ref> NVIDIA stated in the CUDA Release Notes Version 0.8 file that NVIDIA GPUs supporting (64-bit) Double Precision Floating Point arithmetic in hardware will become available in late 2007.<ref>http://developer.download.nvidia.com/compute/cuda/0_8/NVIDIA_CUDA_SDK_releasenotes_readme_win32_linux.zip</ref>
 * Only bilinear texture filtering is supported - mipmapped textures and anisotropic filtering are not supported at this time.
 * Recursive functions are not supported.
+* Various deviations from the [[IEEE 754]] standard. [[denormal number|Denormals]] and signalling [[NaN]]s are not supported; the [[rounding]] mode cannot be changed, and the precision of division/square root is slightly lower than single precision.
-* Various deviations from the [[IEEE]]-754 standard. Denormals and signaling NaNs are not supported; the rounding mode cannot be changed, and the precision of division/square root is slightly lower than single precision.
 * The bus bandwidth and latency between the CPU and the GPU may be a bottleneck.

The best road to progress is freedom's road. - JFK