docs/NVPTXUsage.rst


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276

=============================
User Guide for NVPTX Back-end
=============================

.. contents::
   :local:
   :depth: 3


Introduction
============

To support GPU programming, the NVPTX back-end supports a subset of LLVM IR
along with a defined set of conventions used to represent GPU programming
concepts. This document provides an overview of the general usage of the back-
end, including a description of the conventions used and the set of accepted
LLVM IR.

.. note:: 
   
   This document assumes a basic familiarity with CUDA and the PTX
   assembly language. Information about the CUDA Driver API and the PTX assembly
   language can be found in the `CUDA documentation
   <http://docs.nvidia.com/cuda/index.html>`_.


Conventions
===========

Marking Functions as Kernels
----------------------------

In PTX, there are two types of functions: *device functions*, which are only
callable by device code, and *kernel functions*, which are callable by host
code. By default, the back-end will emit device functions. Metadata is used to
declare a function as a kernel function. This metadata is attached to the
``nvvm.annotations`` named metadata object, and has the following format:

.. code-block:: llvm

   !0 = metadata !{<function-ref>, metadata !"kernel", i32 1}

The first parameter is a reference to the kernel function. The following
example shows a kernel function calling a device function in LLVM IR. The
function ``@my_kernel`` is callable from host code, but ``@my_fmad`` is not.

.. code-block:: llvm

    define float @my_fmad(float %x, float %y, float %z) {
      %mul = fmul float %x, %y
      %add = fadd float %mul, %z
      ret float %add
    }

    define void @my_kernel(float* %ptr) {
      %val = load float* %ptr
      %ret = call float @my_fmad(float %val, float %val, float %val)
      store float %ret, float* %ptr
      ret void
    }

    !nvvm.annotations = !{!1}
    !1 = metadata !{void (float*)* @my_kernel, metadata !"kernel", i32 1}

When compiled, the PTX kernel functions are callable by host-side code.


Address Spaces
--------------

The NVPTX back-end uses the following address space mapping:

   ============= ======================
   Address Space Memory Space
   ============= ======================
   0             Generic
   1             Global
   2             Internal Use
   3             Shared
   4             Constant
   5             Local
   ============= ======================

Every global variable and pointer type is assigned to one of these address
spaces, with 0 being the default address space. Intrinsics are provided which
can be used to convert pointers between the generic and non-generic address
spaces.

As an example, the following IR will define an array ``@g`` that resides in
global device memory.

.. code-block:: llvm

    @g = internal addrspace(1) global [4 x i32] [ i32 0, i32 1, i32 2, i32 3 ]

LLVM IR functions can read and write to this array, and host-side code can
copy data to it by name with the CUDA Driver API.

Note that since address space 0 is the generic space, it is illegal to have
global variables in address space 0.  Address space 0 is the default address
space in LLVM, so the ``addrspace(N)`` annotation is *required* for global
variables.


NVPTX Intrinsics
================

Address Space Conversion
------------------------

'``llvm.nvvm.ptr.*.to.gen``' Intrinsics
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Syntax:
"""""""

These are overloaded intrinsics.  You can use these on any pointer types.

.. code-block:: llvm

    declare i8* @llvm.nvvm.ptr.global.to.gen.p0i8.p1i8(i8 addrspace(1)*)
    declare i8* @llvm.nvvm.ptr.shared.to.gen.p0i8.p3i8(i8 addrspace(3)*)
    declare i8* @llvm.nvvm.ptr.constant.to.gen.p0i8.p4i8(i8 addrspace(4)*)
    declare i8* @llvm.nvvm.ptr.local.to.gen.p0i8.p5i8(i8 addrspace(5)*)

Overview:
"""""""""

The '``llvm.nvvm.ptr.*.to.gen``' intrinsics convert a pointer in a non-generic
address space to a generic address space pointer.

Semantics:
""""""""""

These intrinsics modify the pointer value to be a valid generic address space
pointer.


'``llvm.nvvm.ptr.gen.to.*``' Intrinsics
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Syntax:
"""""""

These are overloaded intrinsics.  You can use these on any pointer types.

.. code-block:: llvm

    declare i8* @llvm.nvvm.ptr.gen.to.global.p1i8.p0i8(i8 addrspace(1)*)
    declare i8* @llvm.nvvm.ptr.gen.to.shared.p3i8.p0i8(i8 addrspace(3)*)
    declare i8* @llvm.nvvm.ptr.gen.to.constant.p4i8.p0i8(i8 addrspace(4)*)
    declare i8* @llvm.nvvm.ptr.gen.to.local.p5i8.p0i8(i8 addrspace(5)*)

Overview:
"""""""""

The '``llvm.nvvm.ptr.gen.to.*``' intrinsics convert a pointer in the generic
address space to a pointer in the target address space.  Note that these
intrinsics are only useful if the address space of the target address space of
the pointer is known.  It is not legal to use address space conversion
intrinsics to convert a pointer from one non-generic address space to another
non-generic address space.

Semantics:
""""""""""

These intrinsics modify the pointer value to be a valid pointer in the target
non-generic address space.


Reading PTX Special Registers
-----------------------------

'``llvm.nvvm.read.ptx.sreg.*``'
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Syntax:
"""""""

.. code-block:: llvm

    declare i32 @llvm.nvvm.read.ptx.sreg.tid.x()
    declare i32 @llvm.nvvm.read.ptx.sreg.tid.y()
    declare i32 @llvm.nvvm.read.ptx.sreg.tid.z()
    declare i32 @llvm.nvvm.read.ptx.sreg.ntid.x()
    declare i32 @llvm.nvvm.read.ptx.sreg.ntid.y()
    declare i32 @llvm.nvvm.read.ptx.sreg.ntid.z()
    declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.x()
    declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.y()
    declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.z()
    declare i32 @llvm.nvvm.read.ptx.sreg.nctaid.x()
    declare i32 @llvm.nvvm.read.ptx.sreg.nctaid.y()
    declare i32 @llvm.nvvm.read.ptx.sreg.nctaid.z()
    declare i32 @llvm.nvvm.read.ptx.sreg.warpsize()

Overview:
"""""""""

The '``@llvm.nvvm.read.ptx.sreg.*``' intrinsics provide access to the PTX
special registers, in particular the kernel launch bounds.  These registers
map in the following way to CUDA builtins:

   ============ =====================================
   CUDA Builtin PTX Special Register Intrinsic
   ============ =====================================
   ``threadId`` ``@llvm.nvvm.read.ptx.sreg.tid.*``
   ``blockIdx`` ``@llvm.nvvm.read.ptx.sreg.ctaid.*``
   ``blockDim`` ``@llvm.nvvm.read.ptx.sreg.ntid.*``
   ``gridDim``  ``@llvm.nvvm.read.ptx.sreg.nctaid.*``
   ============ =====================================


Barriers
--------

'``llvm.nvvm.barrier0``'
^^^^^^^^^^^^^^^^^^^^^^^^^^^

Syntax:
"""""""

.. code-block:: llvm

  declare void @llvm.nvvm.barrier0()

Overview:
"""""""""

The '``@llvm.nvvm.barrier0()``' intrinsic emits a PTX ``bar.sync 0``
instruction, equivalent to the ``__syncthreads()`` call in CUDA.


Other Intrinsics
----------------

For the full set of NVPTX intrinsics, please see the
``include/llvm/IR/IntrinsicsNVVM.td`` file in the LLVM source tree.


Executing PTX
=============

The most common way to execute PTX assembly on a GPU device is to use the CUDA
Driver API. This API is a low-level interface to the GPU driver and allows for
JIT compilation of PTX code to native GPU machine code.

Initializing the Driver API:

.. code-block:: c++

    CUdevice device;
    CUcontext context;

    // Initialize the driver API
    cuInit(0);
    // Get a handle to the first compute device
    cuDeviceGet(&device, 0);
    // Create a compute device context
    cuCtxCreate(&context, 0, device);

JIT compiling a PTX string to a device binary:

.. code-block:: c++

    CUmodule module;
    CUfunction funcion;

    // JIT compile a null-terminated PTX string
    cuModuleLoadData(&module, (void*)PTXString);

    // Get a handle to the "myfunction" kernel function
    cuModuleGetFunction(&function, module, "myfunction");

For full examples of executing PTX assembly, please see the `CUDA Samples
<https://developer.nvidia.com/cuda-downloads>`_ distribution.