Profiling
There are a number of reasons we might want to profile our GPU code, including:
- To work out which bits of our program take the most time to run.
- To check that our kernel is executing on as many threads and blocks as we expect.
- To check that our kernel is executing on as many streams as expected.
We can use a tool from the CUDA toolkit called nvprof
for profiling. The following command can be used to call nvprof to profile a specific Julia GPU program from the command line:
$ nvprof --profile-from-start off /path/to/julia /path/to/julia/script
Note that we use --profile-from-start off
to tell nvprof
not to start profiling until we tell it to. There are several reasons we might not want to have nvprof
start profiling straight away. One reason is that Julia is a Just-In-Time (JIT) compiled language. We often do not want to include the time taken to compile our scripts in our profiling estimates. Another reason is that GPU software usually executes some commands on CPU as well as GPU, and we may only want to profile commands executed on GPU. Finally, we might only be interested in a particular bit of our script, such as the kernel or the time taken to copy data from host to device. We can turn --profile-from-start
off to enable us to profile just the bit we are interested in.
nvprof
Let's use our vector addition kernel streaming example to investigate how we can use nvprof
for profiling. We have modified the example so it can be profiled with nvprof
:
using CuArrays, CUDAnative, CUDAdrv
using Test
# Kernel
function add!(a,b,c, index)
c[index] = a[index] + b[index]
return nothing
end
function main()
# Initialise a, b and c
a = CuArrays.CuArray(fill(0, 10))
b = CuArrays.CuArray(fill(0, 10))
c = CuArrays.CuArray(fill(0, 10))
# Put values in a and b
for i in 1:10
a[i] = i
b[i] = i * 2
end
# Create two streams
s1 = CuStream()
s2 = CuStream()
# Call add! asynchronously in two streams
for i in 1:2:min(length(a), length(b), length(c))
@cuda threads = 1 stream = s1 add!(a,b,c, i)
@cuda threads = 1 stream = s2 add!(a,b,c, i+1)
end
# Copy arrays back to host (CPU)
a=Array(a)
b=Array(b)
c=Array(c)
# Check the addition worked
for i in 1:length(a)
@test a[i] + b[i] โ c[i]
end
end
main()
CUDAdrv.@profile main()
The only change we have made to this example is to add this line at the end:
CUDAdrv.@profile main()
This line executes main()
whilst activating the CUDA profiler. This signals to nvprof
the point at which to start profiling. Notice that we have already called main()
once in the line before we call CUDAdrv.@profile
- this is to force main()
to compile before we start profiling. If we do not call the function we want to profile once before we start profiling then our profiling estimates will include the time spent compiling.
Having modified our example script to invoke the nvprof
profiler, we can now get profiling information by executing the following on the command line:
$ nvprof --profile-from-start off julia gpu_vector_sums_streams.jl
Where gpu_vector_sums_streams.jl
is the name of our example script. The output of this command should look something like this:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 50.18% 21.632us 23 940ns 928ns 1.2160us [CUDA memcpy HtoD]
39.64% 17.088us 10 1.7080us 1.5680us 2.1440us ptxcall_add__1
10.17% 4.3850us 3 1.4610us 1.3440us 1.5680us [CUDA memcpy DtoH]
API calls: 82.95% 1.5925ms 10 159.25us 5.0680us 1.5383ms cuLaunchKernel
10.24% 196.67us 23 8.5500us 6.6030us 26.618us cuMemcpyHtoD
2.34% 44.868us 3 14.956us 8.3740us 26.289us cuMemAlloc
2.25% 43.195us 3 14.398us 12.429us 17.862us cuMemcpyDtoH
1.70% 32.641us 2 16.320us 8.3680us 24.273us cuStreamCreate
0.39% 7.4110us 15 494ns 426ns 672ns cuCtxGetCurrent
0.13% 2.4880us 1 2.4880us 2.4880us 2.4880us cuDeviceGetCount
There is some pretty interesting information in this output - for example, the first line of the output tells us that 50.18% of GPU time is spent copying data from host to device. This is a good illustration of how slow copying data between CPU and GPU often is. We can get some more interesting information by adding an extra flag to our nvprof
command:
$ nvprof --profile-from-start off --print-gpu-trace julia gpu_vector_sums_streams.jl
The output of this command should look something like this:
Start Duration Grid Size Block Size Regs* SSMem* DSMem* Size Throughput SrcMemType DstMemType Device Context Stream Name
15.1992s 1.2480us - - - - - 80B 61.133MB/s Pageable Device Quadro M1000M ( 1 7 [CUDA memcpy HtoD]
15.1992s 1.1200us - - - - - 80B 68.120MB/s Pageable Device Quadro M1000M ( 1 7 [CUDA memcpy HtoD]
15.1992s 928ns - - - - - 80B 82.213MB/s Pageable Device Quadro M1000M ( 1 7 [CUDA memcpy HtoD]
15.1992s 928ns - - - - - 8B 8.2213MB/s Pageable Device Quadro M1000M ( 1 7 [CUDA memcpy HtoD]
15.1992s 928ns - - - - - 8B 8.2213MB/s Pageable Device Quadro M1000M ( 1 7 [CUDA memcpy HtoD]
15.1992s 928ns - - - - - 8B 8.2213MB/s Pageable Device Quadro M1000M ( 1 7 [CUDA memcpy HtoD]
15.1992s 928ns - - - - - 8B 8.2213MB/s Pageable Device Quadro M1000M ( 1 7 [CUDA memcpy HtoD]
15.1992s 928ns - - - - - 8B 8.2213MB/s Pageable Device Quadro M1000M ( 1 7 [CUDA memcpy HtoD]
15.1993s 928ns - - - - - 8B 8.2213MB/s Pageable Device Quadro M1000M ( 1 7 [CUDA memcpy HtoD]
15.1993s 928ns - - - - - 8B 8.2213MB/s Pageable Device Quadro M1000M ( 1 7 [CUDA memcpy HtoD]
15.1993s 928ns - - - - - 8B 8.2213MB/s Pageable Device Quadro M1000M ( 1 7 [CUDA memcpy HtoD]
15.1993s 928ns - - - - - 8B 8.2213MB/s Pageable Device Quadro M1000M ( 1 7 [CUDA memcpy HtoD]
15.1993s 928ns - - - - - 8B 8.2213MB/s Pageable Device Quadro M1000M ( 1 7 [CUDA memcpy HtoD]
15.1993s 960ns - - - - - 8B 7.9473MB/s Pageable Device Quadro M1000M ( 1 7 [CUDA memcpy HtoD]
15.1993s 928ns - - - - - 8B 8.2213MB/s Pageable Device Quadro M1000M ( 1 7 [CUDA memcpy HtoD]
15.1993s 928ns - - - - - 8B 8.2213MB/s Pageable Device Quadro M1000M ( 1 7 [CUDA memcpy HtoD]
15.1993s 928ns - - - - - 8B 8.2213MB/s Pageable Device Quadro M1000M ( 1 7 [CUDA memcpy HtoD]
15.1993s 928ns - - - - - 8B 8.2213MB/s Pageable Device Quadro M1000M ( 1 7 [CUDA memcpy HtoD]
15.1993s 928ns - - - - - 8B 8.2213MB/s Pageable Device Quadro M1000M ( 1 7 [CUDA memcpy HtoD]
15.1993s 928ns - - - - - 8B 8.2213MB/s Pageable Device Quadro M1000M ( 1 7 [CUDA memcpy HtoD]
15.1994s 928ns - - - - - 8B 8.2213MB/s Pageable Device Quadro M1000M ( 1 7 [CUDA memcpy HtoD]
15.1994s 928ns - - - - - 8B 8.2213MB/s Pageable Device Quadro M1000M ( 1 7 [CUDA memcpy HtoD]
15.1994s 928ns - - - - - 8B 8.2213MB/s Pageable Device Quadro M1000M ( 1 7 [CUDA memcpy HtoD]
15.2010s 2.2720us (1 1 1) (1 1 1) 32 0B 0B - - - - Quadro M1000M ( 1 16 ptxcall_add__1 [108]
15.2010s 1.9520us (1 1 1) (1 1 1) 32 0B 0B - - - - Quadro M1000M ( 1 17 ptxcall_add__1 [110]
15.2010s 1.5680us (1 1 1) (1 1 1) 32 0B 0B - - - - Quadro M1000M ( 1 16 ptxcall_add__1 [112]
15.2010s 1.5680us (1 1 1) (1 1 1) 32 0B 0B - - - - Quadro M1000M ( 1 17 ptxcall_add__1 [114]
15.2010s 1.5680us (1 1 1) (1 1 1) 32 0B 0B - - - - Quadro M1000M ( 1 16 ptxcall_add__1 [116]
15.2010s 1.5680us (1 1 1) (1 1 1) 32 0B 0B - - - - Quadro M1000M ( 1 17 ptxcall_add__1 [118]
15.2010s 1.5680us (1 1 1) (1 1 1) 32 0B 0B - - - - Quadro M1000M ( 1 16 ptxcall_add__1 [120]
15.2010s 1.6000us (1 1 1) (1 1 1) 32 0B 0B - - - - Quadro M1000M ( 1 17 ptxcall_add__1 [122]
15.2010s 1.8880us (1 1 1) (1 1 1) 32 0B 0B - - - - Quadro M1000M ( 1 16 ptxcall_add__1 [124]
15.2010s 1.5680us (1 1 1) (1 1 1) 32 0B 0B - - - - Quadro M1000M ( 1 17 ptxcall_add__1 [126]
15.2010s 1.1840us - - - - - 80B 64.437MB/s Device Pageable Quadro M1000M ( 1 7 [CUDA memcpy DtoH]
15.2011s 1.1840us - - - - - 80B 64.437MB/s Device Pageable Quadro M1000M ( 1 7 [CUDA memcpy DtoH]
15.2011s 1.3120us - - - - - 80B 58.151MB/s Device Pageable Quadro M1000M ( 1 7 [CUDA memcpy DtoH]
Let's talk through what this output means. First there are a series of lines that look something like this:
Start Duration Grid Size Block Size Regs* SSMem* DSMem* Size Throughput SrcMemType DstMemType Device Context Stream Name
15.1992s 1.2480us - - - - - 80B 61.133MB/s Pageable Device Quadro M1000M ( 1 7 [CUDA memcpy HtoD]
These lines are describing the process of copying data from host to device ([CUDA memcpy HtoD]
). We can see this is happening in stream 7. Next we see lines that look like these:
Start Duration Grid Size Block Size Regs* SSMem* DSMem* Size Throughput SrcMemType DstMemType Device Context Stream Name
15.2010s 2.2720us (1 1 1) (1 1 1) 32 0B 0B - - - - Quadro M1000M ( 1 16 ptxcall_add__1 [108]
15.2010s 1.9520us (1 1 1) (1 1 1) 32 0B 0B - - - - Quadro M1000M ( 1 17 ptxcall_add__1 [110]
These lines are executing the kernel. There are two things to notice here:
The Grid Size and Block Size are both
(1 1 1)
, indicating that the kernel is executing sequentially. If you look back at the example script you will see that this is as expected - we always call the kernel on 1 thread and 1 block (the default number of threads and blocks). If you specified more threads or blocks to@cuda
these fields would change.The first line executes in stream 16 and the second executes in stream 17. Again, if you look back at the example script this is as expected - we create two streams then execute our kernel alternately in each stream.
Finally, there are some lines copying data back from device to host:
Start Duration Grid Size Block Size Regs* SSMem* DSMem* Size Throughput SrcMemType DstMemType Device Context Stream Name
15.2010s 1.1840us - - - - - 80B 64.437MB/s Device Pageable Quadro M1000M ( 1 7 [CUDA memcpy DtoH]
This data transfer takes place in stream 7 again.
Zoom in on the kernel
One of the main take home points of the profiling above is that most of the execution time is taken up copying data. Let's imagine that we decided we were not interested in profiling data transfer and just wanted to profile the kernel. We could modify our example script so it looked like below:
using CuArrays, CUDAnative, CUDAdrv
using Test
# Kernel
function add!(a,b,c, index)
c[index] = a[index] + b[index]
return nothing
end
function main()
# Initialise a, b and c
a = CuArrays.CuArray(fill(0, 10))
b = CuArrays.CuArray(fill(0, 10))
c = CuArrays.CuArray(fill(0, 10))
# Put values in a and b
for i in 1:10
a[i] = -i
b[i] = i * i
end
# Create two streams
s1 = CuStream()
s2 = CuStream()
# Call add! asynchronously in two streams
for i in 1:2:min(length(a), length(b), length(c))
@cuda threads = 1 stream = s1 add!(a,b,c, i)
CUDAdrv.@profile @cuda threads = 1 stream = s1 add!(a,b,c, i)
@cuda threads = 1 stream = s2 add!(a,b,c, i+1)
end
# Copy arrays back to host (CPU)
a=Array(a)
b=Array(b)
c=Array(c)
# Check the addition worked
for i in 1:length(a)
@test a[i] + b[i] โ c[i]
end
end
main()
Notice that we no longer use CUDAdrv.@profile
to profile main()
, but we now use it to profile our call to @cuda threads = 1 stream = s1 add!(a,b,c, i)
. Now we call
$ nvprof --profile-from-start off julia gpu_vector_sums_streams.jl
from the command line, and get the following output:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 100.00% 10.304us 5 2.0600us 1.8560us 2.3040us ptxcall_add__1
API calls: 75.82% 11.778ms 5 2.3556ms 1.4952ms 2.6888ms cuLaunchKernel
24.17% 3.7540ms 5 750.79us 864ns 961.91us cuCtxGetCurrent
0.01% 2.1640us 1 2.1640us 2.1640us 2.1640us cuDeviceGetCount
Since we are just profiling the kernel now, unsurprisingly we find that 100% of the profiled GPU time was spent executing the kernel. The output to
$ nvprof --profile-from-start off --print-gpu-trace julia gpu_vector_sums_streams.jl
is shown below:
Start Duration Grid Size Block Size Regs* SSMem* DSMem* Device Context Stream Name
14.9673s 2.0800us (1 1 1) (1 1 1) 32 0B 0B Quadro M1000M ( 1 14 ptxcall_add__1 [54]
16.9861s 1.7920us (1 1 1) (1 1 1) 32 0B 0B Quadro M1000M ( 1 14 ptxcall_add__1 [62]
19.0047s 2.0480us (1 1 1) (1 1 1) 32 0B 0B Quadro M1000M ( 1 14 ptxcall_add__1 [70]
21.0233s 1.8560us (1 1 1) (1 1 1) 32 0B 0B Quadro M1000M ( 1 14 ptxcall_add__1 [78]
23.0424s 1.8880us (1 1 1) (1 1 1) 32 0B 0B Quadro M1000M ( 1 14 ptxcall_add__1 [86]
Again, all of the data transfer steps have vanished from the profiling output, since we have only profiled kernel execution. Notice that we see 5 lines of output here, corresponding to 5 calls of the kernel, and that the kernel is always called in the same stream, because we only profiled calls to the kernel in stream s1
.
This section has shown some of the methods you can use to profile GPU software that you write in Julia. In the next section, we will discuss warps and branching and their impact on performance.