Improving the performance of Video reading in Python

Introduction

Reading from a Video stream in a performant way is key if you are working with video processing algorithms and specially with live video streams that use RTSP. At Ensolvers, we have been working in this area since we started the development our key project: “Please smile”.

Initially, we used the out of the box Opencv’s VideoCapture class which is acceptable for low quality videos, e.g. 720p. However, when we switch to start using 4k cameras in parallel, we need to take a different approach as the CPU of our box was too high. We tried several of the approaches in the Web [1, 2 and 3] to be able to leverage a NVIDIA GPU NVDEC without success.

What has worked for us is a custom OpenCV build:

  1. Download OpenCV and OpenCV contrib from Git (We have tested with 4.0.1) snapshots.
  2. Make a change in modules/videoio/src/cap_ffmpeg_impl.hpp so that FFMpeg’s AVcodec uses TCP as transport for RTSP. If you don’t do this, you will likely lose packages. Check my custom branch.
  3. If you are using Anaconda switch to your Env. 
  4. Open a terminal and move to the OpenCV/build directory
  5. Run the following CMake command. Check for FFMPEG enabled in the Output.
  6. Run make -j8
  7. Find the file pyopencv_generated_types.h inside the build directory and search for pyopencv_cv_cudacodec_cudacodec_VideoReader_nextFrame. Manually modify the method so that:
    1. using namespace cv::cuda;
      is included before
      using namespace cv::cudacodec;
    2. Replace the type of frame by GpuMat frame;
  8. Run sudo make install
  9. Run sudo ldconfig

Results

Box:

  • i7 7700 processor
  • 8GB RAM
  • NVidia GTX 1050ti card

The results are just amazing!!! 

In the following pictures you can see how Opencv’s VideoCapture performance for 8 videos of 4k quality each using CPU decoding. As you can see the CPU is at 100%

Next, you can see the same 8 videos being read by our custom the GPU and used inside Please smile.

CPU and memory usage are much lower delegating all the power to NVidia GPU.

Caveats

Not everything is perfect, a few things we have to mention:

  • The performance gains are because we better use the CPU resources in a multi core machine and we use the NVIDIA card for video decoding.
  • We had several issues with OpenCV using FFMPEG as backend. Our logs where full of:

    [rtsp @ 0x7f268c5e9220] max delay reached. need to consume packet
  • [rtsp @ 0x7f268c5e9220] RTP: missed 40 packets
  • [h264 @ 0x55b1e115d400] left block unavailable for requested intra mode
  • [h264 @ 0x55b1e115d400] error while decoding MB 0 12, bytestream 114567
  • [h264 @ 0x55b1e115d400] concealing 3889 DC, 3889 AC, 3889 MV errors in I frame

OPENCV_FFMPEG_CAPTURE_OPTIONS variables DOESN’T work. We found it out by looking at OpenCV’s code. We had to manually add the TCP protocol to the AV codec in STEP 2 of the process.

  • Video reading may randomly fail. What we did is create a Watchdog service that looks at the last time a frame was read and if it is more than 5 seconds ago it restarts the Cuda capture process..
  • FFMPEG still uses a bit of the CPU, it is not 1% as we initially thought, but more like 12% in a i7 7700 processor for 8 4k videos.
  • FFMPEG processes uses memory of the NVIDIA card, so if you are running several decoders + Tensorflow or Pytorch you may run out of memory in the NVIDIA card. To avoid this, we moved from a GTX 1050ti card to a RTX 2070, basically we double the GPU memory from 4GB to 8GB.

References

  1. https://www.pyimagesearch.com/2016/07/11/compiling-opencv-with-cuda-support/
  2. https://pterneas.com/2018/11/02/opencv-cuda/
  3. https://jamesbowley.co.uk/accelerating-opencv-4-build-with-cuda-intel-mkl-tbb-and-python-bindings/