kandi background
Explore Kits

ncnn | performance neural network inference framework | Machine Learning library

 by   Tencent C++ Version: 20220216 License: Non-SPDX

 by   Tencent C++ Version: 20220216 License: Non-SPDX

Download this library from

kandi X-RAY | ncnn Summary

ncnn is a C++ library typically used in Artificial Intelligence, Machine Learning, Deep Learning, Pytorch, Tensorflow applications. ncnn has no bugs, it has no vulnerabilities and it has medium support. However ncnn has a Non-SPDX License. You can download it from GitHub.
ncnn is a high-performance neural network inference computing framework optimized for mobile platforms. ncnn is deeply considerate about deployment and uses on mobile phones from the beginning of design. ncnn does not have third party dependencies. it is cross-platform, and runs faster than all known open source frameworks on mobile phone cpu. Developers can easily deploy deep learning algorithm models to the mobile platform by using efficient ncnn implementation, create intelligent APPs, and bring the artificial intelligence to your fingertips. ncnn is currently being used in many Tencent applications, such as QQ, Qzone, WeChat, Pitu and so on. ncnn 是一个为手机端极致优化的高性能神经网络前向计算框架。ncnn 从设计之初深刻考虑手机端的部署和使用。无第三方依赖,跨平台,手机端 cpu 的速度快于目前所有已知的开源框架。基于 ncnn,开发者能够将深度学习算法轻松移植到手机端高效执行,开发出人工智能 APP,将 AI 带到你的指尖。ncnn 目前已在腾讯多款应用中使用,如 QQ,Qzone,微信,天天P图等。.
Support
Support
Quality
Quality
Security
Security
License
License
Reuse
Reuse

kandi-support Support

  • ncnn has a medium active ecosystem.
  • It has 14305 star(s) with 3418 fork(s). There are 564 watchers for this library.
  • There were 3 major release(s) in the last 12 months.
  • There are 639 open issues and 1851 have been closed. On average issues are closed in 25 days. There are 23 open pull requests and 0 closed requests.
  • It has a neutral sentiment in the developer community.
  • The latest version of ncnn is 20220216
ncnn Support
Best in #Machine Learning
Average in #Machine Learning
ncnn Support
Best in #Machine Learning
Average in #Machine Learning

quality kandi Quality

  • ncnn has 0 bugs and 0 code smells.
ncnn Quality
Best in #Machine Learning
Average in #Machine Learning
ncnn Quality
Best in #Machine Learning
Average in #Machine Learning

securitySecurity

  • ncnn has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.
  • ncnn code analysis shows 0 unresolved vulnerabilities.
  • There are 0 security hotspots that need review.
ncnn Security
Best in #Machine Learning
Average in #Machine Learning
ncnn Security
Best in #Machine Learning
Average in #Machine Learning

license License

  • ncnn has a Non-SPDX License.
  • Non-SPDX licenses can be open source with a non SPDX compliant license, or non open source licenses, and you need to review them closely before use.
ncnn License
Best in #Machine Learning
Average in #Machine Learning
ncnn License
Best in #Machine Learning
Average in #Machine Learning

buildReuse

  • ncnn releases are available to install and integrate.
  • It has 16722 lines of code, 1143 functions and 387 files.
  • It has medium code complexity. Code complexity directly impacts maintainability of the code.
ncnn Reuse
Best in #Machine Learning
Average in #Machine Learning
ncnn Reuse
Best in #Machine Learning
Average in #Machine Learning
Top functions reviewed by kandi - BETA

Coming Soon for all Libraries!

Currently covering the most popular Java, JavaScript and Python libraries. See a SAMPLE HERE.
kandi's functional review helps you automatically verify the functionalities of the libraries and avoid rework.

ncnn Key Features

Supports convolutional neural networks, supports multiple input and multi-branch structure, can calculate part of the branch

No third-party library dependencies, does not rely on BLAS / NNPACK or any other computing framework

Pure C++ implementation, cross-platform, supports android, ios and so on

ARM NEON assembly level of careful optimization, calculation speed is extremely high

Sophisticated memory management and data structure design, very low memory footprint

Supports multi-core parallel computing acceleration, ARM big.LITTLE cpu scheduling optimization

Supports GPU acceleration via the next-generation low-overhead vulkan api

Extensible model design, supports 8bit quantization and half-precision floating point storage, can import caffe/pytorch/mxnet/onnx/darknet/keras/tensorflow(mlir) models

Support direct memory zero copy reference load network model

Can be registered with custom layer implementation and extended

Well, it is strong, not afraid of being stuffed with 卷 QvQ

Why cv::parallel_for_ run faster than my own implementation?

copy iconCopydownload iconDownload
#ifdef CV_PARALLEL_FRAMEWORK
#if defined HAVE_TBB

#if TBB_INTERFACE_VERSION >= 8000
        tbbArena.execute(pbody);
#else
        pbody();
#endif

#elif defined HAVE_HPX
        pbody();

#elif defined HAVE_OPENMP

        #pragma omp parallel for schedule(dynamic) num_threads(numThreads > 0 ? numThreads : numThreadsMax)
        for (int i = stripeRange.start; i < stripeRange.end; ++i)
            pbody(Range(i, i + 1));

#elif defined HAVE_GCD

        dispatch_queue_t concurrent_queue = dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0);
        dispatch_apply_f(stripeRange.end - stripeRange.start, concurrent_queue, &pbody, block_function);

#elif defined WINRT

        Concurrency::parallel_for(stripeRange.start, stripeRange.end, pbody);

#elif defined HAVE_CONCURRENCY

        if(!pplScheduler || pplScheduler->Id() == Concurrency::CurrentScheduler::Id())
        {
            Concurrency::parallel_for(stripeRange.start, stripeRange.end, pbody);
        }
        else
        {
            pplScheduler->Attach();
            Concurrency::parallel_for(stripeRange.start, stripeRange.end, pbody);
            Concurrency::CurrentScheduler::Detach();
        }

#elif defined HAVE_PTHREADS_PF

        parallel_for_pthreads(pbody.stripeRange(), pbody, pbody.stripeRange().size());

#else

#error You have hacked and compiling with unsupported parallel framework

#endif

        ctx.finalize();  // propagate exceptions if exists
        return;
#endif // CV_PARALLEL_FRAMEWORK
-----------------------
find_package(OpenMP)
if(NOT TARGET OpenMP::OpenMP_CXX AND (OpenMP_CXX_FOUND OR OPENMP_FOUND))
    target_compile_options(plain PRIVATE ${OpenMP_CXX_FLAGS})
endif()

if(OpenMP_CXX_FOUND OR OPENMP_FOUND)
    if(ANDROID_NDK_MAJOR AND (ANDROID_NDK_MAJOR GREATER 20))
        target_compile_options(plain PRIVATE -fopenmp)
        target_link_libraries(plain PUBLIC -fopenmp -static-openmp)
    elseif(OpenMP_CXX_FOUND)
        target_link_libraries(plain PUBLIC OpenMP::OpenMP_CXX)
    else()
        target_link_libraries(plain PRIVATE "${OpenMP_CXX_FLAGS}")
    endif()
endif()
target_link_libraries(plain PUBLIC ${OpenCV_LIBS})

find_package(OpenMP)
if(NOT TARGET OpenMP::OpenMP_CXX AND (OpenMP_CXX_FOUND OR OPENMP_FOUND))
    target_compile_options(plain PUBLIC ${OpenMP_CXX_FLAGS})
endif()

if(OpenMP_CXX_FOUND OR OPENMP_FOUND)
    if(ANDROID_NDK_MAJOR AND (ANDROID_NDK_MAJOR GREATER 20))
        target_compile_options(plain PUBLIC -fopenmp)
        target_link_libraries(plain PUBLIC -fopenmp -static-openmp)
    elseif(OpenMP_CXX_FOUND)
        target_link_libraries(plain PUBLIC OpenMP::OpenMP_CXX)
    else()
        target_link_libraries(plain PUBLIC "${OpenMP_CXX_FLAGS}")
    endif()
endif()
-----------------------
find_package(OpenMP)
if(NOT TARGET OpenMP::OpenMP_CXX AND (OpenMP_CXX_FOUND OR OPENMP_FOUND))
    target_compile_options(plain PRIVATE ${OpenMP_CXX_FLAGS})
endif()

if(OpenMP_CXX_FOUND OR OPENMP_FOUND)
    if(ANDROID_NDK_MAJOR AND (ANDROID_NDK_MAJOR GREATER 20))
        target_compile_options(plain PRIVATE -fopenmp)
        target_link_libraries(plain PUBLIC -fopenmp -static-openmp)
    elseif(OpenMP_CXX_FOUND)
        target_link_libraries(plain PUBLIC OpenMP::OpenMP_CXX)
    else()
        target_link_libraries(plain PRIVATE "${OpenMP_CXX_FLAGS}")
    endif()
endif()
target_link_libraries(plain PUBLIC ${OpenCV_LIBS})

find_package(OpenMP)
if(NOT TARGET OpenMP::OpenMP_CXX AND (OpenMP_CXX_FOUND OR OPENMP_FOUND))
    target_compile_options(plain PUBLIC ${OpenMP_CXX_FLAGS})
endif()

if(OpenMP_CXX_FOUND OR OPENMP_FOUND)
    if(ANDROID_NDK_MAJOR AND (ANDROID_NDK_MAJOR GREATER 20))
        target_compile_options(plain PUBLIC -fopenmp)
        target_link_libraries(plain PUBLIC -fopenmp -static-openmp)
    elseif(OpenMP_CXX_FOUND)
        target_link_libraries(plain PUBLIC OpenMP::OpenMP_CXX)
    else()
        target_link_libraries(plain PUBLIC "${OpenMP_CXX_FLAGS}")
    endif()
endif()

How can I get the Vulkan API working in Google Colab

copy iconCopydownload iconDownload
dnf install vulkan-headers vulkan-loader-devel
apt-get install libvulkan-dev
pacman -S vulkan-headers vulkan-icd-loader
-----------------------
dnf install vulkan-headers vulkan-loader-devel
apt-get install libvulkan-dev
pacman -S vulkan-headers vulkan-icd-loader
-----------------------
dnf install vulkan-headers vulkan-loader-devel
apt-get install libvulkan-dev
pacman -S vulkan-headers vulkan-icd-loader

Community Discussions

Trending Discussions on ncnn
  • Why cv::parallel_for_ run faster than my own implementation?
  • Create video from single images
  • How can I get the Vulkan API working in Google Colab
Trending Discussions on ncnn

QUESTION

Why cv::parallel_for_ run faster than my own implementation?

Asked 2021-Aug-14 at 11:01

I am implementing the nearest-neighborhood resizing algorithm for RGB image (unsigned char type). Considering the speed comparison with OpenCV's on Android ARMv8 platform, I find that OpenCV use cv::parallel_for_ for multi-threading speed up.

Thus, I dive into the corresponding source code of OpenCV's cv::resize(), copy and paste the code that actually run, put in my main.cpp. It contains a functor resizeNNInvoker, and cv::parallel_for_ that performs multi-thread calculation on this functor.

What makes me confuse is that cv::parallel_for_ version run faster than using my_parallel_for_, whose code keeps same as OpenCV's.

To make it more clear:

  • Tested on Android armv8 platform
  • **Compiling OpenCV with OpenMP multithread, turn of other parallel framwork
  • Go to OpenCV's cv::parallel_for_, change its source code to the same as my_parallel_for_ (see below)
  • Using 4 threads by cv::setNumThreads(4), and binding 4 big cpu cores (using ncnn API)
  • All code compile under Release mode (via CMake)
  • Test input image: width=7680,height=4320, target image size: 7680/3, 4320/3.

Time cost is as follow:

method time cost
cv::parallel_for_ 3.24 ms
my_parallel_for_ 7.67 ms
inplace openmp 7.75 ms
// my own implementation of parallel_for_, copied from OpenCV source code
void my_parallel_for_(const cv::Range& range, const cv::ParallelLoopBody& body)
{
    #pragma omp parallel for schedule(dynamic) num_threads(4)
    for (int i = range.start; i < range.end; ++i)
        body(cv::Range(i, i + 1));
}

// The functor that performs nearest neighbor resizing, copied from opencv source
class resizeNNInvoker : public cv::ParallelLoopBody
{
public:
    resizeNNInvoker(const cv::Mat& _src, cv::Mat &_dst, int *_x_ofs, double _ify) :
        ParallelLoopBody(), src(_src), dst(_dst), x_ofs(_x_ofs),
        ify(_ify)
    {
    }

    virtual void operator() (const cv::Range& range) const CV_OVERRIDE
    {
        //printf("--- resizeNNInvoker get called\n");
        cv::Size ssize = src.size(), dsize = dst.size();
        int y, x, pix_size = (int)src.elemSize();

        for( y = range.start; y < range.end; y++ )
        {
            uchar* D = dst.data + dst.step*y;
            int sy = std::min(cvFloor(y*ify), ssize.height-1);
            const uchar* S = src.ptr(sy);

            switch( pix_size )
            {
            case 1:
                for( x = 0; x <= dsize.width - 2; x += 2 )
                {
                    uchar t0 = S[x_ofs[x]];
                    uchar t1 = S[x_ofs[x+1]];
                    D[x] = t0;
                    D[x+1] = t1;
                }

                for( ; x < dsize.width; x++ )
                    D[x] = S[x_ofs[x]];
                break;
            case 2:
                for( x = 0; x < dsize.width; x++ )
                    *(ushort*)(D + x*2) = *(ushort*)(S + x_ofs[x]);
                break;
            case 3:
                for( x = 0; x < dsize.width; x++, D += 3 )
                {
                    const uchar* _tS = S + x_ofs[x];
                    D[0] = _tS[0]; D[1] = _tS[1]; D[2] = _tS[2];
                }
                break;
            case 4:
                for( x = 0; x < dsize.width; x++ )
                    *(int*)(D + x*4) = *(int*)(S + x_ofs[x]);
                break;
            case 6:
                for( x = 0; x < dsize.width; x++, D += 6 )
                {
                    const ushort* _tS = (const ushort*)(S + x_ofs[x]);
                    ushort* _tD = (ushort*)D;
                    _tD[0] = _tS[0]; _tD[1] = _tS[1]; _tD[2] = _tS[2];
                }
                break;
            case 8:
                for( x = 0; x < dsize.width; x++, D += 8 )
                {
                    const int* _tS = (const int*)(S + x_ofs[x]);
                    int* _tD = (int*)D;
                    _tD[0] = _tS[0]; _tD[1] = _tS[1];
                }
                break;
            case 12:
                for( x = 0; x < dsize.width; x++, D += 12 )
                {
                    const int* _tS = (const int*)(S + x_ofs[x]);
                    int* _tD = (int*)D;
                    _tD[0] = _tS[0]; _tD[1] = _tS[1]; _tD[2] = _tS[2];
                }
                break;
            default:
                for( x = 0; x < dsize.width; x++, D += pix_size )
                {
                    const uchar* _tS = S + x_ofs[x];
                    for (int k = 0; k < pix_size; k++)
                        D[k] = _tS[k];
                }
            }
        }
    }

private:
    const cv::Mat& src;
    cv::Mat& dst;
    int* x_ofs;
    double ify;

    resizeNNInvoker(const resizeNNInvoker&);
    resizeNNInvoker& operator=(const resizeNNInvoker&);
};

// The entry function that calls nearest neighbor resizing with openmp multi-thread
void resize_nearest(const uchar* src_buf, int src_height, int src_width, int src_linebytes, uchar* dst_buf, int dst_height, int dst_width, int dst_linebytes, const Option& opt)
{
    cv::Size src_size;
    src_size.height = src_height;
    src_size.width = src_width;
    cv::Mat src(src_size, CV_8UC3, const_cast<uchar*>(src_buf));

    cv::Size dst_size;
    dst_size.height = dst_height;
    dst_size.width = dst_width;
    cv::Mat dst(dst_size, CV_8UC3, dst_buf);

    cv::Size ssize = src.size(), dsize = dst.size();

    double inv_scale_x = (double)dsize.width/ssize.width;
    double inv_scale_y = (double)dsize.height/ssize.height;
    double fx = inv_scale_x;
    double fy = inv_scale_y;

    cv::AutoBuffer<int> _x_ofs(dsize.width);
    int* x_ofs = _x_ofs.data();
    int pix_size = (int)src.elemSize();
    double ifx = 1./fx, ify = 1./fy;
    int x;

    for( x = 0; x < dsize.width; x++ )
    {
        int sx = cvFloor(x*ifx);
        x_ofs[x] = std::min(sx, ssize.width-1)*pix_size;
    }

    cv::Range range(0, dsize.height);

    // !! define the instance of resizeNNInvoker functor.
    resizeNNInvoker invoker(src, dst, x_ofs, ify);

#if 0
    cv::parallel_for_(range, invoker);   //!! use opencv's, cost 3.24 ms
#elif 0
    my_parallel_for_(range, invoker);    //!! use own implementation, cost 7.67 ms
#else
    set_omp_dynamic(1);    //!! use inplace-implementation, cost 7.75 ms
    cv::Range stripeRange = range;
    #pragma omp parallel for schedule(dynamic) num_threads(4)
    for (int i = stripeRange.start; i < stripeRange.end; ++i)
        invoker(cv::Range(i, i + 1));
#endif
}

ANSWER

Answered 2021-Aug-11 at 06:54

This is the code from OpenCV to select the actual threading framework:

#ifdef CV_PARALLEL_FRAMEWORK
#if defined HAVE_TBB

#if TBB_INTERFACE_VERSION >= 8000
        tbbArena.execute(pbody);
#else
        pbody();
#endif

#elif defined HAVE_HPX
        pbody();

#elif defined HAVE_OPENMP

        #pragma omp parallel for schedule(dynamic) num_threads(numThreads > 0 ? numThreads : numThreadsMax)
        for (int i = stripeRange.start; i < stripeRange.end; ++i)
            pbody(Range(i, i + 1));

#elif defined HAVE_GCD

        dispatch_queue_t concurrent_queue = dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0);
        dispatch_apply_f(stripeRange.end - stripeRange.start, concurrent_queue, &pbody, block_function);

#elif defined WINRT

        Concurrency::parallel_for(stripeRange.start, stripeRange.end, pbody);

#elif defined HAVE_CONCURRENCY

        if(!pplScheduler || pplScheduler->Id() == Concurrency::CurrentScheduler::Id())
        {
            Concurrency::parallel_for(stripeRange.start, stripeRange.end, pbody);
        }
        else
        {
            pplScheduler->Attach();
            Concurrency::parallel_for(stripeRange.start, stripeRange.end, pbody);
            Concurrency::CurrentScheduler::Detach();
        }

#elif defined HAVE_PTHREADS_PF

        parallel_for_pthreads(pbody.stripeRange(), pbody, pbody.stripeRange().size());

#else

#error You have hacked and compiling with unsupported parallel framework

#endif

        ctx.finalize();  // propagate exceptions if exists
        return;
#endif // CV_PARALLEL_FRAMEWORK

So this is the order of priorization:

  1. TBB task arena
  2. TBB
  3. HPX
  4. OPENMP
  5. Apple GCD
  6. WINRT concurrency
  7. Windows Concurrency
  8. PThread

Maybe your opencv parallel_for uses TBB while your code uses OpenMP?

Not sure whether it is possible, but you could try to use openmp from opencv explicitly, like cv::parallel::openmp::parallel_for in C++

Source https://stackoverflow.com/questions/68727478

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install ncnn

You can download it from GitHub.

Support

✅ = known work and runs fast with good optimization✔️ = known work, but speed may not be fast enough❔ = shall work, not confirmed/ = not applied

DOWNLOAD this Library from

Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from
over 430 million Knowledge Items
Find more libraries
Reuse Solution Kits and Libraries Curated by Popular Use Cases

Save this library and start creating your kit

Share this Page

share link
Reuse Pre-built Kits with ncnn
Compare Machine Learning Libraries with Highest Support
Compare Machine Learning Libraries with Highest Quality
Compare Machine Learning Libraries with Highest Security
Compare Machine Learning Libraries with Permissive License
Compare Machine Learning Libraries with Highest Reuse
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from
over 430 million Knowledge Items
Find more libraries
Reuse Solution Kits and Libraries Curated by Popular Use Cases

Save this library and start creating your kit

  • © 2022 Open Weaver Inc.