face-alignment | 7000FPS face alignment | Graphics library
kandi X-RAY | face-alignment Summary
kandi X-RAY | face-alignment Summary
7000+FPS face alignment
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of face-alignment
face-alignment Key Features
face-alignment Examples and Code Snippets
Community Discussions
Trending Discussions on face-alignment
QUESTION
I am very new to Torch/CUDA, and I'm trying to test the small binary network (~1.5mb) from https://github.com/1adrianb/binary-face-alignment, but I keep running into 'out of memory' issues.
I am using a relatively weak GPU (NVIDIA Quadro K600) with ~900Mb of graphics memory on 16.04 Ubuntu with CUDA 10.0 and CudNN version 5.1. So I don't really care about performance, but I thought I would at least be able to run a small network for prediction, one image at a time (especially one that supposedly is aimed at those "with Limited Resources").
I managed to run the code in headless mode and checked the memory consumption to be around 700Mb, which would explain why it fails immediately when I have an X-server running which takes around 250Mb of GPU memory.
I also added some logs to see how far along main.lua I get, and it's the call output:copy(model:forward(img))
on the very first image that runs out of memory.
For reference, here's the main.lua code up until the crash:
...ANSWER
Answered 2019-Apr-11 at 20:18What usually consumes most of the memory are the activation maps (and gradients, when training). I am not familiar with this particular model and implementation, but I would say that you are using a "fake" binary network; by fake I mean they still use floating-point numbers to represent the binary values since most users are going to use their code on GPUs that do not fully support real binary operations. The authors even write in Section 5:
Performance. In theory, by replacing all floating-point multiplications with bitwise XOR and making use of the SWAR (Single instruction, multiple data within a register) [5], [6], the number of operations can be reduced up to 32x when compared against the multiplication-based convolution. However, in our tests, we observed speedups of up to 3.5x, when compared against cuBLAS, for matrix multiplications, a result being in accordance with those reported in [6]. We note that we did not conduct experiments on CPUs. However, given the fact that we used the same method for binarization as in [5], similar improvements in terms of speed, of the order of 58x, are to be expected: as the realvalued network takes 0.67 seconds to do a forward pass on a i7-3820 using a single core, a speedup close to x58 will allow the system to run in real-time. In terms of memory compression, by removing the biases, which have minimum impact (or no impact at all) on performance, and by grouping and storing every 32 weights in one variable, we can achieve a compression rate of 39x when compared against the single precision counterpart of Torch.
In this context, a small model (w.r.t. number of parameters or model size in MiB) does not necessarily mean low memory footprint. It is likely that all this memory is being used to store the activation maps in single- or double-precision.
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install face-alignment
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page