xpdf | xpdf with local changes | Document Editor library
kandi X-RAY | xpdf Summary
kandi X-RAY | xpdf Summary
Xpdf is an open source viewer for Portable Document Format (PDF) files. (These are also sometimes also called Acrobat files, from the name of Adobe’s PDF software.) The Xpdf project also includes a PDF text extractor, PDF-to-PostScript converter, and various other utilities. Xpdf runs under the X Window System on UNIX, VMS, and OS/2. The non-X components (pdftops, pdftotext, etc.) also run on Windows and Mac OSX systems and should run on pretty much any system with a decent C++ compiler. Xpdf will run on 32-bit and 64-bit machines.
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of xpdf
xpdf Key Features
xpdf Examples and Code Snippets
Community Discussions
Trending Discussions on xpdf
QUESTION
Goal: to run this Auto Labelling Notebook on AWS SageMaker Jupyter Labs.
Kernels tried: conda_pytorch_p36
, conda_python3
, conda_amazonei_mxnet_p27
.
ANSWER
Answered 2022-Feb-03 at 09:29I would recommend to downgrade your milvus version to a version before the 2.0 release just a week ago. Here is a discussion on that topic: https://github.com/deepset-ai/haystack/issues/2081
QUESTION
I have a ruby on rails app that uses pandoc-ruby to convert markdown files into pdf.
The pandoc-ruby requires pandoc installation. To successfully convert to pdf, pdflatex needs to be present as well. Locally (tested on Mac and Ubuntu 18.04) everything is working if pandoc
, texlive-latex-recommended
and texlive-fonts-recommended
packages are installed. Things get a little bit tricky when deploying to heroku.
To install all the packages on heroku I've used the Aptfile approach and I have not been able to solve this.
Approach 1: Aptfile
I've specified this Aptfile:
...ANSWER
Answered 2021-Jan-25 at 19:29After quite a bit of trial and error, I have found a solution that works.
As @mb21 mentioned, Docker image would probably be the best option long term. Docker images are supported on Heroku. However, I wanted to avoid dockerizing the whole application to solve this issue.
After finding a TeX Live buildpack for Heroku that supports adding custom TeX Live packages (one example of such buildpack), the error on conversion was ! LaTeX Error: File 'xcolor.sty' not found.
I used tlmgr
to get some info on the missing file. Running tlmgr search --global --file xcolor.sty
does the trick and reveals that there is a package called xcolor
. After installing that we come to the next error, and the next, and the next. In the end I ended up installing 2 collections that are small enough for Heroku (mind the 500MB slug size limit) and contain everything pandoc needs for a successful conversion. Those 2 are collection-fontsrecommended
and collection-latexrecommended
.
Adding a texlive.packages
file to the root of the application does the trick. It is recognized by the buildpack and it installs all the specified packages for you using tlmgr
.
QUESTION
i am trying to build a minimalistic docker image
for one of my applicatoins
in my "usual" builds i do not rely on 3rd party applications. This time I need to include a precompiled executeable (xpdf) to the build; My go
applications are prebuilt in a builder
Docker and then copied over (no dependencies).
my current Dockerimage
file looks like this: (working!) application launches
ANSWER
Answered 2021-Jan-19 at 09:19create a docker image where you can grab all required libraries from
QUESTION
I need help for the following problem: I'd like to kill all instances of a program, let's say xpdf. At the prompt the following works as intended:
...ANSWER
Answered 2021-Jan-02 at 12:47This answer is for the case where killall
or pkill
(suggested in this answer) are not enough for you. For example if you really want to print "xpdf läuft nicht"
if there is no pid to kill or applying kill -SIGTERM
because you want to be sure of the signal you send to your pids or whatever.
You could use a bash loop instead of xargs
and sed
. It's pretty simple to iterate over CSV/column outputs:
QUESTION
In the Ghostscript documentation I did not found arguments to query the paper sizes of a PDF document.
I read about a pdf_info.ps file in the lib subdirectory.
I tried this code:
...ANSWER
Answered 2020-Jun-23 at 16:08Recent versions of Ghostscript default to SAFER mode by default, which prevents PostScript programs (like pdf_info.ps) from accessing files in the file system.
In general Ghostscript will try and infer from the command line when files should be permitted (such as the input filename, in the case above pdf_info.ps) but it can't know that -sFile= should be permitted, because that part of the command simply ends up in the PostScript interpreter.
So to use pdf_info.ps you will either have to set -dNOSAFER
or add --permit-file-read=
to your command line. -dNOSAFER
turns off all protection so you may not want to do that, --permit-file-read allows the PostScript program to read the specified directory only. I'd recommend you do that.
I'd also suggest you experiment from the command line using the usual Ghostscript executable and only move to your application when you have it correct.
If you are planning to distribute this application, please have a look at the license file.
QUESTION
For machine learning purposes (sckit-learn), I need to extract the raw text from lots of PDF files. First off, I was using xpdf pdftotext to do this task:
...ANSWER
Answered 2020-May-18 at 17:50There are two fairly simple techniques you can use.
1) Google's "Tessaract" open source OCR (optical character recognition). You could apply this evenly to all PDFs, though converting all that data into pixels and then working magic upon them is going to be more computationally expensive. Which is more important, engineer time or CPU time? There's a pytesseract module. Note that this tool works on image formats, so you'd have to use something like GhostScript (another open source project) to convert all of a PDF's pages to images, then run [py]tessaract on those images.
2) pyPDF can get each page and programmatically extract any text draw operations in the order they were drawn onto the page. This may be nothing like the logical reading order of the page... While a PDF could draw all the 'a's and then all the 'b's (and so forth), it's actually more efficient to draw everything in "font a" , then everything in "font b". It's important to note that "font b" might just be the italic version of "font a". This produces a shorter/more efficient stream of drawing commands, though probably not by such an amount as to be a good business decision to do so.
The kicker here is that a random pile of PDF files might require you to do some OCR. A poorly assembled PDF (one with a font subset that has no "to unicode" data) can't be properly mined for text even though it has nothing but text drawing operations. "Draw glyphs one through five from "font C" doesn't mean much if you don't know that those first five glyphs are "g-l-y-p-h", because that's the order they were used in.
On the other hand, if you've got home-grown PDFs or all your pdfs are from some known source (Word's pdf converter for example), you'll know what to expect in advance.
Note that the only thing mentioned above that I've actually used is Ghostscript. I remember it having a solid command line interface we used to generate images for some online PDF viewer Many Years Ago.
QUESTION
I have the following code, which counts the number of PDFs in specific folders, and counts the number of sheets in those specific PDFs, and sends an email with this data.
I've anonymised part of the script.
...ANSWER
Answered 2020-May-08 at 12:18It's a HTML encoding issue. I think you need to either use the following code.
QUESTION
I'm generating PDF files using PDFsharp, and I need to overlay the PDF I'm generating with a specific page from another PDF.
I've created this method:
...ANSWER
Answered 2020-Mar-17 at 12:32You can append the page number to the name of the PDF file, separated with a hash sign ("#").
To get page 7 of "sample.pdf", use the filename "sample.pdf#6" (zero-based page numbers).
QUESTION
I'm trying to use pdftotext, but it won't import.
I'm running Windows 10 (64 bit) on a Lenovo IdeaPad S340, a work laptop.
Following the directions here and here (which were super helpful), I:
- Installed Microsoft Visual C++ Build Tools.
- Installed Anaconda.
- Got the latest version of Anaconda and updated it, using a separate Anaconda3 commands for each of these steps. I don't recall the commands, and haven't found them again.
- Updated Microsoft Visual 14.
- Used conda to install poppler via Anaconda3 command:
conda install -c conda-forge poppler
- Used pip to install pdftotext via Anaconda3 command:
pip install pdftotext
After that:
This happens in the Python 3.8 (32 bit) command prompt:
...ANSWER
Answered 2020-Feb-11 at 09:20Okay, I figured it out! If you install pdftotext using Anaconda and conda, then importing it seems to only work when you run it in the Python interpreter from within the Anaconda3 shell.
So, I had to switch to the Python interpreter mode in the Anaconda3 PowerShell first:
python
Then, I could import pdftotext with no error:
import pdftotext
It looked like this:
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install xpdf
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page