pdfx | Extract text metadata and references ( pdf url doi | Document Editor library

 by   metachris Python Version: 1.4.1 License: Apache-2.0

kandi X-RAY | pdfx Summary

kandi X-RAY | pdfx Summary

pdfx is a Python library typically used in Editor, Document Editor applications. pdfx has no bugs, it has no vulnerabilities, it has build file available, it has a Permissive License and it has medium support. You can install using 'pip install pdfx' or download it from GitHub, PyPI.

Extract references (pdf, url, doi, arxiv) and metadata from a PDF. Optionally download all referenced PDFs and check for broken links.
Support
    Quality
      Security
        License
          Reuse

            kandi-support Support

              pdfx has a medium active ecosystem.
              It has 966 star(s) with 112 fork(s). There are 38 watchers for this library.
              OutlinedDot
              It had no major release in the last 12 months.
              There are 20 open issues and 23 have been closed. On average issues are closed in 713 days. There are 5 open pull requests and 0 closed requests.
              It has a neutral sentiment in the developer community.
              The latest version of pdfx is 1.4.1

            kandi-Quality Quality

              pdfx has 0 bugs and 0 code smells.

            kandi-Security Security

              pdfx has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.
              pdfx code analysis shows 0 unresolved vulnerabilities.
              There are 0 security hotspots that need review.

            kandi-License License

              pdfx is licensed under the Apache-2.0 License. This license is Permissive.
              Permissive licenses have the least restrictions, and you can use them in most projects.

            kandi-Reuse Reuse

              pdfx releases are available to install and integrate.
              Deployable package is available in PyPI.
              Build file is available. You can build the component from source.
              Installation instructions, examples and code snippets are available.
              It has 796 lines of code, 54 functions and 13 files.
              It has medium code complexity. Code complexity directly impacts maintainability of the code.

            Top functions reviewed by kandi - BETA

            kandi has reviewed pdfx and discovered the below as its top functions. This is intended to give you an instant insight into pdfx implemented functionality, and help decide if they suit your requirements.
            • Create argument parser
            • Check URLs
            • Return the status code of a given URL
            • Sanitize url
            • Wait for all tasks to finish
            • Color print a string
            • Download PDFs to specified directory
            • Download one or more URLs
            • Return a list of all references
            • Return the text output of a PDF
            • Return the number of references in the file
            • Return a dictionary containing the meta - information of the description
            • Parses the value element
            • Parse tag element
            • Print text to the console
            • Extract the arXiv
            • Parse a requirements file
            • Cleanup metadata for all keys
            • Remove trailing whitespace
            • Wait for completion
            • Print an error code to stderr
            Get all kandi verified functions for this library.

            pdfx Key Features

            No Key Features are available at this moment for pdfx.

            pdfx Examples and Code Snippets

            No Code Snippets are available at this moment for pdfx.

            Community Discussions

            QUESTION

            Extract Hyperlink from a spool pdf file in Python
            Asked 2021-Nov-13 at 19:39

            I am getting my form data from frontend and reading it using fast api as shown below:

            ...

            ANSWER

            Answered 2021-Nov-13 at 19:39

            fastapi gives you a SpooledTemporaryFile. You may be able to use that file object directly if there is some api in pdfx which will work on a File() object rather than a str representing a path (!). Otherwise make a new temporary file on disk and work with that:

            Source https://stackoverflow.com/questions/69956921

            QUESTION

            PDF to XML conversion using PDFX http://pdfx.cs.man.ac.uk/
            Asked 2021-Sep-15 at 15:21

            I'm aware that PDFX is a rule-based system designed to reconstruct the logical structure of scholarly articles in PDF form, regardless of their formatting style. The system's output is an XML document that describes the input article's logical structure in terms of title, sections, tables, references, etc.

            I've been trying to convert some PDF files into XML using PDFX on python but http://pdfx.cs.man.ac.uk/ is not responding.

            The code I use for the conversion is:

            response = requests.post('http://pdfx.cs.man.ac.uk/', headers=headers, data=data)

            Is it still available? Is there any other option to convert the documents reconstructing the structure of scholarly articles?

            Thanks in advance!

            ...

            ANSWER

            Answered 2021-Sep-15 at 15:21

            From the research I've been doing this days, I could find a similar tool called GROBID.

            Home page: https://grobid.readthedocs.io/en/latest/

            GitHub: https://github.com/kermitt2/grobid

            Is a machine learning software for extracting information from scholarly documents

            Source https://stackoverflow.com/questions/68942805

            QUESTION

            Convert in R a XML with ASCII Entity Names to a basic XML
            Asked 2021-May-30 at 21:50

            I have the following XML file:

            ...

            ANSWER

            Answered 2021-May-30 at 16:16

            The solution is a multi step process, extract the database node, convert to text, clean up and then convert back to xml with the read_xml() function.

            Source https://stackoverflow.com/questions/67762928

            QUESTION

            Getting an error from cete Dynamic pdf: ceTe.DynamicPDF.Imaging.ImageParsingException: TIFF Compression value 8 (Flate/Deflate) is not supported
            Asked 2021-Apr-26 at 15:14

            I'm getting this error from cete DynamicPdf, any help would be appreciated, I have posted it on their forum but not getting a response. It happens when I call the draw method to create a Pdf after merging several documents into one. Is there a workaround for it?

            Stacktrace follows:

            Traceback (most recent call last): File "(unknown)", line 235, in ceTe.DynamicPDF.Imaging.TiffImageData.a() File "(unknown)", line 19, in ceTe.DynamicPDF.Imaging.TiffImageData.c() File "(unknown)", line unknown, in ceTe.DynamicPDF.Imaging.TiffImageData.Draw(ceTe.DynamicPDF.IO.OperatorWriter writer, System.Single pdfX, System.Single pdfY, System.Single width, System.Single height) File "(unknown)", line 617, in ceTe.DynamicPDF.PageElements.Image.DrawRotated(ceTe.DynamicPDF.IO.PageWriter writer) File "(unknown)", line 103, in ceTe.DynamicPDF.PageElements.RotatingPageElement.Draw(ceTe.DynamicPDF.IO.PageWriter writer) File "(unknown)", line unknown, in ceTe.DynamicPDF.PageElements.Group.Draw(ceTe.DynamicPDF.IO.PageWriter writer) File "(unknown)", line 14, in ceTe.DynamicPDF.Page.b(ceTe.DynamicPDF.IO.PageWriter A_0) File "(unknown)", line 136, in ceTe.DynamicPDF.Page.fd(ceTe.DynamicPDF.IO.DocumentWriter A_0, System.Int32 A_1, System.Int32 A_2) File "(unknown)", line 178, in ceTe.DynamicPDF.Page.a(ceTe.DynamicPDF.IO.DocumentWriter A_0, System.Int32 A_1, System.Int32 A_2, System.Int32 A_3) File "(unknown)", line 166, in zz93.b1.f() File "(unknown)", line 419, in zz93.b1.Draw() File "(unknown)", line 1, in ceTe.DynamicPDF.Document.Draw(System.IO.Stream stream) File "(unknown)", line 15, in ceTe.DynamicPDF.Document.Draw() File "C:\Users\Simon\Source\Workspaces\PubManager\PubManager\PdfManager\Printing.cs", line 1505, in PubManager.PdfManager.Printing+d__3.MoveNext() File "(unknown)", line 12, in System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() File "(unknown)", line 40, in System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(System.Threading.Tasks.Task task) File "C:\Users\Simon\Source\Workspaces\PubManager\PubManager\PubManager\Controllers\PageLayoutController.cs", line 598, in PubManager.Controllers.PageLayoutController+d__24.MoveNext() File "(unknown)", line 12, in System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() File "(unknown)", line 40, in System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(System.Threading.Tasks.Task task) File "(unknown)", line unknown, in System.Web.Mvc.Async.TaskAsyncActionDescriptor.EndExecute(System.IAsyncResult asyncResult) File "(unknown)", line unknown, in System.Web.Mvc.Async.AsyncControllerActionInvoker+<>c__DisplayClass37.b__36(System.IAsyncResult asyncResult) File "(unknown)", line unknown, in System.Web.Mvc.Async.AsyncControllerActionInvoker.EndInvokeActionMethod(System.IAsyncResult asyncResult) File "(unknown)", line 20, in System.Web.Mvc.Async.AsyncControllerActionInvoker+AsyncInvocationWithFilters.b__3d() File "(unknown)", line 134, in System.Web.Mvc.Async.AsyncControllerActionInvoker+AsyncInvocationWithFilters+<>c__DisplayClass46.b__3f() File "(unknown)", line 134, in System.Web.Mvc.Async.AsyncControllerActionInvoker+AsyncInvocationWithFilters+<>c__DisplayClass46.b__3f() File "(unknown)", line unknown, in System.Web.Mvc.Async.AsyncControllerActionInvoker.EndInvokeActionMethodWithFilters(System.IAsyncResult asyncResult) File "(unknown)", line unknown, in System.Web.Mvc.Async.AsyncControllerActionInvoker+<>c__DisplayClass21+<>c__DisplayClass2b.b__1c() File "(unknown)", line unknown, in System.Web.Mvc.Async.AsyncControllerActionInvoker+<>c__DisplayClass21.b__1e(System.IAsyncResult asyncResult) ceTe.DynamicPDF.Imaging.ImageParsingException: TIFF Compression value 8 (Flate/Deflate) is not supported.

            ...

            ANSWER

            Answered 2021-Apr-26 at 15:14

            It looks like TIFF images with compression value 7 or above are not suported. Have a look at this thread: https://www.dynamicpdf.com/forums/core-suite-for-net-v9/tiff-compression-value-7-extended-jpeg-not-supported

            Source https://stackoverflow.com/questions/67268426

            QUESTION

            How do you skip over files with no extension when downloading them?
            Asked 2020-Jul-17 at 19:58

            My code is working correctly to scour a directory of PDFs, download weblinks embedded within those PDFs, and sequentially name them with appropriate file extension.

            That being said - I am getting a few random files that download but DON'T have an extension associated with them. In doing quality checks, I have all the attachments that matter - these extra files are truly garbage.

            Is there a way to not download them or build in a check in the code so that I don't end up with these phantom files?

            ...

            ANSWER

            Answered 2020-Jul-17 at 19:58

            It's not clear which part of your code produces these "phantom" files, but anyplace you want to avoid downloading a file which doesn't have an extension, you can make the download conditional. If the component after the last slash doesn't contain a dot, do nothing.

            Source https://stackoverflow.com/questions/62957712

            QUESTION

            How do you correctly parse web links to avoid a 403 error when using Wget?
            Asked 2020-Jul-17 at 15:50

            I just started learning python yesterday and have VERY minimal coding skill. I am trying to write a python script that will process a folder of PDFs. Each PDF contains at least 1, and maybe as many as 15 or more, web links to supplemental documents. I think I'm off to a good start, but I'm having consistent "HTTP Error 403: Forbidden" errors when trying to use the wget function. I believe I'm just not parsing the web links correctly. I think the main issue is coming in because the web links are mostly "s3.amazonaws.com" links that are SUPER long.

            For reference:

            Link copied directly from PDF (works to download): https://s3.amazonaws.com/os_uploads/2169504_DFA%20train%20pass.PNG?AWSAccessKeyId=AKIAIPCTK7BDMEW7SP4Q&Expires=1909634500&Signature=aQlQXVR8UuYLtkzjvcKJ5tiVrZQ=&response-content-disposition=attachment;%20filename*=utf-8''DFA%2520train%2520pass.PNG

            Link as it appears after trying to parse it in my code (doesn't work, gives "unknown url type" when trying to download): https%3A//s3.amazonaws.com/os_uploads/2169504_DFA%2520train%2520pass.PNG%3FAWSAccessKeyId%3DAKIAIPCTK7BDMEW7SP4Q%26Expires%3D1909634500%26Signature%3DaQlQXVR8UuYLtkzjvcKJ5tiVrZQ%253D%26response-content-disposition%3Dattachment%253B%2520filename%252A%253Dutf-8%2527%2527DFA%252520train%252520pass.PNG

            Additionally if people want to weigh in on how I'm doing this in a stupid way. Each PDF starts with a string of 6 digits, and once I download supplemental documents I want to auto save and name them as XXXXXX_attachY.* Where X is the identifying string of digits and Y just increases for each attachment. I haven't gotten my code to work enough to test that, but I'm fairly certain I don't have it correct either.

            Help!

            ...

            ANSWER

            Answered 2020-Jul-17 at 15:06

            I prefer to use requests (https://requests.readthedocs.io/en/master/) when trying to grab text or files online. I tried it quickly with wget and I got the same error (might be linked to user-agent HTTP headers used by wget).

            The good thing with requests is that it lets you modify HTTP headers the way you want (https://requests.readthedocs.io/en/master/user/quickstart/#custom-headers).

            Source https://stackoverflow.com/questions/62955392

            QUESTION

            Formatting a table based on Criteria in R
            Asked 2020-May-18 at 09:54

            I have a dataset which is generated using this

            ...

            ANSWER

            Answered 2020-May-18 at 07:59

            We can get the data in long format, remove rows with 0 values, count number of rows for each day and class and get data in wide format again.

            Source https://stackoverflow.com/questions/61864274

            QUESTION

            Ghostscript pdf conversion makes ligatures unable to copy & paste
            Asked 2020-Jan-07 at 19:27

            I have a pdf (created with latex with \usepackage[a-2b]{pdfx}) where I am able to correctly copy & paste ligatures, i.e., "fi" gets pasted in my text editor as "fi". The pdf is quite large, so I'm trying to reduce its size with this ghostscript command:

            ...

            ANSWER

            Answered 2020-Jan-07 at 19:27

            Without seeing your original file, so that I can see the way the text is encoded, its not possible to be definitive. It certainly is not possible to have the pdfwrite device 'preserve the information', for an explanation, see here.

            If you original PDF file has a ToUnicode CMap then the pdfwrite devcie should use that to generate a new ToUnicode CMap in the outptu file, maintaing cut&paste/search. If it doesn't then the conversion process will destroy the encoding. You might be able to get an improvement in results by setting SubsetFonts to false, but its just a guess without seeing an example.

            My guess is that your original file doesn't have a ToUnicode CMap, which means that its essentially only working by luck.

            Source https://stackoverflow.com/questions/59632621

            Community Discussions, Code Snippets contain sources that include Stack Exchange Network

            Vulnerabilities

            No vulnerabilities reported

            Install pdfx

            Grab a copy of the code with easy_install or pip, and run it:.

            Support

            For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .
            Find more information at:

            Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items

            Find more libraries
            Install
          • PyPI

            pip install pdfx

          • CLONE
          • HTTPS

            https://github.com/metachris/pdfx.git

          • CLI

            gh repo clone metachris/pdfx

          • sshUrl

            git@github.com:metachris/pdfx.git

          • Stay Updated

            Subscribe to our newsletter for trending solutions and developer bootcamps

            Agree to Sign up and Terms & Conditions

            Share this Page

            share link