agrep | approximate GREP for fast fuzzy string

 by   Wikinaut C Version: Current License: Non-SPDX

kandi X-RAY | agrep Summary

kandi X-RAY | agrep Summary

agrep is a C library. agrep has no bugs, it has no vulnerabilities and it has low support. However agrep has a Non-SPDX License. You can download it from GitHub.

AGREP - an approximate GREP. Fast searching files for a string or regular expression, with approximate matching capabilities and user-definable records. Developed 1989-1991 by Udi Manber, Sun Wu et al. at the University of Arizona. For Glimpse and WebGlimpse - AGREP is an essential part of them - see.
Support
    Quality
      Security
        License
          Reuse

            kandi-support Support

              agrep has a low active ecosystem.
              It has 245 star(s) with 43 fork(s). There are 14 watchers for this library.
              OutlinedDot
              It had no major release in the last 6 months.
              There are 11 open issues and 5 have been closed. On average issues are closed in 317 days. There are 1 open pull requests and 0 closed requests.
              It has a neutral sentiment in the developer community.
              The latest version of agrep is current.

            kandi-Quality Quality

              agrep has 0 bugs and 0 code smells.

            kandi-Security Security

              agrep has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.
              agrep code analysis shows 0 unresolved vulnerabilities.
              There are 0 security hotspots that need review.

            kandi-License License

              agrep has a Non-SPDX License.
              Non-SPDX licenses can be open source with a non SPDX compliant license, or non open source licenses, and you need to review them closely before use.

            kandi-Reuse Reuse

              agrep releases are not available. You will need to build from source code and install.
              Installation instructions are not available. Examples and code snippets are available.

            Top functions reviewed by kandi - BETA

            kandi's functional review helps you automatically verify the functionalities of the libraries and avoid rework.
            Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of agrep
            Get all kandi verified functions for this library.

            agrep Key Features

            No Key Features are available at this moment for agrep.

            agrep Examples and Code Snippets

            No Code Snippets are available at this moment for agrep.

            Community Discussions

            QUESTION

            Mean columns with almost the same name
            Asked 2021-Jun-25 at 14:46

            I have a data frame containing only one row with named columns. The data frame looks somewhat like this:

            ...

            ANSWER

            Answered 2021-Jun-23 at 16:47

            We could use split.default to split based on the substring of column namesinto a list and then loop over the list with sapply, get the rowMeans in base R

            Source https://stackoverflow.com/questions/68103538

            QUESTION

            Understanding constraints in agrep fuzzy matching in R
            Asked 2021-Jun-05 at 19:14

            This seems really simple but for some reason, I don't understand the behavior of agrep fuzzy matching involving substitutions. Two substitutions produce a match as expected when all=2 is specified, but not when substitutions=2. Why is this?

            ...

            ANSWER

            Answered 2021-Jun-05 at 19:14

            all is an upper limit which always applies, regardless of other max.distance controls (other than cost). It defaults to 10%.

            Source https://stackoverflow.com/questions/67828545

            QUESTION

            match two vectors by similar characters/strings in R
            Asked 2021-Jun-02 at 20:06

            I have two vectors, like

            ...

            ANSWER

            Answered 2021-Jun-02 at 19:59

            Maybe the following can solve your problem. It uses stringdistmatrix in package stringdist, which can become a memory problem if the vectors v1 and v2 are larger.

            Source https://stackoverflow.com/questions/67811356

            QUESTION

            awk unix - match regex - regex string size limit | ideas?
            Asked 2021-May-24 at 16:40

            The following code works as a minimal example. It searches a regular expression with one mismatch inside a text (later a large DNA file).

            ...

            ANSWER

            Answered 2021-May-20 at 07:31

            Is there another way to approach this?

            Looking for fuzzy matches is easy with Python. You just need to install the PyPi regex module by running the following in the terminal:

            Source https://stackoverflow.com/questions/67467829

            QUESTION

            Fuzzy-match and extract words repeated across turns in conversation
            Asked 2021-Apr-22 at 11:17

            I'm working on speech in conversational speaking turns and want to extract words that are repeated across turns. The task I'm grappling with is to extract words that inexactly repeated.

            Data:

            ...

            ANSWER

            Answered 2021-Apr-22 at 11:17

            I think this can be done really well using a tidy approach. The problem you already solved can be done (probably much quicker) using tidytext:

            Source https://stackoverflow.com/questions/67209674

            QUESTION

            Extract a string of words between two specific words, but allow for a mismatches in R
            Asked 2021-Apr-12 at 16:33

            I have the following string.

            ...

            ANSWER

            Answered 2021-Apr-12 at 16:33

            If I understood you correctly, you want to extract from your vector the verbs (i.e., the middle substring) iff the words on the left and on the right of it are maximally 2 insertions/deletions etc. distant from the "today \\w+ Oscar"pattern.

            If that premise is correct you can first subset your vector on those strings that meet that condition using agrep (or agrepl) and second capture the substring in the middle in a capturing group (...) and refer to it using backreference \\1 in sub's replacement argument:

            Source https://stackoverflow.com/questions/67060015

            QUESTION

            Fuzzy matching strings within a single column and documenting possible matches
            Asked 2021-Mar-23 at 07:45

            I have a relatively large dataset of ~ 5k rows containing titles of journal/research papers. Here is a small sample of the dataset:

            ...

            ANSWER

            Answered 2021-Mar-23 at 02:31

            This isn't base r nor data.table, but here's one way using tidyverse to detect duplicates:

            Source https://stackoverflow.com/questions/66756085

            QUESTION

            Calling the agrep .Internal C function from Rcpp
            Asked 2021-Feb-17 at 08:46

            In short: How can I call, from within Rccp C++ code, the agrep C internal function that gets called when users use the regular agrep function from base R?

            In long: I have found multiple questions here about how to invoke, from within Rcpp, a C or C++ function created for another package (e.g. using C function from other package in Rcpp and Rcpp: Call C function from a package within Rcpp).

            The thing that I am trying to achieve, however, is at the same time simpler but also way less documented: it is to directly call, from within Rcpp, a .Internal C function that comes with base R rather than another package, without interfacing with R (that is, without doing what is said in Call R functions in Rcpp). How could I do that for the .Internal C function that lays underneath base R's agrep wrapper?

            The specific function I am trying to call here is the agrep internal C function. And for context, what I am ultimately trying to achieve is to speed-up a call to agrep for when millions of patterns must be each checked against each of millions of x targets.

            ...

            ANSWER

            Answered 2021-Feb-17 at 08:46

            Great question. The long and short of it is "You cant" (in many cases) unless the function is visible in one of the header files in "src/include/". At least not that easily.

            Not long ago I had a similar fun challenge, where I tried to get access to the do_docall function (called by do.call), and it is not a simple task. First of all, it is not directly possible to just #include (or something similar). That file simply isn't available for inclusion, as it is not a part of the "src/include". It is compiled and the uncompiled file is removed (not to mention that one should never "include" a .c file).

            If one is willing to go the mile, then the next step one could look at is "copying" and "altering" the source code. Basically find the function in "src/main/agrep.c", copy it into your package and then fix any errors you find.

            Problems with this approach:

            1. As documented in R-exts the internal structures of sexprec_info is not made public (this is the base structure for all objects in R). Many internal function use the fields within this structure, so one has to "copy" the structure into your source code, to make it public to your code specifically.
            2. If you ever #include prior to this file, you will need to go through each and every call to internal functions and likely add either R_ or Rf_.
            3. The function may contain calls to other "internal" functions, that further needs to be copied and altered for it to work.
            4. You will also need to get a clear understanding of what CDR, CAR and similar does. The internal functions have a documented structure, where the first argument contains the full call passed to the function, and function like those 2 are used to access parts of the call. I did myself a solid and rewrote do_docall changing the input format, to avoid having to consider this. But this takes time. The alternative is to create a pairlist according to the documentation, set its type as a call-sexp (the exact name is lost to me at the moment) and pass the appropriate arguments for op, args and env.
            5. And lastly, if you go through the steps, and find that it is necessary to copy the internal structures of sexprec_info (as described later), then you will need to be very careful about when you include Rinternals and Rcpp, as any one of these causes your code to crash and burn in the most beautiful and silent way if you include your header and these in the wrong order! Note that this even goes for [[Rcpp::export]], which may indeed turn out to include them in the wrong arbitrary order!

            If you are willing to go this far down the drainage, I would suggest carefully reading adv-R "R's C interface" and Chapter 2, 5 and 6 of R-ext and maybe even the R internal manual, and finally once that is done take a look at do_docall from src/main/coerce.c and compare it to the implementation in my repository cmdline.arguments/src/utils/{cmd_coerce.h, cmd_coerce.c}. In this version I have

            1. Added all the internal structures that are not public, so that I can access their unmodified form (unmodified by the current session).
              • This includes the table used to store the currently used SEXP's, that was used as a lookup. This caused a problem as I can't access the modified version, so my code is slightly altered with the old code blocked by the macro #if --- defined(CMDLINE_ARGUMENTS_MAYBE_IN_THE_FUTURE). Luckily the code causing a problem had a static answer, so I could work around this (but this might not always be the case).
            2. I added quite a few Rf_s as their macro version is not available (since I #include at some point)
            3. The code has been split into smaller functions to make it more readable (for my own sake).
            4. The function has one additional argument (name), that is not used in the internal function, with some added errors (for my specific need).

            This implementation will be frozen "for all time to come" as I've moved on to another branch (and this one is frozen for my own future benefit, if I ever want to walk down this path again).

            I spent a few days scouring the internet for information on this and found 2 different posts, talking about how this could be achieved, and my approach basically copies this. Whether this is actually allowed in a cran package, is an whole other question (and not one that I will be testing out).

            This approach goes again if you want to use not-public code from other packages. While often here it is as simple as "copy-paste" their files into your repository.

            As a final side note, you mention the intend is to "speed up" your code for when you have to perform millions upon millions of calls to agrep. It seems that this is a time where one should consider performing the task in parallel. Even after going through the steps outlined above, creating N parallel sessions to take care of K evaluations each (say 100.000), would be the first step to reduce computing time. Of course each session should be given a batch and not a single call to agrep.

            Source https://stackoverflow.com/questions/66235330

            QUESTION

            Does R have a distinction between character vectors and strings?
            Asked 2021-Feb-12 at 14:25

            As far as I know, what most languages call a string, R calls a character vector. For example, "Alice" is not a string, it's a character vector of length 1. Similarly, c("Alice", "Bob") is a character vector of length 2. I cannot recall my IDE or any of my work with R's type system telling me that R has any internal concept of "strings".

            Despite this, R's documentation frequently uses the word "string":

            • ?paste and ?nchar frequently talk of "character strings".
            • Many "See Also" sections mention strings without any qualifier, e.g. ?paste, ?chartr, and ?agrep.
            • ?strsplit mentions "substrings".
            • ?agrep, ?toString, and ?adist talk about strings both in their titles and "Description" sections.
            • strsplit, strwidth, and toString have string or a shorthand for it in their names.

            So does R actually have a concept of strings, or does it always mean exactly the same thing as "character vector"?

            ...

            ANSWER

            Answered 2021-Feb-12 at 14:25

            Converting my comment to an answer.

            A description of character and string can be found in the R Language Definition:

            R has six basic (‘atomic’) vector types: logical, integer, real, complex, string (or character) and raw. The modes and storage modes for the different vector types are listed in the following table.

            typeof mode storage.mode logical logical logical integer numeric integer double numeric double complex complex complex character character character raw raw raw

            [...]

            String vectors have mode and storage mode "character". A single element of a character vector is often referred to as a character string.

            Source https://stackoverflow.com/questions/66105136

            QUESTION

            R Add column to a list of data frame using for loop
            Asked 2020-Aug-14 at 06:18

            I have the following list of data frames.

            ...

            ANSWER

            Answered 2020-Aug-14 at 06:18

            The e in your for loop has no connection with original ern list hence, it is not possible to add any new information in the list. You should iterate over the index of the list instead.

            Source https://stackoverflow.com/questions/63407502

            Community Discussions, Code Snippets contain sources that include Stack Exchange Network

            Vulnerabilities

            No vulnerabilities reported

            Install agrep

            You can download it from GitHub.

            Support

            For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .
            Find more information at:

            Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items

            Find more libraries
            CLONE
          • HTTPS

            https://github.com/Wikinaut/agrep.git

          • CLI

            gh repo clone Wikinaut/agrep

          • sshUrl

            git@github.com:Wikinaut/agrep.git

          • Stay Updated

            Subscribe to our newsletter for trending solutions and developer bootcamps

            Agree to Sign up and Terms & Conditions

            Share this Page

            share link