kandi background
Explore Kits

OpenRefine | open source power tool for working with messy data

 by   OpenRefine Java Version: 3.5.2 License: BSD-3-Clause

 by   OpenRefine Java Version: 3.5.2 License: BSD-3-Clause

Download this library from

kandi X-RAY | OpenRefine Summary

OpenRefine is a Java library typically used in Data Science applications. OpenRefine has no bugs, it has no vulnerabilities, it has build file available, it has a Permissive License and it has medium support. You can download it from GitHub.
OpenRefine is a Java-based power tool that allows you to load data, understand it, clean it up, reconcile it, and augment it with data coming from the web. All from a web browser and the comfort and privacy of your own computer.
Support
Support
Quality
Quality
Security
Security
License
License
Reuse
Reuse

kandi-support Support

  • OpenRefine has a medium active ecosystem.
  • It has 8767 star(s) with 1667 fork(s). There are 488 watchers for this library.
  • There were 5 major release(s) in the last 12 months.
  • There are 627 open issues and 1883 have been closed. On average issues are closed in 591 days. There are 39 open pull requests and 0 closed requests.
  • It has a neutral sentiment in the developer community.
  • The latest version of OpenRefine is 3.5.2
OpenRefine Support
Best in #Java
Average in #Java
OpenRefine Support
Best in #Java
Average in #Java

quality kandi Quality

  • OpenRefine has 0 bugs and 0 code smells.
OpenRefine Quality
Best in #Java
Average in #Java
OpenRefine Quality
Best in #Java
Average in #Java

securitySecurity

  • OpenRefine has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.
  • OpenRefine code analysis shows 0 unresolved vulnerabilities.
  • There are 0 security hotspots that need review.
OpenRefine Security
Best in #Java
Average in #Java
OpenRefine Security
Best in #Java
Average in #Java

license License

  • OpenRefine is licensed under the BSD-3-Clause License. This license is Permissive.
  • Permissive licenses have the least restrictions, and you can use them in most projects.
OpenRefine License
Best in #Java
Average in #Java
OpenRefine License
Best in #Java
Average in #Java

buildReuse

  • OpenRefine releases are available to install and integrate.
  • Build file is available. You can build the component from source.
  • Installation instructions are available. Examples and code snippets are not available.
  • OpenRefine saves you 87305 person hours of effort in developing the same functionality from scratch.
  • It has 100859 lines of code, 5388 functions and 1368 files.
  • It has medium code complexity. Code complexity directly impacts maintainability of the code.
OpenRefine Reuse
Best in #Java
Average in #Java
OpenRefine Reuse
Best in #Java
Average in #Java
Top functions reviewed by kandi - BETA

kandi has reviewed OpenRefine and discovered the below as its top functions. This is intended to give you an instant insight into OpenRefine implemented functionality, and help decide if they suit your requirements.

  • Parse a numeric token .
  • Retrieves data from a post request .
  • Returns the next token .
  • Encode the main loop .
  • Parse a factor .
  • Gets the insert SQL .
  • Gets the create sql .
  • Export rows .
  • Retrieves the data directory .
  • Generate a serializable log event .

OpenRefine Key Features

OpenRefine is a free, open source power tool for working with messy data and improving it

OpenRefine sample extension not building

copy iconCopydownload iconDownload
<version>3.6-SNAPSHOT</version>

Does OpenRefine support Python3?

copy iconCopydownload iconDownload
# This jython2.7 script has to be executed as jython, not GREL
# It allows you to execute a command (CLI) in the terminal and retrieve the result.

# import basic librairies
import time
import commands
import random
# get status and output of the command
status, output = commands.getstatusoutput(value)
# add a random between 2 and 5s pause to avoid ddos on servers... Be kind to APIs!
time.sleep(random.randint(2, 5))
# returns the result of the command
return output.decode("utf-8")

OpenRefine: swapping order strings within a column of values

copy iconCopydownload iconDownload
result = re.sub('(\d+).(\w+).(\d+)', r'\2 \1 \3', input) 
\d+ matches 14
\w+ matches October 
\d+ matches 2021
-----------------------
result = re.sub('(\d+).(\w+).(\d+)', r'\2 \1 \3', input) 
\d+ matches 14
\w+ matches October 
\d+ matches 2021
-----------------------
value.toDate("dd MMMM yyyy").toString("MMMM dd yyyy")
-----------------------
cells['Month'].value+cells['DD'].value+cells['YYYY'].value

Extract text using GREL in OpenRefine

copy iconCopydownload iconDownload
"("+value.partition(" (")[2]
-----------------------
value.split(" ").slice(2).join(" ")
value.match(/\S+\s\S+\s(.+)/)[0]
-----------------------
value.split(" ").slice(2).join(" ")
value.match(/\S+\s\S+\s(.+)/)[0]

How to get csv with header from xml

copy iconCopydownload iconDownload
  <xsl:param name="columns" as="xs:string*" select="'forename', 'surname', 'linksurname'"/>

  <xsl:template match="/">
      <xsl:value-of select="$columns" separator=";"/>
      <xsl:text>&#10;</xsl:text>
      <xsl:apply-templates select="//person/persName"/>
  </xsl:template>
  
  <xsl:template match="person/persName">
      <xsl:value-of select="forename, surname, @ref" separator=";"/>
      <xsl:text>&#10;</xsl:text>
  </xsl:template>
  
  <xsl:param name="columns" as="xs:string*" select="'forename', 'nameLink', 'surname', 'roleName', 'gnd', 'note'"/>
  
  <xsl:template match="/">
      <xsl:value-of select="$columns" separator=";"/>
      <xsl:text>&#10;</xsl:text>
      <xsl:apply-templates select="//person/persName"/>
  </xsl:template>
  
  <xsl:template match="person/persName">
      <xsl:value-of select="(string(forename), string(nameLink), string(surname), string(roleName), string(@ref), string(../note)) ! normalize-space()" separator=";"/>
      <xsl:text>&#10;</xsl:text>
  </xsl:template>
-----------------------
  <xsl:param name="columns" as="xs:string*" select="'forename', 'surname', 'linksurname'"/>

  <xsl:template match="/">
      <xsl:value-of select="$columns" separator=";"/>
      <xsl:text>&#10;</xsl:text>
      <xsl:apply-templates select="//person/persName"/>
  </xsl:template>
  
  <xsl:template match="person/persName">
      <xsl:value-of select="forename, surname, @ref" separator=";"/>
      <xsl:text>&#10;</xsl:text>
  </xsl:template>
  
  <xsl:param name="columns" as="xs:string*" select="'forename', 'nameLink', 'surname', 'roleName', 'gnd', 'note'"/>
  
  <xsl:template match="/">
      <xsl:value-of select="$columns" separator=";"/>
      <xsl:text>&#10;</xsl:text>
      <xsl:apply-templates select="//person/persName"/>
  </xsl:template>
  
  <xsl:template match="person/persName">
      <xsl:value-of select="(string(forename), string(nameLink), string(surname), string(roleName), string(@ref), string(../note)) ! normalize-space()" separator=";"/>
      <xsl:text>&#10;</xsl:text>
  </xsl:template>
-----------------------
<xsl:stylesheet version="2.0" 
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text" encoding="UTF-8" />
<xsl:strip-space elements="*"/>

<xsl:template match="/listPerson">
    <!-- header -->
    <xsl:text>forename;nameLink;surname;roleName;gnd;note&#10;</xsl:text>
    <!-- data -->
    <xsl:for-each select="person/persName">
        <xsl:value-of select="forename"/>
        <xsl:text>;</xsl:text>
        <xsl:value-of select="nameLink"/>
        <xsl:text>;</xsl:text>
        <xsl:value-of select="surname"/>
        <xsl:text>;</xsl:text>
        <xsl:value-of select="roleName"/>
        <xsl:text>;</xsl:text>
        <xsl:value-of select="@ref"/>
        <xsl:text>;</xsl:text>
        <xsl:value-of select="../note"/>
        <xsl:text>&#10;</xsl:text>
    </xsl:for-each>
</xsl:template>

</xsl:stylesheet>
forename;nameLink;surname;roleName;gnd;note
Jacques;;Abbadie;;http://d-nb.info/gnd/100002307;Prediger der französisch-reformierten Gemeinde in Berlin
Johann Jakob;;Achermann;;http://d-nb.info/gnd/1072413450;
Philipp III.;von;Aarschot;Herzog;http://d-nb.info/gnd/132281007;Philippe de Croy
Barbara;von;Aham;;;Äbtissin Barbara II. des Niedermünsters zu Regensburg
-----------------------
<xsl:stylesheet version="2.0" 
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text" encoding="UTF-8" />
<xsl:strip-space elements="*"/>

<xsl:template match="/listPerson">
    <!-- header -->
    <xsl:text>forename;nameLink;surname;roleName;gnd;note&#10;</xsl:text>
    <!-- data -->
    <xsl:for-each select="person/persName">
        <xsl:value-of select="forename"/>
        <xsl:text>;</xsl:text>
        <xsl:value-of select="nameLink"/>
        <xsl:text>;</xsl:text>
        <xsl:value-of select="surname"/>
        <xsl:text>;</xsl:text>
        <xsl:value-of select="roleName"/>
        <xsl:text>;</xsl:text>
        <xsl:value-of select="@ref"/>
        <xsl:text>;</xsl:text>
        <xsl:value-of select="../note"/>
        <xsl:text>&#10;</xsl:text>
    </xsl:for-each>
</xsl:template>

</xsl:stylesheet>
forename;nameLink;surname;roleName;gnd;note
Jacques;;Abbadie;;http://d-nb.info/gnd/100002307;Prediger der französisch-reformierten Gemeinde in Berlin
Johann Jakob;;Achermann;;http://d-nb.info/gnd/1072413450;
Philipp III.;von;Aarschot;Herzog;http://d-nb.info/gnd/132281007;Philippe de Croy
Barbara;von;Aham;;;Äbtissin Barbara II. des Niedermünsters zu Regensburg

Regex to delete all caps letters and following comma

copy iconCopydownload iconDownload
replace(value, /, *[A-Z]+\b/, '')
--------------------------------------------------------------------------------
  ,                        ','
--------------------------------------------------------------------------------
   *                       ' ' (0 or more times (matching the most
                           amount possible))
--------------------------------------------------------------------------------
  [A-Z]+                   any character of: 'A' to 'Z' (1 or more
                           times (matching the most amount possible))
--------------------------------------------------------------------------------
  \b                       the boundary between a word char (\w) and
                           something that is not a word char
-----------------------
replace(value, /, *[A-Z]+\b/, '')
--------------------------------------------------------------------------------
  ,                        ','
--------------------------------------------------------------------------------
   *                       ' ' (0 or more times (matching the most
                           amount possible))
--------------------------------------------------------------------------------
  [A-Z]+                   any character of: 'A' to 'Z' (1 or more
                           times (matching the most amount possible))
--------------------------------------------------------------------------------
  \b                       the boundary between a word char (\w) and
                           something that is not a word char

Pattern Matching in OpenRefine JSON

copy iconCopydownload iconDownload
value.replace(/.*[Tt]est.*/,'Test Titles')
value.replace(/.*test.*/i,'Test Titles')
-----------------------
value.replace(/.*[Tt]est.*/,'Test Titles')
value.replace(/.*test.*/i,'Test Titles')

Average of splited value in OpenRefine formula

copy iconCopydownload iconDownload
forEach(value.split(','),v,v.toNumber()).sum() / value.split(',').length()
forEach(value.replace('%','').split(','),v,v.toNumber()).sum() / value.replace('%','').split(',').length() + '%'
-----------------------
forEach(value.split(','),v,v.toNumber()).sum() / value.split(',').length()
forEach(value.replace('%','').split(','),v,v.toNumber()).sum() / value.replace('%','').split(',').length() + '%'

identifying near duplicate names in a dataset

copy iconCopydownload iconDownload
import re, string
import pandas as pd
from unidecode  import unidecode
from collections import defaultdict

# clean the text before processing
def cleansing_special_characters(txt):
    seps = [' ',';',':','.','`','~',',','*','#','@','|','\\','-','_','?','%','!','^','(',')','[',']','{','}','$','=','+','"','<','>',"'",' AND ', ' and ']
    default_sep = seps[0]
    txt = str(txt)
    for sep in seps[1:]:
        if sep == " AND " or sep == " and ":
            txt = txt.upper()
            txt = txt.replace(sep, ' & ')
        else:
            txt = txt.upper()
            txt = txt.replace(sep, default_sep)
    try :
        list(map(int,txt.split()))
        txt = 'NUMBERS'
    except:
        pass
    txt = re.sub(' +', ' ', txt)
    temp_list = [i.strip() for i in txt.split(default_sep)]
    temp_list = [i for i in temp_list if i]
    return " ".join(temp_list)


punctuation = re.compile('[%s]' % re.escape(string.punctuation))

class fingerprinter(object):
    
    # __init__function
    def __init__(self, string):
        self.string = self._preprocess(string)
        
    
    # strip leading, trailing spaces and to lower case
    def _preprocess(self, string):
        return punctuation.sub('',string.strip().lower())
    
        
    def _latinize(self, string):
        return unidecode(string)
#         return unidecode(string.decode('utf-8'))
    
    def _unique_preserve_order(self,seq):
        seen = set()
        seen_add = seen.add
        return [x for x in seq if not (x in seen or seen_add(x))]

    
    #-####################################################
    def get_fingerprint(self):
        return self._latinize(' '.join(self._unique_preserve_order(sorted(self.string.split()))))
    
    
    def get_ngram_fingerprint(self, n=1):
        return self._latinize(''.join(self._unique_preserve_order(sorted([self.string[i:i + n] for i in range(len(self.string) - n +1)]))))
    
    

# read excel file
df = pd.read_excel('Input_File.xlsx')

#preprocess the column
df['Clean'] = df['SUPPLIER_NAME'].apply(cleansing_special_characters)


                            # step 1 cleanining

# ##for n_gram fingerprint algorithm
###########################################################################################

df['n_gram_fingerprint_n2'] = df['Clean'].apply(lambda x : fingerprinter(x.replace(" ","")).get_ngram_fingerprint(n=2))


## generate tag_id for every unique generated n_gram_fingerprint
d = defaultdict(lambda: len(d))
df['tag_idn']=[d[x] for x in df['n_gram_fingerprint_n2']]

###########################################################################################

#drop n_gram column
df.drop(columns=['n_gram_fingerprint_n2'], inplace=True)

# make copy to create group of tag_id
df1 = df[['SUPPLIER_NAME','tag_idn']]


# drop SUPPLIER_NAME column , we have tag_id's now
df.drop(columns=['SUPPLIER_NAME'], inplace=True)

# group df with tag_id with selecting minimum 
#group = df.groupby('tag_id').min().reset_index()
group = df.loc[df["Clean"].str.len().groupby(df["tag_idn"]).idxmax()]

# join both the data frames group(unique) and main data
df_merge = pd.merge(df1,group, on=['tag_idn'])


# # output excel file
df_merge.to_excel('Output_File.xlsx', index = False)

remove stop words from a string in order to create clusters

copy iconCopydownload iconDownload
String[] alsoReplace = {"and", "the", "&"};
for (String str : alsoReplace) {
    s = s.replaceAll("(?i)" + str + "(\\s+)?" , "");
}

Community Discussions

Trending Discussions on OpenRefine
  • Storing RDF to Triple Store as input: Conversion from CSV to RDF
  • OpenRefine sample extension not building
  • Does OpenRefine support Python3?
  • OpenRefine: swapping order strings within a column of values
  • Extract text using GREL in OpenRefine
  • How to get csv with header from xml
  • OpenRefine: How can I offset values? (preceding row to the following row)
  • OpenRefine: How to create a unique row for each input in a column ( dilneated by comma)
  • Regex to delete all caps letters and following comma
  • Pattern Matching in OpenRefine JSON
Trending Discussions on OpenRefine

QUESTION

Storing RDF to Triple Store as input: Conversion from CSV to RDF

Asked 2022-Feb-05 at 16:46

I am using Triple Store called Apache Jena Fuseki for storing the RDF as input But the thing is that i have data in CSV format. I researched a lot but didn't find direct way to convert CSV to RDF but there is tarql tool which is command line tool that can do the job but the thing is that i need a python script that directly converts my CSV to RDF form.

I have used the tools like openRefine and tarql but i need python script to do this job and i have read somewhere that owlready2 tool also used to convert CSV to RDF but when i used to visit the official site then i found that they are using OWL file for this work.

Thanks!

ANSWER

Answered 2022-Feb-05 at 16:46

CSVW - CSV on the Web - is a W3C Recommendation for this. There is a python implementation.

Or you can run "tarql" from python by forking a subprocess.

Source https://stackoverflow.com/questions/70997605

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install OpenRefine

OpenRefine Releases

Support

User ManualFAQOfficial Website and tutorial videos

DOWNLOAD this Library from

Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from
over 430 million Knowledge Items
Find more libraries
Reuse Solution Kits and Libraries Curated by Popular Use Cases

Save this library and start creating your kit

Explore Related Topics

Share this Page

share link
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from
over 430 million Knowledge Items
Find more libraries
Reuse Solution Kits and Libraries Curated by Popular Use Cases

Save this library and start creating your kit

  • © 2022 Open Weaver Inc.