anonymization | Text anonymization in many languages using Faker | Web Framework library

by gillesdami Python Version: v0.1 License: MIT

X-Ray Key Features Code Snippets(3)Community Discussions(10)Vulnerabilities Install Support

kandi X-RAY | anonymization Summary

anonymization is a Python library typically used in Server, Web Framework, Latex applications. anonymization has no bugs, it has no vulnerabilities, it has build file available, it has a Permissive License and it has low support. You can download it from GitHub.

Text anonymization in many languages for python3.6+ using Faker.

Support

Quality

Security

License

Reuse

Support

anonymization has a low active ecosystem.

It has 9 star(s) with 4 fork(s). There are no watchers for this library.

It had no major release in the last 12 months.

anonymization has no issues reported. There are no pull requests.

It has a neutral sentiment in the developer community.

The latest version of anonymization is v0.1

Quality

anonymization has no bugs reported.

Security

anonymization has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

License

anonymization is licensed under the MIT License. This license is Permissive.

Permissive licenses have the least restrictions, and you can use them in most projects.

Reuse

anonymization releases are available to install and integrate.

Build file is available. You can build the component from source.

Installation instructions are not available. Examples and code snippets are available.

Top functions reviewed by kandi - BETA

kandi has reviewed anonymization and discovered the below as its top functions. This is intended to give you an instant insight into anonymization implemented functionality, and help decide if they suit your requirements.

Anonymize a text
Replaces all matches in text
Get a fake value for a given provider
Anonymize regex using regular expression
An anonymize text
Performs anonymization of text
An anonymize given text
An anonymize a string

Get all kandi verified functions for this library.

anonymization Key Features

No Key Features are available at this moment for anonymization.

anonymization Examples and Code Snippets

Anonymization,Custom anonymizers

Python

Lines of Code : 10

License : Permissive (MIT)

Copy

class CustomAnonymizer():
    def __init__(self, anonymization: Anonymization):
        self.anonymization = anonymization

    def anonymize(self, text: str) -> str:
        return modified_text
        # or replace by regex patterns in text usin

Anonymization,Example,Replace emails and named entities in english

Python

Lines of Code : 9

License : Permissive (MIT)

Copy

pip install spacy
python -m spacy download en

>>> from anonymization import Anonymization, AnonymizerChain, EmailAnonymizer, NamedEntitiesAnonymizer

>>> text = "Hi John,\nthanks for you for subscribing to Superprogram, feel free t

Anonymization,Example,Replace a french phone number with a fake one

Python

Lines of Code : 7

License : Permissive (MIT)

Copy

>>> from anonymization import Anonymization, PhoneNumberAnonymizer
>>>
>>> text = "C'est bien le 0611223344 ton numéro ?"
>>> anon = Anonymization('fr_FR')
>>> phoneAnonymizer = PhoneNumberAnonymizer(anon)

Community Discussions

Trending Discussions on anonymization

How to anonymize one more many readonly attributes

Data masking in relational table using R

GDPR: encyption at-rest instead of data lookup tables

Forwarding messages from Kafka to Elasticsearch and Postgresql

By how much can i approx. reduce disk volume by using dvc?

Anonymize pandas name column with random 'nicknames'

Auto generate script for CREATE TABLE including all indices, constraints, etc (not via SSMS)

How to skip the scan for test files on few modules in a multi module project while using Sonarqube

Default XML schema / XSD when none is specified?

Kube-Flannel cant get CIDR although PodCIDR available on node

QUESTION

How to anonymize one more many readonly attributes

Asked 2021-Feb-09 at 16:41

I'm currently on a case where I have model with readonly attributes and I'm trying to anonymize those attributes (meaning updating them with unrecognizable values) due to GDPR. It seems that Rails doesn't allow to do that (even by overriding the #readonly? and returning false in it) which makes sense because why updating an attribute that's not supposed to be updated right ? but have anyone had a similar case ? if so, did you go about it please ?

Model

...

ANSWER

Answered 2021-Feb-09 at 07:25

I eventually used this line below. Probably not the best since I want to update one record but it doesn't check for readonly attributes and works for now.

Source https://stackoverflow.com/questions/66097778

QUESTION

Data masking in relational table using R

Asked 2020-Oct-13 at 07:47

I am trying to mask data in such a way that referential integrity is not compromised.

My table Customer has this data:

Customer table

...

ANSWER

Answered 2020-Sep-01 at 18:32

There are two things that seem important for you

anonymity
referential integrity

For both of your requirements that solution from the blog article you linked is a bad choice.

Anonymity

Just hashing does not provide anonymity. The article also mentions (but it is not in the code) you probably at least want to add a salt.

Just an example:

A number like 211 will be af9fad5f as a CRC32 hash. If the person you share your data with sees this 8char(32bit) alphanumeric string it probably will assume that this might be a CRC32 hash. The good thing with hashes is you can not easily calculate back starting from af9fad5f to 211. The bad thing is, most short words/ hashes are already precalculated and easy to look up in what is called a rainbow table (e.g. https://md5hashing.net/hash/crc32/af9fad5f).

This basically means everybody could just look up the "clear text" behind the crc32 hashes. (same for all other hashes). Adding a salt prevents this. (this salt must of course be kept secret!)

Referential Integrity The referential integrity is kept. 211 will be always be af9fad5f as a CRC32 hash - this is static and there is no random effect to it. So the Product_ID would stay the same for all your tables. Which is what you need.

But just to be sure I would use SHA256 instead of CRC32. In CRC32 everything will be mapped to a 8chars alphanumeric (32bit). If you have quite a lot of data - there is some chance of hash collisions. This means two numbers/ids in the same table actually having the same hash. With SHA256 this is next to impossible.

Overall I think using the anonymizer package seems ok. (it is not actively maintained - but functionality seems to be ok)

Source https://stackoverflow.com/questions/63619978

QUESTION

GDPR: encyption at-rest instead of data lookup tables

Asked 2020-Jun-18 at 17:26

Encryption at-rest - is storing data inside your storage/database in encrypted format. During processing you need to decrypt data every time, calculate something and then encrypt everything back (encryption is managed by storage).

Does encryption at-rest resolve "right to be forgotten" issue? When you can't go with encryption at-rest and should choose data lookup tables and pseudo-anonymization?

Unlike data lookup tables, encryption at-rest is much easier to implement. It can affect your performance though, and maybe billing.

AFAIK due to GDPR, you shouldn't stop processing or remove anonymized data. In other hand, ETL jobs must have permissions to decrypt data. Means everyone who has privileges to run a job (i.e. developer, data scientist or QA) will still be able to decrypt (de-anonymize) the data with encryption key.

...

ANSWER

Answered 2020-Jun-18 at 11:38

If encryption is occurring at the storage layer then it does not help with the right to be forgotten. If you want to use encryption to solve the right to be forgotten challenge, then I would suggest using a unique encryption key per data subject. If a data subject needs to be forgotten, you can then delete your copy of the encryption key and you have effectively "crypto-shredded" all the data that is protected by that key. For this to work best you would need to carefully design your architecture (e.g. can you keep the key separate to the data, so that it isn't backed-up and find another way to ensure availability of current keys in a DR scenario etc).

A data lookup table is the equivalent of a tokenization service, where you're replacing a data subject's name or other details with a token. By deleting (or altering) the token in the data lookup table you have removed the ability to resolve the token back to the actual data subject. This would provide a lesser degree of assurance as to the level of "forgotten-ness" that had been achieved as you might still be able to identify a data subject indirectly through other information about them. Have a look at https://en.wikipedia.org/wiki/K-anonymity to understand this concept in-depth.

Source https://stackoverflow.com/questions/62446488

QUESTION

Forwarding messages from Kafka to Elasticsearch and Postgresql

Asked 2020-Apr-18 at 10:46

I am trying to build an infrastructure in which I need to forward messages from one kafka topic to elasticsearch and postgresql. My infrastructure looks like in the picture below, and it all runs on the same host. Logstash is making some anonymization and some mutates, and sends the document back to kafka as json. Kafka should then forward the message to PostgreSQL and Elasticsearch

Everything works fine, accept the connection to postgresql, with which i'm having some trouble.

My config files looks like follows:

sink-quickstart-sqlite.properties

...

ANSWER

Answered 2020-Apr-17 at 16:03

Your error is here:

Source https://stackoverflow.com/questions/61275141

QUESTION

By how much can i approx. reduce disk volume by using dvc?

Asked 2020-Feb-23 at 19:57

I want to classify ~1m+ documents and have a Version Control System for in- and Output of the corresponding model.

The data changes over time:

sample size increases over time
new Features might appear
anonymization procedure might Change over time

So basically "everything" might change: amount of observations, Features and the values. We are interested in making the ml model Building reproducible without using 10/100+ GB of disk volume, because we save all updated versions of Input data. Currently the volume size of the data is ~700mb.

The most promising tool i found is: https://github.com/iterative/dvc. Currently the data is stored in a database in loaded in R/Python from there.

Question:

How much disk volume can be (very approx.) saved by using dvc?

If one can roughly estimate that. I tried to find out if only the "diffs" of the data are saved. I didnt find much info by reading through: https://github.com/iterative/dvc#how-dvc-works or other documentation.

I am aware that this is a very vague question. And it will highly depend on the dataset. However, i would still be interested in getting a very approximate idea.

...

ANSWER

Answered 2020-Feb-23 at 19:57

Let me try to summarize how does DVC store data and I hope you'll be able to figure our from this how much space will be saved/consumed in your specific scenario.

DVC is storing and deduplicating data on the individual file level. So, what does it usually mean from a practical perspective.

I will use dvc add as an example, but the same logic applies to all commands that save data files or directories into DVC cache - dvc add, dvc run, etc.

Scenario 1: Modifying file

Let's imagine I have a single 1GB XML file. I start tracking it with DVC:

Source https://stackoverflow.com/questions/60365473

QUESTION

Anonymize pandas name column with random 'nicknames'

Asked 2020-Feb-05 at 13:50

Let's say I have a pandas dataframe and a column 'name'. I want to anonymize the column and hide the identities. I can do something like,

...

ANSWER

Answered 2020-Feb-05 at 13:50

You can use the Faker package for this which generates a dummy name for you.

Installation:

Source https://stackoverflow.com/questions/59928902

QUESTION

Auto generate script for CREATE TABLE including all indices, constraints, etc (not via SSMS)

Asked 2020-Feb-04 at 21:51

I have a data anonymization process that takes a production copy of a database and turns it into an anonymized copy by UPDATE-ing some columns.

Some of the tables contain several million rows so instead of UPDATE-ing the columns, which is very log intensive, I went down the way of

...

ANSWER

Answered 2020-Feb-04 at 21:51

Right-click on the database, select Tasks; there is Generate Scripts there. Just follow prompts or Google for additional information.

Source https://stackoverflow.com/questions/60064875

QUESTION

How to skip the scan for test files on few modules in a multi module project while using Sonarqube

Asked 2020-Jan-15 at 12:21

I'm setting this property src/test/java in the root pom file, the project consist of 7 modules, 5 modules have test cases in the mentioned location, while the other do not have tests, so the sonar scans and fails as shown :

...

ANSWER

Answered 2020-Jan-15 at 12:21

As already mentioned in the comments: Specify in the modules without a test folder to remove the test folder property.

Source https://stackoverflow.com/questions/59734742

QUESTION

Default XML schema / XSD when none is specified?

Asked 2019-Dec-13 at 20:59

I am capturing lessons learned about data I/O from a data analyst's perspective, without the benefit of data engineering expertise (and being quite explicit about that shortcoming). In order to give context to the various alternatives, taking into consideration the constraints within my shop, I've experimented briefly with XML import/export, and done online reading about schemas. One thing I noticed about an open source utility for a 4th generation language environment is that seems to use a default (I haven't specified one):

...

ANSWER

Answered 2019-Dec-13 at 20:59

There is no default XML schema for an arbitrary XML file. There are rules of well-formedness given by the W3C XML Recommendation, but those define XML itself rather than the vocabulary and grammar of any given XML schema.

Identifying an XSD when none is specified

When schemaLocation is specified in the XML, see the XSD specified there. For more on schemaLocation, see How to link XML to XSD using schemaLocation or noNamespaceSchemaLocation?
When only a namespace is used, see How to locate an XML Schema (XSD) by namespace?
When the provider of the XML is available, ask or inspect the source/documentation.
When relatively unique/informative element names are used, or if you know the sector/industry google element names or sector/industry and "xml schema".

If none of the above work, go schema-less, or write your own to fit the data.

Vulnerabilities

No vulnerabilities reported

Install anonymization

You can download it from GitHub.
You can use anonymization like any standard Python library. You will need to make sure that you have a development environment consisting of a Python distribution including header files, a compiler, pip, and git installed. Make sure that your pip, setuptools, and wheel are up to date. When using pip it is generally recommended to install packages in a virtual environment to avoid changes to the system.

Support

For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .

Find more information at: