pyreadstat | Python package to read sas | Data Manipulation library

by Roche C Version: 1.2.7 License: Non-SPDX

X-Ray Key Features Code Snippets Community Discussions(10)Vulnerabilities Install Support

kandi X-RAY | pyreadstat Summary

pyreadstat is a C library typically used in Utilities, Data Manipulation, Pandas applications. pyreadstat has no bugs, it has no vulnerabilities and it has low support. However pyreadstat has a Non-SPDX License. You can download it from GitHub.

A python package to read and write sas (sas7bdat, sas7bcat, xport), spps (sav, zsav, por) and stata (dta) data files into/from pandas dataframes. This module is a wrapper around the excellent Readstat C library by Evan Miller. Readstat is the library used in the back of the R library Haven, meaning pyreadstat is a python equivalent to R Haven. Detailed documentation on all available methods is in the Module documentation. If you would like to read R RData and Rds files into python in an easy way, take a look to pyreadr, a wrapper around the C library librdata.

Support

Quality

Security

License

Reuse

Support

pyreadstat has a low active ecosystem.

It has 263 star(s) with 44 fork(s). There are 19 watchers for this library.

It had no major release in the last 12 months.

There are 8 open issues and 117 have been closed. On average issues are closed in 38 days. There are no pull requests.

It has a neutral sentiment in the developer community.

The latest version of pyreadstat is 1.2.7

Quality

pyreadstat has 0 bugs and 0 code smells.

Security

pyreadstat has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

pyreadstat code analysis shows 0 unresolved vulnerabilities.

There are 0 security hotspots that need review.

License

pyreadstat has a Non-SPDX License.

Non-SPDX licenses can be open source with a non SPDX compliant license, or non open source licenses, and you need to review them closely before use.

Reuse

pyreadstat releases are available to install and integrate.

Installation instructions are not available. Examples and code snippets are available.

It has 3159 lines of code, 85 functions and 25 files.

It has high code complexity. Code complexity directly impacts maintainability of the code.

Top functions reviewed by kandi - BETA

kandi's functional review helps you automatically verify the functionalities of the libraries and avoid rework.
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of pyreadstat

Get all kandi verified functions for this library.

pyreadstat Key Features

No Key Features are available at this moment for pyreadstat.

pyreadstat Examples and Code Snippets

No Code Snippets are available at this moment for pyreadstat.

Community Discussions

Trending Discussions on pyreadstat

How to read .dta into Python

Is pandas.read_spss misreading datetime into unix?

How to export pyreadstat’s metadata container to json

pyreadstat encoding latin1 to read SAS Datasets not working?

How to read DATA type from a sas7bdat file through pyreadstat?meta.original_variable_types giving different values?

converting each sas dataset to dataframe in pandas

calculate frequencies for non existing values

Divide and Conquer Lists in Python (to read sav files using pyreadstat)

Identical predictions in XGBoost

GCP AI Platform cannot read .SAV file stored in Google Cloud Storage (Python)

QUESTION

How to read .dta into Python

Asked 2022-Mar-20 at 18:47

I want to read data from http://fmwww.bc.edu/ec-p/data/wooldridge/401k.dta. I tried below,

...

ANSWER

Answered 2022-Mar-20 at 18:47

import requests
import pyreadstat

url = 'http://fmwww.bc.edu/ec-p/data/wooldridge/401k.dta'

def download_file(url):
    local_filename = url.split('/')[-1]
    with requests.get(url, stream=True) as r:
        r.raise_for_status()
        with open(local_filename, 'wb') as f:
            for chunk in r.iter_content(chunk_size=8192): 
                f.write(chunk)
    return local_filename

# download_file(url)

df, meta = pyreadstat.read_dta(download_file(url))

Source https://stackoverflow.com/questions/71549342

QUESTION

Is pandas.read_spss misreading datetime into unix?

Asked 2022-Mar-16 at 10:05

I have a sav file with a datetime column in %m/%d/%Y string format. When I read it in with pd.read_spss(), which doesn't seem to have any datetime-related arguments, it ends up in what looks like unix time, except that the time would be a few centuries from now with unique values including 13778726400, 13841884800, etc.

When I feed the read column into pd.to_datetime, however, it's not interpreted as the date I would expect, but rather a few seconds after the original unix date in 1970:

...

ANSWER

Answered 2022-Mar-16 at 10:05

Dates, times and datetimes are always stored in SPSS as a number and then you add a format for displaying. SPSS continuously adds new formats while removes others. New formats have to be added manually to the pyreadstat code, while old formats stay in the code for backward compatibility. So the problem is you have found a new Date/datetime/time format that is not registered in pyreadstat.

Another workaround would have been to open the file in SPSS and store it as a date/datetime/time, but with a different format pyreadstat would recognise, for example DATE11, DATETIME20 etc (the current list that pyreadstat accepts is [https://github.com/Roche/pyreadstat/blob/master/pyreadstat/_readstat_parser.pyx#L52-L54])

The best when this is found is to submit a github issue describing the new format found for it to be added. I just added a few I found in the most recent SPSS documentation, and hopefully your problem should be solved in the next release (already available on dev). If not, please submit an issue with a reproducible example.

The numbers SPSS uses to store the dates are not unix time, but either the number of seconds (in the case of datetimes or time) or days (in the case of dates) since 1582-10-14 (the start of the Gregorian Calendar. So you would need something like this to calculate it manually:

Source https://stackoverflow.com/questions/71475489

QUESTION

How to export pyreadstat’s metadata container to json

Asked 2021-Jun-28 at 19:35

I read a sav file using this code:

df_file, meta_data = pyreadstat.read_sav(‘path’)

It returns df_file as pandas DataFrame but returns meta_data as metadata_container object. I need to share meta_data object to a colleague who is not programmer. How do I export it? I can easily export df_file because it is a DataFrame but can’t export meta_data to something like JSON because it is not a DataFrame.

...

ANSWER

Answered 2021-Jun-28 at 19:35

Not perfect but it's a first step, you can use meta.__dict__

Source https://stackoverflow.com/questions/68168914

QUESTION

pyreadstat encoding latin1 to read SAS Datasets not working?

Asked 2021-Mar-18 at 19:54

I have a sas dataset with Latin-1 encoding as follows:

I am trying to read this using pyreadstat with encoding specified as Latin1 with a failure message as follows:

Any inputs as to why this is happening and any workaround would be highly appreciated!

TIA

...

ANSWER

Answered 2021-Mar-18 at 19:54

It is not the encoding, it is coming from a change in Readstat (the C library behind pyreadstat) where they admit files with versions being only numeric, while some files have letters on it ( V for viya or visual). Versions of pyreadstat prior to 0.3 would work.

The issue is tracked here in pyreadstat and here.

Once the issue is repaired in Readstat, it will be fixed in Pyreadstat.

Source https://stackoverflow.com/questions/66574976

QUESTION

How to read DATA type from a sas7bdat file through pyreadstat?meta.original_variable_types giving different values?

Asked 2021-Mar-02 at 17:34

import pyreadstat
df, meta = pyreadstat.read_sas7bdat('c:/ae.sas7bdat')
print(meta.original_variable_types)

...

ANSWER

Answered 2021-Mar-02 at 17:34

If you only need type, then it is easy: in pyreadstat if $ then it is charachter, if not, it is numeric.

What you are seeing in pyreadstat is what you have in the format column of SAS without the variable width (which is stored separately in pyreadstat in meta.variable_display_width). You will observe in your screenshot that all character variables have a format that starts with $ , the number that comes next is the variable width.

SAS has only two types: charachter and number, therefore if not a character it is a number. The format tells SAS how to display the variable. For characters it is just display the charachter ($) with cerain width, as there are no more alternatives. Numbers can be displayed in different ways, things like BEST, but also as DATE if they represent the number of days since Jan 1st 1960, as DATETIME if they represent the number of seconds since Jan 1st 1960, etc.

In case formats are missing, you can check if the data in a column is a string, in which case the type in SAS was character. Anything else was numeric:

Source https://stackoverflow.com/questions/66169026

QUESTION

converting each sas dataset to dataframe in pandas

Asked 2020-Oct-09 at 06:28

I am converting the each sas dataset from the list of directory to individual dataframe in pandas

...

ANSWER

Answered 2020-Oct-09 at 06:28

I would do it like this:

Source https://stackoverflow.com/questions/64246782

QUESTION

calculate frequencies for non existing values

Asked 2020-Sep-17 at 17:32

Im using pyreadstat to open a SPSS (.sav) file, and think of a data frame where the columns are questions from a survey. I'd like to calculate frequencies for each column grouped by some other columns by first melting the data and then calculate the frequencies.

...

ANSWER

Answered 2020-Sep-17 at 17:32

Let's try reindex:

Source https://stackoverflow.com/questions/63941404

QUESTION

Divide and Conquer Lists in Python (to read sav files using pyreadstat)

Asked 2020-Aug-18 at 05:39

I am trying to read sav files using pyreadstat in python but for some rare scenarios I am getting error of UnicodeDecodeError since the string variable has special characters.

To handle this I think instead of loading the entire variable set I will load only variables which do not have this error.

Below is the pseudo-code that I have with me. This is not a very efficient code since I check for error in each item of list using try and except.

...

ANSWER

Answered 2020-Aug-18 at 05:39

For this specific case I would suggest a different approach: you can give an argument "encoding" to pyreadstat.read_sav to manually set the encoding. If you don't know which one it is, what you can do is iterate over the list of encodings here: https://gist.github.com/hakre/4188459 to find out which one makes sense. For example:

Source https://stackoverflow.com/questions/63434610

QUESTION

Identical predictions in XGBoost

Asked 2020-Aug-13 at 15:34

When I've used XGBoost for regression in the past, I've gotten differentiated predictions, but using an XGBClassifier on this dataset is resulting in all cases being predicted to have the same value. The true values of the test data are that 221 cases are a 0, and 49 cases are a 1. XGBoost seems to be latching onto that imbalance and predicting all 0's. I'm trying to figure out what I might need to adjust in the model's parameters to fix that.

Here is the code I'm running:

...

ANSWER

Answered 2020-Aug-13 at 15:34

Found an answer on the stats section: https://stats.stackexchange.com/questions/243207/what-is-the-proper-usage-of-scale-pos-weight-in-xgboost-for-imbalanced-datasets

scale_pos_weight seems to be a parameter that you can adjust to deal with imbalances in classes like this. Mine was set to the default, 1, which means that negative (0) and positive (1) cases are assumed to be showing up evenly. If I change this to 4, which is my ratio of negatives to positives, I start seeing cases predicted into 1.

My accuracy score goes down, but this makes sense: you get a higher % accuracy with this data by predicting everyone to be 0 since the vast majority of cases are 0, but I want to run this model not for accuracy but for information on the importances/contributions of each predictor, so I want differing predictions.

One answer in the link also suggested being more conservative by setting scale_pos_weight to the sqrt of the ratio, which would be 2, in this case. I got a higher accuracy with 2 than 4, so that's what I'm going with, and I plan to look into this parameter in future classification models.

For a multi-class model, it looks like you're better off adjusting the case-level weights to bring your classes to even representation, as outlined here: https://datascience.stackexchange.com/questions/16342/unbalanced-multiclass-data-with-xgboost

Source https://stackoverflow.com/questions/63381937

QUESTION

GCP AI Platform cannot read .SAV file stored in Google Cloud Storage (Python)

Asked 2020-Jul-30 at 20:52

I have an AI Platform VM instance set up with a Python3 notebook. I also have a Google Cloud Storage bucket that contains numerous .CSV and .SAV files. I have no difficulties using standard python packages likes Pandas to read in data from the CSV files, but my notebook appears unable to locate my .SAV files in my storage bucket.

Does anyone know what is going on here and/or how I can resolve this issue?

...

ANSWER

Answered 2020-Jul-30 at 20:52

The read_spss function can only read from a local file path:

path: pathstr or Path - File path.

Compare that with the read_csv function:

filepath_or_bufferstr: str, path object or file-like object - Any valid string path is acceptable. The string could be a URL. Valid URL schemes include http, ftp, s3, gs, and file. For file URLs, a host is expected.

Source https://stackoverflow.com/questions/63179949

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install pyreadstat

You can download it from GitHub.

Support

Python 2.7 is not actively supported. If it works, we are happy about that. But if it does not, and the bug is specific for python 2.7 (cannot be reproduced in python 3), the issue is not going to be solved. At the moment of writing this document Python 2.7 does not work for windows. It does work for Mac and Linux. In Mac and Linux, files cannot be opened if the path contains international (non-ascii) characters. As mentioned before this bug is not going to be repaired (There is not such issue on Python 3). Starting on version 1.0.6 wheels are not produced for Python 2.7 anymore, but you can still compile on linux and mac.

Find more information at: