pyreadstat | Python package to read sas | Data Manipulation library
kandi X-RAY | pyreadstat Summary
kandi X-RAY | pyreadstat Summary
A python package to read and write sas (sas7bdat, sas7bcat, xport), spps (sav, zsav, por) and stata (dta) data files into/from pandas dataframes. This module is a wrapper around the excellent Readstat C library by Evan Miller. Readstat is the library used in the back of the R library Haven, meaning pyreadstat is a python equivalent to R Haven. Detailed documentation on all available methods is in the Module documentation. If you would like to read R RData and Rds files into python in an easy way, take a look to pyreadr, a wrapper around the C library librdata.
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of pyreadstat
pyreadstat Key Features
pyreadstat Examples and Code Snippets
Community Discussions
Trending Discussions on pyreadstat
QUESTION
I want to read data from http://fmwww.bc.edu/ec-p/data/wooldridge/401k.dta. I tried below,
...ANSWER
Answered 2022-Mar-20 at 18:47import requests
import pyreadstat
url = 'http://fmwww.bc.edu/ec-p/data/wooldridge/401k.dta'
def download_file(url):
local_filename = url.split('/')[-1]
with requests.get(url, stream=True) as r:
r.raise_for_status()
with open(local_filename, 'wb') as f:
for chunk in r.iter_content(chunk_size=8192):
f.write(chunk)
return local_filename
# download_file(url)
df, meta = pyreadstat.read_dta(download_file(url))
QUESTION
I have a sav
file with a datetime column in %m/%d/%Y
string format. When I read it in with pd.read_spss()
, which doesn't seem to have any datetime-related arguments, it ends up in what looks like unix time, except that the time would be a few centuries from now with unique values including 13778726400
, 13841884800
, etc.
When I feed the read column into pd.to_datetime
, however, it's not interpreted as the date I would expect, but rather a few seconds after the original unix date in 1970:
ANSWER
Answered 2022-Mar-16 at 10:05Dates, times and datetimes are always stored in SPSS as a number and then you add a format for displaying. SPSS continuously adds new formats while removes others. New formats have to be added manually to the pyreadstat code, while old formats stay in the code for backward compatibility. So the problem is you have found a new Date/datetime/time format that is not registered in pyreadstat.
Another workaround would have been to open the file in SPSS and store it as a date/datetime/time, but with a different format pyreadstat would recognise, for example DATE11, DATETIME20 etc (the current list that pyreadstat accepts is [https://github.com/Roche/pyreadstat/blob/master/pyreadstat/_readstat_parser.pyx#L52-L54])
The best when this is found is to submit a github issue describing the new format found for it to be added. I just added a few I found in the most recent SPSS documentation, and hopefully your problem should be solved in the next release (already available on dev). If not, please submit an issue with a reproducible example.
The numbers SPSS uses to store the dates are not unix time, but either the number of seconds (in the case of datetimes or time) or days (in the case of dates) since 1582-10-14 (the start of the Gregorian Calendar. So you would need something like this to calculate it manually:
QUESTION
I read a sav file using this code:
df_file, meta_data = pyreadstat.read_sav(‘path’)
It returns df_file
as pandas DataFrame but returns meta_data
as metadata_container object
. I need to share meta_data
object to a colleague who is not programmer. How do I export it? I can easily export df_file
because it is a DataFrame but can’t export meta_data
to something like JSON because it is not a DataFrame.
ANSWER
Answered 2021-Jun-28 at 19:35Not perfect but it's a first step, you can use meta.__dict__
QUESTION
ANSWER
Answered 2021-Mar-18 at 19:54It is not the encoding, it is coming from a change in Readstat (the C library behind pyreadstat) where they admit files with versions being only numeric, while some files have letters on it ( V for viya or visual). Versions of pyreadstat prior to 0.3 would work.
The issue is tracked here in pyreadstat and here.
Once the issue is repaired in Readstat, it will be fixed in Pyreadstat.
QUESTION
import pyreadstat
df, meta = pyreadstat.read_sas7bdat('c:/ae.sas7bdat')
print(meta.original_variable_types)
...ANSWER
Answered 2021-Mar-02 at 17:34If you only need type, then it is easy: in pyreadstat if $ then it is charachter, if not, it is numeric.
What you are seeing in pyreadstat is what you have in the format column of SAS without the variable width (which is stored separately in pyreadstat in meta.variable_display_width). You will observe in your screenshot that all character variables have a format that starts with $ , the number that comes next is the variable width.
SAS has only two types: charachter and number, therefore if not a character it is a number. The format tells SAS how to display the variable. For characters it is just display the charachter ($) with cerain width, as there are no more alternatives. Numbers can be displayed in different ways, things like BEST, but also as DATE if they represent the number of days since Jan 1st 1960, as DATETIME if they represent the number of seconds since Jan 1st 1960, etc.
In case formats are missing, you can check if the data in a column is a string, in which case the type in SAS was character. Anything else was numeric:
QUESTION
I am converting the each sas dataset from the list of directory to individual dataframe in pandas
...ANSWER
Answered 2020-Oct-09 at 06:28I would do it like this:
QUESTION
Im using pyreadstat
to open a SPSS (.sav) file, and think of a data frame where the columns are questions from a survey. I'd like to calculate frequencies for each column grouped by some other columns by first melting the data and then calculate the frequencies.
ANSWER
Answered 2020-Sep-17 at 17:32Let's try reindex
:
QUESTION
I am trying to read sav files using pyreadstat in python but for some rare scenarios I am getting error of UnicodeDecodeError since the string variable has special characters.
To handle this I think instead of loading the entire variable set I will load only variables which do not have this error.
Below is the pseudo-code that I have with me. This is not a very efficient code since I check for error in each item of list using try and except.
...ANSWER
Answered 2020-Aug-18 at 05:39For this specific case I would suggest a different approach: you can give an argument "encoding" to pyreadstat.read_sav to manually set the encoding. If you don't know which one it is, what you can do is iterate over the list of encodings here: https://gist.github.com/hakre/4188459 to find out which one makes sense. For example:
QUESTION
When I've used XGBoost for regression in the past, I've gotten differentiated predictions, but using an XGBClassifier on this dataset is resulting in all cases being predicted to have the same value. The true values of the test data are that 221 cases are a 0, and 49 cases are a 1. XGBoost seems to be latching onto that imbalance and predicting all 0's. I'm trying to figure out what I might need to adjust in the model's parameters to fix that.
Here is the code I'm running:
...ANSWER
Answered 2020-Aug-13 at 15:34Found an answer on the stats section: https://stats.stackexchange.com/questions/243207/what-is-the-proper-usage-of-scale-pos-weight-in-xgboost-for-imbalanced-datasets
scale_pos_weight
seems to be a parameter that you can adjust to deal with imbalances in classes like this. Mine was set to the default, 1, which means that negative (0) and positive (1) cases are assumed to be showing up evenly. If I change this to 4, which is my ratio of negatives to positives, I start seeing cases predicted into 1.
My accuracy score goes down, but this makes sense: you get a higher % accuracy with this data by predicting everyone to be 0 since the vast majority of cases are 0, but I want to run this model not for accuracy but for information on the importances/contributions of each predictor, so I want differing predictions.
One answer in the link also suggested being more conservative by setting scale_pos_weight
to the sqrt of the ratio, which would be 2, in this case. I got a higher accuracy with 2 than 4, so that's what I'm going with, and I plan to look into this parameter in future classification models.
For a multi-class model, it looks like you're better off adjusting the case-level weights to bring your classes to even representation, as outlined here: https://datascience.stackexchange.com/questions/16342/unbalanced-multiclass-data-with-xgboost
QUESTION
I have an AI Platform VM instance set up with a Python3 notebook. I also have a Google Cloud Storage bucket that contains numerous .CSV and .SAV files. I have no difficulties using standard python packages likes Pandas to read in data from the CSV files, but my notebook appears unable to locate my .SAV files in my storage bucket.
Does anyone know what is going on here and/or how I can resolve this issue?
...ANSWER
Answered 2020-Jul-30 at 20:52The read_spss
function can only read from a local file path:
path
: pathstr or Path - File path.
Compare that with the read_csv
function:
filepath_or_bufferstr
: str, path object or file-like object - Any valid string path is acceptable. The string could be a URL. Valid URL schemes include http, ftp, s3, gs, and file. For file URLs, a host is expected.
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install pyreadstat
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page