Explore all Data Manipulation open source software, libraries, packages, source code, cloud functions and APIs.

Popular New Releases in Data Manipulation

numpy

did_you_mean

v1.5.0

numexpr

NumExpr 2.7.2

nbconvert

6.5.0

hapi-fhir

HAPI FHIR 5.2.0 (Numbat)

Popular Libraries in Data Manipulation

numpy

by numpy doticonpythondoticon

star image 20101 doticonBSD-3-Clause

The fundamental package for scientific computing with Python.

BullshitGenerator

by menzi11 doticonjavascriptdoticon

star image 15078 doticonNOASSERTION

Needs to generate some texts to test if my GUI rendering codes good or not. so I made this.

cleartext-mac

by mortenjust doticonswiftdoticon

star image 3291 doticonGPL-3.0

A text editor that will help you write clearer and simpler

shave

by dollarshaveclub doticonjavascriptdoticon

star image 2113 doticonMIT

💈 Shave is a 0 dep JS plugin that truncates text to fit within an element based on a set max-height ✁

did_you_mean

by ruby doticonrubydoticon

star image 1756 doticonMIT

The gem that has been saving people from typos since 2014

numexpr

by pydata doticonpythondoticon

star image 1687 doticonMIT

Fast numerical array expression evaluator for Python, NumPy, PyTables, pandas, bcolz and more

react-native-masked-text

by benhurott doticonjavascriptdoticon

star image 1514 doticonMIT

A pure javascript masked text and input text component for React-Native.

100-pandas-puzzles

by ajcr doticonjupyter notebookdoticon

star image 1499 doticonMIT

100 data puzzles for pandas, ranging from short and simple to super tricky (60% complete)

nbconvert

by jupyter doticonpythondoticon

star image 1293 doticonNOASSERTION

Jupyter Notebook Conversion

Trending New libraries in Data Manipulation

dalle-mini

by borisdayma doticonpythondoticon

star image 830 doticonApache-2.0

DALL·E Mini - Generate images from a text prompt

capsize

by seek-oss doticontypescriptdoticon

star image 815 doticonMIT

Flipping how we define typography in CSS.

Fast-F1

by theOehrly doticonpythondoticon

star image 764 doticonMIT

FastF1 is a python package for accessing and analyzing Formula 1 results, schedules, timing data and telemetry

color2k

by ricokahler doticontypescriptdoticon

star image 417 doticonMIT

a color parsing and manipulation lib served in roughly 2kB

react-native-see-more-inline

by kashishgrover doticonjavascriptdoticon

star image 307 doticon

Show a "read more", "see more", "read less", "see less" inline with your text in React Native

riptable

by rtosholdings doticonpythondoticon

star image 305 doticonNOASSERTION

64bit multithreaded python data analytics tools for numpy arrays and datasets

Options_Data_Science

by yugedata doticonpythondoticon

star image 246 doticonBSD-3-Clause

Collecting, analyzing, visualizing & paper trading options market data

go-financial

by razorpay doticongodoticon

star image 209 doticonMIT

A go port of numpy-financial functions and more.

alldocs.app

by ueberdosis doticonphpdoticon

star image 159 doticon

Online text file converter

Top Authors in Data Manipulation

1

ruby-rdf

12 Libraries

star icon645

2

w3c

8 Libraries

star icon263

3

dbpedia

6 Libraries

star icon81

4

GeoKnow

5 Libraries

star icon169

5

manjaro

5 Libraries

star icon112

6

numpy

5 Libraries

star icon20596

7

ldodds

5 Libraries

star icon23

8

dice-group

5 Libraries

star icon47

9

zazuko

5 Libraries

star icon24

10

enthought

5 Libraries

star icon154

1

12 Libraries

star icon645

2

8 Libraries

star icon263

3

6 Libraries

star icon81

4

5 Libraries

star icon169

5

5 Libraries

star icon112

6

5 Libraries

star icon20596

7

5 Libraries

star icon23

8

5 Libraries

star icon47

9

5 Libraries

star icon24

10

5 Libraries

star icon154

Trending Kits in Data Manipulation

OpenCV is a library of programming functions mainly aimed at real-time computer vision. It is written in C, C++, and Python, and runs on Windows, Linux, Android, and macOS.OpenCV is widely used in the field of computer vision for tasks such as object recognition, face detection, and image and video analysis. It has a large community of developers and users and is continuously updated and improved.


OpenCV provides a large collection of algorithms and functions for image and video processing, including:

  • Image processing operations like filtering, morphological transformations, thresholding, etc.
  • Object detection and recognition, including face detection and recognition, object tracking, etc.
  • Image and video analysis, including edge detection, feature extraction, and optical flow.
  • Camera calibration and 3D reconstruction.
  • Machine learning algorithms, including support for deep learning frameworks like TensorFlow and Caffe.


You can divide an image into two equal parts vertically or horizontally using OpenCV by simply slicing the image array. Here's an example of how you could divide an image into two equal parts horizontally in Python using OpenCV:


This code splits the image into two equal parts, horizontally. It first retrieves the shape of the image and calculates the height and width of the image. It then calculates the starting and ending row and column pixel coordinates for the top and bottom halves of the image. The image is then sliced and each half is stored in the cropped_top and cropped_bot variables. Finally, each of the two cropped images is displayed using the OpenCV function cv2.imshow() and is shown until a key is pressed using the cv2.waitKey(0) function


Here is an example of how you can Divide the image into two equal parts using OpenCV

Preview of the output that you will get on running this code from your IDE

CODE

In this solution we use the Imread function of the OpenCV.

  1. Copy the code using the "Copy" button above, and paste it in a Python file in your IDE.
  2. Modify the name, location of the image to display in the code.
  3. Run the file to divide the image to Top and Bottom


I hope you found this useful. I have added the link to dependent libraries, version information in the following sections.


i found this code snippet by searching for "divide image into tow equal parts python opencv" in kandi. You can try any such use case!

Environment Tested

I tested this solution in the following versions. Be mindful of changes when working with other versions


  1. The solution is created and executed in python version 3.7.15 .
  2. The solution is tested on OpenCV 4.6.0
  3. The solution is tested on numpy 1.21.6


Using this solution, we are able to divide an image using the OpenCV library in Python with simple steps. This process also facilities an easy-to-use, hassle-free method to create a hands-on working version of code which would help us divide an image in Python

Dependent Library

If you do not have OpenCV and numpy that is required to run this code, you can install it by clicking on the above link and copying the pip Install command from the OpenCV page in kandi.

You can search for any dependent library on kandi like OpenCV and numpy

Support

  1. For any support on kandi solution kits, please use the chat
  2. For further learning resources, visit the Open Weaver Community learning page.

OpenCV is a computer vision library written in C++ and widely used for image and video processing. It offers a range of features for working with photographs and movies, including the ability to load and save images, use filters, find edges, and find and track objects. In collaboration, applications involving image and video processing are frequently created using Python and OpenCV. This combination enables you to develop solid and adaptable programs that can address various computer vision issues.  


In our work as developers, we frequently must read and rotate the photos in our applications to complete various image processing activities, such as recognition, upload, augmentation, training, and many more. There are numerous libraries for Python that enable working with images. Python has features for manipulating, enhancing, and creating more images. In addition to using additional OpenCV functions to apply other transformations to the image, such as scaling, cropping, and applying filters, you can modify the angle of rotation and the image's size to get the desired effect.  


Here is an example of how we can draw a line beyond the second point using opencv


Preview of the output that you will get on running this code from your IDE

CODE

In this solution we use the numpy and open cv library

  1. Copy the code using the "Copy" button above, and paste it in a Python file in your IDE.
  2. Import open Cv library and Numpy library.
  3. Modify the name and Length of the points.
  4. Run the file to draw a line.


I hope you found this useful. I have added the link to dependent libraries, version information in the following sections.


ifound this code snippet by searching for "Draw a line in open cv and python beyond given points" in kandi. You can try any such use case!

Dependent Library

If you do not have OpenCV and numpy that is required to run this code, you can install it by clicking on the above link and copying the pip Install command from the OpenCV page in kandi.

You can search for any dependent library on kandi like OpenCV and numpy

Environment Test

I tested this solution in the following versions. Be mindful of changes when working with other versions


  1. The solution is created and executed in python version 3.7.15 .
  2. The solution is tested on OpenCV 4.6.0 version
  3. The solution is tested on numpy 1.21.6


Using this solution, we are going to draw a line beyond the second given point using the OpenCv library and numpy library in Python with simple steps. This process also facilities an easy to use, hassle free method to create a hands-on working version of code which would help us draw a image in Python

Support

  1. For any support on kandi solution kits, please use the chat
  2. For further learning resources, visit the Open Weaver Community learning page.


Precision and recall are two commonly used metrics for evaluating the performance of a classification model. Precision measures the accuracy of the positive predictions, while recall measures the ability of the model to identify all relevant positive samples. y_true is the list of true labels and y_pred is the list of predicted labels. The precision_score and recall_score functions calculate the precision and recall, respectively


Precision is the fraction of true positive predictions out of all positive predictions made. It Measures the accuracy of the positive predictions 

recall is the fraction of true positive predictions out of all actual positive cases. It measures the completeness of the positive predictions 


  • Confusion_matrix: This function generates a confusion matrix given true labels and predicted labels.
  • precision_score: This function calculates the precision score of a classification model given true labels and predicted labels.
  • recall_score: This function calculates the recall score of a classification model given true labels and predicted labels.
  • These libraries and functions can be used to evaluate the performance of a classification model.


Here is the example of how we can find the Precision score and recall score using Sk-learn.

Preview of the output that you will get on running this code from your IDE

Code

In this solution we have used Scikit-Learn

  1. Copy the code using the "Copy" button above, and paste it in a Python file in your IDE.
  2. Run the file to get the output


I hope you found this useful. I have added the link to dependent libraries, version information in the following sections.


I found this code snippet by searching for "Need help finding the precision and recall for a confusion matrix" in kandi. You can try any such use case!

Dependent Library

Environment Tested

I tested this solution in the following versions. Be mindful of changes when working with other versions.


  1. The solution is created in Python 3.7.15 version
  2. The solution is tested on scikit-learn 1.0.2 version


Using this solution, we are able going to learn how to Finding the precision and recall for a confusion matrix in python using Scikit learn library in Python with simple steps. This process also facilities an easy to use, hassle free method to create a hands-on working version of code which would help Finding the precision and recall for a confusion matrix in Python.

If you do not have Scikit-learn and numpy that is required to run this code, you can install it by clicking on the above link and copying the pip Install command from the Scikit-learn page in kandi.

You can search for any dependent library on kandi like Scikit-learn. numpy

Support

  1. For any support on kandi solution kits, please use the chat
  2. For further learning resources, visit the Open Weaver Community learning page.


Converting RGB to YCbCr can provide better results for image and video compression, color space conversions, and HDR processing.  There are several reasons why we might need to convert RGB to YCbCr


Compression efficiency: YCbCr provides better compression results compared to RGB, especially in preserving image quality after compression. This is because the human visual system is more sensitive to changes in brightness (luma, Y) than to changes in color (chroma, Cb and Cr). Color space conversion: Some image processing tasks, such as color correction and color space conversion, may require transforming the image from one color space to another. For example, many image sensors capture the image in the YCbCr color space, and it may be necessary to convert it to RGB for display purposes. 


OpenCV (Open Source Computer Vision Library) is an open-source and machine-learning software library. OpenCV is a computer vision library written in C++ and widely used for image and video processing. OpenCV provides a vast array of image and video processing functions that can be used in various domains such as:


  • Object detection and recognition
  • Image and video segmentation
  • Face and feature detection
  • Object tracking
  • Image restoration and enhancement
  • Stereoscopic vision
  • Motion analysis and object tracking
  • 3D reconstruction


RGB and YCbCr are color spaces used in digital image processing.


RGB stands for Blue, Green, Red, and is an encoding of the RGB (Red, Green, Blue) color space. BGR is used in computer vision and image processing applications and is the default color format for the OpenCV library in Python.


YCbCr, on the other hand, stands for Luma (Y) and Chrominance (Cb, Cr), and is a color space used in digital video processing. YCbCr separates the brightness information (luma) from the color information (chroma), which allows for more efficient compression. YCbCr is used in many image and video compression standards, such as JPEG and MPEG. In summary, BGR is used in computer vision and image processing, while YCbCr is used in video processing and compression.


In this solution, we are going to learn how to convert the RGB image to YcbCr using opencv.

Preview of the output that you will get on running this code from your IDE

CODE

In this solution we use the Imread function of the OpenCV.


  1. Copy the code using the "Copy" button above, and paste it in a Python file in your IDE.
  2. Import open Cv library and numpy library
  3. Modify the name, and location of the image in the code.
  4. Run the file to get the Output


I hope you found this useful. I have added the link to dependent libraries, and version information in the following sections.


i found this code snippet by searching for "OpenCV Python converting color-space image to YCbCr" in kandi. You can try any such use case!


Note:-


If the user wants to Display the output use this command

cv2.imshow('after', YCrbCrImage)

cv2.waitkey(0)

Environment Tested

I tested this solution in the following versions. Be mindful of changes when working with other versions


  1. The solution is created and executed in python version 3.7.15 .
  2. The solution is tested on OpenCV 4.6.0
  3. The solution is tested on numpy 1.21.6


Using this solution, we are going to convert BGR image to YCBCR using the OpenCv library in Python with simple steps. This process also facilities an easy-to-use, hassle-free method to create a hands-on working version of code which would help us convert BGR to YCBCR in Python

Dependent Library

If you do not have OpenCV and numpy that is required to run this code, you can install it by clicking on the above link and copying the pip Install command from the OpenCV page in kandi.

You can search for any dependent library on kandi like OpenCV and numpy

Support

  1. For any support on kandi solution kits, please use the chat
  2. For further learning resources, visit the Open Weaver Community learning page.

In Python, the "where" condition is used in conjunction with Boolean indexing to filter the elements of an array, list, or DataFrame based on a specific condition. The condition is specified as a Boolean expression, and the elements that satisfy the condition are kept while the elements that do not are removed. 


You can fetch the value of a particular column with a WHERE condition using a “SQL SELECT” statement. 

  • SQL SELECT: Using the SQL SELECT command, you may query a database and get specified data from one or more of its tables. 

In the WHERE clause, you may also include several criteria by using logical operators like "AND," "OR," etc.  

  • AND: In a WHERE clause, several criteria can be combined using the SQL AND statement. When all criteria are true, rows from a table are returned using the AND statement. 
  • OR: In a WHERE clause, multiple conditions can be combined using the SQL OR statement. When at least one of the requirements is true, the OR statement is used to get rows from a table. 


For better knowledge of fetching the value of a particular column with where condition, you may have a look at the code below. 

Fig : Preview of the output that you will get on running this code from your IDE.

Code

In this solution we're using Pandas and NumPy libraries.

Instructions

Follow the steps carefully to get the output easily.

  1. Install pandas on your IDE(Any of your favorite IDE).
  2. Copy the snippet using the 'copy' and paste it in your IDE.
  3. Add required dependencies and import them in Python file.
  4. Run the file to generate the output.


I hope you found this useful. I have added the link to dependent libraries, version information in the following sections.


I found this code snippet by searching for 'How to fetch value of particular column with where condition in pandas' in kandi. You can try any such use case!

Environment Tested

I tested this solution in the following versions. Be mindful of changes when working with other versions.

  1. The solution is created in PyCharm 2021.3.
  2. The solution is tested on Python 3.9.7.
  3. Pandas version-v1.5.2.
  4. NumPy version-v1.24.0.


Using this solution, we are able to fetch value of particular column with where condition in pandas with simple steps. This process also facilities an easy way to use, hassle-free method to create a hands-on working version of code which would help us to fetch value of particular column with where condition in pandas.

Dependent Libraries

You can also search for any dependent libraries on kandi like 'pandas' and 'numpy'.

Support

  1. For any support on kandi solution kits, please use the chat
  2. For further learning resources, visit the Open Weaver Community learning page.


We will locate a specific group of words in a text using the SpaCy library, then replace those words with an empty string to remove them from the text.  


Using SpaCy, it is possible to exclude words within a specific span from a text in the following ways:  

  • Text pre-processing: Removing specific words or phrases from text can be a useful step in pre-processing text data for NLP tasks such as text classification, sentiment analysis, and language translation.  
  • Document summarization: Maintaining only the most crucial information, specific words or phrases will serve to construct a summary of a lengthy text.  
  • Data cleaning: Anonymization and data cleaning can both benefit from removing sensitive or useless text information, such as names and addresses.  
  • Text generation: Adding context or meaning to the generated content might help create new text by deleting specific words or phrases.  
  • Text augmentation: Text can be used for text augmentation techniques in NLP by removing specific words or phrases and replacing them with new text variations.  


Here is how you can remove words in span using SpaCy:  

Preview of the output that you will get on running this code from your IDE

Code

In this solution we have used spacy library of python

  1. Copy the code using the "Copy" button above, and paste it in a Python file in your IDE.
  2. Enter the Text
  3. Run the code that Remove Specific words in the text


I hope you found this useful. I have added the link to dependent libraries, version information in the following sections.


I found this code snippet by searching for "Remove words in span from spacy" in kandi. You can try any such use case!


Note


In this snippet we are using a Language model (en_core_web_sm)

  1. Download the model using the command python -m spacy download en_core_web_sm .
  2. paste it in your terminal and download it.


Check the user's spacy version using pip show spacy command in users terminal.

  1. if its version 3.0, you will need to load it using nlp = spacy.load("en_core_web_sm")
  2. if its version is less than 3.0 you will need to load it using nlp = spacy.load("en")

Environment Tested

I tested this solution in the following versions. Be mindful of changes when working with other versions.


  1. The solution is created in Python 3.7.15 Version
  2. The solution is tested on Spacy 3.4.3 Version


Using this solution, we can collect nouns that ends with s-t-l with the help of function in spacy . This process also facilities an easy to use, hassle free method to create a hands-on working version of code which would help us use full stop whenever the user needs in the sentence in python.

Dependent Library

If you do not have SpaCy and numpy that is required to run this code, you can install it by clicking on the above link and copying the pip Install command from the Spacy page in kandi.

You can search for any dependent library on kandi like SpaCy and numpy

Support

  1. For any support on kandi solution kits, please use the chat
  2. For further learning resources, visit the Open Weaver Community learning page

SpaCy is an open-source software library for advanced natural language processing. It assists you in creating programs that process and "understand" massive amounts of text because it was created expressly for use in production environments. The quick and effective tokenization offered by spaCy is one of its main advantages. SpaCy is frequently used for tasks including information extraction, machine translation, named entity recognition, part-of-speech tagging, and text summarization in business, academia, and government research projects. 


Additionally, spaCy offers tools for standard tasks like text classification, language recognition, working with word vectors and similarity, and more. You can use spaCy's tokenizer to remove certain types of tokens from a text. You may use SpaCy in a few ways to get rid of tokens in text, including symbols, punctuation, and numerals. Some examples include: 

  • Eliminating common stop words: SpaCy has a built-in list of terms you can eliminate from your writing, like "and," "or," and "the." 
  • Eliminating punctuation: You may verify whether a token is a punctuation using the spacy.tokens.token.is punct property and then deletes it from the text. 
  • Removing numbers: To determine whether a token is a number and to delete it from the text, use the spacy.tokens.token.like num property. 
  • Removing symbols: To determine whether a token is a symbol or not and to delete it from the text, use the spacy.tokens.token.isalpha and spacy.tokens.token.is digit properties. 


Here is how you can remove tokens like symbols, punctuation, and numbers in SpaCy:

Preview of the output that you will get on running this code from your IDE

Code

In this solution we use the Attributes method of the SpaCy library.

  1. Copy the code using the "Copy" button above, and paste it in a Python file in your IDE.
  2. Enter the Text
  3. Run the file to annihilate symbols ,numbers and punctuation


I hope you found this useful. I have added the link to dependent libraries, version information in the following sections.


I found this code snippet by searching for "How to filter tokens from spacy Document " in kandi. You can try any such use case!


Note


In this snippet we are using a Language model (en_core_web_sm)

  1. Download the model using the command python -m spacy download en_core_web_sm .
  2. paste it in your terminal and download it.


Check the user's spacy version using pip show spacy command in users terminal.

  1. if its version 3.0 or above you will need to load it using nlp = spacy.load("en_core_web_sm")
  2. if its version is less than 3.0 you will need to load it using nlp = spacy.load("en")

Environment Test

I tested this solution in the following versions. Be mindful of changes when working with other versions.


  1. The solution is created in Python 3.7.15. Version
  2. The solution is tested on Spacy 3.4.3 Version
  3. The solution is tested on numpy 1.21.6 Version


Using this solution, we can able to delete or remove symbols ,punctuation, numbers using python with the help of Spacy library. This process also facilities an easy to use, hassle free method to create a hands-on working version of code which would help us remove the token in python.

Dependent Library

If you do not have SpaCy and numpy that is required to run this code, you can install it by clicking on the above link and copying the pip Install command from the Spacy page in kandi.

You can search for any dependent library on kandi like Spacy and numpy

Support

  1. For any support on kandi solution kits, please use the chat
  2. For further learning resources, visit the Open Weaver Community learning page.

Attempting to create a new image from an input image by swapping the positions of the rows and columns using nested loops, and then writing the resulting image to a file using the OpenCV library


OpenCV (Open Source Computer Vision) and NumPy are two powerful libraries in Python that are widely used in computer vision, image processing, and machine learning applications. Here's a brief overview of how each library can be used. OpenCV provides a variety of computer vision algorithms and functions for image and video processing.


  • These functions range from basic image filtering, resizing, and rotation to advanced feature detection, object recognition, and video analysis.
  • OpenCV can read and write a variety of image and video formats, making it easy to work with different types of media.
  • OpenCV has interfaces for several programming languages, including


h, w, c where h, w, and c represent the height, width, and number of color channels in the new array. A nested loop is a loop inside another loop. It is a common programming construct used to iterate over multiple levels of data, such as two-dimensional arrays or matrices. cv2.imwrite is a function provided by the OpenCV library that is used to write an image to a file on a disk. The function takes two arguments: the filename of the image to be saved, and the image data to be written.


Here is the example how to rotate the image:

Preview of the output that you will get on running this code from your IDE


CODE

In this solution we use the Imread function of the OpenCV.

  1. Copy the code using the "Copy" button above, and paste it in a Python file in your IDE.
  2. Import open Cv and Numpy library
  3. Modify the name, location of the image to be rotate in the code.
  4. Run the file to rotate the image.


I hope you found this useful. I have added the link to dependent libraries, version information in the following sections.


i found this code snippet by searching for "Image rotation using OpenCV" in kandi. You can try any such use case!

Dependent Libraries

If you do not have OpenCV that is required to run this code, you can install it by clicking on the above link and copying the pip Install command from the OpenCV page in kandi.

You can search for any dependent library on kandi like OpenCV, numpy

Envorinment Tested

I tested this solution in the following versions. Be mindful of changes when working with other versions


  1. The solution is created and executed in python version 3.7.15 .
  2. The solution is tested on OpenCV 4.6.0


Using this solution, we are able to rotate an image using the OpenCv library in Python with simple steps. This process also facilities an easy to use, hassle free method to create a hands-on working version of code which would help us rotate an image in Python

Support

  1. For any support on kandi solution kits, please use the chat
  2. For further learning resources, visit the Open Weaver Community learning page.

Indexing and slicing a tensor in PyTorch refers to selecting a specific part of a tensor, which can be done using a combination of indices and slices. This is useful for selecting tensor parts, such as a subset of rows or columns or a certain number of elements along a certain dimension. Indexing and slicing can be used to select and manipulate tensor parts, which can be used for various operations, such as creating sub-tensors from a larger tensor or applying certain operations to only a subset of elements in a tensor. 


A tensor in Python is a multi-dimensional array used to store numerical data. It is a fundamental data structure in deep learning models like convolutional neural networks (CNNs). Tensors are usually represented as a matrix of numbers and can be manipulated using various operations such as addition, multiplication, and division. 


Indexing and slicing of tensors in PyTorch are the same as indexing and slicing lists in Python. 

  • To retrieve a single tensor element, use the indexing operator [] with the corresponding indices. 
  • To slice a tensor, use the slicing operator: with the corresponding indices. 


Here is an example of indexing and slicing a tensor in PyTorch. 



Fig 1: Preview of the output that you will get on indexing a tensor in PyTorch.



Fig 2: Preview of the output that you will get on slicing a tensor in PyTorch.

Codes


In this solution, we use the torch.tensor Function of the PyTorch library

Instructions

Follow the steps carefully to get the output easily.

  1. Install Jupyter Notebook on your computer.
  2. Open terminal and install the required libraries with following commands.
  3. Install pytorch - pip install torch.
  4. Copy the codes using the "Copy" button above, and paste it into your IDE's Python file.
  5. Print Result in slicing.
  6. Run the file to perform Indexing and slicing a tensor in PyTorch.


I hope you found this useful. I have added the link to dependent libraries, version information in the following sections.


I found this code snippet by searching for "Indexing and slicing a tensor in PyTorch" in kandi. You can try any such use case!

Dependent Libraries


If you do not have PyTorch that is required to run this code, you can install it by clicking on the above link and copying the pip Install command from the PyTorch page in kandi.


You can search for any dependent library on kandi like PyTorch

Environment Tested


I tested this solution in the following versions. Be mindful of changes when working with other versions.

  1. The solution is created in Python 3.9.6
  2. The solution is tested on PyTorch 2.0.0+cpu version.


Using this solution, we are able to perform indexing and slicing of tensor in PyTorch in Python with simple steps. PyTorch is also used in Computer Vision and Generative Adversarial Networks.

Support


  1. For any support on kandi solution kits, please use the chat
  2. For further learning resources, visit the Open Weaver Community learning page.


This code demonstrates how a simple linear regression model can be trained and used to make predictions in Python using the sci-kit-learn library. The LinearRegression class from the sklearn.linear_model module in sci-kit-learn is used to build and train linear regression models in Python. 


Linear Regression is a supervised machine learning algorithm used for regression problems. In regression problems, the goal is to predict a continuous target variable based on one or more input variables. The linear regression algorithm fits a linear equation to the observed data between the dependent (target) and independent (predictor) variables. The equation is represented by a line that best captures the relationship between the variables.


The model.predict() method in sci-kit-learn's LinearRegression class is used to make predictions for new data based on a trained linear regression model.


Linear Regression is widely used for many applications, including forecasting, modeling, and understanding the relationship between variables.

Preview of the output that you will get on running this code from your IDE

Code

In this solution we have used LinearRegression

  1. Copy the code using the "Copy" button above, and paste it in a Python file in your IDE.
  2. Run the file to get the output


I hope you found this useful. I have added the link to dependent libraries, version information in the following sections.


I found this code snippet by searching for "use .predict() method in python for Linear regression" in kandi. You can try any such use case!

Environment Tested

I tested this solution in the following versions. Be mindful of changes when working with other versions.


  1. The solution is created in Python 3.7.15 version
  2. The solution is tested on sci-kit-learn 1.0.2 version
  3. The solution is tested on numpy 1.21.6 version


Using this solution, we are able going to learn how to predict a simple linear regression model using Scikit learn library in Python with simple steps. This process also facilities an easy-to-use, hassle-free method to create a hands-on working version of code which would help use the .predict() method in python for Linear regression in Python.

Dependent Library

If you do not have Scikit-learn and numpy that is required to run this code, you can install it by clicking on the above link and copying the pip Install command from the Scikit-learn page in kandi.

You can search for any dependent library on kandi like Scikit-learn. and numpy

Support

  1. For any support on kandi solution kits, please use the chat
  2. For further learning resources, visit the Open Weaver Community learning page.


Creating a Pandas DataFrame with a unique index can provide several benefits, including: 

  • Uniqueness: A unique index ensures that a unique label can identify each row in the DataFrame. This helps avoid issues when dealing with duplicate rows or merging data from multiple sources. 
  • Data Integrity: By using a unique index, you can help maintain the integrity of your data. This can make performing operations such as filtering, sorting, and aggregating data easier without affecting the underlying data structure. 
  • Efficiency: Using a unique index can make certain operations more efficient when working with large datasets. For example, when performing joins or merges between dataframes, using a unique index can speed up the process by allowing the data to be aligned more quickly. 


In Python, NumPy is a library for numerical computing. It provides a powerful N-dimensional array object, as well as a variety of functions for performing mathematical operations on arrays. NumPy arrays are efficient and fast and can be used for various data analysis tasks, such as filtering, sorting, and aggregating data. Pandas are created on top of NumPy, providing a higher-level Python interface for data manipulation and analysis. The append() method is used to add rows of data to an existing DataFrame. The append() method returns a new DataFrame with the rows from the original DataFrame and the appended rows.  


Creating a Pandas DataFrame with a unique index can help ensure data integrity, improve efficiency, and make data analysis and manipulation easier and more intuitive. 

Preview of the output that you will get on running this code.

Code

In this solution we have used append() function of python.

  1. Copy this code using "Copy" button above and paste it in your Python ide
  2. Import Pandas and Numpy library of python.
  3. Run the code to get a unique index.


I hope you have found this useful. I have added the dependent library and version information in the following section.


I found this code snippet by searching "Create pandas dataFrame with unique index" in kandi. you can try any use case.

Dependent Library

If you do not have Pandas that is required to run this code you can install it by clicking on th above link and copying the pip install command from the pandas page in Kandi. You can search for any dependent library in Kandi like Pandas.

Environment Test

In this solution we have used the following versions. Be mindful to change when working with other versions.


  1. This solution is created using Python version 3.7.15
  2. This solution is Tested using Pandas 1.5.2


Using this solution we can able to Create a dataframe with a unique Index using Pandas library in python with simple Steps. This process also facilities an easy to use, hassle free method to create a hands-on working version of code which would help us create a Dataframe with unique Index in Python.

Support

  1. For any support on kandi solution kits, please use the chat
  2. For further learning resources, visit the Open Weaver Community learning page.

A multi-label confusion matrix is a useful tool for evaluating the performance of multi-label classification models. It provides a detailed view of the true positive (TP), false positive (FP), false negative (FN), and true negative (TN) predictions made by the classifier for each label. This information can be used to evaluate several aspects of the classifier's performance, including:


  • Accuracy: The overall accuracy of the classifier can be computed as the ratio of correct predictions to total predictions.
  • Precision: Precision measures the fraction of correct positive predictions. It can be used to evaluate the quality of the positive predictions made by the classifier.
  • Recall: Recall measures the fraction of actual positive instances correctly identified by the classifier. It can be used to evaluate the completeness of the positive predictions made by the classifier.
  • F1-Score: The F1-score is the harmonic mean of precision and recall and provides a balance between precision and recall.
  • Support: The support is the number of instances belonging to each class.


These performance metrics can be computed for each label and averaged across labels to give an overall view of the classifier's performance.


In addition to these performance metrics, the multi-label confusion matrix can also help identify specific areas for improvement in the classifier. For example, if the classifier has low precision for a particular label, it may indicate that it is making too many false positive predictions. On the other hand, if the classifier has a low recall for a particular label, it may indicate that the classifier needs to include more actual positive instances for that label. By identifying these specific areas for improvement, the multi-label confusion matrix can help guide further development and refinement of the classifier.

Preview of the output that you will get on running this code from your IDE

Code

In this solution we have used Sklearn library.

  1. Copy the code using the "Copy" button above, and paste it in a Python file in your IDE.
  2. Run the file to create multi label confusion matrix.


I hope you found this useful. I have added the link to dependent libraries, version information in the following sections.


I found this code snippet by searching for "Multi-label confusion matrix" in kandi. You can try any such use case!

Dependent Library

If you do not have Scikit-learn that is required to run this code, you can install it by clicking on the above link and copying the pip Install command from the Scikit-learn page in kandi.

You can search for any dependent library on kandi like Scikit-learn.

Environment Tested

I tested this solution in the following versions. Be mindful of changes when working with other versions.


  1. The solution is created in Python 3.7.15 version
  2. The solution is tested on scikit-learn 1.0.2 version
  3. The solution is tested on numpy 1.21.6 version


Using this solution, we are able to create a multi-label confusion matrix using Scikit learn library in Python with simple steps. This process also facilities an easy-to-use, hassle-free method to create a hands-on working version of code which would help us label for confusion matrix in Python.

Support

  1. For any support on kandi solution kits, please use the chat
  2. For further learning resources, visit the Open Weaver Community learning page.


Rendering text is an important part of web design and typography, as it allows text to be displayed in a way that is visually appealing and easy to read. Rendering text is storing text in a computer document and displaying it on a screen, often with formatting such as font size, font type, and color. This is typically done by a program such as a word processor or web browser. 



Pygame is a set of Python modules designed for writing video games. It is free and open source, designed to make it easy to write fun games. It includes functions for creating graphics, playing sounds, handling mouse and keyboard input, and much more. 



Rendering text with Pygame involves using the Pygame library to display text on the screen. This is done by creating a font object and using the render() method to draw the text to the screen. The font object can be customized with color and size, and the text can be drawn to the screen in any position. 



Here is an example of rendering text with Pygame 



Fig1: Preview of Code



Fig2: Preview of the Output

Code


In this solution, we use the Pygame function.

Instructions


  1. Install Jupyter Notebook on your computer.
  2. Open terminal and install the required libraries with following commands.
  3. Install Pygame - pip install pygame
  4. Copy the snippet using the 'copy' button and paste it into that file.
  5. Run the file using run button.


I hope you found this useful. I have added the link to dependent libraries, version information in the following sections.


I found this code snippet by searching for "Rendering text with Pygame" in kandi. You can try any such use case!

Dependent Libraries

If you do not have Pygame that is required to run this code, you can install it by clicking on the above link and copying the pip Install command from the Pygame page in kandi.


You can search for any dependent library on kandi like Pygame.

Environment Tested


I tested this solution in the following versions. Be mindful of changes when working with other versions.

  1. The solution is created in Python3.9.6
  2. The solution is tested on Pygame 2.3.0 version.


Using this solution, we are able to render the text with Pygame


This process also facilities an easy to use, hassle free method to create a hands-on working version of code which would help us to render the text with Pygame

Support


  1. For any support on kandi solution kits, please use the chat
  2. For further learning resources, visit the Open Weaver Community learning page.


Finding the sublist's size allows you to determine each sublist's size in a list quickly. This can be useful in various situations where you must work with varying sizes of sublists. For example, use this script to check that a list of data has the expected structure or to identify the smallest sublist in a list. Additionally, use this script to filter out sublists too small to be useful or to perform calculations or manipulations on sublists of a particular size. 


NumPy (short for "Numerical Python") is a powerful Python library for working with multi-dimensional arrays and matrices. It provides various mathematical functions for working with these arrays, including linear algebra, Fourier transforms, and random number generation. 


hasattr is a built-in Python function that takes an object and a string. It returns True if the object has an attribute with the given string name and False otherwise. hasattr is often used in combination with other built-in functions, such as getattr and setattr, which retrieve and set the value of an attribute on an object. hasattr can be useful when working with complex data structures, such as objects or dictionaries, where we want to check whether a certain attribute or key exists before trying to access it. 


This is a useful tool for working with lists of sublists and can save you time and effort when you need to analyze or manipulate data in this format. 

Preview of the output that you will get on running this code.

Code

In this solution we have used Len() function in python.

  1. Copy this code using "Copy" button above and paste it in your Python ide
  2. Run the code, get the size of each sublist and neglect empty and single elements


I hope you have found this useful. I have added the version information in the following section.


I found this code snippet by searching "Python find size of each sublist in a list" in kandi. you can try any use case.

Environment Tested

In this solution we have used the following versions. Be mindful to change when working with other versions.


  • This solution is created using Python version 3.7.15


Using this solution we can able to get the size of each sublist and neglect empty and single elements in python with simple Steps. This process also facilities an easy to use, hassle free method to create a hands-on working version of code which would help us to get size of each sub-list in Python.

Dependent Library

If you do not have numpy that is required to run this code, you can install it by clicking on the above link and copying the pip Install command from the numpy page in kandi.

You can search for any dependent library on kandi like numpy

Support

  1. For any support on kandi solution kits, please use the chat
  2. For further learning resources, visit the Open Weaver Community learning page.


Trending Discussions on Data Manipulation

Vue - Wait for for loop to fetch all items asynchronously

How to get file name which have error using AWK command?

Declaring variable in R for DBI query to MS SQL

Check if azure databricks mount point exists from .NET

R basics: working with multiple variables at once and their output

I can't read excel file using dt.fread from datatable AttributeError

How can I use data.table in a package without importing all functions?

Conflicting object names within a solution

How to update shiny module with reactive dataframe from another module

R command `group_by`

QUESTION

Vue - Wait for for loop to fetch all items asynchronously

Asked 2022-Apr-17 at 17:32

i have an array of data to be fetched, so i have to use a for loop to fetch all the data, but i want to do it asynchronously (multiple calls at the same time). After having fetched the data i also want to do some data manipulation, so i need to run code AFTER all the data has been fetched

1for (var e in this.dataTofetch) {
2  axios
3    .get("https://www.example.com/api/" + e)
4    .then((response) => this.fetchedData.push(response.data));
5}
6this.manipulateData();
7

The problem is that whenever i reach the manipulateData function, fetchedData is empty.

Also i tried doing it synchronously using await and it works but it becomes very slow when making multiple calls.

ANSWER

Answered 2022-Apr-17 at 17:21

The best approach I can think of is to use Promise.all(). You will leave out the .then-handler, because axios.get() returns you a promise.

An exact implementation example can be found here at StackOverflow: Promise All with Axios.

Source https://stackoverflow.com/questions/71903882

QUESTION

How to get file name which have error using AWK command?

Asked 2022-Mar-12 at 17:18

I am using the SAC tool to read the header information but some files have no header information and it prints an error. Is there any way to use AWK to print that files if they do not have a header or error during work. I often used AWK for data manipulation but failed this time.

Here is my try:

1saclst a f *2020-05*BHZ*
2

This is the output

1saclst a f *2020-05*BHZ*
2GS.GS043.2020-05-18T03:52.BHZ.sac         3.37
3GS.GS043.2020-05-18T09:28.BHZ.sac         3.64
4GS.GS043.2020-05-18T12:09.BHZ.sac         3.42
5saclst: Error determining SAC header: GS.GS043.2020-05-18T14:36.BHZ.sac
6GS.GS043.2020-05-18T16:25.BHZ.sac         2.92
7GS.GS043.2020-05-18T18:51.BHZ.sac         3.66
8

Now I want to get the file name and print it but seems like AWK does not help;

1saclst a f *2020-05*BHZ*
2GS.GS043.2020-05-18T03:52.BHZ.sac         3.37
3GS.GS043.2020-05-18T09:28.BHZ.sac         3.64
4GS.GS043.2020-05-18T12:09.BHZ.sac         3.42
5saclst: Error determining SAC header: GS.GS043.2020-05-18T14:36.BHZ.sac
6GS.GS043.2020-05-18T16:25.BHZ.sac         2.92
7GS.GS043.2020-05-18T18:51.BHZ.sac         3.66
8saclst a f *2020-05*BHZ* | awk '{if ($2<0) print $1;}' > ../test.dat
9

My output file is empty and the terminal shows this error:

Is there any way to save this error so I can later modify it?

1saclst a f *2020-05*BHZ*
2GS.GS043.2020-05-18T03:52.BHZ.sac         3.37
3GS.GS043.2020-05-18T09:28.BHZ.sac         3.64
4GS.GS043.2020-05-18T12:09.BHZ.sac         3.42
5saclst: Error determining SAC header: GS.GS043.2020-05-18T14:36.BHZ.sac
6GS.GS043.2020-05-18T16:25.BHZ.sac         2.92
7GS.GS043.2020-05-18T18:51.BHZ.sac         3.66
8saclst a f *2020-05*BHZ* | awk '{if ($2<0) print $1;}' > ../test.dat
9saclst: Error determining SAC header: SC.LZB.2020-05-21T10:46.BHZ.sac
10saclst: Error determining SAC header: SC.LZB.2020-05-21T11:57.BHZ.sac
11saclst: Error determining SAC header: SC.LZB.2020-05-26T11:23.BHZ.sac
12saclst: Error determining SAC header: SC.LZB.2020-05-28T10:44.BHZ.sac
13saclst: Error determining SAC header: SC.QSC.2020-05-12T06:49.BHZ.sac
14

ANSWER

Answered 2022-Mar-12 at 09:06

Here's what I think you are looking for:

1saclst a f *2020-05*BHZ*
2GS.GS043.2020-05-18T03:52.BHZ.sac         3.37
3GS.GS043.2020-05-18T09:28.BHZ.sac         3.64
4GS.GS043.2020-05-18T12:09.BHZ.sac         3.42
5saclst: Error determining SAC header: GS.GS043.2020-05-18T14:36.BHZ.sac
6GS.GS043.2020-05-18T16:25.BHZ.sac         2.92
7GS.GS043.2020-05-18T18:51.BHZ.sac         3.66
8saclst a f *2020-05*BHZ* | awk '{if ($2<0) print $1;}' > ../test.dat
9saclst: Error determining SAC header: SC.LZB.2020-05-21T10:46.BHZ.sac
10saclst: Error determining SAC header: SC.LZB.2020-05-21T11:57.BHZ.sac
11saclst: Error determining SAC header: SC.LZB.2020-05-26T11:23.BHZ.sac
12saclst: Error determining SAC header: SC.LZB.2020-05-28T10:44.BHZ.sac
13saclst: Error determining SAC header: SC.QSC.2020-05-12T06:49.BHZ.sac
14# just for demo, pipe SAC tool to awk for your actual use case
15$ cat ip.txt
16GS.GS043.2020-05-18T03:52.BHZ.sac         3.37
17GS.GS043.2020-05-18T09:28.BHZ.sac         3.64
18GS.GS043.2020-05-18T12:09.BHZ.sac         3.42
19saclst: Error determining SAC header: GS.GS043.2020-05-18T14:36.BHZ.sac
20GS.GS043.2020-05-18T16:25.BHZ.sac         2.92
21GS.GS043.2020-05-18T18:51.BHZ.sac         3.66
22
23# filter lines with Error based on number of fields or `Error` in 2nd field
24$ awk 'NF != 2' ip.txt
25saclst: Error determining SAC header: GS.GS043.2020-05-18T14:36.BHZ.sac
26$ awk '$2 == "Error"' ip.txt
27saclst: Error determining SAC header: GS.GS043.2020-05-18T14:36.BHZ.sac
28
29# print only last field
30$ awk '$2 == "Error"{print $NF}' ip.txt
31GS.GS043.2020-05-18T14:36.BHZ.sac
32

If the saclst command puts the lines with Error on stderr, you can use this:

1saclst a f *2020-05*BHZ*
2GS.GS043.2020-05-18T03:52.BHZ.sac         3.37
3GS.GS043.2020-05-18T09:28.BHZ.sac         3.64
4GS.GS043.2020-05-18T12:09.BHZ.sac         3.42
5saclst: Error determining SAC header: GS.GS043.2020-05-18T14:36.BHZ.sac
6GS.GS043.2020-05-18T16:25.BHZ.sac         2.92
7GS.GS043.2020-05-18T18:51.BHZ.sac         3.66
8saclst a f *2020-05*BHZ* | awk '{if ($2<0) print $1;}' > ../test.dat
9saclst: Error determining SAC header: SC.LZB.2020-05-21T10:46.BHZ.sac
10saclst: Error determining SAC header: SC.LZB.2020-05-21T11:57.BHZ.sac
11saclst: Error determining SAC header: SC.LZB.2020-05-26T11:23.BHZ.sac
12saclst: Error determining SAC header: SC.LZB.2020-05-28T10:44.BHZ.sac
13saclst: Error determining SAC header: SC.QSC.2020-05-12T06:49.BHZ.sac
14# just for demo, pipe SAC tool to awk for your actual use case
15$ cat ip.txt
16GS.GS043.2020-05-18T03:52.BHZ.sac         3.37
17GS.GS043.2020-05-18T09:28.BHZ.sac         3.64
18GS.GS043.2020-05-18T12:09.BHZ.sac         3.42
19saclst: Error determining SAC header: GS.GS043.2020-05-18T14:36.BHZ.sac
20GS.GS043.2020-05-18T16:25.BHZ.sac         2.92
21GS.GS043.2020-05-18T18:51.BHZ.sac         3.66
22
23# filter lines with Error based on number of fields or `Error` in 2nd field
24$ awk 'NF != 2' ip.txt
25saclst: Error determining SAC header: GS.GS043.2020-05-18T14:36.BHZ.sac
26$ awk '$2 == "Error"' ip.txt
27saclst: Error determining SAC header: GS.GS043.2020-05-18T14:36.BHZ.sac
28
29# print only last field
30$ awk '$2 == "Error"{print $NF}' ip.txt
31GS.GS043.2020-05-18T14:36.BHZ.sac
32$ saclst a f *2020-05*BHZ* 2> error.log
33

Source https://stackoverflow.com/questions/71448054

QUESTION

Declaring variable in R for DBI query to MS SQL

Asked 2022-Mar-08 at 14:11

I'm writing an R query that runs several SQL queries using the DBI package to create reports. To make this work, I need to be able to declare a variable in R (such as a Period End Date) that is then called from within the SQL query. When I run my query, I get the following error:

If I simply use the field name (PeriodEndDate), I get the following error:

Error in (function (classes, fdef, mtable) : unable to find an inherited method for function ‘dbGetQuery’ for signature ‘"Microsoft SQL Server", "character"’

If I use @ to access the field name (@PeriodEndDate), I get the following error:

Error: nanodbc/nanodbc.cpp:1655: 42000: [Microsoft][ODBC SQL Server Driver][SQL Server]Must declare the scalar variable "@PeriodEndDate". [Microsoft][ODBC SQL Server Driver][SQL Server]Statement(s) could not be prepared. '

An example query might look like this:

1library(DBI)  # Used for connecting to SQL server and submitting SQL queries.
2library(tidyverse)  # Used for data manipulation and creating/saving CSV files.
3library(lubridate) # Used to calculate end of month, start of month in queries
4
5# Define time periods for queries.
6PeriodEndDate <<- ceiling_date(as.Date('2021-10-31'),'month')  # Enter Period End Date on this line.
7PeriodStartDate <<- floor_date(PeriodEndDate, 'month')
8
9# Connect to SQL Server.
10con <- dbConnect(
11  odbc::odbc(),
12  driver = "SQL Server",
13  server = "SERVERNAME",
14  trusted_connection = TRUE,
15  timeout = 5,
16  encoding = "Latin1")
17
18samplequery <- dbGetQuery(con, "
19     SELECT * FROM [TableName]
20     WHERE OrderDate <= @PeriodEndDate           
21")
22

I believe one way might be to use the paste function, like this:

1library(DBI)  # Used for connecting to SQL server and submitting SQL queries.
2library(tidyverse)  # Used for data manipulation and creating/saving CSV files.
3library(lubridate) # Used to calculate end of month, start of month in queries
4
5# Define time periods for queries.
6PeriodEndDate <<- ceiling_date(as.Date('2021-10-31'),'month')  # Enter Period End Date on this line.
7PeriodStartDate <<- floor_date(PeriodEndDate, 'month')
8
9# Connect to SQL Server.
10con <- dbConnect(
11  odbc::odbc(),
12  driver = "SQL Server",
13  server = "SERVERNAME",
14  trusted_connection = TRUE,
15  timeout = 5,
16  encoding = "Latin1")
17
18samplequery <- dbGetQuery(con, "
19     SELECT * FROM [TableName]
20     WHERE OrderDate <= @PeriodEndDate           
21")
22samplequery <- dbGetQuery(con, paste("
23     SELECT * FROM [TableName]
24     WHERE OrderDate <=", PeriodEndDate")
25

However, that can get unwieldy if it involves several variables being referenced outside the query or in several places within the query.

Is there a relatively straightforward way to do this?

Thanks in advance for any thoughts you might have!

ANSWER

Answered 2022-Mar-08 at 14:11

The mechanism in most DBI-based connections is to use ?-placeholders[1] in the query and params= in the call to DBI::dbGetQuery or DBI::dbExecute.

Perhaps this:

1library(DBI)  # Used for connecting to SQL server and submitting SQL queries.
2library(tidyverse)  # Used for data manipulation and creating/saving CSV files.
3library(lubridate) # Used to calculate end of month, start of month in queries
4
5# Define time periods for queries.
6PeriodEndDate <<- ceiling_date(as.Date('2021-10-31'),'month')  # Enter Period End Date on this line.
7PeriodStartDate <<- floor_date(PeriodEndDate, 'month')
8
9# Connect to SQL Server.
10con <- dbConnect(
11  odbc::odbc(),
12  driver = "SQL Server",
13  server = "SERVERNAME",
14  trusted_connection = TRUE,
15  timeout = 5,
16  encoding = "Latin1")
17
18samplequery <- dbGetQuery(con, "
19     SELECT * FROM [TableName]
20     WHERE OrderDate <= @PeriodEndDate           
21")
22samplequery <- dbGetQuery(con, paste("
23     SELECT * FROM [TableName]
24     WHERE OrderDate <=", PeriodEndDate")
25samplequery <- dbGetQuery(con, "
26     SELECT * FROM [TableName]
27     WHERE OrderDate <= ?
28", params = list(PeriodEndDate))
29

In general the mechanisms for including an R object as a data-item are enumerated well in https://db.rstudio.com/best-practices/run-queries-safely/. In the order of my recommendation,

  1. Parameterized queries (as shown above);
  2. glue::glue_sql;
  3. sqlInterpolate (which uses the same ?-placeholders as #1);
  4. The link also mentions "manual escaping" using dbQuoteString.

Anything else is in my mind more risky due to inadvertent SQL corruption/injection.

I've seen many questions here on SO that try to use one of the following techniques: paste and/or sprintf using sQuote or hard-coded paste0("'", PeriodEndDate, "'"). These are too fragile in my mind and should be avoided.

My preference for parameterized queries extends beyond this usability, it also can have non-insignificant impacts on repeated use of the same query, since DBMSes tend to analyze/optimize the query and cache this for the next use. Consider this:

1library(DBI)  # Used for connecting to SQL server and submitting SQL queries.
2library(tidyverse)  # Used for data manipulation and creating/saving CSV files.
3library(lubridate) # Used to calculate end of month, start of month in queries
4
5# Define time periods for queries.
6PeriodEndDate <<- ceiling_date(as.Date('2021-10-31'),'month')  # Enter Period End Date on this line.
7PeriodStartDate <<- floor_date(PeriodEndDate, 'month')
8
9# Connect to SQL Server.
10con <- dbConnect(
11  odbc::odbc(),
12  driver = "SQL Server",
13  server = "SERVERNAME",
14  trusted_connection = TRUE,
15  timeout = 5,
16  encoding = "Latin1")
17
18samplequery <- dbGetQuery(con, "
19     SELECT * FROM [TableName]
20     WHERE OrderDate <= @PeriodEndDate           
21")
22samplequery <- dbGetQuery(con, paste("
23     SELECT * FROM [TableName]
24     WHERE OrderDate <=", PeriodEndDate")
25samplequery <- dbGetQuery(con, "
26     SELECT * FROM [TableName]
27     WHERE OrderDate <= ?
28", params = list(PeriodEndDate))
29### parameterized queries
30DBI::dbGetQuery("select ... where OrderDate >= ?", params=list("2020-02-02"))
31DBI::dbGetQuery("select ... where OrderDate >= ?", params=list("2020-02-03"))
32
33### glue_sql
34PeriodEndDate <- as.Date("2020-02-02")
35qry <- glue::glue_sql("select ... where OrderDate >= {PeriodEndDate}", .con=con)
36# <SQL> select ... where OrderDate >= '2020-02-02'
37DBI::dbGetQuery(con, qry)
38PeriodEndDate <- as.Date("2021-12-22")
39qry <- glue::glue_sql("select ... where OrderDate >= {PeriodEndDate}", .con=con)
40# <SQL> select ... where OrderDate >= '2021-12-22'
41DBI::dbGetQuery(con, qry)
42

In the case of parameterized queries, the "query" itself never changes, so its optimized query (internal to the server) can be reused.

In the case of the glue_sql queries, the query itself changes (albeit just a handful of character), so most (all?) DBMSes will re-analyze and re-optimize the query. While they tend to do it quickly, and most analysts' queries are not complex, it is still unnecessary overhead, and missing an opportunity in cases where your query and/or the indices require a little more work to optimize well.


Notes:

  1. ? is used by most DBMSes but not all. Others use $name or $1 or such. With odbc::odbc(), however, it is always ? (no name, no number), regardless of the actual DBMS.

  2. Not sure if you are using this elsewhere, but the use of <<- (vice <- or =) can encourage bad habits and/or unreliable/unexpected results.

  3. It is not uncommon to use the same variable multiple times in a query. Unfortunately, you will need to include the variable multiple times, and order is important. For example,

1library(DBI)  # Used for connecting to SQL server and submitting SQL queries.
2library(tidyverse)  # Used for data manipulation and creating/saving CSV files.
3library(lubridate) # Used to calculate end of month, start of month in queries
4
5# Define time periods for queries.
6PeriodEndDate &lt;&lt;- ceiling_date(as.Date('2021-10-31'),'month')  # Enter Period End Date on this line.
7PeriodStartDate &lt;&lt;- floor_date(PeriodEndDate, 'month')
8
9# Connect to SQL Server.
10con &lt;- dbConnect(
11  odbc::odbc(),
12  driver = &quot;SQL Server&quot;,
13  server = &quot;SERVERNAME&quot;,
14  trusted_connection = TRUE,
15  timeout = 5,
16  encoding = &quot;Latin1&quot;)
17
18samplequery &lt;- dbGetQuery(con, &quot;
19     SELECT * FROM [TableName]
20     WHERE OrderDate &lt;= @PeriodEndDate           
21&quot;)
22samplequery &lt;- dbGetQuery(con, paste(&quot;
23     SELECT * FROM [TableName]
24     WHERE OrderDate &lt;=&quot;, PeriodEndDate&quot;)
25samplequery &lt;- dbGetQuery(con, &quot;
26     SELECT * FROM [TableName]
27     WHERE OrderDate &lt;= ?
28&quot;, params = list(PeriodEndDate))
29### parameterized queries
30DBI::dbGetQuery(&quot;select ... where OrderDate &gt;= ?&quot;, params=list(&quot;2020-02-02&quot;))
31DBI::dbGetQuery(&quot;select ... where OrderDate &gt;= ?&quot;, params=list(&quot;2020-02-03&quot;))
32
33### glue_sql
34PeriodEndDate &lt;- as.Date(&quot;2020-02-02&quot;)
35qry &lt;- glue::glue_sql(&quot;select ... where OrderDate &gt;= {PeriodEndDate}&quot;, .con=con)
36# &lt;SQL&gt; select ... where OrderDate &gt;= '2020-02-02'
37DBI::dbGetQuery(con, qry)
38PeriodEndDate &lt;- as.Date(&quot;2021-12-22&quot;)
39qry &lt;- glue::glue_sql(&quot;select ... where OrderDate &gt;= {PeriodEndDate}&quot;, .con=con)
40# &lt;SQL&gt; select ... where OrderDate &gt;= '2021-12-22'
41DBI::dbGetQuery(con, qry)
42samplequery &lt;- dbGetQuery(con, &quot;
43     SELECT * FROM [TableName]
44     WHERE OrderDate &lt;= ?
45       or (SomethingElse = ? and OrderDate &gt; ?)0
46&quot;, params = list(PeriodEndDate, 99, PeriodEndDate))
47
  • If you have a list/vector of values and want to use SQL's IN operator, then you have two options, my preference being the first (for the reasons stated above):

    1. Create a string of question marks and paste into the query. (Yes, this is pasteing into the query, but we are not dealing with the risk of incorrectly single-quoting or double-quoting. Since DBI does not support any other mechanism, this is what we have.)

  • 1library(DBI)  # Used for connecting to SQL server and submitting SQL queries.
    2library(tidyverse)  # Used for data manipulation and creating/saving CSV files.
    3library(lubridate) # Used to calculate end of month, start of month in queries
    4
    5# Define time periods for queries.
    6PeriodEndDate &lt;&lt;- ceiling_date(as.Date('2021-10-31'),'month')  # Enter Period End Date on this line.
    7PeriodStartDate &lt;&lt;- floor_date(PeriodEndDate, 'month')
    8
    9# Connect to SQL Server.
    10con &lt;- dbConnect(
    11  odbc::odbc(),
    12  driver = &quot;SQL Server&quot;,
    13  server = &quot;SERVERNAME&quot;,
    14  trusted_connection = TRUE,
    15  timeout = 5,
    16  encoding = &quot;Latin1&quot;)
    17
    18samplequery &lt;- dbGetQuery(con, &quot;
    19     SELECT * FROM [TableName]
    20     WHERE OrderDate &lt;= @PeriodEndDate           
    21&quot;)
    22samplequery &lt;- dbGetQuery(con, paste(&quot;
    23     SELECT * FROM [TableName]
    24     WHERE OrderDate &lt;=&quot;, PeriodEndDate&quot;)
    25samplequery &lt;- dbGetQuery(con, &quot;
    26     SELECT * FROM [TableName]
    27     WHERE OrderDate &lt;= ?
    28&quot;, params = list(PeriodEndDate))
    29### parameterized queries
    30DBI::dbGetQuery(&quot;select ... where OrderDate &gt;= ?&quot;, params=list(&quot;2020-02-02&quot;))
    31DBI::dbGetQuery(&quot;select ... where OrderDate &gt;= ?&quot;, params=list(&quot;2020-02-03&quot;))
    32
    33### glue_sql
    34PeriodEndDate &lt;- as.Date(&quot;2020-02-02&quot;)
    35qry &lt;- glue::glue_sql(&quot;select ... where OrderDate &gt;= {PeriodEndDate}&quot;, .con=con)
    36# &lt;SQL&gt; select ... where OrderDate &gt;= '2020-02-02'
    37DBI::dbGetQuery(con, qry)
    38PeriodEndDate &lt;- as.Date(&quot;2021-12-22&quot;)
    39qry &lt;- glue::glue_sql(&quot;select ... where OrderDate &gt;= {PeriodEndDate}&quot;, .con=con)
    40# &lt;SQL&gt; select ... where OrderDate &gt;= '2021-12-22'
    41DBI::dbGetQuery(con, qry)
    42samplequery &lt;- dbGetQuery(con, &quot;
    43     SELECT * FROM [TableName]
    44     WHERE OrderDate &lt;= ?
    45       or (SomethingElse = ? and OrderDate &gt; ?)0
    46&quot;, params = list(PeriodEndDate, 99, PeriodEndDate))
    47MyDates &lt;- c(..., ...)
    48qmarks &lt;- paste(rep(&quot;?&quot;, length(MyDates)), collapse=&quot;,&quot;)
    49samplequery &lt;- dbGetQuery(con, sprintf(&quot;
    50     SELECT * FROM [TableName]
    51     WHERE OrderDate IN (%s)
    52&quot;, qmarks), params = as.list(MyDates))
    53
  • glue_sql supports expanding internally:

  • 1library(DBI)  # Used for connecting to SQL server and submitting SQL queries.
    2library(tidyverse)  # Used for data manipulation and creating/saving CSV files.
    3library(lubridate) # Used to calculate end of month, start of month in queries
    4
    5# Define time periods for queries.
    6PeriodEndDate &lt;&lt;- ceiling_date(as.Date('2021-10-31'),'month')  # Enter Period End Date on this line.
    7PeriodStartDate &lt;&lt;- floor_date(PeriodEndDate, 'month')
    8
    9# Connect to SQL Server.
    10con &lt;- dbConnect(
    11  odbc::odbc(),
    12  driver = &quot;SQL Server&quot;,
    13  server = &quot;SERVERNAME&quot;,
    14  trusted_connection = TRUE,
    15  timeout = 5,
    16  encoding = &quot;Latin1&quot;)
    17
    18samplequery &lt;- dbGetQuery(con, &quot;
    19     SELECT * FROM [TableName]
    20     WHERE OrderDate &lt;= @PeriodEndDate           
    21&quot;)
    22samplequery &lt;- dbGetQuery(con, paste(&quot;
    23     SELECT * FROM [TableName]
    24     WHERE OrderDate &lt;=&quot;, PeriodEndDate&quot;)
    25samplequery &lt;- dbGetQuery(con, &quot;
    26     SELECT * FROM [TableName]
    27     WHERE OrderDate &lt;= ?
    28&quot;, params = list(PeriodEndDate))
    29### parameterized queries
    30DBI::dbGetQuery(&quot;select ... where OrderDate &gt;= ?&quot;, params=list(&quot;2020-02-02&quot;))
    31DBI::dbGetQuery(&quot;select ... where OrderDate &gt;= ?&quot;, params=list(&quot;2020-02-03&quot;))
    32
    33### glue_sql
    34PeriodEndDate &lt;- as.Date(&quot;2020-02-02&quot;)
    35qry &lt;- glue::glue_sql(&quot;select ... where OrderDate &gt;= {PeriodEndDate}&quot;, .con=con)
    36# &lt;SQL&gt; select ... where OrderDate &gt;= '2020-02-02'
    37DBI::dbGetQuery(con, qry)
    38PeriodEndDate &lt;- as.Date(&quot;2021-12-22&quot;)
    39qry &lt;- glue::glue_sql(&quot;select ... where OrderDate &gt;= {PeriodEndDate}&quot;, .con=con)
    40# &lt;SQL&gt; select ... where OrderDate &gt;= '2021-12-22'
    41DBI::dbGetQuery(con, qry)
    42samplequery &lt;- dbGetQuery(con, &quot;
    43     SELECT * FROM [TableName]
    44     WHERE OrderDate &lt;= ?
    45       or (SomethingElse = ? and OrderDate &gt; ?)0
    46&quot;, params = list(PeriodEndDate, 99, PeriodEndDate))
    47MyDates &lt;- c(..., ...)
    48qmarks &lt;- paste(rep(&quot;?&quot;, length(MyDates)), collapse=&quot;,&quot;)
    49samplequery &lt;- dbGetQuery(con, sprintf(&quot;
    50     SELECT * FROM [TableName]
    51     WHERE OrderDate IN (%s)
    52&quot;, qmarks), params = as.list(MyDates))
    53MyDates &lt;- c(..., ...)
    54qry &lt;- glue::glue_sql(&quot;
    55     SELECT * FROM [TableName]
    56     WHERE OrderDate IN ({MyDates*})&quot;, .con=con)
    57DBI::dbGetQuery(con, qry)
    58

    Source https://stackoverflow.com/questions/71394206

    QUESTION

    Check if azure databricks mount point exists from .NET

    Asked 2021-Dec-14 at 08:44

    I work on an app which does some kind of data engineering and we use Azure ADLS for data storage and Databricks for data manipulation. There are two approaches in order to retrieve the data, the first one uses the Storage Account and Storage account secret key and the other approach uses mount point. When I go with the first approach, I can successfully check, from .NET, whether the Storage account and it's corresponsive Secret key correspond to each other and return a message whether the credentials are right or not. However, I need to do the same thing with the mount point i.e. determine whether the mount point exists in dbutils.fs.mounts() or anywhere in the storage (I don't know how mount point exactly works and if it stores data in blob).

    The flow for Storage account and Secret key is the following:

    1. Try to connect using the BlobServiceClient API from Microsoft;
    2. If it fails, return a message to the user that the credentials are invalid;
    3. If it doesn't fail, proceed further.

    I'm not that familiar with /mnt/ and stuff because I mostly do .NET but is there a way to check from .NET whether a mount point exists or not?

    ANSWER

    Answered 2021-Dec-14 at 08:44

    Mount point is just a kind of reference to the underlying cloud storage. dbutils.fs.mounts() command needs to be executed on some cluster - it's doable, but it's not fast & cumbersome.

    The simplest way to check that is to use List command of DBFS REST API, passing the mount point name /mnt/<something> as path parameter. If it doesn't exist, you'll get error message RESOURCE_DOES_NOT_EXIST:

    1{
    2  &quot;error_code&quot;: &quot;RESOURCE_DOES_NOT_EXIST&quot;,
    3  &quot;message&quot;: &quot;No file or directory exists on path /mnt/test22/.&quot;
    4}
    5

    Source https://stackoverflow.com/questions/70345499

    QUESTION

    R basics: working with multiple variables at once and their output

    Asked 2021-Nov-29 at 19:49

    I have a survey dataset with 40 ordered factor variables. The variables are transformed into characters when the data is imported.Please correct me if I am wrong, as I am thinking of using the apply function here.

    Below my data manipulation:

    1### data    
    2v1 &lt;- as.character(c(1,4,2,4,3,1,3,4,5,2,2,3,6,5,4,6,5,4,5,6,6,2,4,3,4,5,6,1,6,3,5,6,3,2,4,5,3,2,4,5,3,2,4))
    3v2 &lt;- as.character(c(3,4,1,4,5,1,3,1,5,6,4,3,4,5,6,3,3,5,4,3,3,5,6,3,4,3,4,6,3,1,1,3,4,5,6,1,3,6,4,3,1,6,5))
    4
    5df &lt;- data.frame(v1,v2)
    6
    7### transform into ordered factor
    8
    9df$v1.f &lt;- as.factor(df$v1)
    10df$v1.f &lt;- ordered(df$v1.f, levels = c(&quot;1&quot;, &quot;2&quot;, &quot;3&quot;, &quot;4&quot;, &quot;5&quot;, &quot;6&quot;))
    11

    The real levels are unsorted characters, which is why I included the step. I don't mind typing this for all variables, but it seems redundant.

    My second issue is with the output. I would like to create a fancy report and know how to generate the numbers for it:

    1### data    
    2v1 &lt;- as.character(c(1,4,2,4,3,1,3,4,5,2,2,3,6,5,4,6,5,4,5,6,6,2,4,3,4,5,6,1,6,3,5,6,3,2,4,5,3,2,4,5,3,2,4))
    3v2 &lt;- as.character(c(3,4,1,4,5,1,3,1,5,6,4,3,4,5,6,3,3,5,4,3,3,5,6,3,4,3,4,6,3,1,1,3,4,5,6,1,3,6,4,3,1,6,5))
    4
    5df &lt;- data.frame(v1,v2)
    6
    7### transform into ordered factor
    8
    9df$v1.f &lt;- as.factor(df$v1)
    10df$v1.f &lt;- ordered(df$v1.f, levels = c(&quot;1&quot;, &quot;2&quot;, &quot;3&quot;, &quot;4&quot;, &quot;5&quot;, &quot;6&quot;))
    11v1.freq &lt;- table(df$v1.f)
    12v1.perc &lt;- round(prop.table(v1.freq),2)*100
    13v1.med &lt;- median(df$v1)
    14

    How can a table that contains all the information for all the variables at once for multiple variables be printed - especially when there are no answers to a level (see v2, where there is no response for level 2; table() simply skips over the level).

    How do I turn the R output in a table that has the levels as headers and frequencies and percentages as rows for multiple variables?

    Copy/pasting the numbers into an Excel Sheet seems - again - unnecessary and prone to errors.

    ANSWER

    Answered 2021-Nov-29 at 10:57

    First, you might want to check if you have a stringAsFactor option for your data import function.

    Then, as I understand, you want to transform your variable into ordered factors, and this for all of them. You can wrap this into a dplyr sentence, and use forcats to handle factors. Let's take your data:

    1### data    
    2v1 &lt;- as.character(c(1,4,2,4,3,1,3,4,5,2,2,3,6,5,4,6,5,4,5,6,6,2,4,3,4,5,6,1,6,3,5,6,3,2,4,5,3,2,4,5,3,2,4))
    3v2 &lt;- as.character(c(3,4,1,4,5,1,3,1,5,6,4,3,4,5,6,3,3,5,4,3,3,5,6,3,4,3,4,6,3,1,1,3,4,5,6,1,3,6,4,3,1,6,5))
    4
    5df &lt;- data.frame(v1,v2)
    6
    7### transform into ordered factor
    8
    9df$v1.f &lt;- as.factor(df$v1)
    10df$v1.f &lt;- ordered(df$v1.f, levels = c(&quot;1&quot;, &quot;2&quot;, &quot;3&quot;, &quot;4&quot;, &quot;5&quot;, &quot;6&quot;))
    11v1.freq &lt;- table(df$v1.f)
    12v1.perc &lt;- round(prop.table(v1.freq),2)*100
    13v1.med &lt;- median(df$v1)
    14library(tidyverse)
    15df %&gt;% 
    16  mutate(across(1:2, ~factor(.))) %&gt;% 
    17  mutate(across(1:2,~ordered(.))) %&gt;% 
    18  str()
    19

    Output:

    1### data    
    2v1 &lt;- as.character(c(1,4,2,4,3,1,3,4,5,2,2,3,6,5,4,6,5,4,5,6,6,2,4,3,4,5,6,1,6,3,5,6,3,2,4,5,3,2,4,5,3,2,4))
    3v2 &lt;- as.character(c(3,4,1,4,5,1,3,1,5,6,4,3,4,5,6,3,3,5,4,3,3,5,6,3,4,3,4,6,3,1,1,3,4,5,6,1,3,6,4,3,1,6,5))
    4
    5df &lt;- data.frame(v1,v2)
    6
    7### transform into ordered factor
    8
    9df$v1.f &lt;- as.factor(df$v1)
    10df$v1.f &lt;- ordered(df$v1.f, levels = c(&quot;1&quot;, &quot;2&quot;, &quot;3&quot;, &quot;4&quot;, &quot;5&quot;, &quot;6&quot;))
    11v1.freq &lt;- table(df$v1.f)
    12v1.perc &lt;- round(prop.table(v1.freq),2)*100
    13v1.med &lt;- median(df$v1)
    14library(tidyverse)
    15df %&gt;% 
    16  mutate(across(1:2, ~factor(.))) %&gt;% 
    17  mutate(across(1:2,~ordered(.))) %&gt;% 
    18  str()
    19'data.frame':   43 obs. of  2 variables:
    20 $ v1: Ord.factor w/ 6 levels &quot;1&quot;&lt;&quot;2&quot;&lt;&quot;3&quot;&lt;&quot;4&quot;&lt;..: 1 4 2 4 3 1 3 4 5 2 ...
    21 $ v2: Ord.factor w/ 5 levels &quot;1&quot;&lt;&quot;3&quot;&lt;&quot;4&quot;&lt;&quot;5&quot;&lt;..: 2 3 1 3 4 1 2 1 4 5 ...
    22

    As you can see, the variables are transformed as ordered factors, with levels ordered alphabetically. To explain, mutate is to alterate your variables, across specify which variables you want to change, and how. Here, we want to mutate the variable 1 to 2 and apply to them the functions factor and then ordered. If the alphabetical levelling isn't the one desired, you can still mutate the column by it self and give the levels argument.

    For the second question, as far as there is no level "2" for V2, unlike V1, you cannot merge the two variable, unless you add a level for V2 with NA. You can still check janitor::tabyl to give you cross frequencies, and create one table per variable:

    1### data    
    2v1 &lt;- as.character(c(1,4,2,4,3,1,3,4,5,2,2,3,6,5,4,6,5,4,5,6,6,2,4,3,4,5,6,1,6,3,5,6,3,2,4,5,3,2,4,5,3,2,4))
    3v2 &lt;- as.character(c(3,4,1,4,5,1,3,1,5,6,4,3,4,5,6,3,3,5,4,3,3,5,6,3,4,3,4,6,3,1,1,3,4,5,6,1,3,6,4,3,1,6,5))
    4
    5df &lt;- data.frame(v1,v2)
    6
    7### transform into ordered factor
    8
    9df$v1.f &lt;- as.factor(df$v1)
    10df$v1.f &lt;- ordered(df$v1.f, levels = c(&quot;1&quot;, &quot;2&quot;, &quot;3&quot;, &quot;4&quot;, &quot;5&quot;, &quot;6&quot;))
    11v1.freq &lt;- table(df$v1.f)
    12v1.perc &lt;- round(prop.table(v1.freq),2)*100
    13v1.med &lt;- median(df$v1)
    14library(tidyverse)
    15df %&gt;% 
    16  mutate(across(1:2, ~factor(.))) %&gt;% 
    17  mutate(across(1:2,~ordered(.))) %&gt;% 
    18  str()
    19'data.frame':   43 obs. of  2 variables:
    20 $ v1: Ord.factor w/ 6 levels &quot;1&quot;&lt;&quot;2&quot;&lt;&quot;3&quot;&lt;&quot;4&quot;&lt;..: 1 4 2 4 3 1 3 4 5 2 ...
    21 $ v2: Ord.factor w/ 5 levels &quot;1&quot;&lt;&quot;3&quot;&lt;&quot;4&quot;&lt;&quot;5&quot;&lt;..: 2 3 1 3 4 1 2 1 4 5 ...
    22library(janitor)
    23df2 &lt;- df %&gt;% 
    24  mutate(across(1:2, ~factor(.))) %&gt;% 
    25  mutate(across(1:2,~ordered(.)))
    26
    27map(df2, tabyl)
    28

    Output:

    1### data    
    2v1 &lt;- as.character(c(1,4,2,4,3,1,3,4,5,2,2,3,6,5,4,6,5,4,5,6,6,2,4,3,4,5,6,1,6,3,5,6,3,2,4,5,3,2,4,5,3,2,4))
    3v2 &lt;- as.character(c(3,4,1,4,5,1,3,1,5,6,4,3,4,5,6,3,3,5,4,3,3,5,6,3,4,3,4,6,3,1,1,3,4,5,6,1,3,6,4,3,1,6,5))
    4
    5df &lt;- data.frame(v1,v2)
    6
    7### transform into ordered factor
    8
    9df$v1.f &lt;- as.factor(df$v1)
    10df$v1.f &lt;- ordered(df$v1.f, levels = c(&quot;1&quot;, &quot;2&quot;, &quot;3&quot;, &quot;4&quot;, &quot;5&quot;, &quot;6&quot;))
    11v1.freq &lt;- table(df$v1.f)
    12v1.perc &lt;- round(prop.table(v1.freq),2)*100
    13v1.med &lt;- median(df$v1)
    14library(tidyverse)
    15df %&gt;% 
    16  mutate(across(1:2, ~factor(.))) %&gt;% 
    17  mutate(across(1:2,~ordered(.))) %&gt;% 
    18  str()
    19'data.frame':   43 obs. of  2 variables:
    20 $ v1: Ord.factor w/ 6 levels &quot;1&quot;&lt;&quot;2&quot;&lt;&quot;3&quot;&lt;&quot;4&quot;&lt;..: 1 4 2 4 3 1 3 4 5 2 ...
    21 $ v2: Ord.factor w/ 5 levels &quot;1&quot;&lt;&quot;3&quot;&lt;&quot;4&quot;&lt;&quot;5&quot;&lt;..: 2 3 1 3 4 1 2 1 4 5 ...
    22library(janitor)
    23df2 &lt;- df %&gt;% 
    24  mutate(across(1:2, ~factor(.))) %&gt;% 
    25  mutate(across(1:2,~ordered(.)))
    26
    27map(df2, tabyl)
    28$v1
    29 .x[[i]]  n    percent
    30       1  3 0.06976744
    31       2  7 0.16279070
    32       3  8 0.18604651
    33       4 10 0.23255814
    34       5  8 0.18604651
    35       6  7 0.16279070
    36
    37$v2
    38 .x[[i]]  n   percent
    39       1  7 0.1627907
    40       3 13 0.3023256
    41       4  9 0.2093023
    42       5  7 0.1627907
    43       6  7 0.1627907
    44

    Source https://stackoverflow.com/questions/70152969

    QUESTION

    I can't read excel file using dt.fread from datatable AttributeError

    Asked 2021-Nov-19 at 15:28

    Hello I'm trying to read an excel file 'myFile.xlsx' using datatable.fread (version 1.0.0) function to speedup data manipulation.

    The problem is I had an AttributeError: module 'xlrd' has no attribute 'xlsx'.

    The command I used is:

    1import datatable as dt
    2DT = dt.fread(&quot;myFile.xlsx&quot;)
    3

    I checked the module where the error occurred is the module xls of datatable package:

    1import datatable as dt
    2DT = dt.fread(&quot;myFile.xlsx&quot;)
    3def read_xls_workbook(filename, subpath):
    4    try:
    5        import xlrd
    6        # Fixes the warning
    7        # &quot;PendingDeprecationWarning: This method will be removed in future
    8        #  versions.  Use 'tree.iter()' or 'list(tree.iter())' instead.&quot;
    9        xlrd.xlsx.ensure_elementtree_imported(False, None) # Here
    10        xlrd.xlsx.Element_has_iter = True # and Here
    11

    Is there any solution to fix this issue? please.

    ANSWER

    Answered 2021-Nov-19 at 15:28

    The issue is that datatable package is not updated yet to make use of xldr>1.2.0, so in order to make it work you have to install xldr = 1.2.0

    1import datatable as dt
    2DT = dt.fread(&quot;myFile.xlsx&quot;)
    3def read_xls_workbook(filename, subpath):
    4    try:
    5        import xlrd
    6        # Fixes the warning
    7        # &quot;PendingDeprecationWarning: This method will be removed in future
    8        #  versions.  Use 'tree.iter()' or 'list(tree.iter())' instead.&quot;
    9        xlrd.xlsx.ensure_elementtree_imported(False, None) # Here
    10        xlrd.xlsx.Element_has_iter = True # and Here
    11pip install xldr==1.2.0
    12

    I hope it helped.

    Source https://stackoverflow.com/questions/70035997

    QUESTION

    How can I use data.table in a package without importing all functions?

    Asked 2021-Oct-27 at 22:46

    I'm building an R package in which I would like to use dtplyr to perform various bits of data manipulation. My issue is that dtplyr seems to only work if I import the whole of data.table (i.e. using the roxygen #' @import data.table). Without this I get errors like:

    1Error in .(x = sum(x), y = sum(y),  : 
    2  could not find function &quot;.&quot; 
    3

    If I can solve this problem by only importing certain functions from data.table that would be great, but there seems to be no function .() in the package. My knowledge of data.table is limited, but I can only assume it uses .() to edit parsed code (similar to the base R bquote()), but that dtplyr for some reason needs data.table to be loaded for this to work.

    I've tried various things such as withr::with_package("data.table", code) and requireNamespace("data.table"), but so far importing the whole package is the only thing that seems to work. This is not a viable solution because it completely ruins the well-maintained namespace in the package I'm working on by importing so many functions from data.table.

    NB, this package houses a project which will be worked on by many other analysts well into the future. While simply writing data.table code may be preferable in terms of performance and general good-practice, using dtplyr to translate dplyr code gives a boost in readability and ease-of-use that is far more important in this context.

    ANSWER

    Answered 2021-Oct-27 at 22:46

    The (documented) solution I found is to set .datatable.aware <- TRUE somewhere in the package source code. According to the documentation, if you're using data.table in a package without importing the whole thing, you should do this so that [.data.table() does not revert to calling [.data.frame(). From the docs:

    ...please define .datatable.aware = TRUE anywhere in your R source code (no need to export). This tells data.table that you as a package developer have designed your code to intentionally rely on data.table functionality even though it may not be obvious from inspecting your NAMESPACE file.

    Source https://stackoverflow.com/questions/69544896

    QUESTION

    Conflicting object names within a solution

    Asked 2021-Oct-13 at 19:08

    I have a project that does some file and data manipulation using several classes generated from elsewhere. I'm trying to use those generated classes in one place, but I'm running into issues when I add references in ProcessorProject to more than one of the "Item" projects because the object names conflict with each other.

    I know that this could be easily solved by wrapping the generated code within the "Item" classes in their projects' namespace, but I'm trying to avoid modifying those generated files in any way.

    Is there any other way around this that I'm not thinking of? A way to add that generated code to the project namespace without actually modifying the files themselves? Something else?

    Very simplified model:

    1ProcessorProject
    2  Processor.cs
    3       switch (color)
    4           case &quot;Blue&quot;:
    5               BlueUtility.DoSomething();
    6               break;
    7           case &quot;Red&quot;:
    8               RedUtility.DoSomething();
    9               break;
    10       
    11BlueItemProject
    12  BlueUtility.cs
    13     namespace BlueItem
    14        class BlueUtility
    15  BlueItem.cs [generated]
    16     partial class BlueItemInfo
    17         public ItemInfo Information
    18         public SomeOtherInformation MoreInformation
    19     partial class ItemInfo
    20     partial class SomeOtherInformation
    21
    22RedItemProject
    23  RedUtility.cs
    24     namespace RedItem
    25        class RedUtility
    26  RedItem.cs [generated]
    27     partial class RedItemInfo
    28         public ItemInfo Information
    29         public SomeOtherInformation MoreInformation
    30     partial class ItemInfo
    31     partial class SomeOtherInformation
    32
    33

    ANSWER

    Answered 2021-Oct-13 at 19:08

    Create an alias for each reference in the References properties window. Then on the file where you use them write something like this at the top

    1ProcessorProject
    2  Processor.cs
    3       switch (color)
    4           case &quot;Blue&quot;:
    5               BlueUtility.DoSomething();
    6               break;
    7           case &quot;Red&quot;:
    8               RedUtility.DoSomething();
    9               break;
    10       
    11BlueItemProject
    12  BlueUtility.cs
    13     namespace BlueItem
    14        class BlueUtility
    15  BlueItem.cs [generated]
    16     partial class BlueItemInfo
    17         public ItemInfo Information
    18         public SomeOtherInformation MoreInformation
    19     partial class ItemInfo
    20     partial class SomeOtherInformation
    21
    22RedItemProject
    23  RedUtility.cs
    24     namespace RedItem
    25        class RedUtility
    26  RedItem.cs [generated]
    27     partial class RedItemInfo
    28         public ItemInfo Information
    29         public SomeOtherInformation MoreInformation
    30     partial class ItemInfo
    31     partial class SomeOtherInformation
    32
    33extern alias NewAliasOfProject; 
    34using NewAliasOfProject::NamespaceName;
    35

    Source https://stackoverflow.com/questions/69559941

    QUESTION

    How to update shiny module with reactive dataframe from another module

    Asked 2021-Sep-27 at 09:22

    The goal of this module is create a reactive barplot that changes based on the output of a data selector module. Unfortunately the barplot does not update. It's stuck at the first variable that's selected.

    I've tried creating observer functions to update the barplot, to no avail. I've also tried nesting the selector server module within the barplot module, but I get the error: Warning: Error in UseMethod: no applicable method for 'mutate' applied to an object of class "c('reactiveExpr', 'reactive', 'function')"

    I just need some way to tell the barplot module to update whenever the data it's fed changes.

    Barplot Module:

    1#UI
    2
    3barplotUI &lt;- function(id) {
    4  tagList(plotlyOutput(NS(id, &quot;barplot&quot;), height = &quot;300px&quot;))
    5}
    6
    7#Server
    8#' @param data Reactive element from another module: reactive(dplyr::filter(austin_map, var == input$var)) 
    9barplotServer &lt;- function(id, data) {
    10  moduleServer(id, function(input, output, session) {
    11    #Data Manipulation
    12    bardata &lt;- reactive({
    13      bar &lt;-
    14        data  |&gt;
    15        mutate(
    16          `&gt; 50% People of Color` = if_else(`% people of color` &gt;= 0.5, 1, 0),
    17          `&gt; 50% Low Income` = if_else(`% low-income` &gt;= 0.5, 1, 0)
    18        )
    19      
    20      total_av &lt;- mean(bar$value)
    21      poc &lt;- bar |&gt; filter(`&gt; 50% People of Color` == 1)
    22      poc_av &lt;- mean(poc$value)
    23      lowincome &lt;- bar |&gt; filter(`&gt; 50% Low Income` == 1)
    24      lowincome_av &lt;- mean(lowincome$value)
    25      bar_to_plotly &lt;-
    26        data.frame(
    27          y = c(total_av, poc_av, lowincome_av),
    28          x = c(&quot;Austin Average&quot;,
    29                &quot;&gt; 50% People of Color&quot;,
    30                &quot;&gt; 50% Low Income&quot;)
    31        )
    32      
    33      return(bar_to_plotly)
    34    })
    35    
    36    #Plotly Barplot
    37    output$barplot &lt;- renderPlotly({
    38      plot_ly(
    39        x = bardata()$x,
    40        y = bardata()$y,
    41        color = I(&quot;#00a65a&quot;),
    42        type = 'bar'
    43        
    44      ) |&gt;
    45        config(displayModeBar = FALSE)
    46      
    47    })
    48  })
    49}
    50

    EDIT : Data Selector Module

    1#UI
    2
    3barplotUI &lt;- function(id) {
    4  tagList(plotlyOutput(NS(id, &quot;barplot&quot;), height = &quot;300px&quot;))
    5}
    6
    7#Server
    8#' @param data Reactive element from another module: reactive(dplyr::filter(austin_map, var == input$var)) 
    9barplotServer &lt;- function(id, data) {
    10  moduleServer(id, function(input, output, session) {
    11    #Data Manipulation
    12    bardata &lt;- reactive({
    13      bar &lt;-
    14        data  |&gt;
    15        mutate(
    16          `&gt; 50% People of Color` = if_else(`% people of color` &gt;= 0.5, 1, 0),
    17          `&gt; 50% Low Income` = if_else(`% low-income` &gt;= 0.5, 1, 0)
    18        )
    19      
    20      total_av &lt;- mean(bar$value)
    21      poc &lt;- bar |&gt; filter(`&gt; 50% People of Color` == 1)
    22      poc_av &lt;- mean(poc$value)
    23      lowincome &lt;- bar |&gt; filter(`&gt; 50% Low Income` == 1)
    24      lowincome_av &lt;- mean(lowincome$value)
    25      bar_to_plotly &lt;-
    26        data.frame(
    27          y = c(total_av, poc_av, lowincome_av),
    28          x = c(&quot;Austin Average&quot;,
    29                &quot;&gt; 50% People of Color&quot;,
    30                &quot;&gt; 50% Low Income&quot;)
    31        )
    32      
    33      return(bar_to_plotly)
    34    })
    35    
    36    #Plotly Barplot
    37    output$barplot &lt;- renderPlotly({
    38      plot_ly(
    39        x = bardata()$x,
    40        y = bardata()$y,
    41        color = I(&quot;#00a65a&quot;),
    42        type = 'bar'
    43        
    44      ) |&gt;
    45        config(displayModeBar = FALSE)
    46      
    47    })
    48  })
    49}
    50dataInput &lt;- function(id) {
    51  tagList(
    52    pickerInput(
    53      NS(id, &quot;var&quot;),
    54      label = NULL,
    55      width = '100%',
    56      inline = FALSE,
    57      options = list(`actions-box` = TRUE,
    58                     size = 10),
    59      choices =list(
    60            &quot;O3&quot;,
    61            &quot;Ozone - CAPCOG&quot;,
    62            &quot;Percentile for Ozone level in air&quot;,
    63            &quot;PM2.5&quot;,
    64            &quot;PM2.5 - CAPCOG&quot;,
    65            &quot;Percentile for PM2.5 level in air&quot;)
    66    )
    67  )
    68}
    69
    70dataServer &lt;- function(id) {
    71  moduleServer(id, function(input, output, session) {
    72    austin_map &lt;- readRDS(&quot;./data/austin_composite.rds&quot;)
    73    austin_map &lt;- as.data.frame(austin_map)
    74    austin_map$value &lt;- as.numeric(austin_map$value)
    75    
    76    list(
    77      var = reactive(input$var),
    78      df = reactive(austin_map |&gt; dplyr::filter(var == input$var))
    79    )
    80    
    81  })
    82}
    83

    Simplified App

    1#UI
    2
    3barplotUI &lt;- function(id) {
    4  tagList(plotlyOutput(NS(id, &quot;barplot&quot;), height = &quot;300px&quot;))
    5}
    6
    7#Server
    8#' @param data Reactive element from another module: reactive(dplyr::filter(austin_map, var == input$var)) 
    9barplotServer &lt;- function(id, data) {
    10  moduleServer(id, function(input, output, session) {
    11    #Data Manipulation
    12    bardata &lt;- reactive({
    13      bar &lt;-
    14        data  |&gt;
    15        mutate(
    16          `&gt; 50% People of Color` = if_else(`% people of color` &gt;= 0.5, 1, 0),
    17          `&gt; 50% Low Income` = if_else(`% low-income` &gt;= 0.5, 1, 0)
    18        )
    19      
    20      total_av &lt;- mean(bar$value)
    21      poc &lt;- bar |&gt; filter(`&gt; 50% People of Color` == 1)
    22      poc_av &lt;- mean(poc$value)
    23      lowincome &lt;- bar |&gt; filter(`&gt; 50% Low Income` == 1)
    24      lowincome_av &lt;- mean(lowincome$value)
    25      bar_to_plotly &lt;-
    26        data.frame(
    27          y = c(total_av, poc_av, lowincome_av),
    28          x = c(&quot;Austin Average&quot;,
    29                &quot;&gt; 50% People of Color&quot;,
    30                &quot;&gt; 50% Low Income&quot;)
    31        )
    32      
    33      return(bar_to_plotly)
    34    })
    35    
    36    #Plotly Barplot
    37    output$barplot &lt;- renderPlotly({
    38      plot_ly(
    39        x = bardata()$x,
    40        y = bardata()$y,
    41        color = I(&quot;#00a65a&quot;),
    42        type = 'bar'
    43        
    44      ) |&gt;
    45        config(displayModeBar = FALSE)
    46      
    47    })
    48  })
    49}
    50dataInput &lt;- function(id) {
    51  tagList(
    52    pickerInput(
    53      NS(id, &quot;var&quot;),
    54      label = NULL,
    55      width = '100%',
    56      inline = FALSE,
    57      options = list(`actions-box` = TRUE,
    58                     size = 10),
    59      choices =list(
    60            &quot;O3&quot;,
    61            &quot;Ozone - CAPCOG&quot;,
    62            &quot;Percentile for Ozone level in air&quot;,
    63            &quot;PM2.5&quot;,
    64            &quot;PM2.5 - CAPCOG&quot;,
    65            &quot;Percentile for PM2.5 level in air&quot;)
    66    )
    67  )
    68}
    69
    70dataServer &lt;- function(id) {
    71  moduleServer(id, function(input, output, session) {
    72    austin_map &lt;- readRDS(&quot;./data/austin_composite.rds&quot;)
    73    austin_map &lt;- as.data.frame(austin_map)
    74    austin_map$value &lt;- as.numeric(austin_map$value)
    75    
    76    list(
    77      var = reactive(input$var),
    78      df = reactive(austin_map |&gt; dplyr::filter(var == input$var))
    79    )
    80    
    81  })
    82}
    83library(shiny)
    84library(tidyverse)
    85library(plotly)
    86
    87source(&quot;barplot.r&quot;)
    88source(&quot;datamod.r&quot;)
    89
    90
    91ui = fluidPage(
    92  fluidRow(
    93    dataInput(&quot;data&quot;),
    94    barplotUI(&quot;barplot&quot;)
    95    )
    96  )
    97
    98server &lt;- function(input, output, session) {
    99  data &lt;- dataServer(&quot;data&quot;)
    100  variable &lt;- data$df
    101  
    102  
    103  barplotServer(&quot;barplot&quot;, data = variable())
    104  
    105}
    106
    107shinyApp(ui, server)
    108
    109

    ANSWER

    Answered 2021-Sep-27 at 09:22

    As I wrote in my comment, passing a reactive dataset as an argument to a module server is no different to passing an argument of any other type.

    Here's a MWE that illustrates the concept, passing either mtcars or a data frame of random values between a selection module and a display module.

    The critical point is that the selection module returns the reactive [data], not the reactive's value [data()] to the main server function and, in turn, the reactive, not the reactive's value is passed as a parameter to the plot module.

    1#UI
    2
    3barplotUI &lt;- function(id) {
    4  tagList(plotlyOutput(NS(id, &quot;barplot&quot;), height = &quot;300px&quot;))
    5}
    6
    7#Server
    8#' @param data Reactive element from another module: reactive(dplyr::filter(austin_map, var == input$var)) 
    9barplotServer &lt;- function(id, data) {
    10  moduleServer(id, function(input, output, session) {
    11    #Data Manipulation
    12    bardata &lt;- reactive({
    13      bar &lt;-
    14        data  |&gt;
    15        mutate(
    16          `&gt; 50% People of Color` = if_else(`% people of color` &gt;= 0.5, 1, 0),
    17          `&gt; 50% Low Income` = if_else(`% low-income` &gt;= 0.5, 1, 0)
    18        )
    19      
    20      total_av &lt;- mean(bar$value)
    21      poc &lt;- bar |&gt; filter(`&gt; 50% People of Color` == 1)
    22      poc_av &lt;- mean(poc$value)
    23      lowincome &lt;- bar |&gt; filter(`&gt; 50% Low Income` == 1)
    24      lowincome_av &lt;- mean(lowincome$value)
    25      bar_to_plotly &lt;-
    26        data.frame(
    27          y = c(total_av, poc_av, lowincome_av),
    28          x = c(&quot;Austin Average&quot;,
    29                &quot;&gt; 50% People of Color&quot;,
    30                &quot;&gt; 50% Low Income&quot;)
    31        )
    32      
    33      return(bar_to_plotly)
    34    })
    35    
    36    #Plotly Barplot
    37    output$barplot &lt;- renderPlotly({
    38      plot_ly(
    39        x = bardata()$x,
    40        y = bardata()$y,
    41        color = I(&quot;#00a65a&quot;),
    42        type = 'bar'
    43        
    44      ) |&gt;
    45        config(displayModeBar = FALSE)
    46      
    47    })
    48  })
    49}
    50dataInput &lt;- function(id) {
    51  tagList(
    52    pickerInput(
    53      NS(id, &quot;var&quot;),
    54      label = NULL,
    55      width = '100%',
    56      inline = FALSE,
    57      options = list(`actions-box` = TRUE,
    58                     size = 10),
    59      choices =list(
    60            &quot;O3&quot;,
    61            &quot;Ozone - CAPCOG&quot;,
    62            &quot;Percentile for Ozone level in air&quot;,
    63            &quot;PM2.5&quot;,
    64            &quot;PM2.5 - CAPCOG&quot;,
    65            &quot;Percentile for PM2.5 level in air&quot;)
    66    )
    67  )
    68}
    69
    70dataServer &lt;- function(id) {
    71  moduleServer(id, function(input, output, session) {
    72    austin_map &lt;- readRDS(&quot;./data/austin_composite.rds&quot;)
    73    austin_map &lt;- as.data.frame(austin_map)
    74    austin_map$value &lt;- as.numeric(austin_map$value)
    75    
    76    list(
    77      var = reactive(input$var),
    78      df = reactive(austin_map |&gt; dplyr::filter(var == input$var))
    79    )
    80    
    81  })
    82}
    83library(shiny)
    84library(tidyverse)
    85library(plotly)
    86
    87source(&quot;barplot.r&quot;)
    88source(&quot;datamod.r&quot;)
    89
    90
    91ui = fluidPage(
    92  fluidRow(
    93    dataInput(&quot;data&quot;),
    94    barplotUI(&quot;barplot&quot;)
    95    )
    96  )
    97
    98server &lt;- function(input, output, session) {
    99  data &lt;- dataServer(&quot;data&quot;)
    100  variable &lt;- data$df
    101  
    102  
    103  barplotServer(&quot;barplot&quot;, data = variable())
    104  
    105}
    106
    107shinyApp(ui, server)
    108
    109library(shiny)
    110library(ggplot2)
    111
    112# Select module
    113selectUI &lt;- function(id) {
    114    ns &lt;- NS(id)
    115    selectInput(ns(&quot;select&quot;), &quot;Select a dataset&quot;, c(&quot;mtcars&quot;, &quot;random&quot;))
    116}
    117
    118selectServer &lt;- function(id) {
    119    moduleServer(
    120        id,
    121        function(input, output, session) {
    122            data &lt;- reactive({
    123                if (input$select == &quot;mtcars&quot;) {
    124                    mtcars
    125                } else {
    126                    tibble(x=runif(10), y=rnorm(10), z=rbinom(n=10, size=20, prob=0.3))
    127                } 
    128            })
    129            
    130            return(data)
    131        }
    132    )
    133}
    134
    135# Barplot module
    136barplotUI &lt;- function(id) {
    137    ns &lt;- NS(id)
    138    
    139    tagList(
    140        selectInput(ns(&quot;variable&quot;), &quot;Select variable:&quot;, choices=c()),
    141        plotOutput(ns(&quot;plot&quot;))
    142    )
    143}
    144
    145barplotServer &lt;- function(id, plotData) {
    146    moduleServer(
    147        id,
    148        function(input, output, session) {
    149            ns &lt;- NS(id)
    150            
    151            observeEvent(plotData(), {
    152                updateSelectInput(
    153                    session, 
    154                    &quot;variable&quot;, 
    155                    choices=names(plotData()), 
    156                    selected=names(plotData()[1])
    157                )
    158            })
    159            
    160            output$plot &lt;- renderPlot({
    161                # There's an irritating transient error as the dataset
    162                # changes, but handling it would
    163                # detract from the purpose of this answer
    164                plotData() %&gt;% 
    165                    ggplot() + geom_bar(aes_string(x=input$variable))
    166
    167            })
    168        }
    169    )
    170}
    171
    172# Main UI
    173ui &lt;- fluidPage(
    174    selectUI(&quot;select&quot;),
    175    barplotUI(&quot;plot&quot;)
    176)
    177
    178# Main server
    179server &lt;- function(input, output, session) {
    180    selectedData &lt;- selectServer(&quot;select&quot;)
    181    barplotServer &lt;- barplotServer(&quot;plot&quot;, plotData=selectedData)
    182}
    183
    184# Run the application 
    185shinyApp(ui = ui, server = server)
    186

    Source https://stackoverflow.com/questions/68584478

    QUESTION

    R command `group_by`

    Asked 2021-Sep-22 at 09:13

    I am not able to understand exactly how this code works. I have found it on a tutorial guide:

    Data manipulation in R - Steph Locke

    on page 133 an example that I am able to understand only partially.

    1library(tidyverse)
    2library(nycflights13)
    3
    4flights %&gt;%
    5group_by(month, carrier) %&gt;%
    6summarise(n=n()) %&gt;%  ##sum of items;
    7group_by(month) %&gt;%                             
    8mutate(prop=scales::percent(n/sum(n)), n=NULL) %&gt;%              
    9spread(month, prop)
    10
    11
    12flights %&gt;%
    13group_by(month, carrier) %&gt;%    ## This is grouping by months and within the months by carrier;
    14summarise(n=n()) %&gt;%        ## It is summing the items, giving for each month and each carrier the sum of items;
    15

    At this point there in another group_by(), it looks like a nested to group_by(month, carrier)

    Then:

    1library(tidyverse)
    2library(nycflights13)
    3
    4flights %&gt;%
    5group_by(month, carrier) %&gt;%
    6summarise(n=n()) %&gt;%  ##sum of items;
    7group_by(month) %&gt;%                             
    8mutate(prop=scales::percent(n/sum(n)), n=NULL) %&gt;%              
    9spread(month, prop)
    10
    11
    12flights %&gt;%
    13group_by(month, carrier) %&gt;%    ## This is grouping by months and within the months by carrier;
    14summarise(n=n()) %&gt;%        ## It is summing the items, giving for each month and each carrier the sum of items;
    15mutate(prop=scales::percent(n/sum(n)), n=NULL) %&gt;%  ## Calculates the percentage of items over the total and store them in &quot;prop&quot;   
    16

    Last line it creates the matrix, putting in the columns month and inside the value obtained from prop

    I would like to understand better what is doing exactly the second group_by(month) %>%

    Thank you in advance for every reply.

    ANSWER

    Answered 2021-Sep-22 at 09:04

    The second group_by is not needed here as by default summarise step argument .groups = "drop_last". Therefore, after the first summarise, there is only a single grouping column i.e. 'month' remains. We can change the code to

    1library(tidyverse)
    2library(nycflights13)
    3
    4flights %&gt;%
    5group_by(month, carrier) %&gt;%
    6summarise(n=n()) %&gt;%  ##sum of items;
    7group_by(month) %&gt;%                             
    8mutate(prop=scales::percent(n/sum(n)), n=NULL) %&gt;%              
    9spread(month, prop)
    10
    11
    12flights %&gt;%
    13group_by(month, carrier) %&gt;%    ## This is grouping by months and within the months by carrier;
    14summarise(n=n()) %&gt;%        ## It is summing the items, giving for each month and each carrier the sum of items;
    15mutate(prop=scales::percent(n/sum(n)), n=NULL) %&gt;%  ## Calculates the percentage of items over the total and store them in &quot;prop&quot;   
    16flights %&gt;%
    17  group_by(month, carrier) %&gt;%
    18  summarise(n=n()) %&gt;%
    19  mutate(prop=scales::percent(n/sum(n)), n=NULL)
    20

    Suppose, we change the default value in .groups to "drop", then, it will drop all the grouping variables, and thus a new group_by statement is needed. Also, after the last grouping statement, if we are using mutate, it wouldn't drop the group attributes and thus ungroup would be useful

    1library(tidyverse)
    2library(nycflights13)
    3
    4flights %&gt;%
    5group_by(month, carrier) %&gt;%
    6summarise(n=n()) %&gt;%  ##sum of items;
    7group_by(month) %&gt;%                             
    8mutate(prop=scales::percent(n/sum(n)), n=NULL) %&gt;%              
    9spread(month, prop)
    10
    11
    12flights %&gt;%
    13group_by(month, carrier) %&gt;%    ## This is grouping by months and within the months by carrier;
    14summarise(n=n()) %&gt;%        ## It is summing the items, giving for each month and each carrier the sum of items;
    15mutate(prop=scales::percent(n/sum(n)), n=NULL) %&gt;%  ## Calculates the percentage of items over the total and store them in &quot;prop&quot;   
    16flights %&gt;%
    17  group_by(month, carrier) %&gt;%
    18  summarise(n=n()) %&gt;%
    19  mutate(prop=scales::percent(n/sum(n)), n=NULL)
    20flights %&gt;%
    21  group_by(month, carrier) %&gt;%
    22  summarise(n=n(), .groups = &quot;drop&quot;) %&gt;%
    23  group_by(month) %&gt;%
    24  mutate(prop=scales::percent(n/sum(n)), n=NULL) %&gt;%
    25  ungroup
    26

    Source https://stackoverflow.com/questions/69281011

    Community Discussions contain sources that include Stack Exchange Network

    Tutorials and Learning Resources in Data Manipulation

    Tutorials and Learning Resources are not available at this moment for Data Manipulation

    Share this Page

    share link

    Get latest updates on Data Manipulation