textflowcpp | A simple , single-header-only library , for wrapping text | Data Manipulation library
kandi X-RAY | textflowcpp Summary
Support
Quality
Security
License
Reuse
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample Here
textflowcpp Key Features
textflowcpp Examples and Code Snippets
Trending Discussions on Data Manipulation
Trending Discussions on Data Manipulation
QUESTION
I am working with the R programming language.
I have the following dataset:
v <- c(1,2,3,4,5,6,7,8,9,10)
var_1 <- as.factor(sample(v, 10000, replace=TRUE, prob=c(0.1,0.1,0.1,0.1,0.1, 0.1,0.1,0.1,0.1,0.1)))
var_2 <- as.factor(sample(v, 10000, replace=TRUE, prob=c(0.1,0.1,0.1,0.1,0.1, 0.1,0.1,0.1,0.1,0.1)))
var_3 <- as.factor(sample(v, 10000, replace=TRUE, prob=c(0.1,0.1,0.1,0.1,0.1, 0.1,0.1,0.1,0.1,0.1)))
var_4 <- as.factor(sample(v, 10000, replace=TRUE, prob=c(0.1,0.1,0.1,0.1,0.1, 0.1,0.1,0.1,0.1,0.1)))
var_5 <- as.factor(sample(v, 10000, replace=TRUE, prob=c(0.1,0.1,0.1,0.1,0.1, 0.1,0.1,0.1,0.1,0.1)))
my_data = data.frame(var_1, var_2, var_3, var_4, var_5)
I also have another dataset of "conditions" that will be used for querying this data frame:
conditions = data.frame(cond_1 = c("1,3,4", "4,5,6"), cond_2 = c("5,6", "7,8,9"))
My Question: I tried to run the following command to select rows from "my_data" based on the first row of "conditions" - but this returns an empty result:
my_data[my_data$var_1 %in% unlist(conditions[1,1]) &
my_data$var_2 %in% unlist(conditions[1,2]), ]
[1] var_1 var_2 var_3 var_4 var_5
<0 rows> (or 0-length row.names)
I tried to look more into this by "inspecting" these conditions:
class(conditions[1,1])
[1] "character"
This makes me think that the "unlist()" command is not working because the conditions themselves are a "character" instead of a "list".
Is there an equivalent command that can be used here that plays the same role as the "unlist()" command so that the above statement can be run?
In general, I am trying to produce the same results as I would have gotten from this code - but keeping the format I was using above:
my_data[my_data$var_1 %in% c("1", "3", "4") &
my_data$var_2 %in% c("5", "6"), ]
Thanks!
Reference: Selecting Rows of Data Based on Multiple Conditions
ANSWER
Answered 2022-Apr-10 at 05:36Up front, "1,3,4" != 1
. It seems you should look to split the strings using strsplit(., ",")
.
expected <- my_data[my_data$var_1 %in% c("1", "3", "4") & my_data$var_2 %in% c("5", "6"), ]
head(expected)
# var_1 var_2 var_3 var_4 var_5
# 18 3 6 2 2 9
# 129 3 5 3 2 8
# 133 4 5 6 5 8
# 186 1 6 6 10 10
# 204 4 6 4 2 6
# 207 1 5 3 2 9
out <- my_data[do.call(`&`,
Map(`%in%`,
lapply(my_data[,1:2], as.character),
lapply(conditions, function(z) strsplit(z, ",")[[1]]))),]
head(out)
# var_1 var_2 var_3 var_4 var_5
# 18 3 6 2 2 9
# 129 3 5 3 2 8
# 133 4 5 6 5 8
# 186 1 6 6 10 10
# 204 4 6 4 2 6
# 207 1 5 3 2 9
Edit: update for new conditions
: change do.call
to Reduce
:
conditions = data.frame(cond_1 = c("1,3,4", "4,5,6"), cond_2 = c("5,6", "7,8,9"), cond_3 = c("4,6", "9"))
out <- my_data[Reduce(`&`,
Map(`%in%`,
lapply(my_data[,1:3], as.character),
lapply(conditions, function(z) strsplit(z, ",")[[1]]))),]
head(out)
# var_1 var_2 var_3 var_4 var_5
# 133 4 5 6 5 8
# 186 1 6 6 10 10
# 204 4 6 4 2 6
# 232 1 5 6 5 8
# 332 3 6 6 5 10
# 338 1 5 6 3 6
QUESTION
I've the following table
Owner Pet Housing_Type A Cats;Dog;Rabbit 3 B Dog;Rabbit 2 C Cats 2 D Cats;Rabbit 3 E Cats;Fish 1The code is as follows:
Data_Pets = structure(list(Owner = structure(1:5, .Label = c("A", "B", "C", "D",
"E"), class = "factor"), Pets = structure(c(2L, 5L, 1L,4L, 3L), .Label = c("Cats ",
"Cats;Dog;Rabbit", "Cats;Fish","Cats;Rabbit", "Dog;Rabbit"), class = "factor"),
House_Type = c(3L,2L, 2L, 3L, 1L)), class = "data.frame", row.names = c(NA, -5L))
Can anyone advise me how I can create new columns based on the data in Pet column by creating a new column for each animal separated by ; to look like the following table?
Owner Cats Dog Rabbit Fish Housing_Type A Y Y Y N 3 B N Y Y N 2 C N Y N N 2 D Y N Y N 3 E Y N N Y 1Thanks!
ANSWER
Answered 2022-Mar-15 at 08:48One approach is to define a helper function that matches for a specific animal, then bind the columns to the original frame.
Note that some wrangling is done to get rid of whitespace to identify the unique animals to query.
f <- Vectorize(function(string, match) {
ifelse(grepl(match, string), "Y", "N")
}, c("match"))
df %>%
bind_cols(
f(df$Pets, unique(unlist(strsplit(trimws(as.character(df$Pets)), ";"))))
)
Owner Pets House_Type Cats Dog Rabbit Fish
1 A Cats;Dog;Rabbit 3 Y Y Y N
2 B Dog;Rabbit 2 N Y Y N
3 C Cats 2 Y N N N
4 D Cats;Rabbit 3 Y N Y N
5 E Cats;Fish 1 Y N N Y
Or more generalized if you don't know for sure that the separator is ;
, and whitespace is present, stringi
is useful:
dplyr::bind_cols(
df,
f(df$Pets, unique(unlist(stringi::stri_extract_all_words(df$Pets))))
)
QUESTION
I have this data frame:
color <- c("AKZ", "ZZA", "KAK")
color_1 <- sample(color, 100, replace=TRUE, prob=c(0.4, 0.3, 0.3))
id = 1:100
sample_data = data.frame(id, color_1)
id color_1
1 1 KAK
2 2 AKZ
3 3 KAK
4 4 KAK
5 5 AKZ
6 6 ZZA
Suppose there is a legend:
- K = 3
- A = 4
- Z = 6
I want to add two columns to the above data frame:
- sample_data$add_score : e.g. KAK = K + A + K = 3 + 4 + 3 = 10
- sample_data$multiply_score : e.g. KAK = K * A * K = 3 * 4 * 3 = 36
I thought of solving the problem like this:
sample_data$first = substr(color_1,1,1)
sample_data$second = substr(color_1,2,2)
sample_data$third = substr(color_1,3,3)
sample_data$first_score = ifelse(sample_data$first == "K", 3, ifelse(sample_data$first == "A", 4, 6))
sample_data$second_score = ifelse(sample_data$second == "K", 3, ifelse(sample_data$second == "A", 4, 6))
sample_data$third_score = ifelse(sample_data$third == "K", 3, ifelse(sample_data$third == "A", 4, 6))
sample_data$add_score = sample_data$first_score + sample_data$second_score + sample_data$third_score
sample_data$multiply_score = sample_data$first_score * sample_data$second_score * sample_data$third_score
But I think this way would take a long time if the length of "color_1" was longer. Given a scoring legend, is there a faster way to do this?
Thank you!
ANSWER
Answered 2022-Mar-10 at 04:12We can use stri_replace_all_regex
to replace your color_1
into integers together with the arithmetic operator.
Here I've stored your values into a vector color_1_convert
. We can use this as the input in stri_replace_all_regex
for better management of the values.
library(dplyr)
library(stringi)
color_1_convert <- c("K" = "3", "A" = "4", "Z" = "6")
sample_data %>%
group_by(id) %>%
mutate(add_score = eval(parse(text = gsub("\\+$", "", stri_replace_all_regex(color_1, names(color_1_convert), paste0(color_1_convert, "+"), vectorize_all = F)))),
multiply_score = eval(parse(text = gsub("\\*$", "", stri_replace_all_regex(color_1, names(color_1_convert), paste0(color_1_convert, "*"), vectorize_all = F)))))
# A tibble: 100 × 4
# Groups: id [100]
id color_1 add_score multiply_score
1 1 KAK 10 36
2 2 ZZA 16 144
3 3 AKZ 13 72
4 4 ZZA 16 144
5 5 AKZ 13 72
6 6 AKZ 13 72
7 7 AKZ 13 72
8 8 KAK 10 36
9 9 ZZA 16 144
10 10 AKZ 13 72
# … with 90 more rows
QUESTION
I have a database with columns M1
, M2
and M3
. These M values correspond to the values obtained by each method. My idea is now to make a rank column for each of them. For M1
and M2
, the rank will be from the highest value to the lowest value and M3
in reverse. I made the output table for you to see.
df1<-structure(list(M1 = c(400,300, 200, 50), M2 = c(500,200, 10, 100), M3 = c(420,330, 230, 51)), class = "data.frame", row.names = c(NA,-4L))
> df1
M1 M2 M3
1 400 500 420
2 300 200 330
3 200 10 230
4 50 100 51
Output
> df1
M1 rank M2 rank M3 rank
1 400 1 500 1 420 4
2 300 2 200 2 330 3
3 200 3 10 4 230 2
4 50 4 100 3 51 1
ANSWER
Answered 2022-Mar-07 at 14:15Using rank
and relocate
:
library(dplyr)
df1 %>%
mutate(across(M1:M2, ~ rank(-.x), .names = "{.col}_rank"),
M3_rank = rank(M3)) %>%
relocate(order(colnames(.)))
M1 M1_rank M2 M2_rank M3 M3_rank
1 400 1 500 1 420 4
2 300 2 200 2 330 3
3 200 3 10 4 230 2
4 50 4 100 3 51 1
If you have duplicate values in your vector, then you have to choose a method for ties. By default, you get the average rank, but you can choose "first".
Another possibility, which is I think what you want to do, is to convert to factor and then to numeric, so that you get a only entire values (not the average).
df1 <- data.frame(M1 = c(400,300, 50, 300))
df1 %>%
mutate(M1_rankAverage = rank(-M1),
M1_rankFirst = rank(-M1, ties.method = "first"),
M1_unique = as.numeric(as.factor(rank(-M1))))
M1 M1_rankAverage M1_rankFirst M1_unique
1 400 1.0 1 1
2 300 2.5 2 2
3 50 4.0 4 3
4 300 2.5 3 2
QUESTION
I working on a Python project that has a DataFrame like this:
data = {'AAA': [3, 8, 2, 1],
'BBB': [5, 4, 7, 2],
'CCC': [2, 5, 6, 4]}
df = pd.DataFrame(data)
which leads to:
AAA BBB CCC 0 3 5 2 1 8 4 5 2 2 7 6 3 1 2 4And the task consists of generating the following DataFrame:
AAA BBB CCC Role 0 3 5 2 BBB 1 8 4 5 AAA 2 2 7 6 BBB 3 1 2 4 CCCWhere "Role" column elements are the column headers that have the highest value in the row in which it is located.
Could you please help me by suggesting a code that solves this task?
ANSWER
Answered 2022-Feb-24 at 20:48You could use the idxmax
method on axis:
df['Role'] = df.idxmax(axis=1)
Output:
AAA BBB CCC Role
0 3 5 2 BBB
1 8 4 5 AAA
2 2 7 6 BBB
3 1 2 4 CCC
QUESTION
I would like to know of a fast/efficient way in any program (awk/perl/python) to split a csv file (say 10k columns) into multiple small files each containing 2 columns. I would be doing this on a unix machine.
#contents of large_file.csv
1,2,3,4,5,6,7,8
a,b,c,d,e,f,g,h
q,w,e,r,t,y,u,i
a,s,d,f,g,h,j,k
z,x,c,v,b,n,m,z
I now want multiple files like this:
# contents of 1.csv
1,2
a,b
q,w
a,s
z,x
# contents of 2.csv
1,3
a,c
q,e
a,d
z,c
# contents of 3.csv
1,4
a,d
q,r
a,f
z,v
and so on...
I can do this currently with awk on small files (say 30 columns) like this:
awk -F, 'BEGIN{OFS=",";} {for (i=1; i < NF; i++) print $1, $(i+1) > i ".csv"}' large_file.csv
The above takes a very long time with large files and I was wondering if there is a faster and more efficient way of doing the same.
Thanks in advance.
ANSWER
Answered 2021-Dec-12 at 05:22With your show samples, attempts; please try following awk
code. Since you are opening files all together it may fail with infamous "too many files opened error" So to avoid that have all values into an array and in END
block of this awk
code print them one by one and I am closing them ASAP all contents are getting printed to output file.
awk '
BEGIN{ FS=OFS="," }
{
for(i=1;i (outFile)
close(outFile)
}
}
' large_file.csv
QUESTION
Good afternoon, friends!
I'm currently performing some calculations in R (df is displayed below). My goal is to display in a new column the first non-null value from selected cells for each row.
My df is:
MD <- c(100, 200, 300, 400, 500)
liv <- c(0, 0, 1, 3, 4)
liv2 <- c(6, 2, 0, 4, 5)
liv3 <- c(1, 1, 1, 1, 1)
liv4 <- c(1, 0, 0, 3, 5)
liv5 <- c(0, 2, 7, 9, 10)
df <- data.frame(MD, liv, liv2, liv3, liv4, liv5)
I want to display (in a column called "liv6") the first non-null value from 5 cells (given the data, liv1 = 0, liv2 = 6 , liv3 = 1, liv 4 = 1 and liv5 = 1). The result should be 6. And this calculation should be repeated fro each row in my dataframe..
I do know how to do this in Python, but not in R..
Any help is highly appreciated!
ANSWER
Answered 2022-Feb-03 at 11:16One option with dplyr
could be:
df %>%
rowwise() %>%
mutate(liv6 = with(rle(c_across(liv:liv5)), values[which.max(values != 0)]))
MD liv liv2 liv3 liv4 liv5 liv6
1 100 0 6 1 1 0 6
2 200 0 2 1 0 2 2
3 300 1 0 1 0 7 1
4 400 3 4 1 3 9 3
5 500 4 5 1 5 10 4
QUESTION
I am again struggling with transforming a wide df into a long one using pivot_longer
The data frame is a result of power analysis for different effect sizes and sample sizes, this is how the original df looks like:
es_issue_owner es_independence es_party pwr_issue_owner_1200 pwr_independence_1200 pwr_party_1200 pwr_issue_owner_2400 pwr_independence_2400 pwr_party_2400
1 0.1 0.1 0.1 0.087 0.080 0.081 0.130 0.163 0.102
2 0.2 0.2 0.2 0.235 0.273 0.157 0.406 0.513 0.267
Or with dput:
example <- structure(list(es_issue_owner = c(0.1, 0.2), es_independence = c(0.1,
0.2), es_party = c(0.1, 0.2), pwr_issue_owner_1200 = c(0.087,
0.235), pwr_independence_1200 = c(0.08, 0.273), pwr_party_1200 = c(0.081,
0.157), pwr_issue_owner_2400 = c(0.13, 0.406), pwr_independence_2400 = c(0.163,
0.513), pwr_party_2400 = c(0.102, 0.267)), row.names = 1:2, class = "data.frame")
Each effect size (es) for three meassures ("independence", "issueowner", "party") is paired with a power calculation on a 1200 and on a 2400 sample size. This is how the output I want to get would look like based on the example above:
type es pwr value
1 independence 0.1 1200 0.080
2 issue_owner 0.1 1200 0.087
3 party 0.1 1200 0.081
4 independence 0.2 1200 0.273
5 issue_owner 0.2 1200 0.235
6 party 0.2 1200 0.157
7 independence 0.1 2400 0.163
8 issue_owner 0.1 2400 0.130
9 party 0.1 2400 0.102
10 independence 0.2 2400 0.513
11 issue_owner 0.2 2400 0.406
12 party 0.2 2400 0.267
or, with dput:
output <- structure(list(type = structure(c(1L, 2L, 3L, 1L, 2L, 3L, 1L,
2L, 3L, 1L, 2L, 3L), .Label = c("independence", "issueowner",
"party"), class = "factor"), es = c(0.1, 0.1, 0.1, 0.2, 0.2,
0.2, 0.1, 0.1, 0.1, 0.2, 0.2, 0.2), pwr = c(1200, 1200, 1200,
1200, 1200, 1200, 2400, 2400, 2400, 2400, 2400, 2400), value = c("0.080",
"0.087", "0.081", "0.273", "0.235", "0.157", "0.163", "0.130",
"0.102", "0.513", "0.406", "0.267")), out.attrs = list(dim = c(type = 3L,
es = 2L, pwr = 2L, value = 1L), dimnames = list(type = c("type=independence",
"type=issueowner", "type=party"), es = c("es=0.1", "es=0.2"),
pwr = c("pwr=1200", "pwr=2400"), value = "value=NA")), class = "data.frame", row.names = c(NA,
-12L))
As a start I tried experimenting with this:
example %>%
pivot_longer(cols = everything(),
names_pattern = "(es_[A-Za-z]+)(pwr_[A-Za-z]+_1200)(pwr_[A-Za-z]+_2400)",
# names_sep = "(?=\\d)_(?=\\d)",
names_to = c("es", "pwr_1200", "pwr_2400"),
values_to = "value")
But it did not work, so I tried from two steps, which sort of works, but the "pairing" gets messed up:
example %>%
# pivot_longer(cols = everything(),
# names_pattern = "(es_[A-Za-z]+)(pwr_[A-Za-z]+_1200)(pwr_[A-Za-z]+_2400)",
# # names_sep = "(?=\\d)_(?=\\d)",
# names_to = c("es", "pwr_1200", "pwr_2400"),
# values_to = "value")
pivot_longer(cols = contains("pwr_"),
# names_pattern = "es_pwr(.*)1200_pwr(.*)2400",
names_sep = "_(?=\\d)",
names_to = c("pwr_type", "pwr_sample"), values_to = "value") %>%
pivot_longer(cols = contains("es_"),
# names_pattern = "es_pwr(.*)1200_pwr(.*)2400",
# names_sep = "_(?=\\d)",
names_to = "es_type", values_to = "es")
I would appreciate any help!
ANSWER
Answered 2022-Feb-03 at 10:59library(tidyverse)
example %>%
pivot_longer(cols = starts_with("es"), names_to = "type", names_prefix = "es_", values_to = "es") %>%
pivot_longer(cols = starts_with("pwr"), names_to = "pwr", names_prefix = "pwr_") %>%
filter(substr(type, 1, 3) == substr(pwr, 1, 3)) %>%
mutate(pwr = parse_number(pwr)) %>%
arrange(pwr, es, type)
output
type es pwr value
1 independence 0.1 1200 0.08
2 issue_owner 0.1 1200 0.087
3 party 0.1 1200 0.081
4 independence 0.2 1200 0.273
5 issue_owner 0.2 1200 0.235
6 party 0.2 1200 0.157
7 independence 0.1 2400 0.163
8 issue_owner 0.1 2400 0.13
9 party 0.1 2400 0.102
10 independence 0.2 2400 0.513
11 issue_owner 0.2 2400 0.406
12 party 0.2 2400 0.267
QUESTION
Suppose I have the following 10 variables (num_var_1, num_var_2, num_var_3, num_var_4, num_var_5, factor_var_1, factor_var_2, factor_var_3, factor_var_4, factor_var_5):
set.seed(123)
num_var_1 <- rnorm(1000, 10, 1)
num_var_2 <- rnorm(1000, 10, 5)
num_var_3 <- rnorm(1000, 10, 10)
num_var_4 <- rnorm(1000, 10, 10)
num_var_5 <- rnorm(1000, 10, 10)
factor_1 <- c("A","B", "C")
factor_2 <- c("AA","BB", "CC")
factor_3 <- c("AAA","BBB", "CCC", "DDD")
factor_4 <- c("AAAA","BBBB", "CCCC", "DDDD", "EEEE")
factor_5 <- c("AAAAA","BBBBB", "CCCCC", "DDDDD", "EEEEE", "FFFFFF")
factor_var_1 <- as.factor(sample(factor_1, 1000, replace=TRUE, prob=c(0.3, 0.5, 0.2)))
factor_var_2 <- as.factor(sample(factor_2, 1000, replace=TRUE, prob=c(0.5, 0.3, 0.2)))
factor_var_3 <- as.factor(sample(factor_3, 1000, replace=TRUE, prob=c(0.5, 0.2, 0.2, 0.1)))
factor_var_4 <- as.factor(sample(factor_4, 1000, replace=TRUE, prob=c(0.5, 0.2, 0.1, 0.1, 0.1)))
factor_var_5 <- as.factor(sample(factor_4, 1000, replace=TRUE, prob=c(0.3, 0.2, 0.1, 0.1, 0.1)))
id = 1:1000
my_data = data.frame(id,num_var_1, num_var_2, num_var_3, num_var_4, num_var_5, factor_var_1, factor_var_2, factor_var_3, factor_var_4, factor_var_5)
> head(my_data)
id num_var_1 num_var_2 num_var_3 num_var_4 num_var_5 factor_var_1 factor_var_2 factor_var_3 factor_var_4 factor_var_5
1 1 9.439524 5.021006 4.883963 8.496925 11.965498 B AA AAA CCCC AAAA
2 2 9.769823 4.800225 12.369379 6.722429 16.501132 B AA AAA AAAA AAAA
3 3 11.558708 9.910099 4.584108 -4.481653 16.710042 C AA BBB AAAA CCCC
4 4 10.070508 9.339124 22.192276 3.027154 -2.841578 B CC DDD BBBB AAAA
5 5 10.129288 -2.746714 11.741359 35.984902 -10.261096 B AA AAA DDDD DDDD
6 6 11.715065 15.202867 3.847317 9.625850 32.053261 B AA CCC BBBB EEEE
My Question: I am interested in selecting a random number of variables from this data - and taking random subsets from these variables. (And then repeating this process many times). For example - I would like to record such a randomly generated list:
Iteration 1: num_var_2 > 12, factor_var_1 = "A, C", factor_var_4 = "BBBB, DDDD, EEEE"
Iteration 2: num_var_1 >0, num_var_3 <10, factor_var_2 = "AA, BB, CC", factor_var_3 = "AAA", factor_var_5 = "CCCCC, DDDDD"
Iteration 3: num_var_2 <5, num_var_5 <10, factor_var_1 = "B", factor_var_3 = "AAA"
Iteration 4 : factor_var_4 = "BBBB"
etc.
I can perform the above manually, but this would take a long time (e.g. 10 iterations). Is there a way to automate this process and in the end, just output this kind of list (10 rows × 2 columns) :
Iteration Condition
1 num_var_2 > 12, factor_var_1 = A, C, factor_var_4 = BBBB, DDDD, EEEE
2 num_var_1 >0, num_var_3 <10, factor_var_2 = AA, BB, CC, factor_var_3 = AAA, factor_var_5 = CCCCC, DDDDD
3 num_var_2 <5, num_var_5 <10, factor_var_1 = B, factor_var_3 = AAA
4 factor_var_4 = BBBB
Can someone please show me how to do this?
ANSWER
Answered 2021-Dec-26 at 10:11You may define a function FUN(n)
that creates a data set as shown in OP.
FUN <- function(n=1e3) {
num_var_1 <- rnorm(n, 10, 1)
num_var_2 <- rnorm(n, 10, 5)
num_var_3 <- rnorm(n, 10, 10)
num_var_4 <- rnorm(n, 10, 10)
num_var_5 <- rnorm(n, 10, 10)
factor_1 <- c("A", "B", "C")
factor_2 <- c("AA", "BB", "CC")
factor_3 <- c("AAA", "BBB", "CCC", "DDD")
factor_4 <- c("AAAA", "BBBB", "CCCC", "DDDD", "EEEE")
factor_5 <- c("AAAAA", "BBBBB", "CCCCC", "DDDDD", "EEEEE", "FFFFFF")
factor_var_1 <- as.factor(sample(factor_1, n, replace=TRUE,
prob=c(0.3, 0.5, 0.2)))
factor_var_2 <- as.factor(sample(factor_2, n, replace=TRUE,
prob=c(0.5, 0.3, 0.2)))
factor_var_3 <- as.factor(sample(factor_3, n, replace=TRUE,
prob=c(0.5, 0.2, 0.2, 0.1)))
factor_var_4 <- as.factor(sample(factor_4, n, replace=TRUE,
prob=c(0.5, 0.2, 0.1, 0.1, 0.1)))
factor_var_5 <- as.factor(sample(factor_5, n, replace=TRUE,
prob=c(0.3, 0.2, 0.1, 0.1, 0.1, .2)))
id <- 1:n
return(data.frame(id, num_var_1, num_var_2, num_var_3, num_var_4,
num_var_5, factor_var_1, factor_var_2, factor_var_3,
factor_var_4, factor_var_5))
}
Next, define (appropriate) expressions as strings in a list evl
.
evl <- list(
c('num_var_2 > 12', 'factor_var_1 %in% c("A", "C")',
'factor_var_4 %in% c("BBBB", "DDDD", "EEEE")'),
c('num_var_1 > 0', 'num_var_3 < 10', 'factor_var_2 %in% c("AA", "BB", "CC")',
'factor_var_3 %in% "AAA"', 'factor_var_5 %in% c("CCCCC", "DDDDD")'),
c('num_var_2 < 5', 'num_var_5 < 10', 'factor_var_1 %in% "B"',
'factor_var_3 %in% "AAA"'),
c('factor_var_4 %in% "BBBB"')
)
Finally, in Map
define a function that subsets the data of one replicate
ion according to the respective expressions using eval(parse(text=))
. Use set.seed()
outside the function to prevent the same data from being generated on each iteration.
set.seed(42)
result <- Map(\(x, y) x[with(x, eval(parse(text=paste(y, collapse=' & ')))), ],
replicate(length(evl), FUN(), simplify=FALSE),
evl)
Note: R version 4.1.2 (2021-11-01)
str(result)
# List of 4
# $ :'data.frame': 59 obs. of 11 variables:
# ..$ id : int [1:59] 3 6 25 29 32 34 52 54 58 93 ...
# ..$ num_var_1 : num [1:59] 9.99 10.95 9.38 8.53 9.65 ...
# ..$ num_var_2 : num [1:59] 13.6 17.4 20.3 19.3 16.1 ...
# ..$ num_var_3 : num [1:59] 9.42 18.67 6.1 25.71 -2.73 ...
# ..$ num_var_4 : num [1:59] 6.29 9.22 3.68 16.27 15.77 ...
# ..$ num_var_5 : num [1:59] 13.37 18.86 4.89 24.18 26.11 ...
# ..$ factor_var_1: Factor w/ 3 levels "A","B","C": 3 1 3 1 3 3 1 3 1 1 ...
# ..$ factor_var_2: Factor w/ 3 levels "AA","BB","CC": 3 3 1 1 1 2 3 3 1 3 ...
# ..$ factor_var_3: Factor w/ 4 levels "AAA","BBB","CCC",..: 1 1 2 1 1 4 2 1 3 2 ...
# ..$ factor_var_4: Factor w/ 5 levels "AAAA","BBBB",..: 5 2 2 2 2 2 5 2 4 4 ...
# ..$ factor_var_5: Factor w/ 6 levels "AAAAA","BBBBB",..: 3 5 2 3 5 4 4 6 1 6 ...
# $ :'data.frame': 53 obs. of 11 variables:
# ..$ id : int [1:53] 2 14 28 36 49 59 75 103 134 137 ...
# ..$ num_var_1 : num [1:53] 9.67 11.61 11.22 10.14 10.5 ...
# ..$ num_var_2 : num [1:53] 10.89 7.12 2.38 13.28 10.88 ...
# ..$ num_var_3 : num [1:53] 5.87 7.33 2.88 -10.78 4.09 ...
# ..$ num_var_4 : num [1:53] 19.239 6.261 -0.158 14.586 -0.544 ...
# ..$ num_var_5 : num [1:53] -5.1 21.04 2.81 1.76 27.19 ...
# ..$ factor_var_1: Factor w/ 3 levels "A","B","C": 1 1 1 2 3 2 3 3 2 3 ...
# ..$ factor_var_2: Factor w/ 3 levels "AA","BB","CC": 2 2 2 3 3 3 3 2 1 1 ...
# ..$ factor_var_3: Factor w/ 4 levels "AAA","BBB","CCC",..: 1 1 1 1 1 1 1 1 1 1 ...
# ..$ factor_var_4: Factor w/ 5 levels "AAAA","BBBB",..: 1 5 5 1 4 4 4 4 1 4 ...
# ..$ factor_var_5: Factor w/ 6 levels "AAAAA","BBBBB",..: 3 4 4 3 3 4 4 4 4 3 ...
# $ :'data.frame': 20 obs. of 11 variables:
# ..$ id : int [1:20] 3 44 91 181 222 233 241 287 293 302 ...
# ..$ num_var_1 : num [1:20] 12 10.26 9.65 8.48 12.1 ...
# ..$ num_var_2 : num [1:20] 3.68 3.61 3.28 4.01 1.78 ...
# ..$ num_var_3 : num [1:20] 4.113 -3.481 17.654 0.496 5.457 ...
# ..$ num_var_4 : num [1:20] 9.25 19.79 17.15 -4.72 22.16 ...
# ..$ num_var_5 : num [1:20] 6 8.49 4.31 4.67 1.96 ...
# ..$ factor_var_1: Factor w/ 3 levels "A","B","C": 2 2 2 2 2 2 2 2 2 2 ...
# ..$ factor_var_2: Factor w/ 3 levels "AA","BB","CC": 2 1 3 1 1 1 1 3 2 1 ...
# ..$ factor_var_3: Factor w/ 4 levels "AAA","BBB","CCC",..: 1 1 1 1 1 1 1 1 1 1 ...
# ..$ factor_var_4: Factor w/ 5 levels "AAAA","BBBB",..: 3 1 1 1 1 1 1 1 1 1 ...
# ..$ factor_var_5: Factor w/ 6 levels "AAAAA","BBBBB",..: 3 5 5 1 1 1 2 6 1 2 ...
# $ :'data.frame': 205 obs. of 11 variables:
# ..$ id : int [1:205] 7 10 23 24 27 29 31 33 38 40 ...
# ..$ num_var_1 : num [1:205] 10.23 9.78 8.92 10.16 9.93 ...
# ..$ num_var_2 : num [1:205] 23.49 13.06 12.17 16.88 7.93 ...
# ..$ num_var_3 : num [1:205] 6.33 9.33 14.04 21.66 28.56 ...
# ..$ num_var_4 : num [1:205] 16.33 -1.805 0.509 21.2 15.158 ...
# ..$ num_var_5 : num [1:205] 8.48 -1.31 5.03 15.07 19.48 ...
# ..$ factor_var_1: Factor w/ 3 levels "A","B","C": 1 1 2 1 2 1 2 2 3 2 ...
# ..$ factor_var_2: Factor w/ 3 levels "AA","BB","CC": 3 1 1 2 1 1 1 2 1 3 ...
# ..$ factor_var_3: Factor w/ 4 levels "AAA","BBB","CCC",..: 1 2 3 1 3 4 3 1 3 2 ...
# ..$ factor_var_4: Factor w/ 5 levels "AAAA","BBBB",..: 2 2 2 2 2 2 2 2 2 2 ...
# ..$ factor_var_5: Factor w/ 6 levels "AAAAA","BBBBB",..: 3 5 2 6 6 2 6 1 2 2 ...
QUESTION
I am trying to tidy up some data that is all contained in 1 column called "game_info" as a string. This data contains college basketball upcoming game data, with the Date, Time, Team IDs, Team Names, etc. Ideally each one of those would be their own column. I have tried separating with a space delimiter, but that has not worked well since there are teams such as "Duke" with 1 part to their name, and teams with 2 to 3 parts to their name (Michigan State, South Dakota State, etc). There also teams with "-" dashes in their name.
Here is my data:
df <- data.frame(list(
game_info = c(
"12/16 7:00 PM 751 Appalachian State 752 Duke",
"12/16 7:00 PM 753 Chicago State 754 Indiana-Purdue",
"12/16 8:00 PM 755 Texas-Arlington 756 Oral Roberts",
"12/16 10:00 PM 757 Dartmouth 758 Stanford"
)
))
Desired output:
date time away_team_id away_team_name home_team_id home_team_name
12/16 7:00 PM 751 Appalachian State 752 Duke
12/16 7:00 PM 753 Chicago State 754 Indiana-Purdue
12/16 8:00 PM 755 Texas-Arlington 756 Oral Roberts
12/16 10:00 PM 757 Dartmouth 758 Stanford
ANSWER
Answered 2021-Dec-16 at 15:25Here's one with regex. See regex101 link for the regex explanations
regex <- "^(\\d{2}\\/\\d{2})\\s*(\\d{1,2}:\\d{2}\\s*(PM|AM))\\s*(\\d+)\\s*([^\\d.]+)(\\d+)\\s*([^\\d.]+)$"
data <- data.frame(game_info=
"12/16 7:00 PM 751 Appalachian State 752 Duke"
,"12/16 7:00 PM 753 Chicago State 754 Indiana-Purdue"
,"12/16 8:00 PM 755 Texas-Arlington 756 Oral Roberts"
,"12/16 10:00 AM 757 Dartmouth 758 Stanford"
)
library(stringr)
out <- do.call(rbind, str_match_all(data, regex))
out <- as.data.frame(out)
# remove full string & AM/PM
out$V1 <- NULL
out$V4 <- NULL
names(out) <- c("date", "time", "away_team_id", "away_team_name",
"home_team_id", "home_team_name")
# remove white space from end
out$away_team_name <- trimws(out$away_team_name)
out$home_team_name <- trimws(out$home_team_name)
out
Explanation:
^(\d{2}/\d{2}) - starts with 2 digits/2 digits like 12/16. ^ is a start anchor and () are used to say we want to capture this group for plucking out
\s* - 0 or more spaces between our first group and the next
(\d{1,2}:\d{2}\s*(PM|AM)) - want 1 or 2 digits : 2 digits, then possibly a space and PM or AM
\s*(\d+)\s* - spaces around any number of digits, the first id
([^\d.]+) - all non numeric characters. This will fall down if there are ever numbers in your team names. If so, find some examples and we can improve it. White space is captured afterwards so is removed later with trimws
(\d+)\s* - second id and spaces
([^\d.]+)$ - finally the other team name and the end sentence anchor
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install textflowcpp
Support
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesExplore Kits - Develop, implement, customize Projects, Custom Functions and Applications with kandi kits
Save this library and start creating your kit
Share this Page