Explore all Big Data open source software, libraries, packages, source code, cloud functions and APIs.

Popular New Releases in Big Data

No Popular Releases are available at this moment for Big Data

Popular Libraries in Big Data

No Trending Libraries are available at this moment for Big Data

Trending New libraries in Big Data

No Trending Libraries are available at this moment for Big Data

Top Authors in Big Data

No Top Authors are available at this moment for Big Data.

Trending Kits in Big Data

No Trending Kits are available at this moment for Big Data

Trending Discussions on Big Data

Visualise missing values in a time series heatmap

The minimum number of rows that sum to a given number in Python

pandas fill NA but not all based on recent past record

Delete and replace Nan values with mean of the rows in pandas dataframe

How to decode column value from rare label by matching column names

How do I copy a big database table to another in ABAP?

Remove all rows between two sentinel strings in a column using pandas (but not the sentinel strings)

Faster for loop with only if in python

determine the range of a value using a look up table

How to use multiprocessing in a chronical order?

QUESTION

Visualise missing values in a time series heatmap

Asked 2022-Mar-28 at 19:27

I am really new in big data analysing. Let's say I have a big data with the following features. I want to visualise the the percentage of missing values (None values) of fuel parameters for every id in specific hour. I want to draw a chart that x-axis is the time series (time column), y-axis is the 'id' and the colour will indicate its missing fuel percentage. I grouped the data base on 'id' and 'hour'

I don't know how to visualise missing value in a good way for all ids. For example if the percentage of missing value fuel of specific id in specific hour is 100% then the colour in that specific time and for that 'id' can be gray. If percentage of missing value in fuel is 50%, the colour can be light green. If percentage of missing value in fuel is 0% then the colour can be dark green. The colour must be based to the percentage of missing value in fuel, after grouping based on id and time.

1    id    time                   fuel
20   1     2022-02-26 19:08:33    100
32   1     2022-02-26 20:09:35    None
43   2     2022-02-26 21:09:35    70
54   3     2022-02-26 21:10:55    60
65   4     2022-02-26 21:10:55    None
76   5     2022-02-26 22:12:43    50
87   6     2022-02-26 23:10:50    None
9

So for example, in the following code I computed the percentage of the missing value for every hour for specific id:

1    id    time                   fuel
20   1     2022-02-26 19:08:33    100
32   1     2022-02-26 20:09:35    None
43   2     2022-02-26 21:09:35    70
54   3     2022-02-26 21:10:55    60
65   4     2022-02-26 21:10:55    None
76   5     2022-02-26 22:12:43    50
87   6     2022-02-26 23:10:50    None
9df.set_index('ts').groupby(['id', pd.Grouper(freq='H')])['fuell'].apply(lambda x: x.isnull().mean() * 100)
10

Is there any solution?

ANSWER

Answered 2022-Mar-25 at 09:39

There is no right answer concerning missing values visualization, I guess it depends on your uses, habits ...

But first, to make it works, we need to preprocess your dataframe and make it analyzable, aka ensure its dtypes.

First let's build our data :

1    id    time                   fuel
20   1     2022-02-26 19:08:33    100
32   1     2022-02-26 20:09:35    None
43   2     2022-02-26 21:09:35    70
54   3     2022-02-26 21:10:55    60
65   4     2022-02-26 21:10:55    None
76   5     2022-02-26 22:12:43    50
87   6     2022-02-26 23:10:50    None
9df.set_index('ts').groupby(['id', pd.Grouper(freq='H')])['fuell'].apply(lambda x: x.isnull().mean() * 100)
10import pandas as pd
11from io import StringIO
12    
13csvfile = StringIO(
14"""id   time    fuel
151   2022-02-26 19:08:33 100
162   2022-02-26 19:09:35 70
173   2022-02-26 19:10:55 60
184   2022-02-26 20:10:55 None
195   2022-02-26 21:12:43 50
206   2022-02-26 22:10:50 None""")
21df = pd.read_csv(csvfile, sep = '\t', engine='python')
22
23df
24Out[65]: 
25   id                 time  fuel
260   1  2022-02-26 19:08:33   100
271   2  2022-02-26 19:09:35    70
282   3  2022-02-26 19:10:55    60
293   4  2022-02-26 20:10:55  None
304   5  2022-02-26 21:12:43    50
315   6  2022-02-26 22:10:50  None
32

At this stage almost all data in our dataframe is string related, you need to convert fuel and time into a non-object dtypes.

1    id    time                   fuel
20   1     2022-02-26 19:08:33    100
32   1     2022-02-26 20:09:35    None
43   2     2022-02-26 21:09:35    70
54   3     2022-02-26 21:10:55    60
65   4     2022-02-26 21:10:55    None
76   5     2022-02-26 22:12:43    50
87   6     2022-02-26 23:10:50    None
9df.set_index('ts').groupby(['id', pd.Grouper(freq='H')])['fuell'].apply(lambda x: x.isnull().mean() * 100)
10import pandas as pd
11from io import StringIO
12    
13csvfile = StringIO(
14"""id   time    fuel
151   2022-02-26 19:08:33 100
162   2022-02-26 19:09:35 70
173   2022-02-26 19:10:55 60
184   2022-02-26 20:10:55 None
195   2022-02-26 21:12:43 50
206   2022-02-26 22:10:50 None""")
21df = pd.read_csv(csvfile, sep = '\t', engine='python')
22
23df
24Out[65]: 
25   id                 time  fuel
260   1  2022-02-26 19:08:33   100
271   2  2022-02-26 19:09:35    70
282   3  2022-02-26 19:10:55    60
293   4  2022-02-26 20:10:55  None
304   5  2022-02-26 21:12:43    50
315   6  2022-02-26 22:10:50  None
32df.dtypes
33Out[66]: 
34id       int64
35time    object
36fuel    object
37dtype: object
38

Time should be converted as datetime, id as int and fuel as float. Indeed, None should be convert as np.nan for numeric values, which needs the float dtype.

With a map, we can easily change all 'None' values into np.nan. I won't go deeper here, but for simplicity sake, I'll use a custom subclass of dict with a __missing__ implementation

1    id    time                   fuel
20   1     2022-02-26 19:08:33    100
32   1     2022-02-26 20:09:35    None
43   2     2022-02-26 21:09:35    70
54   3     2022-02-26 21:10:55    60
65   4     2022-02-26 21:10:55    None
76   5     2022-02-26 22:12:43    50
87   6     2022-02-26 23:10:50    None
9df.set_index('ts').groupby(['id', pd.Grouper(freq='H')])['fuell'].apply(lambda x: x.isnull().mean() * 100)
10import pandas as pd
11from io import StringIO
12    
13csvfile = StringIO(
14"""id   time    fuel
151   2022-02-26 19:08:33 100
162   2022-02-26 19:09:35 70
173   2022-02-26 19:10:55 60
184   2022-02-26 20:10:55 None
195   2022-02-26 21:12:43 50
206   2022-02-26 22:10:50 None""")
21df = pd.read_csv(csvfile, sep = '\t', engine='python')
22
23df
24Out[65]: 
25   id                 time  fuel
260   1  2022-02-26 19:08:33   100
271   2  2022-02-26 19:09:35    70
282   3  2022-02-26 19:10:55    60
293   4  2022-02-26 20:10:55  None
304   5  2022-02-26 21:12:43    50
315   6  2022-02-26 22:10:50  None
32df.dtypes
33Out[66]: 
34id       int64
35time    object
36fuel    object
37dtype: object
38df.time = pd.to_datetime(df.time, format = "%Y/%m/%d %H:%M:%S")
39
40class dict_with_missing(dict):
41    def __missing__(self, key):
42        return key
43map_dict = dict_with_missing({'None' : np.nan})
44df.fuel = df.fuel.map(map_dict).astype(np.float32)
45

Then we have a clean dataframe :

1    id    time                   fuel
20   1     2022-02-26 19:08:33    100
32   1     2022-02-26 20:09:35    None
43   2     2022-02-26 21:09:35    70
54   3     2022-02-26 21:10:55    60
65   4     2022-02-26 21:10:55    None
76   5     2022-02-26 22:12:43    50
87   6     2022-02-26 23:10:50    None
9df.set_index('ts').groupby(['id', pd.Grouper(freq='H')])['fuell'].apply(lambda x: x.isnull().mean() * 100)
10import pandas as pd
11from io import StringIO
12    
13csvfile = StringIO(
14"""id   time    fuel
151   2022-02-26 19:08:33 100
162   2022-02-26 19:09:35 70
173   2022-02-26 19:10:55 60
184   2022-02-26 20:10:55 None
195   2022-02-26 21:12:43 50
206   2022-02-26 22:10:50 None""")
21df = pd.read_csv(csvfile, sep = '\t', engine='python')
22
23df
24Out[65]: 
25   id                 time  fuel
260   1  2022-02-26 19:08:33   100
271   2  2022-02-26 19:09:35    70
282   3  2022-02-26 19:10:55    60
293   4  2022-02-26 20:10:55  None
304   5  2022-02-26 21:12:43    50
315   6  2022-02-26 22:10:50  None
32df.dtypes
33Out[66]: 
34id       int64
35time    object
36fuel    object
37dtype: object
38df.time = pd.to_datetime(df.time, format = "%Y/%m/%d %H:%M:%S")
39
40class dict_with_missing(dict):
41    def __missing__(self, key):
42        return key
43map_dict = dict_with_missing({'None' : np.nan})
44df.fuel = df.fuel.map(map_dict).astype(np.float32)
45df
46Out[68]: 
47   id                time   fuel
480   1 2022-02-26 19:08:33  100.0
491   2 2022-02-26 19:09:35   70.0
502   3 2022-02-26 19:10:55   60.0
513   4 2022-02-26 20:10:55    NaN
524   5 2022-02-26 21:12:43   50.0
535   6 2022-02-26 22:10:50    NaN
54
55df.dtypes
56Out[69]: 
57id               int64
58time    datetime64[ns]
59fuel           float32
60dtype: object
61

Then, you can easily use bar, matrix or heatmap from the missingno module

1    id    time                   fuel
20   1     2022-02-26 19:08:33    100
32   1     2022-02-26 20:09:35    None
43   2     2022-02-26 21:09:35    70
54   3     2022-02-26 21:10:55    60
65   4     2022-02-26 21:10:55    None
76   5     2022-02-26 22:12:43    50
87   6     2022-02-26 23:10:50    None
9df.set_index('ts').groupby(['id', pd.Grouper(freq='H')])['fuell'].apply(lambda x: x.isnull().mean() * 100)
10import pandas as pd
11from io import StringIO
12    
13csvfile = StringIO(
14"""id   time    fuel
151   2022-02-26 19:08:33 100
162   2022-02-26 19:09:35 70
173   2022-02-26 19:10:55 60
184   2022-02-26 20:10:55 None
195   2022-02-26 21:12:43 50
206   2022-02-26 22:10:50 None""")
21df = pd.read_csv(csvfile, sep = '\t', engine='python')
22
23df
24Out[65]: 
25   id                 time  fuel
260   1  2022-02-26 19:08:33   100
271   2  2022-02-26 19:09:35    70
282   3  2022-02-26 19:10:55    60
293   4  2022-02-26 20:10:55  None
304   5  2022-02-26 21:12:43    50
315   6  2022-02-26 22:10:50  None
32df.dtypes
33Out[66]: 
34id       int64
35time    object
36fuel    object
37dtype: object
38df.time = pd.to_datetime(df.time, format = "%Y/%m/%d %H:%M:%S")
39
40class dict_with_missing(dict):
41    def __missing__(self, key):
42        return key
43map_dict = dict_with_missing({'None' : np.nan})
44df.fuel = df.fuel.map(map_dict).astype(np.float32)
45df
46Out[68]: 
47   id                time   fuel
480   1 2022-02-26 19:08:33  100.0
491   2 2022-02-26 19:09:35   70.0
502   3 2022-02-26 19:10:55   60.0
513   4 2022-02-26 20:10:55    NaN
524   5 2022-02-26 21:12:43   50.0
535   6 2022-02-26 22:10:50    NaN
54
55df.dtypes
56Out[69]: 
57id               int64
58time    datetime64[ns]
59fuel           float32
60dtype: object
61msno.bar(df)
62msno.matrix(df, sparkline=False)
63msno.heatmap(df, cmap="RdYlGn")
64

A side note here, heatmap is kind of useless here, since it compares columns having missing values. And you only have one column with missing value. But for a bigger dataframe (~ 5/6 columns with missing values) it can be useful.

For a quick and dirty visualization, you can also print the number of missing value (aka np.nan, in pandas/numpy formulation) :

1    id    time                   fuel
20   1     2022-02-26 19:08:33    100
32   1     2022-02-26 20:09:35    None
43   2     2022-02-26 21:09:35    70
54   3     2022-02-26 21:10:55    60
65   4     2022-02-26 21:10:55    None
76   5     2022-02-26 22:12:43    50
87   6     2022-02-26 23:10:50    None
9df.set_index('ts').groupby(['id', pd.Grouper(freq='H')])['fuell'].apply(lambda x: x.isnull().mean() * 100)
10import pandas as pd
11from io import StringIO
12    
13csvfile = StringIO(
14"""id   time    fuel
151   2022-02-26 19:08:33 100
162   2022-02-26 19:09:35 70
173   2022-02-26 19:10:55 60
184   2022-02-26 20:10:55 None
195   2022-02-26 21:12:43 50
206   2022-02-26 22:10:50 None""")
21df = pd.read_csv(csvfile, sep = '\t', engine='python')
22
23df
24Out[65]: 
25   id                 time  fuel
260   1  2022-02-26 19:08:33   100
271   2  2022-02-26 19:09:35    70
282   3  2022-02-26 19:10:55    60
293   4  2022-02-26 20:10:55  None
304   5  2022-02-26 21:12:43    50
315   6  2022-02-26 22:10:50  None
32df.dtypes
33Out[66]: 
34id       int64
35time    object
36fuel    object
37dtype: object
38df.time = pd.to_datetime(df.time, format = "%Y/%m/%d %H:%M:%S")
39
40class dict_with_missing(dict):
41    def __missing__(self, key):
42        return key
43map_dict = dict_with_missing({'None' : np.nan})
44df.fuel = df.fuel.map(map_dict).astype(np.float32)
45df
46Out[68]: 
47   id                time   fuel
480   1 2022-02-26 19:08:33  100.0
491   2 2022-02-26 19:09:35   70.0
502   3 2022-02-26 19:10:55   60.0
513   4 2022-02-26 20:10:55    NaN
524   5 2022-02-26 21:12:43   50.0
535   6 2022-02-26 22:10:50    NaN
54
55df.dtypes
56Out[69]: 
57id               int64
58time    datetime64[ns]
59fuel           float32
60dtype: object
61msno.bar(df)
62msno.matrix(df, sparkline=False)
63msno.heatmap(df, cmap="RdYlGn")
64df.isna().sum()
65Out[72]: 
66id      0
67time    0
68fuel    2
69dtype: int64
70

Source https://stackoverflow.com/questions/71610279

QUESTION

The minimum number of rows that sum to a given number in Python

Asked 2022-Mar-14 at 09:34

I have a DataFrame with cash, inflows and outflows. I need to create a feature survival that is the maximum number of periods the cash is enough to pay the projection of outflows (excluding the inflows from the computation).

Let's take an example from the table below.

(Again, the inflows do not count in this exercise).
In t=1, from the starting cash = 100, I can add the outflows: -20, -50, -10, -10 e still having a positive cash (100-20-50-10-10 = 10 > 0) while with the outflow in t=5 the cash would be negative. So, as long as I can "survive" 4 periods in t=1 the survival = 4.
In t=2 the survival = 3 and so on.

As it is a big DataFrame, how can I do it efficiently with Pandas?

t cash outflow inflow survival
1 100 -20 10 4
2 90 -50 10 3
3 50 -10 80 2
4 120 -10 70 ...
5 40 -50 60 ...

ANSWER

Answered 2022-Mar-14 at 09:08

I would do like this:

1df['survival'] = [ (cash + df.iloc[i:].outflow.cumsum()>0).sum() for i,cash in enumerate(df.cash)]
2

Output:

1df['survival'] = [ (cash + df.iloc[i:].outflow.cumsum()>0).sum() for i,cash in enumerate(df.cash)]
2   t  cash  outflow  survival
30  1   100      -20         4
41  2    90      -50         3
52  3    50      -10         2
63  4   120      -10         2
74  5    40      -50         0
8

Explanation: I make a loop on cash values keeping also track of the row number using enumerate. I use the row number to select only the portion of the dataframe from the current value of cash going down. On this portion of the dataframe I make a cumulative sum and then I add it to the cash. This yields a series which is negative when the cash is smaller than the sum of the previous outflows. I then set it >0 so I have True values when it's positive and False when it's negative. I finally sum the whole series, each True counts as a 1, so the output is the survival number you are looking for. Hope it makes sense.

Source https://stackoverflow.com/questions/71464759

QUESTION

pandas fill NA but not all based on recent past record

Asked 2022-Mar-01 at 13:19

I have a dataframe like as shown below

1stud_name   act_qtr year    yr_qty  qtr mov_avg_full    mov_avg_2qtr_min_period
20   ABC Q2  2014    2014Q2  NaN NaN NaN
31   ABC Q1  2016    2016Q1  Q1  13.0    14.5
42   ABC Q4  2016    2016Q4  NaN NaN NaN
53   ABC Q4  2017    2017Q4  NaN NaN NaN
64   ABC Q4  2020    2020Q4  NaN NaN NaN
7
8OP = pd.read_clipboard()
9
10stud_name   qtr year    t_score p_score yr_qty  mov_avg_full    mov_avg_2qtr_min_period
110   ABC Q1  2014    10  11  2014Q1  10.000000   10.0
121   ABC Q1  2015    11  32  2015Q1  10.500000   10.5
132   ABC Q2  2015    13  45  2015Q2  11.333333   12.0
143   ABC Q3  2015    15  32  2015Q3  12.250000   14.0
154   ABC Q4  2015    17  21  2015Q4  13.200000   16.0
165   ABC Q1  2016    12  56  2016Q1  13.000000   14.5
176   ABC Q2  2017    312 87  2017Q2  55.714286   162.0
187   ABC Q3  2018    24  90  2018Q3  51.750000   168.0
19
20df = pd.read_clipboard()
21

I would like to fillna() based on below logic

For ex: let's take stud_name = ABC. He has multipple NA records. Let's take his NA for 2020Q4. To fill that, we pick the latest record from df for stud_name=ABC before 2020Q4 (which is 2018Q3). Similarly, if we take stud_name = ABC. His another NA record is for 2014Q2. We pick the latest (prior) record from df for stud_name=ABC before 2014Q2 (which is 2014Q1). We need to sort based on yearqty values to get the latest (prior) record correctly

We need to do this for each stud_name and for a big dataset

So, we fillna in mov_avg_full and mov_avg_2qtr_min_period

If there are no previous records to look at in df dataframe, leave NA as it is

I was trying something like below but it doesn't work and incorrect

1stud_name   act_qtr year    yr_qty  qtr mov_avg_full    mov_avg_2qtr_min_period
20   ABC Q2  2014    2014Q2  NaN NaN NaN
31   ABC Q1  2016    2016Q1  Q1  13.0    14.5
42   ABC Q4  2016    2016Q4  NaN NaN NaN
53   ABC Q4  2017    2017Q4  NaN NaN NaN
64   ABC Q4  2020    2020Q4  NaN NaN NaN
7
8OP = pd.read_clipboard()
9
10stud_name   qtr year    t_score p_score yr_qty  mov_avg_full    mov_avg_2qtr_min_period
110   ABC Q1  2014    10  11  2014Q1  10.000000   10.0
121   ABC Q1  2015    11  32  2015Q1  10.500000   10.5
132   ABC Q2  2015    13  45  2015Q2  11.333333   12.0
143   ABC Q3  2015    15  32  2015Q3  12.250000   14.0
154   ABC Q4  2015    17  21  2015Q4  13.200000   16.0
165   ABC Q1  2016    12  56  2016Q1  13.000000   14.5
176   ABC Q2  2017    312 87  2017Q2  55.714286   162.0
187   ABC Q3  2018    24  90  2018Q3  51.750000   168.0
19
20df = pd.read_clipboard()
21Filled = OP.merge(df,on=['stud_name'],how='left')
22filled.sort_values(['year','Qty'],inplace=True)
23filled['mov_avg_full'].fillna(Filled.groupby('stud_name']['mov_avg_full'].shift())
24filled['mov_avg_2qtr_min_period'].fillna(Filled .groupby('stud_name']['mov_avg_2qtr_min_period'].shift())
25

I expect my output to be like as shown below

enter image description here

ANSWER

Answered 2022-Mar-01 at 13:19

In this case, you might want to use append instead of merge. In other words, you want to concatenate vertically instead of horizontally. Then after sorting the DataFrame by stud_name and yr_qtr, you can use groupby and fillna methods on it.

Code:
1stud_name   act_qtr year    yr_qty  qtr mov_avg_full    mov_avg_2qtr_min_period
20   ABC Q2  2014    2014Q2  NaN NaN NaN
31   ABC Q1  2016    2016Q1  Q1  13.0    14.5
42   ABC Q4  2016    2016Q4  NaN NaN NaN
53   ABC Q4  2017    2017Q4  NaN NaN NaN
64   ABC Q4  2020    2020Q4  NaN NaN NaN
7
8OP = pd.read_clipboard()
9
10stud_name   qtr year    t_score p_score yr_qty  mov_avg_full    mov_avg_2qtr_min_period
110   ABC Q1  2014    10  11  2014Q1  10.000000   10.0
121   ABC Q1  2015    11  32  2015Q1  10.500000   10.5
132   ABC Q2  2015    13  45  2015Q2  11.333333   12.0
143   ABC Q3  2015    15  32  2015Q3  12.250000   14.0
154   ABC Q4  2015    17  21  2015Q4  13.200000   16.0
165   ABC Q1  2016    12  56  2016Q1  13.000000   14.5
176   ABC Q2  2017    312 87  2017Q2  55.714286   162.0
187   ABC Q3  2018    24  90  2018Q3  51.750000   168.0
19
20df = pd.read_clipboard()
21Filled = OP.merge(df,on=['stud_name'],how='left')
22filled.sort_values(['year','Qty'],inplace=True)
23filled['mov_avg_full'].fillna(Filled.groupby('stud_name']['mov_avg_full'].shift())
24filled['mov_avg_2qtr_min_period'].fillna(Filled .groupby('stud_name']['mov_avg_2qtr_min_period'].shift())
25import pandas as pd
26
27# Create the sample dataframes
28import numpy as np
29op = pd.DataFrame({'stud_name': {0: 'ABC', 1: 'ABC', 2: 'ABC', 3: 'ABC', 4: 'ABC'}, 'act_qtr': {0: 'Q2', 1: 'Q1', 2: 'Q4', 3: 'Q4', 4: 'Q4'}, 'year': {0: 2014, 1: 2016, 2: 2016, 3: 2017, 4: 2020}, 'yr_qty': {0: '2014Q2', 1: '2016Q1', 2: '2016Q4', 3: '2017Q4', 4: '2020Q4'}, 'qtr': {0: np.NaN, 1: 'Q1', 2: np.NaN, 3: np.NaN, 4: np.NaN}, 'mov_avg_full': {0: np.NaN, 1: 13.0, 2: np.NaN, 3: np.NaN, 4: np.NaN}, 'mov_avg_2qtr_min_period': {0: np.NaN, 1: 14.5, 2: np.NaN, 3: np.NaN, 4: np.NaN}})
30df = pd.DataFrame({'stud_name': {0: 'ABC', 1: 'ABC', 2: 'ABC', 3: 'ABC', 4: 'ABC', 5: 'ABC', 6: 'ABC', 7: 'ABC'}, 'qtr': {0: 'Q1', 1: 'Q1', 2: 'Q2', 3: 'Q3', 4: 'Q4', 5: 'Q1', 6: 'Q2', 7: 'Q3'}, 'year': {0: 2014, 1: 2015, 2: 2015, 3: 2015, 4: 2015, 5: 2016, 6: 2017, 7: 2018}, 't_score': {0: 10, 1: 11, 2: 13, 3: 15, 4: 17, 5: 12, 6: 312, 7: 24}, 'p_score': {0: 11, 1: 32, 2: 45, 3: 32, 4: 21, 5: 56, 6: 87, 7: 90}, 'yr_qty': {0: '2014Q1', 1: '2015Q1', 2: '2015Q2', 3: '2015Q3', 4: '2015Q4', 5: '2016Q1', 6: '2017Q2', 7: '2018Q3'}, 'mov_avg_full': {0: 10.0, 1: 10.5, 2: 11.333333, 3: 12.25, 4: 13.2, 5: 13.0, 6: 55.714286, 7: 51.75}, 'mov_avg_2qtr_min_period': {0: 10.0, 1: 10.5, 2: 12.0, 3: 14.0, 4: 16.0, 5: 14.5, 6: 162.0, 7: 168.0}})
31
32# Append df to op
33dfa = op.append(df[['stud_name', 'yr_qty', 'mov_avg_full', 'mov_avg_2qtr_min_period']])
34
35# Sort before applying fillna
36dfa = dfa.sort_values(['stud_name', 'yr_qty'])
37
38# Group by stud_name and apply ffill
39dfa[['mov_avg_full', 'mov_avg_2qtr_min_period']] = dfa.groupby('stud_name')[['mov_avg_full', 'mov_avg_2qtr_min_period']].fillna(method='ffill')
40
41# Extract the orginal rows from op and deal with columns
42dfa = dfa[dfa.act_qtr.notna()].drop('qtr', axis=1)
43
44print(dfa)
45
Output:
stud_name act_qtr year yr_qty mov_avg_full mov_avg_2qtr_min_period
ABC Q2 2014 2014Q2 10 10
ABC Q1 2016 2016Q1 13 14.5
ABC Q4 2016 2016Q4 13 14.5
ABC Q4 2017 2017Q4 55.7143 162
ABC Q4 2020 2020Q4 51.75 168

Source https://stackoverflow.com/questions/71305279

QUESTION

Delete and replace Nan values with mean of the rows in pandas dataframe

Asked 2022-Feb-08 at 22:54

I have a very big DataFrame that looks like:

1    c1   c2    c3
20  NaN  1.0   NaN
31  NaN  NaN   NaN
42  3.0  6.0   9.0
53  NaN  7.0  10.0
6...
7

I want to:

1- Delete the rows with all "Nan" values. like the second row in the sample.

2- Replace all the "Nan" values in other rows with the mean of the rows.

Note: in the rows, we have different "Nan" values. could you please help me with that? Thanks.

Also, this link does not solve my question: Pandas Dataframe: Replacing NaN with row average

Here is a sample of my DataFrame:

1    c1   c2    c3
20  NaN  1.0   NaN
31  NaN  NaN   NaN
42  3.0  6.0   9.0
53  NaN  7.0  10.0
6...
7import pandas as pd
8import numpy as np
9
10
11df = pd.DataFrame()
12df['c1'] = [np.nan, np.nan, 3, np.nan]
13df['c2'] = [1, np.nan, 6, 7]
14df['c3'] = [np.nan, np.nan, 9, 10]
15

Update: When we don't want to consider the mean of all rows. sample dataframe:

1    c1   c2    c3
20  NaN  1.0   NaN
31  NaN  NaN   NaN
42  3.0  6.0   9.0
53  NaN  7.0  10.0
6...
7import pandas as pd
8import numpy as np
9
10
11df = pd.DataFrame()
12df['c1'] = [np.nan, np.nan, 3, np.nan]
13df['c2'] = [1, np.nan, 6, 7]
14df['c3'] = [np.nan, np.nan, 9, 10]
15import pandas as pd
16import numpy as np
17
18
19df = pd.DataFrame()
20df['id'] = [1, 2, 3, 4, 5]
21df['c1'] = [np.nan, np.nan, 3, np.nan, 5]
22df['c2'] = [1, np.nan, 3, 11, 5]
23df['c3'] = [1, np.nan, 3, 11, np.nan]
24df['c4'] = [3, np.nan, 3, 11, 5]
25
26output: 
27df = pd.DataFrame()
28df['id'] = [1,  3, 4, 5]
29df['c1'] = [ 5/3, 3, 11, 5]
30df['c2'] = [1,  3, 11, 5]
31df['c3'] = [1,  3, 11, 5]
32df['c4'] = [3,  3, 11, 5]
33df
34

For this part, I don't want to consider the value of id for calculating the mean of row.

ANSWER

Answered 2022-Feb-08 at 21:44

how about this :

1    c1   c2    c3
20  NaN  1.0   NaN
31  NaN  NaN   NaN
42  3.0  6.0   9.0
53  NaN  7.0  10.0
6...
7import pandas as pd
8import numpy as np
9
10
11df = pd.DataFrame()
12df['c1'] = [np.nan, np.nan, 3, np.nan]
13df['c2'] = [1, np.nan, 6, 7]
14df['c3'] = [np.nan, np.nan, 9, 10]
15import pandas as pd
16import numpy as np
17
18
19df = pd.DataFrame()
20df['id'] = [1, 2, 3, 4, 5]
21df['c1'] = [np.nan, np.nan, 3, np.nan, 5]
22df['c2'] = [1, np.nan, 3, 11, 5]
23df['c3'] = [1, np.nan, 3, 11, np.nan]
24df['c4'] = [3, np.nan, 3, 11, 5]
25
26output: 
27df = pd.DataFrame()
28df['id'] = [1,  3, 4, 5]
29df['c1'] = [ 5/3, 3, 11, 5]
30df['c2'] = [1,  3, 11, 5]
31df['c3'] = [1,  3, 11, 5]
32df['c4'] = [3,  3, 11, 5]
33df
34df = df.T.fillna(df.mean(axis=1)).T.dropna()
35print(df)
36

output:

1    c1   c2    c3
20  NaN  1.0   NaN
31  NaN  NaN   NaN
42  3.0  6.0   9.0
53  NaN  7.0  10.0
6...
7import pandas as pd
8import numpy as np
9
10
11df = pd.DataFrame()
12df['c1'] = [np.nan, np.nan, 3, np.nan]
13df['c2'] = [1, np.nan, 6, 7]
14df['c3'] = [np.nan, np.nan, 9, 10]
15import pandas as pd
16import numpy as np
17
18
19df = pd.DataFrame()
20df['id'] = [1, 2, 3, 4, 5]
21df['c1'] = [np.nan, np.nan, 3, np.nan, 5]
22df['c2'] = [1, np.nan, 3, 11, 5]
23df['c3'] = [1, np.nan, 3, 11, np.nan]
24df['c4'] = [3, np.nan, 3, 11, 5]
25
26output: 
27df = pd.DataFrame()
28df['id'] = [1,  3, 4, 5]
29df['c1'] = [ 5/3, 3, 11, 5]
30df['c2'] = [1,  3, 11, 5]
31df['c3'] = [1,  3, 11, 5]
32df['c4'] = [3,  3, 11, 5]
33df
34df = df.T.fillna(df.mean(axis=1)).T.dropna()
35print(df)
36>>>
37    c1   c2    c3
380  1.0  1.0   1.0
392  3.0  6.0   9.0
403  8.5  7.0  10.0
41

Source https://stackoverflow.com/questions/71040598

QUESTION

How to decode column value from rare label by matching column names

Asked 2022-Feb-07 at 14:57

I have two dataframes like as shown below

1import numpy as np
2import pandas as pd
3from numpy.random import default_rng
4rng = default_rng(100)
5cdf = pd.DataFrame({'Id':[1,2,3,4,5],
6                   'grade': rng.choice(list('ACD'),size=(5)),
7                   'dash': rng.choice(list('PQRS'),size=(5)),
8                   'dumeel': rng.choice(list('QWER'),size=(5)),
9                   'dumma': rng.choice((1234),size=(5)),
10                   'target': rng.choice([0,1],size=(5))
11})
12
13tdf = pd.DataFrame({'Id': [1,1,1,1,3,3,3],
14                   'feature': ['grade=Rare','dash=Q','dumma=rare','dumeel=R','dash=Rare','dumma=rare','grade=D'],
15                   'value': [0.2,0.45,-0.32,0.56,1.3,1.5,3.7]})
16

My objective is to

a) Replace the Rare or rare values in feature column of tdf dataframe by original value from cdf dataframe.

b) To identify original value, we can make use of the string before = Rare or =rare or = rare etc. That string represents the column name in cdf dataframe (from where original value to replace rare can be found)

I was trying something like the below but not sure how to go from here

1import numpy as np
2import pandas as pd
3from numpy.random import default_rng
4rng = default_rng(100)
5cdf = pd.DataFrame({'Id':[1,2,3,4,5],
6                   'grade': rng.choice(list('ACD'),size=(5)),
7                   'dash': rng.choice(list('PQRS'),size=(5)),
8                   'dumeel': rng.choice(list('QWER'),size=(5)),
9                   'dumma': rng.choice((1234),size=(5)),
10                   'target': rng.choice([0,1],size=(5))
11})
12
13tdf = pd.DataFrame({'Id': [1,1,1,1,3,3,3],
14                   'feature': ['grade=Rare','dash=Q','dumma=rare','dumeel=R','dash=Rare','dumma=rare','grade=D'],
15                   'value': [0.2,0.45,-0.32,0.56,1.3,1.5,3.7]})
16replace_df = cdf.merge(tdf,how='inner',on='Id')
17replace_df ["replaced_feature"] = np.where(((replace_df["feature"].str.contains('rare',regex=True)]) & (replace_df["feature"].str.split('='))]) 
18

I have to apply this on a big data where I have million rows and more than 1000 replacements to be made like this.

I expect my output to be like as shown below

enter image description here

ANSWER

Answered 2022-Feb-07 at 14:40
1import numpy as np
2import pandas as pd
3from numpy.random import default_rng
4rng = default_rng(100)
5cdf = pd.DataFrame({'Id':[1,2,3,4,5],
6                   'grade': rng.choice(list('ACD'),size=(5)),
7                   'dash': rng.choice(list('PQRS'),size=(5)),
8                   'dumeel': rng.choice(list('QWER'),size=(5)),
9                   'dumma': rng.choice((1234),size=(5)),
10                   'target': rng.choice([0,1],size=(5))
11})
12
13tdf = pd.DataFrame({'Id': [1,1,1,1,3,3,3],
14                   'feature': ['grade=Rare','dash=Q','dumma=rare','dumeel=R','dash=Rare','dumma=rare','grade=D'],
15                   'value': [0.2,0.45,-0.32,0.56,1.3,1.5,3.7]})
16replace_df = cdf.merge(tdf,how='inner',on='Id')
17replace_df ["replaced_feature"] = np.where(((replace_df["feature"].str.contains('rare',regex=True)]) & (replace_df["feature"].str.split('='))]) 
18# list comprehension to find where rare is in the feature col
19tdf['feature'] = [x if y.lower()=='rare' else x+'='+y for x,y in tdf['feature'].str.split('=')]
20# create a mask where feature is in columns of cdf
21mask = tdf['feature'].isin(cdf.columns)
22# use loc to filter your frame and use merge to join cdf on the id and feature column - after you use stack
23tdf.loc[mask, 'feature'] = tdf.loc[mask, 'feature']+'='+tdf.loc[mask].merge(cdf.set_index('Id').stack().to_frame(),
24                                                                            right_index=True, left_on=['Id', 'feature'])[0].astype(str)
25
26   Id     feature  value
270   1     grade=D   0.20
281   1      dash=Q   0.45
292   1  dumma=1123  -0.32
303   1    dumeel=R   0.56
314   3      dash=P   1.30
325   3   dumma=849   1.50
336   3     grade=D   3.70
34

Source https://stackoverflow.com/questions/71020053

QUESTION

How do I copy a big database table to another in ABAP?

Asked 2022-Feb-03 at 09:29

I want to copy one big database table to another. This is my current approach:

1OPEN CURSOR WITH HOLD lv_db_cursor FOR
2  SELECT * FROM zcustomers.
3
4DO.
5  REFRESH gt_custom.
6  FETCH NEXT CURSOR lv_db_cursor
7    INTO TABLE  gt_custom
8    PACKAGE SIZE lv_package_size.
9
10  IF sy-subrc NE 0.
11    CLOSE CURSOR lv_db_cursor.
12    EXIT.
13  ENDIF.
14
15  INSERT zcustomers1 FROM TABLE gt_custom.
16
17  * Write code to modify u r custom table from gt_custom .
18ENDDO.
19

But the problem is that I get a error "Enterprise]ASE has run out of LOCKS". I tried to use COMMIT statement after insert some piece of records, but it closes the cursor. I don't want to increase max locks by database setting or make a copy on database level. I want to understand how I can copy with best performance and low usage memory in ABAP... Thank you.

ANSWER

Answered 2022-Jan-20 at 12:45

You can also "copy on database level" from within ABAP SQL using a combined INSERT and SELECT:

1OPEN CURSOR WITH HOLD lv_db_cursor FOR
2  SELECT * FROM zcustomers.
3
4DO.
5  REFRESH gt_custom.
6  FETCH NEXT CURSOR lv_db_cursor
7    INTO TABLE  gt_custom
8    PACKAGE SIZE lv_package_size.
9
10  IF sy-subrc NE 0.
11    CLOSE CURSOR lv_db_cursor.
12    EXIT.
13  ENDIF.
14
15  INSERT zcustomers1 FROM TABLE gt_custom.
16
17  * Write code to modify u r custom table from gt_custom .
18ENDDO.
19INSERT zcustomers1 FROM ( SELECT * FROM zcustomers ).
20

Unlike the other solution, this runs in one single transaction (no inconsistency on the database) and avoids moving the data between the database and the ABAP server, so should be by magnitudes faster. However, like the code in question this might still run into database limits due to opening many locks during the insert (though might avoid other problems). This should be solved on database side and is not a limitation of ABAP.

Source https://stackoverflow.com/questions/70763128

QUESTION

Remove all rows between two sentinel strings in a column using pandas (but not the sentinel strings)

Asked 2022-Jan-21 at 06:47

given a big dataframe, how can I remove all the things between two specific strings, here "start" and "end". Yesteday, I asked a similar question but removing the two strings (I post my solution here below). Now I would like to see how I can keep also the strings. I have been trying to get the next row using itertuples, but I still haven't figured out how to do it.

df1:

1   c1     c2
20   1      1
31   2  start
42   3      3
53   4    end
64   5      5
75   6  start
86   7    end
97   8      0
10

code to remove strings and things in between:

1   c1     c2
20   1      1
31   2  start
42   3      3
53   4    end
64   5      5
75   6  start
86   7    end
97   8      0
10df = pd.DataFrame(
11    {'c1': [1, 2, 3, 4, 5, 6, 7, 8], 
12     'c2': ['1', 'start', '3', 'end',
13            '5', 'start', 'end', 0]})
14
15df2 = copy.copy(df)
16flag = False
17list_a = []
18
19
20
21for j in df.itertuples():
22    if j.c2 == 'start':
23        flag = True
24        list_a.append((j))
25    elif j.c2 =='end':
26        flag = False
27        list_a.append((j))
28    elif flag:
29        list_a.append((j))
30
31list_a = tuple(list_a) 
32to_remove_df = pd.DataFrame(list_a, columns=['index','c1','c2'])
33to_remove_df = to_remove_df["c2"]
34removed_df = pd.merge(df, to_remove_df, on=["c2"], how="outer", indicator=True).query('_merge != "both"').drop('_merge', 1)
35

output df2:

1   c1     c2
20   1      1
31   2  start
42   3      3
53   4    end
64   5      5
75   6  start
86   7    end
97   8      0
10df = pd.DataFrame(
11    {'c1': [1, 2, 3, 4, 5, 6, 7, 8], 
12     'c2': ['1', 'start', '3', 'end',
13            '5', 'start', 'end', 0]})
14
15df2 = copy.copy(df)
16flag = False
17list_a = []
18
19
20
21for j in df.itertuples():
22    if j.c2 == 'start':
23        flag = True
24        list_a.append((j))
25    elif j.c2 =='end':
26        flag = False
27        list_a.append((j))
28    elif flag:
29        list_a.append((j))
30
31list_a = tuple(list_a) 
32to_remove_df = pd.DataFrame(list_a, columns=['index','c1','c2'])
33to_remove_df = to_remove_df["c2"]
34removed_df = pd.merge(df, to_remove_df, on=["c2"], how="outer", indicator=True).query('_merge != "both"').drop('_merge', 1)
35   c1 c2
360   1  1
374   5  5
387   8  0
39

Wanted output: df1:

1   c1     c2
20   1      1
31   2  start
42   3      3
53   4    end
64   5      5
75   6  start
86   7    end
97   8      0
10df = pd.DataFrame(
11    {'c1': [1, 2, 3, 4, 5, 6, 7, 8], 
12     'c2': ['1', 'start', '3', 'end',
13            '5', 'start', 'end', 0]})
14
15df2 = copy.copy(df)
16flag = False
17list_a = []
18
19
20
21for j in df.itertuples():
22    if j.c2 == 'start':
23        flag = True
24        list_a.append((j))
25    elif j.c2 =='end':
26        flag = False
27        list_a.append((j))
28    elif flag:
29        list_a.append((j))
30
31list_a = tuple(list_a) 
32to_remove_df = pd.DataFrame(list_a, columns=['index','c1','c2'])
33to_remove_df = to_remove_df["c2"]
34removed_df = pd.merge(df, to_remove_df, on=["c2"], how="outer", indicator=True).query('_merge != "both"').drop('_merge', 1)
35   c1 c2
360   1  1
374   5  5
387   8  0
39   c1     c2
400   1      1
411   2  start
423   4    end
434   5      5
445   6  start
456   7    end
467   8      0
47

ANSWER

Answered 2022-Jan-20 at 15:08

You just need to not add the indeces from the start and stop location to a.

Change this:

1   c1     c2
20   1      1
31   2  start
42   3      3
53   4    end
64   5      5
75   6  start
86   7    end
97   8      0
10df = pd.DataFrame(
11    {'c1': [1, 2, 3, 4, 5, 6, 7, 8], 
12     'c2': ['1', 'start', '3', 'end',
13            '5', 'start', 'end', 0]})
14
15df2 = copy.copy(df)
16flag = False
17list_a = []
18
19
20
21for j in df.itertuples():
22    if j.c2 == 'start':
23        flag = True
24        list_a.append((j))
25    elif j.c2 =='end':
26        flag = False
27        list_a.append((j))
28    elif flag:
29        list_a.append((j))
30
31list_a = tuple(list_a) 
32to_remove_df = pd.DataFrame(list_a, columns=['index','c1','c2'])
33to_remove_df = to_remove_df["c2"]
34removed_df = pd.merge(df, to_remove_df, on=["c2"], how="outer", indicator=True).query('_merge != "both"').drop('_merge', 1)
35   c1 c2
360   1  1
374   5  5
387   8  0
39   c1     c2
400   1      1
411   2  start
423   4    end
434   5      5
445   6  start
456   7    end
467   8      0
47for j in df.itertuples():
48    if j.c2 == 'start':
49        flag = True
50        list_a.append((j))
51    elif j.c2 =='end':
52        flag = False
53        list_a.append((j))
54    elif flag:
55        list_a.append((j))
56

to this:

1   c1     c2
20   1      1
31   2  start
42   3      3
53   4    end
64   5      5
75   6  start
86   7    end
97   8      0
10df = pd.DataFrame(
11    {'c1': [1, 2, 3, 4, 5, 6, 7, 8], 
12     'c2': ['1', 'start', '3', 'end',
13            '5', 'start', 'end', 0]})
14
15df2 = copy.copy(df)
16flag = False
17list_a = []
18
19
20
21for j in df.itertuples():
22    if j.c2 == 'start':
23        flag = True
24        list_a.append((j))
25    elif j.c2 =='end':
26        flag = False
27        list_a.append((j))
28    elif flag:
29        list_a.append((j))
30
31list_a = tuple(list_a) 
32to_remove_df = pd.DataFrame(list_a, columns=['index','c1','c2'])
33to_remove_df = to_remove_df["c2"]
34removed_df = pd.merge(df, to_remove_df, on=["c2"], how="outer", indicator=True).query('_merge != "both"').drop('_merge', 1)
35   c1 c2
360   1  1
374   5  5
387   8  0
39   c1     c2
400   1      1
411   2  start
423   4    end
434   5      5
445   6  start
456   7    end
467   8      0
47for j in df.itertuples():
48    if j.c2 == 'start':
49        flag = True
50        list_a.append((j))
51    elif j.c2 =='end':
52        flag = False
53        list_a.append((j))
54    elif flag:
55        list_a.append((j))
56for j in df.itertuples():
57    if j.c2 == 'start':
58        flag = True
59    elif j.c2 =='end':
60        flag = False
61    elif flag:
62        list_a.append((j))
63

Source https://stackoverflow.com/questions/70788404

QUESTION

Faster for loop with only if in python

Asked 2021-Dec-23 at 07:22

I'm dealing with a big dataset and want to basically this:

1test = np.random.rand(int(1e7))-0.5
2def test0(test):
3    return [0 if c<0 else c for c in test]
4

which is doing this:

1test = np.random.rand(int(1e7))-0.5
2def test0(test):
3    return [0 if c<0 else c for c in test]
4def test1(test):
5    for i,dat in enumerate(test):
6        if dat<0: 
7            test[i] = 0
8        else:
9            test[i] = dat
10    return test
11

Is there a way to modify test0 to skip the else request so i works like this:

1test = np.random.rand(int(1e7))-0.5
2def test0(test):
3    return [0 if c<0 else c for c in test]
4def test1(test):
5    for i,dat in enumerate(test):
6        if dat<0: 
7            test[i] = 0
8        else:
9            test[i] = dat
10    return test
11def test1(test):
12    for i,dat in enumerate(test):
13        if dat<0: test[i] = 0
14    return test
15

Thanks in advance!

ANSWER

Answered 2021-Dec-23 at 07:22

You could try

1test = np.random.rand(int(1e7))-0.5
2def test0(test):
3    return [0 if c<0 else c for c in test]
4def test1(test):
5    for i,dat in enumerate(test):
6        if dat<0: 
7            test[i] = 0
8        else:
9            test[i] = dat
10    return test
11def test1(test):
12    for i,dat in enumerate(test):
13        if dat<0: test[i] = 0
14    return test
15np.maximum(test, 0)
16

But where is the fastest on my machine:

enter image description here

https://gist.github.com/axil/af6c4adb8c5634ff39ed9f3da1efaa90

Actually it depends on the amount of negative values in the array:

enter image description here

https://gist.github.com/axil/ce4ecdf1cb0446db47b979c37ed5fba3

Results:
    – where is the fastest in most cases and is the only one with the flat curve
    – putmask is #2
    – where is only faster than the others when there's almost nothing to be done (≤10%)
    – maximum and clip are (surprisingly) slower than the others in the whole range and obviously share the implementation.

The size of the array generally does not matter: enter image description here https://gist.github.com/axil/2241e62977f46753caac7005268d5b28

Source https://stackoverflow.com/questions/70447547

QUESTION

determine the range of a value using a look up table

Asked 2021-Oct-06 at 08:46

I have a df with numbers:

1numbers = pd.DataFrame(columns=['number'], data=[
250,
365,
475,
585,
690
7])
8
9

and a df with ranges (look up table):

1numbers = pd.DataFrame(columns=['number'], data=[
250,
365,
475,
585,
690
7])
8
9ranges = pd.DataFrame(
10columns=['range','range_min','range_max'],
11data=[
12['A',90,100],
13['B',85,95],
14['C',70,80]
15]
16)
17

I want to determine what range (in second table) a value (in the first table) falls in. Please note ranges overlap, and limits are inclusive. Also please note the vanilla dataframe above has 3 ranges, however this dataframe gets generated dynamically. It could have from 2 to 7 ranges.

Desired result:

1numbers = pd.DataFrame(columns=['number'], data=[
250,
365,
475,
585,
690
7])
8
9ranges = pd.DataFrame(
10columns=['range','range_min','range_max'],
11data=[
12['A',90,100],
13['B',85,95],
14['C',70,80]
15]
16)
17numbers = pd.DataFrame(columns=['number','detected_range'], data=[
18[50,'out_of_range'],
19[65, 'out_of_range'],
20[75,'C'],
21[85,'B'],
22[90,'overlap'] * could be A or B *
23])
24

I solved this with a for loop but this doesn't scale well to a big dataset I am using. Also code is too extensive and inelegant. See below:

1numbers = pd.DataFrame(columns=['number'], data=[
250,
365,
475,
585,
690
7])
8
9ranges = pd.DataFrame(
10columns=['range','range_min','range_max'],
11data=[
12['A',90,100],
13['B',85,95],
14['C',70,80]
15]
16)
17numbers = pd.DataFrame(columns=['number','detected_range'], data=[
18[50,'out_of_range'],
19[65, 'out_of_range'],
20[75,'C'],
21[85,'B'],
22[90,'overlap'] * could be A or B *
23])
24numbers['detected_range'] = nan
25for i, row1 in number.iterrows():
26    for j, row2 in ranges.iterrows():
27        if row1.number<row2.range_min and row1.number>row2.range_max:
28             numbers.loc[i,'detected_range'] = row1.loc[j,'range']
29        else if (other cases...):
30              ...and so on...
31

How could I do this?

ANSWER

Answered 2021-Oct-06 at 08:46

You can use a bit of numpy vectorial operations to generate masks, and use them to select your labels:

1numbers = pd.DataFrame(columns=['number'], data=[
250,
365,
475,
585,
690
7])
8
9ranges = pd.DataFrame(
10columns=['range','range_min','range_max'],
11data=[
12['A',90,100],
13['B',85,95],
14['C',70,80]
15]
16)
17numbers = pd.DataFrame(columns=['number','detected_range'], data=[
18[50,'out_of_range'],
19[65, 'out_of_range'],
20[75,'C'],
21[85,'B'],
22[90,'overlap'] * could be A or B *
23])
24numbers['detected_range'] = nan
25for i, row1 in number.iterrows():
26    for j, row2 in ranges.iterrows():
27        if row1.number<row2.range_min and row1.number>row2.range_max:
28             numbers.loc[i,'detected_range'] = row1.loc[j,'range']
29        else if (other cases...):
30              ...and so on...
31import numpy as np
32
33a = numbers['number'].values   # numpy array of numbers
34r = ranges.set_index('range')  # dataframe of min/max with labels as index
35
36m1 = (a>=r['range_min'].values[:,None]).T  # is number above each min
37m2 = (a<r['range_max'].values[:,None]).T   # is number below each max
38m3 = (m1&m2)                               # combine both conditions above
39# NB. the two operations could be done without the intermediate variables m1/m2
40
41m4 = m3.sum(1)                             # how many matches?
42                                           # 0 -> out_of_range
43                                           # 2 -> overlap
44                                           # 1 -> get column name
45
46# now we select the label according to the conditions
47numbers['detected_range'] = np.select([m4==0, m4==2], # out_of_range and overlap
48                                      ['out_of_range', 'overlap'],
49                                      # otherwise get column name
50                                      default=np.take(r.index, m3.argmax(1))
51                                     )
52

output:

1numbers = pd.DataFrame(columns=['number'], data=[
250,
365,
475,
585,
690
7])
8
9ranges = pd.DataFrame(
10columns=['range','range_min','range_max'],
11data=[
12['A',90,100],
13['B',85,95],
14['C',70,80]
15]
16)
17numbers = pd.DataFrame(columns=['number','detected_range'], data=[
18[50,'out_of_range'],
19[65, 'out_of_range'],
20[75,'C'],
21[85,'B'],
22[90,'overlap'] * could be A or B *
23])
24numbers['detected_range'] = nan
25for i, row1 in number.iterrows():
26    for j, row2 in ranges.iterrows():
27        if row1.number<row2.range_min and row1.number>row2.range_max:
28             numbers.loc[i,'detected_range'] = row1.loc[j,'range']
29        else if (other cases...):
30              ...and so on...
31import numpy as np
32
33a = numbers['number'].values   # numpy array of numbers
34r = ranges.set_index('range')  # dataframe of min/max with labels as index
35
36m1 = (a>=r['range_min'].values[:,None]).T  # is number above each min
37m2 = (a<r['range_max'].values[:,None]).T   # is number below each max
38m3 = (m1&m2)                               # combine both conditions above
39# NB. the two operations could be done without the intermediate variables m1/m2
40
41m4 = m3.sum(1)                             # how many matches?
42                                           # 0 -> out_of_range
43                                           # 2 -> overlap
44                                           # 1 -> get column name
45
46# now we select the label according to the conditions
47numbers['detected_range'] = np.select([m4==0, m4==2], # out_of_range and overlap
48                                      ['out_of_range', 'overlap'],
49                                      # otherwise get column name
50                                      default=np.take(r.index, m3.argmax(1))
51                                     )
52   number detected_range
530      50   out_of_range
541      65   out_of_range
552      75              C
563      85              B
574      90        overlap
58
edit:

It works with any number of intervals in ranges

example output with extra['D',50,51]:

1numbers = pd.DataFrame(columns=['number'], data=[
250,
365,
475,
585,
690
7])
8
9ranges = pd.DataFrame(
10columns=['range','range_min','range_max'],
11data=[
12['A',90,100],
13['B',85,95],
14['C',70,80]
15]
16)
17numbers = pd.DataFrame(columns=['number','detected_range'], data=[
18[50,'out_of_range'],
19[65, 'out_of_range'],
20[75,'C'],
21[85,'B'],
22[90,'overlap'] * could be A or B *
23])
24numbers['detected_range'] = nan
25for i, row1 in number.iterrows():
26    for j, row2 in ranges.iterrows():
27        if row1.number<row2.range_min and row1.number>row2.range_max:
28             numbers.loc[i,'detected_range'] = row1.loc[j,'range']
29        else if (other cases...):
30              ...and so on...
31import numpy as np
32
33a = numbers['number'].values   # numpy array of numbers
34r = ranges.set_index('range')  # dataframe of min/max with labels as index
35
36m1 = (a>=r['range_min'].values[:,None]).T  # is number above each min
37m2 = (a<r['range_max'].values[:,None]).T   # is number below each max
38m3 = (m1&m2)                               # combine both conditions above
39# NB. the two operations could be done without the intermediate variables m1/m2
40
41m4 = m3.sum(1)                             # how many matches?
42                                           # 0 -> out_of_range
43                                           # 2 -> overlap
44                                           # 1 -> get column name
45
46# now we select the label according to the conditions
47numbers['detected_range'] = np.select([m4==0, m4==2], # out_of_range and overlap
48                                      ['out_of_range', 'overlap'],
49                                      # otherwise get column name
50                                      default=np.take(r.index, m3.argmax(1))
51                                     )
52   number detected_range
530      50   out_of_range
541      65   out_of_range
552      75              C
563      85              B
574      90        overlap
58   number detected_range
590      50              D
601      65   out_of_range
612      75              C
623      85              B
634      90        overlap
64

Source https://stackoverflow.com/questions/69462119

QUESTION

How to use multiprocessing in a chronical order?

Asked 2021-Sep-07 at 12:23

I have a python script which exists of 2 processes:

  1. process 1: Loading and unzipping files
  2. process 2: Processing the files, doing some stuff with it.

Before implementing multiprocessing the software seemed to do its work in chronological order. Loading all the zipped files, unzipping them, then opening them to do some stuff with it.

So I have brought multiprocessing into the game and right now it seems that whilst the files are being loaded and unzipped the process of opening and doing stuff with them has already started. So there are multiple processes doing stuff at the same time. The problem is that when I run this code on big data (more then 100+ files) I get problems with concurrent file access. This results in PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: When I run the snippet on a small data set (30 files approx.) it seems to go okay because the files are being unzipped really fast just in time process 2 starts.

What I want: I want to keep the multiprocessing because it is speeding up things but I want process 2 only to start if all the files have been unzipped (e.g. process 1 is done).

This is my snippet:

1import os
2import csv
3import collections
4import datetime 
5import zipfile
6import re
7import shutil
8import fnmatch
9from pathlib import Path
10import ntpath
11import configparser
12from multiprocessing import Pool
13
14def generate_file_lists():
15    # Change the following line to a real path
16    data_files = 'c:\desktop\DataEnergy'
17    pattern = '*.zip'
18    last_root = None
19    args = []
20    for root, dirs, files in os.walk(data_files):
21        for filename in fnmatch.filter(files, pattern):
22            if root != last_root:
23                last_root = root
24                if args:
25                    yield args
26                    args = []
27            args.append((root, filename))
28    if args:
29        yield args
30
31def unzip(file_list):
32    """
33    file_list is a list of (root, filename) tuples where
34    root is the same for all tuples.
35    """
36    # Change the following line to a real path:
37    counter_part = 'c:\desktop\CounterPart'
38    for root, filename in file_list:
39        path = os.path.join(root, filename)
40        date_zipped_file_s = re.search('-(.\d+)-', filename).group(1)
41        date_zipped_file = datetime.datetime.strptime(date_zipped_file_s, '%Y%m%d').date()
42
43        #Create the new directory location
44        new_dir = os.path.normpath(os.path.join(os.path.relpath(path, start='c:\desktop\DataEnergy'), ".."))
45
46        #Join the directory names counter_part and create their paths.
47        new = os.path.join(counter_part, new_dir)
48
49        #Create the directories
50        if (not os.path.exists(new)):
51            os.makedirs(new)
52        zipfile.ZipFile(path).extractall(new)
53
54        #Get al the zipped files
55        files = os.listdir(new)
56
57        #Rename all the files in the created directories
58        for file in files:
59            filesplit = os.path.splitext(os.path.basename(file))
60            if not re.search(r'_\d{8}.', file):
61                os.rename(os.path.join(new, file), os.path.join(new, filesplit[0]+'_'+date_zipped_file_s+filesplit[1]))
62
63# Required for Windows:
64if __name__ == '__main__':
65    pool = Pool(13)
66    pool.map(unzip, generate_file_lists())
67    print('the files have been unzipped!')
68
69
70#Start proces 2 
71all_missing_dates = ['20210701', '20210702']
72missing_dates = [datetime.datetime.strptime(i, "%Y%m%d").date() for i in all_missing_dates]
73
74dates_to_process = []
75root = Path('.\middle_stage').resolve()
76
77
78at_set = {'Audi', 'Mercedes', 'Volkswagen'}
79
80#Only read the rows if they fulfill the following conditions.
81def filter_row(r, missing_date): 
82    if set(row).intersection(at_set):       
83        if len(r) > 24 and r[24].isdigit():
84            aantal_pplgs = int(r[24])  
85            date_time = datetime.datetime.fromisoformat(r[0]) if len(r) > 3 else True
86            condition_3 = date_time.date() == missing_date  if len(r) > 3 else True  
87            
88            return condition_3
89    return False
90
91#Open the files and read the rows
92print("Start reading data")
93data_per_date = dict()
94for missing_date in missing_dates:
95    print("\tReading missing date: ", missing_date)
96    files=[fn for fn in (e for e in root.glob(f"**/*_{missing_date:%Y%m%d}.txt") if e.is_file())]
97    if len(files) != 13:
98        continue
99    dates_to_process.append(missing_date)  
100
101    vehicle_loc_dict = collections.defaultdict(list)
102    for file in files:      
103        with open(file, 'r') as log_file:
104            reader = csv.reader(log_file, delimiter = ',')
105            next(reader) # skip header
106            for row in reader:
107                if filter_row(row, missing_date): 
108                    print('filter_row has been executed!')
109
110    data_per_date[missing_date] = vehicle_loc_dict 
111

ANSWER

Answered 2021-Sep-07 at 09:01
Main Thread

In the main thread we need to setup the queues and add the zipped files to the queues

1import os
2import csv
3import collections
4import datetime 
5import zipfile
6import re
7import shutil
8import fnmatch
9from pathlib import Path
10import ntpath
11import configparser
12from multiprocessing import Pool
13
14def generate_file_lists():
15    # Change the following line to a real path
16    data_files = 'c:\desktop\DataEnergy'
17    pattern = '*.zip'
18    last_root = None
19    args = []
20    for root, dirs, files in os.walk(data_files):
21        for filename in fnmatch.filter(files, pattern):
22            if root != last_root:
23                last_root = root
24                if args:
25                    yield args
26                    args = []
27            args.append((root, filename))
28    if args:
29        yield args
30
31def unzip(file_list):
32    """
33    file_list is a list of (root, filename) tuples where
34    root is the same for all tuples.
35    """
36    # Change the following line to a real path:
37    counter_part = 'c:\desktop\CounterPart'
38    for root, filename in file_list:
39        path = os.path.join(root, filename)
40        date_zipped_file_s = re.search('-(.\d+)-', filename).group(1)
41        date_zipped_file = datetime.datetime.strptime(date_zipped_file_s, '%Y%m%d').date()
42
43        #Create the new directory location
44        new_dir = os.path.normpath(os.path.join(os.path.relpath(path, start='c:\desktop\DataEnergy'), ".."))
45
46        #Join the directory names counter_part and create their paths.
47        new = os.path.join(counter_part, new_dir)
48
49        #Create the directories
50        if (not os.path.exists(new)):
51            os.makedirs(new)
52        zipfile.ZipFile(path).extractall(new)
53
54        #Get al the zipped files
55        files = os.listdir(new)
56
57        #Rename all the files in the created directories
58        for file in files:
59            filesplit = os.path.splitext(os.path.basename(file))
60            if not re.search(r'_\d{8}.', file):
61                os.rename(os.path.join(new, file), os.path.join(new, filesplit[0]+'_'+date_zipped_file_s+filesplit[1]))
62
63# Required for Windows:
64if __name__ == '__main__':
65    pool = Pool(13)
66    pool.map(unzip, generate_file_lists())
67    print('the files have been unzipped!')
68
69
70#Start proces 2 
71all_missing_dates = ['20210701', '20210702']
72missing_dates = [datetime.datetime.strptime(i, "%Y%m%d").date() for i in all_missing_dates]
73
74dates_to_process = []
75root = Path('.\middle_stage').resolve()
76
77
78at_set = {'Audi', 'Mercedes', 'Volkswagen'}
79
80#Only read the rows if they fulfill the following conditions.
81def filter_row(r, missing_date): 
82    if set(row).intersection(at_set):       
83        if len(r) > 24 and r[24].isdigit():
84            aantal_pplgs = int(r[24])  
85            date_time = datetime.datetime.fromisoformat(r[0]) if len(r) > 3 else True
86            condition_3 = date_time.date() == missing_date  if len(r) > 3 else True  
87            
88            return condition_3
89    return False
90
91#Open the files and read the rows
92print("Start reading data")
93data_per_date = dict()
94for missing_date in missing_dates:
95    print("\tReading missing date: ", missing_date)
96    files=[fn for fn in (e for e in root.glob(f"**/*_{missing_date:%Y%m%d}.txt") if e.is_file())]
97    if len(files) != 13:
98        continue
99    dates_to_process.append(missing_date)  
100
101    vehicle_loc_dict = collections.defaultdict(list)
102    for file in files:      
103        with open(file, 'r') as log_file:
104            reader = csv.reader(log_file, delimiter = ',')
105            next(reader) # skip header
106            for row in reader:
107                if filter_row(row, missing_date): 
108                    print('filter_row has been executed!')
109
110    data_per_date[missing_date] = vehicle_loc_dict 
111import threading
112import queue 
113
114zippedQueue = queue.Queue()
115unzippedQueue = queue.Queue()
116zippedLock = threading.Lock()
117
118for file in files:
119   zippedQueue.put(file)
120
Worker Thread
1import os
2import csv
3import collections
4import datetime 
5import zipfile
6import re
7import shutil
8import fnmatch
9from pathlib import Path
10import ntpath
11import configparser
12from multiprocessing import Pool
13
14def generate_file_lists():
15    # Change the following line to a real path
16    data_files = 'c:\desktop\DataEnergy'
17    pattern = '*.zip'
18    last_root = None
19    args = []
20    for root, dirs, files in os.walk(data_files):
21        for filename in fnmatch.filter(files, pattern):
22            if root != last_root:
23                last_root = root
24                if args:
25                    yield args
26                    args = []
27            args.append((root, filename))
28    if args:
29        yield args
30
31def unzip(file_list):
32    """
33    file_list is a list of (root, filename) tuples where
34    root is the same for all tuples.
35    """
36    # Change the following line to a real path:
37    counter_part = 'c:\desktop\CounterPart'
38    for root, filename in file_list:
39        path = os.path.join(root, filename)
40        date_zipped_file_s = re.search('-(.\d+)-', filename).group(1)
41        date_zipped_file = datetime.datetime.strptime(date_zipped_file_s, '%Y%m%d').date()
42
43        #Create the new directory location
44        new_dir = os.path.normpath(os.path.join(os.path.relpath(path, start='c:\desktop\DataEnergy'), ".."))
45
46        #Join the directory names counter_part and create their paths.
47        new = os.path.join(counter_part, new_dir)
48
49        #Create the directories
50        if (not os.path.exists(new)):
51            os.makedirs(new)
52        zipfile.ZipFile(path).extractall(new)
53
54        #Get al the zipped files
55        files = os.listdir(new)
56
57        #Rename all the files in the created directories
58        for file in files:
59            filesplit = os.path.splitext(os.path.basename(file))
60            if not re.search(r'_\d{8}.', file):
61                os.rename(os.path.join(new, file), os.path.join(new, filesplit[0]+'_'+date_zipped_file_s+filesplit[1]))
62
63# Required for Windows:
64if __name__ == '__main__':
65    pool = Pool(13)
66    pool.map(unzip, generate_file_lists())
67    print('the files have been unzipped!')
68
69
70#Start proces 2 
71all_missing_dates = ['20210701', '20210702']
72missing_dates = [datetime.datetime.strptime(i, "%Y%m%d").date() for i in all_missing_dates]
73
74dates_to_process = []
75root = Path('.\middle_stage').resolve()
76
77
78at_set = {'Audi', 'Mercedes', 'Volkswagen'}
79
80#Only read the rows if they fulfill the following conditions.
81def filter_row(r, missing_date): 
82    if set(row).intersection(at_set):       
83        if len(r) > 24 and r[24].isdigit():
84            aantal_pplgs = int(r[24])  
85            date_time = datetime.datetime.fromisoformat(r[0]) if len(r) > 3 else True
86            condition_3 = date_time.date() == missing_date  if len(r) > 3 else True  
87            
88            return condition_3
89    return False
90
91#Open the files and read the rows
92print("Start reading data")
93data_per_date = dict()
94for missing_date in missing_dates:
95    print("\tReading missing date: ", missing_date)
96    files=[fn for fn in (e for e in root.glob(f"**/*_{missing_date:%Y%m%d}.txt") if e.is_file())]
97    if len(files) != 13:
98        continue
99    dates_to_process.append(missing_date)  
100
101    vehicle_loc_dict = collections.defaultdict(list)
102    for file in files:      
103        with open(file, 'r') as log_file:
104            reader = csv.reader(log_file, delimiter = ',')
105            next(reader) # skip header
106            for row in reader:
107                if filter_row(row, missing_date): 
108                    print('filter_row has been executed!')
109
110    data_per_date[missing_date] = vehicle_loc_dict 
111import threading
112import queue 
113
114zippedQueue = queue.Queue()
115unzippedQueue = queue.Queue()
116zippedLock = threading.Lock()
117
118for file in files:
119   zippedQueue.put(file)
120class ChiSquaredThread(threading.Thread):
121    def __init__(self):
122         threading.Thread.__init__(self)
123
124    def run(self):
125        unzipFile()
126
127# add all your zipped files to the zippedQueue 
128def unzipFile():
129   zippedLock.acquire()
130   if not zippedQueue.empty():
131      zipped = zippedQueue.get()
132      zippedLock.release()
133      # unzip the zipped file here
134      # add unziped file to queue
135      unzipedQueue.put(unzippedFile)
136

then create a similar block as the worker thread block that does the same steps and processes the file. This sample block in the worker thread should guide you on that

Source https://stackoverflow.com/questions/69084919

Community Discussions contain sources that include Stack Exchange Network

Tutorials and Learning Resources in Big Data

Tutorials and Learning Resources are not available at this moment for Big Data

Share this Page

share link

Get latest updates on Big Data