Popular Releases
Popular Libraries
New Libraries
Top Authors
Trending Kits
Trending Discussions
Learning
No Popular Releases are available at this moment for Big Data
No Trending Libraries are available at this moment for Big Data
No Trending Libraries are available at this moment for Big Data
No Top Authors are available at this moment for Big Data.
QUESTION
Visualise missing values in a time series heatmap
Asked 2022-Mar-28 at 19:27I am really new in big data analysing. Let's say I have a big data with the following features. I want to visualise the the percentage of missing values (None values) of fuel parameters for every id in specific hour. I want to draw a chart that x-axis is the time series (time column), y-axis is the 'id' and the colour will indicate its missing fuel percentage. I grouped the data base on 'id' and 'hour'
I don't know how to visualise missing value in a good way for all ids. For example if the percentage of missing value fuel of specific id in specific hour is 100% then the colour in that specific time and for that 'id' can be gray. If percentage of missing value in fuel is 50%, the colour can be light green. If percentage of missing value in fuel is 0% then the colour can be dark green. The colour must be based to the percentage of missing value in fuel, after grouping based on id and time.
1 id time fuel
20 1 2022-02-26 19:08:33 100
32 1 2022-02-26 20:09:35 None
43 2 2022-02-26 21:09:35 70
54 3 2022-02-26 21:10:55 60
65 4 2022-02-26 21:10:55 None
76 5 2022-02-26 22:12:43 50
87 6 2022-02-26 23:10:50 None
9
So for example, in the following code I computed the percentage of the missing value for every hour for specific id:
1 id time fuel
20 1 2022-02-26 19:08:33 100
32 1 2022-02-26 20:09:35 None
43 2 2022-02-26 21:09:35 70
54 3 2022-02-26 21:10:55 60
65 4 2022-02-26 21:10:55 None
76 5 2022-02-26 22:12:43 50
87 6 2022-02-26 23:10:50 None
9df.set_index('ts').groupby(['id', pd.Grouper(freq='H')])['fuell'].apply(lambda x: x.isnull().mean() * 100)
10
Is there any solution?
ANSWER
Answered 2022-Mar-25 at 09:39There is no right answer concerning missing values visualization, I guess it depends on your uses, habits ...
But first, to make it works, we need to preprocess your dataframe and make it analyzable, aka ensure its dtypes.
First let's build our data :
1 id time fuel
20 1 2022-02-26 19:08:33 100
32 1 2022-02-26 20:09:35 None
43 2 2022-02-26 21:09:35 70
54 3 2022-02-26 21:10:55 60
65 4 2022-02-26 21:10:55 None
76 5 2022-02-26 22:12:43 50
87 6 2022-02-26 23:10:50 None
9df.set_index('ts').groupby(['id', pd.Grouper(freq='H')])['fuell'].apply(lambda x: x.isnull().mean() * 100)
10import pandas as pd
11from io import StringIO
12
13csvfile = StringIO(
14"""id time fuel
151 2022-02-26 19:08:33 100
162 2022-02-26 19:09:35 70
173 2022-02-26 19:10:55 60
184 2022-02-26 20:10:55 None
195 2022-02-26 21:12:43 50
206 2022-02-26 22:10:50 None""")
21df = pd.read_csv(csvfile, sep = '\t', engine='python')
22
23df
24Out[65]:
25 id time fuel
260 1 2022-02-26 19:08:33 100
271 2 2022-02-26 19:09:35 70
282 3 2022-02-26 19:10:55 60
293 4 2022-02-26 20:10:55 None
304 5 2022-02-26 21:12:43 50
315 6 2022-02-26 22:10:50 None
32
At this stage almost all data in our dataframe is string related, you need to convert fuel and time into a non-object dtypes.
1 id time fuel
20 1 2022-02-26 19:08:33 100
32 1 2022-02-26 20:09:35 None
43 2 2022-02-26 21:09:35 70
54 3 2022-02-26 21:10:55 60
65 4 2022-02-26 21:10:55 None
76 5 2022-02-26 22:12:43 50
87 6 2022-02-26 23:10:50 None
9df.set_index('ts').groupby(['id', pd.Grouper(freq='H')])['fuell'].apply(lambda x: x.isnull().mean() * 100)
10import pandas as pd
11from io import StringIO
12
13csvfile = StringIO(
14"""id time fuel
151 2022-02-26 19:08:33 100
162 2022-02-26 19:09:35 70
173 2022-02-26 19:10:55 60
184 2022-02-26 20:10:55 None
195 2022-02-26 21:12:43 50
206 2022-02-26 22:10:50 None""")
21df = pd.read_csv(csvfile, sep = '\t', engine='python')
22
23df
24Out[65]:
25 id time fuel
260 1 2022-02-26 19:08:33 100
271 2 2022-02-26 19:09:35 70
282 3 2022-02-26 19:10:55 60
293 4 2022-02-26 20:10:55 None
304 5 2022-02-26 21:12:43 50
315 6 2022-02-26 22:10:50 None
32df.dtypes
33Out[66]:
34id int64
35time object
36fuel object
37dtype: object
38
Time should be converted as datetime, id as int and fuel as float. Indeed, None should be convert as np.nan for numeric values, which needs the float dtype.
With a map, we can easily change all 'None'
values into np.nan
. I won't go deeper here, but for simplicity sake, I'll use a custom subclass of dict with a __missing__
implementation
1 id time fuel
20 1 2022-02-26 19:08:33 100
32 1 2022-02-26 20:09:35 None
43 2 2022-02-26 21:09:35 70
54 3 2022-02-26 21:10:55 60
65 4 2022-02-26 21:10:55 None
76 5 2022-02-26 22:12:43 50
87 6 2022-02-26 23:10:50 None
9df.set_index('ts').groupby(['id', pd.Grouper(freq='H')])['fuell'].apply(lambda x: x.isnull().mean() * 100)
10import pandas as pd
11from io import StringIO
12
13csvfile = StringIO(
14"""id time fuel
151 2022-02-26 19:08:33 100
162 2022-02-26 19:09:35 70
173 2022-02-26 19:10:55 60
184 2022-02-26 20:10:55 None
195 2022-02-26 21:12:43 50
206 2022-02-26 22:10:50 None""")
21df = pd.read_csv(csvfile, sep = '\t', engine='python')
22
23df
24Out[65]:
25 id time fuel
260 1 2022-02-26 19:08:33 100
271 2 2022-02-26 19:09:35 70
282 3 2022-02-26 19:10:55 60
293 4 2022-02-26 20:10:55 None
304 5 2022-02-26 21:12:43 50
315 6 2022-02-26 22:10:50 None
32df.dtypes
33Out[66]:
34id int64
35time object
36fuel object
37dtype: object
38df.time = pd.to_datetime(df.time, format = "%Y/%m/%d %H:%M:%S")
39
40class dict_with_missing(dict):
41 def __missing__(self, key):
42 return key
43map_dict = dict_with_missing({'None' : np.nan})
44df.fuel = df.fuel.map(map_dict).astype(np.float32)
45
Then we have a clean dataframe :
1 id time fuel
20 1 2022-02-26 19:08:33 100
32 1 2022-02-26 20:09:35 None
43 2 2022-02-26 21:09:35 70
54 3 2022-02-26 21:10:55 60
65 4 2022-02-26 21:10:55 None
76 5 2022-02-26 22:12:43 50
87 6 2022-02-26 23:10:50 None
9df.set_index('ts').groupby(['id', pd.Grouper(freq='H')])['fuell'].apply(lambda x: x.isnull().mean() * 100)
10import pandas as pd
11from io import StringIO
12
13csvfile = StringIO(
14"""id time fuel
151 2022-02-26 19:08:33 100
162 2022-02-26 19:09:35 70
173 2022-02-26 19:10:55 60
184 2022-02-26 20:10:55 None
195 2022-02-26 21:12:43 50
206 2022-02-26 22:10:50 None""")
21df = pd.read_csv(csvfile, sep = '\t', engine='python')
22
23df
24Out[65]:
25 id time fuel
260 1 2022-02-26 19:08:33 100
271 2 2022-02-26 19:09:35 70
282 3 2022-02-26 19:10:55 60
293 4 2022-02-26 20:10:55 None
304 5 2022-02-26 21:12:43 50
315 6 2022-02-26 22:10:50 None
32df.dtypes
33Out[66]:
34id int64
35time object
36fuel object
37dtype: object
38df.time = pd.to_datetime(df.time, format = "%Y/%m/%d %H:%M:%S")
39
40class dict_with_missing(dict):
41 def __missing__(self, key):
42 return key
43map_dict = dict_with_missing({'None' : np.nan})
44df.fuel = df.fuel.map(map_dict).astype(np.float32)
45df
46Out[68]:
47 id time fuel
480 1 2022-02-26 19:08:33 100.0
491 2 2022-02-26 19:09:35 70.0
502 3 2022-02-26 19:10:55 60.0
513 4 2022-02-26 20:10:55 NaN
524 5 2022-02-26 21:12:43 50.0
535 6 2022-02-26 22:10:50 NaN
54
55df.dtypes
56Out[69]:
57id int64
58time datetime64[ns]
59fuel float32
60dtype: object
61
Then, you can easily use bar
, matrix
or heatmap
from the missingno
module
1 id time fuel
20 1 2022-02-26 19:08:33 100
32 1 2022-02-26 20:09:35 None
43 2 2022-02-26 21:09:35 70
54 3 2022-02-26 21:10:55 60
65 4 2022-02-26 21:10:55 None
76 5 2022-02-26 22:12:43 50
87 6 2022-02-26 23:10:50 None
9df.set_index('ts').groupby(['id', pd.Grouper(freq='H')])['fuell'].apply(lambda x: x.isnull().mean() * 100)
10import pandas as pd
11from io import StringIO
12
13csvfile = StringIO(
14"""id time fuel
151 2022-02-26 19:08:33 100
162 2022-02-26 19:09:35 70
173 2022-02-26 19:10:55 60
184 2022-02-26 20:10:55 None
195 2022-02-26 21:12:43 50
206 2022-02-26 22:10:50 None""")
21df = pd.read_csv(csvfile, sep = '\t', engine='python')
22
23df
24Out[65]:
25 id time fuel
260 1 2022-02-26 19:08:33 100
271 2 2022-02-26 19:09:35 70
282 3 2022-02-26 19:10:55 60
293 4 2022-02-26 20:10:55 None
304 5 2022-02-26 21:12:43 50
315 6 2022-02-26 22:10:50 None
32df.dtypes
33Out[66]:
34id int64
35time object
36fuel object
37dtype: object
38df.time = pd.to_datetime(df.time, format = "%Y/%m/%d %H:%M:%S")
39
40class dict_with_missing(dict):
41 def __missing__(self, key):
42 return key
43map_dict = dict_with_missing({'None' : np.nan})
44df.fuel = df.fuel.map(map_dict).astype(np.float32)
45df
46Out[68]:
47 id time fuel
480 1 2022-02-26 19:08:33 100.0
491 2 2022-02-26 19:09:35 70.0
502 3 2022-02-26 19:10:55 60.0
513 4 2022-02-26 20:10:55 NaN
524 5 2022-02-26 21:12:43 50.0
535 6 2022-02-26 22:10:50 NaN
54
55df.dtypes
56Out[69]:
57id int64
58time datetime64[ns]
59fuel float32
60dtype: object
61msno.bar(df)
62msno.matrix(df, sparkline=False)
63msno.heatmap(df, cmap="RdYlGn")
64
A side note here, heatmap is kind of useless here, since it compares columns having missing values. And you only have one column with missing value. But for a bigger dataframe (~ 5/6 columns with missing values) it can be useful.
For a quick and dirty visualization, you can also print the number of missing value (aka np.nan, in pandas/numpy formulation) :
1 id time fuel
20 1 2022-02-26 19:08:33 100
32 1 2022-02-26 20:09:35 None
43 2 2022-02-26 21:09:35 70
54 3 2022-02-26 21:10:55 60
65 4 2022-02-26 21:10:55 None
76 5 2022-02-26 22:12:43 50
87 6 2022-02-26 23:10:50 None
9df.set_index('ts').groupby(['id', pd.Grouper(freq='H')])['fuell'].apply(lambda x: x.isnull().mean() * 100)
10import pandas as pd
11from io import StringIO
12
13csvfile = StringIO(
14"""id time fuel
151 2022-02-26 19:08:33 100
162 2022-02-26 19:09:35 70
173 2022-02-26 19:10:55 60
184 2022-02-26 20:10:55 None
195 2022-02-26 21:12:43 50
206 2022-02-26 22:10:50 None""")
21df = pd.read_csv(csvfile, sep = '\t', engine='python')
22
23df
24Out[65]:
25 id time fuel
260 1 2022-02-26 19:08:33 100
271 2 2022-02-26 19:09:35 70
282 3 2022-02-26 19:10:55 60
293 4 2022-02-26 20:10:55 None
304 5 2022-02-26 21:12:43 50
315 6 2022-02-26 22:10:50 None
32df.dtypes
33Out[66]:
34id int64
35time object
36fuel object
37dtype: object
38df.time = pd.to_datetime(df.time, format = "%Y/%m/%d %H:%M:%S")
39
40class dict_with_missing(dict):
41 def __missing__(self, key):
42 return key
43map_dict = dict_with_missing({'None' : np.nan})
44df.fuel = df.fuel.map(map_dict).astype(np.float32)
45df
46Out[68]:
47 id time fuel
480 1 2022-02-26 19:08:33 100.0
491 2 2022-02-26 19:09:35 70.0
502 3 2022-02-26 19:10:55 60.0
513 4 2022-02-26 20:10:55 NaN
524 5 2022-02-26 21:12:43 50.0
535 6 2022-02-26 22:10:50 NaN
54
55df.dtypes
56Out[69]:
57id int64
58time datetime64[ns]
59fuel float32
60dtype: object
61msno.bar(df)
62msno.matrix(df, sparkline=False)
63msno.heatmap(df, cmap="RdYlGn")
64df.isna().sum()
65Out[72]:
66id 0
67time 0
68fuel 2
69dtype: int64
70
Community Discussions contain sources that include Stack Exchange Network
QUESTION
Visualise missing values in a time series heatmap
Asked 2022-Mar-28 at 19:27I am really new in big data analysing. Let's say I have a big data with the following features. I want to visualise the the percentage of missing values (None values) of fuel parameters for every id in specific hour. I want to draw a chart that x-axis is the time series (time column), y-axis is the 'id' and the colour will indicate its missing fuel percentage. I grouped the data base on 'id' and 'hour'
I don't know how to visualise missing value in a good way for all ids. For example if the percentage of missing value fuel of specific id in specific hour is 100% then the colour in that specific time and for that 'id' can be gray. If percentage of missing value in fuel is 50%, the colour can be light green. If percentage of missing value in fuel is 0% then the colour can be dark green. The colour must be based to the percentage of missing value in fuel, after grouping based on id and time.
1 id time fuel
20 1 2022-02-26 19:08:33 100
32 1 2022-02-26 20:09:35 None
43 2 2022-02-26 21:09:35 70
54 3 2022-02-26 21:10:55 60
65 4 2022-02-26 21:10:55 None
76 5 2022-02-26 22:12:43 50
87 6 2022-02-26 23:10:50 None
9
So for example, in the following code I computed the percentage of the missing value for every hour for specific id:
1 id time fuel
20 1 2022-02-26 19:08:33 100
32 1 2022-02-26 20:09:35 None
43 2 2022-02-26 21:09:35 70
54 3 2022-02-26 21:10:55 60
65 4 2022-02-26 21:10:55 None
76 5 2022-02-26 22:12:43 50
87 6 2022-02-26 23:10:50 None
9df.set_index('ts').groupby(['id', pd.Grouper(freq='H')])['fuell'].apply(lambda x: x.isnull().mean() * 100)
10
Is there any solution?
ANSWER
Answered 2022-Mar-25 at 09:39There is no right answer concerning missing values visualization, I guess it depends on your uses, habits ...
But first, to make it works, we need to preprocess your dataframe and make it analyzable, aka ensure its dtypes.
First let's build our data :
1 id time fuel
20 1 2022-02-26 19:08:33 100
32 1 2022-02-26 20:09:35 None
43 2 2022-02-26 21:09:35 70
54 3 2022-02-26 21:10:55 60
65 4 2022-02-26 21:10:55 None
76 5 2022-02-26 22:12:43 50
87 6 2022-02-26 23:10:50 None
9df.set_index('ts').groupby(['id', pd.Grouper(freq='H')])['fuell'].apply(lambda x: x.isnull().mean() * 100)
10import pandas as pd
11from io import StringIO
12
13csvfile = StringIO(
14"""id time fuel
151 2022-02-26 19:08:33 100
162 2022-02-26 19:09:35 70
173 2022-02-26 19:10:55 60
184 2022-02-26 20:10:55 None
195 2022-02-26 21:12:43 50
206 2022-02-26 22:10:50 None""")
21df = pd.read_csv(csvfile, sep = '\t', engine='python')
22
23df
24Out[65]:
25 id time fuel
260 1 2022-02-26 19:08:33 100
271 2 2022-02-26 19:09:35 70
282 3 2022-02-26 19:10:55 60
293 4 2022-02-26 20:10:55 None
304 5 2022-02-26 21:12:43 50
315 6 2022-02-26 22:10:50 None
32
At this stage almost all data in our dataframe is string related, you need to convert fuel and time into a non-object dtypes.
1 id time fuel
20 1 2022-02-26 19:08:33 100
32 1 2022-02-26 20:09:35 None
43 2 2022-02-26 21:09:35 70
54 3 2022-02-26 21:10:55 60
65 4 2022-02-26 21:10:55 None
76 5 2022-02-26 22:12:43 50
87 6 2022-02-26 23:10:50 None
9df.set_index('ts').groupby(['id', pd.Grouper(freq='H')])['fuell'].apply(lambda x: x.isnull().mean() * 100)
10import pandas as pd
11from io import StringIO
12
13csvfile = StringIO(
14"""id time fuel
151 2022-02-26 19:08:33 100
162 2022-02-26 19:09:35 70
173 2022-02-26 19:10:55 60
184 2022-02-26 20:10:55 None
195 2022-02-26 21:12:43 50
206 2022-02-26 22:10:50 None""")
21df = pd.read_csv(csvfile, sep = '\t', engine='python')
22
23df
24Out[65]:
25 id time fuel
260 1 2022-02-26 19:08:33 100
271 2 2022-02-26 19:09:35 70
282 3 2022-02-26 19:10:55 60
293 4 2022-02-26 20:10:55 None
304 5 2022-02-26 21:12:43 50
315 6 2022-02-26 22:10:50 None
32df.dtypes
33Out[66]:
34id int64
35time object
36fuel object
37dtype: object
38
Time should be converted as datetime, id as int and fuel as float. Indeed, None should be convert as np.nan for numeric values, which needs the float dtype.
With a map, we can easily change all 'None'
values into np.nan
. I won't go deeper here, but for simplicity sake, I'll use a custom subclass of dict with a __missing__
implementation
1 id time fuel
20 1 2022-02-26 19:08:33 100
32 1 2022-02-26 20:09:35 None
43 2 2022-02-26 21:09:35 70
54 3 2022-02-26 21:10:55 60
65 4 2022-02-26 21:10:55 None
76 5 2022-02-26 22:12:43 50
87 6 2022-02-26 23:10:50 None
9df.set_index('ts').groupby(['id', pd.Grouper(freq='H')])['fuell'].apply(lambda x: x.isnull().mean() * 100)
10import pandas as pd
11from io import StringIO
12
13csvfile = StringIO(
14"""id time fuel
151 2022-02-26 19:08:33 100
162 2022-02-26 19:09:35 70
173 2022-02-26 19:10:55 60
184 2022-02-26 20:10:55 None
195 2022-02-26 21:12:43 50
206 2022-02-26 22:10:50 None""")
21df = pd.read_csv(csvfile, sep = '\t', engine='python')
22
23df
24Out[65]:
25 id time fuel
260 1 2022-02-26 19:08:33 100
271 2 2022-02-26 19:09:35 70
282 3 2022-02-26 19:10:55 60
293 4 2022-02-26 20:10:55 None
304 5 2022-02-26 21:12:43 50
315 6 2022-02-26 22:10:50 None
32df.dtypes
33Out[66]:
34id int64
35time object
36fuel object
37dtype: object
38df.time = pd.to_datetime(df.time, format = "%Y/%m/%d %H:%M:%S")
39
40class dict_with_missing(dict):
41 def __missing__(self, key):
42 return key
43map_dict = dict_with_missing({'None' : np.nan})
44df.fuel = df.fuel.map(map_dict).astype(np.float32)
45
Then we have a clean dataframe :
1 id time fuel
20 1 2022-02-26 19:08:33 100
32 1 2022-02-26 20:09:35 None
43 2 2022-02-26 21:09:35 70
54 3 2022-02-26 21:10:55 60
65 4 2022-02-26 21:10:55 None
76 5 2022-02-26 22:12:43 50
87 6 2022-02-26 23:10:50 None
9df.set_index('ts').groupby(['id', pd.Grouper(freq='H')])['fuell'].apply(lambda x: x.isnull().mean() * 100)
10import pandas as pd
11from io import StringIO
12
13csvfile = StringIO(
14"""id time fuel
151 2022-02-26 19:08:33 100
162 2022-02-26 19:09:35 70
173 2022-02-26 19:10:55 60
184 2022-02-26 20:10:55 None
195 2022-02-26 21:12:43 50
206 2022-02-26 22:10:50 None""")
21df = pd.read_csv(csvfile, sep = '\t', engine='python')
22
23df
24Out[65]:
25 id time fuel
260 1 2022-02-26 19:08:33 100
271 2 2022-02-26 19:09:35 70
282 3 2022-02-26 19:10:55 60
293 4 2022-02-26 20:10:55 None
304 5 2022-02-26 21:12:43 50
315 6 2022-02-26 22:10:50 None
32df.dtypes
33Out[66]:
34id int64
35time object
36fuel object
37dtype: object
38df.time = pd.to_datetime(df.time, format = "%Y/%m/%d %H:%M:%S")
39
40class dict_with_missing(dict):
41 def __missing__(self, key):
42 return key
43map_dict = dict_with_missing({'None' : np.nan})
44df.fuel = df.fuel.map(map_dict).astype(np.float32)
45df
46Out[68]:
47 id time fuel
480 1 2022-02-26 19:08:33 100.0
491 2 2022-02-26 19:09:35 70.0
502 3 2022-02-26 19:10:55 60.0
513 4 2022-02-26 20:10:55 NaN
524 5 2022-02-26 21:12:43 50.0
535 6 2022-02-26 22:10:50 NaN
54
55df.dtypes
56Out[69]:
57id int64
58time datetime64[ns]
59fuel float32
60dtype: object
61
Then, you can easily use bar
, matrix
or heatmap
from the missingno
module
1 id time fuel
20 1 2022-02-26 19:08:33 100
32 1 2022-02-26 20:09:35 None
43 2 2022-02-26 21:09:35 70
54 3 2022-02-26 21:10:55 60
65 4 2022-02-26 21:10:55 None
76 5 2022-02-26 22:12:43 50
87 6 2022-02-26 23:10:50 None
9df.set_index('ts').groupby(['id', pd.Grouper(freq='H')])['fuell'].apply(lambda x: x.isnull().mean() * 100)
10import pandas as pd
11from io import StringIO
12
13csvfile = StringIO(
14"""id time fuel
151 2022-02-26 19:08:33 100
162 2022-02-26 19:09:35 70
173 2022-02-26 19:10:55 60
184 2022-02-26 20:10:55 None
195 2022-02-26 21:12:43 50
206 2022-02-26 22:10:50 None""")
21df = pd.read_csv(csvfile, sep = '\t', engine='python')
22
23df
24Out[65]:
25 id time fuel
260 1 2022-02-26 19:08:33 100
271 2 2022-02-26 19:09:35 70
282 3 2022-02-26 19:10:55 60
293 4 2022-02-26 20:10:55 None
304 5 2022-02-26 21:12:43 50
315 6 2022-02-26 22:10:50 None
32df.dtypes
33Out[66]:
34id int64
35time object
36fuel object
37dtype: object
38df.time = pd.to_datetime(df.time, format = "%Y/%m/%d %H:%M:%S")
39
40class dict_with_missing(dict):
41 def __missing__(self, key):
42 return key
43map_dict = dict_with_missing({'None' : np.nan})
44df.fuel = df.fuel.map(map_dict).astype(np.float32)
45df
46Out[68]:
47 id time fuel
480 1 2022-02-26 19:08:33 100.0
491 2 2022-02-26 19:09:35 70.0
502 3 2022-02-26 19:10:55 60.0
513 4 2022-02-26 20:10:55 NaN
524 5 2022-02-26 21:12:43 50.0
535 6 2022-02-26 22:10:50 NaN
54
55df.dtypes
56Out[69]:
57id int64
58time datetime64[ns]
59fuel float32
60dtype: object
61msno.bar(df)
62msno.matrix(df, sparkline=False)
63msno.heatmap(df, cmap="RdYlGn")
64
A side note here, heatmap is kind of useless here, since it compares columns having missing values. And you only have one column with missing value. But for a bigger dataframe (~ 5/6 columns with missing values) it can be useful.
For a quick and dirty visualization, you can also print the number of missing value (aka np.nan, in pandas/numpy formulation) :
1 id time fuel
20 1 2022-02-26 19:08:33 100
32 1 2022-02-26 20:09:35 None
43 2 2022-02-26 21:09:35 70
54 3 2022-02-26 21:10:55 60
65 4 2022-02-26 21:10:55 None
76 5 2022-02-26 22:12:43 50
87 6 2022-02-26 23:10:50 None
9df.set_index('ts').groupby(['id', pd.Grouper(freq='H')])['fuell'].apply(lambda x: x.isnull().mean() * 100)
10import pandas as pd
11from io import StringIO
12
13csvfile = StringIO(
14"""id time fuel
151 2022-02-26 19:08:33 100
162 2022-02-26 19:09:35 70
173 2022-02-26 19:10:55 60
184 2022-02-26 20:10:55 None
195 2022-02-26 21:12:43 50
206 2022-02-26 22:10:50 None""")
21df = pd.read_csv(csvfile, sep = '\t', engine='python')
22
23df
24Out[65]:
25 id time fuel
260 1 2022-02-26 19:08:33 100
271 2 2022-02-26 19:09:35 70
282 3 2022-02-26 19:10:55 60
293 4 2022-02-26 20:10:55 None
304 5 2022-02-26 21:12:43 50
315 6 2022-02-26 22:10:50 None
32df.dtypes
33Out[66]:
34id int64
35time object
36fuel object
37dtype: object
38df.time = pd.to_datetime(df.time, format = "%Y/%m/%d %H:%M:%S")
39
40class dict_with_missing(dict):
41 def __missing__(self, key):
42 return key
43map_dict = dict_with_missing({'None' : np.nan})
44df.fuel = df.fuel.map(map_dict).astype(np.float32)
45df
46Out[68]:
47 id time fuel
480 1 2022-02-26 19:08:33 100.0
491 2 2022-02-26 19:09:35 70.0
502 3 2022-02-26 19:10:55 60.0
513 4 2022-02-26 20:10:55 NaN
524 5 2022-02-26 21:12:43 50.0
535 6 2022-02-26 22:10:50 NaN
54
55df.dtypes
56Out[69]:
57id int64
58time datetime64[ns]
59fuel float32
60dtype: object
61msno.bar(df)
62msno.matrix(df, sparkline=False)
63msno.heatmap(df, cmap="RdYlGn")
64df.isna().sum()
65Out[72]:
66id 0
67time 0
68fuel 2
69dtype: int64
70