Explore all Data Processing open source software, libraries, packages, source code, cloud functions and APIs.

Popular New Releases in Data Processing

mackup

sharelist

v0.3.15

gnome-shell-extension-gsconnect

v53

gnome-shell-extension-gsconnect

v44

git-sync

v3.5.0

Popular Libraries in Data Processing

mackup

by lra doticonpythondoticon

star image 11523 doticonGPL-3.0

Keep your application settings in sync (OS X/Linux)

cnpmjs.org

by cnpm doticonjavascriptdoticon

star image 3568 doticonNOASSERTION

Private npm registry and web for Enterprise

go-mysql-elasticsearch

by go-mysql-org doticongodoticon

star image 3488 doticonMIT

Sync MySQL data into elasticsearch

sharelist

by reruin doticonjavascriptdoticon

star image 2308 doticonMIT

快速分享 GoogleDrive OneDrive

gnome-shell-extension-gsconnect

by GSConnect doticonjavascriptdoticon

star image 2265 doticonGPL-2.0

KDE Connect implementation for GNOME

geeknote

by VitaliyRodnenko doticonpythondoticon

star image 2084 doticon

Console client for Evernote.

gitfs

by presslabs doticonpythondoticon

star image 1699 doticonApache-2.0

Version controlled file system

gnome-shell-extension-gsconnect

by andyholmes doticonjavascriptdoticon

star image 1675 doticonGPL-2.0

KDE Connect implementation for GNOME

syncserver

by mozilla-services doticonpythondoticon

star image 1586 doticonMPL-2.0

Run-Your-Own Firefox Sync Server

Trending New libraries in Data Processing

redissyncer-server

by TraceNature doticonjavadoticon

star image 501 doticonApache-2.0

RedisSyncer是一个多任务的redis数据同步工具,可灵活的满足Redis间的数据同步、迁移需求; redissyncer is a redis synchronization tool, used in redis single instance and cluster synchronization

nebula

by hubastard doticoncsharpdoticon

star image 473 doticonGPL-3.0

A multiplayer mod for the game Dyson Sphere Program

jellyfin-android

by jellyfin doticonkotlindoticon

star image 379 doticonGPL-2.0

Android Client for Jellyfin

rmfakecloud

by ddvk doticongodoticon

star image 338 doticonAGPL-3.0

host your own cloud for the remarkable

icloud-drive-docker

by mandarons doticonpythondoticon

star image 208 doticonBSD-3-Clause

Dockerized iCloud (drive and photos)

JustList

by txperl doticonpythondoticon

star image 175 doticonMIT

支持天翼云盘、OneDrive、本地目录索引的多用户文件列表工具

convtools

by iTechArt doticonpythondoticon

star image 173 doticonMIT

convtools is a python library to declaratively define conversions for processing collections, doing complex aggregations and joins.

local-first

by jaredly doticonjavascriptdoticon

star image 162 doticon

data syncing, storage, and collaboration. that works

anki-devops-services

by ankicommunity doticonpythondoticon

star image 137 doticonGPL-3.0

Anki Sync Server with Docker - and it works!

Top Authors in Data Processing

1

mozilla-services

11 Libraries

star icon2178

2

kindy

4 Libraries

star icon170

3

grrrr

4 Libraries

star icon271

4

agoragames

3 Libraries

star icon260

5

jellyfin

3 Libraries

star icon467

6

39aldo39

3 Libraries

star icon195

7

LedgerSync

3 Libraries

star icon44

8

putdotio

3 Libraries

star icon66

9

linonetwo

2 Libraries

star icon42

10

dearplain

2 Libraries

star icon7

1

11 Libraries

star icon2178

2

4 Libraries

star icon170

3

4 Libraries

star icon271

4

3 Libraries

star icon260

5

3 Libraries

star icon467

6

3 Libraries

star icon195

7

3 Libraries

star icon44

8

3 Libraries

star icon66

9

2 Libraries

star icon42

10

2 Libraries

star icon7

Trending Kits in Data Processing

No Trending Kits are available at this moment for Data Processing

Trending Discussions on Data Processing

Create dictionary from the position of elements in nested lists

Flutter How to show splash screen only while loading flutter data

Why does Java HotSpot can not optimize array instance after one-time resizing (leads to massive performance loss)?

How to predict actual future values after testing the trained LSTM model?

How to deploy sagemaker.workflow.pipeline.Pipeline?

How reproducible / deterministic is Parquet format?

Why does KNeighborsClassifier always predict the same number?

Is there a way to accomplish multithreading or parallel processes in a batch file?

Simple calculation for all combination (brute force) of elements in two arrays, for better performance, in Julia

Remove strange character from tokenization array

QUESTION

Create dictionary from the position of elements in nested lists

Asked 2022-Feb-27 at 15:36

I want to create a dictionary using the position of elements in each list of lists. The order of each nested list is very important and must remain the same.

Original nested lists and desired dictionary keys:

1L_original = [[1, 1, 3], [2, 3, 8]]
2keys = ["POS1", "POS2", "POS3"]
3

Desired dictionary created from L_original:

1L_original = [[1, 1, 3], [2, 3, 8]]
2keys = ["POS1", "POS2", "POS3"]
3L_dictionary = {"POS1": [1, 2], "POS2": [1, 3], "POS3": [3, 8]}
4

The code I have so far fails the conditionals and ends on the else statement for each iteration.

1L_original = [[1, 1, 3], [2, 3, 8]]
2keys = ["POS1", "POS2", "POS3"]
3L_dictionary = {"POS1": [1, 2], "POS2": [1, 3], "POS3": [3, 8]}
4for i in L_original:
5    for key, value in enumerate(i):
6        if key == 0:
7            L_dictionary[keys[0]] = value
8        if key == 1:
9            L_dictionary[keys[1]] = value
10        if key == 2:
11            L_dictionary[keys[2]] = value
12        else:
13            print(f"Error in positional data processing...{key}: {value} in {i}")
14

ANSWER

Answered 2022-Feb-27 at 14:42

I believe there are more clean ways to solve this with some fancy python API, but one of the straightforward solutions might be the following:

For each key from keys we take those numbers from L_original's nested arrays which have the same index as key has, namely idx

1L_original = [[1, 1, 3], [2, 3, 8]]
2keys = ["POS1", "POS2", "POS3"]
3L_dictionary = {"POS1": [1, 2], "POS2": [1, 3], "POS3": [3, 8]}
4for i in L_original:
5    for key, value in enumerate(i):
6        if key == 0:
7            L_dictionary[keys[0]] = value
8        if key == 1:
9            L_dictionary[keys[1]] = value
10        if key == 2:
11            L_dictionary[keys[2]] = value
12        else:
13            print(f"Error in positional data processing...{key}: {value} in {i}")
14L_original = [[1, 1, 3], [2, 3, 8]]
15keys = ["POS1", "POS2", "POS3"]
16L_dictionary = {}
17
18for (idx, key) in enumerate(keys):
19    L_dictionary[key] = []
20    for items in L_original:
21        L_dictionary[key].append(items[idx])
22

Your code goes to else, because this else is related to the if key == 2, not to the whole chain of ifs. So if the key is, for example, 0 the flow goes to else, because 0 != 2. To fix this, the second and subsequent ifs should be replaced with elif. This relates the else to the whole chain:

1L_original = [[1, 1, 3], [2, 3, 8]]
2keys = ["POS1", "POS2", "POS3"]
3L_dictionary = {"POS1": [1, 2], "POS2": [1, 3], "POS3": [3, 8]}
4for i in L_original:
5    for key, value in enumerate(i):
6        if key == 0:
7            L_dictionary[keys[0]] = value
8        if key == 1:
9            L_dictionary[keys[1]] = value
10        if key == 2:
11            L_dictionary[keys[2]] = value
12        else:
13            print(f"Error in positional data processing...{key}: {value} in {i}")
14L_original = [[1, 1, 3], [2, 3, 8]]
15keys = ["POS1", "POS2", "POS3"]
16L_dictionary = {}
17
18for (idx, key) in enumerate(keys):
19    L_dictionary[key] = []
20    for items in L_original:
21        L_dictionary[key].append(items[idx])
22if key == 0:
23  # only when key is 0
24elif key == 1:
25  # only when key is 1 
26elif key == 2:
27  # only when key is 2
28else:
29  # otherwise (not 0, not 1, not 2)
30

Source https://stackoverflow.com/questions/71285445

QUESTION

Flutter How to show splash screen only while loading flutter data

Asked 2022-Feb-24 at 12:53

While the app's splash screen is displayed, it needs to download files from the FTP server and process data. Implemented splash screen for flutter

1class Home extends StatelessWidget {
2  const Home({Key? key}) : super(key: key);
3
4  @override
5  Widget build(BuildContext context) {
6
7    return FutureBuilder(
8      future: Future.delayed(Duration(seconds: 3)),
9      builder: (BuildContext context, AsyncSnapshot snapshot){
10        if(snapshot.connectionState == ConnectionState.waiting)
11          return SplashUI();    ///Splash Screen
12        else
13          return MainUI();       ///Main Screen
14      },
15    );
16  }
17}
18

Now, with a delay of 3 seconds, the startup screen is displayed for 3 seconds, during which time the file is downloaded from FTP and data is processed. I want to keep the splash screen until the completion of data processing rather than the specified time.

Splash Screen

1class Home extends StatelessWidget {
2  const Home({Key? key}) : super(key: key);
3
4  @override
5  Widget build(BuildContext context) {
6
7    return FutureBuilder(
8      future: Future.delayed(Duration(seconds: 3)),
9      builder: (BuildContext context, AsyncSnapshot snapshot){
10        if(snapshot.connectionState == ConnectionState.waiting)
11          return SplashUI();    ///Splash Screen
12        else
13          return MainUI();       ///Main Screen
14      },
15    );
16  }
17}
18
19Widget _splashUI(Size size){
20    return SafeArea(
21      child: Center(
22        child: Container(
23          width: size.width * 0.5,
24          height: size.height * 0.1,
25          child: Image(
26            fit: BoxFit.fill,
27            image: AssetImage('assets/images/elf_logo.png'),
28          ),
29        ),
30      ),
31    );
32  }
33
34 Widget build(BuildContext context) {
35
36 getFtpFile();
37 dataProgress();
38
39 return Platform.isAndroid ?
40    MaterialApp(
41      debugShowCheckedModeBanner: false,
42      home: Scaffold(
43        body: _splashUI(_size),
44      ),
45    ) :
46    CupertinoApp(
47      debugShowCheckedModeBanner: false,
48      home: CupertinoPageScaffold(
49        child: _splashUI(_size),
50      ),
51    );
52 }
53

I want to know how to keep SplashScreen while processing data rather than handling SplashScreen with delayed. thank you.

ANSWER

Answered 2022-Feb-24 at 02:35

You could do like other people have done in the past; you should make both of your methods getFTPFile and dataProgress return a Future, then you wait for both Futures using Future.wait, as in this answer https://stackoverflow.com/a/54465973/871364

1class Home extends StatelessWidget {
2  const Home({Key? key}) : super(key: key);
3
4  @override
5  Widget build(BuildContext context) {
6
7    return FutureBuilder(
8      future: Future.delayed(Duration(seconds: 3)),
9      builder: (BuildContext context, AsyncSnapshot snapshot){
10        if(snapshot.connectionState == ConnectionState.waiting)
11          return SplashUI();    ///Splash Screen
12        else
13          return MainUI();       ///Main Screen
14      },
15    );
16  }
17}
18
19Widget _splashUI(Size size){
20    return SafeArea(
21      child: Center(
22        child: Container(
23          width: size.width * 0.5,
24          height: size.height * 0.1,
25          child: Image(
26            fit: BoxFit.fill,
27            image: AssetImage('assets/images/elf_logo.png'),
28          ),
29        ),
30      ),
31    );
32  }
33
34 Widget build(BuildContext context) {
35
36 getFtpFile();
37 dataProgress();
38
39 return Platform.isAndroid ?
40    MaterialApp(
41      debugShowCheckedModeBanner: false,
42      home: Scaffold(
43        body: _splashUI(_size),
44      ),
45    ) :
46    CupertinoApp(
47      debugShowCheckedModeBanner: false,
48      home: CupertinoPageScaffold(
49        child: _splashUI(_size),
50      ),
51    );
52 }
53Future.wait([
54   getFTPFile(),
55   dataProgress(),     
56], () {
57  // once all Futures have completed, navigate to another page here
58});
59

Source https://stackoverflow.com/questions/71246161

QUESTION

Why does Java HotSpot can not optimize array instance after one-time resizing (leads to massive performance loss)?

Asked 2022-Feb-04 at 18:19
Question

Why is the use of fBuffer1 in the attached code example (SELECT_QUICK = true) double as fast as the other variant when fBuffer2 is resized only once at the beginning (SELECT_QUICK = false)?

The code path is absolutely identical but even after 10 minutes the throughput of fBuffer2 does not increase to this level of fBuffer1.

Background:

We have a generic data processing framework that collects thousands of Java primitive values in different subclasses (one subclass for each primitive type). These values are stored internally in arrays, which we originally sized sufficiently large. To save heap memory, we have now switched these arrays to dynamic resizing (arrays grow only if needed). As expected, this change has massively reduce the heap memory. However, on the other hand the performance has unfortunately degenerated significantly. Our processing jobs now take 2-3 times longer as before (e.g. 6 min instead of 2 min as before).

I have reduced our problem to a minimum working example and attached it. You can choose with SELECT_QUICK which buffer should be used. I see the same effect with jdk-1.8.0_202-x64 as well as with openjdk-17.0.1-x64.

Buffer 1 (is not resized) shows the following numbers:
1duration buf1: 8,890.551ms (8.9s)
2duration buf1: 8,339.755ms (8.3s)
3duration buf1: 8,620.633ms (8.6s)
4duration buf1: 8,682.809ms (8.7s)
5...
6
Buffer 2 (is resized exactly 1 time at the beginning) shows the following numbers:
1duration buf1: 8,890.551ms (8.9s)
2duration buf1: 8,339.755ms (8.3s)
3duration buf1: 8,620.633ms (8.6s)
4duration buf1: 8,682.809ms (8.7s)
5...
6make buffer 2 larger
7duration buf2 (resized): 19,542.750ms (19.5s)
8duration buf2 (resized): 22,423.529ms (22.4s)
9duration buf2 (resized): 22,413.364ms (22.4s)
10duration buf2 (resized): 22,219.383ms (22.2s)
11...
12

I would really appreciate to get some hints, how I can change the code so that fBuffer2 (after resizing) works as fast as fBuffer1. The other way round (make fBuffer1 as slow as fBuffer2) is pretty easy. ;-) Since this problem is placed in some framework-like component I would prefer to change the code instead of tuning the hotspot (with external arguments). But of course, suggestions in both directions are very welcome.

Source Code
1duration buf1: 8,890.551ms (8.9s)
2duration buf1: 8,339.755ms (8.3s)
3duration buf1: 8,620.633ms (8.6s)
4duration buf1: 8,682.809ms (8.7s)
5...
6make buffer 2 larger
7duration buf2 (resized): 19,542.750ms (19.5s)
8duration buf2 (resized): 22,423.529ms (22.4s)
9duration buf2 (resized): 22,413.364ms (22.4s)
10duration buf2 (resized): 22,219.383ms (22.2s)
11...
12import java.util.Locale;
13
14public final class Collector {
15
16    private static final boolean SELECT_QUICK = true;
17
18    private static final long LOOP_COUNT = 50_000L;
19    private static final int VALUE_COUNT = 150_000;
20    private static final int BUFFER_LENGTH = 100_000;
21
22    private final Buffer fBuffer = new Buffer();
23
24    public void reset() {fBuffer.reset();}
25    public void addValueBuf1(long val) {fBuffer.add1(val);}
26    public void addValueBuf2(long val) {fBuffer.add2(val);}
27
28    public static final class Buffer {
29
30        private int fIdx = 0;
31        private long[] fBuffer1 = new long[BUFFER_LENGTH * 2];
32        private long[] fBuffer2 = new long[BUFFER_LENGTH];
33
34        public void reset() {fIdx = 0;}
35
36        public void add1(long value) {
37            ensureRemainingBuf1(1);
38            fBuffer1[fIdx++] = value;
39        }
40
41        public void add2(long value) {
42            ensureRemainingBuf2(1);
43            fBuffer2[fIdx++] = value;
44        }
45
46        private void ensureRemainingBuf1(int remaining) {
47            if (remaining > fBuffer1.length - fIdx) {
48                System.out.println("make buffer 1 larger");
49                fBuffer1 = new long[(fIdx + remaining) << 1];
50            }
51        }
52
53        private void ensureRemainingBuf2(int remaining) {
54            if (remaining > fBuffer2.length - fIdx) {
55                System.out.println("make buffer 2 larger");
56                fBuffer2 = new long[(fIdx + remaining) << 1];
57            }
58        }
59
60    }
61
62    public static void main(String[] args) {
63        Locale.setDefault(Locale.ENGLISH);
64        final Collector collector = new Collector();
65        if (SELECT_QUICK) {
66            while (true) {
67                final long start = System.nanoTime();
68                for (long j = 0L; j < LOOP_COUNT; j++) {
69                    collector.reset();
70                    for (int k = 0; k < VALUE_COUNT; k++) {
71                        collector.addValueBuf1(k);
72                    }
73                }
74                final long nanos = System.nanoTime() - start;
75                System.out.printf("duration buf1: %1$,.3fms (%2$,.1fs)%n",
76                    nanos / 1_000_000d, nanos / 1_000_000_000d);
77            }
78        } else {
79            while (true) {
80                final long start = System.nanoTime();
81                for (long j = 0L; j < LOOP_COUNT; j++) {
82                    collector.reset();
83                    for (int k = 0; k < VALUE_COUNT; k++) {
84                        collector.addValueBuf2(k);
85                    }
86                }
87                final long nanos = System.nanoTime() - start;
88                System.out.printf("duration buf2 (resized): %1$,.3fms (%2$,.1fs)%n",
89                    nanos / 1_000_000d, nanos / 1_000_000_000d);
90            }
91        }
92    }
93
94}
95

ANSWER

Answered 2022-Feb-04 at 18:19

JIT compilation in HotSpot JVM is 1) based on runtime profile data; 2) uses speculative optimizations.

Once the method is compiled at the maximum optimization level, HotSpot stops profiling this code, so it is never recompiled afterwards, no matter how long the code runs. (The exception is when the method needs to be deoptimized or unloaded, but it's not your case).

In the first case (SELECT_QUICK == true), the condition remaining > fBuffer1.length - fIdx is never met, and HotSpot JVM is aware of that from profiling data collected at lower tiers. So it speculatively hoists the check out of the loop, and compiles the loop body with the assumption that array index is always within bounds. After the optimization, the loop is compiled like this (in pseudocode):

1duration buf1: 8,890.551ms (8.9s)
2duration buf1: 8,339.755ms (8.3s)
3duration buf1: 8,620.633ms (8.6s)
4duration buf1: 8,682.809ms (8.7s)
5...
6make buffer 2 larger
7duration buf2 (resized): 19,542.750ms (19.5s)
8duration buf2 (resized): 22,423.529ms (22.4s)
9duration buf2 (resized): 22,413.364ms (22.4s)
10duration buf2 (resized): 22,219.383ms (22.2s)
11...
12import java.util.Locale;
13
14public final class Collector {
15
16    private static final boolean SELECT_QUICK = true;
17
18    private static final long LOOP_COUNT = 50_000L;
19    private static final int VALUE_COUNT = 150_000;
20    private static final int BUFFER_LENGTH = 100_000;
21
22    private final Buffer fBuffer = new Buffer();
23
24    public void reset() {fBuffer.reset();}
25    public void addValueBuf1(long val) {fBuffer.add1(val);}
26    public void addValueBuf2(long val) {fBuffer.add2(val);}
27
28    public static final class Buffer {
29
30        private int fIdx = 0;
31        private long[] fBuffer1 = new long[BUFFER_LENGTH * 2];
32        private long[] fBuffer2 = new long[BUFFER_LENGTH];
33
34        public void reset() {fIdx = 0;}
35
36        public void add1(long value) {
37            ensureRemainingBuf1(1);
38            fBuffer1[fIdx++] = value;
39        }
40
41        public void add2(long value) {
42            ensureRemainingBuf2(1);
43            fBuffer2[fIdx++] = value;
44        }
45
46        private void ensureRemainingBuf1(int remaining) {
47            if (remaining > fBuffer1.length - fIdx) {
48                System.out.println("make buffer 1 larger");
49                fBuffer1 = new long[(fIdx + remaining) << 1];
50            }
51        }
52
53        private void ensureRemainingBuf2(int remaining) {
54            if (remaining > fBuffer2.length - fIdx) {
55                System.out.println("make buffer 2 larger");
56                fBuffer2 = new long[(fIdx + remaining) << 1];
57            }
58        }
59
60    }
61
62    public static void main(String[] args) {
63        Locale.setDefault(Locale.ENGLISH);
64        final Collector collector = new Collector();
65        if (SELECT_QUICK) {
66            while (true) {
67                final long start = System.nanoTime();
68                for (long j = 0L; j < LOOP_COUNT; j++) {
69                    collector.reset();
70                    for (int k = 0; k < VALUE_COUNT; k++) {
71                        collector.addValueBuf1(k);
72                    }
73                }
74                final long nanos = System.nanoTime() - start;
75                System.out.printf("duration buf1: %1$,.3fms (%2$,.1fs)%n",
76                    nanos / 1_000_000d, nanos / 1_000_000_000d);
77            }
78        } else {
79            while (true) {
80                final long start = System.nanoTime();
81                for (long j = 0L; j < LOOP_COUNT; j++) {
82                    collector.reset();
83                    for (int k = 0; k < VALUE_COUNT; k++) {
84                        collector.addValueBuf2(k);
85                    }
86                }
87                final long nanos = System.nanoTime() - start;
88                System.out.printf("duration buf2 (resized): %1$,.3fms (%2$,.1fs)%n",
89                    nanos / 1_000_000d, nanos / 1_000_000_000d);
90            }
91        }
92    }
93
94}
95if (VALUE_COUNT > collector.fBuffer.fBuffer1.length) {
96    uncommon_trap();
97}
98for (int k = 0; k < VALUE_COUNT; k++) {
99    collector.fBuffer.fBuffer1[k] = k;  // no bounds check
100}
101

In the second case (SELECT_QUICK == false), on the contrary, HotSpot knows that condition remaining > fBuffer2.length - fIdx is sometimes met, so it cannot eliminate the check.

Since fIdx is not the loop counter, HotSpot is apparently not smart enough to split the loop into two parts (with and without bounds check). However, you can help JIT compiler by splitting the loop manually:

1duration buf1: 8,890.551ms (8.9s)
2duration buf1: 8,339.755ms (8.3s)
3duration buf1: 8,620.633ms (8.6s)
4duration buf1: 8,682.809ms (8.7s)
5...
6make buffer 2 larger
7duration buf2 (resized): 19,542.750ms (19.5s)
8duration buf2 (resized): 22,423.529ms (22.4s)
9duration buf2 (resized): 22,413.364ms (22.4s)
10duration buf2 (resized): 22,219.383ms (22.2s)
11...
12import java.util.Locale;
13
14public final class Collector {
15
16    private static final boolean SELECT_QUICK = true;
17
18    private static final long LOOP_COUNT = 50_000L;
19    private static final int VALUE_COUNT = 150_000;
20    private static final int BUFFER_LENGTH = 100_000;
21
22    private final Buffer fBuffer = new Buffer();
23
24    public void reset() {fBuffer.reset();}
25    public void addValueBuf1(long val) {fBuffer.add1(val);}
26    public void addValueBuf2(long val) {fBuffer.add2(val);}
27
28    public static final class Buffer {
29
30        private int fIdx = 0;
31        private long[] fBuffer1 = new long[BUFFER_LENGTH * 2];
32        private long[] fBuffer2 = new long[BUFFER_LENGTH];
33
34        public void reset() {fIdx = 0;}
35
36        public void add1(long value) {
37            ensureRemainingBuf1(1);
38            fBuffer1[fIdx++] = value;
39        }
40
41        public void add2(long value) {
42            ensureRemainingBuf2(1);
43            fBuffer2[fIdx++] = value;
44        }
45
46        private void ensureRemainingBuf1(int remaining) {
47            if (remaining > fBuffer1.length - fIdx) {
48                System.out.println("make buffer 1 larger");
49                fBuffer1 = new long[(fIdx + remaining) << 1];
50            }
51        }
52
53        private void ensureRemainingBuf2(int remaining) {
54            if (remaining > fBuffer2.length - fIdx) {
55                System.out.println("make buffer 2 larger");
56                fBuffer2 = new long[(fIdx + remaining) << 1];
57            }
58        }
59
60    }
61
62    public static void main(String[] args) {
63        Locale.setDefault(Locale.ENGLISH);
64        final Collector collector = new Collector();
65        if (SELECT_QUICK) {
66            while (true) {
67                final long start = System.nanoTime();
68                for (long j = 0L; j < LOOP_COUNT; j++) {
69                    collector.reset();
70                    for (int k = 0; k < VALUE_COUNT; k++) {
71                        collector.addValueBuf1(k);
72                    }
73                }
74                final long nanos = System.nanoTime() - start;
75                System.out.printf("duration buf1: %1$,.3fms (%2$,.1fs)%n",
76                    nanos / 1_000_000d, nanos / 1_000_000_000d);
77            }
78        } else {
79            while (true) {
80                final long start = System.nanoTime();
81                for (long j = 0L; j < LOOP_COUNT; j++) {
82                    collector.reset();
83                    for (int k = 0; k < VALUE_COUNT; k++) {
84                        collector.addValueBuf2(k);
85                    }
86                }
87                final long nanos = System.nanoTime() - start;
88                System.out.printf("duration buf2 (resized): %1$,.3fms (%2$,.1fs)%n",
89                    nanos / 1_000_000d, nanos / 1_000_000_000d);
90            }
91        }
92    }
93
94}
95if (VALUE_COUNT > collector.fBuffer.fBuffer1.length) {
96    uncommon_trap();
97}
98for (int k = 0; k < VALUE_COUNT; k++) {
99    collector.fBuffer.fBuffer1[k] = k;  // no bounds check
100}
101for (long j = 0L; j < LOOP_COUNT; j++) {
102    collector.reset();
103
104    int fastCount = Math.min(collector.fBuffer.fBuffer2.length, VALUE_COUNT);
105    for (int k = 0; k < fastCount; k++) {
106        collector.addValueBuf2Fast(k);
107    }
108
109    for (int k = fastCount; k < VALUE_COUNT; k++) {
110        collector.addValueBuf2(k);
111    }
112}
113

where addValueBuf2Fast inserts a value without bounds check:

1duration buf1: 8,890.551ms (8.9s)
2duration buf1: 8,339.755ms (8.3s)
3duration buf1: 8,620.633ms (8.6s)
4duration buf1: 8,682.809ms (8.7s)
5...
6make buffer 2 larger
7duration buf2 (resized): 19,542.750ms (19.5s)
8duration buf2 (resized): 22,423.529ms (22.4s)
9duration buf2 (resized): 22,413.364ms (22.4s)
10duration buf2 (resized): 22,219.383ms (22.2s)
11...
12import java.util.Locale;
13
14public final class Collector {
15
16    private static final boolean SELECT_QUICK = true;
17
18    private static final long LOOP_COUNT = 50_000L;
19    private static final int VALUE_COUNT = 150_000;
20    private static final int BUFFER_LENGTH = 100_000;
21
22    private final Buffer fBuffer = new Buffer();
23
24    public void reset() {fBuffer.reset();}
25    public void addValueBuf1(long val) {fBuffer.add1(val);}
26    public void addValueBuf2(long val) {fBuffer.add2(val);}
27
28    public static final class Buffer {
29
30        private int fIdx = 0;
31        private long[] fBuffer1 = new long[BUFFER_LENGTH * 2];
32        private long[] fBuffer2 = new long[BUFFER_LENGTH];
33
34        public void reset() {fIdx = 0;}
35
36        public void add1(long value) {
37            ensureRemainingBuf1(1);
38            fBuffer1[fIdx++] = value;
39        }
40
41        public void add2(long value) {
42            ensureRemainingBuf2(1);
43            fBuffer2[fIdx++] = value;
44        }
45
46        private void ensureRemainingBuf1(int remaining) {
47            if (remaining > fBuffer1.length - fIdx) {
48                System.out.println("make buffer 1 larger");
49                fBuffer1 = new long[(fIdx + remaining) << 1];
50            }
51        }
52
53        private void ensureRemainingBuf2(int remaining) {
54            if (remaining > fBuffer2.length - fIdx) {
55                System.out.println("make buffer 2 larger");
56                fBuffer2 = new long[(fIdx + remaining) << 1];
57            }
58        }
59
60    }
61
62    public static void main(String[] args) {
63        Locale.setDefault(Locale.ENGLISH);
64        final Collector collector = new Collector();
65        if (SELECT_QUICK) {
66            while (true) {
67                final long start = System.nanoTime();
68                for (long j = 0L; j < LOOP_COUNT; j++) {
69                    collector.reset();
70                    for (int k = 0; k < VALUE_COUNT; k++) {
71                        collector.addValueBuf1(k);
72                    }
73                }
74                final long nanos = System.nanoTime() - start;
75                System.out.printf("duration buf1: %1$,.3fms (%2$,.1fs)%n",
76                    nanos / 1_000_000d, nanos / 1_000_000_000d);
77            }
78        } else {
79            while (true) {
80                final long start = System.nanoTime();
81                for (long j = 0L; j < LOOP_COUNT; j++) {
82                    collector.reset();
83                    for (int k = 0; k < VALUE_COUNT; k++) {
84                        collector.addValueBuf2(k);
85                    }
86                }
87                final long nanos = System.nanoTime() - start;
88                System.out.printf("duration buf2 (resized): %1$,.3fms (%2$,.1fs)%n",
89                    nanos / 1_000_000d, nanos / 1_000_000_000d);
90            }
91        }
92    }
93
94}
95if (VALUE_COUNT > collector.fBuffer.fBuffer1.length) {
96    uncommon_trap();
97}
98for (int k = 0; k < VALUE_COUNT; k++) {
99    collector.fBuffer.fBuffer1[k] = k;  // no bounds check
100}
101for (long j = 0L; j < LOOP_COUNT; j++) {
102    collector.reset();
103
104    int fastCount = Math.min(collector.fBuffer.fBuffer2.length, VALUE_COUNT);
105    for (int k = 0; k < fastCount; k++) {
106        collector.addValueBuf2Fast(k);
107    }
108
109    for (int k = fastCount; k < VALUE_COUNT; k++) {
110        collector.addValueBuf2(k);
111    }
112}
113    public void addValueBuf2Fast(long val) {fBuffer.add2Fast(val);}
114
115    public static final class Buffer {
116        ...
117        public void add2Fast(long value) {
118            fBuffer2[fIdx++] = value;
119        }
120    }
121

This should dramatically improve performance of the loop:

1duration buf1: 8,890.551ms (8.9s)
2duration buf1: 8,339.755ms (8.3s)
3duration buf1: 8,620.633ms (8.6s)
4duration buf1: 8,682.809ms (8.7s)
5...
6make buffer 2 larger
7duration buf2 (resized): 19,542.750ms (19.5s)
8duration buf2 (resized): 22,423.529ms (22.4s)
9duration buf2 (resized): 22,413.364ms (22.4s)
10duration buf2 (resized): 22,219.383ms (22.2s)
11...
12import java.util.Locale;
13
14public final class Collector {
15
16    private static final boolean SELECT_QUICK = true;
17
18    private static final long LOOP_COUNT = 50_000L;
19    private static final int VALUE_COUNT = 150_000;
20    private static final int BUFFER_LENGTH = 100_000;
21
22    private final Buffer fBuffer = new Buffer();
23
24    public void reset() {fBuffer.reset();}
25    public void addValueBuf1(long val) {fBuffer.add1(val);}
26    public void addValueBuf2(long val) {fBuffer.add2(val);}
27
28    public static final class Buffer {
29
30        private int fIdx = 0;
31        private long[] fBuffer1 = new long[BUFFER_LENGTH * 2];
32        private long[] fBuffer2 = new long[BUFFER_LENGTH];
33
34        public void reset() {fIdx = 0;}
35
36        public void add1(long value) {
37            ensureRemainingBuf1(1);
38            fBuffer1[fIdx++] = value;
39        }
40
41        public void add2(long value) {
42            ensureRemainingBuf2(1);
43            fBuffer2[fIdx++] = value;
44        }
45
46        private void ensureRemainingBuf1(int remaining) {
47            if (remaining > fBuffer1.length - fIdx) {
48                System.out.println("make buffer 1 larger");
49                fBuffer1 = new long[(fIdx + remaining) << 1];
50            }
51        }
52
53        private void ensureRemainingBuf2(int remaining) {
54            if (remaining > fBuffer2.length - fIdx) {
55                System.out.println("make buffer 2 larger");
56                fBuffer2 = new long[(fIdx + remaining) << 1];
57            }
58        }
59
60    }
61
62    public static void main(String[] args) {
63        Locale.setDefault(Locale.ENGLISH);
64        final Collector collector = new Collector();
65        if (SELECT_QUICK) {
66            while (true) {
67                final long start = System.nanoTime();
68                for (long j = 0L; j < LOOP_COUNT; j++) {
69                    collector.reset();
70                    for (int k = 0; k < VALUE_COUNT; k++) {
71                        collector.addValueBuf1(k);
72                    }
73                }
74                final long nanos = System.nanoTime() - start;
75                System.out.printf("duration buf1: %1$,.3fms (%2$,.1fs)%n",
76                    nanos / 1_000_000d, nanos / 1_000_000_000d);
77            }
78        } else {
79            while (true) {
80                final long start = System.nanoTime();
81                for (long j = 0L; j < LOOP_COUNT; j++) {
82                    collector.reset();
83                    for (int k = 0; k < VALUE_COUNT; k++) {
84                        collector.addValueBuf2(k);
85                    }
86                }
87                final long nanos = System.nanoTime() - start;
88                System.out.printf("duration buf2 (resized): %1$,.3fms (%2$,.1fs)%n",
89                    nanos / 1_000_000d, nanos / 1_000_000_000d);
90            }
91        }
92    }
93
94}
95if (VALUE_COUNT > collector.fBuffer.fBuffer1.length) {
96    uncommon_trap();
97}
98for (int k = 0; k < VALUE_COUNT; k++) {
99    collector.fBuffer.fBuffer1[k] = k;  // no bounds check
100}
101for (long j = 0L; j < LOOP_COUNT; j++) {
102    collector.reset();
103
104    int fastCount = Math.min(collector.fBuffer.fBuffer2.length, VALUE_COUNT);
105    for (int k = 0; k < fastCount; k++) {
106        collector.addValueBuf2Fast(k);
107    }
108
109    for (int k = fastCount; k < VALUE_COUNT; k++) {
110        collector.addValueBuf2(k);
111    }
112}
113    public void addValueBuf2Fast(long val) {fBuffer.add2Fast(val);}
114
115    public static final class Buffer {
116        ...
117        public void add2Fast(long value) {
118            fBuffer2[fIdx++] = value;
119        }
120    }
121make buffer 2 larger
122duration buf2 (resized): 5,537.681ms (5.5s)
123duration buf2 (resized): 5,461.519ms (5.5s)
124duration buf2 (resized): 5,450.445ms (5.5s)
125

Source https://stackoverflow.com/questions/70986856

QUESTION

How to predict actual future values after testing the trained LSTM model?

Asked 2021-Dec-22 at 10:12

I have trained my stock price prediction model by splitting the dataset into train & test. I have also tested the predictions by comparing the valid data with the predicted data, and the model works fine. But I want to predict actual future values.

What do I need to change in my code below?

How can I make predictions up to a specific date in the actual future?


Code (in a Jupyter Notebook):

(To run the code, please try it in a similar csv file you have, or install nsepy python library using command pip install nsepy)

1# imports
2import pandas as pd  # data processing
3import numpy as np  # linear algebra
4import matplotlib.pyplot as plt  # plotting
5from datetime import date  # date
6from nsepy import get_history  # NSE historical data
7from keras.models import Sequential  # neural network
8from keras.layers import LSTM, Dropout, Dense  # LSTM layer
9from sklearn.preprocessing import MinMaxScaler  # scaling
10
11nseCode = 'TCS'
12stockTitle = 'Tata Consultancy Services'
13
14# API call
15apiData = get_history(symbol = nseCode, start = date(2017,1,1), end = date(2021,12,19))
16data = apiData  # copy the dataframe (not necessary)
17
18# remove columns you don't need
19del data['Symbol']
20del data['Series']
21del data['Prev Close']
22del data['Volume']
23del data['Turnover']
24del data['Trades']
25del data['Deliverable Volume']
26del data['%Deliverble']
27
28# store the data in a csv file
29data.to_csv('infy2.csv')
30
31# Read the csv file
32data = pd.read_csv('infy2.csv')
33
34# convert the date column to datetime; if you read data from csv, do this. Otherwise, no need if you read data from API
35data['Date'] = pd.to_datetime(data['Date'], format = '%Y-%m-%d')
36data.index = data['Date']
37
38# plot
39plt.xlabel('Date')
40plt.ylabel('Close Price (Rs.)')
41data['Close'].plot(legend = True, figsize = (10,6), title = stockTitle, grid = True, color = 'blue')
42
43# Sort data into Date and Close columns
44data2 = data.sort_index(ascending = True, axis = 0)
45
46newData = pd.DataFrame(index = range(0,len(data2)), columns = ['Date', 'Close'])
47
48for i in range(0, len(data2)):  # only if you read data from csv
49    newData['Date'][i] = data2['Date'][i]
50    newData['Close'][i] = data2['Close'][I]
51
52# Calculate the row number to split the dataset into train and test
53split = len(newData) - 100
54
55# normalize the new dataset
56scaler = MinMaxScaler(feature_range = (0, 1))
57finalData = newData.values
58
59trainData = finalData[0:split, :]
60validData = finalData[split:, :]
61
62newData.index = newData.Date
63newData.drop('Date', axis = 1, inplace = True)
64scaler = MinMaxScaler(feature_range = (0, 1))
65scaledData = scaler.fit_transform(newData)
66
67xTrainData, yTrainData = [], []
68
69for i in range(60, len(trainData)):  # data-flair has used 60 instead of 30
70    xTrainData.append(scaledData[i-60:i, 0])
71    yTrainData.append(scaledData[i, 0])
72
73xTrainData, yTrainData = np.array(xTrainData), np.array(yTrainData)
74
75xTrainData = np.reshape(xTrainData, (xTrainData.shape[0], xTrainData.shape[1], 1))
76
77# build and train the LSTM model
78lstmModel = Sequential()
79lstmModel.add(LSTM(units = 50, return_sequences = True, input_shape = (xTrainData.shape[1], 1)))
80lstmModel.add(LSTM(units = 50))
81lstmModel.add(Dense(units = 1))
82
83inputsData = newData[len(newData) - len(validData) - 60:].values
84inputsData = inputsData.reshape(-1,1)
85inputsData = scaler.transform(inputsData)
86
87lstmModel.compile(loss = 'mean_squared_error', optimizer = 'adam')
88lstmModel.fit(xTrainData, yTrainData, epochs = 1, batch_size = 1, verbose = 2)
89
90# Take a sample of a dataset to make predictions
91xTestData = []
92
93for i in range(60, inputsData.shape[0]):
94    xTestData.append(inputsData[i-60:i, 0])
95
96xTestData = np.array(xTestData)
97
98xTestData = np.reshape(xTestData, (xTestData.shape[0], xTestData.shape[1], 1))
99
100predictedClosingPrice = lstmModel.predict(xTestData)
101predictedClosingPrice = scaler.inverse_transform(predictedClosingPrice)
102
103# visualize the results
104trainData = newData[:split]
105validData = newData[split:]
106
107validData['Predictions'] = predictedClosingPrice
108
109plt.xlabel('Date')
110plt.ylabel('Close Price (Rs.)')
111
112trainData['Close'].plot(legend = True, color = 'blue', label = 'Train Data')
113validData['Close'].plot(legend = True, color = 'green', label = 'Valid Data')
114validData['Predictions'].plot(legend = True, figsize = (12,7), grid = True, color = 'orange', label = 'Predicted Data', title = stockTitle)
115

ANSWER

Answered 2021-Dec-22 at 10:12

Below is an example of how you could implement this approach for your model:

1# imports
2import pandas as pd  # data processing
3import numpy as np  # linear algebra
4import matplotlib.pyplot as plt  # plotting
5from datetime import date  # date
6from nsepy import get_history  # NSE historical data
7from keras.models import Sequential  # neural network
8from keras.layers import LSTM, Dropout, Dense  # LSTM layer
9from sklearn.preprocessing import MinMaxScaler  # scaling
10
11nseCode = 'TCS'
12stockTitle = 'Tata Consultancy Services'
13
14# API call
15apiData = get_history(symbol = nseCode, start = date(2017,1,1), end = date(2021,12,19))
16data = apiData  # copy the dataframe (not necessary)
17
18# remove columns you don't need
19del data['Symbol']
20del data['Series']
21del data['Prev Close']
22del data['Volume']
23del data['Turnover']
24del data['Trades']
25del data['Deliverable Volume']
26del data['%Deliverble']
27
28# store the data in a csv file
29data.to_csv('infy2.csv')
30
31# Read the csv file
32data = pd.read_csv('infy2.csv')
33
34# convert the date column to datetime; if you read data from csv, do this. Otherwise, no need if you read data from API
35data['Date'] = pd.to_datetime(data['Date'], format = '%Y-%m-%d')
36data.index = data['Date']
37
38# plot
39plt.xlabel('Date')
40plt.ylabel('Close Price (Rs.)')
41data['Close'].plot(legend = True, figsize = (10,6), title = stockTitle, grid = True, color = 'blue')
42
43# Sort data into Date and Close columns
44data2 = data.sort_index(ascending = True, axis = 0)
45
46newData = pd.DataFrame(index = range(0,len(data2)), columns = ['Date', 'Close'])
47
48for i in range(0, len(data2)):  # only if you read data from csv
49    newData['Date'][i] = data2['Date'][i]
50    newData['Close'][i] = data2['Close'][I]
51
52# Calculate the row number to split the dataset into train and test
53split = len(newData) - 100
54
55# normalize the new dataset
56scaler = MinMaxScaler(feature_range = (0, 1))
57finalData = newData.values
58
59trainData = finalData[0:split, :]
60validData = finalData[split:, :]
61
62newData.index = newData.Date
63newData.drop('Date', axis = 1, inplace = True)
64scaler = MinMaxScaler(feature_range = (0, 1))
65scaledData = scaler.fit_transform(newData)
66
67xTrainData, yTrainData = [], []
68
69for i in range(60, len(trainData)):  # data-flair has used 60 instead of 30
70    xTrainData.append(scaledData[i-60:i, 0])
71    yTrainData.append(scaledData[i, 0])
72
73xTrainData, yTrainData = np.array(xTrainData), np.array(yTrainData)
74
75xTrainData = np.reshape(xTrainData, (xTrainData.shape[0], xTrainData.shape[1], 1))
76
77# build and train the LSTM model
78lstmModel = Sequential()
79lstmModel.add(LSTM(units = 50, return_sequences = True, input_shape = (xTrainData.shape[1], 1)))
80lstmModel.add(LSTM(units = 50))
81lstmModel.add(Dense(units = 1))
82
83inputsData = newData[len(newData) - len(validData) - 60:].values
84inputsData = inputsData.reshape(-1,1)
85inputsData = scaler.transform(inputsData)
86
87lstmModel.compile(loss = 'mean_squared_error', optimizer = 'adam')
88lstmModel.fit(xTrainData, yTrainData, epochs = 1, batch_size = 1, verbose = 2)
89
90# Take a sample of a dataset to make predictions
91xTestData = []
92
93for i in range(60, inputsData.shape[0]):
94    xTestData.append(inputsData[i-60:i, 0])
95
96xTestData = np.array(xTestData)
97
98xTestData = np.reshape(xTestData, (xTestData.shape[0], xTestData.shape[1], 1))
99
100predictedClosingPrice = lstmModel.predict(xTestData)
101predictedClosingPrice = scaler.inverse_transform(predictedClosingPrice)
102
103# visualize the results
104trainData = newData[:split]
105validData = newData[split:]
106
107validData['Predictions'] = predictedClosingPrice
108
109plt.xlabel('Date')
110plt.ylabel('Close Price (Rs.)')
111
112trainData['Close'].plot(legend = True, color = 'blue', label = 'Train Data')
113validData['Close'].plot(legend = True, color = 'green', label = 'Valid Data')
114validData['Predictions'].plot(legend = True, figsize = (12,7), grid = True, color = 'orange', label = 'Predicted Data', title = stockTitle)
115import pandas as pd
116import numpy as np
117from datetime import date
118from nsepy import get_history
119from keras.models import Sequential
120from keras.layers import LSTM, Dense
121from sklearn.preprocessing import MinMaxScaler
122pd.options.mode.chained_assignment = None
123
124# load the data
125stock_ticker = 'TCS'
126stock_name = 'Tata Consultancy Services'
127train_start = date(2017, 1, 1)
128train_end = date.today()
129data = get_history(symbol=stock_ticker, start=train_start, end=train_end)
130data.index = pd.DatetimeIndex(data.index)
131data = data[['Close']]
132
133# scale the data
134scaler = MinMaxScaler(feature_range=(0, 1)).fit(data)
135z = scaler.transform(data)
136
137# extract the input sequences and target values
138window_size = 60
139
140x, y = [], []
141
142for i in range(window_size, len(z)):
143    x.append(z[i - window_size: i])
144    y.append(z[i])
145
146x, y = np.array(x), np.array(y)
147
148# build and train the model
149model = Sequential()
150model.add(LSTM(units=50, return_sequences=True, input_shape=x.shape[1:]))
151model.add(LSTM(units=50))
152model.add(Dense(units=1))
153model.compile(loss='mse', optimizer='adam')
154model.fit(x, y, epochs=100, batch_size=128, verbose=1)
155
156# generate the multi-step forecasts
157def multi_step_forecasts(n_past, n_future):
158
159    x_past = x[- n_past - 1:, :, :][:1]  # last observed input sequence
160    y_past = y[- n_past - 1]             # last observed target value
161    y_future = []                        # predicted target values
162
163    for i in range(n_past + n_future):
164
165        # feed the last forecast back to the model as an input
166        x_past = np.append(x_past[:, 1:, :], y_past.reshape(1, 1, 1), axis=1)
167
168        # generate the next forecast
169        y_past = model.predict(x_past)
170
171        # save the forecast
172        y_future.append(y_past.flatten()[0])
173
174    # transform the forecasts back to the original scale
175    y_future = scaler.inverse_transform(np.array(y_future).reshape(-1, 1)).flatten()
176
177    # add the forecasts to the data frame
178    df_past = data.rename(columns={'Close': 'Actual'}).copy()
179
180    df_future = pd.DataFrame(
181        index=pd.bdate_range(start=data.index[- n_past - 1] + pd.Timedelta(days=1), periods=n_past + n_future),
182        columns=['Forecast'],
183        data=y_future
184    )
185
186    return df_past.join(df_future, how='outer')
187
188# forecast the next 30 days
189df1 = multi_step_forecasts(n_past=0, n_future=30)
190df1.plot(title=stock_name)
191
192# forecast the last 20 days and the next 30 days
193df2 = multi_step_forecasts(n_past=20, n_future=30)
194df2.plot(title=stock_name)
195

Source https://stackoverflow.com/questions/70420155

QUESTION

How to deploy sagemaker.workflow.pipeline.Pipeline?

Asked 2021-Dec-09 at 18:06

I have a sagemaker.workflow.pipeline.Pipeline which contains multiple sagemaker.workflow.steps.ProcessingStep and each ProcessingStep contains sagemaker.processing.ScriptProcessor.

The current pipeline graph look like the below shown image. It will take data from multiple sources from S3, process it and create a final dataset using the data from previous steps.

enter image description here

As the Pipeline object doesn't support .deploy method, how to deploy this pipeline?

While inference/scoring, When we receive a raw data(single row for each source), how to trigger the pipeline?

or Sagemaker Pipeline is designed for only data processing and model training on huge/batch data? Not for the inference with the single data point?

ANSWER

Answered 2021-Dec-09 at 18:06

As the Pipeline object doesn't support .deploy method, how to deploy this pipeline?

Pipeline does not have a .deploy() method, no

Use pipeline.upsert(role_arn='...') to create/update the pipeline definition to SageMaker, then call pipeline.start() . Docs here

While inference/scoring, When we receive a raw data(single row for each source), how to trigger the pipeline?

There are actually two types of pipelines in SageMaker. Model Building Pipelines (which you have in your question), and Serial Inference Pipelines, which are used for Inference. AWS definitely should have called the former "workflows"

You can use a model building pipeline to setup a serial inference pipeline

To do pre-processing in a serial inference pipeline, you want to train an encoder/estimator (such as SKLearn) and save its model. Then train a learning algorithm, and save its model, then create a PipelineModel using both models

Source https://stackoverflow.com/questions/70287087

QUESTION

How reproducible / deterministic is Parquet format?

Asked 2021-Dec-09 at 03:55

I'm seeking advice from people deeply familiar with the binary layout of Apache Parquet:

Having a data transformation F(a) = b where F is fully deterministic, and same exact versions of the entire software stack (framework, arrow & parquet libraries) are used - how likely am I to get an identical binary representation of dataframe b on different hosts every time b is saved into Parquet?

In other words how reproducible Parquet is on binary level? When data is logically the same what can cause binary differences?

  • Can there be some uninit memory in between values due to alignment?
  • Assuming all serialization settings (compression, chunking, use of dictionaries etc.) are the same, can result still drift?
Context

I'm working on a system for fully reproducible and deterministic data processing and computing dataset hashes to assert these guarantees.

My key goal has been to ensure that dataset b contains an idendital set of records as dataset b' - this is of course very different from hashing a binary representation of Arrow/Parquet. Not wanting to deal with the reproducibility of storage formats I've been computing logical data hashes in memory. This is slow but flexible, e.g. my hash stays the same even if records are re-ordered (which I consider an equivalent dataset).

But when thinking about integrating with IPFS and other content-addressable storages that rely on hashes of files - it would simplify the design a lot to have just one hash (physical) instead of two (logical + physical), but this means I have to guarantee that Parquet files are reproducible.


Update

I decided to continue using logical hashing for now.

I've created a new Rust crate arrow-digest that implements the stable hashing for Arrow arrays and record batches and tries hard to hide the encoding-related differences. The crate's README describes the hashing algorithm if someone finds it useful and wants to implement it in another language.

I'll continue to expand the set of supported types as I'm integrating it into the decentralized data processing tool I'm working on.

In the long term, I'm not sure logical hashing is the best way forward - a subset of Parquet that makes some efficiency sacrifices just to make file layout deterministic might be a better choice for content-addressability.

ANSWER

Answered 2021-Dec-05 at 04:30

At least in arrow's implementation I would expect, but haven't verified the exact same input (including identical metadata) in the same order to yield deterministic outputs (we try not to leave uninitialized values for security reasons) with the same configuration (assuming the compression algorithm chosen also makes the deterministic guarantee). It is possible there is some hash-map iteration for metadata or elsewhere that might also break this assumption.

As @Pace pointed out I would not rely on this and recommend against relying on it). There is nothing in the spec that guarantees this and since the writer version is persisted when writing a file you are guaranteed a breakage if you ever decided to upgrade. Things will also break if additional metadata is added or removed ( I believe in the past there have been some big fixes for round tripping data sets that would have caused non-determinism).

So in summary this might or might not work today but even if it does I would expect this would be very brittle.

Source https://stackoverflow.com/questions/70220970

QUESTION

Why does KNeighborsClassifier always predict the same number?

Asked 2021-Oct-17 at 14:23

Why does knn always predict the same number? How can I solve this? The dataset is here.

Code:

1import numpy as np
2import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
3import os
4import scipy.io   
5from sklearn.neighbors import KNeighborsClassifier
6from sklearn import metrics
7from sklearn.model_selection import train_test_split
8from sklearn.preprocessing import StandardScaler
9from torch.utils.data import Dataset, DataLoader
10from sklearn import preprocessing
11import torch
12import numpy as np
13from sklearn.model_selection import KFold
14from sklearn.neighbors import KNeighborsClassifier
15from sklearn import metrics
16
17def load_mat_data(path):
18    mat = scipy.io.loadmat(DATA_PATH)
19    x,y = mat['data'], mat['class']
20    x = x.astype('float32')
21    # stadardize values
22    standardizer = preprocessing.StandardScaler()
23    x = standardizer.fit_transform(x) 
24    return x, standardizer, y
25
26def numpyToTensor(x):
27    x_train = torch.from_numpy(x)
28    return x_train
29
30class DataBuilder(Dataset):
31    def __init__(self, path):
32        self.x, self.standardizer, self.y = load_mat_data(DATA_PATH)
33        self.x = numpyToTensor(self.x)
34        self.len=self.x.shape[0]
35        self.y = numpyToTensor(self.y)
36    def __getitem__(self,index):      
37        return (self.x[index], self.y[index])
38    def __len__(self):
39        return self.len
40
41datasets = ['/home/katerina/Desktop/datasets/GSE75110.mat']
42
43for DATA_PATH in datasets:
44
45    print(DATA_PATH)
46    data_set=DataBuilder(DATA_PATH)
47
48    pred_rpknn = [0] * len(data_set.y)
49    kf = KFold(n_splits=10, shuffle = True, random_state=7)
50
51    for train_index, test_index in kf.split(data_set.x):
52        #Create KNN Classifier
53        knn = KNeighborsClassifier(n_neighbors=5)
54        #print("TRAIN:", train_index, "TEST:", test_index)
55        x_train, x_test = data_set.x[train_index], data_set.x[test_index]
56        y_train, y_test = data_set.y[train_index], data_set.y[test_index]
57        #Train the model using the training sets
58        y1_train = y_train.ravel()
59        knn.fit(x_train, y1_train)
60        #Predict the response for test dataset
61        y_pred = knn.predict(x_test)
62        #print(y_pred)
63        # Model Accuracy, how often is the classifier correct?
64        print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
65        c = 0
66        for idx in test_index:
67            pred_rpknn[idx] = y_pred[c]
68            c +=1
69    print("Accuracy:",metrics.accuracy_score(data_set.y, pred_rpknn))
70    print(pred_rpknn, data_set.y.reshape(1,-1))
71

Output:

1import numpy as np
2import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
3import os
4import scipy.io   
5from sklearn.neighbors import KNeighborsClassifier
6from sklearn import metrics
7from sklearn.model_selection import train_test_split
8from sklearn.preprocessing import StandardScaler
9from torch.utils.data import Dataset, DataLoader
10from sklearn import preprocessing
11import torch
12import numpy as np
13from sklearn.model_selection import KFold
14from sklearn.neighbors import KNeighborsClassifier
15from sklearn import metrics
16
17def load_mat_data(path):
18    mat = scipy.io.loadmat(DATA_PATH)
19    x,y = mat['data'], mat['class']
20    x = x.astype('float32')
21    # stadardize values
22    standardizer = preprocessing.StandardScaler()
23    x = standardizer.fit_transform(x) 
24    return x, standardizer, y
25
26def numpyToTensor(x):
27    x_train = torch.from_numpy(x)
28    return x_train
29
30class DataBuilder(Dataset):
31    def __init__(self, path):
32        self.x, self.standardizer, self.y = load_mat_data(DATA_PATH)
33        self.x = numpyToTensor(self.x)
34        self.len=self.x.shape[0]
35        self.y = numpyToTensor(self.y)
36    def __getitem__(self,index):      
37        return (self.x[index], self.y[index])
38    def __len__(self):
39        return self.len
40
41datasets = ['/home/katerina/Desktop/datasets/GSE75110.mat']
42
43for DATA_PATH in datasets:
44
45    print(DATA_PATH)
46    data_set=DataBuilder(DATA_PATH)
47
48    pred_rpknn = [0] * len(data_set.y)
49    kf = KFold(n_splits=10, shuffle = True, random_state=7)
50
51    for train_index, test_index in kf.split(data_set.x):
52        #Create KNN Classifier
53        knn = KNeighborsClassifier(n_neighbors=5)
54        #print("TRAIN:", train_index, "TEST:", test_index)
55        x_train, x_test = data_set.x[train_index], data_set.x[test_index]
56        y_train, y_test = data_set.y[train_index], data_set.y[test_index]
57        #Train the model using the training sets
58        y1_train = y_train.ravel()
59        knn.fit(x_train, y1_train)
60        #Predict the response for test dataset
61        y_pred = knn.predict(x_test)
62        #print(y_pred)
63        # Model Accuracy, how often is the classifier correct?
64        print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
65        c = 0
66        for idx in test_index:
67            pred_rpknn[idx] = y_pred[c]
68            c +=1
69    print("Accuracy:",metrics.accuracy_score(data_set.y, pred_rpknn))
70    print(pred_rpknn, data_set.y.reshape(1,-1))
71/home/katerina/Desktop/datasets/GSE75110.mat
72Accuracy: 0.2857142857142857
73Accuracy: 0.38095238095238093
74Accuracy: 0.14285714285714285
75Accuracy: 0.4
76Accuracy: 0.3
77Accuracy: 0.25
78Accuracy: 0.3
79Accuracy: 0.6
80Accuracy: 0.25
81Accuracy: 0.45
82Accuracy: 0.33497536945812806
83[3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]
84

I am trying to combine knn with k fold in order to test the whole dataset using 10 folds. The problem is that knn always predicts arrays of 3's for each fold. The classes I want to predict are these:

tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2,2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3,3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]]

ANSWER

Answered 2021-Oct-17 at 07:36

TL;DR
It have to do with the StandardScaler, change it to a simple normalisation.
e.g.

1import numpy as np
2import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
3import os
4import scipy.io   
5from sklearn.neighbors import KNeighborsClassifier
6from sklearn import metrics
7from sklearn.model_selection import train_test_split
8from sklearn.preprocessing import StandardScaler
9from torch.utils.data import Dataset, DataLoader
10from sklearn import preprocessing
11import torch
12import numpy as np
13from sklearn.model_selection import KFold
14from sklearn.neighbors import KNeighborsClassifier
15from sklearn import metrics
16
17def load_mat_data(path):
18    mat = scipy.io.loadmat(DATA_PATH)
19    x,y = mat['data'], mat['class']
20    x = x.astype('float32')
21    # stadardize values
22    standardizer = preprocessing.StandardScaler()
23    x = standardizer.fit_transform(x) 
24    return x, standardizer, y
25
26def numpyToTensor(x):
27    x_train = torch.from_numpy(x)
28    return x_train
29
30class DataBuilder(Dataset):
31    def __init__(self, path):
32        self.x, self.standardizer, self.y = load_mat_data(DATA_PATH)
33        self.x = numpyToTensor(self.x)
34        self.len=self.x.shape[0]
35        self.y = numpyToTensor(self.y)
36    def __getitem__(self,index):      
37        return (self.x[index], self.y[index])
38    def __len__(self):
39        return self.len
40
41datasets = ['/home/katerina/Desktop/datasets/GSE75110.mat']
42
43for DATA_PATH in datasets:
44
45    print(DATA_PATH)
46    data_set=DataBuilder(DATA_PATH)
47
48    pred_rpknn = [0] * len(data_set.y)
49    kf = KFold(n_splits=10, shuffle = True, random_state=7)
50
51    for train_index, test_index in kf.split(data_set.x):
52        #Create KNN Classifier
53        knn = KNeighborsClassifier(n_neighbors=5)
54        #print("TRAIN:", train_index, "TEST:", test_index)
55        x_train, x_test = data_set.x[train_index], data_set.x[test_index]
56        y_train, y_test = data_set.y[train_index], data_set.y[test_index]
57        #Train the model using the training sets
58        y1_train = y_train.ravel()
59        knn.fit(x_train, y1_train)
60        #Predict the response for test dataset
61        y_pred = knn.predict(x_test)
62        #print(y_pred)
63        # Model Accuracy, how often is the classifier correct?
64        print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
65        c = 0
66        for idx in test_index:
67            pred_rpknn[idx] = y_pred[c]
68            c +=1
69    print("Accuracy:",metrics.accuracy_score(data_set.y, pred_rpknn))
70    print(pred_rpknn, data_set.y.reshape(1,-1))
71/home/katerina/Desktop/datasets/GSE75110.mat
72Accuracy: 0.2857142857142857
73Accuracy: 0.38095238095238093
74Accuracy: 0.14285714285714285
75Accuracy: 0.4
76Accuracy: 0.3
77Accuracy: 0.25
78Accuracy: 0.3
79Accuracy: 0.6
80Accuracy: 0.25
81Accuracy: 0.45
82Accuracy: 0.33497536945812806
83[3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]
84from sklearn import preprocessing
85
86...
87
88x = preprocessing.normalize(x)
89

Explanation:

Standard Scalar as you use it will do:

1import numpy as np
2import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
3import os
4import scipy.io   
5from sklearn.neighbors import KNeighborsClassifier
6from sklearn import metrics
7from sklearn.model_selection import train_test_split
8from sklearn.preprocessing import StandardScaler
9from torch.utils.data import Dataset, DataLoader
10from sklearn import preprocessing
11import torch
12import numpy as np
13from sklearn.model_selection import KFold
14from sklearn.neighbors import KNeighborsClassifier
15from sklearn import metrics
16
17def load_mat_data(path):
18    mat = scipy.io.loadmat(DATA_PATH)
19    x,y = mat['data'], mat['class']
20    x = x.astype('float32')
21    # stadardize values
22    standardizer = preprocessing.StandardScaler()
23    x = standardizer.fit_transform(x) 
24    return x, standardizer, y
25
26def numpyToTensor(x):
27    x_train = torch.from_numpy(x)
28    return x_train
29
30class DataBuilder(Dataset):
31    def __init__(self, path):
32        self.x, self.standardizer, self.y = load_mat_data(DATA_PATH)
33        self.x = numpyToTensor(self.x)
34        self.len=self.x.shape[0]
35        self.y = numpyToTensor(self.y)
36    def __getitem__(self,index):      
37        return (self.x[index], self.y[index])
38    def __len__(self):
39        return self.len
40
41datasets = ['/home/katerina/Desktop/datasets/GSE75110.mat']
42
43for DATA_PATH in datasets:
44
45    print(DATA_PATH)
46    data_set=DataBuilder(DATA_PATH)
47
48    pred_rpknn = [0] * len(data_set.y)
49    kf = KFold(n_splits=10, shuffle = True, random_state=7)
50
51    for train_index, test_index in kf.split(data_set.x):
52        #Create KNN Classifier
53        knn = KNeighborsClassifier(n_neighbors=5)
54        #print("TRAIN:", train_index, "TEST:", test_index)
55        x_train, x_test = data_set.x[train_index], data_set.x[test_index]
56        y_train, y_test = data_set.y[train_index], data_set.y[test_index]
57        #Train the model using the training sets
58        y1_train = y_train.ravel()
59        knn.fit(x_train, y1_train)
60        #Predict the response for test dataset
61        y_pred = knn.predict(x_test)
62        #print(y_pred)
63        # Model Accuracy, how often is the classifier correct?
64        print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
65        c = 0
66        for idx in test_index:
67            pred_rpknn[idx] = y_pred[c]
68            c +=1
69    print("Accuracy:",metrics.accuracy_score(data_set.y, pred_rpknn))
70    print(pred_rpknn, data_set.y.reshape(1,-1))
71/home/katerina/Desktop/datasets/GSE75110.mat
72Accuracy: 0.2857142857142857
73Accuracy: 0.38095238095238093
74Accuracy: 0.14285714285714285
75Accuracy: 0.4
76Accuracy: 0.3
77Accuracy: 0.25
78Accuracy: 0.3
79Accuracy: 0.6
80Accuracy: 0.25
81Accuracy: 0.45
82Accuracy: 0.33497536945812806
83[3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]
84from sklearn import preprocessing
85
86...
87
88x = preprocessing.normalize(x)
89The standard score of a sample `x` is calculated as:
90
91    z = (x - u) / s
92
93where `u` is the mean of the training samples or zero if `with_mean=False`,
94and `s` is the standard deviation of the training samples or one if
95`with_std=False`.
96

When you actually want this features to help KNN to decide which vector is closer.

in normalize the normalization happen for each vector separately so it doesn't effect and even help the KNN to differentiate the vectors

With KNN StandardScaler can actually harm your prediction. It is better to use it in other forms of data.

1import numpy as np
2import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
3import os
4import scipy.io   
5from sklearn.neighbors import KNeighborsClassifier
6from sklearn import metrics
7from sklearn.model_selection import train_test_split
8from sklearn.preprocessing import StandardScaler
9from torch.utils.data import Dataset, DataLoader
10from sklearn import preprocessing
11import torch
12import numpy as np
13from sklearn.model_selection import KFold
14from sklearn.neighbors import KNeighborsClassifier
15from sklearn import metrics
16
17def load_mat_data(path):
18    mat = scipy.io.loadmat(DATA_PATH)
19    x,y = mat['data'], mat['class']
20    x = x.astype('float32')
21    # stadardize values
22    standardizer = preprocessing.StandardScaler()
23    x = standardizer.fit_transform(x) 
24    return x, standardizer, y
25
26def numpyToTensor(x):
27    x_train = torch.from_numpy(x)
28    return x_train
29
30class DataBuilder(Dataset):
31    def __init__(self, path):
32        self.x, self.standardizer, self.y = load_mat_data(DATA_PATH)
33        self.x = numpyToTensor(self.x)
34        self.len=self.x.shape[0]
35        self.y = numpyToTensor(self.y)
36    def __getitem__(self,index):      
37        return (self.x[index], self.y[index])
38    def __len__(self):
39        return self.len
40
41datasets = ['/home/katerina/Desktop/datasets/GSE75110.mat']
42
43for DATA_PATH in datasets:
44
45    print(DATA_PATH)
46    data_set=DataBuilder(DATA_PATH)
47
48    pred_rpknn = [0] * len(data_set.y)
49    kf = KFold(n_splits=10, shuffle = True, random_state=7)
50
51    for train_index, test_index in kf.split(data_set.x):
52        #Create KNN Classifier
53        knn = KNeighborsClassifier(n_neighbors=5)
54        #print("TRAIN:", train_index, "TEST:", test_index)
55        x_train, x_test = data_set.x[train_index], data_set.x[test_index]
56        y_train, y_test = data_set.y[train_index], data_set.y[test_index]
57        #Train the model using the training sets
58        y1_train = y_train.ravel()
59        knn.fit(x_train, y1_train)
60        #Predict the response for test dataset
61        y_pred = knn.predict(x_test)
62        #print(y_pred)
63        # Model Accuracy, how often is the classifier correct?
64        print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
65        c = 0
66        for idx in test_index:
67            pred_rpknn[idx] = y_pred[c]
68            c +=1
69    print("Accuracy:",metrics.accuracy_score(data_set.y, pred_rpknn))
70    print(pred_rpknn, data_set.y.reshape(1,-1))
71/home/katerina/Desktop/datasets/GSE75110.mat
72Accuracy: 0.2857142857142857
73Accuracy: 0.38095238095238093
74Accuracy: 0.14285714285714285
75Accuracy: 0.4
76Accuracy: 0.3
77Accuracy: 0.25
78Accuracy: 0.3
79Accuracy: 0.6
80Accuracy: 0.25
81Accuracy: 0.45
82Accuracy: 0.33497536945812806
83[3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]
84from sklearn import preprocessing
85
86...
87
88x = preprocessing.normalize(x)
89The standard score of a sample `x` is calculated as:
90
91    z = (x - u) / s
92
93where `u` is the mean of the training samples or zero if `with_mean=False`,
94and `s` is the standard deviation of the training samples or one if
95`with_std=False`.
96import scipy.io
97from torch.utils.data import Dataset
98from sklearn import preprocessing
99import torch
100import numpy as np
101from sklearn.model_selection import KFold
102from sklearn.neighbors import KNeighborsClassifier
103from sklearn import metrics
104
105def load_mat_data(path):
106    mat = scipy.io.loadmat(DATA_PATH)
107    x, y = mat['data'], mat['class']
108    x = x.astype('float32')
109    # stadardize values
110    x = preprocessing.normalize(x)
111    return x, y
112
113def numpyToTensor(x):
114    x_train = torch.from_numpy(x)
115    return x_train
116
117class DataBuilder(Dataset):
118    def __init__(self, path):
119        self.x, self.y = load_mat_data(DATA_PATH)
120        self.x = numpyToTensor(self.x)
121        self.len=self.x.shape[0]
122        self.y = numpyToTensor(self.y)
123    def __getitem__(self,index):
124        return (self.x[index], self.y[index])
125    def __len__(self):
126        return self.len
127
128datasets = ['/home/katerina/Desktop/datasets/GSE75110.mat']
129
130for DATA_PATH in datasets:
131
132    print(DATA_PATH)
133    data_set=DataBuilder(DATA_PATH)
134
135    pred_rpknn = [0] * len(data_set.y)
136    kf = KFold(n_splits=10, shuffle = True, random_state=7)
137
138    for train_index, test_index in kf.split(data_set.x):
139        #Create KNN Classifier
140        knn = KNeighborsClassifier(n_neighbors=5)
141        #print("TRAIN:", train_index, "TEST:", test_index)
142        x_train, x_test = data_set.x[train_index], data_set.x[test_index]
143        y_train, y_test = data_set.y[train_index], data_set.y[test_index]
144        #Train the model using the training sets
145        y1_train = y_train.view(-1)
146        knn.fit(x_train, y1_train)
147        #Predict the response for test dataset
148        y_pred = knn.predict(x_test)
149        #print(y_pred)
150        # Model Accuracy, how often is the classifier correct?
151        print("Accuracy in loop:", metrics.accuracy_score(y_test, y_pred))
152        c = 0
153        for idx in test_index:
154            pred_rpknn[idx] = y_pred[c]
155            c +=1
156    print("Accuracy:",metrics.accuracy_score(data_set.y, pred_rpknn))
157    print(pred_rpknn, data_set.y.reshape(1,-1))
158
159
160Accuracy in loop: 1.0
161Accuracy in loop: 0.8571428571428571
162Accuracy in loop: 0.8571428571428571
163Accuracy in loop: 1.0
164Accuracy in loop: 0.9
165Accuracy in loop: 0.9
166Accuracy in loop: 0.95
167Accuracy in loop: 1.0
168Accuracy in loop: 0.9
169Accuracy in loop: 1.0
170Accuracy: 0.9359605911330049
171

Source https://stackoverflow.com/questions/69599448

QUESTION

Is there a way to accomplish multithreading or parallel processes in a batch file?

Asked 2021-Oct-14 at 17:38

So I have a batch file that is executing a simulation given some input parameters and then processing the output data via awk, R, and Python. At the moment the input parameters are passed into the simulation through some nested for loops and each iteration of the simulation will be run one after the other. I would like for the execution of the simulation to be done in parallel because at the moment there are 1,000+ cases so in my mind I could have core 1 handle sims 1-250, core 2 handle sims 251-500, etc.

In essence what I would like to do is this:

  1. Run every case of the simulation across multiple cores
  2. Once every simulation has been completed, start the output data processing

I've tried using start /affinity n simulation.exe, but the issue here is that all of the simulations will be executed simultaneously, so when it gets to the post processing calls, it errors out because the data hasn't been generated yet. There is the start /w command, but I'm not sure if that improve the simulation. One idea I've thought of is updating a variable once each simulation has been completed, then only start the post processing once the variable reaches n runs.

Here is an excerpt of what I am doing right now:

1    for %%f in (1 2 3) do (
2            for %%a in (4 5 6) do (
3                for %%b in (7 8 9) do (
4                    call :mission %%f %%a %%b
5                )
6            )
7         )
8    some gawk scripts
9    some python scripts
10    some r scripts
11    go to :exit
12
13:mission
14   sed -e 's/text1/%1/' -e 's/text2/%2/' -e 's/text3/%3/'
15   simulation.exe
16   go to :exit
17
18:exit
19

And here's what I was playing around with to test out some parallel processing:

1    for %%f in (1 2 3) do (
2            for %%a in (4 5 6) do (
3                for %%b in (7 8 9) do (
4                    call :mission %%f %%a %%b
5                )
6            )
7         )
8    some gawk scripts
9    some python scripts
10    some r scripts
11    go to :exit
12
13:mission
14   sed -e 's/text1/%1/' -e 's/text2/%2/' -e 's/text3/%3/'
15   simulation.exe
16   go to :exit
17
18:exit
19start /affinity 1 C:\Users\614890\R-4.1.1\bin\Rscript.exe test1.R
20start /affinity 2 C:\Users\614890\R-4.1.1\bin\Rscript.exe test2.R
21start /affinity 3 C:\Users\614890\R-4.1.1\bin\Rscript.exe test3.R
22start /affinity 4 C:\Users\614890\R-4.1.1\bin\Rscript.exe test4.R
23
24C:\Users\614890\R-4.1.1\bin\Rscript.exe plotting.R
25

ANSWER

Answered 2021-Oct-14 at 17:38

I was actually able to accomplish this by doing the following:

1    for %%f in (1 2 3) do (
2            for %%a in (4 5 6) do (
3                for %%b in (7 8 9) do (
4                    call :mission %%f %%a %%b
5                )
6            )
7         )
8    some gawk scripts
9    some python scripts
10    some r scripts
11    go to :exit
12
13:mission
14   sed -e 's/text1/%1/' -e 's/text2/%2/' -e 's/text3/%3/'
15   simulation.exe
16   go to :exit
17
18:exit
19start /affinity 1 C:\Users\614890\R-4.1.1\bin\Rscript.exe test1.R
20start /affinity 2 C:\Users\614890\R-4.1.1\bin\Rscript.exe test2.R
21start /affinity 3 C:\Users\614890\R-4.1.1\bin\Rscript.exe test3.R
22start /affinity 4 C:\Users\614890\R-4.1.1\bin\Rscript.exe test4.R
23
24C:\Users\614890\R-4.1.1\bin\Rscript.exe plotting.R
25setlocal
26set "lock=%temp%\wait%random%.lock"
27
28:: Launch processes asynchronously, with stream 9 redirected to a lock file.
29:: The lock file will remain locked until the script ends
30start /affinity 1 9>"%lock%1" Rscript test1.R
31start /affinity 2 9>"%lock%2" Rscript test2.R
32start /affinity 4 9>"%lock%3" Rscript test3.R
33start /affinity 8 9>"%lock%4" Rscript test4.R
34
35:Wait for all processes to finish
361>nul 2>nul ping /n 2 ::1
37for %%F in ("%lock%*") do (
38 (call ) 9>"%%F" || goto :Wait
39) 2>nul
40
41del "%lock%*"
42
43Rscript plotting.R
44

Source https://stackoverflow.com/questions/69381811

QUESTION

Simple calculation for all combination (brute force) of elements in two arrays, for better performance, in Julia

Asked 2021-Oct-06 at 15:49

I am new to Julia (some experience with Python). The main reason I am starting to use Julia is better performance for large scale data processing.

I want to get differences of values (int) of all possible combinations in two arrays.

Say I have two arrays.

a = [5,4]
b = [2,1,3]

Then I want to have differences of all combinations like a[1] - b[1], a[1] - b[2] ..... a[3] - b[1], a[3] - b[2]

The result will be 3 x2 array [3 2; 4 3; 2 1]

Then something I came up is

1a = [5,4]  
2b = [2,1,3]
3diff_matrix = zeros(Int8, size(b)[1], size(a)[1])
4for ia in eachindex(a)
5    for ib in eachindex(b)
6        diff_matrix[ib,ia]= a[ia] - b[ib]
7    end
8end
9println(diff_matrix)
10

It works but it uses iteration inside of iteration and I assume the performance will not be great. In real application, length of array will be long (like a few hundreds), and this process needs to be done for millions of combinations of arrays.

Is there any better (better performance, simpler code) approach for this task ?

ANSWER

Answered 2021-Oct-06 at 15:49

If you wrapped the code in a function your code would be already reasonably fast.

This is exactly the power of Julia that loops are fast. The only thing you need to avoid is using global variables in computations (as they lead to code that is not type stable).

I write the code would be "reasonably fast", as it could be made faster by some low-level tricks. However, in this case you could just write:

1a = [5,4]  
2b = [2,1,3]
3diff_matrix = zeros(Int8, size(b)[1], size(a)[1])
4for ia in eachindex(a)
5    for ib in eachindex(b)
6        diff_matrix[ib,ia]= a[ia] - b[ib]
7    end
8end
9println(diff_matrix)
10julia> a = [5,4]                
112-element Vector{Int64}:        
12 5                              
13 4                              
14                                
15julia> b = [2,1,3]              
163-element Vector{Int64}:        
17 2                              
18 1                              
19 3                              
20                                
21julia> permutedims(a) .- b      
223×2 Matrix{Int64}:              
23 3  2                           
24 4  3                           
25 2  1                           
26

and this code will be fast (and much simpler as a bonus).

Source https://stackoverflow.com/questions/69468471

QUESTION

Remove strange character from tokenization array

Asked 2021-Sep-03 at 19:39

I have a very dirty Pyspark dataframe, i.e. full of weird characters like:
  • ɴɪᴄᴇ ᴏɴᴇ ᴀᴩᴩ
  • பரமசிவம்
  • and many others

I'm doing the data processing and cleaning (tokenization, stopword removal, ...) and this is my dataframe:

content score label classWeigth words filtered terms_stemmed
absolutely love d... 5 1 0.48 [absolutely, love... [absolutely, love... [absolut, love, d...
absolutely love t... 5 1 0.48 [absolutely, love... [absolutely, love... [absolut, love, g...
absolutely phenom... 5 1 0.48 [absolutely, phen... [absolutely, phen... [absolut, phenome...
absolutely shocki... 1 0 0.52 [absolutely, shoc... [absolutely, shoc... [absolut, shock, ...
accept the phone ... 1 0 0.52 [accept, the, pho... [accept, phone, n... [accept, phone, n...

How can I access the word column and remove all weird characters, like the ones mentioned above?

ANSWER

Answered 2021-Sep-03 at 19:39

Try this UDF.

1>>> @udf('array<string>')
2... def filter_udf(a):
3...     from builtins import filter
4...     return list(filter(lambda s: s.isascii(), a))
5... 
6
7>>> df = spark.createDataFrame([(['pyspark','பரமசிவம்'],)])
8>>> df.select(filter_udf('_1')).show()
9+--------------+
10|filter_udf(_1)|
11+--------------+
12|     [pyspark]|
13+--------------+
14

Source https://stackoverflow.com/questions/69046942

Community Discussions contain sources that include Stack Exchange Network

Tutorials and Learning Resources in Data Processing

Tutorials and Learning Resources are not available at this moment for Data Processing

Share this Page

share link

Get latest updates on Data Processing