Explore all Data Processing open source software, libraries, packages, source code, cloud functions and APIs.

Popular New Releases in Data Processing

webtorrent

v1.8.14

aria2

aria2 1.36.0

lens

v5.4.6

mackup

resque

2.2.0

Popular Libraries in Data Processing

gulp

by gulpjs doticonjavascriptdoticon

star image 32240 doticonMIT

A toolkit to automate & enhance your workflow

webtorrent

by webtorrent doticonjavascriptdoticon

star image 26250 doticonMIT

⚡️ Streaming torrent client for the web

aria2

by aria2 doticonc++doticon

star image 26036 doticonNOASSERTION

aria2 is a lightweight multi-protocol & multi-source, cross platform download utility operated in command-line. It supports HTTP/HTTPS, FTP, SFTP, BitTorrent and Metalink.

lens

by lensapp doticontypescriptdoticon

star image 17415 doticonNOASSERTION

Lens - The way the world runs Kubernetes

ZeroNet

by HelloZeroNet doticonjavascriptdoticon

star image 16995 doticonNOASSERTION

ZeroNet - Decentralized websites using Bitcoin crypto and BitTorrent network

qBittorrent

by qbittorrent doticonc++doticon

star image 14553 doticonNOASSERTION

qBittorrent BitTorrent client

sidekiq

by mperham doticonrubydoticon

star image 11744 doticonNOASSERTION

Simple, efficient background processing for Ruby

mackup

by lra doticonpythondoticon

star image 11523 doticonGPL-3.0

Keep your application settings in sync (OS X/Linux)

dataloader

by graphql doticonjavascriptdoticon

star image 11321 doticonMIT

DataLoader is a generic utility to be used as part of your application's data fetching layer to provide a consistent API over various backends and reduce requests to those backends via batching and caching.

Trending New libraries in Data Processing

PowerJob

by PowerJob doticonjavadoticon

star image 3412 doticonApache-2.0

Enterprise job scheduling middleware with distributed computing ability.

PowerJob

by KFCFans doticonjavadoticon

star image 2357 doticonApache-2.0

Enterprise job scheduling middleware with distributed computing ability.

exatorrent

by varbhat doticongodoticon

star image 1428 doticonGPL-3.0

Easy to Use Torrent Client. Can be hosted in Cloud. Files can be streamed in Browser/Media Player.

bobarr

by iam4x doticontypescriptdoticon

star image 955 doticonMIT

🍿 The all-in-one alternative for Sonarr, Radarr, Jackett... with a VPN and running in docker

jobrunr

by jobrunr doticonjavadoticon

star image 864 doticonNOASSERTION

An extremely easy way to perform background processing in Java. Backed by persistent storage. Open and free for commercial use.

engine

by Monibuca doticongodoticon

star image 542 doticonMIT

Monibuca 核心引擎,包含流媒体核心转发逻辑,需要配合功能插件一起组合运行

redissyncer-server

by TraceNature doticonjavadoticon

star image 501 doticonApache-2.0

RedisSyncer是一个多任务的redis数据同步工具,可灵活的满足Redis间的数据同步、迁移需求; redissyncer is a redis synchronization tool, used in redis single instance and cluster synchronization

timetrace

by dominikbraun doticongodoticon

star image 487 doticonApache-2.0

timetrace is a simple CLI for tracking your working time.

nebula

by hubastard doticoncsharpdoticon

star image 473 doticonGPL-3.0

A multiplayer mod for the game Dyson Sphere Program

Top Authors in Data Processing

1

mafintosh

30 Libraries

star icon13100

2

webtorrent

24 Libraries

star icon43720

3

bartbutenaers

15 Libraries

star icon87

4

node-red

11 Libraries

star icon1659

5

substack

11 Libraries

star icon716

6

mozilla-services

11 Libraries

star icon2178

7

DeanCording

11 Libraries

star icon118

8

Raynos

11 Libraries

star icon159

9

hkjang

10 Libraries

star icon24

10

pull-stream

10 Libraries

star icon99

1

30 Libraries

star icon13100

2

24 Libraries

star icon43720

3

15 Libraries

star icon87

4

11 Libraries

star icon1659

5

11 Libraries

star icon716

6

11 Libraries

star icon2178

7

11 Libraries

star icon118

8

11 Libraries

star icon159

9

10 Libraries

star icon24

10

10 Libraries

star icon99

Trending Kits in Data Processing

Today data has generated constantly, and business needs the latest data to be used for business decisions via intelligent applications. This requires constantly processing data in a streaming fashion to get the lower latency. This will also allow optimum usage of the resources and get the up-to-date data loaded into the systems.

Stream processing involves multiple processing steps in near real-time as the data is produced, transported, and received at the target location. Some examples of such processing requirements processing data in motion are from continuous streams from sensors in IT infrastructure, machine sensors, health sensors, stock trade activities, etc

To create an end-to-end stream processing, you will need components performing different tasks stitched together in a pipeline and workflow.

Streaming

Using the below libraries, you can build you own correct concurrent and scalable streaming applications.

Stream processing engine

The below open-source stream processing framework provide you with stream processing capabilities.

Data Pipeline

Below libraries help in defining both batch and parallel processing pipelines running in a distributed processing backends.

Nodejs Scheduling Library is a library of functions and tools. It enables developers to create and manage automated tasks within an application. This library uses JavaScript to create applications. It will schedule and execute tasks at designated times. The library enables developers to schedule tasks in advance. It enables developers to run on a specific date and time or find the right times for running tasks. It also configures various task parameters, such as delays, intervals, and repetition. With Nodejs Scheduling, Library developers can create efficient applications and minimize resource usage.  

Different types of nodejs scheduling available libraries:  

  • node-schedule: A fast, flexible, and lightweight job scheduling library for Node.js.  
  • Agenda: Agenda is a lightweight library for scheduling jobs in Node.js. It allows you to schedule jobs, run them on a specific schedule, and manage them.  
  • Cron-job.org: A cloud-based scheduler for Node.js. It is designed to make scheduling and running tasks in Node.js as easy and efficient as possible.  
  • Axiom Scheduler: A powerful library to schedule tasks both on the client and server side. It works by applying a rule-based approach to scheduling tasks.  
  • Timers.js: A library for setTimeout and setInterval functions. This is a vital choice for those who need a simpler approach to scheduling tasks in Node.js.  
  • Node-cron: A library that allows you to schedule tasks in Node.js using cron syntax. It is an excellent option when you are familiar with the cron syntax and want to use it in Node.  
  • Nodetime: A library for scheduling jobs in Node.js. It is designed for multi-threaded applications. It provides several features to make it easy to manage many jobs.  

Different features are available in different nodejs scheduling libraries:  

Time-Based Scheduling:  

  • Cron Syntax:  

It helps execute tasks based on date and time.  

  • Schedule Once:  

It helps execute a task at a specific point.  

  • Recurring:  

It helps execute a task regularly (daily, weekly, etc.).  

Event-Based Scheduling:  

  • Event-Based Actions:  

It helps execute an action upon the occurrence of the specified event.  

  • Signal-Based Scheduling:  

It helps execute tasks based on certain signals.  

  • System-Based Actions:  

It helps execute tasks based on system events.  

For using a nodejs scheduling library, one should:  

  • Evaluate the specific features of each library. Consider which library offers the functionality you need, whether concurrency, retrying, etc.  
  • Test out each library before committing to it. Many libraries provide sample codes or demos. It runs these tests, and comparing their performance is useful.  
  • Research the community resources that at each library. This can confirm whether the library is maintained and has good quality support.  
  • Explore the configuration options. Many libraries provide customization. You can tailor how it behaves to fit your specific needs.  
  • Consider the performance of the library. Make sure that any library you choose performs well. It can handle the scale at which you plan to use the scheduling library.  
  • Consider how each library handles errors and timeouts. We should handle errors and enforce timeouts.  
  • Seek good documentation. Libraries that provide good-quality documentation make integrating. It helps in configuring the library for your usage easier.  

Different ways that a nodejs scheduling library can be used are:  

  • Automating Tasks:  

A node.js scheduling library can create, execute, and manage scheduled tasks. For instance, you can use the scheduler to provide automated feedback. It helps users send out emails, run reports, or send notifications.  

  • Job Scheduling:  

The scheduling library can schedule jobs for specific times and recurring tasks. This is useful when dealing with complex processes like backups or database updates.  

  • Process Control:  

Node.js scheduling libraries also allow you to monitor process details. You can attach extra information like timing data and user-defined tasks. It will be helpful comments to process records. It can be useful for better process control and auditing.  

  • Job Scheduling:  

Node.js scheduling libraries can schedule jobs for specific times and recurring tasks. This is useful for creating tasks that must be handled at certain times or regular intervals.  

  • Event-Based Processing:  

The scalability and event-based processing capabilities of Node.js scheduling libraries. It makes them great for event-driven tasks, such as MySql polling or updates to a NoSQL database. You can use the library's events to trigger.  

To use a nodejs scheduling library:  

  • Understand How Node Scheduling Libraries Work:  

Before using a Node scheduling library, understand how it works. A Node scheduling library allows you to schedule tasks triggered at pre-determined intervals. This is for creating backups, running reports, cleaning up, or setting reminders. Familiarize yourself with the basic concepts. The concepts like creating tasks/timers, scheduling executions, and scheduling recurring events.  

  • Set Up Your Task Executors:  

You must also set up your task executors when adding tasks to your library. A task executor handles running a particular task when subjected to a timer. You can choose from several task executor types, such as a function, a set of commands, a task or file, or a URL.  

  • Manage Your Timers:  

You can manage your timers once you've set up your tasks and task executors. This includes setting up one-time jobs, recurring jobs, or creating cron-like schedules. You can also adjust your timing parameters. It includes an interval of execution, maximum number of executions, and maximum execution duration.  

  • Take Advantage of Advanced Features:  

Many Node scheduling libraries offer advanced features. It offers task retries, priority settings, event handlers, and groupings. Take time to understand these features and take advantage of them. It helps ensure the reliability and accuracy of your scheduled tasks.  

  • Continuously Check Performance:  

Finally, take time to monitor the performance of your task scheduler. This includes monitoring your tasks' accuracy. It helps ensure that your scheduled processes are running as possible. This will help ensure your tasks are running as expected. It also ensures that errors or missed executions are addressed.  

Conclusion  

Using a Node.js scheduling library is an attractive way. It helps developers to integrate scheduling capabilities into their applications. These libraries provide developers with an intuitive and powerful API. It helps schedule tasks, handle and trigger external events, and execute scheduled tasks.  

 

The functionality supports them, which is ideal for scaling and making applications more powerful. It can couple external services and APIs with scheduling libraries. It gives developers the power to orchestrate complex and distributed events. It helps in handling local and remote triggers.  

 

Node.js scheduling libraries are a powerful tool for developers. It helps incorporate scheduling into their applications. You can bring your applications to life with the right library in your toolbelt. You can do so with reliable and powerful scheduling capabilities.  

cron  

  • It supports scheduling many jobs to run at the same time using wildcards.  
  • Can run jobs at any frequency, from minutes to years.  
  • Jobs can be set up and managed. It allows users to manage jobs in different environments from one location. 

agenda  

  • Allows developers to add unique attributes to each job. It makes it easy for the application to customize jobs and data further.  
  • Backed by an active community of support, making it easy for developers to get help if needed.  
  • Built to be a data store. It means developers can store job-related information in a single, pre-configured database.  

node-schedule  

  • Flexibility by enabling users to control scheduling parameters via code.  
  • Offers built-in real-time monitoring tools and an API for collecting job execution metrics.  
  • Allow you to set recurrences, create custom timeouts, and even run tasks in parallel.  

node-cron  

  • A sophisticated runtime engine. It makes it easy to store past scheduled times. It helps execute tasks at the correct time even if the system clock is adjusted.  
  • Supports various scheduling modes, including intervals and cron expressions.  
  • Allows you to specify a callback for each job, allowing you to customize the jobs to different tasks.  

Haraka  

  • Event-driven task scheduler. It allows developers to trigger tasks at a specified time or in intervals.  
  • Supports custom task parameters so that jobs can be customized to specific needs.  
  • It has a built-in logging infrastructure to track tasks.  

buzzer  

  • It helps trigger node services. It allows you to link different components within your application and schedule related tasks.  
  • Provides an extensive API to customize tasks. This allows you to customize timing, retry settings, execution logic, and more.  
  • Includes scheduling features such as cron scheduling, creating on-demand tasks, and recurring schedules.  

ontime  

  • Easy integration with external services through webhooks and plugins.  
  • Provides real-time updates on task progress. It allows tasks to be cron-style scheduled and triggered.  
  • Strong logging capabilities provide an audit trail of scheduled tasks, a feature differentiating it.  

FAQ  

What is the cron syntax for scheduling jobs with a node job scheduling library?  

In general, the cron syntax for node job scheduling libraries is written in six fields:  


[SECONDS] [MINUTES] [HOURS] [DAYS OF MONTH] [MONTH] [DAYS OF WEEK]  


Each field can contain an asterisk (*). It represents any valid, comma-separated, or numbers represented by a hyphen. Additionally, a forward slash (/) can define step values.  

 

How does a nodejs developer implement a job scheduling library into their roadmap?  

A Node.js developer can implement a job scheduling library into their roadmap. The foremost step is to pick the appropriate library. Popular scheduling libraries for Node.js include Agenda and Later.js.  


Once it is selected, the developer must install the library using a package manager. Then, depending on its options, they must set up the library's configuration. This may include setting up the job queue and configuring the scheduling interval.  


The developer may need to create an interface to abstract the library and build methods to work. Once the interface is set up, the developer can create new jobs and schedule/execute tasks.  


Finally, the developer may need to add monitoring capabilities and error handling. This will allow them to debug any issues with the job scheduling library.  

 

What are the main features of Node scheduler, and how do they differ from Google Cloud Scheduler?  

Node Scheduler is an npm package that allows you to add job scheduling in your Node.js applications. It supports various scheduling functions. It includes scheduling tasks in intervals, on specific dates, or at specified times. It can also perform tasks on a cron schedule and can be used to manage @reboot jobs.  


The main features of the Node scheduler include the following:  

  • Flexible Scheduling:  

It supports a variety of scheduling functions. It lets you define jobs as recurring tasks, perform tasks on a cron schedule, or execute @reboot jobs.  

  • High Availability:  

The node scheduler is designed to be reliable and tolerant of system or server failures. Jobs will remain pending until they succeed.  

  • Ready API:  

Node scheduler enables you to use its API to define, store, and manage jobs. It also allows you to access data about past runs. It enables easy integration with other services.  

  • Advanced Features:  

It supports recurrent jobs, delay scheduling, time zone support, and automatic retries.  


Google Cloud Scheduler is a service. It allows you to schedule jobs to run on Google Compute Engine or App Engine instances. Its main features include integration.  

 

Are there any other cron implementations that are available for use in nodejs?  

Yes! Also, to the standard cron library, some alternatives provide similar functionality. These include Agenda.js, node-cron, and node-schedule.  

 

How can MongoDB be used to store data related to scheduled jobs?  

MongoDB can store data related to scheduled jobs using its built-in scheduling mechanisms. This enables developers to store data related to events within the database. It can also track the progress and history of jobs. It allows developers to query and report on job performance and success. MongoDB can build efficient and distributed job queues. It will further allow developers to monitor and control the execution of jobs.  

 

Is there a straightforward way to pause or stop a scheduling method?  

Yes, pausing or stopping a scheduling method is possible. If you use a language such as Java, you can use the Timer and TimerTask classes to pause or stop a scheduling method.  

 

What advantages does time-based scheduling have over the task triggering in node runtime?  

Time-based scheduling has advantages over the manual triggering of tasks in node runtime. First, it reduces human error by automating the task execution process instead of relying on someone to remember to run a task on a specific day or time. Time-based scheduling ensures tasks are triggered when they need to be. Time-based scheduling also helps optimize system resources by running tasks at specific times. It happens when demand is expected to be lighter, and CPU use will be more optimal. Time-based scheduling helps reduce maintenance time by eliminating the need to trigger tasks. It can help predictability and reliability since tasks will always occur when scheduled.  

 

Are there any security considerations when leveraging job scheduling libraries with my codebase?  

Yes, there are a few security considerations. Considering this when leveraging these libraries with your project's codebase would be best.  

  • Ensure that you understand the library's authentication and authorization systems. Then, audit them for security vulnerabilities.  
  • Review the library for secure storage and authentication of parameters, credentials, and API keys.  
  • Ensure that appropriate logging is in place for any job-related activity.  
  • Set up regular vulnerability scanning of both the libraries and the codebase. It will ensure that no new vulnerabilities are introduced.  
  • Ensure that users and processes using job scheduling libraries have enough permissions. Then, only the minimum number of privileges are required to perform their activities.  
  • Follow best practices and use secure coding techniques to prevent malicious code injections.  

 

What extra benefits does a job scheduling library provide? When compared to writing custom code for task execution within the runtime environment?  

These libraries can provide various benefits over writing custom code for task execution. These benefits include the following:  

  • Automation:  

A job scheduling library can manage task execution according to predefined schedules. This reduces the time spent coding and overseeing the execution of tasks.  

  • Scalability:  

A job scheduling library can scale with your task load. It provides flexibility when dealing with different types of tasks and workloads.  

  • Efficiency:  

A job scheduling library can execute many tasks, saving time and resources.  

  • Fault Tolerance:  

A job scheduling library is designed to be robust in the face of failures and downtimes. The library can detect and recover from errors and failures. It will help reduce the consequence of any potential downtime.  

  • Cost Savings:  

A job scheduling library can reduce the cost of labor. It will reduce the cost associated with manual coding and overseeing task execution. 

Developers widely use Python Stream processing to query ongoing data streams and respond to important events in timeframes ranging from milliseconds to minutes. Complex event processing, Real-time analytics, and streaming analytics are all closely linked to stream processing, which is now the preliminary framework for executing these use cases. 

 

Stream processing engines are runtime libraries that permit coders to write code to process streaming data with not having to deal with low-level streaming mechanics. Data were traditionally processed in batches based on a schedule or predefined point (for instance, each night at 1 am, every hundred rows, or every time the volume reached two megabytes). However, as data volumes and speeds have increased, more than batch processing is needed for many applications. Python Stream processing has evolved into a must-have feature for modern applications. For various use cases and applications, enterprises have turned to technologies that respond to data as it is created. Stream processing enables applications to respond to new data events as they happen. Unlike batch processing, which groups data and collects it at predetermined intervals, stream processing applications collect and process data when it is generated.

 

Python Stream processing is most commonly used with data generated as a series of events, such as IoT sensor data, payment processing systems, servers, and application logs. The two common paradigms are publisher/subscriber (also known as pub/sub) and source/sink. A publisher or source generates data and events, which are then delivered to a stream processing application. Here the data might be augmented, tested against fraud detection algorithms, or otherwise transformed before being sent to a subscriber or sink. Furthermore, all major cloud services, such as Tensorflow, Numpy, and Pytorch, have native services that simplify stream processing development on their respective platforms.


Check out the list below to find more popular Python stream-processing libraries for your applications: 

A job queue is used for managing and executing asynchronous tasks. Whether processing large data volumes, handling time-consuming tasks, or distributing workloads across workers. A job queue provides an efficient way to organize and rank jobs. Acting between task producers will ensure tasks are in a controlled manner. With Bull or Bee-Queue, developers can implement a queue system to improve workflow. Job queues come in different types, each catering to specific needs. Task queues focus on managing individual jobs or tasks within the system. These task-oriented queues allow for job execution by assigning them to available workers. Message queues enable asynchronous communication by facilitating the exchange of messages between components. These queues are useful in event-driven microservices architectures, where different services must communicate.  

 

Using a wide range of features can make job queues more useful. The center element is the lining. It permits responsibilities to be added to a line for later execution. Job queues have notification features to send alerts when a job finishes or fails. Retry mechanisms are essential to increase fault tolerance because they retry failed jobs. With built-in support, BullMQ and Bee-Queue offer developers a robust job queue solution.  

 

We must comprehend the options for your business requirements to use job queues. The concurrency parameter, which regulates concurrent job executions, is one consideration. By adjusting this parameter, optimal performance is ensured, and resource use is balanced. Using specific queue instances helps organize job types and rank tasks. Monitoring and handling stalled jobs are essential to maintain the queue system's stability. Job queues can boost workflow efficiency. It makes the software resilient with the right configuration. 

bull:  

  • Used for creating fast and robust job queues in Node.js applications.  
  • It supports advanced features like queuing, notification, and retry mechanisms.  
  • Offers many queues and job events for efficient job management.  
  • Provides concurrency control and adjustable concurrency for optimal resource use. 

pg-boss:  

  • Used for building job queues and task scheduling with PostgreSQL as the backend.  
  • Supports job dependencies and job prioritization for managing complex workflows.  
  • Offers advanced job scheduling options and persistence for jobs.  
  • Provides fault tolerance and transaction support for reliable job execution.  

bee-queue:  

  • Used for creating lightweight and scalable job queues in Node.js applications.  
  • Supports delayed and repeatable jobs for flexible task scheduling.  
  • Offers job prioritization and throttling for efficient resource management.  
  • Provides built-in support for rate limiting and job events.  

agenda:  

  • Used for job scheduling and task management in Node.js applications.  
  • Supports various scheduling options, including recurring and delayed jobs.  
  • Offers job persistence and event-driven job execution.  
  • Provides concurrency control and job dependency management.  

kue:  

  • Used for building feature-rich job queues with Redis as the backend.  
  • Supports job throttling and rate limiting for efficient resource use.  
  • Offers a user-friendly web interface for managing jobs and monitoring queues.  
  • Provides advanced features like job priority, job events, and job search.  

resque:  

  • Used for creating job queues inspired by the Ruby library with the same name.  
  • Supports Redis as the backend for efficient job processing.  
  • Offers job scheduling, delayed jobs, and monitoring capabilities.  
  • Provides a clean and straightforward interface for managing jobs and workers. 

FAQ: 

1. What is a queue job manager, and what options are available for NodeJS applications?  

A queue job manager in Node.js is a tool or library that helps manage and execute jobs or tasks in sequential order. Some application options include Bull queues, Bee-Queue, Agenda, and pg-boss.  

 

2. Can any fast and robust queue system libraries be used with NodeJS?  

Yes, Bull queues and Bee-Queue are fast and robust queue system libraries. It can be used with Node.js applications. It provides features like queuing, notification, retry mechanisms, and customizable job controls.  

 

3. How does an asynchronous function queue work in the context of NodeJS applications?  

Node.js's asynchronous function queue is useful for handling asynchronous operations in applications. It enables the sequential execution of asynchronous tasks. It ensures order and control over the execution flow.  

 

4. What are the advantages of Redis Queue versus other queues for NodeJS apps?  

  • It supports many queues and job events, persistence, and advanced features. 
  • It includes notification, retry mechanisms, and adjustable concurrency.  

 

5. How do you adjust concurrency when using a job processor on Nodejs apps?  

Concurrency in job processors can be done by controlling the concurrent job executions. This parameter helps optimize resource use and balance the workload. It helps ensure optimal performance and efficient job processing.  

 

6. What are some key features of Bull queues when building web applications with nodejs?  

When building web applications, bull queues provide robust job queuing. It provides advanced job management with features like job dependencies and prioritization. It offers built-in support for job events and scalability. It helps handle web applications with much work.  

In Node.js, a "queue job manager" tool assists in the sequential management and execution of tasks. Bull queues, Bee-Queue, Agenda, and pg-boss are some options for Node.js applications. 

Trending Discussions on Data Processing

Create dictionary from the position of elements in nested lists

Flutter How to show splash screen only while loading flutter data

Why does Java HotSpot can not optimize array instance after one-time resizing (leads to massive performance loss)?

How to predict actual future values after testing the trained LSTM model?

How to deploy sagemaker.workflow.pipeline.Pipeline?

How reproducible / deterministic is Parquet format?

Why does KNeighborsClassifier always predict the same number?

Is there a way to accomplish multithreading or parallel processes in a batch file?

Simple calculation for all combination (brute force) of elements in two arrays, for better performance, in Julia

Remove strange character from tokenization array

QUESTION

Create dictionary from the position of elements in nested lists

Asked 2022-Feb-27 at 15:36

I want to create a dictionary using the position of elements in each list of lists. The order of each nested list is very important and must remain the same.

Original nested lists and desired dictionary keys:

1L_original = [[1, 1, 3], [2, 3, 8]]
2keys = ["POS1", "POS2", "POS3"]
3

Desired dictionary created from L_original:

1L_original = [[1, 1, 3], [2, 3, 8]]
2keys = ["POS1", "POS2", "POS3"]
3L_dictionary = {"POS1": [1, 2], "POS2": [1, 3], "POS3": [3, 8]}
4

The code I have so far fails the conditionals and ends on the else statement for each iteration.

1L_original = [[1, 1, 3], [2, 3, 8]]
2keys = ["POS1", "POS2", "POS3"]
3L_dictionary = {"POS1": [1, 2], "POS2": [1, 3], "POS3": [3, 8]}
4for i in L_original:
5    for key, value in enumerate(i):
6        if key == 0:
7            L_dictionary[keys[0]] = value
8        if key == 1:
9            L_dictionary[keys[1]] = value
10        if key == 2:
11            L_dictionary[keys[2]] = value
12        else:
13            print(f"Error in positional data processing...{key}: {value} in {i}")
14

ANSWER

Answered 2022-Feb-27 at 14:42

I believe there are more clean ways to solve this with some fancy python API, but one of the straightforward solutions might be the following:

For each key from keys we take those numbers from L_original's nested arrays which have the same index as key has, namely idx

1L_original = [[1, 1, 3], [2, 3, 8]]
2keys = ["POS1", "POS2", "POS3"]
3L_dictionary = {"POS1": [1, 2], "POS2": [1, 3], "POS3": [3, 8]}
4for i in L_original:
5    for key, value in enumerate(i):
6        if key == 0:
7            L_dictionary[keys[0]] = value
8        if key == 1:
9            L_dictionary[keys[1]] = value
10        if key == 2:
11            L_dictionary[keys[2]] = value
12        else:
13            print(f"Error in positional data processing...{key}: {value} in {i}")
14L_original = [[1, 1, 3], [2, 3, 8]]
15keys = ["POS1", "POS2", "POS3"]
16L_dictionary = {}
17
18for (idx, key) in enumerate(keys):
19    L_dictionary[key] = []
20    for items in L_original:
21        L_dictionary[key].append(items[idx])
22

Your code goes to else, because this else is related to the if key == 2, not to the whole chain of ifs. So if the key is, for example, 0 the flow goes to else, because 0 != 2. To fix this, the second and subsequent ifs should be replaced with elif. This relates the else to the whole chain:

1L_original = [[1, 1, 3], [2, 3, 8]]
2keys = ["POS1", "POS2", "POS3"]
3L_dictionary = {"POS1": [1, 2], "POS2": [1, 3], "POS3": [3, 8]}
4for i in L_original:
5    for key, value in enumerate(i):
6        if key == 0:
7            L_dictionary[keys[0]] = value
8        if key == 1:
9            L_dictionary[keys[1]] = value
10        if key == 2:
11            L_dictionary[keys[2]] = value
12        else:
13            print(f"Error in positional data processing...{key}: {value} in {i}")
14L_original = [[1, 1, 3], [2, 3, 8]]
15keys = ["POS1", "POS2", "POS3"]
16L_dictionary = {}
17
18for (idx, key) in enumerate(keys):
19    L_dictionary[key] = []
20    for items in L_original:
21        L_dictionary[key].append(items[idx])
22if key == 0:
23  # only when key is 0
24elif key == 1:
25  # only when key is 1 
26elif key == 2:
27  # only when key is 2
28else:
29  # otherwise (not 0, not 1, not 2)
30

Source https://stackoverflow.com/questions/71285445

QUESTION

Flutter How to show splash screen only while loading flutter data

Asked 2022-Feb-24 at 12:53

While the app's splash screen is displayed, it needs to download files from the FTP server and process data. Implemented splash screen for flutter

1class Home extends StatelessWidget {
2  const Home({Key? key}) : super(key: key);
3
4  @override
5  Widget build(BuildContext context) {
6
7    return FutureBuilder(
8      future: Future.delayed(Duration(seconds: 3)),
9      builder: (BuildContext context, AsyncSnapshot snapshot){
10        if(snapshot.connectionState == ConnectionState.waiting)
11          return SplashUI();    ///Splash Screen
12        else
13          return MainUI();       ///Main Screen
14      },
15    );
16  }
17}
18

Now, with a delay of 3 seconds, the startup screen is displayed for 3 seconds, during which time the file is downloaded from FTP and data is processed. I want to keep the splash screen until the completion of data processing rather than the specified time.

Splash Screen

1class Home extends StatelessWidget {
2  const Home({Key? key}) : super(key: key);
3
4  @override
5  Widget build(BuildContext context) {
6
7    return FutureBuilder(
8      future: Future.delayed(Duration(seconds: 3)),
9      builder: (BuildContext context, AsyncSnapshot snapshot){
10        if(snapshot.connectionState == ConnectionState.waiting)
11          return SplashUI();    ///Splash Screen
12        else
13          return MainUI();       ///Main Screen
14      },
15    );
16  }
17}
18
19Widget _splashUI(Size size){
20    return SafeArea(
21      child: Center(
22        child: Container(
23          width: size.width * 0.5,
24          height: size.height * 0.1,
25          child: Image(
26            fit: BoxFit.fill,
27            image: AssetImage('assets/images/elf_logo.png'),
28          ),
29        ),
30      ),
31    );
32  }
33
34 Widget build(BuildContext context) {
35
36 getFtpFile();
37 dataProgress();
38
39 return Platform.isAndroid ?
40    MaterialApp(
41      debugShowCheckedModeBanner: false,
42      home: Scaffold(
43        body: _splashUI(_size),
44      ),
45    ) :
46    CupertinoApp(
47      debugShowCheckedModeBanner: false,
48      home: CupertinoPageScaffold(
49        child: _splashUI(_size),
50      ),
51    );
52 }
53

I want to know how to keep SplashScreen while processing data rather than handling SplashScreen with delayed. thank you.

ANSWER

Answered 2022-Feb-24 at 02:35

You could do like other people have done in the past; you should make both of your methods getFTPFile and dataProgress return a Future, then you wait for both Futures using Future.wait, as in this answer https://stackoverflow.com/a/54465973/871364

1class Home extends StatelessWidget {
2  const Home({Key? key}) : super(key: key);
3
4  @override
5  Widget build(BuildContext context) {
6
7    return FutureBuilder(
8      future: Future.delayed(Duration(seconds: 3)),
9      builder: (BuildContext context, AsyncSnapshot snapshot){
10        if(snapshot.connectionState == ConnectionState.waiting)
11          return SplashUI();    ///Splash Screen
12        else
13          return MainUI();       ///Main Screen
14      },
15    );
16  }
17}
18
19Widget _splashUI(Size size){
20    return SafeArea(
21      child: Center(
22        child: Container(
23          width: size.width * 0.5,
24          height: size.height * 0.1,
25          child: Image(
26            fit: BoxFit.fill,
27            image: AssetImage('assets/images/elf_logo.png'),
28          ),
29        ),
30      ),
31    );
32  }
33
34 Widget build(BuildContext context) {
35
36 getFtpFile();
37 dataProgress();
38
39 return Platform.isAndroid ?
40    MaterialApp(
41      debugShowCheckedModeBanner: false,
42      home: Scaffold(
43        body: _splashUI(_size),
44      ),
45    ) :
46    CupertinoApp(
47      debugShowCheckedModeBanner: false,
48      home: CupertinoPageScaffold(
49        child: _splashUI(_size),
50      ),
51    );
52 }
53Future.wait([
54   getFTPFile(),
55   dataProgress(),     
56], () {
57  // once all Futures have completed, navigate to another page here
58});
59

Source https://stackoverflow.com/questions/71246161

QUESTION

Why does Java HotSpot can not optimize array instance after one-time resizing (leads to massive performance loss)?

Asked 2022-Feb-04 at 18:19
Question

Why is the use of fBuffer1 in the attached code example (SELECT_QUICK = true) double as fast as the other variant when fBuffer2 is resized only once at the beginning (SELECT_QUICK = false)?

The code path is absolutely identical but even after 10 minutes the throughput of fBuffer2 does not increase to this level of fBuffer1.

Background:

We have a generic data processing framework that collects thousands of Java primitive values in different subclasses (one subclass for each primitive type). These values are stored internally in arrays, which we originally sized sufficiently large. To save heap memory, we have now switched these arrays to dynamic resizing (arrays grow only if needed). As expected, this change has massively reduce the heap memory. However, on the other hand the performance has unfortunately degenerated significantly. Our processing jobs now take 2-3 times longer as before (e.g. 6 min instead of 2 min as before).

I have reduced our problem to a minimum working example and attached it. You can choose with SELECT_QUICK which buffer should be used. I see the same effect with jdk-1.8.0_202-x64 as well as with openjdk-17.0.1-x64.

Buffer 1 (is not resized) shows the following numbers:
1duration buf1: 8,890.551ms (8.9s)
2duration buf1: 8,339.755ms (8.3s)
3duration buf1: 8,620.633ms (8.6s)
4duration buf1: 8,682.809ms (8.7s)
5...
6
Buffer 2 (is resized exactly 1 time at the beginning) shows the following numbers:
1duration buf1: 8,890.551ms (8.9s)
2duration buf1: 8,339.755ms (8.3s)
3duration buf1: 8,620.633ms (8.6s)
4duration buf1: 8,682.809ms (8.7s)
5...
6make buffer 2 larger
7duration buf2 (resized): 19,542.750ms (19.5s)
8duration buf2 (resized): 22,423.529ms (22.4s)
9duration buf2 (resized): 22,413.364ms (22.4s)
10duration buf2 (resized): 22,219.383ms (22.2s)
11...
12

I would really appreciate to get some hints, how I can change the code so that fBuffer2 (after resizing) works as fast as fBuffer1. The other way round (make fBuffer1 as slow as fBuffer2) is pretty easy. ;-) Since this problem is placed in some framework-like component I would prefer to change the code instead of tuning the hotspot (with external arguments). But of course, suggestions in both directions are very welcome.

Source Code
1duration buf1: 8,890.551ms (8.9s)
2duration buf1: 8,339.755ms (8.3s)
3duration buf1: 8,620.633ms (8.6s)
4duration buf1: 8,682.809ms (8.7s)
5...
6make buffer 2 larger
7duration buf2 (resized): 19,542.750ms (19.5s)
8duration buf2 (resized): 22,423.529ms (22.4s)
9duration buf2 (resized): 22,413.364ms (22.4s)
10duration buf2 (resized): 22,219.383ms (22.2s)
11...
12import java.util.Locale;
13
14public final class Collector {
15
16    private static final boolean SELECT_QUICK = true;
17
18    private static final long LOOP_COUNT = 50_000L;
19    private static final int VALUE_COUNT = 150_000;
20    private static final int BUFFER_LENGTH = 100_000;
21
22    private final Buffer fBuffer = new Buffer();
23
24    public void reset() {fBuffer.reset();}
25    public void addValueBuf1(long val) {fBuffer.add1(val);}
26    public void addValueBuf2(long val) {fBuffer.add2(val);}
27
28    public static final class Buffer {
29
30        private int fIdx = 0;
31        private long[] fBuffer1 = new long[BUFFER_LENGTH * 2];
32        private long[] fBuffer2 = new long[BUFFER_LENGTH];
33
34        public void reset() {fIdx = 0;}
35
36        public void add1(long value) {
37            ensureRemainingBuf1(1);
38            fBuffer1[fIdx++] = value;
39        }
40
41        public void add2(long value) {
42            ensureRemainingBuf2(1);
43            fBuffer2[fIdx++] = value;
44        }
45
46        private void ensureRemainingBuf1(int remaining) {
47            if (remaining > fBuffer1.length - fIdx) {
48                System.out.println("make buffer 1 larger");
49                fBuffer1 = new long[(fIdx + remaining) << 1];
50            }
51        }
52
53        private void ensureRemainingBuf2(int remaining) {
54            if (remaining > fBuffer2.length - fIdx) {
55                System.out.println("make buffer 2 larger");
56                fBuffer2 = new long[(fIdx + remaining) << 1];
57            }
58        }
59
60    }
61
62    public static void main(String[] args) {
63        Locale.setDefault(Locale.ENGLISH);
64        final Collector collector = new Collector();
65        if (SELECT_QUICK) {
66            while (true) {
67                final long start = System.nanoTime();
68                for (long j = 0L; j < LOOP_COUNT; j++) {
69                    collector.reset();
70                    for (int k = 0; k < VALUE_COUNT; k++) {
71                        collector.addValueBuf1(k);
72                    }
73                }
74                final long nanos = System.nanoTime() - start;
75                System.out.printf("duration buf1: %1$,.3fms (%2$,.1fs)%n",
76                    nanos / 1_000_000d, nanos / 1_000_000_000d);
77            }
78        } else {
79            while (true) {
80                final long start = System.nanoTime();
81                for (long j = 0L; j < LOOP_COUNT; j++) {
82                    collector.reset();
83                    for (int k = 0; k < VALUE_COUNT; k++) {
84                        collector.addValueBuf2(k);
85                    }
86                }
87                final long nanos = System.nanoTime() - start;
88                System.out.printf("duration buf2 (resized): %1$,.3fms (%2$,.1fs)%n",
89                    nanos / 1_000_000d, nanos / 1_000_000_000d);
90            }
91        }
92    }
93
94}
95

ANSWER

Answered 2022-Feb-04 at 18:19

JIT compilation in HotSpot JVM is 1) based on runtime profile data; 2) uses speculative optimizations.

Once the method is compiled at the maximum optimization level, HotSpot stops profiling this code, so it is never recompiled afterwards, no matter how long the code runs. (The exception is when the method needs to be deoptimized or unloaded, but it's not your case).

In the first case (SELECT_QUICK == true), the condition remaining > fBuffer1.length - fIdx is never met, and HotSpot JVM is aware of that from profiling data collected at lower tiers. So it speculatively hoists the check out of the loop, and compiles the loop body with the assumption that array index is always within bounds. After the optimization, the loop is compiled like this (in pseudocode):

1duration buf1: 8,890.551ms (8.9s)
2duration buf1: 8,339.755ms (8.3s)
3duration buf1: 8,620.633ms (8.6s)
4duration buf1: 8,682.809ms (8.7s)
5...
6make buffer 2 larger
7duration buf2 (resized): 19,542.750ms (19.5s)
8duration buf2 (resized): 22,423.529ms (22.4s)
9duration buf2 (resized): 22,413.364ms (22.4s)
10duration buf2 (resized): 22,219.383ms (22.2s)
11...
12import java.util.Locale;
13
14public final class Collector {
15
16    private static final boolean SELECT_QUICK = true;
17
18    private static final long LOOP_COUNT = 50_000L;
19    private static final int VALUE_COUNT = 150_000;
20    private static final int BUFFER_LENGTH = 100_000;
21
22    private final Buffer fBuffer = new Buffer();
23
24    public void reset() {fBuffer.reset();}
25    public void addValueBuf1(long val) {fBuffer.add1(val);}
26    public void addValueBuf2(long val) {fBuffer.add2(val);}
27
28    public static final class Buffer {
29
30        private int fIdx = 0;
31        private long[] fBuffer1 = new long[BUFFER_LENGTH * 2];
32        private long[] fBuffer2 = new long[BUFFER_LENGTH];
33
34        public void reset() {fIdx = 0;}
35
36        public void add1(long value) {
37            ensureRemainingBuf1(1);
38            fBuffer1[fIdx++] = value;
39        }
40
41        public void add2(long value) {
42            ensureRemainingBuf2(1);
43            fBuffer2[fIdx++] = value;
44        }
45
46        private void ensureRemainingBuf1(int remaining) {
47            if (remaining > fBuffer1.length - fIdx) {
48                System.out.println("make buffer 1 larger");
49                fBuffer1 = new long[(fIdx + remaining) << 1];
50            }
51        }
52
53        private void ensureRemainingBuf2(int remaining) {
54            if (remaining > fBuffer2.length - fIdx) {
55                System.out.println("make buffer 2 larger");
56                fBuffer2 = new long[(fIdx + remaining) << 1];
57            }
58        }
59
60    }
61
62    public static void main(String[] args) {
63        Locale.setDefault(Locale.ENGLISH);
64        final Collector collector = new Collector();
65        if (SELECT_QUICK) {
66            while (true) {
67                final long start = System.nanoTime();
68                for (long j = 0L; j < LOOP_COUNT; j++) {
69                    collector.reset();
70                    for (int k = 0; k < VALUE_COUNT; k++) {
71                        collector.addValueBuf1(k);
72                    }
73                }
74                final long nanos = System.nanoTime() - start;
75                System.out.printf("duration buf1: %1$,.3fms (%2$,.1fs)%n",
76                    nanos / 1_000_000d, nanos / 1_000_000_000d);
77            }
78        } else {
79            while (true) {
80                final long start = System.nanoTime();
81                for (long j = 0L; j < LOOP_COUNT; j++) {
82                    collector.reset();
83                    for (int k = 0; k < VALUE_COUNT; k++) {
84                        collector.addValueBuf2(k);
85                    }
86                }
87                final long nanos = System.nanoTime() - start;
88                System.out.printf("duration buf2 (resized): %1$,.3fms (%2$,.1fs)%n",
89                    nanos / 1_000_000d, nanos / 1_000_000_000d);
90            }
91        }
92    }
93
94}
95if (VALUE_COUNT > collector.fBuffer.fBuffer1.length) {
96    uncommon_trap();
97}
98for (int k = 0; k < VALUE_COUNT; k++) {
99    collector.fBuffer.fBuffer1[k] = k;  // no bounds check
100}
101

In the second case (SELECT_QUICK == false), on the contrary, HotSpot knows that condition remaining > fBuffer2.length - fIdx is sometimes met, so it cannot eliminate the check.

Since fIdx is not the loop counter, HotSpot is apparently not smart enough to split the loop into two parts (with and without bounds check). However, you can help JIT compiler by splitting the loop manually:

1duration buf1: 8,890.551ms (8.9s)
2duration buf1: 8,339.755ms (8.3s)
3duration buf1: 8,620.633ms (8.6s)
4duration buf1: 8,682.809ms (8.7s)
5...
6make buffer 2 larger
7duration buf2 (resized): 19,542.750ms (19.5s)
8duration buf2 (resized): 22,423.529ms (22.4s)
9duration buf2 (resized): 22,413.364ms (22.4s)
10duration buf2 (resized): 22,219.383ms (22.2s)
11...
12import java.util.Locale;
13
14public final class Collector {
15
16    private static final boolean SELECT_QUICK = true;
17
18    private static final long LOOP_COUNT = 50_000L;
19    private static final int VALUE_COUNT = 150_000;
20    private static final int BUFFER_LENGTH = 100_000;
21
22    private final Buffer fBuffer = new Buffer();
23
24    public void reset() {fBuffer.reset();}
25    public void addValueBuf1(long val) {fBuffer.add1(val);}
26    public void addValueBuf2(long val) {fBuffer.add2(val);}
27
28    public static final class Buffer {
29
30        private int fIdx = 0;
31        private long[] fBuffer1 = new long[BUFFER_LENGTH * 2];
32        private long[] fBuffer2 = new long[BUFFER_LENGTH];
33
34        public void reset() {fIdx = 0;}
35
36        public void add1(long value) {
37            ensureRemainingBuf1(1);
38            fBuffer1[fIdx++] = value;
39        }
40
41        public void add2(long value) {
42            ensureRemainingBuf2(1);
43            fBuffer2[fIdx++] = value;
44        }
45
46        private void ensureRemainingBuf1(int remaining) {
47            if (remaining > fBuffer1.length - fIdx) {
48                System.out.println("make buffer 1 larger");
49                fBuffer1 = new long[(fIdx + remaining) << 1];
50            }
51        }
52
53        private void ensureRemainingBuf2(int remaining) {
54            if (remaining > fBuffer2.length - fIdx) {
55                System.out.println("make buffer 2 larger");
56                fBuffer2 = new long[(fIdx + remaining) << 1];
57            }
58        }
59
60    }
61
62    public static void main(String[] args) {
63        Locale.setDefault(Locale.ENGLISH);
64        final Collector collector = new Collector();
65        if (SELECT_QUICK) {
66            while (true) {
67                final long start = System.nanoTime();
68                for (long j = 0L; j < LOOP_COUNT; j++) {
69                    collector.reset();
70                    for (int k = 0; k < VALUE_COUNT; k++) {
71                        collector.addValueBuf1(k);
72                    }
73                }
74                final long nanos = System.nanoTime() - start;
75                System.out.printf("duration buf1: %1$,.3fms (%2$,.1fs)%n",
76                    nanos / 1_000_000d, nanos / 1_000_000_000d);
77            }
78        } else {
79            while (true) {
80                final long start = System.nanoTime();
81                for (long j = 0L; j < LOOP_COUNT; j++) {
82                    collector.reset();
83                    for (int k = 0; k < VALUE_COUNT; k++) {
84                        collector.addValueBuf2(k);
85                    }
86                }
87                final long nanos = System.nanoTime() - start;
88                System.out.printf("duration buf2 (resized): %1$,.3fms (%2$,.1fs)%n",
89                    nanos / 1_000_000d, nanos / 1_000_000_000d);
90            }
91        }
92    }
93
94}
95if (VALUE_COUNT > collector.fBuffer.fBuffer1.length) {
96    uncommon_trap();
97}
98for (int k = 0; k < VALUE_COUNT; k++) {
99    collector.fBuffer.fBuffer1[k] = k;  // no bounds check
100}
101for (long j = 0L; j < LOOP_COUNT; j++) {
102    collector.reset();
103
104    int fastCount = Math.min(collector.fBuffer.fBuffer2.length, VALUE_COUNT);
105    for (int k = 0; k < fastCount; k++) {
106        collector.addValueBuf2Fast(k);
107    }
108
109    for (int k = fastCount; k < VALUE_COUNT; k++) {
110        collector.addValueBuf2(k);
111    }
112}
113

where addValueBuf2Fast inserts a value without bounds check:

1duration buf1: 8,890.551ms (8.9s)
2duration buf1: 8,339.755ms (8.3s)
3duration buf1: 8,620.633ms (8.6s)
4duration buf1: 8,682.809ms (8.7s)
5...
6make buffer 2 larger
7duration buf2 (resized): 19,542.750ms (19.5s)
8duration buf2 (resized): 22,423.529ms (22.4s)
9duration buf2 (resized): 22,413.364ms (22.4s)
10duration buf2 (resized): 22,219.383ms (22.2s)
11...
12import java.util.Locale;
13
14public final class Collector {
15
16    private static final boolean SELECT_QUICK = true;
17
18    private static final long LOOP_COUNT = 50_000L;
19    private static final int VALUE_COUNT = 150_000;
20    private static final int BUFFER_LENGTH = 100_000;
21
22    private final Buffer fBuffer = new Buffer();
23
24    public void reset() {fBuffer.reset();}
25    public void addValueBuf1(long val) {fBuffer.add1(val);}
26    public void addValueBuf2(long val) {fBuffer.add2(val);}
27
28    public static final class Buffer {
29
30        private int fIdx = 0;
31        private long[] fBuffer1 = new long[BUFFER_LENGTH * 2];
32        private long[] fBuffer2 = new long[BUFFER_LENGTH];
33
34        public void reset() {fIdx = 0;}
35
36        public void add1(long value) {
37            ensureRemainingBuf1(1);
38            fBuffer1[fIdx++] = value;
39        }
40
41        public void add2(long value) {
42            ensureRemainingBuf2(1);
43            fBuffer2[fIdx++] = value;
44        }
45
46        private void ensureRemainingBuf1(int remaining) {
47            if (remaining > fBuffer1.length - fIdx) {
48                System.out.println("make buffer 1 larger");
49                fBuffer1 = new long[(fIdx + remaining) << 1];
50            }
51        }
52
53        private void ensureRemainingBuf2(int remaining) {
54            if (remaining > fBuffer2.length - fIdx) {
55                System.out.println("make buffer 2 larger");
56                fBuffer2 = new long[(fIdx + remaining) << 1];
57            }
58        }
59
60    }
61
62    public static void main(String[] args) {
63        Locale.setDefault(Locale.ENGLISH);
64        final Collector collector = new Collector();
65        if (SELECT_QUICK) {
66            while (true) {
67                final long start = System.nanoTime();
68                for (long j = 0L; j < LOOP_COUNT; j++) {
69                    collector.reset();
70                    for (int k = 0; k < VALUE_COUNT; k++) {
71                        collector.addValueBuf1(k);
72                    }
73                }
74                final long nanos = System.nanoTime() - start;
75                System.out.printf("duration buf1: %1$,.3fms (%2$,.1fs)%n",
76                    nanos / 1_000_000d, nanos / 1_000_000_000d);
77            }
78        } else {
79            while (true) {
80                final long start = System.nanoTime();
81                for (long j = 0L; j < LOOP_COUNT; j++) {
82                    collector.reset();
83                    for (int k = 0; k < VALUE_COUNT; k++) {
84                        collector.addValueBuf2(k);
85                    }
86                }
87                final long nanos = System.nanoTime() - start;
88                System.out.printf("duration buf2 (resized): %1$,.3fms (%2$,.1fs)%n",
89                    nanos / 1_000_000d, nanos / 1_000_000_000d);
90            }
91        }
92    }
93
94}
95if (VALUE_COUNT > collector.fBuffer.fBuffer1.length) {
96    uncommon_trap();
97}
98for (int k = 0; k < VALUE_COUNT; k++) {
99    collector.fBuffer.fBuffer1[k] = k;  // no bounds check
100}
101for (long j = 0L; j < LOOP_COUNT; j++) {
102    collector.reset();
103
104    int fastCount = Math.min(collector.fBuffer.fBuffer2.length, VALUE_COUNT);
105    for (int k = 0; k < fastCount; k++) {
106        collector.addValueBuf2Fast(k);
107    }
108
109    for (int k = fastCount; k < VALUE_COUNT; k++) {
110        collector.addValueBuf2(k);
111    }
112}
113    public void addValueBuf2Fast(long val) {fBuffer.add2Fast(val);}
114
115    public static final class Buffer {
116        ...
117        public void add2Fast(long value) {
118            fBuffer2[fIdx++] = value;
119        }
120    }
121

This should dramatically improve performance of the loop:

1duration buf1: 8,890.551ms (8.9s)
2duration buf1: 8,339.755ms (8.3s)
3duration buf1: 8,620.633ms (8.6s)
4duration buf1: 8,682.809ms (8.7s)
5...
6make buffer 2 larger
7duration buf2 (resized): 19,542.750ms (19.5s)
8duration buf2 (resized): 22,423.529ms (22.4s)
9duration buf2 (resized): 22,413.364ms (22.4s)
10duration buf2 (resized): 22,219.383ms (22.2s)
11...
12import java.util.Locale;
13
14public final class Collector {
15
16    private static final boolean SELECT_QUICK = true;
17
18    private static final long LOOP_COUNT = 50_000L;
19    private static final int VALUE_COUNT = 150_000;
20    private static final int BUFFER_LENGTH = 100_000;
21
22    private final Buffer fBuffer = new Buffer();
23
24    public void reset() {fBuffer.reset();}
25    public void addValueBuf1(long val) {fBuffer.add1(val);}
26    public void addValueBuf2(long val) {fBuffer.add2(val);}
27
28    public static final class Buffer {
29
30        private int fIdx = 0;
31        private long[] fBuffer1 = new long[BUFFER_LENGTH * 2];
32        private long[] fBuffer2 = new long[BUFFER_LENGTH];
33
34        public void reset() {fIdx = 0;}
35
36        public void add1(long value) {
37            ensureRemainingBuf1(1);
38            fBuffer1[fIdx++] = value;
39        }
40
41        public void add2(long value) {
42            ensureRemainingBuf2(1);
43            fBuffer2[fIdx++] = value;
44        }
45
46        private void ensureRemainingBuf1(int remaining) {
47            if (remaining > fBuffer1.length - fIdx) {
48                System.out.println("make buffer 1 larger");
49                fBuffer1 = new long[(fIdx + remaining) << 1];
50            }
51        }
52
53        private void ensureRemainingBuf2(int remaining) {
54            if (remaining > fBuffer2.length - fIdx) {
55                System.out.println("make buffer 2 larger");
56                fBuffer2 = new long[(fIdx + remaining) << 1];
57            }
58        }
59
60    }
61
62    public static void main(String[] args) {
63        Locale.setDefault(Locale.ENGLISH);
64        final Collector collector = new Collector();
65        if (SELECT_QUICK) {
66            while (true) {
67                final long start = System.nanoTime();
68                for (long j = 0L; j < LOOP_COUNT; j++) {
69                    collector.reset();
70                    for (int k = 0; k < VALUE_COUNT; k++) {
71                        collector.addValueBuf1(k);
72                    }
73                }
74                final long nanos = System.nanoTime() - start;
75                System.out.printf("duration buf1: %1$,.3fms (%2$,.1fs)%n",
76                    nanos / 1_000_000d, nanos / 1_000_000_000d);
77            }
78        } else {
79            while (true) {
80                final long start = System.nanoTime();
81                for (long j = 0L; j < LOOP_COUNT; j++) {
82                    collector.reset();
83                    for (int k = 0; k < VALUE_COUNT; k++) {
84                        collector.addValueBuf2(k);
85                    }
86                }
87                final long nanos = System.nanoTime() - start;
88                System.out.printf("duration buf2 (resized): %1$,.3fms (%2$,.1fs)%n",
89                    nanos / 1_000_000d, nanos / 1_000_000_000d);
90            }
91        }
92    }
93
94}
95if (VALUE_COUNT > collector.fBuffer.fBuffer1.length) {
96    uncommon_trap();
97}
98for (int k = 0; k < VALUE_COUNT; k++) {
99    collector.fBuffer.fBuffer1[k] = k;  // no bounds check
100}
101for (long j = 0L; j < LOOP_COUNT; j++) {
102    collector.reset();
103
104    int fastCount = Math.min(collector.fBuffer.fBuffer2.length, VALUE_COUNT);
105    for (int k = 0; k < fastCount; k++) {
106        collector.addValueBuf2Fast(k);
107    }
108
109    for (int k = fastCount; k < VALUE_COUNT; k++) {
110        collector.addValueBuf2(k);
111    }
112}
113    public void addValueBuf2Fast(long val) {fBuffer.add2Fast(val);}
114
115    public static final class Buffer {
116        ...
117        public void add2Fast(long value) {
118            fBuffer2[fIdx++] = value;
119        }
120    }
121make buffer 2 larger
122duration buf2 (resized): 5,537.681ms (5.5s)
123duration buf2 (resized): 5,461.519ms (5.5s)
124duration buf2 (resized): 5,450.445ms (5.5s)
125

Source https://stackoverflow.com/questions/70986856

QUESTION

How to predict actual future values after testing the trained LSTM model?

Asked 2021-Dec-22 at 10:12

I have trained my stock price prediction model by splitting the dataset into train & test. I have also tested the predictions by comparing the valid data with the predicted data, and the model works fine. But I want to predict actual future values.

What do I need to change in my code below?

How can I make predictions up to a specific date in the actual future?


Code (in a Jupyter Notebook):

(To run the code, please try it in a similar csv file you have, or install nsepy python library using command pip install nsepy)

1# imports
2import pandas as pd  # data processing
3import numpy as np  # linear algebra
4import matplotlib.pyplot as plt  # plotting
5from datetime import date  # date
6from nsepy import get_history  # NSE historical data
7from keras.models import Sequential  # neural network
8from keras.layers import LSTM, Dropout, Dense  # LSTM layer
9from sklearn.preprocessing import MinMaxScaler  # scaling
10
11nseCode = 'TCS'
12stockTitle = 'Tata Consultancy Services'
13
14# API call
15apiData = get_history(symbol = nseCode, start = date(2017,1,1), end = date(2021,12,19))
16data = apiData  # copy the dataframe (not necessary)
17
18# remove columns you don't need
19del data['Symbol']
20del data['Series']
21del data['Prev Close']
22del data['Volume']
23del data['Turnover']
24del data['Trades']
25del data['Deliverable Volume']
26del data['%Deliverble']
27
28# store the data in a csv file
29data.to_csv('infy2.csv')
30
31# Read the csv file
32data = pd.read_csv('infy2.csv')
33
34# convert the date column to datetime; if you read data from csv, do this. Otherwise, no need if you read data from API
35data['Date'] = pd.to_datetime(data['Date'], format = '%Y-%m-%d')
36data.index = data['Date']
37
38# plot
39plt.xlabel('Date')
40plt.ylabel('Close Price (Rs.)')
41data['Close'].plot(legend = True, figsize = (10,6), title = stockTitle, grid = True, color = 'blue')
42
43# Sort data into Date and Close columns
44data2 = data.sort_index(ascending = True, axis = 0)
45
46newData = pd.DataFrame(index = range(0,len(data2)), columns = ['Date', 'Close'])
47
48for i in range(0, len(data2)):  # only if you read data from csv
49    newData['Date'][i] = data2['Date'][i]
50    newData['Close'][i] = data2['Close'][I]
51
52# Calculate the row number to split the dataset into train and test
53split = len(newData) - 100
54
55# normalize the new dataset
56scaler = MinMaxScaler(feature_range = (0, 1))
57finalData = newData.values
58
59trainData = finalData[0:split, :]
60validData = finalData[split:, :]
61
62newData.index = newData.Date
63newData.drop('Date', axis = 1, inplace = True)
64scaler = MinMaxScaler(feature_range = (0, 1))
65scaledData = scaler.fit_transform(newData)
66
67xTrainData, yTrainData = [], []
68
69for i in range(60, len(trainData)):  # data-flair has used 60 instead of 30
70    xTrainData.append(scaledData[i-60:i, 0])
71    yTrainData.append(scaledData[i, 0])
72
73xTrainData, yTrainData = np.array(xTrainData), np.array(yTrainData)
74
75xTrainData = np.reshape(xTrainData, (xTrainData.shape[0], xTrainData.shape[1], 1))
76
77# build and train the LSTM model
78lstmModel = Sequential()
79lstmModel.add(LSTM(units = 50, return_sequences = True, input_shape = (xTrainData.shape[1], 1)))
80lstmModel.add(LSTM(units = 50))
81lstmModel.add(Dense(units = 1))
82
83inputsData = newData[len(newData) - len(validData) - 60:].values
84inputsData = inputsData.reshape(-1,1)
85inputsData = scaler.transform(inputsData)
86
87lstmModel.compile(loss = 'mean_squared_error', optimizer = 'adam')
88lstmModel.fit(xTrainData, yTrainData, epochs = 1, batch_size = 1, verbose = 2)
89
90# Take a sample of a dataset to make predictions
91xTestData = []
92
93for i in range(60, inputsData.shape[0]):
94    xTestData.append(inputsData[i-60:i, 0])
95
96xTestData = np.array(xTestData)
97
98xTestData = np.reshape(xTestData, (xTestData.shape[0], xTestData.shape[1], 1))
99
100predictedClosingPrice = lstmModel.predict(xTestData)
101predictedClosingPrice = scaler.inverse_transform(predictedClosingPrice)
102
103# visualize the results
104trainData = newData[:split]
105validData = newData[split:]
106
107validData['Predictions'] = predictedClosingPrice
108
109plt.xlabel('Date')
110plt.ylabel('Close Price (Rs.)')
111
112trainData['Close'].plot(legend = True, color = 'blue', label = 'Train Data')
113validData['Close'].plot(legend = True, color = 'green', label = 'Valid Data')
114validData['Predictions'].plot(legend = True, figsize = (12,7), grid = True, color = 'orange', label = 'Predicted Data', title = stockTitle)
115

ANSWER

Answered 2021-Dec-22 at 10:12

Below is an example of how you could implement this approach for your model:

1# imports
2import pandas as pd  # data processing
3import numpy as np  # linear algebra
4import matplotlib.pyplot as plt  # plotting
5from datetime import date  # date
6from nsepy import get_history  # NSE historical data
7from keras.models import Sequential  # neural network
8from keras.layers import LSTM, Dropout, Dense  # LSTM layer
9from sklearn.preprocessing import MinMaxScaler  # scaling
10
11nseCode = 'TCS'
12stockTitle = 'Tata Consultancy Services'
13
14# API call
15apiData = get_history(symbol = nseCode, start = date(2017,1,1), end = date(2021,12,19))
16data = apiData  # copy the dataframe (not necessary)
17
18# remove columns you don't need
19del data['Symbol']
20del data['Series']
21del data['Prev Close']
22del data['Volume']
23del data['Turnover']
24del data['Trades']
25del data['Deliverable Volume']
26del data['%Deliverble']
27
28# store the data in a csv file
29data.to_csv('infy2.csv')
30
31# Read the csv file
32data = pd.read_csv('infy2.csv')
33
34# convert the date column to datetime; if you read data from csv, do this. Otherwise, no need if you read data from API
35data['Date'] = pd.to_datetime(data['Date'], format = '%Y-%m-%d')
36data.index = data['Date']
37
38# plot
39plt.xlabel('Date')
40plt.ylabel('Close Price (Rs.)')
41data['Close'].plot(legend = True, figsize = (10,6), title = stockTitle, grid = True, color = 'blue')
42
43# Sort data into Date and Close columns
44data2 = data.sort_index(ascending = True, axis = 0)
45
46newData = pd.DataFrame(index = range(0,len(data2)), columns = ['Date', 'Close'])
47
48for i in range(0, len(data2)):  # only if you read data from csv
49    newData['Date'][i] = data2['Date'][i]
50    newData['Close'][i] = data2['Close'][I]
51
52# Calculate the row number to split the dataset into train and test
53split = len(newData) - 100
54
55# normalize the new dataset
56scaler = MinMaxScaler(feature_range = (0, 1))
57finalData = newData.values
58
59trainData = finalData[0:split, :]
60validData = finalData[split:, :]
61
62newData.index = newData.Date
63newData.drop('Date', axis = 1, inplace = True)
64scaler = MinMaxScaler(feature_range = (0, 1))
65scaledData = scaler.fit_transform(newData)
66
67xTrainData, yTrainData = [], []
68
69for i in range(60, len(trainData)):  # data-flair has used 60 instead of 30
70    xTrainData.append(scaledData[i-60:i, 0])
71    yTrainData.append(scaledData[i, 0])
72
73xTrainData, yTrainData = np.array(xTrainData), np.array(yTrainData)
74
75xTrainData = np.reshape(xTrainData, (xTrainData.shape[0], xTrainData.shape[1], 1))
76
77# build and train the LSTM model
78lstmModel = Sequential()
79lstmModel.add(LSTM(units = 50, return_sequences = True, input_shape = (xTrainData.shape[1], 1)))
80lstmModel.add(LSTM(units = 50))
81lstmModel.add(Dense(units = 1))
82
83inputsData = newData[len(newData) - len(validData) - 60:].values
84inputsData = inputsData.reshape(-1,1)
85inputsData = scaler.transform(inputsData)
86
87lstmModel.compile(loss = 'mean_squared_error', optimizer = 'adam')
88lstmModel.fit(xTrainData, yTrainData, epochs = 1, batch_size = 1, verbose = 2)
89
90# Take a sample of a dataset to make predictions
91xTestData = []
92
93for i in range(60, inputsData.shape[0]):
94    xTestData.append(inputsData[i-60:i, 0])
95
96xTestData = np.array(xTestData)
97
98xTestData = np.reshape(xTestData, (xTestData.shape[0], xTestData.shape[1], 1))
99
100predictedClosingPrice = lstmModel.predict(xTestData)
101predictedClosingPrice = scaler.inverse_transform(predictedClosingPrice)
102
103# visualize the results
104trainData = newData[:split]
105validData = newData[split:]
106
107validData['Predictions'] = predictedClosingPrice
108
109plt.xlabel('Date')
110plt.ylabel('Close Price (Rs.)')
111
112trainData['Close'].plot(legend = True, color = 'blue', label = 'Train Data')
113validData['Close'].plot(legend = True, color = 'green', label = 'Valid Data')
114validData['Predictions'].plot(legend = True, figsize = (12,7), grid = True, color = 'orange', label = 'Predicted Data', title = stockTitle)
115import pandas as pd
116import numpy as np
117from datetime import date
118from nsepy import get_history
119from keras.models import Sequential
120from keras.layers import LSTM, Dense
121from sklearn.preprocessing import MinMaxScaler
122pd.options.mode.chained_assignment = None
123
124# load the data
125stock_ticker = 'TCS'
126stock_name = 'Tata Consultancy Services'
127train_start = date(2017, 1, 1)
128train_end = date.today()
129data = get_history(symbol=stock_ticker, start=train_start, end=train_end)
130data.index = pd.DatetimeIndex(data.index)
131data = data[['Close']]
132
133# scale the data
134scaler = MinMaxScaler(feature_range=(0, 1)).fit(data)
135z = scaler.transform(data)
136
137# extract the input sequences and target values
138window_size = 60
139
140x, y = [], []
141
142for i in range(window_size, len(z)):
143    x.append(z[i - window_size: i])
144    y.append(z[i])
145
146x, y = np.array(x), np.array(y)
147
148# build and train the model
149model = Sequential()
150model.add(LSTM(units=50, return_sequences=True, input_shape=x.shape[1:]))
151model.add(LSTM(units=50))
152model.add(Dense(units=1))
153model.compile(loss='mse', optimizer='adam')
154model.fit(x, y, epochs=100, batch_size=128, verbose=1)
155
156# generate the multi-step forecasts
157def multi_step_forecasts(n_past, n_future):
158
159    x_past = x[- n_past - 1:, :, :][:1]  # last observed input sequence
160    y_past = y[- n_past - 1]             # last observed target value
161    y_future = []                        # predicted target values
162
163    for i in range(n_past + n_future):
164
165        # feed the last forecast back to the model as an input
166        x_past = np.append(x_past[:, 1:, :], y_past.reshape(1, 1, 1), axis=1)
167
168        # generate the next forecast
169        y_past = model.predict(x_past)
170
171        # save the forecast
172        y_future.append(y_past.flatten()[0])
173
174    # transform the forecasts back to the original scale
175    y_future = scaler.inverse_transform(np.array(y_future).reshape(-1, 1)).flatten()
176
177    # add the forecasts to the data frame
178    df_past = data.rename(columns={'Close': 'Actual'}).copy()
179
180    df_future = pd.DataFrame(
181        index=pd.bdate_range(start=data.index[- n_past - 1] + pd.Timedelta(days=1), periods=n_past + n_future),
182        columns=['Forecast'],
183        data=y_future
184    )
185
186    return df_past.join(df_future, how='outer')
187
188# forecast the next 30 days
189df1 = multi_step_forecasts(n_past=0, n_future=30)
190df1.plot(title=stock_name)
191
192# forecast the last 20 days and the next 30 days
193df2 = multi_step_forecasts(n_past=20, n_future=30)
194df2.plot(title=stock_name)
195

Source https://stackoverflow.com/questions/70420155

QUESTION

How to deploy sagemaker.workflow.pipeline.Pipeline?

Asked 2021-Dec-09 at 18:06

I have a sagemaker.workflow.pipeline.Pipeline which contains multiple sagemaker.workflow.steps.ProcessingStep and each ProcessingStep contains sagemaker.processing.ScriptProcessor.

The current pipeline graph look like the below shown image. It will take data from multiple sources from S3, process it and create a final dataset using the data from previous steps.

enter image description here

As the Pipeline object doesn't support .deploy method, how to deploy this pipeline?

While inference/scoring, When we receive a raw data(single row for each source), how to trigger the pipeline?

or Sagemaker Pipeline is designed for only data processing and model training on huge/batch data? Not for the inference with the single data point?

ANSWER

Answered 2021-Dec-09 at 18:06

As the Pipeline object doesn't support .deploy method, how to deploy this pipeline?

Pipeline does not have a .deploy() method, no

Use pipeline.upsert(role_arn='...') to create/update the pipeline definition to SageMaker, then call pipeline.start() . Docs here

While inference/scoring, When we receive a raw data(single row for each source), how to trigger the pipeline?

There are actually two types of pipelines in SageMaker. Model Building Pipelines (which you have in your question), and Serial Inference Pipelines, which are used for Inference. AWS definitely should have called the former "workflows"

You can use a model building pipeline to setup a serial inference pipeline

To do pre-processing in a serial inference pipeline, you want to train an encoder/estimator (such as SKLearn) and save its model. Then train a learning algorithm, and save its model, then create a PipelineModel using both models

Source https://stackoverflow.com/questions/70287087

QUESTION

How reproducible / deterministic is Parquet format?

Asked 2021-Dec-09 at 03:55

I'm seeking advice from people deeply familiar with the binary layout of Apache Parquet:

Having a data transformation F(a) = b where F is fully deterministic, and same exact versions of the entire software stack (framework, arrow & parquet libraries) are used - how likely am I to get an identical binary representation of dataframe b on different hosts every time b is saved into Parquet?

In other words how reproducible Parquet is on binary level? When data is logically the same what can cause binary differences?

  • Can there be some uninit memory in between values due to alignment?
  • Assuming all serialization settings (compression, chunking, use of dictionaries etc.) are the same, can result still drift?
Context

I'm working on a system for fully reproducible and deterministic data processing and computing dataset hashes to assert these guarantees.

My key goal has been to ensure that dataset b contains an idendital set of records as dataset b' - this is of course very different from hashing a binary representation of Arrow/Parquet. Not wanting to deal with the reproducibility of storage formats I've been computing logical data hashes in memory. This is slow but flexible, e.g. my hash stays the same even if records are re-ordered (which I consider an equivalent dataset).

But when thinking about integrating with IPFS and other content-addressable storages that rely on hashes of files - it would simplify the design a lot to have just one hash (physical) instead of two (logical + physical), but this means I have to guarantee that Parquet files are reproducible.


Update

I decided to continue using logical hashing for now.

I've created a new Rust crate arrow-digest that implements the stable hashing for Arrow arrays and record batches and tries hard to hide the encoding-related differences. The crate's README describes the hashing algorithm if someone finds it useful and wants to implement it in another language.

I'll continue to expand the set of supported types as I'm integrating it into the decentralized data processing tool I'm working on.

In the long term, I'm not sure logical hashing is the best way forward - a subset of Parquet that makes some efficiency sacrifices just to make file layout deterministic might be a better choice for content-addressability.

ANSWER

Answered 2021-Dec-05 at 04:30

At least in arrow's implementation I would expect, but haven't verified the exact same input (including identical metadata) in the same order to yield deterministic outputs (we try not to leave uninitialized values for security reasons) with the same configuration (assuming the compression algorithm chosen also makes the deterministic guarantee). It is possible there is some hash-map iteration for metadata or elsewhere that might also break this assumption.

As @Pace pointed out I would not rely on this and recommend against relying on it). There is nothing in the spec that guarantees this and since the writer version is persisted when writing a file you are guaranteed a breakage if you ever decided to upgrade. Things will also break if additional metadata is added or removed ( I believe in the past there have been some big fixes for round tripping data sets that would have caused non-determinism).

So in summary this might or might not work today but even if it does I would expect this would be very brittle.

Source https://stackoverflow.com/questions/70220970

QUESTION

Why does KNeighborsClassifier always predict the same number?

Asked 2021-Oct-17 at 14:23

Why does knn always predict the same number? How can I solve this? The dataset is here.

Code:

1import numpy as np
2import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
3import os
4import scipy.io   
5from sklearn.neighbors import KNeighborsClassifier
6from sklearn import metrics
7from sklearn.model_selection import train_test_split
8from sklearn.preprocessing import StandardScaler
9from torch.utils.data import Dataset, DataLoader
10from sklearn import preprocessing
11import torch
12import numpy as np
13from sklearn.model_selection import KFold
14from sklearn.neighbors import KNeighborsClassifier
15from sklearn import metrics
16
17def load_mat_data(path):
18    mat = scipy.io.loadmat(DATA_PATH)
19    x,y = mat['data'], mat['class']
20    x = x.astype('float32')
21    # stadardize values
22    standardizer = preprocessing.StandardScaler()
23    x = standardizer.fit_transform(x) 
24    return x, standardizer, y
25
26def numpyToTensor(x):
27    x_train = torch.from_numpy(x)
28    return x_train
29
30class DataBuilder(Dataset):
31    def __init__(self, path):
32        self.x, self.standardizer, self.y = load_mat_data(DATA_PATH)
33        self.x = numpyToTensor(self.x)
34        self.len=self.x.shape[0]
35        self.y = numpyToTensor(self.y)
36    def __getitem__(self,index):      
37        return (self.x[index], self.y[index])
38    def __len__(self):
39        return self.len
40
41datasets = ['/home/katerina/Desktop/datasets/GSE75110.mat']
42
43for DATA_PATH in datasets:
44
45    print(DATA_PATH)
46    data_set=DataBuilder(DATA_PATH)
47
48    pred_rpknn = [0] * len(data_set.y)
49    kf = KFold(n_splits=10, shuffle = True, random_state=7)
50
51    for train_index, test_index in kf.split(data_set.x):
52        #Create KNN Classifier
53        knn = KNeighborsClassifier(n_neighbors=5)
54        #print("TRAIN:", train_index, "TEST:", test_index)
55        x_train, x_test = data_set.x[train_index], data_set.x[test_index]
56        y_train, y_test = data_set.y[train_index], data_set.y[test_index]
57        #Train the model using the training sets
58        y1_train = y_train.ravel()
59        knn.fit(x_train, y1_train)
60        #Predict the response for test dataset
61        y_pred = knn.predict(x_test)
62        #print(y_pred)
63        # Model Accuracy, how often is the classifier correct?
64        print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
65        c = 0
66        for idx in test_index:
67            pred_rpknn[idx] = y_pred[c]
68            c +=1
69    print("Accuracy:",metrics.accuracy_score(data_set.y, pred_rpknn))
70    print(pred_rpknn, data_set.y.reshape(1,-1))
71

Output:

1import numpy as np
2import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
3import os
4import scipy.io   
5from sklearn.neighbors import KNeighborsClassifier
6from sklearn import metrics
7from sklearn.model_selection import train_test_split
8from sklearn.preprocessing import StandardScaler
9from torch.utils.data import Dataset, DataLoader
10from sklearn import preprocessing
11import torch
12import numpy as np
13from sklearn.model_selection import KFold
14from sklearn.neighbors import KNeighborsClassifier
15from sklearn import metrics
16
17def load_mat_data(path):
18    mat = scipy.io.loadmat(DATA_PATH)
19    x,y = mat['data'], mat['class']
20    x = x.astype('float32')
21    # stadardize values
22    standardizer = preprocessing.StandardScaler()
23    x = standardizer.fit_transform(x) 
24    return x, standardizer, y
25
26def numpyToTensor(x):
27    x_train = torch.from_numpy(x)
28    return x_train
29
30class DataBuilder(Dataset):
31    def __init__(self, path):
32        self.x, self.standardizer, self.y = load_mat_data(DATA_PATH)
33        self.x = numpyToTensor(self.x)
34        self.len=self.x.shape[0]
35        self.y = numpyToTensor(self.y)
36    def __getitem__(self,index):      
37        return (self.x[index], self.y[index])
38    def __len__(self):
39        return self.len
40
41datasets = ['/home/katerina/Desktop/datasets/GSE75110.mat']
42
43for DATA_PATH in datasets:
44
45    print(DATA_PATH)
46    data_set=DataBuilder(DATA_PATH)
47
48    pred_rpknn = [0] * len(data_set.y)
49    kf = KFold(n_splits=10, shuffle = True, random_state=7)
50
51    for train_index, test_index in kf.split(data_set.x):
52        #Create KNN Classifier
53        knn = KNeighborsClassifier(n_neighbors=5)
54        #print("TRAIN:", train_index, "TEST:", test_index)
55        x_train, x_test = data_set.x[train_index], data_set.x[test_index]
56        y_train, y_test = data_set.y[train_index], data_set.y[test_index]
57        #Train the model using the training sets
58        y1_train = y_train.ravel()
59        knn.fit(x_train, y1_train)
60        #Predict the response for test dataset
61        y_pred = knn.predict(x_test)
62        #print(y_pred)
63        # Model Accuracy, how often is the classifier correct?
64        print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
65        c = 0
66        for idx in test_index:
67            pred_rpknn[idx] = y_pred[c]
68            c +=1
69    print("Accuracy:",metrics.accuracy_score(data_set.y, pred_rpknn))
70    print(pred_rpknn, data_set.y.reshape(1,-1))
71/home/katerina/Desktop/datasets/GSE75110.mat
72Accuracy: 0.2857142857142857
73Accuracy: 0.38095238095238093
74Accuracy: 0.14285714285714285
75Accuracy: 0.4
76Accuracy: 0.3
77Accuracy: 0.25
78Accuracy: 0.3
79Accuracy: 0.6
80Accuracy: 0.25
81Accuracy: 0.45
82Accuracy: 0.33497536945812806
83[3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]
84

I am trying to combine knn with k fold in order to test the whole dataset using 10 folds. The problem is that knn always predicts arrays of 3's for each fold. The classes I want to predict are these:

tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2,2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3,3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]]

ANSWER

Answered 2021-Oct-17 at 07:36

TL;DR
It have to do with the StandardScaler, change it to a simple normalisation.
e.g.

1import numpy as np
2import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
3import os
4import scipy.io   
5from sklearn.neighbors import KNeighborsClassifier
6from sklearn import metrics
7from sklearn.model_selection import train_test_split
8from sklearn.preprocessing import StandardScaler
9from torch.utils.data import Dataset, DataLoader
10from sklearn import preprocessing
11import torch
12import numpy as np
13from sklearn.model_selection import KFold
14from sklearn.neighbors import KNeighborsClassifier
15from sklearn import metrics
16
17def load_mat_data(path):
18    mat = scipy.io.loadmat(DATA_PATH)
19    x,y = mat['data'], mat['class']
20    x = x.astype('float32')
21    # stadardize values
22    standardizer = preprocessing.StandardScaler()
23    x = standardizer.fit_transform(x) 
24    return x, standardizer, y
25
26def numpyToTensor(x):
27    x_train = torch.from_numpy(x)
28    return x_train
29
30class DataBuilder(Dataset):
31    def __init__(self, path):
32        self.x, self.standardizer, self.y = load_mat_data(DATA_PATH)
33        self.x = numpyToTensor(self.x)
34        self.len=self.x.shape[0]
35        self.y = numpyToTensor(self.y)
36    def __getitem__(self,index):      
37        return (self.x[index], self.y[index])
38    def __len__(self):
39        return self.len
40
41datasets = ['/home/katerina/Desktop/datasets/GSE75110.mat']
42
43for DATA_PATH in datasets:
44
45    print(DATA_PATH)
46    data_set=DataBuilder(DATA_PATH)
47
48    pred_rpknn = [0] * len(data_set.y)
49    kf = KFold(n_splits=10, shuffle = True, random_state=7)
50
51    for train_index, test_index in kf.split(data_set.x):
52        #Create KNN Classifier
53        knn = KNeighborsClassifier(n_neighbors=5)
54        #print("TRAIN:", train_index, "TEST:", test_index)
55        x_train, x_test = data_set.x[train_index], data_set.x[test_index]
56        y_train, y_test = data_set.y[train_index], data_set.y[test_index]
57        #Train the model using the training sets
58        y1_train = y_train.ravel()
59        knn.fit(x_train, y1_train)
60        #Predict the response for test dataset
61        y_pred = knn.predict(x_test)
62        #print(y_pred)
63        # Model Accuracy, how often is the classifier correct?
64        print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
65        c = 0
66        for idx in test_index:
67            pred_rpknn[idx] = y_pred[c]
68            c +=1
69    print("Accuracy:",metrics.accuracy_score(data_set.y, pred_rpknn))
70    print(pred_rpknn, data_set.y.reshape(1,-1))
71/home/katerina/Desktop/datasets/GSE75110.mat
72Accuracy: 0.2857142857142857
73Accuracy: 0.38095238095238093
74Accuracy: 0.14285714285714285
75Accuracy: 0.4
76Accuracy: 0.3
77Accuracy: 0.25
78Accuracy: 0.3
79Accuracy: 0.6
80Accuracy: 0.25
81Accuracy: 0.45
82Accuracy: 0.33497536945812806
83[3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]
84from sklearn import preprocessing
85
86...
87
88x = preprocessing.normalize(x)
89

Explanation:

Standard Scalar as you use it will do:

1import numpy as np
2import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
3import os
4import scipy.io   
5from sklearn.neighbors import KNeighborsClassifier
6from sklearn import metrics
7from sklearn.model_selection import train_test_split
8from sklearn.preprocessing import StandardScaler
9from torch.utils.data import Dataset, DataLoader
10from sklearn import preprocessing
11import torch
12import numpy as np
13from sklearn.model_selection import KFold
14from sklearn.neighbors import KNeighborsClassifier
15from sklearn import metrics
16
17def load_mat_data(path):
18    mat = scipy.io.loadmat(DATA_PATH)
19    x,y = mat['data'], mat['class']
20    x = x.astype('float32')
21    # stadardize values
22    standardizer = preprocessing.StandardScaler()
23    x = standardizer.fit_transform(x) 
24    return x, standardizer, y
25
26def numpyToTensor(x):
27    x_train = torch.from_numpy(x)
28    return x_train
29
30class DataBuilder(Dataset):
31    def __init__(self, path):
32        self.x, self.standardizer, self.y = load_mat_data(DATA_PATH)
33        self.x = numpyToTensor(self.x)
34        self.len=self.x.shape[0]
35        self.y = numpyToTensor(self.y)
36    def __getitem__(self,index):      
37        return (self.x[index], self.y[index])
38    def __len__(self):
39        return self.len
40
41datasets = ['/home/katerina/Desktop/datasets/GSE75110.mat']
42
43for DATA_PATH in datasets:
44
45    print(DATA_PATH)
46    data_set=DataBuilder(DATA_PATH)
47
48    pred_rpknn = [0] * len(data_set.y)
49    kf = KFold(n_splits=10, shuffle = True, random_state=7)
50
51    for train_index, test_index in kf.split(data_set.x):
52        #Create KNN Classifier
53        knn = KNeighborsClassifier(n_neighbors=5)
54        #print("TRAIN:", train_index, "TEST:", test_index)
55        x_train, x_test = data_set.x[train_index], data_set.x[test_index]
56        y_train, y_test = data_set.y[train_index], data_set.y[test_index]
57        #Train the model using the training sets
58        y1_train = y_train.ravel()
59        knn.fit(x_train, y1_train)
60        #Predict the response for test dataset
61        y_pred = knn.predict(x_test)
62        #print(y_pred)
63        # Model Accuracy, how often is the classifier correct?
64        print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
65        c = 0
66        for idx in test_index:
67            pred_rpknn[idx] = y_pred[c]
68            c +=1
69    print("Accuracy:",metrics.accuracy_score(data_set.y, pred_rpknn))
70    print(pred_rpknn, data_set.y.reshape(1,-1))
71/home/katerina/Desktop/datasets/GSE75110.mat
72Accuracy: 0.2857142857142857
73Accuracy: 0.38095238095238093
74Accuracy: 0.14285714285714285
75Accuracy: 0.4
76Accuracy: 0.3
77Accuracy: 0.25
78Accuracy: 0.3
79Accuracy: 0.6
80Accuracy: 0.25
81Accuracy: 0.45
82Accuracy: 0.33497536945812806
83[3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]
84from sklearn import preprocessing
85
86...
87
88x = preprocessing.normalize(x)
89The standard score of a sample `x` is calculated as:
90
91    z = (x - u) / s
92
93where `u` is the mean of the training samples or zero if `with_mean=False`,
94and `s` is the standard deviation of the training samples or one if
95`with_std=False`.
96

When you actually want this features to help KNN to decide which vector is closer.

in normalize the normalization happen for each vector separately so it doesn't effect and even help the KNN to differentiate the vectors

With KNN StandardScaler can actually harm your prediction. It is better to use it in other forms of data.

1import numpy as np
2import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
3import os
4import scipy.io   
5from sklearn.neighbors import KNeighborsClassifier
6from sklearn import metrics
7from sklearn.model_selection import train_test_split
8from sklearn.preprocessing import StandardScaler
9from torch.utils.data import Dataset, DataLoader
10from sklearn import preprocessing
11import torch
12import numpy as np
13from sklearn.model_selection import KFold
14from sklearn.neighbors import KNeighborsClassifier
15from sklearn import metrics
16
17def load_mat_data(path):
18    mat = scipy.io.loadmat(DATA_PATH)
19    x,y = mat['data'], mat['class']
20    x = x.astype('float32')
21    # stadardize values
22    standardizer = preprocessing.StandardScaler()
23    x = standardizer.fit_transform(x) 
24    return x, standardizer, y
25
26def numpyToTensor(x):
27    x_train = torch.from_numpy(x)
28    return x_train
29
30class DataBuilder(Dataset):
31    def __init__(self, path):
32        self.x, self.standardizer, self.y = load_mat_data(DATA_PATH)
33        self.x = numpyToTensor(self.x)
34        self.len=self.x.shape[0]
35        self.y = numpyToTensor(self.y)
36    def __getitem__(self,index):      
37        return (self.x[index], self.y[index])
38    def __len__(self):
39        return self.len
40
41datasets = ['/home/katerina/Desktop/datasets/GSE75110.mat']
42
43for DATA_PATH in datasets:
44
45    print(DATA_PATH)
46    data_set=DataBuilder(DATA_PATH)
47
48    pred_rpknn = [0] * len(data_set.y)
49    kf = KFold(n_splits=10, shuffle = True, random_state=7)
50
51    for train_index, test_index in kf.split(data_set.x):
52        #Create KNN Classifier
53        knn = KNeighborsClassifier(n_neighbors=5)
54        #print("TRAIN:", train_index, "TEST:", test_index)
55        x_train, x_test = data_set.x[train_index], data_set.x[test_index]
56        y_train, y_test = data_set.y[train_index], data_set.y[test_index]
57        #Train the model using the training sets
58        y1_train = y_train.ravel()
59        knn.fit(x_train, y1_train)
60        #Predict the response for test dataset
61        y_pred = knn.predict(x_test)
62        #print(y_pred)
63        # Model Accuracy, how often is the classifier correct?
64        print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
65        c = 0
66        for idx in test_index:
67            pred_rpknn[idx] = y_pred[c]
68            c +=1
69    print("Accuracy:",metrics.accuracy_score(data_set.y, pred_rpknn))
70    print(pred_rpknn, data_set.y.reshape(1,-1))
71/home/katerina/Desktop/datasets/GSE75110.mat
72Accuracy: 0.2857142857142857
73Accuracy: 0.38095238095238093
74Accuracy: 0.14285714285714285
75Accuracy: 0.4
76Accuracy: 0.3
77Accuracy: 0.25
78Accuracy: 0.3
79Accuracy: 0.6
80Accuracy: 0.25
81Accuracy: 0.45
82Accuracy: 0.33497536945812806
83[3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]
84from sklearn import preprocessing
85
86...
87
88x = preprocessing.normalize(x)
89The standard score of a sample `x` is calculated as:
90
91    z = (x - u) / s
92
93where `u` is the mean of the training samples or zero if `with_mean=False`,
94and `s` is the standard deviation of the training samples or one if
95`with_std=False`.
96import scipy.io
97from torch.utils.data import Dataset
98from sklearn import preprocessing
99import torch
100import numpy as np
101from sklearn.model_selection import KFold
102from sklearn.neighbors import KNeighborsClassifier
103from sklearn import metrics
104
105def load_mat_data(path):
106    mat = scipy.io.loadmat(DATA_PATH)
107    x, y = mat['data'], mat['class']
108    x = x.astype('float32')
109    # stadardize values
110    x = preprocessing.normalize(x)
111    return x, y
112
113def numpyToTensor(x):
114    x_train = torch.from_numpy(x)
115    return x_train
116
117class DataBuilder(Dataset):
118    def __init__(self, path):
119        self.x, self.y = load_mat_data(DATA_PATH)
120        self.x = numpyToTensor(self.x)
121        self.len=self.x.shape[0]
122        self.y = numpyToTensor(self.y)
123    def __getitem__(self,index):
124        return (self.x[index], self.y[index])
125    def __len__(self):
126        return self.len
127
128datasets = ['/home/katerina/Desktop/datasets/GSE75110.mat']
129
130for DATA_PATH in datasets:
131
132    print(DATA_PATH)
133    data_set=DataBuilder(DATA_PATH)
134
135    pred_rpknn = [0] * len(data_set.y)
136    kf = KFold(n_splits=10, shuffle = True, random_state=7)
137
138    for train_index, test_index in kf.split(data_set.x):
139        #Create KNN Classifier
140        knn = KNeighborsClassifier(n_neighbors=5)
141        #print("TRAIN:", train_index, "TEST:", test_index)
142        x_train, x_test = data_set.x[train_index], data_set.x[test_index]
143        y_train, y_test = data_set.y[train_index], data_set.y[test_index]
144        #Train the model using the training sets
145        y1_train = y_train.view(-1)
146        knn.fit(x_train, y1_train)
147        #Predict the response for test dataset
148        y_pred = knn.predict(x_test)
149        #print(y_pred)
150        # Model Accuracy, how often is the classifier correct?
151        print("Accuracy in loop:", metrics.accuracy_score(y_test, y_pred))
152        c = 0
153        for idx in test_index:
154            pred_rpknn[idx] = y_pred[c]
155            c +=1
156    print("Accuracy:",metrics.accuracy_score(data_set.y, pred_rpknn))
157    print(pred_rpknn, data_set.y.reshape(1,-1))
158
159
160Accuracy in loop: 1.0
161Accuracy in loop: 0.8571428571428571
162Accuracy in loop: 0.8571428571428571
163Accuracy in loop: 1.0
164Accuracy in loop: 0.9
165Accuracy in loop: 0.9
166Accuracy in loop: 0.95
167Accuracy in loop: 1.0
168Accuracy in loop: 0.9
169Accuracy in loop: 1.0
170Accuracy: 0.9359605911330049
171

Source https://stackoverflow.com/questions/69599448

QUESTION

Is there a way to accomplish multithreading or parallel processes in a batch file?

Asked 2021-Oct-14 at 17:38

So I have a batch file that is executing a simulation given some input parameters and then processing the output data via awk, R, and Python. At the moment the input parameters are passed into the simulation through some nested for loops and each iteration of the simulation will be run one after the other. I would like for the execution of the simulation to be done in parallel because at the moment there are 1,000+ cases so in my mind I could have core 1 handle sims 1-250, core 2 handle sims 251-500, etc.

In essence what I would like to do is this:

  1. Run every case of the simulation across multiple cores
  2. Once every simulation has been completed, start the output data processing

I've tried using start /affinity n simulation.exe, but the issue here is that all of the simulations will be executed simultaneously, so when it gets to the post processing calls, it errors out because the data hasn't been generated yet. There is the start /w command, but I'm not sure if that improve the simulation. One idea I've thought of is updating a variable once each simulation has been completed, then only start the post processing once the variable reaches n runs.

Here is an excerpt of what I am doing right now:

1    for %%f in (1 2 3) do (
2            for %%a in (4 5 6) do (
3                for %%b in (7 8 9) do (
4                    call :mission %%f %%a %%b
5                )
6            )
7         )
8    some gawk scripts
9    some python scripts
10    some r scripts
11    go to :exit
12
13:mission
14   sed -e 's/text1/%1/' -e 's/text2/%2/' -e 's/text3/%3/'
15   simulation.exe
16   go to :exit
17
18:exit
19

And here's what I was playing around with to test out some parallel processing:

1    for %%f in (1 2 3) do (
2            for %%a in (4 5 6) do (
3                for %%b in (7 8 9) do (
4                    call :mission %%f %%a %%b
5                )
6            )
7         )
8    some gawk scripts
9    some python scripts
10    some r scripts
11    go to :exit
12
13:mission
14   sed -e 's/text1/%1/' -e 's/text2/%2/' -e 's/text3/%3/'
15   simulation.exe
16   go to :exit
17
18:exit
19start /affinity 1 C:\Users\614890\R-4.1.1\bin\Rscript.exe test1.R
20start /affinity 2 C:\Users\614890\R-4.1.1\bin\Rscript.exe test2.R
21start /affinity 3 C:\Users\614890\R-4.1.1\bin\Rscript.exe test3.R
22start /affinity 4 C:\Users\614890\R-4.1.1\bin\Rscript.exe test4.R
23
24C:\Users\614890\R-4.1.1\bin\Rscript.exe plotting.R
25

ANSWER

Answered 2021-Oct-14 at 17:38

I was actually able to accomplish this by doing the following:

1    for %%f in (1 2 3) do (
2            for %%a in (4 5 6) do (
3                for %%b in (7 8 9) do (
4                    call :mission %%f %%a %%b
5                )
6            )
7         )
8    some gawk scripts
9    some python scripts
10    some r scripts
11    go to :exit
12
13:mission
14   sed -e 's/text1/%1/' -e 's/text2/%2/' -e 's/text3/%3/'
15   simulation.exe
16   go to :exit
17
18:exit
19start /affinity 1 C:\Users\614890\R-4.1.1\bin\Rscript.exe test1.R
20start /affinity 2 C:\Users\614890\R-4.1.1\bin\Rscript.exe test2.R
21start /affinity 3 C:\Users\614890\R-4.1.1\bin\Rscript.exe test3.R
22start /affinity 4 C:\Users\614890\R-4.1.1\bin\Rscript.exe test4.R
23
24C:\Users\614890\R-4.1.1\bin\Rscript.exe plotting.R
25setlocal
26set "lock=%temp%\wait%random%.lock"
27
28:: Launch processes asynchronously, with stream 9 redirected to a lock file.
29:: The lock file will remain locked until the script ends
30start /affinity 1 9>"%lock%1" Rscript test1.R
31start /affinity 2 9>"%lock%2" Rscript test2.R
32start /affinity 4 9>"%lock%3" Rscript test3.R
33start /affinity 8 9>"%lock%4" Rscript test4.R
34
35:Wait for all processes to finish
361>nul 2>nul ping /n 2 ::1
37for %%F in ("%lock%*") do (
38 (call ) 9>"%%F" || goto :Wait
39) 2>nul
40
41del "%lock%*"
42
43Rscript plotting.R
44

Source https://stackoverflow.com/questions/69381811

QUESTION

Simple calculation for all combination (brute force) of elements in two arrays, for better performance, in Julia

Asked 2021-Oct-06 at 15:49

I am new to Julia (some experience with Python). The main reason I am starting to use Julia is better performance for large scale data processing.

I want to get differences of values (int) of all possible combinations in two arrays.

Say I have two arrays.

a = [5,4]
b = [2,1,3]

Then I want to have differences of all combinations like a[1] - b[1], a[1] - b[2] ..... a[3] - b[1], a[3] - b[2]

The result will be 3 x2 array [3 2; 4 3; 2 1]

Then something I came up is

1a = [5,4]  
2b = [2,1,3]
3diff_matrix = zeros(Int8, size(b)[1], size(a)[1])
4for ia in eachindex(a)
5    for ib in eachindex(b)
6        diff_matrix[ib,ia]= a[ia] - b[ib]
7    end
8end
9println(diff_matrix)
10

It works but it uses iteration inside of iteration and I assume the performance will not be great. In real application, length of array will be long (like a few hundreds), and this process needs to be done for millions of combinations of arrays.

Is there any better (better performance, simpler code) approach for this task ?

ANSWER

Answered 2021-Oct-06 at 15:49

If you wrapped the code in a function your code would be already reasonably fast.

This is exactly the power of Julia that loops are fast. The only thing you need to avoid is using global variables in computations (as they lead to code that is not type stable).

I write the code would be "reasonably fast", as it could be made faster by some low-level tricks. However, in this case you could just write:

1a = [5,4]  
2b = [2,1,3]
3diff_matrix = zeros(Int8, size(b)[1], size(a)[1])
4for ia in eachindex(a)
5    for ib in eachindex(b)
6        diff_matrix[ib,ia]= a[ia] - b[ib]
7    end
8end
9println(diff_matrix)
10julia> a = [5,4]                
112-element Vector{Int64}:        
12 5                              
13 4                              
14                                
15julia> b = [2,1,3]              
163-element Vector{Int64}:        
17 2                              
18 1                              
19 3                              
20                                
21julia> permutedims(a) .- b      
223×2 Matrix{Int64}:              
23 3  2                           
24 4  3                           
25 2  1                           
26

and this code will be fast (and much simpler as a bonus).

Source https://stackoverflow.com/questions/69468471

QUESTION

Remove strange character from tokenization array

Asked 2021-Sep-03 at 19:39

I have a very dirty Pyspark dataframe, i.e. full of weird characters like:
  • ɴɪᴄᴇ ᴏɴᴇ ᴀᴩᴩ
  • பரமசிவம்
  • and many others

I'm doing the data processing and cleaning (tokenization, stopword removal, ...) and this is my dataframe:

content score label classWeigth words filtered terms_stemmed
absolutely love d... 5 1 0.48 [absolutely, love... [absolutely, love... [absolut, love, d...
absolutely love t... 5 1 0.48 [absolutely, love... [absolutely, love... [absolut, love, g...
absolutely phenom... 5 1 0.48 [absolutely, phen... [absolutely, phen... [absolut, phenome...
absolutely shocki... 1 0 0.52 [absolutely, shoc... [absolutely, shoc... [absolut, shock, ...
accept the phone ... 1 0 0.52 [accept, the, pho... [accept, phone, n... [accept, phone, n...

How can I access the word column and remove all weird characters, like the ones mentioned above?

ANSWER

Answered 2021-Sep-03 at 19:39

Try this UDF.

1>>> @udf('array<string>')
2... def filter_udf(a):
3...     from builtins import filter
4...     return list(filter(lambda s: s.isascii(), a))
5... 
6
7>>> df = spark.createDataFrame([(['pyspark','பரமசிவம்'],)])
8>>> df.select(filter_udf('_1')).show()
9+--------------+
10|filter_udf(_1)|
11+--------------+
12|     [pyspark]|
13+--------------+
14

Source https://stackoverflow.com/questions/69046942

Community Discussions contain sources that include Stack Exchange Network

Tutorials and Learning Resources in Data Processing

Tutorials and Learning Resources are not available at this moment for Data Processing

Share this Page

share link

Get latest updates on Data Processing