Data Mining

Explore all data mining open source software, libraries, packages, source code, cloud functions and APIs.

Data mining is a process of discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems.

Popular New Releases in Data Mining

bulk-downloader-for-reddit

Bulk Downloader for Reddit 2.2

pipeline

Pipeline v1.6

striplog

v0.9.2

arxiv-miner

Bug Fixes

Snippext_public

Snippext for Rotom

Popular Libraries in Data Mining

snap

by snap-stanford c++

1835 NOASSERTION

Stanford Network Analysis Platform (SNAP) is a general purpose network analysis and graph mining library.

bulk-downloader-for-reddit

by aliparlakci python

986 GPL-3.0

Downloads and archives content from reddit

Apriori

by asaini python

653 MIT

Python Implementation of Apriori Algorithm for finding Frequent sets and Association Rules

Book-SocialMediaMiningPython

by bonzanini python

487

Companion code for the book "Mastering Social Media Mining with Python"

pymining

by bartdag python

432 NOASSERTION

A few data mining algorithms in pure python

PrefixSpan-py

by chuanconggao python

248 MIT

The shortest yet efficient Python implementation of the sequential pattern mining algorithm PrefixSpan, closed sequential pattern mining algorithm BIDE, and generator sequential pattern mining algorithm FEAT.

chirp

by 9b python

231 MIT

Interface to manage and centralize Google Alert information

streaminer

by mayconbordin java

181 Apache-2.0

A collection of algorithms for mining data streams

2017-CCF-BDCI-AIJudge

by ShawnyXiao jupyter notebook

176 MIT

2017-CCF-BDCI-让AI当法官(初赛)：7th/415 (Top 1.68%)

Explore all libraries in Data Mining

Trending New libraries in Data Mining

arxiv-miner

by valayDave python

84 MIT

arxiv_miner is a toolkit for mining research papers on CS ArXiv.

Snippext_public

by rit-git python

42 BSD-3-Clause

Snippext: Semi-supervised Opinion Mining with Augmented Data

Cyber-FastTrack-Spring-2021

by Alic3C python

A collection of write-ups and solutions for Cyber FastTrack Spring 2021.

learntidytext

by juliasilge css

38 CC-BY-4.0

Learn about text mining 📄 with tidy data principles

VoterFraud2020

by sTechLab jupyter notebook

A multi-modal Twitter dataset with 7.6M tweets and 25.6M retweets related to voter fraud claims.

game-boy-ntgbtminer

by ghidraninja python

27 MIT

The (Python-based) mining software required for the Game Boy mining project.

apriori_python

by chonyy python

22 MIT

🔨 Python implementation of Apriori algorithm, new and simple!

tweetsOLAPing

by MohamedHmini python

implementing an end-to-end tweets ETL/Analysis pipeline.

mat_discover

by sparks-baird python

19 MIT

A materials discovery algorithm geared towards exploring high-performance candidates in new chemical spaces.

Top Authors in Data Mining

PeerChristensen

3 Libraries

zakimjz

3 Libraries

ShawnyXiao

3 Libraries

300

juliasilge

3 Libraries

gmggroup

2 Libraries

lucasxlu

2 Libraries

kwartler

2 Libraries

kbalog

2 Libraries

vigna

2 Libraries

147

kvandake

2 Libraries

PeerChristensen

3 Libraries

zakimjz

3 Libraries

ShawnyXiao

3 Libraries

300

3 Libraries

2 Libraries

2 Libraries

2 Libraries

2 Libraries

2 Libraries

147

kvandake

2 Libraries

Trending Kits in Data Mining

No Trending Kits are available at this moment for Data Mining

Trending Discussions on Data Mining

Unable to install ray[tune] tune-sklearn

Get total no of classes of each subject within a semester using pandas

How to create a frequency table of each subject from a given timetable using pandas?

Regenerate SSAS multidimentional partitions files from the database

Counting repeated pairs in a list

Python KeyError: 0 when i use if elif

React JS floated tag around a component, is position absolute a prudent idea?

Gensim doc2vec's d2v.wv.most_similar() gives not relevant words with high similarity scores

Creating a CSV file from Python Script

Which is the best Data Mining model to extrapolate known values to missing values in a table? (General question)

Unable to install ray[tune] tune-sklearn

Get total no of classes of each subject within a semester using pandas

How to create a frequency table of each subject from a given timetable using pandas?

Regenerate SSAS multidimentional partitions files from the database

Counting repeated pairs in a list

Python KeyError: 0 when i use if elif

React JS floated tag around a component, is position absolute a prudent idea?

Gensim doc2vec's d2v.wv.most_similar() gives not relevant words with high similarity scores

Creating a CSV file from Python Script

Which is the best Data Mining model to extrapolate known values to missing values in a table? (General question)

QUESTION

Unable to install ray[tune] tune-sklearn

Asked 2022-Mar-14 at 20:10

I'm trying to install ray[tune] tune-sklearn on my machine but keeps failing. I'm using a MacBook Pro 2019 with Big Sur Version 11.6 and Python 3.9.7 (default, Sep 16 2021, 08:50:36) [Clang 10.0.0 ] :: Anaconda, Inc. on darwin. All other packages I've tried to installed fine either using conda install or pip install except for this one. I'm struggling to find an answer online myself. I was on Python 3.8 but I removed this and installed 3.9 as I thought this was the problem. Apologies in advance, I'm new to data mining and still don't know a great deal yet.

I tried

1conda install -c conda-forge -y ray-tune tune-sklearn
2

But got back this:

1conda install -c conda-forge -y ray-tune tune-sklearn
2Collecting package metadata (current_repodata.json): done
3Solving environment: failed with initial frozen solve. Retrying with flexible solve.
4Collecting package metadata (repodata.json): done
5Solving environment: failed with initial frozen solve. Retrying with flexible solve.
6
7PackagesNotFoundError: The following packages are not available from current channels:
8
9  - ray-tune
10
11Current channels:
12
13  - https://conda.anaconda.org/conda-forge/osx-64
14  - https://conda.anaconda.org/conda-forge/noarch
15  - https://repo.anaconda.com/pkgs/main/osx-64
16  - https://repo.anaconda.com/pkgs/main/noarch
17  - https://repo.anaconda.com/pkgs/r/osx-64
18  - https://repo.anaconda.com/pkgs/r/noarch
19
20To search for alternate channels that may provide the conda package you're
21looking for, navigate to
22
23    https://anaconda.org
24
25and use the search bar at the top of the page.
26

I also tried

1conda install -c conda-forge -y ray-tune tune-sklearn
2Collecting package metadata (current_repodata.json): done
3Solving environment: failed with initial frozen solve. Retrying with flexible solve.
4Collecting package metadata (repodata.json): done
5Solving environment: failed with initial frozen solve. Retrying with flexible solve.
6
7PackagesNotFoundError: The following packages are not available from current channels:
8
9  - ray-tune
10
11Current channels:
12
13  - https://conda.anaconda.org/conda-forge/osx-64
14  - https://conda.anaconda.org/conda-forge/noarch
15  - https://repo.anaconda.com/pkgs/main/osx-64
16  - https://repo.anaconda.com/pkgs/main/noarch
17  - https://repo.anaconda.com/pkgs/r/osx-64
18  - https://repo.anaconda.com/pkgs/r/noarch
19
20To search for alternate channels that may provide the conda package you're
21looking for, navigate to
22
23    https://anaconda.org
24
25and use the search bar at the top of the page.
26pip install ray[tune] tune-sklearn
27

But got back

1conda install -c conda-forge -y ray-tune tune-sklearn
2Collecting package metadata (current_repodata.json): done
3Solving environment: failed with initial frozen solve. Retrying with flexible solve.
4Collecting package metadata (repodata.json): done
5Solving environment: failed with initial frozen solve. Retrying with flexible solve.
6
7PackagesNotFoundError: The following packages are not available from current channels:
8
9  - ray-tune
10
11Current channels:
12
13  - https://conda.anaconda.org/conda-forge/osx-64
14  - https://conda.anaconda.org/conda-forge/noarch
15  - https://repo.anaconda.com/pkgs/main/osx-64
16  - https://repo.anaconda.com/pkgs/main/noarch
17  - https://repo.anaconda.com/pkgs/r/osx-64
18  - https://repo.anaconda.com/pkgs/r/noarch
19
20To search for alternate channels that may provide the conda package you're
21looking for, navigate to
22
23    https://anaconda.org
24
25and use the search bar at the top of the page.
26pip install ray[tune] tune-sklearn
27zsh: no matches found: ray[tune]
28

Any help would be greatly appreciated, thank you.

Update:

I also tried

1conda install -c conda-forge -y ray-tune tune-sklearn
2Collecting package metadata (current_repodata.json): done
3Solving environment: failed with initial frozen solve. Retrying with flexible solve.
4Collecting package metadata (repodata.json): done
5Solving environment: failed with initial frozen solve. Retrying with flexible solve.
6
7PackagesNotFoundError: The following packages are not available from current channels:
8
9  - ray-tune
10
11Current channels:
12
13  - https://conda.anaconda.org/conda-forge/osx-64
14  - https://conda.anaconda.org/conda-forge/noarch
15  - https://repo.anaconda.com/pkgs/main/osx-64
16  - https://repo.anaconda.com/pkgs/main/noarch
17  - https://repo.anaconda.com/pkgs/r/osx-64
18  - https://repo.anaconda.com/pkgs/r/noarch
19
20To search for alternate channels that may provide the conda package you're
21looking for, navigate to
22
23    https://anaconda.org
24
25and use the search bar at the top of the page.
26pip install ray[tune] tune-sklearn
27zsh: no matches found: ray[tune]
28pip install 'ray[tune]'
29

And got back

1conda install -c conda-forge -y ray-tune tune-sklearn
2Collecting package metadata (current_repodata.json): done
3Solving environment: failed with initial frozen solve. Retrying with flexible solve.
4Collecting package metadata (repodata.json): done
5Solving environment: failed with initial frozen solve. Retrying with flexible solve.
6
7PackagesNotFoundError: The following packages are not available from current channels:
8
9  - ray-tune
10
11Current channels:
12
13  - https://conda.anaconda.org/conda-forge/osx-64
14  - https://conda.anaconda.org/conda-forge/noarch
15  - https://repo.anaconda.com/pkgs/main/osx-64
16  - https://repo.anaconda.com/pkgs/main/noarch
17  - https://repo.anaconda.com/pkgs/r/osx-64
18  - https://repo.anaconda.com/pkgs/r/noarch
19
20To search for alternate channels that may provide the conda package you're
21looking for, navigate to
22
23    https://anaconda.org
24
25and use the search bar at the top of the page.
26pip install ray[tune] tune-sklearn
27zsh: no matches found: ray[tune]
28pip install 'ray[tune]'
29ERROR: Could not find a version that satisfies the requirement ray[tune] (from versions: none)
30ERROR: No matching distribution found for ray[tune]
31

ANSWER

Answered 2022-Mar-14 at 20:10

ray[tune] is a library within the Ray distributed compute project that supports scalable hyperparameter tuning -- not a stand-alone Python package. You should be able to install ray with the proper dependencies using:

1conda install -c conda-forge -y ray-tune tune-sklearn
2Collecting package metadata (current_repodata.json): done
3Solving environment: failed with initial frozen solve. Retrying with flexible solve.
4Collecting package metadata (repodata.json): done
5Solving environment: failed with initial frozen solve. Retrying with flexible solve.
6
7PackagesNotFoundError: The following packages are not available from current channels:
8
9  - ray-tune
10
11Current channels:
12
13  - https://conda.anaconda.org/conda-forge/osx-64
14  - https://conda.anaconda.org/conda-forge/noarch
15  - https://repo.anaconda.com/pkgs/main/osx-64
16  - https://repo.anaconda.com/pkgs/main/noarch
17  - https://repo.anaconda.com/pkgs/r/osx-64
18  - https://repo.anaconda.com/pkgs/r/noarch
19
20To search for alternate channels that may provide the conda package you're
21looking for, navigate to
22
23    https://anaconda.org
24
25and use the search bar at the top of the page.
26pip install ray[tune] tune-sklearn
27zsh: no matches found: ray[tune]
28pip install 'ray[tune]'
29ERROR: Could not find a version that satisfies the requirement ray[tune] (from versions: none)
30ERROR: No matching distribution found for ray[tune]
31pip install &quot;ray[tune]&quot;
32

After Ray has been installed, you can reference it within your Python project using either:

1conda install -c conda-forge -y ray-tune tune-sklearn
2Collecting package metadata (current_repodata.json): done
3Solving environment: failed with initial frozen solve. Retrying with flexible solve.
4Collecting package metadata (repodata.json): done
5Solving environment: failed with initial frozen solve. Retrying with flexible solve.
6
7PackagesNotFoundError: The following packages are not available from current channels:
8
9  - ray-tune
10
11Current channels:
12
13  - https://conda.anaconda.org/conda-forge/osx-64
14  - https://conda.anaconda.org/conda-forge/noarch
15  - https://repo.anaconda.com/pkgs/main/osx-64
16  - https://repo.anaconda.com/pkgs/main/noarch
17  - https://repo.anaconda.com/pkgs/r/osx-64
18  - https://repo.anaconda.com/pkgs/r/noarch
19
20To search for alternate channels that may provide the conda package you're
21looking for, navigate to
22
23    https://anaconda.org
24
25and use the search bar at the top of the page.
26pip install ray[tune] tune-sklearn
27zsh: no matches found: ray[tune]
28pip install 'ray[tune]'
29ERROR: Could not find a version that satisfies the requirement ray[tune] (from versions: none)
30ERROR: No matching distribution found for ray[tune]
31pip install &quot;ray[tune]&quot;
32import ray[tune]
33

1conda install -c conda-forge -y ray-tune tune-sklearn
2Collecting package metadata (current_repodata.json): done
3Solving environment: failed with initial frozen solve. Retrying with flexible solve.
4Collecting package metadata (repodata.json): done
5Solving environment: failed with initial frozen solve. Retrying with flexible solve.
6
7PackagesNotFoundError: The following packages are not available from current channels:
8
9  - ray-tune
10
11Current channels:
12
13  - https://conda.anaconda.org/conda-forge/osx-64
14  - https://conda.anaconda.org/conda-forge/noarch
15  - https://repo.anaconda.com/pkgs/main/osx-64
16  - https://repo.anaconda.com/pkgs/main/noarch
17  - https://repo.anaconda.com/pkgs/r/osx-64
18  - https://repo.anaconda.com/pkgs/r/noarch
19
20To search for alternate channels that may provide the conda package you're
21looking for, navigate to
22
23    https://anaconda.org
24
25and use the search bar at the top of the page.
26pip install ray[tune] tune-sklearn
27zsh: no matches found: ray[tune]
28pip install 'ray[tune]'
29ERROR: Could not find a version that satisfies the requirement ray[tune] (from versions: none)
30ERROR: No matching distribution found for ray[tune]
31pip install &quot;ray[tune]&quot;
32import ray[tune]
33from ray import tune
34

Source https://stackoverflow.com/questions/71257435

QUESTION

Get total no of classes of each subject within a semester using pandas

Asked 2022-Mar-06 at 08:58

Time table, columns=hour, rows=weekday, data=subject

[weekday x hour]

1                               1                      2                 3             4                 5                      6                      7
2Name                                                                                                                                                   
3Monday                   Project                Project           Project  Data Science  Embedded Systems            Data Mining  Industrial Psychology
4Tuesday                  Project                Project           Project       Project      Data Science  Industrial Psychology       Embedded Systems
5Wednesday           Data Science                Project           Project       Project           Project                Project                Project
6Thursday             Data Mining  Industrial Psychology  Embedded Systems   Data Mining           Project                Project                Project
7Friday     Industrial Psychology       Embedded Systems      Data Science   Data Mining           Project                Project                Project
8

Frequency table rows=weekday, columns=subject, data = subject frequency in the corresponding weekday

[weekday x subject]

1                               1                      2                 3             4                 5                      6                      7
2Name                                                                                                                                                   
3Monday                   Project                Project           Project  Data Science  Embedded Systems            Data Mining  Industrial Psychology
4Tuesday                  Project                Project           Project       Project      Data Science  Industrial Psychology       Embedded Systems
5Wednesday           Data Science                Project           Project       Project           Project                Project                Project
6Thursday             Data Mining  Industrial Psychology  Embedded Systems   Data Mining           Project                Project                Project
7Friday     Industrial Psychology       Embedded Systems      Data Science   Data Mining           Project                Project                Project
8Data       Data Mining  Data Science  Embedded Systems  Industrial Psychology  Project
9Name                                                                                  
10Friday               1             1                 1                      1        3
11Monday               1             1                 1                      1        3
12Thursday             2             0                 1                      1        3
13Tuesday              0             1                 1                      1        4
14Wednesday            0             1                 0                      0        6                            
15

Code

1                               1                      2                 3             4                 5                      6                      7
2Name                                                                                                                                                   
3Monday                   Project                Project           Project  Data Science  Embedded Systems            Data Mining  Industrial Psychology
4Tuesday                  Project                Project           Project       Project      Data Science  Industrial Psychology       Embedded Systems
5Wednesday           Data Science                Project           Project       Project           Project                Project                Project
6Thursday             Data Mining  Industrial Psychology  Embedded Systems   Data Mining           Project                Project                Project
7Friday     Industrial Psychology       Embedded Systems      Data Science   Data Mining           Project                Project                Project
8Data       Data Mining  Data Science  Embedded Systems  Industrial Psychology  Project
9Name                                                                                  
10Friday               1             1                 1                      1        3
11Monday               1             1                 1                      1        3
12Thursday             2             0                 1                      1        3
13Tuesday              0             1                 1                      1        4
14Wednesday            0             1                 0                      0        6                            
15self.start = datetime(2022, 1, 1)
16self.end = datetime(2022, 3, 31)
17
18self.file = 'timetable.csv'
19self.sdf = pd.read_csv(self.file, header=0, index_col=&quot;Name&quot;)
20self.subject_frequency = self.sdf.apply(pd.value_counts).fillna(0)
21print(self.subject_frequency.to_string())
22self.subject_frequency[&quot;sum&quot;] = self.subject_frequency.sum(axis=1)
23
24self.p = self.sdf.melt(var_name='Freq', value_name='Data', ignore_index=False).assign(variable=1)\
25            .pivot_table('Freq', 'Name', 'Data', fill_value=0, aggfunc='count')
26print(self.p.to_string())
27

Required Table

1                               1                      2                 3             4                 5                      6                      7
2Name                                                                                                                                                   
3Monday                   Project                Project           Project  Data Science  Embedded Systems            Data Mining  Industrial Psychology
4Tuesday                  Project                Project           Project       Project      Data Science  Industrial Psychology       Embedded Systems
5Wednesday           Data Science                Project           Project       Project           Project                Project                Project
6Thursday             Data Mining  Industrial Psychology  Embedded Systems   Data Mining           Project                Project                Project
7Friday     Industrial Psychology       Embedded Systems      Data Science   Data Mining           Project                Project                Project
8Data       Data Mining  Data Science  Embedded Systems  Industrial Psychology  Project
9Name                                                                                  
10Friday               1             1                 1                      1        3
11Monday               1             1                 1                      1        3
12Thursday             2             0                 1                      1        3
13Tuesday              0             1                 1                      1        4
14Wednesday            0             1                 0                      0        6                            
15self.start = datetime(2022, 1, 1)
16self.end = datetime(2022, 3, 31)
17
18self.file = 'timetable.csv'
19self.sdf = pd.read_csv(self.file, header=0, index_col=&quot;Name&quot;)
20self.subject_frequency = self.sdf.apply(pd.value_counts).fillna(0)
21print(self.subject_frequency.to_string())
22self.subject_frequency[&quot;sum&quot;] = self.subject_frequency.sum(axis=1)
23
24self.p = self.sdf.melt(var_name='Freq', value_name='Data', ignore_index=False).assign(variable=1)\
25            .pivot_table('Freq', 'Name', 'Data', fill_value=0, aggfunc='count')
26print(self.p.to_string())
27                       classes ...
28Data Mining            32        
29Data Science           32
30Embedded Systems       32
31Industrial Psychology  32
32Project                146     
33

Will be adding more columns later, like current attendance percentage, percentage drop for each class missed, percent losses for taking leaves on Monday, Tuesday, ... etc so as to subtract them from attendance percentage.

The end goal is to analyse which day is safe to take a leave, and to monitor my percentage. If my direction could be better, please advise me.

ANSWER

Answered 2022-Mar-06 at 07:11

1                               1                      2                 3             4                 5                      6                      7
2Name                                                                                                                                                   
3Monday                   Project                Project           Project  Data Science  Embedded Systems            Data Mining  Industrial Psychology
4Tuesday                  Project                Project           Project       Project      Data Science  Industrial Psychology       Embedded Systems
5Wednesday           Data Science                Project           Project       Project           Project                Project                Project
6Thursday             Data Mining  Industrial Psychology  Embedded Systems   Data Mining           Project                Project                Project
7Friday     Industrial Psychology       Embedded Systems      Data Science   Data Mining           Project                Project                Project
8Data       Data Mining  Data Science  Embedded Systems  Industrial Psychology  Project
9Name                                                                                  
10Friday               1             1                 1                      1        3
11Monday               1             1                 1                      1        3
12Thursday             2             0                 1                      1        3
13Tuesday              0             1                 1                      1        4
14Wednesday            0             1                 0                      0        6                            
15self.start = datetime(2022, 1, 1)
16self.end = datetime(2022, 3, 31)
17
18self.file = 'timetable.csv'
19self.sdf = pd.read_csv(self.file, header=0, index_col=&quot;Name&quot;)
20self.subject_frequency = self.sdf.apply(pd.value_counts).fillna(0)
21print(self.subject_frequency.to_string())
22self.subject_frequency[&quot;sum&quot;] = self.subject_frequency.sum(axis=1)
23
24self.p = self.sdf.melt(var_name='Freq', value_name='Data', ignore_index=False).assign(variable=1)\
25            .pivot_table('Freq', 'Name', 'Data', fill_value=0, aggfunc='count')
26print(self.p.to_string())
27                       classes ...
28Data Mining            32        
29Data Science           32
30Embedded Systems       32
31Industrial Psychology  32
32Project                146     
33select_rows = [date.strftime(&quot;%A&quot;) for date in pd.bdate_range(self.start, self.end)]
34r = self.p.loc[select_rows, :]
35print(r.to_string())
36print(r.sum())
37

Please feel free to add a simpler code, design advice is also appreciated!

Source https://stackoverflow.com/questions/71364481

QUESTION

How to create a frequency table of each subject from a given timetable using pandas?

Asked 2022-Mar-05 at 16:06

This is a time table, columns=hour, rows=weekday, data=subject [weekday x hour]

1                               1                      2                 3             4                 5                      6                      7
2Name                                                                                                                                                   
3Monday                   Project                Project           Project  Data Science  Embedded Systems            Data Mining  Industrial Psychology
4Tuesday                  Project                Project           Project       Project      Data Science  Industrial Psychology       Embedded Systems
5Wednesday           Data Science                Project           Project       Project           Project                Project                Project
6Thursday             Data Mining  Industrial Psychology  Embedded Systems   Data Mining           Project                Project                Project
7Friday     Industrial Psychology       Embedded Systems      Data Science   Data Mining           Project                Project                Project
8

How do you generate a pandas.Dataframe where, rows=weekday, columns=subject, data = subject frequency in the corresponding weekday?

Required table: [weekday x subject]

1                               1                      2                 3             4                 5                      6                      7
2Name                                                                                                                                                   
3Monday                   Project                Project           Project  Data Science  Embedded Systems            Data Mining  Industrial Psychology
4Tuesday                  Project                Project           Project       Project      Data Science  Industrial Psychology       Embedded Systems
5Wednesday           Data Science                Project           Project       Project           Project                Project                Project
6Thursday             Data Mining  Industrial Psychology  Embedded Systems   Data Mining           Project                Project                Project
7Friday     Industrial Psychology       Embedded Systems      Data Science   Data Mining           Project                Project                Project
8              Data Mining, Data Science, Embedded Systems, Industrial Psychology, Project                                                             
9Name                                                                                                                                                   
10Monday           1          1            1                 1                      3
11Tuesday          ...         
12Wednesday                     
13Thursday                                     
14Friday                               
15

1                               1                      2                 3             4                 5                      6                      7
2Name                                                                                                                                                   
3Monday                   Project                Project           Project  Data Science  Embedded Systems            Data Mining  Industrial Psychology
4Tuesday                  Project                Project           Project       Project      Data Science  Industrial Psychology       Embedded Systems
5Wednesday           Data Science                Project           Project       Project           Project                Project                Project
6Thursday             Data Mining  Industrial Psychology  Embedded Systems   Data Mining           Project                Project                Project
7Friday     Industrial Psychology       Embedded Systems      Data Science   Data Mining           Project                Project                Project
8              Data Mining, Data Science, Embedded Systems, Industrial Psychology, Project                                                             
9Name                                                                                                                                                   
10Monday           1          1            1                 1                      3
11Tuesday          ...         
12Wednesday                     
13Thursday                                     
14Friday                               
15        self.file = 'timetable.csv'
16        self.sdf = pd.read_csv(self.file, header=0, index_col=&quot;Name&quot;)
17        print(self.sdf.to_string())
18        self.subject_frequency = self.sdf.apply(pd.value_counts)
19        print(self.subject_frequency.to_string())
20        self.subject_frequency[&quot;sum&quot;] = self.subject_frequency.sum(axis=1)
21

ANSWER

Answered 2022-Mar-05 at 16:06

Use melt to flatten your dataframe then pivot_table to reshape your dataframe:

1                               1                      2                 3             4                 5                      6                      7
2Name                                                                                                                                                   
3Monday                   Project                Project           Project  Data Science  Embedded Systems            Data Mining  Industrial Psychology
4Tuesday                  Project                Project           Project       Project      Data Science  Industrial Psychology       Embedded Systems
5Wednesday           Data Science                Project           Project       Project           Project                Project                Project
6Thursday             Data Mining  Industrial Psychology  Embedded Systems   Data Mining           Project                Project                Project
7Friday     Industrial Psychology       Embedded Systems      Data Science   Data Mining           Project                Project                Project
8              Data Mining, Data Science, Embedded Systems, Industrial Psychology, Project                                                             
9Name                                                                                                                                                   
10Monday           1          1            1                 1                      3
11Tuesday          ...         
12Wednesday                     
13Thursday                                     
14Friday                               
15        self.file = 'timetable.csv'
16        self.sdf = pd.read_csv(self.file, header=0, index_col=&quot;Name&quot;)
17        print(self.sdf.to_string())
18        self.subject_frequency = self.sdf.apply(pd.value_counts)
19        print(self.subject_frequency.to_string())
20        self.subject_frequency[&quot;sum&quot;] = self.subject_frequency.sum(axis=1)
21out = (
22  df.melt(var_name='Freq', value_name='Data', ignore_index=False).assign(variable=1)
23    .pivot_table('Freq', 'Name', 'Data', fill_value=0, aggfunc='count')
24    .loc[df.index]  # sort by original index: Monday &gt; Thuesday &gt; ...
25)
26

Output:

1                               1                      2                 3             4                 5                      6                      7
2Name                                                                                                                                                   
3Monday                   Project                Project           Project  Data Science  Embedded Systems            Data Mining  Industrial Psychology
4Tuesday                  Project                Project           Project       Project      Data Science  Industrial Psychology       Embedded Systems
5Wednesday           Data Science                Project           Project       Project           Project                Project                Project
6Thursday             Data Mining  Industrial Psychology  Embedded Systems   Data Mining           Project                Project                Project
7Friday     Industrial Psychology       Embedded Systems      Data Science   Data Mining           Project                Project                Project
8              Data Mining, Data Science, Embedded Systems, Industrial Psychology, Project                                                             
9Name                                                                                                                                                   
10Monday           1          1            1                 1                      3
11Tuesday          ...         
12Wednesday                     
13Thursday                                     
14Friday                               
15        self.file = 'timetable.csv'
16        self.sdf = pd.read_csv(self.file, header=0, index_col=&quot;Name&quot;)
17        print(self.sdf.to_string())
18        self.subject_frequency = self.sdf.apply(pd.value_counts)
19        print(self.subject_frequency.to_string())
20        self.subject_frequency[&quot;sum&quot;] = self.subject_frequency.sum(axis=1)
21out = (
22  df.melt(var_name='Freq', value_name='Data', ignore_index=False).assign(variable=1)
23    .pivot_table('Freq', 'Name', 'Data', fill_value=0, aggfunc='count')
24    .loc[df.index]  # sort by original index: Monday &gt; Thuesday &gt; ...
25)
26&gt;&gt;&gt; out
27Data       Data Mining  Data Science  Embedded Systems  Industrial Psychology  Project
28Name                                                                                  
29Monday               1             1                 1                      1        3
30Tuesday              0             1                 1                      1        4
31Wednesday            0             1                 0                      0        6
32Thursday             2             0                 1                      1        3
33Friday               1             1                 1                      1        3
34

Source https://stackoverflow.com/questions/71363338

QUESTION

Regenerate SSAS multidimentional partitions files from the database

Asked 2022-Feb-25 at 05:11

I have an ssas cube and I want to create the solution with ssdt on visual studio. I need to generate the .partations file of the cube.

When I do New Project -> Import from server (multidimentional and data mining) The project is created but the .partations file are empty (2ko)

I tried with VS2019, 2017 and BIDS2008R2, it's always the same problem

Any idea about this issue ?

ANSWER

Answered 2022-Feb-25 at 05:11

This is a issue when we import from a SSAS database containing custom partitions.

To get the correct partitions, you need to just open the cube (in the Visual Studio Solution) and navigate to the partitions tab.

The moment you select the partitions tab, you can notice, the "star" symbol in the tab denoting that the project has been updated.

You will notice that the partitions file is now updated with the latest partitions.

Source https://stackoverflow.com/questions/71255552

QUESTION

Counting repeated pairs in a list

Asked 2022-Feb-15 at 03:11

I have an assignment that has a data mining element. I need to find which authors collaborate the most across several publication webpages.

I've scraped the webpages and compiled the author text into a list.

My current output looks like this:

1for author in list:
2   print(author)
3
4##output :
5['Author 1', 'Author 2', 'Author 3']
6['Author 2', 'Author 4', 'Author 1']
7['Author 1', 'Author 5', 'Author 6', 'Author 7', 'Author 4']
8

etc for ~100 more rows.

My idea is, for in each section of the list, to produce another list that contains each of the unique pairs in that list. E.g. the third demo row would give 'Author 1 + Author 5', 'Author 1 + Author 6', 'Author 1 + Author 7', 'Author 1 + Author 4', 'Author 5 + Author 6', 'Author 5 + Author 7', 'Author 5 + Author 4', 'Author 6 + Author 7', 'Author 6 + Author 4', 'Author 7 + Author 4'. Then I'd append these pairs lists to one large list and put it through a counter to see which pairs came up the most.

The problem is I'm just not sure how to actually implement that pair matcher, so if anyone has any pointers that would be great. I'm sure it can't be that complicated an answer, but I've been unable to find it. Alternative ideas on how to measure collaboration would be good too.

ANSWER

Answered 2022-Feb-14 at 21:36

You could use a dictionary where the pair is the key and the number how often it occurs is the value. You'll need to make sure that you always generate the same key for (Author1,Author2) and (Author2, Author1) but you could choose alphabetic ordering for dealing with that.

Then you simply increment the number stored for the pair whenever you encounter it.

Source https://stackoverflow.com/questions/71118511

QUESTION

Python KeyError: 0 when i use if elif

Asked 2021-Dec-24 at 12:26

I am using python to make a simple application for data mining, and I coded it in Google Colab. And I use elif on my function, here is the code

1def data_pred(data):
2  # split(data)
3  X_train, y_train, X_test, y_test = split(data)
4
5  linreg = LinearRegression()
6  linreg.fit(X_train, y_train)
7  y_preds = linreg.predict(X_test)
8
9  for x in range(17):
10    y_test = np.insert(y_test, len(y_test), y_preds[len(y_preds)-1])
11    X_test = np.insert(X_test, len(X_test), y_test[len(X_test)-1])
12    X_test = np.array(X_test).reshape(X_test.size, 1)
13    y_preds = linreg.predict(X_test)
14
15
16  plt.scatter(X_test, y_test)
17  plt.scatter(X_test, y_preds, color='green')
18  plt.plot(X_test, y_preds, color=&quot;red&quot;)
19  plt.xlabel(&quot;X axis&quot;)
20  plt.ylabel(&quot;Y axis&quot;)
21
22  plt.show()
23
24  print(&quot;nilai slope/koef/a:&quot;,linreg.coef_)
25  print(&quot;nilai intercept/b :&quot;,linreg.intercept_)
26  print('Data hasil prediksi :', y_preds)
27  print('Data aktual :',y_test)
28  print()
29  print('MAPE : ', mape(y_test, y_preds))
30
31
32
33  if data[&quot;Nama Golongan&quot;][0] == &quot;INDUSTRI&quot;:
34    golongan = data.loc[0:23, &quot;Nama Golongan&quot;]
35  elif data[&quot;Nama Golongan&quot;][44] == &quot;INSTANSI PEMERINTAH&quot;:
36    golongan = data.loc[44:67, &quot;Nama Golongan&quot;]
37  elif data[&quot;Nama Golongan&quot;][88] == &quot;NIAGA KECIL&quot;:
38    golongan = data.loc[88:111, &quot;Nama Golongan&quot;]
39  elif data[&quot;Nama Golongan&quot;][132] == &quot;RUMAH MENENGAH&quot;:
40    golongan = data.loc[132:155, &quot;Nama Golongan&quot;]
41  elif data[&quot;Nama Golongan&quot;][176] == &quot;RUMAH MEWAH&quot;:
42    golongan = data.loc[176:119, &quot;Nama Golongan&quot;]
43  elif data[&quot;Nama Golongan&quot;][220] == &quot;SOSIAL KHUSUS&quot;:
44    golongan = data.loc[220:243, &quot;Nama Golongan&quot;]
45  elif data[&quot;Nama Golongan&quot;][264] == &quot;TOTAL PERBULAN&quot;:
46    golongan = data.loc[264:287, &quot;Nama Golongan&quot;]
47
48
49  more code...
50

when I run,

1def data_pred(data):
2  # split(data)
3  X_train, y_train, X_test, y_test = split(data)
4
5  linreg = LinearRegression()
6  linreg.fit(X_train, y_train)
7  y_preds = linreg.predict(X_test)
8
9  for x in range(17):
10    y_test = np.insert(y_test, len(y_test), y_preds[len(y_preds)-1])
11    X_test = np.insert(X_test, len(X_test), y_test[len(X_test)-1])
12    X_test = np.array(X_test).reshape(X_test.size, 1)
13    y_preds = linreg.predict(X_test)
14
15
16  plt.scatter(X_test, y_test)
17  plt.scatter(X_test, y_preds, color='green')
18  plt.plot(X_test, y_preds, color=&quot;red&quot;)
19  plt.xlabel(&quot;X axis&quot;)
20  plt.ylabel(&quot;Y axis&quot;)
21
22  plt.show()
23
24  print(&quot;nilai slope/koef/a:&quot;,linreg.coef_)
25  print(&quot;nilai intercept/b :&quot;,linreg.intercept_)
26  print('Data hasil prediksi :', y_preds)
27  print('Data aktual :',y_test)
28  print()
29  print('MAPE : ', mape(y_test, y_preds))
30
31
32
33  if data[&quot;Nama Golongan&quot;][0] == &quot;INDUSTRI&quot;:
34    golongan = data.loc[0:23, &quot;Nama Golongan&quot;]
35  elif data[&quot;Nama Golongan&quot;][44] == &quot;INSTANSI PEMERINTAH&quot;:
36    golongan = data.loc[44:67, &quot;Nama Golongan&quot;]
37  elif data[&quot;Nama Golongan&quot;][88] == &quot;NIAGA KECIL&quot;:
38    golongan = data.loc[88:111, &quot;Nama Golongan&quot;]
39  elif data[&quot;Nama Golongan&quot;][132] == &quot;RUMAH MENENGAH&quot;:
40    golongan = data.loc[132:155, &quot;Nama Golongan&quot;]
41  elif data[&quot;Nama Golongan&quot;][176] == &quot;RUMAH MEWAH&quot;:
42    golongan = data.loc[176:119, &quot;Nama Golongan&quot;]
43  elif data[&quot;Nama Golongan&quot;][220] == &quot;SOSIAL KHUSUS&quot;:
44    golongan = data.loc[220:243, &quot;Nama Golongan&quot;]
45  elif data[&quot;Nama Golongan&quot;][264] == &quot;TOTAL PERBULAN&quot;:
46    golongan = data.loc[264:287, &quot;Nama Golongan&quot;]
47
48
49  more code...
50  a = this[this['Nama Golongan'] == 'INDUSTRI']
51  data_pred(a)
52

I get graphic plot and the result without error. But, when I run this code

1def data_pred(data):
2  # split(data)
3  X_train, y_train, X_test, y_test = split(data)
4
5  linreg = LinearRegression()
6  linreg.fit(X_train, y_train)
7  y_preds = linreg.predict(X_test)
8
9  for x in range(17):
10    y_test = np.insert(y_test, len(y_test), y_preds[len(y_preds)-1])
11    X_test = np.insert(X_test, len(X_test), y_test[len(X_test)-1])
12    X_test = np.array(X_test).reshape(X_test.size, 1)
13    y_preds = linreg.predict(X_test)
14
15
16  plt.scatter(X_test, y_test)
17  plt.scatter(X_test, y_preds, color='green')
18  plt.plot(X_test, y_preds, color=&quot;red&quot;)
19  plt.xlabel(&quot;X axis&quot;)
20  plt.ylabel(&quot;Y axis&quot;)
21
22  plt.show()
23
24  print(&quot;nilai slope/koef/a:&quot;,linreg.coef_)
25  print(&quot;nilai intercept/b :&quot;,linreg.intercept_)
26  print('Data hasil prediksi :', y_preds)
27  print('Data aktual :',y_test)
28  print()
29  print('MAPE : ', mape(y_test, y_preds))
30
31
32
33  if data[&quot;Nama Golongan&quot;][0] == &quot;INDUSTRI&quot;:
34    golongan = data.loc[0:23, &quot;Nama Golongan&quot;]
35  elif data[&quot;Nama Golongan&quot;][44] == &quot;INSTANSI PEMERINTAH&quot;:
36    golongan = data.loc[44:67, &quot;Nama Golongan&quot;]
37  elif data[&quot;Nama Golongan&quot;][88] == &quot;NIAGA KECIL&quot;:
38    golongan = data.loc[88:111, &quot;Nama Golongan&quot;]
39  elif data[&quot;Nama Golongan&quot;][132] == &quot;RUMAH MENENGAH&quot;:
40    golongan = data.loc[132:155, &quot;Nama Golongan&quot;]
41  elif data[&quot;Nama Golongan&quot;][176] == &quot;RUMAH MEWAH&quot;:
42    golongan = data.loc[176:119, &quot;Nama Golongan&quot;]
43  elif data[&quot;Nama Golongan&quot;][220] == &quot;SOSIAL KHUSUS&quot;:
44    golongan = data.loc[220:243, &quot;Nama Golongan&quot;]
45  elif data[&quot;Nama Golongan&quot;][264] == &quot;TOTAL PERBULAN&quot;:
46    golongan = data.loc[264:287, &quot;Nama Golongan&quot;]
47
48
49  more code...
50  a = this[this['Nama Golongan'] == 'INDUSTRI']
51  data_pred(a)
52b = this[this['Nama Golongan'] == 'INSTANSI PEMERINTAH']
53data_pred(b)
54

I get this

1def data_pred(data):
2  # split(data)
3  X_train, y_train, X_test, y_test = split(data)
4
5  linreg = LinearRegression()
6  linreg.fit(X_train, y_train)
7  y_preds = linreg.predict(X_test)
8
9  for x in range(17):
10    y_test = np.insert(y_test, len(y_test), y_preds[len(y_preds)-1])
11    X_test = np.insert(X_test, len(X_test), y_test[len(X_test)-1])
12    X_test = np.array(X_test).reshape(X_test.size, 1)
13    y_preds = linreg.predict(X_test)
14
15
16  plt.scatter(X_test, y_test)
17  plt.scatter(X_test, y_preds, color='green')
18  plt.plot(X_test, y_preds, color=&quot;red&quot;)
19  plt.xlabel(&quot;X axis&quot;)
20  plt.ylabel(&quot;Y axis&quot;)
21
22  plt.show()
23
24  print(&quot;nilai slope/koef/a:&quot;,linreg.coef_)
25  print(&quot;nilai intercept/b :&quot;,linreg.intercept_)
26  print('Data hasil prediksi :', y_preds)
27  print('Data aktual :',y_test)
28  print()
29  print('MAPE : ', mape(y_test, y_preds))
30
31
32
33  if data[&quot;Nama Golongan&quot;][0] == &quot;INDUSTRI&quot;:
34    golongan = data.loc[0:23, &quot;Nama Golongan&quot;]
35  elif data[&quot;Nama Golongan&quot;][44] == &quot;INSTANSI PEMERINTAH&quot;:
36    golongan = data.loc[44:67, &quot;Nama Golongan&quot;]
37  elif data[&quot;Nama Golongan&quot;][88] == &quot;NIAGA KECIL&quot;:
38    golongan = data.loc[88:111, &quot;Nama Golongan&quot;]
39  elif data[&quot;Nama Golongan&quot;][132] == &quot;RUMAH MENENGAH&quot;:
40    golongan = data.loc[132:155, &quot;Nama Golongan&quot;]
41  elif data[&quot;Nama Golongan&quot;][176] == &quot;RUMAH MEWAH&quot;:
42    golongan = data.loc[176:119, &quot;Nama Golongan&quot;]
43  elif data[&quot;Nama Golongan&quot;][220] == &quot;SOSIAL KHUSUS&quot;:
44    golongan = data.loc[220:243, &quot;Nama Golongan&quot;]
45  elif data[&quot;Nama Golongan&quot;][264] == &quot;TOTAL PERBULAN&quot;:
46    golongan = data.loc[264:287, &quot;Nama Golongan&quot;]
47
48
49  more code...
50  a = this[this['Nama Golongan'] == 'INDUSTRI']
51  data_pred(a)
52b = this[this['Nama Golongan'] == 'INSTANSI PEMERINTAH']
53data_pred(b)
54KeyError                                  Traceback (most recent call last)
55/usr/local/lib/python3.7/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, 
56method, tolerance)
57     2897             try:
58  -&gt; 2898                 return self._engine.get_loc(casted_key)
59     2899             except KeyError as err:
60
61  pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
62  pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
63  pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()
64  pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()
65 
66

I thought its cause the elif code, but i don't know why. can anyone tell me why and how to fix it ? Please help me, thanks.

ANSWER

Answered 2021-Dec-24 at 03:27

OK, I finally see the problem. You are extracting a subset of a dataframe and passing it to this file. So, data["Nama Golongan"][44] is referring to index 44, because the indicies get carried through with the subset.

The problem is that data.loc does NOT use the index. It's strictly row numbers. They're all going to start with 0. If you ONLY want the first 23 rows, you don't need your if sequence at all. Replace the whole thing with this:

1def data_pred(data):
2  # split(data)
3  X_train, y_train, X_test, y_test = split(data)
4
5  linreg = LinearRegression()
6  linreg.fit(X_train, y_train)
7  y_preds = linreg.predict(X_test)
8
9  for x in range(17):
10    y_test = np.insert(y_test, len(y_test), y_preds[len(y_preds)-1])
11    X_test = np.insert(X_test, len(X_test), y_test[len(X_test)-1])
12    X_test = np.array(X_test).reshape(X_test.size, 1)
13    y_preds = linreg.predict(X_test)
14
15
16  plt.scatter(X_test, y_test)
17  plt.scatter(X_test, y_preds, color='green')
18  plt.plot(X_test, y_preds, color=&quot;red&quot;)
19  plt.xlabel(&quot;X axis&quot;)
20  plt.ylabel(&quot;Y axis&quot;)
21
22  plt.show()
23
24  print(&quot;nilai slope/koef/a:&quot;,linreg.coef_)
25  print(&quot;nilai intercept/b :&quot;,linreg.intercept_)
26  print('Data hasil prediksi :', y_preds)
27  print('Data aktual :',y_test)
28  print()
29  print('MAPE : ', mape(y_test, y_preds))
30
31
32
33  if data[&quot;Nama Golongan&quot;][0] == &quot;INDUSTRI&quot;:
34    golongan = data.loc[0:23, &quot;Nama Golongan&quot;]
35  elif data[&quot;Nama Golongan&quot;][44] == &quot;INSTANSI PEMERINTAH&quot;:
36    golongan = data.loc[44:67, &quot;Nama Golongan&quot;]
37  elif data[&quot;Nama Golongan&quot;][88] == &quot;NIAGA KECIL&quot;:
38    golongan = data.loc[88:111, &quot;Nama Golongan&quot;]
39  elif data[&quot;Nama Golongan&quot;][132] == &quot;RUMAH MENENGAH&quot;:
40    golongan = data.loc[132:155, &quot;Nama Golongan&quot;]
41  elif data[&quot;Nama Golongan&quot;][176] == &quot;RUMAH MEWAH&quot;:
42    golongan = data.loc[176:119, &quot;Nama Golongan&quot;]
43  elif data[&quot;Nama Golongan&quot;][220] == &quot;SOSIAL KHUSUS&quot;:
44    golongan = data.loc[220:243, &quot;Nama Golongan&quot;]
45  elif data[&quot;Nama Golongan&quot;][264] == &quot;TOTAL PERBULAN&quot;:
46    golongan = data.loc[264:287, &quot;Nama Golongan&quot;]
47
48
49  more code...
50  a = this[this['Nama Golongan'] == 'INDUSTRI']
51  data_pred(a)
52b = this[this['Nama Golongan'] == 'INSTANSI PEMERINTAH']
53data_pred(b)
54KeyError                                  Traceback (most recent call last)
55/usr/local/lib/python3.7/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, 
56method, tolerance)
57     2897             try:
58  -&gt; 2898                 return self._engine.get_loc(casted_key)
59     2899             except KeyError as err:
60
61  pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
62  pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
63  pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()
64  pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()
65 
66  golongan = data.loc[0:23, &quot;Nama Golongan&quot;]
67

The first row, when using loc, is always 0.

Source https://stackoverflow.com/questions/70469250

QUESTION

React JS floated tag around a component, is position absolute a prudent idea?

Asked 2021-Dec-15 at 22:21

I would like to get involved in something perhaps complicated. I would like to create the following render (see image below) with React JS. However, I thought it would be prudent to begin by using position: absolute and repositioning my divs accordingly. However, it appears to be a difficult idea at first glance, considering the number of tags I desire (floated around the main component) for the responsive aspect and the sake that moving them with some pixel will be an indefinitely task. As a result, I was wondering whether there is a plug-in or if you have any suggestions for resolving this particular aspect. Additionally, remember that if you like to respond, it is OK to do so using basic coloured square-rectangles; I am looking forward to learning how to apply such a thing not the specific design.

Today, I have the following, but it would be unmanageable to perform this for each and hope for the best during responsive resizing.

My current code:

React JS divs:

1profile_picture = () =&gt; {
2return (
3  &lt;div className=&quot;profilepicturetechstack&quot;&gt;
4    &lt;div className=&quot;home&quot;&gt;
5      &lt;div className=&quot;frame-1-3&quot;&gt;
6        &lt;img src=&quot;./resources/simon-provost-02-min.jpg&quot;  alt=&quot;profile_pic&quot;/&gt;
7      &lt;/div&gt;
8      &lt;div className=&quot;photo--wrapper--ellipse&quot;&gt;
9        &lt;p className=&quot;text-4&quot;&gt;ML/RESEARCH&lt;/p&gt;
10      &lt;/div&gt;
11      &lt;p className=&quot;text-1&quot;&gt;Simon provost&lt;/p&gt;
12      &lt;p className=&quot;text-2&quot;&gt;Paris, France&lt;/p&gt;
13    &lt;/div&gt;
14    &lt;div className=&quot;frame-1-4&quot;&gt;
15      &lt;p className=&quot;text-7&quot;&gt;⚙️ Machine Learning&lt;/p&gt;
16    &lt;/div&gt;
17    {/*&lt;div className=&quot;frame-1-9&quot;&gt;
18      &lt;p className=&quot;text-8&quot;&gt;💡 AutoML&lt;/p&gt;
19    &lt;/div&gt;
20    &lt;div className=&quot;frame-1-5&quot;&gt;
21      &lt;p className=&quot;text-9&quot;&gt;⛏ Data Mining&lt;/p&gt;
22    &lt;/div&gt;
23    &lt;div className=&quot;frame-1-6&quot;&gt;
24      &lt;p className=&quot;text-1-0&quot;&gt;🎨  UI.UX&lt;/p&gt;
25    &lt;/div&gt;
26    &lt;div className=&quot;frame-1-8&quot;&gt;
27      &lt;p className=&quot;text-1-1&quot;&gt;🔬 Research&lt;/p&gt;
28    &lt;/div&gt;
29    &lt;div className=&quot;frame-1-2&quot;&gt;
30      &lt;img src=&quot;&quot; /&gt;
31    &lt;/div&gt;
32    &lt;div className=&quot;frame-1-7&quot;&gt;
33      &lt;p className=&quot;text-1-3&quot;&gt;🌤  MLOps&lt;/p&gt;
34    &lt;/div&gt;*/}
35  &lt;/div&gt;
36)
37

}

CSS associated class:

1profile_picture = () =&gt; {
2return (
3  &lt;div className=&quot;profilepicturetechstack&quot;&gt;
4    &lt;div className=&quot;home&quot;&gt;
5      &lt;div className=&quot;frame-1-3&quot;&gt;
6        &lt;img src=&quot;./resources/simon-provost-02-min.jpg&quot;  alt=&quot;profile_pic&quot;/&gt;
7      &lt;/div&gt;
8      &lt;div className=&quot;photo--wrapper--ellipse&quot;&gt;
9        &lt;p className=&quot;text-4&quot;&gt;ML/RESEARCH&lt;/p&gt;
10      &lt;/div&gt;
11      &lt;p className=&quot;text-1&quot;&gt;Simon provost&lt;/p&gt;
12      &lt;p className=&quot;text-2&quot;&gt;Paris, France&lt;/p&gt;
13    &lt;/div&gt;
14    &lt;div className=&quot;frame-1-4&quot;&gt;
15      &lt;p className=&quot;text-7&quot;&gt;⚙️ Machine Learning&lt;/p&gt;
16    &lt;/div&gt;
17    {/*&lt;div className=&quot;frame-1-9&quot;&gt;
18      &lt;p className=&quot;text-8&quot;&gt;💡 AutoML&lt;/p&gt;
19    &lt;/div&gt;
20    &lt;div className=&quot;frame-1-5&quot;&gt;
21      &lt;p className=&quot;text-9&quot;&gt;⛏ Data Mining&lt;/p&gt;
22    &lt;/div&gt;
23    &lt;div className=&quot;frame-1-6&quot;&gt;
24      &lt;p className=&quot;text-1-0&quot;&gt;🎨  UI.UX&lt;/p&gt;
25    &lt;/div&gt;
26    &lt;div className=&quot;frame-1-8&quot;&gt;
27      &lt;p className=&quot;text-1-1&quot;&gt;🔬 Research&lt;/p&gt;
28    &lt;/div&gt;
29    &lt;div className=&quot;frame-1-2&quot;&gt;
30      &lt;img src=&quot;&quot; /&gt;
31    &lt;/div&gt;
32    &lt;div className=&quot;frame-1-7&quot;&gt;
33      &lt;p className=&quot;text-1-3&quot;&gt;🌤  MLOps&lt;/p&gt;
34    &lt;/div&gt;*/}
35  &lt;/div&gt;
36)
37.profilepicturetechstack {
38  display: flex;
39  flex-direction: row;
40  justify-content: center;
41  padding-right: 10%;
42}
43.home {
44  display: flex;
45  position: relative;
46  flex-direction: column;
47  align-items: center;
48  justify-content: center;
49  margin-right: 100px;
50
51  border-radius: 13px;
52  height: 300px;
53  width: 350px;
54  background-color: #ffffff;
55  box-shadow: 0 40px 30px rgba(25, 25, 46, 0.06);
56}
57.text-1 {
58  text-align: center;
59  vertical-align: top;
60  font-size: 16px;
61  font-family: Roboto, serif;
62
63  color: #25323c;
64}
65.text-2 {
66  text-align: left;
67  vertical-align: top;
68  font-size: 14px;
69  margin-top: -15px;
70  font-family: Roboto, serif;
71
72  color: #859fb3;
73}
74.photo--wrapper--ellipse {
75  display: flex;
76  justify-content: center;
77  align-items: center;
78  text-align: center;
79  margin-top: -15px;
80  width: 96px;
81  height: 25px;
82
83  background: linear-gradient(135deg, #FF26B2 0%, #851BD9 80%, #3F0FFF 100%);
84  opacity: 0.8;
85  box-shadow: 0 5px 20px rgba(250, 118, 96, 0.2);
86  border-radius: 66px;
87}
88.img-3 {
89  height: 84px;
90  width: 84px;
91}
92.component-/points-/-m {
93  opacity: 0.80;
94  border-radius: 66px;
95  display: flex;
96  flex-direction: row;
97  justify-content: flex-start;
98  align-items: center;
99  padding: 6px 10px;
100  gap: 7px;
101  background-color: red;
102}
103.text-4 {
104  text-align: center;
105  vertical-align: top;
106  font-size: 11px;
107  font-family: Roboto, serif;
108
109  color: #ffffff;
110}
111.frame-1-3 {
112  height: 120px;
113  width: 120px;
114}
115
116.frame-1-3 img {
117  object-fit: contain;
118  border-radius: 62px;
119  height: 100%;
120  width: 100%;
121}
122
123.frame-1-1 {
124  border-radius: 25px;
125  height: 61px;
126  width: 61px;
127  background-color: rgba(36, 150, 237, 0.5);
128}
129.img-6 {
130  height: 35px;
131  width: 37px;
132}
133.frame-1-4 {
134  display: flex;
135  position: absolute;
136  flex-direction: row;
137  justify-content: flex-start;
138  align-items: center;
139  padding: 16px 24px;
140  gap: 10px;
141  right: 5%;
142  box-shadow: 0 40px 30px rgba(25, 25, 46, 0.04);
143  border-radius: 16px;
144  background-color: #ffffff;
145}
146.text-7 {
147  text-align: left;
148  vertical-align: top;
149  font-size: 16px;
150  font-family: 'Poppins', serif;
151  letter-spacing: 3px;
152
153  color: #5d86a7;
154}
155.frame-1-9 {
156  border-radius: 16px;
157  display: flex;
158  flex-direction: row;
159  justify-content: flex-start;
160  align-items: center;
161  padding: 16px 24px;
162  gap: 10px;
163  background-color: #ffffff;
164}
165.text-8 {
166  text-align: left;
167  vertical-align: top;
168  font-size: 16px;
169  font-family: 'Poppins', serif;
170  letter-spacing: 3px;
171
172  color: #5d86a7;
173}
174.frame-1-5 {
175  border-radius: 16px;
176  display: flex;
177  flex-direction: row;
178  justify-content: flex-start;
179  align-items: center;
180  padding: 16px 24px;
181  gap: 10px;
182  background-color: #ffffff;
183}
184.text-9 {
185  text-align: left;
186  vertical-align: top;
187  font-size: 16px;
188  font-family: 'Poppins', serif;
189  letter-spacing: 3px;
190
191  color: #5d86a7;
192}
193.frame-1-6 {
194  border-radius: 16px;
195  display: flex;
196  flex-direction: row;
197  justify-content: flex-start;
198  align-items: center;
199  padding: 16px 24px;
200  gap: 10px;
201  background-color: #ffffff;
202}
203.text-1-0 {
204  text-align: left;
205  vertical-align: top;
206  font-size: 16px;
207  font-family: 'Poppins', serif;
208  letter-spacing: 3px;
209
210  color: #5d86a7;
211}
212.frame-1-8 {
213  border-radius: 16px;
214  display: flex;
215  flex-direction: row;
216  justify-content: flex-start;
217  align-items: center;
218  padding: 16px 24px;
219  gap: 10px;
220  background-color: #ffffff;
221}
222.text-1-1 {
223  text-align: left;
224  vertical-align: top;
225  font-size: 16px;
226  font-family: 'Poppins', serif;
227  letter-spacing: 3px;
228
229  color: #5d86a7;
230}
231.frame-1-2 {
232  border-radius: 25px;
233  height: 61px;
234  width: 61px;
235  background-color: #f3eefa;
236}
237.img-1-2 {
238  height: 29px;
239  width: 29px;
240}
241.frame-1-7 {
242  border-radius: 16px;
243  display: flex;
244  flex-direction: row;
245  justify-content: flex-start;
246  align-items: center;
247  padding: 16px 24px;
248  gap: 10px;
249  background-color: #ffffff;
250}
251.text-1-3 {
252  text-align: left;
253  vertical-align: top;
254  font-size: 16px;
255  font-family: 'Poppins', serif;
256  letter-spacing: 3px;
257
258  color: #5d86a7;
259}
260

I am open to learning more about tips and best practises; you may remove my code and provide a solution that focuses on the purpose rather than the particular design once again; that is fine. I am a little befuddled. Many thanks.

Figma screenshot I wish to be able to reproduce:

ANSWER

Answered 2021-Dec-15 at 22:21

As Ramesh mentioned in the comments, absolute positioning is needed for the list items surrounding the main div.

Create a container div surrounding the list items and have the width and height dimensions the same as className home. This will ensure that the list items will not be affected by flexbox.
I would remove all flex containers inside the classNames for the list items. Instead, use position: absolute in order to use right, left, bottom, and top properties. From here, you can test different values using percentages or pixels to get the placements you wish for. For more information regarding using either pixels or percentages, this article helps with clarifying this: https://www.hongkiat.com/blog/css-units/
As for responsive resizing: Use media queries. It is also important to use the !important property as it would give more weight to the appropriate value needed based on the screen size. For more information on media queries, visit https://css-tricks.com/a-complete-guide-to-css-media-queries/

One of the list items for responsive resizing should look something like this:

1profile_picture = () =&gt; {
2return (
3  &lt;div className=&quot;profilepicturetechstack&quot;&gt;
4    &lt;div className=&quot;home&quot;&gt;
5      &lt;div className=&quot;frame-1-3&quot;&gt;
6        &lt;img src=&quot;./resources/simon-provost-02-min.jpg&quot;  alt=&quot;profile_pic&quot;/&gt;
7      &lt;/div&gt;
8      &lt;div className=&quot;photo--wrapper--ellipse&quot;&gt;
9        &lt;p className=&quot;text-4&quot;&gt;ML/RESEARCH&lt;/p&gt;
10      &lt;/div&gt;
11      &lt;p className=&quot;text-1&quot;&gt;Simon provost&lt;/p&gt;
12      &lt;p className=&quot;text-2&quot;&gt;Paris, France&lt;/p&gt;
13    &lt;/div&gt;
14    &lt;div className=&quot;frame-1-4&quot;&gt;
15      &lt;p className=&quot;text-7&quot;&gt;⚙️ Machine Learning&lt;/p&gt;
16    &lt;/div&gt;
17    {/*&lt;div className=&quot;frame-1-9&quot;&gt;
18      &lt;p className=&quot;text-8&quot;&gt;💡 AutoML&lt;/p&gt;
19    &lt;/div&gt;
20    &lt;div className=&quot;frame-1-5&quot;&gt;
21      &lt;p className=&quot;text-9&quot;&gt;⛏ Data Mining&lt;/p&gt;
22    &lt;/div&gt;
23    &lt;div className=&quot;frame-1-6&quot;&gt;
24      &lt;p className=&quot;text-1-0&quot;&gt;🎨  UI.UX&lt;/p&gt;
25    &lt;/div&gt;
26    &lt;div className=&quot;frame-1-8&quot;&gt;
27      &lt;p className=&quot;text-1-1&quot;&gt;🔬 Research&lt;/p&gt;
28    &lt;/div&gt;
29    &lt;div className=&quot;frame-1-2&quot;&gt;
30      &lt;img src=&quot;&quot; /&gt;
31    &lt;/div&gt;
32    &lt;div className=&quot;frame-1-7&quot;&gt;
33      &lt;p className=&quot;text-1-3&quot;&gt;🌤  MLOps&lt;/p&gt;
34    &lt;/div&gt;*/}
35  &lt;/div&gt;
36)
37.profilepicturetechstack {
38  display: flex;
39  flex-direction: row;
40  justify-content: center;
41  padding-right: 10%;
42}
43.home {
44  display: flex;
45  position: relative;
46  flex-direction: column;
47  align-items: center;
48  justify-content: center;
49  margin-right: 100px;
50
51  border-radius: 13px;
52  height: 300px;
53  width: 350px;
54  background-color: #ffffff;
55  box-shadow: 0 40px 30px rgba(25, 25, 46, 0.06);
56}
57.text-1 {
58  text-align: center;
59  vertical-align: top;
60  font-size: 16px;
61  font-family: Roboto, serif;
62
63  color: #25323c;
64}
65.text-2 {
66  text-align: left;
67  vertical-align: top;
68  font-size: 14px;
69  margin-top: -15px;
70  font-family: Roboto, serif;
71
72  color: #859fb3;
73}
74.photo--wrapper--ellipse {
75  display: flex;
76  justify-content: center;
77  align-items: center;
78  text-align: center;
79  margin-top: -15px;
80  width: 96px;
81  height: 25px;
82
83  background: linear-gradient(135deg, #FF26B2 0%, #851BD9 80%, #3F0FFF 100%);
84  opacity: 0.8;
85  box-shadow: 0 5px 20px rgba(250, 118, 96, 0.2);
86  border-radius: 66px;
87}
88.img-3 {
89  height: 84px;
90  width: 84px;
91}
92.component-/points-/-m {
93  opacity: 0.80;
94  border-radius: 66px;
95  display: flex;
96  flex-direction: row;
97  justify-content: flex-start;
98  align-items: center;
99  padding: 6px 10px;
100  gap: 7px;
101  background-color: red;
102}
103.text-4 {
104  text-align: center;
105  vertical-align: top;
106  font-size: 11px;
107  font-family: Roboto, serif;
108
109  color: #ffffff;
110}
111.frame-1-3 {
112  height: 120px;
113  width: 120px;
114}
115
116.frame-1-3 img {
117  object-fit: contain;
118  border-radius: 62px;
119  height: 100%;
120  width: 100%;
121}
122
123.frame-1-1 {
124  border-radius: 25px;
125  height: 61px;
126  width: 61px;
127  background-color: rgba(36, 150, 237, 0.5);
128}
129.img-6 {
130  height: 35px;
131  width: 37px;
132}
133.frame-1-4 {
134  display: flex;
135  position: absolute;
136  flex-direction: row;
137  justify-content: flex-start;
138  align-items: center;
139  padding: 16px 24px;
140  gap: 10px;
141  right: 5%;
142  box-shadow: 0 40px 30px rgba(25, 25, 46, 0.04);
143  border-radius: 16px;
144  background-color: #ffffff;
145}
146.text-7 {
147  text-align: left;
148  vertical-align: top;
149  font-size: 16px;
150  font-family: 'Poppins', serif;
151  letter-spacing: 3px;
152
153  color: #5d86a7;
154}
155.frame-1-9 {
156  border-radius: 16px;
157  display: flex;
158  flex-direction: row;
159  justify-content: flex-start;
160  align-items: center;
161  padding: 16px 24px;
162  gap: 10px;
163  background-color: #ffffff;
164}
165.text-8 {
166  text-align: left;
167  vertical-align: top;
168  font-size: 16px;
169  font-family: 'Poppins', serif;
170  letter-spacing: 3px;
171
172  color: #5d86a7;
173}
174.frame-1-5 {
175  border-radius: 16px;
176  display: flex;
177  flex-direction: row;
178  justify-content: flex-start;
179  align-items: center;
180  padding: 16px 24px;
181  gap: 10px;
182  background-color: #ffffff;
183}
184.text-9 {
185  text-align: left;
186  vertical-align: top;
187  font-size: 16px;
188  font-family: 'Poppins', serif;
189  letter-spacing: 3px;
190
191  color: #5d86a7;
192}
193.frame-1-6 {
194  border-radius: 16px;
195  display: flex;
196  flex-direction: row;
197  justify-content: flex-start;
198  align-items: center;
199  padding: 16px 24px;
200  gap: 10px;
201  background-color: #ffffff;
202}
203.text-1-0 {
204  text-align: left;
205  vertical-align: top;
206  font-size: 16px;
207  font-family: 'Poppins', serif;
208  letter-spacing: 3px;
209
210  color: #5d86a7;
211}
212.frame-1-8 {
213  border-radius: 16px;
214  display: flex;
215  flex-direction: row;
216  justify-content: flex-start;
217  align-items: center;
218  padding: 16px 24px;
219  gap: 10px;
220  background-color: #ffffff;
221}
222.text-1-1 {
223  text-align: left;
224  vertical-align: top;
225  font-size: 16px;
226  font-family: 'Poppins', serif;
227  letter-spacing: 3px;
228
229  color: #5d86a7;
230}
231.frame-1-2 {
232  border-radius: 25px;
233  height: 61px;
234  width: 61px;
235  background-color: #f3eefa;
236}
237.img-1-2 {
238  height: 29px;
239  width: 29px;
240}
241.frame-1-7 {
242  border-radius: 16px;
243  display: flex;
244  flex-direction: row;
245  justify-content: flex-start;
246  align-items: center;
247  padding: 16px 24px;
248  gap: 10px;
249  background-color: #ffffff;
250}
251.text-1-3 {
252  text-align: left;
253  vertical-align: top;
254  font-size: 16px;
255  font-family: 'Poppins', serif;
256  letter-spacing: 3px;
257
258  color: #5d86a7;
259}
260.frame-1-6 {  
261     position: absolute;
262     padding: 16px 24px;
263     left: -200px; //Over 900px
264     bottom: 115%; //Over 900px
265     border: 1px solid black;
266     width: 200px;
267     box-shadow: 0 40px 30px rgba(25, 25, 46, 0.04);
268     border-radius: 16px;
269     background-color: #ffffff;
270 }
271
272 @media screen and (max-width: 900px) {
273     .frame-1-6 {
274       left: -150px !important; //Under 900px
275       bottom: 100% !important; //Under 900px
276     }
277 }
278

In the live example, I have gone ahead and placed some of your list items in the desired areas in order to showcase how it works.

Live Example: https://jsfiddle.net/t3qry2oa/286/

Source https://stackoverflow.com/questions/70365768

QUESTION

Gensim doc2vec's d2v.wv.most_similar() gives not relevant words with high similarity scores

Asked 2021-Dec-14 at 20:14

I've got a dataset of job listings with about 150 000 records. I extracted skills from descriptions using NER using a dictionary of 30 000 skills. Every skill is represented as an unique identificator.

My data example:

1          job_title    job_id                                         skills
21  business manager         4               12 13 873 4811 482 2384 48 293 48
32    java developer        55    48 2838 291 37 484 192 92 485 17 23 299 23...
43    data scientist        21    383 48 587 475 2394 5716 293 585 1923 494 3
5

Then, I train a doc2vec model using these data where job titles (their ids to be precise) are used as tags and skills vectors as word vectors.

1          job_title    job_id                                         skills
21  business manager         4               12 13 873 4811 482 2384 48 293 48
32    java developer        55    48 2838 291 37 484 192 92 485 17 23 299 23...
43    data scientist        21    383 48 587 475 2394 5716 293 585 1923 494 3
5def tagged_document(df):
6    for index, row in df.iterrows():
7        yield gensim.models.doc2vec.TaggedDocument(row['skills'].split(), [str(row['job_id'])])
8        
9        
10data_for_training = list(tagged_document(data[['job_id', 'skills']]))
11
12model_d2v = gensim.models.doc2vec.Doc2Vec(dm=0, dbow_words=1, vector_size=80, min_count=3, epochs=100, window=100000)
13
14model_d2v.build_vocab(data_for_training)
15
16model_d2v.train(data_for_training, total_examples=model_d2v.corpus_count, epochs=model_d2v.epochs)
17

It works mostly okay, but I have issues with some job titles. I tried to collect more data from them, but I still have an unpredictable behavior with them.

For example, I have a job title "Director Of Commercial Operations" which is represented as 41 data records having from 11 to 96 skills (mean 32). When I get most similar words for it (skills in my case) I get the following:

1          job_title    job_id                                         skills
21  business manager         4               12 13 873 4811 482 2384 48 293 48
32    java developer        55    48 2838 291 37 484 192 92 485 17 23 299 23...
43    data scientist        21    383 48 587 475 2394 5716 293 585 1923 494 3
5def tagged_document(df):
6    for index, row in df.iterrows():
7        yield gensim.models.doc2vec.TaggedDocument(row['skills'].split(), [str(row['job_id'])])
8        
9        
10data_for_training = list(tagged_document(data[['job_id', 'skills']]))
11
12model_d2v = gensim.models.doc2vec.Doc2Vec(dm=0, dbow_words=1, vector_size=80, min_count=3, epochs=100, window=100000)
13
14model_d2v.build_vocab(data_for_training)
15
16model_d2v.train(data_for_training, total_examples=model_d2v.corpus_count, epochs=model_d2v.epochs)
17docvec = model_d2v.docvecs[id_]
18model_d2v.wv.most_similar(positive=[docvec], topn=5)
19

1          job_title    job_id                                         skills
21  business manager         4               12 13 873 4811 482 2384 48 293 48
32    java developer        55    48 2838 291 37 484 192 92 485 17 23 299 23...
43    data scientist        21    383 48 587 475 2394 5716 293 585 1923 494 3
5def tagged_document(df):
6    for index, row in df.iterrows():
7        yield gensim.models.doc2vec.TaggedDocument(row['skills'].split(), [str(row['job_id'])])
8        
9        
10data_for_training = list(tagged_document(data[['job_id', 'skills']]))
11
12model_d2v = gensim.models.doc2vec.Doc2Vec(dm=0, dbow_words=1, vector_size=80, min_count=3, epochs=100, window=100000)
13
14model_d2v.build_vocab(data_for_training)
15
16model_d2v.train(data_for_training, total_examples=model_d2v.corpus_count, epochs=model_d2v.epochs)
17docvec = model_d2v.docvecs[id_]
18model_d2v.wv.most_similar(positive=[docvec], topn=5)
19capacity utilization 0.5729076266288757
20process optimization 0.5405482649803162
21goal setting 0.5288119316101074
22aeration 0.5124399662017822
23supplier relationship management 0.5117508172988892
24

These are top 5 skills and 3 of them look relevant. However the top one doesn't look too valid together with "aeration". The problem is that none of the job title records have these skills at all. It seems like a noise in the output, but why it gets one of the highest similarity scores (although generally not high)? Does it mean that the model can't outline very specific skills for this kind of job titles? Can the number of "noisy" skills be reduced? Sometimes I see much more relevant skills with lower similarity score, but it's often lower than 0.5.

One more example of correct behavior with similar amount of data: BI Analyst, 29 records, number of skills from 4 to 48 (mean 21). The top skills look alright.

1          job_title    job_id                                         skills
21  business manager         4               12 13 873 4811 482 2384 48 293 48
32    java developer        55    48 2838 291 37 484 192 92 485 17 23 299 23...
43    data scientist        21    383 48 587 475 2394 5716 293 585 1923 494 3
5def tagged_document(df):
6    for index, row in df.iterrows():
7        yield gensim.models.doc2vec.TaggedDocument(row['skills'].split(), [str(row['job_id'])])
8        
9        
10data_for_training = list(tagged_document(data[['job_id', 'skills']]))
11
12model_d2v = gensim.models.doc2vec.Doc2Vec(dm=0, dbow_words=1, vector_size=80, min_count=3, epochs=100, window=100000)
13
14model_d2v.build_vocab(data_for_training)
15
16model_d2v.train(data_for_training, total_examples=model_d2v.corpus_count, epochs=model_d2v.epochs)
17docvec = model_d2v.docvecs[id_]
18model_d2v.wv.most_similar(positive=[docvec], topn=5)
19capacity utilization 0.5729076266288757
20process optimization 0.5405482649803162
21goal setting 0.5288119316101074
22aeration 0.5124399662017822
23supplier relationship management 0.5117508172988892
24business intelligence 0.6986587047576904
25business intelligence development 0.6861011981964111
26power bi 0.6589289903640747
27tableau 0.6500121355056763
28qlikview (data analytics software) 0.6307920217514038
29business intelligence tools 0.6143202781677246
30dimensional modeling 0.6032138466835022
31exploratory data analysis 0.6005223989486694
32marketing analytics 0.5737696886062622
33data mining 0.5734485387802124
34data quality 0.5729933977127075
35data visualization 0.5691111087799072
36microstrategy 0.5566076636314392
37business analytics 0.5535123348236084
38etl 0.5516749620437622
39data modeling 0.5512707233428955
40data profiling 0.5495884418487549
41

ANSWER

Answered 2021-Dec-14 at 20:14

If the your gold standard of what the model should report is skills that appeared in the training data, are you sure you don't want a simple count-based solution? For example, just provide a ranked list of the skills that appear most often in Director Of Commercial Operations listings?

On the other hand, the essence of compressing N job titles, and 30,000 skills, into a smaller (in this case vector_size=80) coordinate-space model is to force some non-intuitive (but perhaps real) relationships to be reflected in the model.

Might there be some real pattern in the model – even if, perhaps, just some idiosyncracies in the appearance of less-common skills – that makes aeration necessarily slot near those other skills? (Maybe it's a rare skill whose few contextual appearances co-occur with other skills very much near 'capacity utilization' -meaning with the tiny amount of data available, & tiny amount of overall attention given to this skill, there's no better place for it.)

Taking note of whether your 'anomalies' are often in low-frequency skills, or lower-freqeuncy job-ids, might enable a closer look at the data causes, or some disclaimering/filtering of most_similar() results. (The most_similar() method can limit its returned rankings to the more frequent range of the known vocabulary, for cases when the long-tail or rare words are, in with their rougher vectors, intruding in higher-quality results from better-reqpresented words. See the restrict_vocab parameter.)

That said, tinkering with training parameters may result in rankings that better reflect your intent. A larger min_count might remove more tokens that, lacking sufficient varied examples, mostly just inject noise into the rest of training. A different vector_size, smaller or larger, might better capture the relationships you're looking for. A more-aggressive (smaller) sample could discard more high-frequency words that might be starving more-interesting less-frequent words of a chance to influence the model.

Note that with dbow_words=1 & a large window, and records with (perhaps?) dozens of skills each, the words are having a much-more neighborly effect on each other, in the model, than the tag<->word correlations. That might be good or bad.

Source https://stackoverflow.com/questions/70350954

QUESTION

Creating a CSV file from Python Script

Asked 2021-Nov-30 at 09:42

I am learning data mining from a book and I am trying to write my first script to gather info from Youtube's API and feed it into a new .csv file. For some reason, it isn't working. I tried inputting the script line by line in a CLI and the script will eventually create an empty .csv file, but the information is never fed in. Here is my code, it's basically copied line by line from the book:

1import csv
2import json
3import requests
4
5api_url = &quot;https://www.googleapis.com/youtube/v3/search?part=snippet&amp;channelId=UCJFp8uSYCjXOMnkUyb3CQ3Q&amp;key=AIzaSyDaMzUYRFzDfjMq-bTm38Y_1swWDMfg03E&quot;
6api_response = requests.get(api_url)
7videos = json.loads(api_response.text)
8
9with open(&quot;C:\Users\jacks\Documents\PythonScripts\youtube_videos.csv&quot;, &quot;w&quot;, encoding=&quot;utf-8&quot;) as csv_file:
10    csv_writer = csv.writer(csv_file)
11    csv_writer.writerow([&quot;publishedAt&quot;,
12                 &quot;title&quot;,
13                 &quot;description&quot;,
14                 &quot;thumbnailurl&quot;])
15    if videos.get(&quot;items&quot;) is not None:
16        for video in videos.get(&quot;items&quot;):
17            videos_data_row = [
18                video[&quot;snippet&quot;][&quot;publishedAt&quot;],
19                video[&quot;snippet&quot;][&quot;title&quot;],
20                video[&quot;snippet&quot;][&quot;description&quot;],
21                video[&quot;snippet&quot;][&quot;thumbnails&quot;][&quot;default&quot;][&quot;url&quot;]
22                ]
23            csv_writer.writerow(video_data_row)
24

ANSWER

Answered 2021-Nov-30 at 09:42

I ran your code & the only problem I found was in csv_writer.writerow(video_data_row)

You're missing an s

Replace with:

1import csv
2import json
3import requests
4
5api_url = &quot;https://www.googleapis.com/youtube/v3/search?part=snippet&amp;channelId=UCJFp8uSYCjXOMnkUyb3CQ3Q&amp;key=AIzaSyDaMzUYRFzDfjMq-bTm38Y_1swWDMfg03E&quot;
6api_response = requests.get(api_url)
7videos = json.loads(api_response.text)
8
9with open(&quot;C:\Users\jacks\Documents\PythonScripts\youtube_videos.csv&quot;, &quot;w&quot;, encoding=&quot;utf-8&quot;) as csv_file:
10    csv_writer = csv.writer(csv_file)
11    csv_writer.writerow([&quot;publishedAt&quot;,
12                 &quot;title&quot;,
13                 &quot;description&quot;,
14                 &quot;thumbnailurl&quot;])
15    if videos.get(&quot;items&quot;) is not None:
16        for video in videos.get(&quot;items&quot;):
17            videos_data_row = [
18                video[&quot;snippet&quot;][&quot;publishedAt&quot;],
19                video[&quot;snippet&quot;][&quot;title&quot;],
20                video[&quot;snippet&quot;][&quot;description&quot;],
21                video[&quot;snippet&quot;][&quot;thumbnails&quot;][&quot;default&quot;][&quot;url&quot;]
22                ]
23            csv_writer.writerow(video_data_row)
24csv_writer.writerow(videos_data_row)
25

Source https://stackoverflow.com/questions/70167096

QUESTION

Which is the best Data Mining model to extrapolate known values to missing values in a table? (General question)

Asked 2021-Oct-27 at 21:21

I am working on a little data mining project (I am still a Data Science student, not a professional). Maybe you can help me to choose a proper model for my task.

So, let's say we have a table with three columns and around 4000 rows:

YEAR	COLOR	NAME
1900	Green	David
1901	Yellow	Sarah
1902	Green	???
1902	Red	Sarah
…	…	…
2020	Purple	John

Any value for any field can be repeated in the dataset (also Year values).

In the first two columns we don't have missing values, but we only have around 20% of Name values in the third column. Name value deppends somewhat on the first two columns (not a causal relation).

My goal is to extrapolate the available Name values to the whole table and get a range of occurrences for each name value (for example in a boxplot)

I have imagined a process like that, although I am not very sure if statitically it makes sense (any objections and suggestions are appreciated):

For every unknown NAME value, the algorythm choose randomly one of the already known NAME values. The odds of a particular NAME value to be chosen depend on the variables YEAR and COLOR. For instance, if 'David' values tend to be correlated with low Year values AND with 'Green' or 'Purple' values for Color, the algorythm give 'David' a higher probability to be chosen if input values for Year and Color are "1900, Purple".
When the above process ends, the number of occurrences for each name is counted.
The above process is applied 30 times and the results for each name are displayed in a plotbox.

However, I don't know which is the best model to implement an idea similar to this. I have drawn the process in a simple paint drawing:

Possible output for the task

Which do you think it could be a good approach to this task? I appreciate any help.

ANSWER

Answered 2021-Oct-27 at 21:21

I think you have the process down, it's converting the data which may be the first hurdle.

I would look at using from sklearn.preprocessing import OrdinalEncoder to encode the data to convert from categorical to numeric.

You could then use a random number generator to produce a number within the range defined by the encoding which would randomly select a name.

Loop through this 30 times with an f loop to achieve the result.

It also looks like you will need to provide the ranking values for year and colour prior to building out your code. From here you would just provide bands, for example, if year > 1985, etc within your for loop to specify the names.

Source https://stackoverflow.com/questions/69720864

Community Discussions contain sources that include Stack Exchange Network