Popular New Releases in Topic Modeling
gensim
BERTopic
v0.9.4
Top2Vec
Phrases and new embedding options
tomotopy
0.12.0
Palmetto
Version 0.1.3
Popular Libraries in Topic Modeling
by RaRe-Technologies python
13112 LGPL-2.1
Topic Modelling for Humans
by baidu c++
2420 BSD-3-Clause
A Toolkit for Industrial Topic Modeling
by MaartenGr python
2187 MIT
Leveraging BERT and c-TF-IDF to create easily interpretable topics.
by ddangelov python
1605 BSD-3-Clause
Top2Vec learns jointly embedded topic, document and word vectors.
by lda-project python
1003 MPL-2.0
Topic modeling with latent Dirichlet allocation using Gibbs sampling
by cpsievert javascript
476 NOASSERTION
R package for web-based interactive topic model visualization.
by gregversteeg python
462 Apache-2.0
Hierarchical unsupervised and semi-supervised topic models for sparse count data with CorEx
by zake7749 python
434 MIT
中文詞向量訓練教學
by adjidieng python
422 MIT
Topic Modeling in Embedding Spaces
Trending New libraries in Topic Modeling
by MaartenGr python
2187 MIT
Leveraging BERT and c-TF-IDF to create easily interpretable topics.
by ddangelov python
1605 BSD-3-Clause
Top2Vec learns jointly embedded topic, document and word vectors.
by sunyilgdx python
206
基于预训练模型的中文关键词抽取方法(论文SIFRank: A New Baseline for Unsupervised Keyphrase Extraction Based on Pre-trained Language Model 的中文版代码)
by jdegoes scala
54
The exercises for the Functional Data Modeling workshop.
by ankane c++
38 MIT
High performance topic modeling for Ruby
by yumeng5 c
33 Apache-2.0
[WWW 2020] Discriminative Topic Mining via Category-Name Guided Text Embedding
by yumeng5 c
31 Apache-2.0
[KDD 2020] Hierarchical Topic Mining via Joint Spherical Tree and Text Embedding
by Eligijus112 python
29
A project to create your very own word embeddings
by MaartenGr python
17 MIT
Concept Modeling: Topic Modeling on Images and Text
Top Authors in Topic Modeling
1
8 Libraries
811
2
8 Libraries
762
3
6 Libraries
92
4
3 Libraries
45
5
3 Libraries
458
6
3 Libraries
72
7
3 Libraries
246
8
3 Libraries
111
9
2 Libraries
21
10
2 Libraries
9
1
8 Libraries
811
2
8 Libraries
762
3
6 Libraries
92
4
3 Libraries
45
5
3 Libraries
458
6
3 Libraries
72
7
3 Libraries
246
8
3 Libraries
111
9
2 Libraries
21
10
2 Libraries
9
Trending Kits in Topic Modeling
Welcome, adventurous souls, to Christchurch – the gateway to the stunning landscapes of New Zealand's South Island. If you're seeking a truly unique experience, why not take to the skies and explore the beauty below with Christchurch Helicopters? In this guide, we'll unveil the top 10 aspects of commercial helicopters flights in Christchurch, ensuring you make the most of your aerial adventure.
For more info visit this website:
Trending Discussions on Topic Modeling
Display document to topic mapping after LSI using Gensim
My main.py script is running in pycharm IDE but not from terminal. Why is this so?
How to get list of words for each topic for a specific relevance metric value (lambda) in pyLDAvis?
Should bi-gram and tri-gram be used in LDA topic modeling?
How encode text can be converted to main text (without special character created by encoding)
Memory problems when using lapply for corpus creation
How can I replace emojis with text and treat them as single words?
Specify the output per topic to a specific number of words
Name topics in lda topic modeling based on beta values
Calculating optimal number of topics for topic modeling (LDA)
QUESTION
Display document to topic mapping after LSI using Gensim
Asked 2022-Feb-22 at 19:27I am new to using LSI with Python and Gensim + Scikit-learn tools. I was able to achieve topic modeling on a corpus using LSI from both the Scikit-learn and Gensim libraries, however, when using the Gensim approach I was not able to display a list of documents to topic mapping.
Here is my work using Scikit-learn LSI where I successfully displayed document to topic mapping:
1tfidf_transformer = TfidfTransformer()
2transformed_vector = tfidf_transformer.fit_transform(transformed_vector)
3NUM_TOPICS = 14
4lsi_model = TruncatedSVD(n_components=NUM_TOPICS)
5lsi= nmf_model.fit_transform(transformed_vector)
6
7topic_to_doc_mapping = {}
8topic_list = []
9topic_names = []
10
11for i in range(len(dbpedia_df.index)):
12 most_likely_topic = nmf[i].argmax()
13
14 if most_likely_topic not in topic_to_doc_mapping:
15 topic_to_doc_mapping[most_likely_topic] = []
16
17 topic_to_doc_mapping[most_likely_topic].append(i)
18
19 topic_list.append(most_likely_topic)
20 topic_names.append(topic_id_topic_mapping[most_likely_topic])
21
22dbpedia_df['Most_Likely_Topic'] = topic_list
23dbpedia_df['Most_Likely_Topic_Names'] = topic_names
24
25print(topic_to_doc_mapping[0][:100])
26
27topic_of_interest = 1
28doc_ids = topic_to_doc_mapping[topic_of_interest][:4]
29for doc_index in doc_ids:
30 print(X.iloc[doc_index])
31
1tfidf_transformer = TfidfTransformer()
2transformed_vector = tfidf_transformer.fit_transform(transformed_vector)
3NUM_TOPICS = 14
4lsi_model = TruncatedSVD(n_components=NUM_TOPICS)
5lsi= nmf_model.fit_transform(transformed_vector)
6
7topic_to_doc_mapping = {}
8topic_list = []
9topic_names = []
10
11for i in range(len(dbpedia_df.index)):
12 most_likely_topic = nmf[i].argmax()
13
14 if most_likely_topic not in topic_to_doc_mapping:
15 topic_to_doc_mapping[most_likely_topic] = []
16
17 topic_to_doc_mapping[most_likely_topic].append(i)
18
19 topic_list.append(most_likely_topic)
20 topic_names.append(topic_id_topic_mapping[most_likely_topic])
21
22dbpedia_df['Most_Likely_Topic'] = topic_list
23dbpedia_df['Most_Likely_Topic_Names'] = topic_names
24
25print(topic_to_doc_mapping[0][:100])
26
27topic_of_interest = 1
28doc_ids = topic_to_doc_mapping[topic_of_interest][:4]
29for doc_index in doc_ids:
30 print(X.iloc[doc_index])
31processed_list = []
32stop_words = set(stopwords.words('english'))
33lemmatizer = WordNetLemmatizer()
34
35for doc in documents_list:
36 tokens = word_tokenize(doc.lower())
37 stopped_tokens = [token for token in tokens if token not in stop_words]
38 lemmatized_tokens = [lemmatizer.lemmatize(i, pos="n") for i in stopped_tokens]
39 processed_list.append(lemmatized_tokens)
40
41term_dictionary = Dictionary(processed_list)
42document_term_matrix = [term_dictionary.doc2bow(document) for document in processed_list]
43
44NUM_TOPICS = 14
45model = LsiModel(corpus=document_term_matrix, num_topics=NUM_TOPICS, id2word=term_dictionary)
46lsi_topics = model.show_topics(num_topics=NUM_TOPICS, formatted=False)
47lsi_topics
48
How can I display the document to topic mapping here?
ANSWER
Answered 2022-Feb-22 at 19:27In order to get the representation of a document (represented as a bag-of-words) from a trained LsiModel
as a vector of topics, you use Python dict-style bracket-accessing (model[bow]
).
For example, to get the topics for the 1st item in your training data, you can use:
1tfidf_transformer = TfidfTransformer()
2transformed_vector = tfidf_transformer.fit_transform(transformed_vector)
3NUM_TOPICS = 14
4lsi_model = TruncatedSVD(n_components=NUM_TOPICS)
5lsi= nmf_model.fit_transform(transformed_vector)
6
7topic_to_doc_mapping = {}
8topic_list = []
9topic_names = []
10
11for i in range(len(dbpedia_df.index)):
12 most_likely_topic = nmf[i].argmax()
13
14 if most_likely_topic not in topic_to_doc_mapping:
15 topic_to_doc_mapping[most_likely_topic] = []
16
17 topic_to_doc_mapping[most_likely_topic].append(i)
18
19 topic_list.append(most_likely_topic)
20 topic_names.append(topic_id_topic_mapping[most_likely_topic])
21
22dbpedia_df['Most_Likely_Topic'] = topic_list
23dbpedia_df['Most_Likely_Topic_Names'] = topic_names
24
25print(topic_to_doc_mapping[0][:100])
26
27topic_of_interest = 1
28doc_ids = topic_to_doc_mapping[topic_of_interest][:4]
29for doc_index in doc_ids:
30 print(X.iloc[doc_index])
31processed_list = []
32stop_words = set(stopwords.words('english'))
33lemmatizer = WordNetLemmatizer()
34
35for doc in documents_list:
36 tokens = word_tokenize(doc.lower())
37 stopped_tokens = [token for token in tokens if token not in stop_words]
38 lemmatized_tokens = [lemmatizer.lemmatize(i, pos="n") for i in stopped_tokens]
39 processed_list.append(lemmatized_tokens)
40
41term_dictionary = Dictionary(processed_list)
42document_term_matrix = [term_dictionary.doc2bow(document) for document in processed_list]
43
44NUM_TOPICS = 14
45model = LsiModel(corpus=document_term_matrix, num_topics=NUM_TOPICS, id2word=term_dictionary)
46lsi_topics = model.show_topics(num_topics=NUM_TOPICS, formatted=False)
47lsi_topics
48first_doc = document_term_matrix[0]
49first_doc_lsi_topics = model[first_doc]
50
You can also supply a list of docs, as in training, to get the LSI topics for an entire batch at once. EG:
1tfidf_transformer = TfidfTransformer()
2transformed_vector = tfidf_transformer.fit_transform(transformed_vector)
3NUM_TOPICS = 14
4lsi_model = TruncatedSVD(n_components=NUM_TOPICS)
5lsi= nmf_model.fit_transform(transformed_vector)
6
7topic_to_doc_mapping = {}
8topic_list = []
9topic_names = []
10
11for i in range(len(dbpedia_df.index)):
12 most_likely_topic = nmf[i].argmax()
13
14 if most_likely_topic not in topic_to_doc_mapping:
15 topic_to_doc_mapping[most_likely_topic] = []
16
17 topic_to_doc_mapping[most_likely_topic].append(i)
18
19 topic_list.append(most_likely_topic)
20 topic_names.append(topic_id_topic_mapping[most_likely_topic])
21
22dbpedia_df['Most_Likely_Topic'] = topic_list
23dbpedia_df['Most_Likely_Topic_Names'] = topic_names
24
25print(topic_to_doc_mapping[0][:100])
26
27topic_of_interest = 1
28doc_ids = topic_to_doc_mapping[topic_of_interest][:4]
29for doc_index in doc_ids:
30 print(X.iloc[doc_index])
31processed_list = []
32stop_words = set(stopwords.words('english'))
33lemmatizer = WordNetLemmatizer()
34
35for doc in documents_list:
36 tokens = word_tokenize(doc.lower())
37 stopped_tokens = [token for token in tokens if token not in stop_words]
38 lemmatized_tokens = [lemmatizer.lemmatize(i, pos="n") for i in stopped_tokens]
39 processed_list.append(lemmatized_tokens)
40
41term_dictionary = Dictionary(processed_list)
42document_term_matrix = [term_dictionary.doc2bow(document) for document in processed_list]
43
44NUM_TOPICS = 14
45model = LsiModel(corpus=document_term_matrix, num_topics=NUM_TOPICS, id2word=term_dictionary)
46lsi_topics = model.show_topics(num_topics=NUM_TOPICS, formatted=False)
47lsi_topics
48first_doc = document_term_matrix[0]
49first_doc_lsi_topics = model[first_doc]
50all_doc_lsi_topics = model[document_term_matrix]
51
QUESTION
My main.py script is running in pycharm IDE but not from terminal. Why is this so?
Asked 2022-Feb-14 at 10:56When I want to run my python main.py script in the terminal it raises the following exception:
1Traceback (most recent call last):
2 File "main.py", line 14, in <module>
3 from typeform_api import get_data # Step 1)
4 File "/Users/philipp/Documents/PycharmProjects/Wegekompass/typeform_api.py", line 9, in <module>
5 import pandas as pd # For creating a dataframe for the data
6ModuleNotFoundError: No module named 'pandas'
7
I already read the answers of those two questions, but couldn't find any solutions:
This is my main.py method:
1Traceback (most recent call last):
2 File "main.py", line 14, in <module>
3 from typeform_api import get_data # Step 1)
4 File "/Users/philipp/Documents/PycharmProjects/Wegekompass/typeform_api.py", line 9, in <module>
5 import pandas as pd # For creating a dataframe for the data
6ModuleNotFoundError: No module named 'pandas'
7"""
8Logic Flow of main.py:
91) Fetch Data with API
102) Preprocess the Data
113) Start topic modeling
124) Create report and include data
135) send report via e-mail to user
14"""
15
16import os
17import sys
18from typeform_api import get_data # Step 1)
19from preprocessing import preprocess_data # Step 2)
20from topic_modeling import create_topic_model # Step 3)
21from create_report import * # Step 4)
22from send_report import send_report # Step 5)
23
24if __name__ == '__main__':
25
26# 1) Fetch Data with Typeform API:
27alle_antworten_txt = get_data()[3]
28projektname = get_data()[0]
29projekt_id = get_data()[1]
30user_email = get_data()[2]
31heute_txt_short = get_data()[3][4]
32heute_txt_long = get_data()[3][1:3]
33heute_txt_long_joined = ". ".join(heute_txt_long)
34morgen_txt_short = get_data()[3][8]
35morgen_txt_long = get_data()[3][5:7]
36morgen_txt_long_joined = ". ".join(morgen_txt_long)
37erfolgshebel_txt_short = get_data()[3][17]
38erfolgshebel_txt_long = get_data()[3][14:16]
39erfolgshebel_txt_long_joined = ". ".join(erfolgshebel_txt_long)
40staerken_txt = get_data()[3][12:13]
41staerken_txt_joined = ". ".join(staerken_txt)
42schwaechen_txt = get_data()[3][10:11]
43schwaechen_txt_joined = ". ".join(schwaechen_txt)
44reflexion_txt = get_data()[3][9]
45etappe1_fachlich_txt = get_data()[3][18]
46etappe1_kulturell_txt = get_data()[3][19]
47etappe2_fachlich_txt = get_data()[3][20]
48etappe2_kulturell_txt = get_data()[3][21]
49etappe3_fachlich_txt = get_data()[3][22]
50etappe3_kulturell_txt = get_data()[3][23]
51weg_von_energie = get_data()[4][0]
52hin_zu_energie = get_data()[4][1]
53umsetzungsenergie = get_data()[4][2]
54
55# 2) Preprocess and tokenize fetched Data:
56alle_antworten_txt_tokens = preprocess_data(alle_antworten_txt)[1]
57heute_txt_long_tokens = preprocess_data(heute_txt_long)[1]
58morgen_txt_long_tokens = preprocess_data(morgen_txt_long)[1]
59
60# 3) Start NLP analysis:
61# answer_analysis = create_topic_model(alle_antworten_txt_tokens)[1]
62try:
63 heute_wortwolke = create_topic_model(heute_txt_long_tokens)[2]
64 heute_wortwolke.to_file(os.getcwd() + "/Grafiken/Wortwolken/heute_wortwolke"+projekt_id+".png")
65 morgen_wortwolke = create_topic_model(morgen_txt_long_tokens)[2]
66 morgen_wortwolke.to_file(os.getcwd() + "/Grafiken/Wortwolken/morgen_wortwolke"+projekt_id+".png")
67
68except ValueError:
69 print("There is not enough word input, LDA analysis raises ValueError")
70 sys.exit(0)
71
72# 4) Create final report:
73try:
74 final_report = create_final_report(heute_txt_short,
75 morgen_txt_short,
76 erfolgshebel_txt_short,
77 heute_txt_long_joined,
78 morgen_txt_long_joined,
79 erfolgshebel_txt_long_joined,
80 staerken_txt_joined,
81 schwaechen_txt_joined,
82 reflexion_txt,
83 os.getcwd() + "/Grafiken/Wortwolken/heute_wortwolke"+projekt_id+".png",
84 os.getcwd() + "/Grafiken/Wortwolken/morgen_wortwolke"+projekt_id+".png",
85 etappe1_fachlich_txt,
86 etappe1_kulturell_txt,
87 etappe2_fachlich_txt,
88 etappe2_kulturell_txt,
89 etappe3_fachlich_txt,
90 etappe3_kulturell_txt,
91 weg_von_energie,
92 hin_zu_energie,
93 umsetzungsenergie,
94 projektname,
95 projekt_id)
96
97except FileNotFoundError:
98 print("There is not enough word input, wordcloud can't be computed so FileNotFoundError is raised")
99 sys.exit(0)
100
101# 5) Send final report:
102try:
103 send_report(final_report, user_email, projektname, projekt_id)
104except NameError:
105 print("variable final_report not defined due missing wordclouds")
106 sys.exit(0)
107
108# 6) Check if program reached its end:
109print("If this gets print, the report was send successfully.")
110
This is the structure of my project folder:
1Traceback (most recent call last):
2 File "main.py", line 14, in <module>
3 from typeform_api import get_data # Step 1)
4 File "/Users/philipp/Documents/PycharmProjects/Wegekompass/typeform_api.py", line 9, in <module>
5 import pandas as pd # For creating a dataframe for the data
6ModuleNotFoundError: No module named 'pandas'
7"""
8Logic Flow of main.py:
91) Fetch Data with API
102) Preprocess the Data
113) Start topic modeling
124) Create report and include data
135) send report via e-mail to user
14"""
15
16import os
17import sys
18from typeform_api import get_data # Step 1)
19from preprocessing import preprocess_data # Step 2)
20from topic_modeling import create_topic_model # Step 3)
21from create_report import * # Step 4)
22from send_report import send_report # Step 5)
23
24if __name__ == '__main__':
25
26# 1) Fetch Data with Typeform API:
27alle_antworten_txt = get_data()[3]
28projektname = get_data()[0]
29projekt_id = get_data()[1]
30user_email = get_data()[2]
31heute_txt_short = get_data()[3][4]
32heute_txt_long = get_data()[3][1:3]
33heute_txt_long_joined = ". ".join(heute_txt_long)
34morgen_txt_short = get_data()[3][8]
35morgen_txt_long = get_data()[3][5:7]
36morgen_txt_long_joined = ". ".join(morgen_txt_long)
37erfolgshebel_txt_short = get_data()[3][17]
38erfolgshebel_txt_long = get_data()[3][14:16]
39erfolgshebel_txt_long_joined = ". ".join(erfolgshebel_txt_long)
40staerken_txt = get_data()[3][12:13]
41staerken_txt_joined = ". ".join(staerken_txt)
42schwaechen_txt = get_data()[3][10:11]
43schwaechen_txt_joined = ". ".join(schwaechen_txt)
44reflexion_txt = get_data()[3][9]
45etappe1_fachlich_txt = get_data()[3][18]
46etappe1_kulturell_txt = get_data()[3][19]
47etappe2_fachlich_txt = get_data()[3][20]
48etappe2_kulturell_txt = get_data()[3][21]
49etappe3_fachlich_txt = get_data()[3][22]
50etappe3_kulturell_txt = get_data()[3][23]
51weg_von_energie = get_data()[4][0]
52hin_zu_energie = get_data()[4][1]
53umsetzungsenergie = get_data()[4][2]
54
55# 2) Preprocess and tokenize fetched Data:
56alle_antworten_txt_tokens = preprocess_data(alle_antworten_txt)[1]
57heute_txt_long_tokens = preprocess_data(heute_txt_long)[1]
58morgen_txt_long_tokens = preprocess_data(morgen_txt_long)[1]
59
60# 3) Start NLP analysis:
61# answer_analysis = create_topic_model(alle_antworten_txt_tokens)[1]
62try:
63 heute_wortwolke = create_topic_model(heute_txt_long_tokens)[2]
64 heute_wortwolke.to_file(os.getcwd() + "/Grafiken/Wortwolken/heute_wortwolke"+projekt_id+".png")
65 morgen_wortwolke = create_topic_model(morgen_txt_long_tokens)[2]
66 morgen_wortwolke.to_file(os.getcwd() + "/Grafiken/Wortwolken/morgen_wortwolke"+projekt_id+".png")
67
68except ValueError:
69 print("There is not enough word input, LDA analysis raises ValueError")
70 sys.exit(0)
71
72# 4) Create final report:
73try:
74 final_report = create_final_report(heute_txt_short,
75 morgen_txt_short,
76 erfolgshebel_txt_short,
77 heute_txt_long_joined,
78 morgen_txt_long_joined,
79 erfolgshebel_txt_long_joined,
80 staerken_txt_joined,
81 schwaechen_txt_joined,
82 reflexion_txt,
83 os.getcwd() + "/Grafiken/Wortwolken/heute_wortwolke"+projekt_id+".png",
84 os.getcwd() + "/Grafiken/Wortwolken/morgen_wortwolke"+projekt_id+".png",
85 etappe1_fachlich_txt,
86 etappe1_kulturell_txt,
87 etappe2_fachlich_txt,
88 etappe2_kulturell_txt,
89 etappe3_fachlich_txt,
90 etappe3_kulturell_txt,
91 weg_von_energie,
92 hin_zu_energie,
93 umsetzungsenergie,
94 projektname,
95 projekt_id)
96
97except FileNotFoundError:
98 print("There is not enough word input, wordcloud can't be computed so FileNotFoundError is raised")
99 sys.exit(0)
100
101# 5) Send final report:
102try:
103 send_report(final_report, user_email, projektname, projekt_id)
104except NameError:
105 print("variable final_report not defined due missing wordclouds")
106 sys.exit(0)
107
108# 6) Check if program reached its end:
109print("If this gets print, the report was send successfully.")
110├── Dockerfile
111├── Error-Handling.md
112├── Grafiken
113│ ├── 1Etappe_Card-min.png
114│ ├── 2Etappe_Card-min.png
115│ ├── 3Etappe_Card-min.png
116│ ├── AD_Logo-min.png
117│ ├── Dynamik_Abflug-min.png
118│ ├── Dynamik_Anflug-min.png
119│ ├── Dynamik_Hoehenflug-min.png
120│ ├── Dynamik_Parabelflug-min.png
121│ ├── Dynamik_Sinkflug-min.png
122│ ├── Dynamik_Steigflug-min.png
123│ ├── Dynamik_Talflug-min.png
124│ ├── Dynamik_Tiefflug-min.png
125│ ├── Entwicklung_Card-min.png
126│ ├── Erfolgshebel_Card-min.png
127│ ├── Framework_Abflug-min.png
128│ ├── Framework_Anflug-min.png
129│ ├── Framework_Hoehenflug-min.png
130│ ├── Framework_Parabelflug-min.png
131│ ├── Framework_Sinkflug-min.png
132│ ├── Framework_Steigflug-min.png
133│ ├── Framework_Talflug-min.png
134│ ├── Framework_Tiefflug-min.png
135│ ├── Heute_Card-min.png
136│ ├── Morgen_Card-min.png
137│ ├── Reflexion_Card-min.png
138│ ├── Strength_Card-min.png
139│ ├── Weakness_Card-min.png
140│ ├── Wegekompass_Hero-min.png
141│ └── Wortwolken
142├── PDF_Reports
143├── README.md
144├── __pycache__
145│ ├── config.cpython-38.pyc
146│ ├── create_report.cpython-38.pyc
147│ ├── main.cpython-38.pyc
148│ ├── preprocessing.cpython-38.pyc
149│ ├── send_report.cpython-38.pyc
150│ ├── topic_modeling.cpython-38.pyc
151│ └── typeform_api.cpython-38.pyc
152├── config.py
153├── create_report.py
154├── html_files
155│ └── E_Mail_Template.html
156├── main.py
157├── preprocessing.py
158├── requirements.txt
159├── send_report.py
160├── topic_modeling.py
161├── typeform_api.py
162└── venv
163 ├── (...)
164
As I said - the main.py script runs in pycharm IDE, but not from the terminal.
How can I run my script from the terminal?
ANSWER
Answered 2022-Feb-14 at 09:31Looks like you are using venv, did you activate it before running your script?
For Linux/Mac you can do the following:
1Traceback (most recent call last):
2 File "main.py", line 14, in <module>
3 from typeform_api import get_data # Step 1)
4 File "/Users/philipp/Documents/PycharmProjects/Wegekompass/typeform_api.py", line 9, in <module>
5 import pandas as pd # For creating a dataframe for the data
6ModuleNotFoundError: No module named 'pandas'
7"""
8Logic Flow of main.py:
91) Fetch Data with API
102) Preprocess the Data
113) Start topic modeling
124) Create report and include data
135) send report via e-mail to user
14"""
15
16import os
17import sys
18from typeform_api import get_data # Step 1)
19from preprocessing import preprocess_data # Step 2)
20from topic_modeling import create_topic_model # Step 3)
21from create_report import * # Step 4)
22from send_report import send_report # Step 5)
23
24if __name__ == '__main__':
25
26# 1) Fetch Data with Typeform API:
27alle_antworten_txt = get_data()[3]
28projektname = get_data()[0]
29projekt_id = get_data()[1]
30user_email = get_data()[2]
31heute_txt_short = get_data()[3][4]
32heute_txt_long = get_data()[3][1:3]
33heute_txt_long_joined = ". ".join(heute_txt_long)
34morgen_txt_short = get_data()[3][8]
35morgen_txt_long = get_data()[3][5:7]
36morgen_txt_long_joined = ". ".join(morgen_txt_long)
37erfolgshebel_txt_short = get_data()[3][17]
38erfolgshebel_txt_long = get_data()[3][14:16]
39erfolgshebel_txt_long_joined = ". ".join(erfolgshebel_txt_long)
40staerken_txt = get_data()[3][12:13]
41staerken_txt_joined = ". ".join(staerken_txt)
42schwaechen_txt = get_data()[3][10:11]
43schwaechen_txt_joined = ". ".join(schwaechen_txt)
44reflexion_txt = get_data()[3][9]
45etappe1_fachlich_txt = get_data()[3][18]
46etappe1_kulturell_txt = get_data()[3][19]
47etappe2_fachlich_txt = get_data()[3][20]
48etappe2_kulturell_txt = get_data()[3][21]
49etappe3_fachlich_txt = get_data()[3][22]
50etappe3_kulturell_txt = get_data()[3][23]
51weg_von_energie = get_data()[4][0]
52hin_zu_energie = get_data()[4][1]
53umsetzungsenergie = get_data()[4][2]
54
55# 2) Preprocess and tokenize fetched Data:
56alle_antworten_txt_tokens = preprocess_data(alle_antworten_txt)[1]
57heute_txt_long_tokens = preprocess_data(heute_txt_long)[1]
58morgen_txt_long_tokens = preprocess_data(morgen_txt_long)[1]
59
60# 3) Start NLP analysis:
61# answer_analysis = create_topic_model(alle_antworten_txt_tokens)[1]
62try:
63 heute_wortwolke = create_topic_model(heute_txt_long_tokens)[2]
64 heute_wortwolke.to_file(os.getcwd() + "/Grafiken/Wortwolken/heute_wortwolke"+projekt_id+".png")
65 morgen_wortwolke = create_topic_model(morgen_txt_long_tokens)[2]
66 morgen_wortwolke.to_file(os.getcwd() + "/Grafiken/Wortwolken/morgen_wortwolke"+projekt_id+".png")
67
68except ValueError:
69 print("There is not enough word input, LDA analysis raises ValueError")
70 sys.exit(0)
71
72# 4) Create final report:
73try:
74 final_report = create_final_report(heute_txt_short,
75 morgen_txt_short,
76 erfolgshebel_txt_short,
77 heute_txt_long_joined,
78 morgen_txt_long_joined,
79 erfolgshebel_txt_long_joined,
80 staerken_txt_joined,
81 schwaechen_txt_joined,
82 reflexion_txt,
83 os.getcwd() + "/Grafiken/Wortwolken/heute_wortwolke"+projekt_id+".png",
84 os.getcwd() + "/Grafiken/Wortwolken/morgen_wortwolke"+projekt_id+".png",
85 etappe1_fachlich_txt,
86 etappe1_kulturell_txt,
87 etappe2_fachlich_txt,
88 etappe2_kulturell_txt,
89 etappe3_fachlich_txt,
90 etappe3_kulturell_txt,
91 weg_von_energie,
92 hin_zu_energie,
93 umsetzungsenergie,
94 projektname,
95 projekt_id)
96
97except FileNotFoundError:
98 print("There is not enough word input, wordcloud can't be computed so FileNotFoundError is raised")
99 sys.exit(0)
100
101# 5) Send final report:
102try:
103 send_report(final_report, user_email, projektname, projekt_id)
104except NameError:
105 print("variable final_report not defined due missing wordclouds")
106 sys.exit(0)
107
108# 6) Check if program reached its end:
109print("If this gets print, the report was send successfully.")
110├── Dockerfile
111├── Error-Handling.md
112├── Grafiken
113│ ├── 1Etappe_Card-min.png
114│ ├── 2Etappe_Card-min.png
115│ ├── 3Etappe_Card-min.png
116│ ├── AD_Logo-min.png
117│ ├── Dynamik_Abflug-min.png
118│ ├── Dynamik_Anflug-min.png
119│ ├── Dynamik_Hoehenflug-min.png
120│ ├── Dynamik_Parabelflug-min.png
121│ ├── Dynamik_Sinkflug-min.png
122│ ├── Dynamik_Steigflug-min.png
123│ ├── Dynamik_Talflug-min.png
124│ ├── Dynamik_Tiefflug-min.png
125│ ├── Entwicklung_Card-min.png
126│ ├── Erfolgshebel_Card-min.png
127│ ├── Framework_Abflug-min.png
128│ ├── Framework_Anflug-min.png
129│ ├── Framework_Hoehenflug-min.png
130│ ├── Framework_Parabelflug-min.png
131│ ├── Framework_Sinkflug-min.png
132│ ├── Framework_Steigflug-min.png
133│ ├── Framework_Talflug-min.png
134│ ├── Framework_Tiefflug-min.png
135│ ├── Heute_Card-min.png
136│ ├── Morgen_Card-min.png
137│ ├── Reflexion_Card-min.png
138│ ├── Strength_Card-min.png
139│ ├── Weakness_Card-min.png
140│ ├── Wegekompass_Hero-min.png
141│ └── Wortwolken
142├── PDF_Reports
143├── README.md
144├── __pycache__
145│ ├── config.cpython-38.pyc
146│ ├── create_report.cpython-38.pyc
147│ ├── main.cpython-38.pyc
148│ ├── preprocessing.cpython-38.pyc
149│ ├── send_report.cpython-38.pyc
150│ ├── topic_modeling.cpython-38.pyc
151│ └── typeform_api.cpython-38.pyc
152├── config.py
153├── create_report.py
154├── html_files
155│ └── E_Mail_Template.html
156├── main.py
157├── preprocessing.py
158├── requirements.txt
159├── send_report.py
160├── topic_modeling.py
161├── typeform_api.py
162└── venv
163 ├── (...)
164. venv/bin/activate
165
For windows you should use:
1Traceback (most recent call last):
2 File "main.py", line 14, in <module>
3 from typeform_api import get_data # Step 1)
4 File "/Users/philipp/Documents/PycharmProjects/Wegekompass/typeform_api.py", line 9, in <module>
5 import pandas as pd # For creating a dataframe for the data
6ModuleNotFoundError: No module named 'pandas'
7"""
8Logic Flow of main.py:
91) Fetch Data with API
102) Preprocess the Data
113) Start topic modeling
124) Create report and include data
135) send report via e-mail to user
14"""
15
16import os
17import sys
18from typeform_api import get_data # Step 1)
19from preprocessing import preprocess_data # Step 2)
20from topic_modeling import create_topic_model # Step 3)
21from create_report import * # Step 4)
22from send_report import send_report # Step 5)
23
24if __name__ == '__main__':
25
26# 1) Fetch Data with Typeform API:
27alle_antworten_txt = get_data()[3]
28projektname = get_data()[0]
29projekt_id = get_data()[1]
30user_email = get_data()[2]
31heute_txt_short = get_data()[3][4]
32heute_txt_long = get_data()[3][1:3]
33heute_txt_long_joined = ". ".join(heute_txt_long)
34morgen_txt_short = get_data()[3][8]
35morgen_txt_long = get_data()[3][5:7]
36morgen_txt_long_joined = ". ".join(morgen_txt_long)
37erfolgshebel_txt_short = get_data()[3][17]
38erfolgshebel_txt_long = get_data()[3][14:16]
39erfolgshebel_txt_long_joined = ". ".join(erfolgshebel_txt_long)
40staerken_txt = get_data()[3][12:13]
41staerken_txt_joined = ". ".join(staerken_txt)
42schwaechen_txt = get_data()[3][10:11]
43schwaechen_txt_joined = ". ".join(schwaechen_txt)
44reflexion_txt = get_data()[3][9]
45etappe1_fachlich_txt = get_data()[3][18]
46etappe1_kulturell_txt = get_data()[3][19]
47etappe2_fachlich_txt = get_data()[3][20]
48etappe2_kulturell_txt = get_data()[3][21]
49etappe3_fachlich_txt = get_data()[3][22]
50etappe3_kulturell_txt = get_data()[3][23]
51weg_von_energie = get_data()[4][0]
52hin_zu_energie = get_data()[4][1]
53umsetzungsenergie = get_data()[4][2]
54
55# 2) Preprocess and tokenize fetched Data:
56alle_antworten_txt_tokens = preprocess_data(alle_antworten_txt)[1]
57heute_txt_long_tokens = preprocess_data(heute_txt_long)[1]
58morgen_txt_long_tokens = preprocess_data(morgen_txt_long)[1]
59
60# 3) Start NLP analysis:
61# answer_analysis = create_topic_model(alle_antworten_txt_tokens)[1]
62try:
63 heute_wortwolke = create_topic_model(heute_txt_long_tokens)[2]
64 heute_wortwolke.to_file(os.getcwd() + "/Grafiken/Wortwolken/heute_wortwolke"+projekt_id+".png")
65 morgen_wortwolke = create_topic_model(morgen_txt_long_tokens)[2]
66 morgen_wortwolke.to_file(os.getcwd() + "/Grafiken/Wortwolken/morgen_wortwolke"+projekt_id+".png")
67
68except ValueError:
69 print("There is not enough word input, LDA analysis raises ValueError")
70 sys.exit(0)
71
72# 4) Create final report:
73try:
74 final_report = create_final_report(heute_txt_short,
75 morgen_txt_short,
76 erfolgshebel_txt_short,
77 heute_txt_long_joined,
78 morgen_txt_long_joined,
79 erfolgshebel_txt_long_joined,
80 staerken_txt_joined,
81 schwaechen_txt_joined,
82 reflexion_txt,
83 os.getcwd() + "/Grafiken/Wortwolken/heute_wortwolke"+projekt_id+".png",
84 os.getcwd() + "/Grafiken/Wortwolken/morgen_wortwolke"+projekt_id+".png",
85 etappe1_fachlich_txt,
86 etappe1_kulturell_txt,
87 etappe2_fachlich_txt,
88 etappe2_kulturell_txt,
89 etappe3_fachlich_txt,
90 etappe3_kulturell_txt,
91 weg_von_energie,
92 hin_zu_energie,
93 umsetzungsenergie,
94 projektname,
95 projekt_id)
96
97except FileNotFoundError:
98 print("There is not enough word input, wordcloud can't be computed so FileNotFoundError is raised")
99 sys.exit(0)
100
101# 5) Send final report:
102try:
103 send_report(final_report, user_email, projektname, projekt_id)
104except NameError:
105 print("variable final_report not defined due missing wordclouds")
106 sys.exit(0)
107
108# 6) Check if program reached its end:
109print("If this gets print, the report was send successfully.")
110├── Dockerfile
111├── Error-Handling.md
112├── Grafiken
113│ ├── 1Etappe_Card-min.png
114│ ├── 2Etappe_Card-min.png
115│ ├── 3Etappe_Card-min.png
116│ ├── AD_Logo-min.png
117│ ├── Dynamik_Abflug-min.png
118│ ├── Dynamik_Anflug-min.png
119│ ├── Dynamik_Hoehenflug-min.png
120│ ├── Dynamik_Parabelflug-min.png
121│ ├── Dynamik_Sinkflug-min.png
122│ ├── Dynamik_Steigflug-min.png
123│ ├── Dynamik_Talflug-min.png
124│ ├── Dynamik_Tiefflug-min.png
125│ ├── Entwicklung_Card-min.png
126│ ├── Erfolgshebel_Card-min.png
127│ ├── Framework_Abflug-min.png
128│ ├── Framework_Anflug-min.png
129│ ├── Framework_Hoehenflug-min.png
130│ ├── Framework_Parabelflug-min.png
131│ ├── Framework_Sinkflug-min.png
132│ ├── Framework_Steigflug-min.png
133│ ├── Framework_Talflug-min.png
134│ ├── Framework_Tiefflug-min.png
135│ ├── Heute_Card-min.png
136│ ├── Morgen_Card-min.png
137│ ├── Reflexion_Card-min.png
138│ ├── Strength_Card-min.png
139│ ├── Weakness_Card-min.png
140│ ├── Wegekompass_Hero-min.png
141│ └── Wortwolken
142├── PDF_Reports
143├── README.md
144├── __pycache__
145│ ├── config.cpython-38.pyc
146│ ├── create_report.cpython-38.pyc
147│ ├── main.cpython-38.pyc
148│ ├── preprocessing.cpython-38.pyc
149│ ├── send_report.cpython-38.pyc
150│ ├── topic_modeling.cpython-38.pyc
151│ └── typeform_api.cpython-38.pyc
152├── config.py
153├── create_report.py
154├── html_files
155│ └── E_Mail_Template.html
156├── main.py
157├── preprocessing.py
158├── requirements.txt
159├── send_report.py
160├── topic_modeling.py
161├── typeform_api.py
162└── venv
163 ├── (...)
164. venv/bin/activate
165source venv/Scripts/activate
166
after activating the virtual environment, you can install packages with:
1Traceback (most recent call last):
2 File "main.py", line 14, in <module>
3 from typeform_api import get_data # Step 1)
4 File "/Users/philipp/Documents/PycharmProjects/Wegekompass/typeform_api.py", line 9, in <module>
5 import pandas as pd # For creating a dataframe for the data
6ModuleNotFoundError: No module named 'pandas'
7"""
8Logic Flow of main.py:
91) Fetch Data with API
102) Preprocess the Data
113) Start topic modeling
124) Create report and include data
135) send report via e-mail to user
14"""
15
16import os
17import sys
18from typeform_api import get_data # Step 1)
19from preprocessing import preprocess_data # Step 2)
20from topic_modeling import create_topic_model # Step 3)
21from create_report import * # Step 4)
22from send_report import send_report # Step 5)
23
24if __name__ == '__main__':
25
26# 1) Fetch Data with Typeform API:
27alle_antworten_txt = get_data()[3]
28projektname = get_data()[0]
29projekt_id = get_data()[1]
30user_email = get_data()[2]
31heute_txt_short = get_data()[3][4]
32heute_txt_long = get_data()[3][1:3]
33heute_txt_long_joined = ". ".join(heute_txt_long)
34morgen_txt_short = get_data()[3][8]
35morgen_txt_long = get_data()[3][5:7]
36morgen_txt_long_joined = ". ".join(morgen_txt_long)
37erfolgshebel_txt_short = get_data()[3][17]
38erfolgshebel_txt_long = get_data()[3][14:16]
39erfolgshebel_txt_long_joined = ". ".join(erfolgshebel_txt_long)
40staerken_txt = get_data()[3][12:13]
41staerken_txt_joined = ". ".join(staerken_txt)
42schwaechen_txt = get_data()[3][10:11]
43schwaechen_txt_joined = ". ".join(schwaechen_txt)
44reflexion_txt = get_data()[3][9]
45etappe1_fachlich_txt = get_data()[3][18]
46etappe1_kulturell_txt = get_data()[3][19]
47etappe2_fachlich_txt = get_data()[3][20]
48etappe2_kulturell_txt = get_data()[3][21]
49etappe3_fachlich_txt = get_data()[3][22]
50etappe3_kulturell_txt = get_data()[3][23]
51weg_von_energie = get_data()[4][0]
52hin_zu_energie = get_data()[4][1]
53umsetzungsenergie = get_data()[4][2]
54
55# 2) Preprocess and tokenize fetched Data:
56alle_antworten_txt_tokens = preprocess_data(alle_antworten_txt)[1]
57heute_txt_long_tokens = preprocess_data(heute_txt_long)[1]
58morgen_txt_long_tokens = preprocess_data(morgen_txt_long)[1]
59
60# 3) Start NLP analysis:
61# answer_analysis = create_topic_model(alle_antworten_txt_tokens)[1]
62try:
63 heute_wortwolke = create_topic_model(heute_txt_long_tokens)[2]
64 heute_wortwolke.to_file(os.getcwd() + "/Grafiken/Wortwolken/heute_wortwolke"+projekt_id+".png")
65 morgen_wortwolke = create_topic_model(morgen_txt_long_tokens)[2]
66 morgen_wortwolke.to_file(os.getcwd() + "/Grafiken/Wortwolken/morgen_wortwolke"+projekt_id+".png")
67
68except ValueError:
69 print("There is not enough word input, LDA analysis raises ValueError")
70 sys.exit(0)
71
72# 4) Create final report:
73try:
74 final_report = create_final_report(heute_txt_short,
75 morgen_txt_short,
76 erfolgshebel_txt_short,
77 heute_txt_long_joined,
78 morgen_txt_long_joined,
79 erfolgshebel_txt_long_joined,
80 staerken_txt_joined,
81 schwaechen_txt_joined,
82 reflexion_txt,
83 os.getcwd() + "/Grafiken/Wortwolken/heute_wortwolke"+projekt_id+".png",
84 os.getcwd() + "/Grafiken/Wortwolken/morgen_wortwolke"+projekt_id+".png",
85 etappe1_fachlich_txt,
86 etappe1_kulturell_txt,
87 etappe2_fachlich_txt,
88 etappe2_kulturell_txt,
89 etappe3_fachlich_txt,
90 etappe3_kulturell_txt,
91 weg_von_energie,
92 hin_zu_energie,
93 umsetzungsenergie,
94 projektname,
95 projekt_id)
96
97except FileNotFoundError:
98 print("There is not enough word input, wordcloud can't be computed so FileNotFoundError is raised")
99 sys.exit(0)
100
101# 5) Send final report:
102try:
103 send_report(final_report, user_email, projektname, projekt_id)
104except NameError:
105 print("variable final_report not defined due missing wordclouds")
106 sys.exit(0)
107
108# 6) Check if program reached its end:
109print("If this gets print, the report was send successfully.")
110├── Dockerfile
111├── Error-Handling.md
112├── Grafiken
113│ ├── 1Etappe_Card-min.png
114│ ├── 2Etappe_Card-min.png
115│ ├── 3Etappe_Card-min.png
116│ ├── AD_Logo-min.png
117│ ├── Dynamik_Abflug-min.png
118│ ├── Dynamik_Anflug-min.png
119│ ├── Dynamik_Hoehenflug-min.png
120│ ├── Dynamik_Parabelflug-min.png
121│ ├── Dynamik_Sinkflug-min.png
122│ ├── Dynamik_Steigflug-min.png
123│ ├── Dynamik_Talflug-min.png
124│ ├── Dynamik_Tiefflug-min.png
125│ ├── Entwicklung_Card-min.png
126│ ├── Erfolgshebel_Card-min.png
127│ ├── Framework_Abflug-min.png
128│ ├── Framework_Anflug-min.png
129│ ├── Framework_Hoehenflug-min.png
130│ ├── Framework_Parabelflug-min.png
131│ ├── Framework_Sinkflug-min.png
132│ ├── Framework_Steigflug-min.png
133│ ├── Framework_Talflug-min.png
134│ ├── Framework_Tiefflug-min.png
135│ ├── Heute_Card-min.png
136│ ├── Morgen_Card-min.png
137│ ├── Reflexion_Card-min.png
138│ ├── Strength_Card-min.png
139│ ├── Weakness_Card-min.png
140│ ├── Wegekompass_Hero-min.png
141│ └── Wortwolken
142├── PDF_Reports
143├── README.md
144├── __pycache__
145│ ├── config.cpython-38.pyc
146│ ├── create_report.cpython-38.pyc
147│ ├── main.cpython-38.pyc
148│ ├── preprocessing.cpython-38.pyc
149│ ├── send_report.cpython-38.pyc
150│ ├── topic_modeling.cpython-38.pyc
151│ └── typeform_api.cpython-38.pyc
152├── config.py
153├── create_report.py
154├── html_files
155│ └── E_Mail_Template.html
156├── main.py
157├── preprocessing.py
158├── requirements.txt
159├── send_report.py
160├── topic_modeling.py
161├── typeform_api.py
162└── venv
163 ├── (...)
164. venv/bin/activate
165source venv/Scripts/activate
166pip install -r requirements.txt
167
QUESTION
How to get list of words for each topic for a specific relevance metric value (lambda) in pyLDAvis?
Asked 2021-Nov-24 at 10:43I am using pyLDAvis along with gensim.models.LdaMulticore for topic modeling. I have totally 10 topics. When I visualize the results using pyLDAvis, there is a bar called lambda with this explanation: "Slide to adjust relevance metric". I am interested to extract the list of words for each topic separately for lambda = 0.1. I cannot find a way to adjust lambda in the document for extracting keywords.
I am using these lines:
1if 1 == 1:
2 LDAvis_prepared = pyLDAvis.gensim_models.prepare(lda_model, corpus, id2word, lambda_step=0.1)
3LDAvis_prepared.topic_info
4
And these are the results:
1if 1 == 1:
2 LDAvis_prepared = pyLDAvis.gensim_models.prepare(lda_model, corpus, id2word, lambda_step=0.1)
3LDAvis_prepared.topic_info
4 Term Freq Total Category logprob loglift
5321 ra 2336.000000 2336.000000 Default 30.0000 30.0000
6146 may 1741.000000 1741.000000 Default 29.0000 29.0000
766 doctor 1310.000000 1310.000000 Default 28.0000 28.0000
8
First of all these results are not related to what I observe with lambda of 0.1 in visualization. Secondly I cannot see the results separated by the topics.
ANSWER
Answered 2021-Nov-24 at 10:43You may want to read this github page: https://nicharuc.github.io/topic_modeling/
According to this example, your code could go like this:
1if 1 == 1:
2 LDAvis_prepared = pyLDAvis.gensim_models.prepare(lda_model, corpus, id2word, lambda_step=0.1)
3LDAvis_prepared.topic_info
4 Term Freq Total Category logprob loglift
5321 ra 2336.000000 2336.000000 Default 30.0000 30.0000
6146 may 1741.000000 1741.000000 Default 29.0000 29.0000
766 doctor 1310.000000 1310.000000 Default 28.0000 28.0000
8lambd = 0.6 # a specific relevance metric value
9
10all_topics = {}
11num_topics = lda_model.num_topics
12num_terms = 10
13
14for i in range(1,num_topics):
15 topic = LDAvis_prepared.topic_info[LDAvis_prepared.topic_info.Category == 'Topic'+str(i)].copy()
16 topic['relevance'] = topic['loglift']*(1-lambd)+topic['logprob']*lambd
17 all_topics['Topic '+str(i)] = topic.sort_values(by='relevance', ascending=False).Term[:num_terms].values
18pd.DataFrame(all_topics).T
19
QUESTION
Should bi-gram and tri-gram be used in LDA topic modeling?
Asked 2021-Sep-13 at 21:11I read several posts(here and here) online about LDA topic modeling. All of them only use uni-grams. I would like to know why bi-grams and tri-grams are not used for LDA topic modeling?
ANSWER
Answered 2021-Sep-13 at 08:30It's a matter of scale. If you have 1000 types (ie "dictionary words"), you might end up (in the worst case, which is not going to happen) with 1,000,000 bigrams, and 1,000,000,000 trigrams. These numbers are hard to manage, especially as you will have a lot more types in a realistic text.
The gains in accuracy/performance don't outweigh the computational cost here.
QUESTION
How encode text can be converted to main text (without special character created by encoding)
Asked 2021-Jun-30 at 11:18I am going to extract text from a series of PDF files to do Topic Modeling. After extracting text from PdF files, I am going to save the text of each PDF file in a .txt file or .doc file. To do this, I had an error that I should add .encode('utf-8') for saving extracted text in a .txt file. So, I added txt = str(txt.encode('utf-8'))
. The problem is reading the .txt files, when I read the .txt files, they have special characters due to UTF-8, I don't know how I can have the main text without that characters. I applied to decode but it didn't work.
I applied another approach to avoid saving in .txt format, I was going to save the extracted text in a data frame, but I found that the few first pages were saved in data frame!
I would appreciate it if you could share your solutions to read from the .txt file and removing characters relating to encoding ('utf-8') and how I can save the extracted text in a data frame.
1import pdfplumber
2import pandas as pd
3import codecs
4
5txt = ''
6
7with pdfplumber.open(r'C:\Users\thmag\3rdPaperLDA\A1.pdf') as pdf:
8 pages = pdf.pages
9 for i, pg in enumerate (pages):
10 txt += pages [i].extract_text()
11
12print (txt)
13
14data = {'text': [txt]}
15df = pd.DataFrame(data)
16
17
18####write in .txt file
19text_file = open("Test.txt", "wt")
20txt = str(txt.encode('utf-8'))
21n = text_file.write(txt)
22text_file.close()
23
24####read from .txt file
25with codecs.open('Test.txt', 'r', 'utf-8') as f:
26 for line in f:
27 print (line)
28
ANSWER
Answered 2021-Jun-30 at 11:18You are writing the file incorrectly. Rather than encoding the text, declare an encoding when you open the file, and write the text without encoding - Python will automatically encode it.
It should be
1import pdfplumber
2import pandas as pd
3import codecs
4
5txt = ''
6
7with pdfplumber.open(r'C:\Users\thmag\3rdPaperLDA\A1.pdf') as pdf:
8 pages = pdf.pages
9 for i, pg in enumerate (pages):
10 txt += pages [i].extract_text()
11
12print (txt)
13
14data = {'text': [txt]}
15df = pd.DataFrame(data)
16
17
18####write in .txt file
19text_file = open("Test.txt", "wt")
20txt = str(txt.encode('utf-8'))
21n = text_file.write(txt)
22text_file.close()
23
24####read from .txt file
25with codecs.open('Test.txt', 'r', 'utf-8') as f:
26 for line in f:
27 print (line)
28
29####write in .txt file
30with open("Test.txt", "wt", encoding='utf-8') as text_file:
31 n = text_file.write(txt)
32
Unless you are using Python 2 you don't need to use codecs to open encoded files, again you can declare the encoding in the open
function:
1import pdfplumber
2import pandas as pd
3import codecs
4
5txt = ''
6
7with pdfplumber.open(r'C:\Users\thmag\3rdPaperLDA\A1.pdf') as pdf:
8 pages = pdf.pages
9 for i, pg in enumerate (pages):
10 txt += pages [i].extract_text()
11
12print (txt)
13
14data = {'text': [txt]}
15df = pd.DataFrame(data)
16
17
18####write in .txt file
19text_file = open("Test.txt", "wt")
20txt = str(txt.encode('utf-8'))
21n = text_file.write(txt)
22text_file.close()
23
24####read from .txt file
25with codecs.open('Test.txt', 'r', 'utf-8') as f:
26 for line in f:
27 print (line)
28
29####write in .txt file
30with open("Test.txt", "wt", encoding='utf-8') as text_file:
31 n = text_file.write(txt)
32with open("Test.txt", "rt", encoding='utf-8') as f:
33 for line in f:
34 print(line)
35
QUESTION
Memory problems when using lapply for corpus creation
Asked 2021-Jun-05 at 05:53My eventual goal is to transform thousands of pdfs into a corpus / document term matrix to conduct some topic modeling. I am using the pdftools package to import my pdfs and work with the tm package for preparing my data for text mining. I managed to import and transform one individual pdf, like this:
1txt <- pdf_text("pdfexample.pdf")
2
3#create corpus
4txt_corpus <- Corpus(VectorSource(txt))
5
6# Some basic text prep, with tm_map(), like:
7txt_corpus <- tm_map(txt_corpus, tolower)
8
9# create document term matrix
10dtm <- DocumentTermMatrix(txt_corpus)
11
However, I am completely stuck with automating this process and I have only limited experience with either loops or apply functions. My approach has run into memory problems, when converting the raw pdf_text() output into a corpus, even though I tested my code only with 5 pdf files (total: 1.5MB). R tried to allocate a vector of more than half GB. Which seems absolutely not right to me. My attempt looks like this:
1txt <- pdf_text("pdfexample.pdf")
2
3#create corpus
4txt_corpus <- Corpus(VectorSource(txt))
5
6# Some basic text prep, with tm_map(), like:
7txt_corpus <- tm_map(txt_corpus, tolower)
8
9# create document term matrix
10dtm <- DocumentTermMatrix(txt_corpus)
11# Create a list of all pdf paths
12file_list <- list.files(path = "mydirectory",
13 full.names = TRUE,
14 pattern = "name*", # to import only specific pdfs
15 ignore.case = FALSE)
16
17# Run a function that reads the pdf of each of those files:
18all_files <- lapply(file_list, FUN = function(files) {
19 pdf_text(files)
20 })
21
22all_files_corpus = lapply(all_files,
23 FUN = Corpus(DirSource())) # That's where I run into memory issues
24
Am I doing something fundamentally wrong? Not sure if it is just a mere memory issue or whether there are easier approaches to my problem. At least, from what I gathered, lapply should be a lot more memory efficient then looping. But maybe there is more to it. I've tried to solve it by my own for days now, but nothing worked.
Grateful for any advice/hint on how to proceed!
Edit: I tried to execute the lapply with only one pdf and my R crashed again, even though I have no capacity problems at all, when using the code mentioned first.
ANSWER
Answered 2021-Jun-05 at 05:52You can write a function which has series of steps that you want to execute on each pdf.
1txt <- pdf_text("pdfexample.pdf")
2
3#create corpus
4txt_corpus <- Corpus(VectorSource(txt))
5
6# Some basic text prep, with tm_map(), like:
7txt_corpus <- tm_map(txt_corpus, tolower)
8
9# create document term matrix
10dtm <- DocumentTermMatrix(txt_corpus)
11# Create a list of all pdf paths
12file_list <- list.files(path = "mydirectory",
13 full.names = TRUE,
14 pattern = "name*", # to import only specific pdfs
15 ignore.case = FALSE)
16
17# Run a function that reads the pdf of each of those files:
18all_files <- lapply(file_list, FUN = function(files) {
19 pdf_text(files)
20 })
21
22all_files_corpus = lapply(all_files,
23 FUN = Corpus(DirSource())) # That's where I run into memory issues
24pdf_to_dtm <- function(file) {
25 txt <- pdf_text(file)
26 #create corpus
27 txt_corpus <- Corpus(VectorSource(txt))
28 # Some basic text prep, with tm_map(), like:
29 txt_corpus <- tm_map(txt_corpus, tolower)
30 # create document term matrix
31 dtm <- DocumentTermMatrix(txt_corpus)
32 dtm
33}
34
Using lapply
apply the function on each file
1txt <- pdf_text("pdfexample.pdf")
2
3#create corpus
4txt_corpus <- Corpus(VectorSource(txt))
5
6# Some basic text prep, with tm_map(), like:
7txt_corpus <- tm_map(txt_corpus, tolower)
8
9# create document term matrix
10dtm <- DocumentTermMatrix(txt_corpus)
11# Create a list of all pdf paths
12file_list <- list.files(path = "mydirectory",
13 full.names = TRUE,
14 pattern = "name*", # to import only specific pdfs
15 ignore.case = FALSE)
16
17# Run a function that reads the pdf of each of those files:
18all_files <- lapply(file_list, FUN = function(files) {
19 pdf_text(files)
20 })
21
22all_files_corpus = lapply(all_files,
23 FUN = Corpus(DirSource())) # That's where I run into memory issues
24pdf_to_dtm <- function(file) {
25 txt <- pdf_text(file)
26 #create corpus
27 txt_corpus <- Corpus(VectorSource(txt))
28 # Some basic text prep, with tm_map(), like:
29 txt_corpus <- tm_map(txt_corpus, tolower)
30 # create document term matrix
31 dtm <- DocumentTermMatrix(txt_corpus)
32 dtm
33}
34file_list <- list.files(path = "mydirectory",
35 full.names = TRUE,
36 pattern = "name*", # to import only specific pdfs
37 ignore.case = FALSE)
38
39all_files_corpus <- lapply(file_list, pdf_to_dtm)
40
QUESTION
How can I replace emojis with text and treat them as single words?
Asked 2021-May-18 at 15:56I have to do a topic modeling based on pieces of texts containing emojis with R. Using the replace_emoji()
and replace_emoticon
functions let me analyze them, but there is a problem with the results.
A red heart emoji is translated as "red heart ufef". These words are then treated separately during the analysis and compromise the results.
Terms like "heart" can have a very different meaning as can be seen with "red heart ufef" and "broken heart"
The function replace_emoji_identifier()
doesn't help either, as the identifiers make an analysis hard.
Dummy data set reproducible with by using dput()
(including the step force to lowercase
:
1Emoji_struct <- c(
2 list(content = "🔥🔥 wow", "😮 look at that", "😤this makes me angry😤", "😍❤\ufe0f, i love it!"),
3 list(content = "😍😍", "😊 thanks for helping", "😢 oh no, why? 😢", "careful, challenging ❌❌❌")
4)
5
Current coding (data_orig
is a list of several files):
1Emoji_struct <- c(
2 list(content = "🔥🔥 wow", "😮 look at that", "😤this makes me angry😤", "😍❤\ufe0f, i love it!"),
3 list(content = "😍😍", "😊 thanks for helping", "😢 oh no, why? 😢", "careful, challenging ❌❌❌")
4)
5library(textclean)
6#The rest should be standard r packages for pre-processing
7
8#pre-processing:
9data <- gsub("'", "", data)
10data <- replace_contraction(data)
11data <- replace_emoji(data) # replace emoji with words
12data <- replace_emoticon(data) # replace emoticon with words
13data <- replace_hash(data, replacement = "")
14data <- replace_word_elongation(data)
15data <- gsub("[[:punct:]]", " ", data) #replace punctuation with space
16data <- gsub("[[:cntrl:]]", " ", data)
17data <- gsub("[[:digit:]]", "", data) #remove digits
18data <- gsub("^[[:space:]]+", "", data) #remove whitespace at beginning of documents
19data <- gsub("[[:space:]]+$", "", data) #remove whitespace at end of documents
20data <- stripWhitespace(data)
21
Desired output:
1Emoji_struct <- c(
2 list(content = "🔥🔥 wow", "😮 look at that", "😤this makes me angry😤", "😍❤\ufe0f, i love it!"),
3 list(content = "😍😍", "😊 thanks for helping", "😢 oh no, why? 😢", "careful, challenging ❌❌❌")
4)
5library(textclean)
6#The rest should be standard r packages for pre-processing
7
8#pre-processing:
9data <- gsub("'", "", data)
10data <- replace_contraction(data)
11data <- replace_emoji(data) # replace emoji with words
12data <- replace_emoticon(data) # replace emoticon with words
13data <- replace_hash(data, replacement = "")
14data <- replace_word_elongation(data)
15data <- gsub("[[:punct:]]", " ", data) #replace punctuation with space
16data <- gsub("[[:cntrl:]]", " ", data)
17data <- gsub("[[:digit:]]", "", data) #remove digits
18data <- gsub("^[[:space:]]+", "", data) #remove whitespace at beginning of documents
19data <- gsub("[[:space:]]+$", "", data) #remove whitespace at end of documents
20data <- stripWhitespace(data)
21[1] list(content = c("fire fire wow",
22 "facewithopenmouth look at that",
23 "facewithsteamfromnose this makes me angry facewithsteamfromnose",
24 "smilingfacewithhearteyes redheart \ufe0f, i love it!"),
25 content = c("smilingfacewithhearteyes smilingfacewithhearteyes",
26 "smilingfacewithsmilingeyes thanks for helping",
27 "cryingface oh no, why? cryingface",
28 "careful, challenging crossmark crossmark crossmark"))
29
Any ideas? Lower cases would work, too. Best regards. Stay safe. Stay healthy.
ANSWER
Answered 2021-May-18 at 15:56Answer
Replace the default conversion table in replace_emoji
with a version where the spaces/punctuation is removed:
1Emoji_struct <- c(
2 list(content = "🔥🔥 wow", "😮 look at that", "😤this makes me angry😤", "😍❤\ufe0f, i love it!"),
3 list(content = "😍😍", "😊 thanks for helping", "😢 oh no, why? 😢", "careful, challenging ❌❌❌")
4)
5library(textclean)
6#The rest should be standard r packages for pre-processing
7
8#pre-processing:
9data <- gsub("'", "", data)
10data <- replace_contraction(data)
11data <- replace_emoji(data) # replace emoji with words
12data <- replace_emoticon(data) # replace emoticon with words
13data <- replace_hash(data, replacement = "")
14data <- replace_word_elongation(data)
15data <- gsub("[[:punct:]]", " ", data) #replace punctuation with space
16data <- gsub("[[:cntrl:]]", " ", data)
17data <- gsub("[[:digit:]]", "", data) #remove digits
18data <- gsub("^[[:space:]]+", "", data) #remove whitespace at beginning of documents
19data <- gsub("[[:space:]]+$", "", data) #remove whitespace at end of documents
20data <- stripWhitespace(data)
21[1] list(content = c("fire fire wow",
22 "facewithopenmouth look at that",
23 "facewithsteamfromnose this makes me angry facewithsteamfromnose",
24 "smilingfacewithhearteyes redheart \ufe0f, i love it!"),
25 content = c("smilingfacewithhearteyes smilingfacewithhearteyes",
26 "smilingfacewithsmilingeyes thanks for helping",
27 "cryingface oh no, why? cryingface",
28 "careful, challenging crossmark crossmark crossmark"))
29hash2 <- lexicon::hash_emojis
30hash2$y <- gsub("[[:space:]]|[[:punct:]]", "", hash2$y)
31
32replace_emoji(Emoji_struct[,1], emoji_dt = hash2)
33
Example
Single character string:
1Emoji_struct <- c(
2 list(content = "🔥🔥 wow", "😮 look at that", "😤this makes me angry😤", "😍❤\ufe0f, i love it!"),
3 list(content = "😍😍", "😊 thanks for helping", "😢 oh no, why? 😢", "careful, challenging ❌❌❌")
4)
5library(textclean)
6#The rest should be standard r packages for pre-processing
7
8#pre-processing:
9data <- gsub("'", "", data)
10data <- replace_contraction(data)
11data <- replace_emoji(data) # replace emoji with words
12data <- replace_emoticon(data) # replace emoticon with words
13data <- replace_hash(data, replacement = "")
14data <- replace_word_elongation(data)
15data <- gsub("[[:punct:]]", " ", data) #replace punctuation with space
16data <- gsub("[[:cntrl:]]", " ", data)
17data <- gsub("[[:digit:]]", "", data) #remove digits
18data <- gsub("^[[:space:]]+", "", data) #remove whitespace at beginning of documents
19data <- gsub("[[:space:]]+$", "", data) #remove whitespace at end of documents
20data <- stripWhitespace(data)
21[1] list(content = c("fire fire wow",
22 "facewithopenmouth look at that",
23 "facewithsteamfromnose this makes me angry facewithsteamfromnose",
24 "smilingfacewithhearteyes redheart \ufe0f, i love it!"),
25 content = c("smilingfacewithhearteyes smilingfacewithhearteyes",
26 "smilingfacewithsmilingeyes thanks for helping",
27 "cryingface oh no, why? cryingface",
28 "careful, challenging crossmark crossmark crossmark"))
29hash2 <- lexicon::hash_emojis
30hash2$y <- gsub("[[:space:]]|[[:punct:]]", "", hash2$y)
31
32replace_emoji(Emoji_struct[,1], emoji_dt = hash2)
33replace_emoji("wow!😮 that is cool!", emoji_dt = hash2)
34#[1] "wow! facewithopenmouth that is cool!"
35
Character vector:
1Emoji_struct <- c(
2 list(content = "🔥🔥 wow", "😮 look at that", "😤this makes me angry😤", "😍❤\ufe0f, i love it!"),
3 list(content = "😍😍", "😊 thanks for helping", "😢 oh no, why? 😢", "careful, challenging ❌❌❌")
4)
5library(textclean)
6#The rest should be standard r packages for pre-processing
7
8#pre-processing:
9data <- gsub("'", "", data)
10data <- replace_contraction(data)
11data <- replace_emoji(data) # replace emoji with words
12data <- replace_emoticon(data) # replace emoticon with words
13data <- replace_hash(data, replacement = "")
14data <- replace_word_elongation(data)
15data <- gsub("[[:punct:]]", " ", data) #replace punctuation with space
16data <- gsub("[[:cntrl:]]", " ", data)
17data <- gsub("[[:digit:]]", "", data) #remove digits
18data <- gsub("^[[:space:]]+", "", data) #remove whitespace at beginning of documents
19data <- gsub("[[:space:]]+$", "", data) #remove whitespace at end of documents
20data <- stripWhitespace(data)
21[1] list(content = c("fire fire wow",
22 "facewithopenmouth look at that",
23 "facewithsteamfromnose this makes me angry facewithsteamfromnose",
24 "smilingfacewithhearteyes redheart \ufe0f, i love it!"),
25 content = c("smilingfacewithhearteyes smilingfacewithhearteyes",
26 "smilingfacewithsmilingeyes thanks for helping",
27 "cryingface oh no, why? cryingface",
28 "careful, challenging crossmark crossmark crossmark"))
29hash2 <- lexicon::hash_emojis
30hash2$y <- gsub("[[:space:]]|[[:punct:]]", "", hash2$y)
31
32replace_emoji(Emoji_struct[,1], emoji_dt = hash2)
33replace_emoji("wow!😮 that is cool!", emoji_dt = hash2)
34#[1] "wow! facewithopenmouth that is cool!"
35replace_emoji(c("1: 😊", "2: 😍"), emoji_dt = hash2)
36#[1] "1: smilingfacewithsmilingeyes "
37#[2] "2: smilingfacewithhearteyes "
38
List:
1Emoji_struct <- c(
2 list(content = "🔥🔥 wow", "😮 look at that", "😤this makes me angry😤", "😍❤\ufe0f, i love it!"),
3 list(content = "😍😍", "😊 thanks for helping", "😢 oh no, why? 😢", "careful, challenging ❌❌❌")
4)
5library(textclean)
6#The rest should be standard r packages for pre-processing
7
8#pre-processing:
9data <- gsub("'", "", data)
10data <- replace_contraction(data)
11data <- replace_emoji(data) # replace emoji with words
12data <- replace_emoticon(data) # replace emoticon with words
13data <- replace_hash(data, replacement = "")
14data <- replace_word_elongation(data)
15data <- gsub("[[:punct:]]", " ", data) #replace punctuation with space
16data <- gsub("[[:cntrl:]]", " ", data)
17data <- gsub("[[:digit:]]", "", data) #remove digits
18data <- gsub("^[[:space:]]+", "", data) #remove whitespace at beginning of documents
19data <- gsub("[[:space:]]+$", "", data) #remove whitespace at end of documents
20data <- stripWhitespace(data)
21[1] list(content = c("fire fire wow",
22 "facewithopenmouth look at that",
23 "facewithsteamfromnose this makes me angry facewithsteamfromnose",
24 "smilingfacewithhearteyes redheart \ufe0f, i love it!"),
25 content = c("smilingfacewithhearteyes smilingfacewithhearteyes",
26 "smilingfacewithsmilingeyes thanks for helping",
27 "cryingface oh no, why? cryingface",
28 "careful, challenging crossmark crossmark crossmark"))
29hash2 <- lexicon::hash_emojis
30hash2$y <- gsub("[[:space:]]|[[:punct:]]", "", hash2$y)
31
32replace_emoji(Emoji_struct[,1], emoji_dt = hash2)
33replace_emoji("wow!😮 that is cool!", emoji_dt = hash2)
34#[1] "wow! facewithopenmouth that is cool!"
35replace_emoji(c("1: 😊", "2: 😍"), emoji_dt = hash2)
36#[1] "1: smilingfacewithsmilingeyes "
37#[2] "2: smilingfacewithhearteyes "
38list("list_element_1: 🔥", "list_element_2: ❌") %>%
39 lapply(replace_emoji, emoji_dt = hash2)
40#[[1]]
41#[1] "list_element_1: fire "
42#
43#[[2]]
44#[1] "list_element_2: crossmark "
45
Rationale
To convert emojis to text, replace_emoji
uses lexicon::hash_emojis
as a conversion table (a hash table):
1Emoji_struct <- c(
2 list(content = "🔥🔥 wow", "😮 look at that", "😤this makes me angry😤", "😍❤\ufe0f, i love it!"),
3 list(content = "😍😍", "😊 thanks for helping", "😢 oh no, why? 😢", "careful, challenging ❌❌❌")
4)
5library(textclean)
6#The rest should be standard r packages for pre-processing
7
8#pre-processing:
9data <- gsub("'", "", data)
10data <- replace_contraction(data)
11data <- replace_emoji(data) # replace emoji with words
12data <- replace_emoticon(data) # replace emoticon with words
13data <- replace_hash(data, replacement = "")
14data <- replace_word_elongation(data)
15data <- gsub("[[:punct:]]", " ", data) #replace punctuation with space
16data <- gsub("[[:cntrl:]]", " ", data)
17data <- gsub("[[:digit:]]", "", data) #remove digits
18data <- gsub("^[[:space:]]+", "", data) #remove whitespace at beginning of documents
19data <- gsub("[[:space:]]+$", "", data) #remove whitespace at end of documents
20data <- stripWhitespace(data)
21[1] list(content = c("fire fire wow",
22 "facewithopenmouth look at that",
23 "facewithsteamfromnose this makes me angry facewithsteamfromnose",
24 "smilingfacewithhearteyes redheart \ufe0f, i love it!"),
25 content = c("smilingfacewithhearteyes smilingfacewithhearteyes",
26 "smilingfacewithsmilingeyes thanks for helping",
27 "cryingface oh no, why? cryingface",
28 "careful, challenging crossmark crossmark crossmark"))
29hash2 <- lexicon::hash_emojis
30hash2$y <- gsub("[[:space:]]|[[:punct:]]", "", hash2$y)
31
32replace_emoji(Emoji_struct[,1], emoji_dt = hash2)
33replace_emoji("wow!😮 that is cool!", emoji_dt = hash2)
34#[1] "wow! facewithopenmouth that is cool!"
35replace_emoji(c("1: 😊", "2: 😍"), emoji_dt = hash2)
36#[1] "1: smilingfacewithsmilingeyes "
37#[2] "2: smilingfacewithhearteyes "
38list("list_element_1: 🔥", "list_element_2: ❌") %>%
39 lapply(replace_emoji, emoji_dt = hash2)
40#[[1]]
41#[1] "list_element_1: fire "
42#
43#[[2]]
44#[1] "list_element_2: crossmark "
45head(lexicon::hash_emojis)
46# x y
47#1: <e2><86><95> up-down arrow
48#2: <e2><86><99> down-left arrow
49#3: <e2><86><a9> right arrow curving left
50#4: <e2><86><aa> left arrow curving right
51#5: <e2><8c><9a> watch
52#6: <e2><8c><9b> hourglass done
53
This is an object of class data.table
. We can simply modify the y
column of this hash table so that we remove all the spaces and punctuation. Note that this also allows you to add new ASCII byte representations and an accompanying string.
QUESTION
Specify the output per topic to a specific number of words
Asked 2021-May-13 at 16:25After conducting a lda topic modeling in R some words have the same beta value. They are therefore listed together when plotting the results. This leads to overlapping and sometimes unreadable results.
Is there a way to limit the amount of words displayed per topic to a specific number? In my dummy data set, some words have the same beta values. I would like to tell R that it should only display 3 words per topic, or any specified number according to necessity.
Currently the code I am using to plot the results looks like this:
1top_terms %>% # take the top terms
2 group_by(topic) %>%
3 mutate(top_term = term[which.max(beta)]) %>%
4 mutate(term = reorder(term, beta)) %>%
5 head(3) %>% # I tried this but that only works for the first topic
6 ggplot(aes(term, beta, fill = factor(topic))) +
7 geom_col(show.legend = FALSE) +
8 facet_wrap(~ top_term, scales = "free") +
9 labs(x = NULL, y = "Beta") + # no x label, change y label
10 coord_flip() # turn bars sideways
11
I tried to solve the issue with head(3)
which worked, but only for the first topic.
What I would need is something similar, which doesn't ignore all the other topics.
Best regards. Stay safe, stay healthy.
Note: top_terms
is a tibble.
Sample data:
1top_terms %>% # take the top terms
2 group_by(topic) %>%
3 mutate(top_term = term[which.max(beta)]) %>%
4 mutate(term = reorder(term, beta)) %>%
5 head(3) %>% # I tried this but that only works for the first topic
6 ggplot(aes(term, beta, fill = factor(topic))) +
7 geom_col(show.legend = FALSE) +
8 facet_wrap(~ top_term, scales = "free") +
9 labs(x = NULL, y = "Beta") + # no x label, change y label
10 coord_flip() # turn bars sideways
11topic term beta
12(int) (chr) (dbl)
131 book 0,9876
141 page 0,9765
151 chapter 0,9654
161 author 0,9654
172 sports 0,8765
182 soccer 0,8654
192 champions 0,8543
202 victory 0,8543
213 music 0,9543
223 song 0,8678
233 artist 0,7231
243 concert 0,7231
254 movie 0,9846
264 cinema 0,9647
274 cast 0,8878
284 story 0,8878
29
dput
of sample data
1top_terms %>% # take the top terms
2 group_by(topic) %>%
3 mutate(top_term = term[which.max(beta)]) %>%
4 mutate(term = reorder(term, beta)) %>%
5 head(3) %>% # I tried this but that only works for the first topic
6 ggplot(aes(term, beta, fill = factor(topic))) +
7 geom_col(show.legend = FALSE) +
8 facet_wrap(~ top_term, scales = "free") +
9 labs(x = NULL, y = "Beta") + # no x label, change y label
10 coord_flip() # turn bars sideways
11topic term beta
12(int) (chr) (dbl)
131 book 0,9876
141 page 0,9765
151 chapter 0,9654
161 author 0,9654
172 sports 0,8765
182 soccer 0,8654
192 champions 0,8543
202 victory 0,8543
213 music 0,9543
223 song 0,8678
233 artist 0,7231
243 concert 0,7231
254 movie 0,9846
264 cinema 0,9647
274 cast 0,8878
284 story 0,8878
29top_terms <- structure(list(topic = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L,
30 3L, 3L, 3L, 4L, 4L, 4L, 4L), term = c("book", "page", "chapter",
31 "author", "sports", "soccer", "champions", "victory", "music",
32 "song", "artist", "concert", "movie", "cinema", "cast", "story"
33 ), beta = c(0.9876, 0.9765, 0.9654, 0.9654, 0.8765, 0.8654, 0.8543,
34 0.8543, 0.9543, 0.8678, 0.7231, 0.7231, 0.9846, 0.9647, 0.8878,
35 0.8878)), row.names = c(NA, -16L), class = "data.frame")
36
ANSWER
Answered 2021-May-13 at 16:14Here is what you can do
1top_terms %>% # take the top terms
2 group_by(topic) %>%
3 mutate(top_term = term[which.max(beta)]) %>%
4 mutate(term = reorder(term, beta)) %>%
5 head(3) %>% # I tried this but that only works for the first topic
6 ggplot(aes(term, beta, fill = factor(topic))) +
7 geom_col(show.legend = FALSE) +
8 facet_wrap(~ top_term, scales = "free") +
9 labs(x = NULL, y = "Beta") + # no x label, change y label
10 coord_flip() # turn bars sideways
11topic term beta
12(int) (chr) (dbl)
131 book 0,9876
141 page 0,9765
151 chapter 0,9654
161 author 0,9654
172 sports 0,8765
182 soccer 0,8654
192 champions 0,8543
202 victory 0,8543
213 music 0,9543
223 song 0,8678
233 artist 0,7231
243 concert 0,7231
254 movie 0,9846
264 cinema 0,9647
274 cast 0,8878
284 story 0,8878
29top_terms <- structure(list(topic = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L,
30 3L, 3L, 3L, 4L, 4L, 4L, 4L), term = c("book", "page", "chapter",
31 "author", "sports", "soccer", "champions", "victory", "music",
32 "song", "artist", "concert", "movie", "cinema", "cast", "story"
33 ), beta = c(0.9876, 0.9765, 0.9654, 0.9654, 0.8765, 0.8654, 0.8543,
34 0.8543, 0.9543, 0.8678, 0.7231, 0.7231, 0.9846, 0.9647, 0.8878,
35 0.8878)), row.names = c(NA, -16L), class = "data.frame")
36library(dplyr)
37library(ggplot2)
38
39# take the top terms
40graph_data <- top_terms %>%
41 group_by(topic) %>%
42 mutate(top_term = term[which.max(beta)]) %>%
43 mutate(term = reorder(term, beta),
44 # popuplate index column which start 1 -> number of record for each topic
45 index = seq_len(n())) %>%
46 # filter by index <= 3
47 filter(index <= 3)
48
49graph_data
50#> # A tibble: 12 x 5
51#> # Groups: topic [4]
52#> topic term beta top_term index
53#> <int> <fct> <dbl> <chr> <int>
54#> 1 1 book 0.988 book 1
55#> 2 1 page 0.976 book 2
56#> 3 1 chapter 0.965 book 3
57#> 4 2 sports 0.876 sports 1
58#> 5 2 soccer 0.865 sports 2
59#> 6 2 champions 0.854 sports 3
60#> 7 3 music 0.954 music 1
61#> 8 3 song 0.868 music 2
62#> 9 3 artist 0.723 music 3
63#> 10 4 movie 0.985 movie 1
64#> 11 4 cinema 0.965 movie 2
65#> 12 4 cast 0.888 movie 3
66
67graph_data %>%
68 ggplot(aes(term, beta, fill = factor(topic))) +
69 geom_col(show.legend = FALSE) +
70 facet_wrap(~ top_term, scales = "free") +
71 labs(x = NULL, y = "Beta") + # no x label, change y label
72 coord_flip() # turn bars sideways
73
Created on 2021-05-13 by the reprex package (v2.0.0)
QUESTION
Name topics in lda topic modeling based on beta values
Asked 2021-May-05 at 19:26I'm currently trying to develop a code for a paper I have to write. I want to conduct a LDA-based topic modeling. I found some code deposits on GitHub and was able to combine them and slightly adapted them where necessary. Now I would like to add something that would name each identified topic after the word with the highest beta-value assigned to the respective topic. Any ideas? It's the first time I'm coding anything and my expertise is therefore quite limited.
Here's the section of the code where I wanted to insert the "naming part":
1# get the top ten terms for each topic
2 top_terms <- topics %>%
3 group_by(topic) %>% # treat each topic as a different group
4 top_n(10, beta) %>% # get top 10 words
5 ungroup() %>%
6 arrange(topic, -beta) # arrange words in descending informativeness
7
8# plot the top ten terms for each topic in order
9 top_terms %>%
10 mutate(term = reorder(term, beta)) %>% # sort terms by beta value
11 ggplot(aes(term, beta, fill = factor(topic))) + # plot beta by theme
12 geom_col(show.legend = FALSE) + # as bar plot
13 facet_wrap(~ topic, scales = "free") + # separate plot for each topic
14 labs(x = NULL, y = "Beta") + # no x label, change y label
15 coord_flip() # turn bars sideways
16
I tried to insert it in this section of the code, but that didn't work. I found this: R topic modeling: lda model labeling function but that didn't work for me, or I didn't get it.
I can't disclose more of the code, because there are some sensible data in there, but some expertise from the community would be highly appreciated nonetheless.
best regards and stay safe
Note: It say top_terms
is a tibble. I tried to come up with some data of the top of my head.
The data in top_terms are structured exactly like this
topic term beta
1# get the top ten terms for each topic
2 top_terms <- topics %>%
3 group_by(topic) %>% # treat each topic as a different group
4 top_n(10, beta) %>% # get top 10 words
5 ungroup() %>%
6 arrange(topic, -beta) # arrange words in descending informativeness
7
8# plot the top ten terms for each topic in order
9 top_terms %>%
10 mutate(term = reorder(term, beta)) %>% # sort terms by beta value
11 ggplot(aes(term, beta, fill = factor(topic))) + # plot beta by theme
12 geom_col(show.legend = FALSE) + # as bar plot
13 facet_wrap(~ topic, scales = "free") + # separate plot for each topic
14 labs(x = NULL, y = "Beta") + # no x label, change y label
15 coord_flip() # turn bars sideways
16(int) (chr) (dbl)
171 book 0,9876
181 page 0,9765
191 chapter 0,9654
202 sports 0,8765
212 soccer 0,8654
222 champions 0,8543
233 music 0,9543
243 song 0,8678
253 artist 0,7231
264 movie 0,9846
274 cinema 0,9647
284 cast 0,8878
29
ANSWER
Answered 2021-May-05 at 19:26You can make an additional column in your data that, after grouping by topic, takes the name of the term with the highest beta.
1# get the top ten terms for each topic
2 top_terms <- topics %>%
3 group_by(topic) %>% # treat each topic as a different group
4 top_n(10, beta) %>% # get top 10 words
5 ungroup() %>%
6 arrange(topic, -beta) # arrange words in descending informativeness
7
8# plot the top ten terms for each topic in order
9 top_terms %>%
10 mutate(term = reorder(term, beta)) %>% # sort terms by beta value
11 ggplot(aes(term, beta, fill = factor(topic))) + # plot beta by theme
12 geom_col(show.legend = FALSE) + # as bar plot
13 facet_wrap(~ topic, scales = "free") + # separate plot for each topic
14 labs(x = NULL, y = "Beta") + # no x label, change y label
15 coord_flip() # turn bars sideways
16(int) (chr) (dbl)
171 book 0,9876
181 page 0,9765
191 chapter 0,9654
202 sports 0,8765
212 soccer 0,8654
222 champions 0,8543
233 music 0,9543
243 song 0,8678
253 artist 0,7231
264 movie 0,9846
274 cinema 0,9647
284 cast 0,8878
29suppressPackageStartupMessages({
30 library(ggplot2)
31 library(tibble)
32 library(dplyr)
33})
34
35# Just replicating example data
36top_terms <- tibble(
37 topic = rep(1:4, each = 3),
38 term = c("book", "page", "chapter",
39 "sports", "soccer", "champions",
40 "music", "song", "artist",
41 "movie", "cinema", "cast"),
42 beta = c(0.9876, 0.9765, 0.9654,
43 0.8765, 0.8654, 0.8543,
44 0.9543, 0.8678, 0.7231,
45 0.9846, 0.9647, 0.8878)
46)
47
48top_terms %>%
49 group_by(topic) %>%
50 mutate(top_term = term[which.max(beta)]) %>%
51 ggplot(aes(term, beta, fill = factor(topic))) +
52 geom_col(show.legend = FALSE) +
53 facet_wrap(~ top_term, scales = "free") +
54 labs(x = NULL, y = "Beta") +
55 coord_flip()
56
Created on 2021-05-05 by the reprex package (v1.0.0)
QUESTION
Calculating optimal number of topics for topic modeling (LDA)
Asked 2021-Apr-27 at 16:56I am going to do topic modeling via LDA. I run my commands to see the optimal number of topics. The output was as follows: It is a bit different from any other plots that I have ever seen. Do you think it is okay? or it is better to use other algorithms rather than LDA. It is worth mentioning that when I run my commands to visualize the topics-keywords for 10 topics, the plot shows 2 main topics and the others had almost a strong overlap. Is there any valid range for coherence?
Many thanks to share your comments as I am a beginner in topic modeling.
ANSWER
Answered 2021-Apr-27 at 16:56Shameless self-promotion: I suggest you use the OCTIS library: https://github.com/mind-Lab/octis It allows you to run different topic models and optimize their hyperparameters (also the number of topics) in order to select the best result.
There might be many reasons why you get those results. But here some hints and observations:
- Make sure that you've preprocessed the text appropriately. This usually includes removing punctuation and numbers, removing stopwords and words that are too frequent or rare, (optionally) lemmatizing the text. Preprocessing is dependent on the language and the domain of the texts.
- LDA is a probabilistic model, which means that if you re-train it with the same hyperparameters, you will get different results each time. A good practice is to run the model with the same number of topics multiple times and then average the topic coherence.
- There are a lot of topic models and LDA works usually fine. The choice of the topic model depends on the data that you have. For example, if you are working with tweets (i.e. short texts), I wouldn't recommend using LDA because it cannot handle well sparse texts.
- Check how you set the hyperparameters. They may have a huge impact on the performance of the topic model.
- The range for coherence (I assume you used NPMI which is the most well-known) is between -1 and 1, but values very close to the upper and lower bound are quite rare.
References: https://www.aclweb.org/anthology/2021.eacl-demos.31/
Community Discussions contain sources that include Stack Exchange Network
Tutorials and Learning Resources in Topic Modeling
Tutorials and Learning Resources are not available at this moment for Topic Modeling