Topic Modeling

Explore all libraries in Topic Modeling

Explore all Topic Modeling open source software, libraries, packages, source code, cloud functions and APIs.

Trending Discussions on Topic Modeling

Display document to topic mapping after LSI using Gensim

My main.py script is running in pycharm IDE but not from terminal. Why is this so?

How to get list of words for each topic for a specific relevance metric value (lambda) in pyLDAvis?

Should bi-gram and tri-gram be used in LDA topic modeling?

How encode text can be converted to main text (without special character created by encoding)

Memory problems when using lapply for corpus creation

How can I replace emojis with text and treat them as single words?

Specify the output per topic to a specific number of words

Name topics in lda topic modeling based on beta values

Calculating optimal number of topics for topic modeling (LDA)

Display document to topic mapping after LSI using Gensim

My main.py script is running in pycharm IDE but not from terminal. Why is this so?

How to get list of words for each topic for a specific relevance metric value (lambda) in pyLDAvis?

Should bi-gram and tri-gram be used in LDA topic modeling?

How encode text can be converted to main text (without special character created by encoding)

Memory problems when using lapply for corpus creation

How can I replace emojis with text and treat them as single words?

Specify the output per topic to a specific number of words

Name topics in lda topic modeling based on beta values

Calculating optimal number of topics for topic modeling (LDA)

QUESTION

Display document to topic mapping after LSI using Gensim

Asked 2022-Feb-22 at 19:27

I am new to using LSI with Python and Gensim + Scikit-learn tools. I was able to achieve topic modeling on a corpus using LSI from both the Scikit-learn and Gensim libraries, however, when using the Gensim approach I was not able to display a list of documents to topic mapping.

Here is my work using Scikit-learn LSI where I successfully displayed document to topic mapping:

1tfidf_transformer = TfidfTransformer()
2transformed_vector = tfidf_transformer.fit_transform(transformed_vector)
3NUM_TOPICS = 14
4lsi_model = TruncatedSVD(n_components=NUM_TOPICS)
5lsi= nmf_model.fit_transform(transformed_vector)
6
7topic_to_doc_mapping = {}
8topic_list = []
9topic_names = []
10
11for i in range(len(dbpedia_df.index)):
12    most_likely_topic =  nmf[i].argmax()
13
14    if most_likely_topic not in topic_to_doc_mapping:
15        topic_to_doc_mapping[most_likely_topic] = []
16
17    topic_to_doc_mapping[most_likely_topic].append(i)
18
19    topic_list.append(most_likely_topic)
20    topic_names.append(topic_id_topic_mapping[most_likely_topic])
21
22dbpedia_df['Most_Likely_Topic'] = topic_list
23dbpedia_df['Most_Likely_Topic_Names'] = topic_names
24
25print(topic_to_doc_mapping[0][:100])
26
27topic_of_interest = 1
28doc_ids = topic_to_doc_mapping[topic_of_interest][:4]
29for doc_index in doc_ids:
30    print(X.iloc[doc_index])
31

Using Gensim I was unable to proceed to display the document to topic mapping:

1tfidf_transformer = TfidfTransformer()
2transformed_vector = tfidf_transformer.fit_transform(transformed_vector)
3NUM_TOPICS = 14
4lsi_model = TruncatedSVD(n_components=NUM_TOPICS)
5lsi= nmf_model.fit_transform(transformed_vector)
6
7topic_to_doc_mapping = {}
8topic_list = []
9topic_names = []
10
11for i in range(len(dbpedia_df.index)):
12    most_likely_topic =  nmf[i].argmax()
13
14    if most_likely_topic not in topic_to_doc_mapping:
15        topic_to_doc_mapping[most_likely_topic] = []
16
17    topic_to_doc_mapping[most_likely_topic].append(i)
18
19    topic_list.append(most_likely_topic)
20    topic_names.append(topic_id_topic_mapping[most_likely_topic])
21
22dbpedia_df['Most_Likely_Topic'] = topic_list
23dbpedia_df['Most_Likely_Topic_Names'] = topic_names
24
25print(topic_to_doc_mapping[0][:100])
26
27topic_of_interest = 1
28doc_ids = topic_to_doc_mapping[topic_of_interest][:4]
29for doc_index in doc_ids:
30    print(X.iloc[doc_index])
31processed_list = []
32stop_words = set(stopwords.words('english'))
33lemmatizer = WordNetLemmatizer()
34
35for doc in documents_list:
36    tokens = word_tokenize(doc.lower())
37    stopped_tokens = [token for token in tokens if token not in stop_words]
38    lemmatized_tokens = [lemmatizer.lemmatize(i, pos=&quot;n&quot;) for i in stopped_tokens]
39    processed_list.append(lemmatized_tokens)
40    
41term_dictionary = Dictionary(processed_list)
42document_term_matrix = [term_dictionary.doc2bow(document) for document in processed_list]
43
44NUM_TOPICS = 14
45model = LsiModel(corpus=document_term_matrix, num_topics=NUM_TOPICS, id2word=term_dictionary)
46lsi_topics = model.show_topics(num_topics=NUM_TOPICS, formatted=False)
47lsi_topics
48

How can I display the document to topic mapping here?

ANSWER

Answered 2022-Feb-22 at 19:27

In order to get the representation of a document (represented as a bag-of-words) from a trained LsiModel as a vector of topics, you use Python dict-style bracket-accessing (model[bow]).

For example, to get the topics for the 1st item in your training data, you can use:

1tfidf_transformer = TfidfTransformer()
2transformed_vector = tfidf_transformer.fit_transform(transformed_vector)
3NUM_TOPICS = 14
4lsi_model = TruncatedSVD(n_components=NUM_TOPICS)
5lsi= nmf_model.fit_transform(transformed_vector)
6
7topic_to_doc_mapping = {}
8topic_list = []
9topic_names = []
10
11for i in range(len(dbpedia_df.index)):
12    most_likely_topic =  nmf[i].argmax()
13
14    if most_likely_topic not in topic_to_doc_mapping:
15        topic_to_doc_mapping[most_likely_topic] = []
16
17    topic_to_doc_mapping[most_likely_topic].append(i)
18
19    topic_list.append(most_likely_topic)
20    topic_names.append(topic_id_topic_mapping[most_likely_topic])
21
22dbpedia_df['Most_Likely_Topic'] = topic_list
23dbpedia_df['Most_Likely_Topic_Names'] = topic_names
24
25print(topic_to_doc_mapping[0][:100])
26
27topic_of_interest = 1
28doc_ids = topic_to_doc_mapping[topic_of_interest][:4]
29for doc_index in doc_ids:
30    print(X.iloc[doc_index])
31processed_list = []
32stop_words = set(stopwords.words('english'))
33lemmatizer = WordNetLemmatizer()
34
35for doc in documents_list:
36    tokens = word_tokenize(doc.lower())
37    stopped_tokens = [token for token in tokens if token not in stop_words]
38    lemmatized_tokens = [lemmatizer.lemmatize(i, pos=&quot;n&quot;) for i in stopped_tokens]
39    processed_list.append(lemmatized_tokens)
40    
41term_dictionary = Dictionary(processed_list)
42document_term_matrix = [term_dictionary.doc2bow(document) for document in processed_list]
43
44NUM_TOPICS = 14
45model = LsiModel(corpus=document_term_matrix, num_topics=NUM_TOPICS, id2word=term_dictionary)
46lsi_topics = model.show_topics(num_topics=NUM_TOPICS, formatted=False)
47lsi_topics
48first_doc = document_term_matrix[0]
49first_doc_lsi_topics = model[first_doc]
50

You can also supply a list of docs, as in training, to get the LSI topics for an entire batch at once. EG:

1tfidf_transformer = TfidfTransformer()
2transformed_vector = tfidf_transformer.fit_transform(transformed_vector)
3NUM_TOPICS = 14
4lsi_model = TruncatedSVD(n_components=NUM_TOPICS)
5lsi= nmf_model.fit_transform(transformed_vector)
6
7topic_to_doc_mapping = {}
8topic_list = []
9topic_names = []
10
11for i in range(len(dbpedia_df.index)):
12    most_likely_topic =  nmf[i].argmax()
13
14    if most_likely_topic not in topic_to_doc_mapping:
15        topic_to_doc_mapping[most_likely_topic] = []
16
17    topic_to_doc_mapping[most_likely_topic].append(i)
18
19    topic_list.append(most_likely_topic)
20    topic_names.append(topic_id_topic_mapping[most_likely_topic])
21
22dbpedia_df['Most_Likely_Topic'] = topic_list
23dbpedia_df['Most_Likely_Topic_Names'] = topic_names
24
25print(topic_to_doc_mapping[0][:100])
26
27topic_of_interest = 1
28doc_ids = topic_to_doc_mapping[topic_of_interest][:4]
29for doc_index in doc_ids:
30    print(X.iloc[doc_index])
31processed_list = []
32stop_words = set(stopwords.words('english'))
33lemmatizer = WordNetLemmatizer()
34
35for doc in documents_list:
36    tokens = word_tokenize(doc.lower())
37    stopped_tokens = [token for token in tokens if token not in stop_words]
38    lemmatized_tokens = [lemmatizer.lemmatize(i, pos=&quot;n&quot;) for i in stopped_tokens]
39    processed_list.append(lemmatized_tokens)
40    
41term_dictionary = Dictionary(processed_list)
42document_term_matrix = [term_dictionary.doc2bow(document) for document in processed_list]
43
44NUM_TOPICS = 14
45model = LsiModel(corpus=document_term_matrix, num_topics=NUM_TOPICS, id2word=term_dictionary)
46lsi_topics = model.show_topics(num_topics=NUM_TOPICS, formatted=False)
47lsi_topics
48first_doc = document_term_matrix[0]
49first_doc_lsi_topics = model[first_doc]
50all_doc_lsi_topics = model[document_term_matrix]
51

Source https://stackoverflow.com/questions/71218086

QUESTION

My main.py script is running in pycharm IDE but not from terminal. Why is this so?

Asked 2022-Feb-14 at 10:56

When I want to run my python main.py script in the terminal it raises the following exception:

1Traceback (most recent call last):
2  File &quot;main.py&quot;, line 14, in &lt;module&gt;
3    from typeform_api import get_data  # Step 1)
4  File &quot;/Users/philipp/Documents/PycharmProjects/Wegekompass/typeform_api.py&quot;, line 9, in &lt;module&gt;
5    import pandas as pd  # For creating a dataframe for the data
6ModuleNotFoundError: No module named 'pandas'
7

I already read the answers of those two questions, but couldn't find any solutions:

This is my main.py method:

1Traceback (most recent call last):
2  File &quot;main.py&quot;, line 14, in &lt;module&gt;
3    from typeform_api import get_data  # Step 1)
4  File &quot;/Users/philipp/Documents/PycharmProjects/Wegekompass/typeform_api.py&quot;, line 9, in &lt;module&gt;
5    import pandas as pd  # For creating a dataframe for the data
6ModuleNotFoundError: No module named 'pandas'
7&quot;&quot;&quot;
8Logic Flow of main.py:
91) Fetch Data with API
102) Preprocess the Data
113) Start topic modeling
124) Create report and include data
135) send report via e-mail to user
14&quot;&quot;&quot;
15
16import os
17import sys
18from typeform_api import get_data  # Step 1)
19from preprocessing import preprocess_data  # Step 2)
20from topic_modeling import create_topic_model  # Step 3)
21from create_report import *  # Step 4)
22from send_report import send_report  # Step 5)
23
24if __name__ == '__main__':
25
26# 1) Fetch Data with Typeform API:
27alle_antworten_txt = get_data()[3]
28projektname = get_data()[0]
29projekt_id = get_data()[1]
30user_email = get_data()[2]
31heute_txt_short = get_data()[3][4]
32heute_txt_long = get_data()[3][1:3]
33heute_txt_long_joined = &quot;. &quot;.join(heute_txt_long)
34morgen_txt_short = get_data()[3][8]
35morgen_txt_long = get_data()[3][5:7]
36morgen_txt_long_joined = &quot;. &quot;.join(morgen_txt_long)
37erfolgshebel_txt_short = get_data()[3][17]
38erfolgshebel_txt_long = get_data()[3][14:16]
39erfolgshebel_txt_long_joined = &quot;. &quot;.join(erfolgshebel_txt_long)
40staerken_txt = get_data()[3][12:13]
41staerken_txt_joined = &quot;. &quot;.join(staerken_txt)
42schwaechen_txt = get_data()[3][10:11]
43schwaechen_txt_joined = &quot;. &quot;.join(schwaechen_txt)
44reflexion_txt = get_data()[3][9]
45etappe1_fachlich_txt = get_data()[3][18]
46etappe1_kulturell_txt = get_data()[3][19]
47etappe2_fachlich_txt = get_data()[3][20]
48etappe2_kulturell_txt = get_data()[3][21]
49etappe3_fachlich_txt = get_data()[3][22]
50etappe3_kulturell_txt = get_data()[3][23]
51weg_von_energie = get_data()[4][0]
52hin_zu_energie = get_data()[4][1]
53umsetzungsenergie = get_data()[4][2]
54
55# 2) Preprocess and tokenize fetched Data:
56alle_antworten_txt_tokens = preprocess_data(alle_antworten_txt)[1]
57heute_txt_long_tokens = preprocess_data(heute_txt_long)[1]
58morgen_txt_long_tokens = preprocess_data(morgen_txt_long)[1]
59
60# 3) Start NLP analysis:
61# answer_analysis = create_topic_model(alle_antworten_txt_tokens)[1]
62try:
63    heute_wortwolke = create_topic_model(heute_txt_long_tokens)[2]
64    heute_wortwolke.to_file(os.getcwd() + &quot;/Grafiken/Wortwolken/heute_wortwolke&quot;+projekt_id+&quot;.png&quot;)
65    morgen_wortwolke = create_topic_model(morgen_txt_long_tokens)[2]
66    morgen_wortwolke.to_file(os.getcwd() + &quot;/Grafiken/Wortwolken/morgen_wortwolke&quot;+projekt_id+&quot;.png&quot;)
67
68except ValueError:
69    print(&quot;There is not enough word input, LDA analysis raises ValueError&quot;)
70    sys.exit(0)
71
72# 4) Create final report:
73try:
74    final_report = create_final_report(heute_txt_short,
75                                       morgen_txt_short,
76                                       erfolgshebel_txt_short,
77                                       heute_txt_long_joined,
78                                       morgen_txt_long_joined,
79                                       erfolgshebel_txt_long_joined,
80                                       staerken_txt_joined,
81                                       schwaechen_txt_joined,
82                                       reflexion_txt,
83                                       os.getcwd() + &quot;/Grafiken/Wortwolken/heute_wortwolke&quot;+projekt_id+&quot;.png&quot;,
84                                       os.getcwd() + &quot;/Grafiken/Wortwolken/morgen_wortwolke&quot;+projekt_id+&quot;.png&quot;,
85                                       etappe1_fachlich_txt,
86                                       etappe1_kulturell_txt,
87                                       etappe2_fachlich_txt,
88                                       etappe2_kulturell_txt,
89                                       etappe3_fachlich_txt,
90                                       etappe3_kulturell_txt,
91                                       weg_von_energie,
92                                       hin_zu_energie,
93                                       umsetzungsenergie,
94                                       projektname,
95                                       projekt_id)
96
97except FileNotFoundError:
98    print(&quot;There is not enough word input, wordcloud can't be computed so FileNotFoundError is raised&quot;)
99    sys.exit(0)
100
101# 5) Send final report:
102try:
103    send_report(final_report, user_email, projektname, projekt_id)
104except NameError:
105    print(&quot;variable final_report not defined due missing wordclouds&quot;)
106    sys.exit(0)
107
108# 6) Check if program reached its end:
109print(&quot;If this gets print, the report was send successfully.&quot;)
110

This is the structure of my project folder:

1Traceback (most recent call last):
2  File &quot;main.py&quot;, line 14, in &lt;module&gt;
3    from typeform_api import get_data  # Step 1)
4  File &quot;/Users/philipp/Documents/PycharmProjects/Wegekompass/typeform_api.py&quot;, line 9, in &lt;module&gt;
5    import pandas as pd  # For creating a dataframe for the data
6ModuleNotFoundError: No module named 'pandas'
7&quot;&quot;&quot;
8Logic Flow of main.py:
91) Fetch Data with API
102) Preprocess the Data
113) Start topic modeling
124) Create report and include data
135) send report via e-mail to user
14&quot;&quot;&quot;
15
16import os
17import sys
18from typeform_api import get_data  # Step 1)
19from preprocessing import preprocess_data  # Step 2)
20from topic_modeling import create_topic_model  # Step 3)
21from create_report import *  # Step 4)
22from send_report import send_report  # Step 5)
23
24if __name__ == '__main__':
25
26# 1) Fetch Data with Typeform API:
27alle_antworten_txt = get_data()[3]
28projektname = get_data()[0]
29projekt_id = get_data()[1]
30user_email = get_data()[2]
31heute_txt_short = get_data()[3][4]
32heute_txt_long = get_data()[3][1:3]
33heute_txt_long_joined = &quot;. &quot;.join(heute_txt_long)
34morgen_txt_short = get_data()[3][8]
35morgen_txt_long = get_data()[3][5:7]
36morgen_txt_long_joined = &quot;. &quot;.join(morgen_txt_long)
37erfolgshebel_txt_short = get_data()[3][17]
38erfolgshebel_txt_long = get_data()[3][14:16]
39erfolgshebel_txt_long_joined = &quot;. &quot;.join(erfolgshebel_txt_long)
40staerken_txt = get_data()[3][12:13]
41staerken_txt_joined = &quot;. &quot;.join(staerken_txt)
42schwaechen_txt = get_data()[3][10:11]
43schwaechen_txt_joined = &quot;. &quot;.join(schwaechen_txt)
44reflexion_txt = get_data()[3][9]
45etappe1_fachlich_txt = get_data()[3][18]
46etappe1_kulturell_txt = get_data()[3][19]
47etappe2_fachlich_txt = get_data()[3][20]
48etappe2_kulturell_txt = get_data()[3][21]
49etappe3_fachlich_txt = get_data()[3][22]
50etappe3_kulturell_txt = get_data()[3][23]
51weg_von_energie = get_data()[4][0]
52hin_zu_energie = get_data()[4][1]
53umsetzungsenergie = get_data()[4][2]
54
55# 2) Preprocess and tokenize fetched Data:
56alle_antworten_txt_tokens = preprocess_data(alle_antworten_txt)[1]
57heute_txt_long_tokens = preprocess_data(heute_txt_long)[1]
58morgen_txt_long_tokens = preprocess_data(morgen_txt_long)[1]
59
60# 3) Start NLP analysis:
61# answer_analysis = create_topic_model(alle_antworten_txt_tokens)[1]
62try:
63    heute_wortwolke = create_topic_model(heute_txt_long_tokens)[2]
64    heute_wortwolke.to_file(os.getcwd() + &quot;/Grafiken/Wortwolken/heute_wortwolke&quot;+projekt_id+&quot;.png&quot;)
65    morgen_wortwolke = create_topic_model(morgen_txt_long_tokens)[2]
66    morgen_wortwolke.to_file(os.getcwd() + &quot;/Grafiken/Wortwolken/morgen_wortwolke&quot;+projekt_id+&quot;.png&quot;)
67
68except ValueError:
69    print(&quot;There is not enough word input, LDA analysis raises ValueError&quot;)
70    sys.exit(0)
71
72# 4) Create final report:
73try:
74    final_report = create_final_report(heute_txt_short,
75                                       morgen_txt_short,
76                                       erfolgshebel_txt_short,
77                                       heute_txt_long_joined,
78                                       morgen_txt_long_joined,
79                                       erfolgshebel_txt_long_joined,
80                                       staerken_txt_joined,
81                                       schwaechen_txt_joined,
82                                       reflexion_txt,
83                                       os.getcwd() + &quot;/Grafiken/Wortwolken/heute_wortwolke&quot;+projekt_id+&quot;.png&quot;,
84                                       os.getcwd() + &quot;/Grafiken/Wortwolken/morgen_wortwolke&quot;+projekt_id+&quot;.png&quot;,
85                                       etappe1_fachlich_txt,
86                                       etappe1_kulturell_txt,
87                                       etappe2_fachlich_txt,
88                                       etappe2_kulturell_txt,
89                                       etappe3_fachlich_txt,
90                                       etappe3_kulturell_txt,
91                                       weg_von_energie,
92                                       hin_zu_energie,
93                                       umsetzungsenergie,
94                                       projektname,
95                                       projekt_id)
96
97except FileNotFoundError:
98    print(&quot;There is not enough word input, wordcloud can't be computed so FileNotFoundError is raised&quot;)
99    sys.exit(0)
100
101# 5) Send final report:
102try:
103    send_report(final_report, user_email, projektname, projekt_id)
104except NameError:
105    print(&quot;variable final_report not defined due missing wordclouds&quot;)
106    sys.exit(0)
107
108# 6) Check if program reached its end:
109print(&quot;If this gets print, the report was send successfully.&quot;)
110├── Dockerfile
111├── Error-Handling.md
112├── Grafiken
113│   ├── 1Etappe_Card-min.png
114│   ├── 2Etappe_Card-min.png
115│   ├── 3Etappe_Card-min.png
116│   ├── AD_Logo-min.png
117│   ├── Dynamik_Abflug-min.png
118│   ├── Dynamik_Anflug-min.png
119│   ├── Dynamik_Hoehenflug-min.png
120│   ├── Dynamik_Parabelflug-min.png
121│   ├── Dynamik_Sinkflug-min.png
122│   ├── Dynamik_Steigflug-min.png
123│   ├── Dynamik_Talflug-min.png
124│   ├── Dynamik_Tiefflug-min.png
125│   ├── Entwicklung_Card-min.png
126│   ├── Erfolgshebel_Card-min.png
127│   ├── Framework_Abflug-min.png
128│   ├── Framework_Anflug-min.png
129│   ├── Framework_Hoehenflug-min.png
130│   ├── Framework_Parabelflug-min.png
131│   ├── Framework_Sinkflug-min.png
132│   ├── Framework_Steigflug-min.png
133│   ├── Framework_Talflug-min.png
134│   ├── Framework_Tiefflug-min.png
135│   ├── Heute_Card-min.png
136│   ├── Morgen_Card-min.png
137│   ├── Reflexion_Card-min.png
138│   ├── Strength_Card-min.png
139│   ├── Weakness_Card-min.png
140│   ├── Wegekompass_Hero-min.png
141│   └── Wortwolken
142├── PDF_Reports
143├── README.md
144├── __pycache__
145│   ├── config.cpython-38.pyc
146│   ├── create_report.cpython-38.pyc
147│   ├── main.cpython-38.pyc
148│   ├── preprocessing.cpython-38.pyc
149│   ├── send_report.cpython-38.pyc
150│   ├── topic_modeling.cpython-38.pyc
151│   └── typeform_api.cpython-38.pyc
152├── config.py
153├── create_report.py
154├── html_files
155│   └── E_Mail_Template.html
156├── main.py
157├── preprocessing.py
158├── requirements.txt
159├── send_report.py
160├── topic_modeling.py
161├── typeform_api.py
162└── venv
163    ├── (...)
164

As I said - the main.py script runs in pycharm IDE, but not from the terminal.

How can I run my script from the terminal?

ANSWER

Answered 2022-Feb-14 at 09:31

Looks like you are using venv, did you activate it before running your script?

For Linux/Mac you can do the following:

1Traceback (most recent call last):
2  File &quot;main.py&quot;, line 14, in &lt;module&gt;
3    from typeform_api import get_data  # Step 1)
4  File &quot;/Users/philipp/Documents/PycharmProjects/Wegekompass/typeform_api.py&quot;, line 9, in &lt;module&gt;
5    import pandas as pd  # For creating a dataframe for the data
6ModuleNotFoundError: No module named 'pandas'
7&quot;&quot;&quot;
8Logic Flow of main.py:
91) Fetch Data with API
102) Preprocess the Data
113) Start topic modeling
124) Create report and include data
135) send report via e-mail to user
14&quot;&quot;&quot;
15
16import os
17import sys
18from typeform_api import get_data  # Step 1)
19from preprocessing import preprocess_data  # Step 2)
20from topic_modeling import create_topic_model  # Step 3)
21from create_report import *  # Step 4)
22from send_report import send_report  # Step 5)
23
24if __name__ == '__main__':
25
26# 1) Fetch Data with Typeform API:
27alle_antworten_txt = get_data()[3]
28projektname = get_data()[0]
29projekt_id = get_data()[1]
30user_email = get_data()[2]
31heute_txt_short = get_data()[3][4]
32heute_txt_long = get_data()[3][1:3]
33heute_txt_long_joined = &quot;. &quot;.join(heute_txt_long)
34morgen_txt_short = get_data()[3][8]
35morgen_txt_long = get_data()[3][5:7]
36morgen_txt_long_joined = &quot;. &quot;.join(morgen_txt_long)
37erfolgshebel_txt_short = get_data()[3][17]
38erfolgshebel_txt_long = get_data()[3][14:16]
39erfolgshebel_txt_long_joined = &quot;. &quot;.join(erfolgshebel_txt_long)
40staerken_txt = get_data()[3][12:13]
41staerken_txt_joined = &quot;. &quot;.join(staerken_txt)
42schwaechen_txt = get_data()[3][10:11]
43schwaechen_txt_joined = &quot;. &quot;.join(schwaechen_txt)
44reflexion_txt = get_data()[3][9]
45etappe1_fachlich_txt = get_data()[3][18]
46etappe1_kulturell_txt = get_data()[3][19]
47etappe2_fachlich_txt = get_data()[3][20]
48etappe2_kulturell_txt = get_data()[3][21]
49etappe3_fachlich_txt = get_data()[3][22]
50etappe3_kulturell_txt = get_data()[3][23]
51weg_von_energie = get_data()[4][0]
52hin_zu_energie = get_data()[4][1]
53umsetzungsenergie = get_data()[4][2]
54
55# 2) Preprocess and tokenize fetched Data:
56alle_antworten_txt_tokens = preprocess_data(alle_antworten_txt)[1]
57heute_txt_long_tokens = preprocess_data(heute_txt_long)[1]
58morgen_txt_long_tokens = preprocess_data(morgen_txt_long)[1]
59
60# 3) Start NLP analysis:
61# answer_analysis = create_topic_model(alle_antworten_txt_tokens)[1]
62try:
63    heute_wortwolke = create_topic_model(heute_txt_long_tokens)[2]
64    heute_wortwolke.to_file(os.getcwd() + &quot;/Grafiken/Wortwolken/heute_wortwolke&quot;+projekt_id+&quot;.png&quot;)
65    morgen_wortwolke = create_topic_model(morgen_txt_long_tokens)[2]
66    morgen_wortwolke.to_file(os.getcwd() + &quot;/Grafiken/Wortwolken/morgen_wortwolke&quot;+projekt_id+&quot;.png&quot;)
67
68except ValueError:
69    print(&quot;There is not enough word input, LDA analysis raises ValueError&quot;)
70    sys.exit(0)
71
72# 4) Create final report:
73try:
74    final_report = create_final_report(heute_txt_short,
75                                       morgen_txt_short,
76                                       erfolgshebel_txt_short,
77                                       heute_txt_long_joined,
78                                       morgen_txt_long_joined,
79                                       erfolgshebel_txt_long_joined,
80                                       staerken_txt_joined,
81                                       schwaechen_txt_joined,
82                                       reflexion_txt,
83                                       os.getcwd() + &quot;/Grafiken/Wortwolken/heute_wortwolke&quot;+projekt_id+&quot;.png&quot;,
84                                       os.getcwd() + &quot;/Grafiken/Wortwolken/morgen_wortwolke&quot;+projekt_id+&quot;.png&quot;,
85                                       etappe1_fachlich_txt,
86                                       etappe1_kulturell_txt,
87                                       etappe2_fachlich_txt,
88                                       etappe2_kulturell_txt,
89                                       etappe3_fachlich_txt,
90                                       etappe3_kulturell_txt,
91                                       weg_von_energie,
92                                       hin_zu_energie,
93                                       umsetzungsenergie,
94                                       projektname,
95                                       projekt_id)
96
97except FileNotFoundError:
98    print(&quot;There is not enough word input, wordcloud can't be computed so FileNotFoundError is raised&quot;)
99    sys.exit(0)
100
101# 5) Send final report:
102try:
103    send_report(final_report, user_email, projektname, projekt_id)
104except NameError:
105    print(&quot;variable final_report not defined due missing wordclouds&quot;)
106    sys.exit(0)
107
108# 6) Check if program reached its end:
109print(&quot;If this gets print, the report was send successfully.&quot;)
110├── Dockerfile
111├── Error-Handling.md
112├── Grafiken
113│   ├── 1Etappe_Card-min.png
114│   ├── 2Etappe_Card-min.png
115│   ├── 3Etappe_Card-min.png
116│   ├── AD_Logo-min.png
117│   ├── Dynamik_Abflug-min.png
118│   ├── Dynamik_Anflug-min.png
119│   ├── Dynamik_Hoehenflug-min.png
120│   ├── Dynamik_Parabelflug-min.png
121│   ├── Dynamik_Sinkflug-min.png
122│   ├── Dynamik_Steigflug-min.png
123│   ├── Dynamik_Talflug-min.png
124│   ├── Dynamik_Tiefflug-min.png
125│   ├── Entwicklung_Card-min.png
126│   ├── Erfolgshebel_Card-min.png
127│   ├── Framework_Abflug-min.png
128│   ├── Framework_Anflug-min.png
129│   ├── Framework_Hoehenflug-min.png
130│   ├── Framework_Parabelflug-min.png
131│   ├── Framework_Sinkflug-min.png
132│   ├── Framework_Steigflug-min.png
133│   ├── Framework_Talflug-min.png
134│   ├── Framework_Tiefflug-min.png
135│   ├── Heute_Card-min.png
136│   ├── Morgen_Card-min.png
137│   ├── Reflexion_Card-min.png
138│   ├── Strength_Card-min.png
139│   ├── Weakness_Card-min.png
140│   ├── Wegekompass_Hero-min.png
141│   └── Wortwolken
142├── PDF_Reports
143├── README.md
144├── __pycache__
145│   ├── config.cpython-38.pyc
146│   ├── create_report.cpython-38.pyc
147│   ├── main.cpython-38.pyc
148│   ├── preprocessing.cpython-38.pyc
149│   ├── send_report.cpython-38.pyc
150│   ├── topic_modeling.cpython-38.pyc
151│   └── typeform_api.cpython-38.pyc
152├── config.py
153├── create_report.py
154├── html_files
155│   └── E_Mail_Template.html
156├── main.py
157├── preprocessing.py
158├── requirements.txt
159├── send_report.py
160├── topic_modeling.py
161├── typeform_api.py
162└── venv
163    ├── (...)
164. venv/bin/activate
165

For windows you should use:

1Traceback (most recent call last):
2  File &quot;main.py&quot;, line 14, in &lt;module&gt;
3    from typeform_api import get_data  # Step 1)
4  File &quot;/Users/philipp/Documents/PycharmProjects/Wegekompass/typeform_api.py&quot;, line 9, in &lt;module&gt;
5    import pandas as pd  # For creating a dataframe for the data
6ModuleNotFoundError: No module named 'pandas'
7&quot;&quot;&quot;
8Logic Flow of main.py:
91) Fetch Data with API
102) Preprocess the Data
113) Start topic modeling
124) Create report and include data
135) send report via e-mail to user
14&quot;&quot;&quot;
15
16import os
17import sys
18from typeform_api import get_data  # Step 1)
19from preprocessing import preprocess_data  # Step 2)
20from topic_modeling import create_topic_model  # Step 3)
21from create_report import *  # Step 4)
22from send_report import send_report  # Step 5)
23
24if __name__ == '__main__':
25
26# 1) Fetch Data with Typeform API:
27alle_antworten_txt = get_data()[3]
28projektname = get_data()[0]
29projekt_id = get_data()[1]
30user_email = get_data()[2]
31heute_txt_short = get_data()[3][4]
32heute_txt_long = get_data()[3][1:3]
33heute_txt_long_joined = &quot;. &quot;.join(heute_txt_long)
34morgen_txt_short = get_data()[3][8]
35morgen_txt_long = get_data()[3][5:7]
36morgen_txt_long_joined = &quot;. &quot;.join(morgen_txt_long)
37erfolgshebel_txt_short = get_data()[3][17]
38erfolgshebel_txt_long = get_data()[3][14:16]
39erfolgshebel_txt_long_joined = &quot;. &quot;.join(erfolgshebel_txt_long)
40staerken_txt = get_data()[3][12:13]
41staerken_txt_joined = &quot;. &quot;.join(staerken_txt)
42schwaechen_txt = get_data()[3][10:11]
43schwaechen_txt_joined = &quot;. &quot;.join(schwaechen_txt)
44reflexion_txt = get_data()[3][9]
45etappe1_fachlich_txt = get_data()[3][18]
46etappe1_kulturell_txt = get_data()[3][19]
47etappe2_fachlich_txt = get_data()[3][20]
48etappe2_kulturell_txt = get_data()[3][21]
49etappe3_fachlich_txt = get_data()[3][22]
50etappe3_kulturell_txt = get_data()[3][23]
51weg_von_energie = get_data()[4][0]
52hin_zu_energie = get_data()[4][1]
53umsetzungsenergie = get_data()[4][2]
54
55# 2) Preprocess and tokenize fetched Data:
56alle_antworten_txt_tokens = preprocess_data(alle_antworten_txt)[1]
57heute_txt_long_tokens = preprocess_data(heute_txt_long)[1]
58morgen_txt_long_tokens = preprocess_data(morgen_txt_long)[1]
59
60# 3) Start NLP analysis:
61# answer_analysis = create_topic_model(alle_antworten_txt_tokens)[1]
62try:
63    heute_wortwolke = create_topic_model(heute_txt_long_tokens)[2]
64    heute_wortwolke.to_file(os.getcwd() + &quot;/Grafiken/Wortwolken/heute_wortwolke&quot;+projekt_id+&quot;.png&quot;)
65    morgen_wortwolke = create_topic_model(morgen_txt_long_tokens)[2]
66    morgen_wortwolke.to_file(os.getcwd() + &quot;/Grafiken/Wortwolken/morgen_wortwolke&quot;+projekt_id+&quot;.png&quot;)
67
68except ValueError:
69    print(&quot;There is not enough word input, LDA analysis raises ValueError&quot;)
70    sys.exit(0)
71
72# 4) Create final report:
73try:
74    final_report = create_final_report(heute_txt_short,
75                                       morgen_txt_short,
76                                       erfolgshebel_txt_short,
77                                       heute_txt_long_joined,
78                                       morgen_txt_long_joined,
79                                       erfolgshebel_txt_long_joined,
80                                       staerken_txt_joined,
81                                       schwaechen_txt_joined,
82                                       reflexion_txt,
83                                       os.getcwd() + &quot;/Grafiken/Wortwolken/heute_wortwolke&quot;+projekt_id+&quot;.png&quot;,
84                                       os.getcwd() + &quot;/Grafiken/Wortwolken/morgen_wortwolke&quot;+projekt_id+&quot;.png&quot;,
85                                       etappe1_fachlich_txt,
86                                       etappe1_kulturell_txt,
87                                       etappe2_fachlich_txt,
88                                       etappe2_kulturell_txt,
89                                       etappe3_fachlich_txt,
90                                       etappe3_kulturell_txt,
91                                       weg_von_energie,
92                                       hin_zu_energie,
93                                       umsetzungsenergie,
94                                       projektname,
95                                       projekt_id)
96
97except FileNotFoundError:
98    print(&quot;There is not enough word input, wordcloud can't be computed so FileNotFoundError is raised&quot;)
99    sys.exit(0)
100
101# 5) Send final report:
102try:
103    send_report(final_report, user_email, projektname, projekt_id)
104except NameError:
105    print(&quot;variable final_report not defined due missing wordclouds&quot;)
106    sys.exit(0)
107
108# 6) Check if program reached its end:
109print(&quot;If this gets print, the report was send successfully.&quot;)
110├── Dockerfile
111├── Error-Handling.md
112├── Grafiken
113│   ├── 1Etappe_Card-min.png
114│   ├── 2Etappe_Card-min.png
115│   ├── 3Etappe_Card-min.png
116│   ├── AD_Logo-min.png
117│   ├── Dynamik_Abflug-min.png
118│   ├── Dynamik_Anflug-min.png
119│   ├── Dynamik_Hoehenflug-min.png
120│   ├── Dynamik_Parabelflug-min.png
121│   ├── Dynamik_Sinkflug-min.png
122│   ├── Dynamik_Steigflug-min.png
123│   ├── Dynamik_Talflug-min.png
124│   ├── Dynamik_Tiefflug-min.png
125│   ├── Entwicklung_Card-min.png
126│   ├── Erfolgshebel_Card-min.png
127│   ├── Framework_Abflug-min.png
128│   ├── Framework_Anflug-min.png
129│   ├── Framework_Hoehenflug-min.png
130│   ├── Framework_Parabelflug-min.png
131│   ├── Framework_Sinkflug-min.png
132│   ├── Framework_Steigflug-min.png
133│   ├── Framework_Talflug-min.png
134│   ├── Framework_Tiefflug-min.png
135│   ├── Heute_Card-min.png
136│   ├── Morgen_Card-min.png
137│   ├── Reflexion_Card-min.png
138│   ├── Strength_Card-min.png
139│   ├── Weakness_Card-min.png
140│   ├── Wegekompass_Hero-min.png
141│   └── Wortwolken
142├── PDF_Reports
143├── README.md
144├── __pycache__
145│   ├── config.cpython-38.pyc
146│   ├── create_report.cpython-38.pyc
147│   ├── main.cpython-38.pyc
148│   ├── preprocessing.cpython-38.pyc
149│   ├── send_report.cpython-38.pyc
150│   ├── topic_modeling.cpython-38.pyc
151│   └── typeform_api.cpython-38.pyc
152├── config.py
153├── create_report.py
154├── html_files
155│   └── E_Mail_Template.html
156├── main.py
157├── preprocessing.py
158├── requirements.txt
159├── send_report.py
160├── topic_modeling.py
161├── typeform_api.py
162└── venv
163    ├── (...)
164. venv/bin/activate
165source venv/Scripts/activate
166

after activating the virtual environment, you can install packages with:

1Traceback (most recent call last):
2  File &quot;main.py&quot;, line 14, in &lt;module&gt;
3    from typeform_api import get_data  # Step 1)
4  File &quot;/Users/philipp/Documents/PycharmProjects/Wegekompass/typeform_api.py&quot;, line 9, in &lt;module&gt;
5    import pandas as pd  # For creating a dataframe for the data
6ModuleNotFoundError: No module named 'pandas'
7&quot;&quot;&quot;
8Logic Flow of main.py:
91) Fetch Data with API
102) Preprocess the Data
113) Start topic modeling
124) Create report and include data
135) send report via e-mail to user
14&quot;&quot;&quot;
15
16import os
17import sys
18from typeform_api import get_data  # Step 1)
19from preprocessing import preprocess_data  # Step 2)
20from topic_modeling import create_topic_model  # Step 3)
21from create_report import *  # Step 4)
22from send_report import send_report  # Step 5)
23
24if __name__ == '__main__':
25
26# 1) Fetch Data with Typeform API:
27alle_antworten_txt = get_data()[3]
28projektname = get_data()[0]
29projekt_id = get_data()[1]
30user_email = get_data()[2]
31heute_txt_short = get_data()[3][4]
32heute_txt_long = get_data()[3][1:3]
33heute_txt_long_joined = &quot;. &quot;.join(heute_txt_long)
34morgen_txt_short = get_data()[3][8]
35morgen_txt_long = get_data()[3][5:7]
36morgen_txt_long_joined = &quot;. &quot;.join(morgen_txt_long)
37erfolgshebel_txt_short = get_data()[3][17]
38erfolgshebel_txt_long = get_data()[3][14:16]
39erfolgshebel_txt_long_joined = &quot;. &quot;.join(erfolgshebel_txt_long)
40staerken_txt = get_data()[3][12:13]
41staerken_txt_joined = &quot;. &quot;.join(staerken_txt)
42schwaechen_txt = get_data()[3][10:11]
43schwaechen_txt_joined = &quot;. &quot;.join(schwaechen_txt)
44reflexion_txt = get_data()[3][9]
45etappe1_fachlich_txt = get_data()[3][18]
46etappe1_kulturell_txt = get_data()[3][19]
47etappe2_fachlich_txt = get_data()[3][20]
48etappe2_kulturell_txt = get_data()[3][21]
49etappe3_fachlich_txt = get_data()[3][22]
50etappe3_kulturell_txt = get_data()[3][23]
51weg_von_energie = get_data()[4][0]
52hin_zu_energie = get_data()[4][1]
53umsetzungsenergie = get_data()[4][2]
54
55# 2) Preprocess and tokenize fetched Data:
56alle_antworten_txt_tokens = preprocess_data(alle_antworten_txt)[1]
57heute_txt_long_tokens = preprocess_data(heute_txt_long)[1]
58morgen_txt_long_tokens = preprocess_data(morgen_txt_long)[1]
59
60# 3) Start NLP analysis:
61# answer_analysis = create_topic_model(alle_antworten_txt_tokens)[1]
62try:
63    heute_wortwolke = create_topic_model(heute_txt_long_tokens)[2]
64    heute_wortwolke.to_file(os.getcwd() + &quot;/Grafiken/Wortwolken/heute_wortwolke&quot;+projekt_id+&quot;.png&quot;)
65    morgen_wortwolke = create_topic_model(morgen_txt_long_tokens)[2]
66    morgen_wortwolke.to_file(os.getcwd() + &quot;/Grafiken/Wortwolken/morgen_wortwolke&quot;+projekt_id+&quot;.png&quot;)
67
68except ValueError:
69    print(&quot;There is not enough word input, LDA analysis raises ValueError&quot;)
70    sys.exit(0)
71
72# 4) Create final report:
73try:
74    final_report = create_final_report(heute_txt_short,
75                                       morgen_txt_short,
76                                       erfolgshebel_txt_short,
77                                       heute_txt_long_joined,
78                                       morgen_txt_long_joined,
79                                       erfolgshebel_txt_long_joined,
80                                       staerken_txt_joined,
81                                       schwaechen_txt_joined,
82                                       reflexion_txt,
83                                       os.getcwd() + &quot;/Grafiken/Wortwolken/heute_wortwolke&quot;+projekt_id+&quot;.png&quot;,
84                                       os.getcwd() + &quot;/Grafiken/Wortwolken/morgen_wortwolke&quot;+projekt_id+&quot;.png&quot;,
85                                       etappe1_fachlich_txt,
86                                       etappe1_kulturell_txt,
87                                       etappe2_fachlich_txt,
88                                       etappe2_kulturell_txt,
89                                       etappe3_fachlich_txt,
90                                       etappe3_kulturell_txt,
91                                       weg_von_energie,
92                                       hin_zu_energie,
93                                       umsetzungsenergie,
94                                       projektname,
95                                       projekt_id)
96
97except FileNotFoundError:
98    print(&quot;There is not enough word input, wordcloud can't be computed so FileNotFoundError is raised&quot;)
99    sys.exit(0)
100
101# 5) Send final report:
102try:
103    send_report(final_report, user_email, projektname, projekt_id)
104except NameError:
105    print(&quot;variable final_report not defined due missing wordclouds&quot;)
106    sys.exit(0)
107
108# 6) Check if program reached its end:
109print(&quot;If this gets print, the report was send successfully.&quot;)
110├── Dockerfile
111├── Error-Handling.md
112├── Grafiken
113│   ├── 1Etappe_Card-min.png
114│   ├── 2Etappe_Card-min.png
115│   ├── 3Etappe_Card-min.png
116│   ├── AD_Logo-min.png
117│   ├── Dynamik_Abflug-min.png
118│   ├── Dynamik_Anflug-min.png
119│   ├── Dynamik_Hoehenflug-min.png
120│   ├── Dynamik_Parabelflug-min.png
121│   ├── Dynamik_Sinkflug-min.png
122│   ├── Dynamik_Steigflug-min.png
123│   ├── Dynamik_Talflug-min.png
124│   ├── Dynamik_Tiefflug-min.png
125│   ├── Entwicklung_Card-min.png
126│   ├── Erfolgshebel_Card-min.png
127│   ├── Framework_Abflug-min.png
128│   ├── Framework_Anflug-min.png
129│   ├── Framework_Hoehenflug-min.png
130│   ├── Framework_Parabelflug-min.png
131│   ├── Framework_Sinkflug-min.png
132│   ├── Framework_Steigflug-min.png
133│   ├── Framework_Talflug-min.png
134│   ├── Framework_Tiefflug-min.png
135│   ├── Heute_Card-min.png
136│   ├── Morgen_Card-min.png
137│   ├── Reflexion_Card-min.png
138│   ├── Strength_Card-min.png
139│   ├── Weakness_Card-min.png
140│   ├── Wegekompass_Hero-min.png
141│   └── Wortwolken
142├── PDF_Reports
143├── README.md
144├── __pycache__
145│   ├── config.cpython-38.pyc
146│   ├── create_report.cpython-38.pyc
147│   ├── main.cpython-38.pyc
148│   ├── preprocessing.cpython-38.pyc
149│   ├── send_report.cpython-38.pyc
150│   ├── topic_modeling.cpython-38.pyc
151│   └── typeform_api.cpython-38.pyc
152├── config.py
153├── create_report.py
154├── html_files
155│   └── E_Mail_Template.html
156├── main.py
157├── preprocessing.py
158├── requirements.txt
159├── send_report.py
160├── topic_modeling.py
161├── typeform_api.py
162└── venv
163    ├── (...)
164. venv/bin/activate
165source venv/Scripts/activate
166pip install -r requirements.txt
167

Source https://stackoverflow.com/questions/71108782

QUESTION

How to get list of words for each topic for a specific relevance metric value (lambda) in pyLDAvis?

Asked 2021-Nov-24 at 10:43

I am using pyLDAvis along with gensim.models.LdaMulticore for topic modeling. I have totally 10 topics. When I visualize the results using pyLDAvis, there is a bar called lambda with this explanation: "Slide to adjust relevance metric". I am interested to extract the list of words for each topic separately for lambda = 0.1. I cannot find a way to adjust lambda in the document for extracting keywords.

I am using these lines:

1if 1 == 1:
2    LDAvis_prepared = pyLDAvis.gensim_models.prepare(lda_model, corpus, id2word, lambda_step=0.1)
3LDAvis_prepared.topic_info
4

And these are the results:

1if 1 == 1:
2    LDAvis_prepared = pyLDAvis.gensim_models.prepare(lda_model, corpus, id2word, lambda_step=0.1)
3LDAvis_prepared.topic_info
4   Term     Freq        Total       Category logprob loglift
5321 ra      2336.000000 2336.000000 Default 30.0000 30.0000
6146 may     1741.000000 1741.000000 Default 29.0000 29.0000
766  doctor  1310.000000 1310.000000 Default 28.0000 28.0000
8

First of all these results are not related to what I observe with lambda of 0.1 in visualization. Secondly I cannot see the results separated by the topics.

ANSWER

Answered 2021-Nov-24 at 10:43

You may want to read this github page: https://nicharuc.github.io/topic_modeling/

According to this example, your code could go like this:

1if 1 == 1:
2    LDAvis_prepared = pyLDAvis.gensim_models.prepare(lda_model, corpus, id2word, lambda_step=0.1)
3LDAvis_prepared.topic_info
4   Term     Freq        Total       Category logprob loglift
5321 ra      2336.000000 2336.000000 Default 30.0000 30.0000
6146 may     1741.000000 1741.000000 Default 29.0000 29.0000
766  doctor  1310.000000 1310.000000 Default 28.0000 28.0000
8lambd = 0.6 # a specific relevance metric value
9
10all_topics = {}
11num_topics = lda_model.num_topics
12num_terms = 10 
13
14for i in range(1,num_topics): 
15    topic = LDAvis_prepared.topic_info[LDAvis_prepared.topic_info.Category == 'Topic'+str(i)].copy()
16    topic['relevance'] = topic['loglift']*(1-lambd)+topic['logprob']*lambd
17    all_topics['Topic '+str(i)] = topic.sort_values(by='relevance', ascending=False).Term[:num_terms].values
18pd.DataFrame(all_topics).T
19

Source https://stackoverflow.com/questions/69492078

QUESTION

Should bi-gram and tri-gram be used in LDA topic modeling?

Asked 2021-Sep-13 at 21:11

I read several posts(here and here) online about LDA topic modeling. All of them only use uni-grams. I would like to know why bi-grams and tri-grams are not used for LDA topic modeling?

ANSWER

Answered 2021-Sep-13 at 08:30

It's a matter of scale. If you have 1000 types (ie "dictionary words"), you might end up (in the worst case, which is not going to happen) with 1,000,000 bigrams, and 1,000,000,000 trigrams. These numbers are hard to manage, especially as you will have a lot more types in a realistic text.

The gains in accuracy/performance don't outweigh the computational cost here.

Source https://stackoverflow.com/questions/69157848

QUESTION

How encode text can be converted to main text (without special character created by encoding)

Asked 2021-Jun-30 at 11:18

I am going to extract text from a series of PDF files to do Topic Modeling. After extracting text from PdF files, I am going to save the text of each PDF file in a .txt file or .doc file. To do this, I had an error that I should add .encode('utf-8') for saving extracted text in a .txt file. So, I added txt = str(txt.encode('utf-8')). The problem is reading the .txt files, when I read the .txt files, they have special characters due to UTF-8, I don't know how I can have the main text without that characters. I applied to decode but it didn't work.

I applied another approach to avoid saving in .txt format, I was going to save the extracted text in a data frame, but I found that the few first pages were saved in data frame!

I would appreciate it if you could share your solutions to read from the .txt file and removing characters relating to encoding ('utf-8') and how I can save the extracted text in a data frame.

1import pdfplumber
2import pandas as pd
3import  codecs
4
5txt = ''
6
7with pdfplumber.open(r'C:\Users\thmag\3rdPaperLDA\A1.pdf') as pdf:
8    pages = pdf.pages
9    for i, pg in enumerate (pages):
10            txt += pages [i].extract_text()
11        
12print (txt)
13
14data = {'text': [txt]}
15df = pd.DataFrame(data)
16
17
18####write in .txt file
19text_file = open(&quot;Test.txt&quot;, &quot;wt&quot;)
20txt = str(txt.encode('utf-8'))
21n = text_file.write(txt)
22text_file.close()
23
24####read from .txt file
25with codecs.open('Test.txt', 'r', 'utf-8') as f:
26    for line in f:
27        print (line)
28

ANSWER

Answered 2021-Jun-30 at 11:18

You are writing the file incorrectly. Rather than encoding the text, declare an encoding when you open the file, and write the text without encoding - Python will automatically encode it.

It should be

1import pdfplumber
2import pandas as pd
3import  codecs
4
5txt = ''
6
7with pdfplumber.open(r'C:\Users\thmag\3rdPaperLDA\A1.pdf') as pdf:
8    pages = pdf.pages
9    for i, pg in enumerate (pages):
10            txt += pages [i].extract_text()
11        
12print (txt)
13
14data = {'text': [txt]}
15df = pd.DataFrame(data)
16
17
18####write in .txt file
19text_file = open(&quot;Test.txt&quot;, &quot;wt&quot;)
20txt = str(txt.encode('utf-8'))
21n = text_file.write(txt)
22text_file.close()
23
24####read from .txt file
25with codecs.open('Test.txt', 'r', 'utf-8') as f:
26    for line in f:
27        print (line)
28
29####write in .txt file
30with open(&quot;Test.txt&quot;, &quot;wt&quot;, encoding='utf-8') as text_file:
31    n = text_file.write(txt)
32

Unless you are using Python 2 you don't need to use codecs to open encoded files, again you can declare the encoding in the open function:

1import pdfplumber
2import pandas as pd
3import  codecs
4
5txt = ''
6
7with pdfplumber.open(r'C:\Users\thmag\3rdPaperLDA\A1.pdf') as pdf:
8    pages = pdf.pages
9    for i, pg in enumerate (pages):
10            txt += pages [i].extract_text()
11        
12print (txt)
13
14data = {'text': [txt]}
15df = pd.DataFrame(data)
16
17
18####write in .txt file
19text_file = open(&quot;Test.txt&quot;, &quot;wt&quot;)
20txt = str(txt.encode('utf-8'))
21n = text_file.write(txt)
22text_file.close()
23
24####read from .txt file
25with codecs.open('Test.txt', 'r', 'utf-8') as f:
26    for line in f:
27        print (line)
28
29####write in .txt file
30with open(&quot;Test.txt&quot;, &quot;wt&quot;, encoding='utf-8') as text_file:
31    n = text_file.write(txt)
32with open(&quot;Test.txt&quot;, &quot;rt&quot;, encoding='utf-8') as f:
33    for line in f:
34        print(line)
35

Source https://stackoverflow.com/questions/68193584

QUESTION

Memory problems when using lapply for corpus creation

Asked 2021-Jun-05 at 05:53

My eventual goal is to transform thousands of pdfs into a corpus / document term matrix to conduct some topic modeling. I am using the pdftools package to import my pdfs and work with the tm package for preparing my data for text mining. I managed to import and transform one individual pdf, like this:

1txt &lt;- pdf_text(&quot;pdfexample.pdf&quot;)
2
3#create corpus
4txt_corpus &lt;- Corpus(VectorSource(txt))
5
6# Some basic text prep, with tm_map(), like:
7txt_corpus &lt;- tm_map(txt_corpus, tolower)
8
9# create document term matrix
10dtm &lt;- DocumentTermMatrix(txt_corpus)
11

However, I am completely stuck with automating this process and I have only limited experience with either loops or apply functions. My approach has run into memory problems, when converting the raw pdf_text() output into a corpus, even though I tested my code only with 5 pdf files (total: 1.5MB). R tried to allocate a vector of more than half GB. Which seems absolutely not right to me. My attempt looks like this:

1txt &lt;- pdf_text(&quot;pdfexample.pdf&quot;)
2
3#create corpus
4txt_corpus &lt;- Corpus(VectorSource(txt))
5
6# Some basic text prep, with tm_map(), like:
7txt_corpus &lt;- tm_map(txt_corpus, tolower)
8
9# create document term matrix
10dtm &lt;- DocumentTermMatrix(txt_corpus)
11# Create a list of all pdf paths
12file_list &lt;- list.files(path = &quot;mydirectory&quot;,
13                 full.names = TRUE,
14                 pattern = &quot;name*&quot;, # to import only specific pdfs
15                 ignore.case = FALSE)
16
17# Run a function that reads the pdf of each of those files:
18all_files &lt;- lapply(file_list, FUN = function(files) {
19             pdf_text(files)
20             })
21
22all_files_corpus = lapply(all_files,
23                          FUN = Corpus(DirSource())) # That's where I run into memory issues
24

Am I doing something fundamentally wrong? Not sure if it is just a mere memory issue or whether there are easier approaches to my problem. At least, from what I gathered, lapply should be a lot more memory efficient then looping. But maybe there is more to it. I've tried to solve it by my own for days now, but nothing worked.

Grateful for any advice/hint on how to proceed!

Edit: I tried to execute the lapply with only one pdf and my R crashed again, even though I have no capacity problems at all, when using the code mentioned first.

ANSWER

Answered 2021-Jun-05 at 05:52

You can write a function which has series of steps that you want to execute on each pdf.

1txt &lt;- pdf_text(&quot;pdfexample.pdf&quot;)
2
3#create corpus
4txt_corpus &lt;- Corpus(VectorSource(txt))
5
6# Some basic text prep, with tm_map(), like:
7txt_corpus &lt;- tm_map(txt_corpus, tolower)
8
9# create document term matrix
10dtm &lt;- DocumentTermMatrix(txt_corpus)
11# Create a list of all pdf paths
12file_list &lt;- list.files(path = &quot;mydirectory&quot;,
13                 full.names = TRUE,
14                 pattern = &quot;name*&quot;, # to import only specific pdfs
15                 ignore.case = FALSE)
16
17# Run a function that reads the pdf of each of those files:
18all_files &lt;- lapply(file_list, FUN = function(files) {
19             pdf_text(files)
20             })
21
22all_files_corpus = lapply(all_files,
23                          FUN = Corpus(DirSource())) # That's where I run into memory issues
24pdf_to_dtm &lt;- function(file) {
25  txt &lt;- pdf_text(file)
26  #create corpus
27  txt_corpus &lt;- Corpus(VectorSource(txt))
28  # Some basic text prep, with tm_map(), like:
29  txt_corpus &lt;- tm_map(txt_corpus, tolower)
30  # create document term matrix
31  dtm &lt;- DocumentTermMatrix(txt_corpus)
32  dtm
33}
34

Using lapply apply the function on each file

1txt &lt;- pdf_text(&quot;pdfexample.pdf&quot;)
2
3#create corpus
4txt_corpus &lt;- Corpus(VectorSource(txt))
5
6# Some basic text prep, with tm_map(), like:
7txt_corpus &lt;- tm_map(txt_corpus, tolower)
8
9# create document term matrix
10dtm &lt;- DocumentTermMatrix(txt_corpus)
11# Create a list of all pdf paths
12file_list &lt;- list.files(path = &quot;mydirectory&quot;,
13                 full.names = TRUE,
14                 pattern = &quot;name*&quot;, # to import only specific pdfs
15                 ignore.case = FALSE)
16
17# Run a function that reads the pdf of each of those files:
18all_files &lt;- lapply(file_list, FUN = function(files) {
19             pdf_text(files)
20             })
21
22all_files_corpus = lapply(all_files,
23                          FUN = Corpus(DirSource())) # That's where I run into memory issues
24pdf_to_dtm &lt;- function(file) {
25  txt &lt;- pdf_text(file)
26  #create corpus
27  txt_corpus &lt;- Corpus(VectorSource(txt))
28  # Some basic text prep, with tm_map(), like:
29  txt_corpus &lt;- tm_map(txt_corpus, tolower)
30  # create document term matrix
31  dtm &lt;- DocumentTermMatrix(txt_corpus)
32  dtm
33}
34file_list &lt;- list.files(path = &quot;mydirectory&quot;,
35                 full.names = TRUE,
36                 pattern = &quot;name*&quot;, # to import only specific pdfs
37                 ignore.case = FALSE)
38
39all_files_corpus &lt;- lapply(file_list, pdf_to_dtm)
40

Source https://stackoverflow.com/questions/67823934

QUESTION

How can I replace emojis with text and treat them as single words?

Asked 2021-May-18 at 15:56

I have to do a topic modeling based on pieces of texts containing emojis with R. Using the replace_emoji() and replace_emoticon functions let me analyze them, but there is a problem with the results.

A red heart emoji is translated as "red heart ufef". These words are then treated separately during the analysis and compromise the results.

Terms like "heart" can have a very different meaning as can be seen with "red heart ufef" and "broken heart" The function replace_emoji_identifier() doesn't help either, as the identifiers make an analysis hard.

Dummy data set reproducible with by using dput() (including the step force to lowercase:

1Emoji_struct &lt;- c(
2      list(content = &quot;🔥🔥 wow&quot;, &quot;😮 look at that&quot;, &quot;😤this makes me angry😤&quot;, &quot;😍❤\ufe0f, i love it!&quot;),  
3      list(content = &quot;😍😍&quot;, &quot;😊 thanks for helping&quot;,  &quot;😢 oh no, why? 😢&quot;, &quot;careful, challenging ❌❌❌&quot;)
4)
5

Current coding (data_orig is a list of several files):

1Emoji_struct &lt;- c(
2      list(content = &quot;🔥🔥 wow&quot;, &quot;😮 look at that&quot;, &quot;😤this makes me angry😤&quot;, &quot;😍❤\ufe0f, i love it!&quot;),  
3      list(content = &quot;😍😍&quot;, &quot;😊 thanks for helping&quot;,  &quot;😢 oh no, why? 😢&quot;, &quot;careful, challenging ❌❌❌&quot;)
4)
5library(textclean)
6#The rest should be standard r packages for pre-processing
7
8#pre-processing:
9data &lt;- gsub(&quot;'&quot;, &quot;&quot;, data) 
10data &lt;- replace_contraction(data)
11data &lt;- replace_emoji(data) # replace emoji with words
12data &lt;- replace_emoticon(data) # replace emoticon with words
13data &lt;- replace_hash(data, replacement = &quot;&quot;)
14data &lt;- replace_word_elongation(data)
15data &lt;- gsub(&quot;[[:punct:]]&quot;, &quot; &quot;, data)  #replace punctuation with space
16data &lt;- gsub(&quot;[[:cntrl:]]&quot;, &quot; &quot;, data) 
17data &lt;- gsub(&quot;[[:digit:]]&quot;, &quot;&quot;, data)  #remove digits
18data &lt;- gsub(&quot;^[[:space:]]+&quot;, &quot;&quot;, data) #remove whitespace at beginning of documents
19data &lt;- gsub(&quot;[[:space:]]+$&quot;, &quot;&quot;, data) #remove whitespace at end of documents
20data &lt;- stripWhitespace(data)
21

Desired output:

1Emoji_struct &lt;- c(
2      list(content = &quot;🔥🔥 wow&quot;, &quot;😮 look at that&quot;, &quot;😤this makes me angry😤&quot;, &quot;😍❤\ufe0f, i love it!&quot;),  
3      list(content = &quot;😍😍&quot;, &quot;😊 thanks for helping&quot;,  &quot;😢 oh no, why? 😢&quot;, &quot;careful, challenging ❌❌❌&quot;)
4)
5library(textclean)
6#The rest should be standard r packages for pre-processing
7
8#pre-processing:
9data &lt;- gsub(&quot;'&quot;, &quot;&quot;, data) 
10data &lt;- replace_contraction(data)
11data &lt;- replace_emoji(data) # replace emoji with words
12data &lt;- replace_emoticon(data) # replace emoticon with words
13data &lt;- replace_hash(data, replacement = &quot;&quot;)
14data &lt;- replace_word_elongation(data)
15data &lt;- gsub(&quot;[[:punct:]]&quot;, &quot; &quot;, data)  #replace punctuation with space
16data &lt;- gsub(&quot;[[:cntrl:]]&quot;, &quot; &quot;, data) 
17data &lt;- gsub(&quot;[[:digit:]]&quot;, &quot;&quot;, data)  #remove digits
18data &lt;- gsub(&quot;^[[:space:]]+&quot;, &quot;&quot;, data) #remove whitespace at beginning of documents
19data &lt;- gsub(&quot;[[:space:]]+$&quot;, &quot;&quot;, data) #remove whitespace at end of documents
20data &lt;- stripWhitespace(data)
21[1] list(content = c(&quot;fire fire wow&quot;, 
22                     &quot;facewithopenmouth look at that&quot;, 
23                     &quot;facewithsteamfromnose this makes me angry facewithsteamfromnose&quot;, 
24                     &quot;smilingfacewithhearteyes redheart \ufe0f, i love it!&quot;), 
25         content = c(&quot;smilingfacewithhearteyes smilingfacewithhearteyes&quot;, 
26                     &quot;smilingfacewithsmilingeyes thanks for helping&quot;, 
27                     &quot;cryingface oh no, why? cryingface&quot;, 
28                     &quot;careful, challenging crossmark crossmark crossmark&quot;))
29

Any ideas? Lower cases would work, too. Best regards. Stay safe. Stay healthy.

ANSWER

Answered 2021-May-18 at 15:56

Answer

Replace the default conversion table in replace_emoji with a version where the spaces/punctuation is removed:

1Emoji_struct &lt;- c(
2      list(content = &quot;🔥🔥 wow&quot;, &quot;😮 look at that&quot;, &quot;😤this makes me angry😤&quot;, &quot;😍❤\ufe0f, i love it!&quot;),  
3      list(content = &quot;😍😍&quot;, &quot;😊 thanks for helping&quot;,  &quot;😢 oh no, why? 😢&quot;, &quot;careful, challenging ❌❌❌&quot;)
4)
5library(textclean)
6#The rest should be standard r packages for pre-processing
7
8#pre-processing:
9data &lt;- gsub(&quot;'&quot;, &quot;&quot;, data) 
10data &lt;- replace_contraction(data)
11data &lt;- replace_emoji(data) # replace emoji with words
12data &lt;- replace_emoticon(data) # replace emoticon with words
13data &lt;- replace_hash(data, replacement = &quot;&quot;)
14data &lt;- replace_word_elongation(data)
15data &lt;- gsub(&quot;[[:punct:]]&quot;, &quot; &quot;, data)  #replace punctuation with space
16data &lt;- gsub(&quot;[[:cntrl:]]&quot;, &quot; &quot;, data) 
17data &lt;- gsub(&quot;[[:digit:]]&quot;, &quot;&quot;, data)  #remove digits
18data &lt;- gsub(&quot;^[[:space:]]+&quot;, &quot;&quot;, data) #remove whitespace at beginning of documents
19data &lt;- gsub(&quot;[[:space:]]+$&quot;, &quot;&quot;, data) #remove whitespace at end of documents
20data &lt;- stripWhitespace(data)
21[1] list(content = c(&quot;fire fire wow&quot;, 
22                     &quot;facewithopenmouth look at that&quot;, 
23                     &quot;facewithsteamfromnose this makes me angry facewithsteamfromnose&quot;, 
24                     &quot;smilingfacewithhearteyes redheart \ufe0f, i love it!&quot;), 
25         content = c(&quot;smilingfacewithhearteyes smilingfacewithhearteyes&quot;, 
26                     &quot;smilingfacewithsmilingeyes thanks for helping&quot;, 
27                     &quot;cryingface oh no, why? cryingface&quot;, 
28                     &quot;careful, challenging crossmark crossmark crossmark&quot;))
29hash2 &lt;- lexicon::hash_emojis
30hash2$y &lt;- gsub(&quot;[[:space:]]|[[:punct:]]&quot;, &quot;&quot;, hash2$y)
31
32replace_emoji(Emoji_struct[,1], emoji_dt = hash2)
33

Example

Single character string:

1Emoji_struct &lt;- c(
2      list(content = &quot;🔥🔥 wow&quot;, &quot;😮 look at that&quot;, &quot;😤this makes me angry😤&quot;, &quot;😍❤\ufe0f, i love it!&quot;),  
3      list(content = &quot;😍😍&quot;, &quot;😊 thanks for helping&quot;,  &quot;😢 oh no, why? 😢&quot;, &quot;careful, challenging ❌❌❌&quot;)
4)
5library(textclean)
6#The rest should be standard r packages for pre-processing
7
8#pre-processing:
9data &lt;- gsub(&quot;'&quot;, &quot;&quot;, data) 
10data &lt;- replace_contraction(data)
11data &lt;- replace_emoji(data) # replace emoji with words
12data &lt;- replace_emoticon(data) # replace emoticon with words
13data &lt;- replace_hash(data, replacement = &quot;&quot;)
14data &lt;- replace_word_elongation(data)
15data &lt;- gsub(&quot;[[:punct:]]&quot;, &quot; &quot;, data)  #replace punctuation with space
16data &lt;- gsub(&quot;[[:cntrl:]]&quot;, &quot; &quot;, data) 
17data &lt;- gsub(&quot;[[:digit:]]&quot;, &quot;&quot;, data)  #remove digits
18data &lt;- gsub(&quot;^[[:space:]]+&quot;, &quot;&quot;, data) #remove whitespace at beginning of documents
19data &lt;- gsub(&quot;[[:space:]]+$&quot;, &quot;&quot;, data) #remove whitespace at end of documents
20data &lt;- stripWhitespace(data)
21[1] list(content = c(&quot;fire fire wow&quot;, 
22                     &quot;facewithopenmouth look at that&quot;, 
23                     &quot;facewithsteamfromnose this makes me angry facewithsteamfromnose&quot;, 
24                     &quot;smilingfacewithhearteyes redheart \ufe0f, i love it!&quot;), 
25         content = c(&quot;smilingfacewithhearteyes smilingfacewithhearteyes&quot;, 
26                     &quot;smilingfacewithsmilingeyes thanks for helping&quot;, 
27                     &quot;cryingface oh no, why? cryingface&quot;, 
28                     &quot;careful, challenging crossmark crossmark crossmark&quot;))
29hash2 &lt;- lexicon::hash_emojis
30hash2$y &lt;- gsub(&quot;[[:space:]]|[[:punct:]]&quot;, &quot;&quot;, hash2$y)
31
32replace_emoji(Emoji_struct[,1], emoji_dt = hash2)
33replace_emoji(&quot;wow!😮 that is cool!&quot;, emoji_dt = hash2)
34#[1] &quot;wow! facewithopenmouth that is cool!&quot;
35

Character vector:

1Emoji_struct &lt;- c(
2      list(content = &quot;🔥🔥 wow&quot;, &quot;😮 look at that&quot;, &quot;😤this makes me angry😤&quot;, &quot;😍❤\ufe0f, i love it!&quot;),  
3      list(content = &quot;😍😍&quot;, &quot;😊 thanks for helping&quot;,  &quot;😢 oh no, why? 😢&quot;, &quot;careful, challenging ❌❌❌&quot;)
4)
5library(textclean)
6#The rest should be standard r packages for pre-processing
7
8#pre-processing:
9data &lt;- gsub(&quot;'&quot;, &quot;&quot;, data) 
10data &lt;- replace_contraction(data)
11data &lt;- replace_emoji(data) # replace emoji with words
12data &lt;- replace_emoticon(data) # replace emoticon with words
13data &lt;- replace_hash(data, replacement = &quot;&quot;)
14data &lt;- replace_word_elongation(data)
15data &lt;- gsub(&quot;[[:punct:]]&quot;, &quot; &quot;, data)  #replace punctuation with space
16data &lt;- gsub(&quot;[[:cntrl:]]&quot;, &quot; &quot;, data) 
17data &lt;- gsub(&quot;[[:digit:]]&quot;, &quot;&quot;, data)  #remove digits
18data &lt;- gsub(&quot;^[[:space:]]+&quot;, &quot;&quot;, data) #remove whitespace at beginning of documents
19data &lt;- gsub(&quot;[[:space:]]+$&quot;, &quot;&quot;, data) #remove whitespace at end of documents
20data &lt;- stripWhitespace(data)
21[1] list(content = c(&quot;fire fire wow&quot;, 
22                     &quot;facewithopenmouth look at that&quot;, 
23                     &quot;facewithsteamfromnose this makes me angry facewithsteamfromnose&quot;, 
24                     &quot;smilingfacewithhearteyes redheart \ufe0f, i love it!&quot;), 
25         content = c(&quot;smilingfacewithhearteyes smilingfacewithhearteyes&quot;, 
26                     &quot;smilingfacewithsmilingeyes thanks for helping&quot;, 
27                     &quot;cryingface oh no, why? cryingface&quot;, 
28                     &quot;careful, challenging crossmark crossmark crossmark&quot;))
29hash2 &lt;- lexicon::hash_emojis
30hash2$y &lt;- gsub(&quot;[[:space:]]|[[:punct:]]&quot;, &quot;&quot;, hash2$y)
31
32replace_emoji(Emoji_struct[,1], emoji_dt = hash2)
33replace_emoji(&quot;wow!😮 that is cool!&quot;, emoji_dt = hash2)
34#[1] &quot;wow! facewithopenmouth that is cool!&quot;
35replace_emoji(c(&quot;1: 😊&quot;, &quot;2: 😍&quot;), emoji_dt = hash2)
36#[1] &quot;1: smilingfacewithsmilingeyes &quot;
37#[2] &quot;2: smilingfacewithhearteyes &quot;
38

List:

1Emoji_struct &lt;- c(
2      list(content = &quot;🔥🔥 wow&quot;, &quot;😮 look at that&quot;, &quot;😤this makes me angry😤&quot;, &quot;😍❤\ufe0f, i love it!&quot;),  
3      list(content = &quot;😍😍&quot;, &quot;😊 thanks for helping&quot;,  &quot;😢 oh no, why? 😢&quot;, &quot;careful, challenging ❌❌❌&quot;)
4)
5library(textclean)
6#The rest should be standard r packages for pre-processing
7
8#pre-processing:
9data &lt;- gsub(&quot;'&quot;, &quot;&quot;, data) 
10data &lt;- replace_contraction(data)
11data &lt;- replace_emoji(data) # replace emoji with words
12data &lt;- replace_emoticon(data) # replace emoticon with words
13data &lt;- replace_hash(data, replacement = &quot;&quot;)
14data &lt;- replace_word_elongation(data)
15data &lt;- gsub(&quot;[[:punct:]]&quot;, &quot; &quot;, data)  #replace punctuation with space
16data &lt;- gsub(&quot;[[:cntrl:]]&quot;, &quot; &quot;, data) 
17data &lt;- gsub(&quot;[[:digit:]]&quot;, &quot;&quot;, data)  #remove digits
18data &lt;- gsub(&quot;^[[:space:]]+&quot;, &quot;&quot;, data) #remove whitespace at beginning of documents
19data &lt;- gsub(&quot;[[:space:]]+$&quot;, &quot;&quot;, data) #remove whitespace at end of documents
20data &lt;- stripWhitespace(data)
21[1] list(content = c(&quot;fire fire wow&quot;, 
22                     &quot;facewithopenmouth look at that&quot;, 
23                     &quot;facewithsteamfromnose this makes me angry facewithsteamfromnose&quot;, 
24                     &quot;smilingfacewithhearteyes redheart \ufe0f, i love it!&quot;), 
25         content = c(&quot;smilingfacewithhearteyes smilingfacewithhearteyes&quot;, 
26                     &quot;smilingfacewithsmilingeyes thanks for helping&quot;, 
27                     &quot;cryingface oh no, why? cryingface&quot;, 
28                     &quot;careful, challenging crossmark crossmark crossmark&quot;))
29hash2 &lt;- lexicon::hash_emojis
30hash2$y &lt;- gsub(&quot;[[:space:]]|[[:punct:]]&quot;, &quot;&quot;, hash2$y)
31
32replace_emoji(Emoji_struct[,1], emoji_dt = hash2)
33replace_emoji(&quot;wow!😮 that is cool!&quot;, emoji_dt = hash2)
34#[1] &quot;wow! facewithopenmouth that is cool!&quot;
35replace_emoji(c(&quot;1: 😊&quot;, &quot;2: 😍&quot;), emoji_dt = hash2)
36#[1] &quot;1: smilingfacewithsmilingeyes &quot;
37#[2] &quot;2: smilingfacewithhearteyes &quot;
38list(&quot;list_element_1: 🔥&quot;, &quot;list_element_2: ❌&quot;) %&gt;%
39  lapply(replace_emoji, emoji_dt = hash2)
40#[[1]]
41#[1] &quot;list_element_1: fire &quot;
42#
43#[[2]]
44#[1] &quot;list_element_2: crossmark &quot;
45

Rationale

To convert emojis to text, replace_emoji uses lexicon::hash_emojis as a conversion table (a hash table):

1Emoji_struct &lt;- c(
2      list(content = &quot;🔥🔥 wow&quot;, &quot;😮 look at that&quot;, &quot;😤this makes me angry😤&quot;, &quot;😍❤\ufe0f, i love it!&quot;),  
3      list(content = &quot;😍😍&quot;, &quot;😊 thanks for helping&quot;,  &quot;😢 oh no, why? 😢&quot;, &quot;careful, challenging ❌❌❌&quot;)
4)
5library(textclean)
6#The rest should be standard r packages for pre-processing
7
8#pre-processing:
9data &lt;- gsub(&quot;'&quot;, &quot;&quot;, data) 
10data &lt;- replace_contraction(data)
11data &lt;- replace_emoji(data) # replace emoji with words
12data &lt;- replace_emoticon(data) # replace emoticon with words
13data &lt;- replace_hash(data, replacement = &quot;&quot;)
14data &lt;- replace_word_elongation(data)
15data &lt;- gsub(&quot;[[:punct:]]&quot;, &quot; &quot;, data)  #replace punctuation with space
16data &lt;- gsub(&quot;[[:cntrl:]]&quot;, &quot; &quot;, data) 
17data &lt;- gsub(&quot;[[:digit:]]&quot;, &quot;&quot;, data)  #remove digits
18data &lt;- gsub(&quot;^[[:space:]]+&quot;, &quot;&quot;, data) #remove whitespace at beginning of documents
19data &lt;- gsub(&quot;[[:space:]]+$&quot;, &quot;&quot;, data) #remove whitespace at end of documents
20data &lt;- stripWhitespace(data)
21[1] list(content = c(&quot;fire fire wow&quot;, 
22                     &quot;facewithopenmouth look at that&quot;, 
23                     &quot;facewithsteamfromnose this makes me angry facewithsteamfromnose&quot;, 
24                     &quot;smilingfacewithhearteyes redheart \ufe0f, i love it!&quot;), 
25         content = c(&quot;smilingfacewithhearteyes smilingfacewithhearteyes&quot;, 
26                     &quot;smilingfacewithsmilingeyes thanks for helping&quot;, 
27                     &quot;cryingface oh no, why? cryingface&quot;, 
28                     &quot;careful, challenging crossmark crossmark crossmark&quot;))
29hash2 &lt;- lexicon::hash_emojis
30hash2$y &lt;- gsub(&quot;[[:space:]]|[[:punct:]]&quot;, &quot;&quot;, hash2$y)
31
32replace_emoji(Emoji_struct[,1], emoji_dt = hash2)
33replace_emoji(&quot;wow!😮 that is cool!&quot;, emoji_dt = hash2)
34#[1] &quot;wow! facewithopenmouth that is cool!&quot;
35replace_emoji(c(&quot;1: 😊&quot;, &quot;2: 😍&quot;), emoji_dt = hash2)
36#[1] &quot;1: smilingfacewithsmilingeyes &quot;
37#[2] &quot;2: smilingfacewithhearteyes &quot;
38list(&quot;list_element_1: 🔥&quot;, &quot;list_element_2: ❌&quot;) %&gt;%
39  lapply(replace_emoji, emoji_dt = hash2)
40#[[1]]
41#[1] &quot;list_element_1: fire &quot;
42#
43#[[2]]
44#[1] &quot;list_element_2: crossmark &quot;
45head(lexicon::hash_emojis)
46#              x                        y
47#1: &lt;e2&gt;&lt;86&gt;&lt;95&gt;            up-down arrow
48#2: &lt;e2&gt;&lt;86&gt;&lt;99&gt;          down-left arrow
49#3: &lt;e2&gt;&lt;86&gt;&lt;a9&gt; right arrow curving left
50#4: &lt;e2&gt;&lt;86&gt;&lt;aa&gt; left arrow curving right
51#5: &lt;e2&gt;&lt;8c&gt;&lt;9a&gt;                    watch
52#6: &lt;e2&gt;&lt;8c&gt;&lt;9b&gt;           hourglass done
53

This is an object of class data.table. We can simply modify the y column of this hash table so that we remove all the spaces and punctuation. Note that this also allows you to add new ASCII byte representations and an accompanying string.

Source https://stackoverflow.com/questions/67576269

QUESTION

Specify the output per topic to a specific number of words

Asked 2021-May-13 at 16:25

After conducting a lda topic modeling in R some words have the same beta value. They are therefore listed together when plotting the results. This leads to overlapping and sometimes unreadable results.

Is there a way to limit the amount of words displayed per topic to a specific number? In my dummy data set, some words have the same beta values. I would like to tell R that it should only display 3 words per topic, or any specified number according to necessity.

Currently the code I am using to plot the results looks like this:

1top_terms %&gt;% # take the top terms
2      group_by(topic) %&gt;%
3      mutate(top_term = term[which.max(beta)]) %&gt;% 
4      mutate(term = reorder(term, beta)) %&gt;% 
5      head(3) %&gt;% # I tried this but that only works for the first topic
6      ggplot(aes(term, beta, fill = factor(topic))) + 
7      geom_col(show.legend = FALSE) + 
8      facet_wrap(~ top_term, scales = &quot;free&quot;) + 
9      labs(x = NULL, y = &quot;Beta&quot;) + # no x label, change y label
10      coord_flip() # turn bars sideways
11

I tried to solve the issue with head(3) which worked, but only for the first topic. What I would need is something similar, which doesn't ignore all the other topics.

Best regards. Stay safe, stay healthy.

Note: top_terms is a tibble.

Sample data:

1top_terms %&gt;% # take the top terms
2      group_by(topic) %&gt;%
3      mutate(top_term = term[which.max(beta)]) %&gt;% 
4      mutate(term = reorder(term, beta)) %&gt;% 
5      head(3) %&gt;% # I tried this but that only works for the first topic
6      ggplot(aes(term, beta, fill = factor(topic))) + 
7      geom_col(show.legend = FALSE) + 
8      facet_wrap(~ top_term, scales = &quot;free&quot;) + 
9      labs(x = NULL, y = &quot;Beta&quot;) + # no x label, change y label
10      coord_flip() # turn bars sideways
11topic   term      beta
12(int)   (chr)     (dbl) 
131       book      0,9876 
141       page      0,9765
151       chapter   0,9654
161       author    0,9654
172       sports    0,8765
182       soccer    0,8654
192       champions   0,8543
202       victory   0,8543
213       music     0,9543
223       song      0,8678
233       artist    0,7231
243       concert   0,7231
254       movie     0,9846
264       cinema    0,9647
274       cast      0,8878
284       story     0,8878 
29

dput of sample data

1top_terms %&gt;% # take the top terms
2      group_by(topic) %&gt;%
3      mutate(top_term = term[which.max(beta)]) %&gt;% 
4      mutate(term = reorder(term, beta)) %&gt;% 
5      head(3) %&gt;% # I tried this but that only works for the first topic
6      ggplot(aes(term, beta, fill = factor(topic))) + 
7      geom_col(show.legend = FALSE) + 
8      facet_wrap(~ top_term, scales = &quot;free&quot;) + 
9      labs(x = NULL, y = &quot;Beta&quot;) + # no x label, change y label
10      coord_flip() # turn bars sideways
11topic   term      beta
12(int)   (chr)     (dbl) 
131       book      0,9876 
141       page      0,9765
151       chapter   0,9654
161       author    0,9654
172       sports    0,8765
182       soccer    0,8654
192       champions   0,8543
202       victory   0,8543
213       music     0,9543
223       song      0,8678
233       artist    0,7231
243       concert   0,7231
254       movie     0,9846
264       cinema    0,9647
274       cast      0,8878
284       story     0,8878 
29top_terms &lt;- structure(list(topic = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 
30  3L, 3L, 3L, 4L, 4L, 4L, 4L), term = c(&quot;book&quot;, &quot;page&quot;, &quot;chapter&quot;, 
31    &quot;author&quot;, &quot;sports&quot;, &quot;soccer&quot;, &quot;champions&quot;, &quot;victory&quot;, &quot;music&quot;, 
32    &quot;song&quot;, &quot;artist&quot;, &quot;concert&quot;, &quot;movie&quot;, &quot;cinema&quot;, &quot;cast&quot;, &quot;story&quot;
33  ), beta = c(0.9876, 0.9765, 0.9654, 0.9654, 0.8765, 0.8654, 0.8543, 
34    0.8543, 0.9543, 0.8678, 0.7231, 0.7231, 0.9846, 0.9647, 0.8878, 
35    0.8878)), row.names = c(NA, -16L), class = &quot;data.frame&quot;)
36

ANSWER

Answered 2021-May-13 at 16:14

Here is what you can do

1top_terms %&gt;% # take the top terms
2      group_by(topic) %&gt;%
3      mutate(top_term = term[which.max(beta)]) %&gt;% 
4      mutate(term = reorder(term, beta)) %&gt;% 
5      head(3) %&gt;% # I tried this but that only works for the first topic
6      ggplot(aes(term, beta, fill = factor(topic))) + 
7      geom_col(show.legend = FALSE) + 
8      facet_wrap(~ top_term, scales = &quot;free&quot;) + 
9      labs(x = NULL, y = &quot;Beta&quot;) + # no x label, change y label
10      coord_flip() # turn bars sideways
11topic   term      beta
12(int)   (chr)     (dbl) 
131       book      0,9876 
141       page      0,9765
151       chapter   0,9654
161       author    0,9654
172       sports    0,8765
182       soccer    0,8654
192       champions   0,8543
202       victory   0,8543
213       music     0,9543
223       song      0,8678
233       artist    0,7231
243       concert   0,7231
254       movie     0,9846
264       cinema    0,9647
274       cast      0,8878
284       story     0,8878 
29top_terms &lt;- structure(list(topic = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 
30  3L, 3L, 3L, 4L, 4L, 4L, 4L), term = c(&quot;book&quot;, &quot;page&quot;, &quot;chapter&quot;, 
31    &quot;author&quot;, &quot;sports&quot;, &quot;soccer&quot;, &quot;champions&quot;, &quot;victory&quot;, &quot;music&quot;, 
32    &quot;song&quot;, &quot;artist&quot;, &quot;concert&quot;, &quot;movie&quot;, &quot;cinema&quot;, &quot;cast&quot;, &quot;story&quot;
33  ), beta = c(0.9876, 0.9765, 0.9654, 0.9654, 0.8765, 0.8654, 0.8543, 
34    0.8543, 0.9543, 0.8678, 0.7231, 0.7231, 0.9846, 0.9647, 0.8878, 
35    0.8878)), row.names = c(NA, -16L), class = &quot;data.frame&quot;)
36library(dplyr)
37library(ggplot2)
38
39# take the top terms
40graph_data &lt;- top_terms %&gt;%
41  group_by(topic) %&gt;%
42  mutate(top_term = term[which.max(beta)]) %&gt;% 
43  mutate(term = reorder(term, beta),
44    # popuplate index column which start 1 -&gt; number of record for each topic
45    index = seq_len(n())) %&gt;% 
46  # filter by index &lt;= 3
47  filter(index &lt;= 3) 
48
49graph_data
50#&gt; # A tibble: 12 x 5
51#&gt; # Groups:   topic [4]
52#&gt;    topic term       beta top_term index
53#&gt;    &lt;int&gt; &lt;fct&gt;     &lt;dbl&gt; &lt;chr&gt;    &lt;int&gt;
54#&gt;  1     1 book      0.988 book         1
55#&gt;  2     1 page      0.976 book         2
56#&gt;  3     1 chapter   0.965 book         3
57#&gt;  4     2 sports    0.876 sports       1
58#&gt;  5     2 soccer    0.865 sports       2
59#&gt;  6     2 champions 0.854 sports       3
60#&gt;  7     3 music     0.954 music        1
61#&gt;  8     3 song      0.868 music        2
62#&gt;  9     3 artist    0.723 music        3
63#&gt; 10     4 movie     0.985 movie        1
64#&gt; 11     4 cinema    0.965 movie        2
65#&gt; 12     4 cast      0.888 movie        3
66
67graph_data %&gt;%
68  ggplot(aes(term, beta, fill = factor(topic))) + 
69  geom_col(show.legend = FALSE) + 
70  facet_wrap(~ top_term, scales = &quot;free&quot;) + 
71  labs(x = NULL, y = &quot;Beta&quot;) + # no x label, change y label
72  coord_flip() # turn bars sideways
73

^{Created on 2021-05-13 by the reprex package (v2.0.0)}

Source https://stackoverflow.com/questions/67522374

QUESTION

Name topics in lda topic modeling based on beta values

Asked 2021-May-05 at 19:26

I'm currently trying to develop a code for a paper I have to write. I want to conduct a LDA-based topic modeling. I found some code deposits on GitHub and was able to combine them and slightly adapted them where necessary. Now I would like to add something that would name each identified topic after the word with the highest beta-value assigned to the respective topic. Any ideas? It's the first time I'm coding anything and my expertise is therefore quite limited.

Here's the section of the code where I wanted to insert the "naming part":

1# get the top ten terms for each topic
2  top_terms &lt;- topics  %&gt;% 
3    group_by(topic) %&gt;% # treat each topic as a different group
4    top_n(10, beta) %&gt;% # get top 10 words
5    ungroup() %&gt;% 
6    arrange(topic, -beta) # arrange words in descending informativeness
7   
8# plot the top ten terms for each topic in order
9    top_terms %&gt;%
10      mutate(term = reorder(term, beta)) %&gt;% # sort terms by beta value 
11      ggplot(aes(term, beta, fill = factor(topic))) + # plot beta by theme
12      geom_col(show.legend = FALSE) + # as bar plot
13      facet_wrap(~ topic, scales = &quot;free&quot;) + # separate plot for each topic
14      labs(x = NULL, y = &quot;Beta&quot;) + # no x label, change y label 
15      coord_flip() # turn bars sideways
16

I tried to insert it in this section of the code, but that didn't work. I found this: R topic modeling: lda model labeling function but that didn't work for me, or I didn't get it.

I can't disclose more of the code, because there are some sensible data in there, but some expertise from the community would be highly appreciated nonetheless.

best regards and stay safe

Note: It say top_terms is a tibble. I tried to come up with some data of the top of my head. The data in top_terms are structured exactly like this

topic term beta

1# get the top ten terms for each topic
2  top_terms &lt;- topics  %&gt;% 
3    group_by(topic) %&gt;% # treat each topic as a different group
4    top_n(10, beta) %&gt;% # get top 10 words
5    ungroup() %&gt;% 
6    arrange(topic, -beta) # arrange words in descending informativeness
7   
8# plot the top ten terms for each topic in order
9    top_terms %&gt;%
10      mutate(term = reorder(term, beta)) %&gt;% # sort terms by beta value 
11      ggplot(aes(term, beta, fill = factor(topic))) + # plot beta by theme
12      geom_col(show.legend = FALSE) + # as bar plot
13      facet_wrap(~ topic, scales = &quot;free&quot;) + # separate plot for each topic
14      labs(x = NULL, y = &quot;Beta&quot;) + # no x label, change y label 
15      coord_flip() # turn bars sideways
16(int)  (chr)  (dbl)
171   book    0,9876 
181   page    0,9765
191   chapter 0,9654
202   sports  0,8765
212   soccer  0,8654
222   champions   0,8543
233   music   0,9543
243   song    0,8678
253   artist  0,7231
264   movie   0,9846
274   cinema  0,9647
284   cast    0,8878
29

ANSWER

Answered 2021-May-05 at 19:26

You can make an additional column in your data that, after grouping by topic, takes the name of the term with the highest beta.

1# get the top ten terms for each topic
2  top_terms &lt;- topics  %&gt;% 
3    group_by(topic) %&gt;% # treat each topic as a different group
4    top_n(10, beta) %&gt;% # get top 10 words
5    ungroup() %&gt;% 
6    arrange(topic, -beta) # arrange words in descending informativeness
7   
8# plot the top ten terms for each topic in order
9    top_terms %&gt;%
10      mutate(term = reorder(term, beta)) %&gt;% # sort terms by beta value 
11      ggplot(aes(term, beta, fill = factor(topic))) + # plot beta by theme
12      geom_col(show.legend = FALSE) + # as bar plot
13      facet_wrap(~ topic, scales = &quot;free&quot;) + # separate plot for each topic
14      labs(x = NULL, y = &quot;Beta&quot;) + # no x label, change y label 
15      coord_flip() # turn bars sideways
16(int)  (chr)  (dbl)
171   book    0,9876 
181   page    0,9765
191   chapter 0,9654
202   sports  0,8765
212   soccer  0,8654
222   champions   0,8543
233   music   0,9543
243   song    0,8678
253   artist  0,7231
264   movie   0,9846
274   cinema  0,9647
284   cast    0,8878
29suppressPackageStartupMessages({
30  library(ggplot2)
31  library(tibble)
32  library(dplyr)
33})
34
35# Just replicating example data
36top_terms &lt;- tibble(
37  topic = rep(1:4, each = 3),
38  term = c(&quot;book&quot;, &quot;page&quot;, &quot;chapter&quot;, 
39           &quot;sports&quot;, &quot;soccer&quot;, &quot;champions&quot;, 
40           &quot;music&quot;, &quot;song&quot;, &quot;artist&quot;,
41           &quot;movie&quot;, &quot;cinema&quot;, &quot;cast&quot;),
42  beta = c(0.9876, 0.9765, 0.9654,
43           0.8765, 0.8654, 0.8543,
44           0.9543, 0.8678, 0.7231,
45           0.9846, 0.9647, 0.8878)
46) 
47
48top_terms %&gt;%
49  group_by(topic) %&gt;%
50  mutate(top_term = term[which.max(beta)]) %&gt;%
51  ggplot(aes(term, beta, fill = factor(topic))) +
52  geom_col(show.legend = FALSE) +
53  facet_wrap(~ top_term, scales = &quot;free&quot;) +
54  labs(x = NULL, y = &quot;Beta&quot;) +
55  coord_flip()
56

^{Created on 2021-05-05 by the reprex package (v1.0.0)}

Source https://stackoverflow.com/questions/67401272

QUESTION

Calculating optimal number of topics for topic modeling (LDA)

Asked 2021-Apr-27 at 16:56

I am going to do topic modeling via LDA. I run my commands to see the optimal number of topics. The output was as follows: It is a bit different from any other plots that I have ever seen. Do you think it is okay? or it is better to use other algorithms rather than LDA. It is worth mentioning that when I run my commands to visualize the topics-keywords for 10 topics, the plot shows 2 main topics and the others had almost a strong overlap. Is there any valid range for coherence?

Many thanks to share your comments as I am a beginner in topic modeling.

ANSWER

Answered 2021-Apr-27 at 16:56

Shameless self-promotion: I suggest you use the OCTIS library: https://github.com/mind-Lab/octis It allows you to run different topic models and optimize their hyperparameters (also the number of topics) in order to select the best result.

There might be many reasons why you get those results. But here some hints and observations:

Make sure that you've preprocessed the text appropriately. This usually includes removing punctuation and numbers, removing stopwords and words that are too frequent or rare, (optionally) lemmatizing the text. Preprocessing is dependent on the language and the domain of the texts.
LDA is a probabilistic model, which means that if you re-train it with the same hyperparameters, you will get different results each time. A good practice is to run the model with the same number of topics multiple times and then average the topic coherence.
There are a lot of topic models and LDA works usually fine. The choice of the topic model depends on the data that you have. For example, if you are working with tweets (i.e. short texts), I wouldn't recommend using LDA because it cannot handle well sparse texts.
Check how you set the hyperparameters. They may have a huge impact on the performance of the topic model.
The range for coherence (I assume you used NPMI which is the most well-known) is between -1 and 1, but values very close to the upper and lower bound are quite rare.

References: https://www.aclweb.org/anthology/2021.eacl-demos.31/

Source https://stackoverflow.com/questions/67129938

Community Discussions contain sources that include Stack Exchange Network