ansj_fast_lda | LDA 的java实现 | Topic Modeling library
kandi X-RAY | ansj_fast_lda Summary
kandi X-RAY | ansj_fast_lda Summary
`` file[] files = new file("/users/ansj/desktop/搜索组分享/文本分类语料库").listfiles();. `` analysis dicanalysis = dicanalysis.getinstance(new file("library/result_1_3.dic"), "utf-8"); lda lda = new lda(dicanalysis, new ldagibbsmodel(10, 5, 0.1, 100, integer.max_value, integer.max_value)); bufferedreader newreader = files.newreader(new file("/users/ansj/documents/temp/computer_300000.txt"), charsets.utf_8);. `` topic 0 : 教育 0.024076469669325036 学校 0.016010479660942534 学生 0.014581063710089938 工作 0.010251975401793508 教师 0.010160084376381556 发展 0.009863991072276375 社会 0.007607555892716207 教学 0.006249610739406241 建设 0.005902466865627754 提高 0.0053613308270907 学习 0.004871245358226952 孩子 0.0041769576106699775 培养 0.004136117154931332 实施 0.003727712597544876 进行 0.003697082255740892 管理 0.0036358215721329235 思想 0.003349938381962404 国家 0.003268257470485113 改革 0.003237627128681129 活动 0.003186576559007822. topic 1 : 环境 0.016325714890308193 光华 0.01205996899776804 日月 0.011455375091738728 污染 0.005594173058287891 城市 0.005510201682450487 环保 0.004536133722736594 垃圾 0.004485750897234152 文章 0.004401779521396747 信区 0.004267425320056899 来源 0.004149865393884533 发信人 0.004133071118717052 阅读 0.004133071118717052 fudan 0.00409948256838209 edu 0.004082688293214609 环境保护 0.003998716917377205 发信站 0.0039147455415398 cn 0.003897951266372319 返回 0.003864362716037357 讨论区 0.003847568440869876 药物 0.0037971856153674335. topic 2 : 网络 0.009624681710355278 规定 0.008204120848487257 病毒 0.007501477841541783 管理 0.006096191827650836 安全 0.00591289365192593 软件 0.005576846996430269 计算机 0.005240800340934607 使用 0.0051033267091409274 文件 0.004874203989484795 光华 0.004843654293530644 日月 0.0045839818779203605 用户 0.004538157333989134 进行 0.004477057942080832 windows 0.004171560982539322 单位 0.0041410112865851705 应当 0.004125736438608095 微软 0.0040799118946768685 信息 0.0040188125027685664 程序 0.003896613718951962 机动车 0.00383551432704366. topic 3 : 新华社 0.0126322893839167 中国 0.012225714250864987 主席 0.010409678656567339 问题 0.00985402597473 国家 0.008756273115490376 总统 0.008580090557834635 今天 0.007685625265120868 访问 0.00730615514093927 人民 0.007238392618763984 举行 0.0071435250877185845 会议 0.006533662388141017 表示 0.0065065573792709025 苏联 0.006289717308309989 合作 0.006127087255089304 两国 0.006032219724043905 记者 0.005978009706303676 关系 0.005842484661953105 发展 0.005747617130907706 美国
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
- Trains the model with the given charset
- Gets the name
- Full full topic vector
- Train the model
- Save model to model
- Samples a topic
- Adds a vector to the vector
- Update the topic for a vector
- Remove a vector
- Get a list of words from a reader
- Filter a term
- Main entry point
- Read all words from the reader into a list of words
- Get an analysis object for a specified file
- Initializes forest
- Main entry point for testing
- Updates the phi parameters
- Reads a system filter from the system file
ansj_fast_lda Key Features
ansj_fast_lda Examples and Code Snippets
Community Discussions
Trending Discussions on Topic Modeling
QUESTION
I am trying to use a pre-trained model from TensorFlow hub instead of frequency vectorization techniques for word embedding before passing the resultant feature vector to the LDA model.
I followed the steps for the TensorFlow model, but I got this error upon passing the resultant feature vector to the LDA model:
...ANSWER
Answered 2022-Feb-24 at 09:31As the fit
function of LatentDirichletAllocation
does not allow a negative array, I will recommend you to apply softplus on the embeddings
.
Here is the code snippet:
QUESTION
I am new to using LSI with Python and Gensim + Scikit-learn tools. I was able to achieve topic modeling on a corpus using LSI from both the Scikit-learn and Gensim libraries, however, when using the Gensim approach I was not able to display a list of documents to topic mapping.
Here is my work using Scikit-learn LSI where I successfully displayed document to topic mapping:
...ANSWER
Answered 2022-Feb-22 at 19:27In order to get the representation of a document (represented as a bag-of-words) from a trained LsiModel
as a vector of topics, you use Python dict-style bracket-accessing (model[bow]
).
For example, to get the topics for the 1st item in your training data, you can use:
QUESTION
I am trying to understand how Top2Vec works. I have some questions about the code that I could not find an answer for in the paper. A summary of what the algorithm does is that it:
- embeds words and vectors in the same semantic space and normalizes them. This usually has more than 300 dimensions.
- projects them into 5-dimensional space using UMAP and cosine similarity.
- creates topics as centroids of clusters using HDBSCAN with Euclidean metric on the projected data.
what troubles me is that they normalize the topic vectors. However, the output from UMAP is not normalized, and normalizing the topic vectors will probably move them out of their clusters. This is inconsistent with what they described in their paper as the topic vectors are the arithmetic mean of all documents vectors that belong to the same topic.
This leads to two questions:
How are they going to calculate the nearest words to find the keywords of each topic given that they altered the topic vector by normalization?
After creating the topics as clusters, they try to deduplicate the very similar topics. To do so, they use cosine similarity. This makes sense with the normalized topic vectors. In the same time, it is an extension of the inconsistency that normalizing topic vectors introduced. Am I missing something here?
...ANSWER
Answered 2022-Feb-16 at 16:13I got the answer to my questions from the source code. I was going to delete the question but I will leave the answer any way.
It is the part I missed and is wrong in my question. Topic vectors are the arithmetic mean of all documents vectors that belong to the same topic. Topic vectors belong to the same semantic space where words and documents vector live.
That is why it makes sense to normalize them since all words and documents vectors are normalized, and to use the cosine metric when looking for duplicated topics in the higher original semantic space.
QUESTION
I am trying to extract topic scores for documents in my dataset after using and LDA model. Specifically, I have followed most of the code from here: https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/
I have completed the topic model and have the results I want, but the provided code only gives the most dominant topic for each document. Is there a simple way to modify the following code to give me the scores for say the 5 most dominant topics?
...ANSWER
Answered 2021-Dec-10 at 10:33Right this is a crusty example because you haven't provided data to reproduce but using some gensim testing corpus, texts and dictionary we can do:
QUESTION
I am using pyLDAvis along with gensim.models.LdaMulticore for topic modeling. I have totally 10 topics. When I visualize the results using pyLDAvis, there is a bar called lambda with this explanation: "Slide to adjust relevance metric". I am interested to extract the list of words for each topic separately for lambda = 0.1. I cannot find a way to adjust lambda in the document for extracting keywords.
I am using these lines:
...ANSWER
Answered 2021-Nov-24 at 10:43You may want to read this github page: https://nicharuc.github.io/topic_modeling/
According to this example, your code could go like this:
QUESTION
Working with the OCTIS package, I am running a CTM topic model on the BBC (default) dataset.
...ANSWER
Answered 2021-Oct-11 at 15:19I'm one of the developers of OCTIS.
Short answer:
If I understood your problem, you can fix this issue by modifying the parameter "bert_path" of CTM and make it dataset-specific, e.g. CTM(bert_path="path/to/store/the/files/" + data)
TL;DR: I think the problem is related to the fact that CTM generates and stores the document representations in some files with a default name. If these files already exist, it uses them without generating new representations, even if the dataset has changed in the meantime. Then CTM will raise that issue because it is using the BOW representation of a dataset, but the contextualized representations of another dataset, resulting in two representations with different dimensions. Changing the name of the files with respect to the name of the dataset will allow the model to retrieve the correct representations.
If you have other issues, please open a GitHub issue in the repo. I've found out about this issue by chance.
QUESTION
ANSWER
Answered 2021-Sep-20 at 01:19You should pass a column of data to the fit_transform
function. Here is the example
QUESTION
ANSWER
Answered 2021-Sep-13 at 08:30It's a matter of scale. If you have 1000 types (ie "dictionary words"), you might end up (in the worst case, which is not going to happen) with 1,000,000 bigrams, and 1,000,000,000 trigrams. These numbers are hard to manage, especially as you will have a lot more types in a realistic text.
The gains in accuracy/performance don't outweigh the computational cost here.
QUESTION
Starting from the following example
...ANSWER
Answered 2021-Sep-08 at 11:20You can compute the explained variance with a range of the possible number of components. The maximum number of components is the size of your vocabulary.
QUESTION
I am doing a topic modelling task with LDA, and I am getting 10 components with 15 top words each:
...ANSWER
Answered 2021-Jun-23 at 08:01If I understand correctly, you have a dataframe with all values and you want to keep the top 10 in each row, and have 0s on remaining values.
Here we transform
each row by:
- getting the 10th highest values
- reindexing to the original index of the row (thus the columns of the dataframe) and filling with 0s:
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install ansj_fast_lda
You can use ansj_fast_lda like any standard Java library. Please include the the jar files in your classpath. You can also use any IDE and you can run and debug the ansj_fast_lda component as you would do with any other Java program. Best practice is to use a build tool that supports dependency management such as Maven or Gradle. For Maven installation, please refer maven.apache.org. For Gradle installation, please refer gradle.org .
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page