quanteda | An R package for the Quantitative Analysis of Textual Data
kandi X-RAY | quanteda Summary
kandi X-RAY | quanteda Summary
An R package for managing and analyzing text, created by Kenneth Benoit. Supported by the European Research Council grant ERC-2011-StG 283794-QUANTESS. For more details, see quanteda 3.0 is a major release that improves functionality, completes the modularisation of the package begun in v2.0, further improves function consistency by removing previously deprecated functions, and enhances workflow stability and consistency by deprecating some shortcut steps built into some functions. See for a full list of the changes.
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of quanteda
quanteda Key Features
quanteda Examples and Code Snippets
Community Discussions
Trending Discussions on quanteda
QUESTION
I am running a simple unsupervised learning model on an Arabic text corpus, and the model is running well. However, I am having an issue with the plots that aren't working well as they are printing the Arabic characters from left to right, rather than the correct format of right to left.
Here are the packages I am using:
...ANSWER
Answered 2022-Feb-24 at 02:07If you're using old a version of R that is 3.2 or Less then those versions does not handle Unicode in proper way. Try to install latest version of R from https://cran.r-project.org/ and if required then install all packages.
QUESTION
I'd like to examine the Psychological Capital (a construct consisting of four dimensions, namely hope, optimism, efficacy and resiliency) of founders using computer-aided text analysis in R. So far I have pulled tweets from various users into R. The data frame contains of 2130 tweets from 5 different users in different periods. The dataframe is called before_failure. Picture of original data frame
I have then used the quanteda package to create a corpus, perfomed tokenization on it and removed redundant punctuatio/numbers/symbols:
...ANSWER
Answered 2022-Feb-01 at 17:16The easiest way to do this is to use tokens_lookup()
with a category for tokens not matched, then to compile this into a dfm that you then convert to term proportions within document.
To use a reproducible example from built-in quanteda objects, the process would be the following. (You can substitute your own corpus and dictionary and the code should work fine.)
QUESTION
I have created a Quanteda corpus called readtext_corpus with 190 types of text. I would like to count the total number of tokens or words in the corpus. I tried the function ntoken which gives a number of words per text not the total number of words for all 190 texts.
...ANSWER
Answered 2022-Feb-01 at 00:26you can just use the sum() function which is really simple. I left an example:
QUESTION
required_packs <- c("pdftools","readxl","pdfsearch","tidyverse","data.table","stringr","tidytext","dplyr","igraph","NLP","tm", "quanteda", "ggraph", "topicmodels", "lasso2", "reshape2", "FSelector")
new_packs <- required_packs[!(required_packs %in% installed.packages()[,"Package"])]
if(length(new_packs)) install.packages(new_packs)
i <- 1
for (i in 1:length(required_packs)) {
sapply(required_packs[i],require, character.only = T)
}
...ANSWER
Answered 2021-Dec-27 at 20:12I think the problem is that you used T
when you meant TRUE
. For example,
QUESTION
I have 6 different dataframes that each calculates a cosine similarity between a set of documents. I have already calculated the cosine similarity, I just need to pull out the right variable on each of the six and save it. The code to do this looks like this:
...ANSWER
Answered 2021-Dec-02 at 16:02You can use get(object_name)
to get an object by name
QUESTION
I am struggling to 'translate' a regex expression from stringi
/stringr
to quanteda
's kwic
function.
How can I get all instances of "Jane Mayer", regardless of whether she has a middle name or not. Note that I don't have a list of all existing middle names in the data. So defining multiple patterns (one for each middle name) wouldn't be possible.
Many thanks!
...ANSWER
Answered 2021-Nov-28 at 23:11It seems you need to pass another pattern to match exactly Jane Mayer
:
QUESTION
I have a corpus object that I converted into a tokens object. I then filtered this object to remove words and unify their spelling. For my further workflow, I again need a corpus object. How can I construct this from the tokens object?
...ANSWER
Answered 2021-Oct-17 at 08:53You could paste the tokens together to return a new corpus. (Although this may not be the best approach if your goal is to get back to a corpus so that you can use corpus_reshape()
.)
QUESTION
I am using the Quanteda suite of packages to preprocess some text data. I want to incorporate collocations as features and decided to use the textstat_collocations function. According to the documentation and I quote:
"The tokens object . . . . While identifying collocations for tokens objects is supported, you will get better results with character or corpus objects due to relatively imperfect detection of sentence boundaries from texts already tokenized."
This makes perfect sense, so here goes:
...ANSWER
Answered 2021-Sep-04 at 09:21The problem is that you have already compounded the elements of the collocations into a single "token" containing a space, but by supplying the phrase()
wrapper in tokens_compound()
, you are telling tokens_replace()
to look for two sequential tokens, not the one with a space.
The way to get what you want is by making the lemmatised replacement match the collocation.
QUESTION
I am trying to measure the number of times that different words co-occur with a particular term in collections of Chinese newspaper articles from each quarter of a year. To do this, I have been using Quanteda and written several R functions to run on each group of articles. My work steps are:
- Group the articles by quarter.
- Produce a frequency co-occurence matrix (FCM) for the articles in each quarter (Function 1).
- Take the column from this matrix for the 'term' I am interested in and convert this to a data.frame (Function 2)
- Merge the data.frames for each quarter together, then produce a large csv file with a column for each quarter and a row for each co-occurring term.
This seems to work okay. But I wondered if anybody more skilled in R might be able to check what I am doing is correct, or might suggest a more efficient way of doing it?
Thanks for any help!
...ANSWER
Answered 2021-Aug-13 at 09:28If you are interested in counting co-occurrences within a window for specific target terms, a better way is to use the window
argument of tokens_select()
, and then to count occurrences from a dfm on the window-selected tokens.
QUESTION
My ultimate goal is to create a quanteda dictionary to use for topic classification on text data.
However, my topic keywords are stored in a somewhat different format: I have a column of about 4000 keywords and a second column that specifies the topic each keyword belongs to. Note that there is no equal number of words for each topic. My data looks like this:
...ANSWER
Answered 2021-Aug-12 at 15:39If your data is in a data.frame like topics (see data section), you can quickly get the data in a list like you want. You can use the function split
.
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install quanteda
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page