lucene | 基于lucene与IKAnalyzer的中文搜索demo及学习记录 | Search Engine library
kandi X-RAY | lucene Summary
kandi X-RAY | lucene Summary
IKAnalyzer 是一个开源的,基于java语言开发的轻量级的中文分词工具包最初,它是以开源项目 Lucene为应用主体的,结合词典分词和文法分析 算法的中文分词组件。新版本的IKAnalyzer3.0则发展为 面向Java的公用分词组件,独立于Lucene项目,同时提供了对Lucene的默认优化实现。.
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of lucene
lucene Key Features
lucene Examples and Code Snippets
Community Discussions
Trending Discussions on lucene
QUESTION
I tried Lucene official demo by running IndexFiles with arguments -index . -docs .
, and console prints including pom.xml and *.java and *.class are added into index.
Then I tried SearchFiles with arguments -index . -query "lucene AND main"
, and console prints only IndexFiles.class and SearchFiles.class and IndexFiles.java, but not SearchFiles.java (which I think should be one of searched results).
ANSWER
Answered 2021-Jun-09 at 12:00Your search results are correct (for the .java
files, at least).
The sample code uses the StandardAnalyzer
which, in turn, uses the StandardTokenizer
.
The StandardTokenizer
splits input text into tokens using the rules described in this document. For example, from section 4 of that document:
When you have text such as the following, in the source files
QUESTION
I am trying to run Solr 8.5.2 in my local. When starting, I am getting the following error :
...ANSWER
Answered 2021-Jun-02 at 08:08Issue was with web.xml. It had servlet entries for ZookeeperInfoServlet. Removing those, fixed the issue.
QUESTION
I use the following code to create an edge
...ANSWER
Answered 2021-May-31 at 15:47Edge IDs in JanusGraph are stored using a special class called RelationIdentifier that contains a lot more information than just the ID itself. The ID of that class is a "UUID like" identifier. You can get other information from the class. Below is an example using a simple 'inmemory' JanusGraph from the Gremlin Console.
QUESTION
I am trying to get the second last value in each row of a data frame, meaning the first job a person has had. (Job1_latest is the most recent job and people had a different number of jobs in the past and I want to get the first one). I managed to get the last value per row with the code below:
first_job <- function(x) tail(x[!is.na(x)], 1)
first_job <- apply(data, 1, first_job)
...ANSWER
Answered 2021-May-11 at 13:56You can get the value which is next to last non-NA value.
QUESTION
I have seen that every time I have a high CPU problem with ES, it's always Lucene Merge Thread.
From what I understand, segments are already sorted, so you are just merging two sorted segments every time, à la merge sort's merge process. Why would merging be so costly. Or am I missing something?
...ANSWER
Answered 2021-May-12 at 09:42There are couple of factors why merging process is costly on IO and CPU:
- Lucene uses
Skip list
data structure they can be costly to merge, there is a good article aboutSkip list merge
here. - Couple of parallel merging can happen at the same time.
- Lucene will need to create a third segment and merge both segment, so you will need enough space for it.
There is a good blog post about segment merging here.
QUESTION
Basically, how are each field inside a document stored in the inverted index? Does Lucene internally create a separate index for each field? Also Suppose a query is on a specific field, how does search works for it internally?
I know how inverted indices work. But how do you store multiple fields in a single index and how do you differentiate when to only search on particular fields when requested.
...ANSWER
Answered 2021-May-11 at 19:22As I mentioned in my comment, If you want to see how Lucene stores indexed data, you can use the SimpleTextCodec. See this answer How to view Lucene Index for more details and some sample code. Basically, this generates human-readable index files (as opposed to the usual binary compressed formats).
Below is a sample of what you can expect to see when you use the SimpleTextCodec
.
How do you store multiple fields in a single index?
To show a basic example, assume we have a Lucene text field defined as follows:
QUESTION
I have JSON documents that represent database rows.
...ANSWER
Answered 2021-May-09 at 16:17OK, I think I figured this out. The problem is that you can't create a valid Term
by using GetStringValue
to convert the integer ID to a string (e.g. "3122"
). Instead you have to create the term from the ID's raw bytes (e.g. [60 8 0 0 18 31]
), like this:
QUESTION
I'm attempting to use Stormcrawler to crawl a set of pages on our website, and while it is able to retrieve and index some of the page's text, it's not capturing a large amount of other text on the page.
I've installed Zookeeper, Apache Storm, and Stormcrawler using the Ansible playbooks provided here (thank you a million for those!) on a server running Ubuntu 18.04, along with Elasticsearch and Kibana. For the most part, I'm using the configuration defaults, but have made the following changes:
- For the Elastic index mappings, I've enabled
_source: true
, and turned on indexing and storing for all properties (content, host, title, url) - In the
crawler-conf.yaml
configuration, I've commented out alltextextractor.include.pattern
andtextextractor.exclude.tags
settings, to enforce capturing the whole page
After re-creating fresh ES indices, running mvn clean package
, and then starting the crawler topology, stormcrawler begins doing its thing and content starts appearing in Elasticsearch. However, for many pages, the content that's retrieved and indexed is only a subset of all the text on the page, and usually excludes the main page text we are interested in.
For example, the text in the following XML path is not returned/indexed:
(text)
While the text in this path is returned:
Are there any additional configuration changes that need to be made beyond commenting out all specific tag include and exclude patterns? From my understanding of the documentation, the default settings for those options are to enforce the whole page to be indexed.
I would greatly appreciate any help. Thank you for the excellent software.
Below are my configuration files:
crawler-conf.yaml
...
ANSWER
Answered 2021-Apr-27 at 08:07IIRC you need to set some additional config to work with ChomeDriver.
Alternatively (haven't tried yet) https://hub.docker.com/r/browserless/chrome would be a nice way of handling Chrome in a Docker container.
QUESTION
I'm trying to backup a Solr core (Solr 8.1.1 in standalone mode). I added the replication requestHandler as per https://solr.apache.org/guide/8_1/index-replication.html#configuring-the-replicationhandler
When I run /solr/core/replication?command=backup it returns:
...ANSWER
Answered 2021-Apr-23 at 01:31Looks like this is a regression of a bug from earlier Solr that was re-introduced in 8.0.0. It was fixed in 8.4.0 apparently
QUESTION
I am using Lucene for an autocomplete mechanism of an textfield supporting multiple languages and multiple groups of options. Each group has about 2k to 5k different values.
Currently I query all hits and sort those according to an integer value by hand. Since this is inefficient, I need to create an index using doc-values. I understand the theory but I cant find a good code snippet to make it work. I brought and read in two books and it is either not or poorly covered (one small section with one line of code).
My goal is to index an integer value per document and sort in descending order.
Also I would like to ask if I miss a mayor documentation source? The Lucene documentation is not that comprehensive nor accessible. I used to use Lucene in Action but this book is a decade old and the most recent changes in Lucene are quite dramatic in terms of API.
As an example:
- {name:"A1", number:1000}
- {name:"A2", number:1001}
- {name:"A3", number:990}
- {name:"B1", number:300}
= Query: A* + sorted by number + top2 => A3, A1
Summary: I currently fetch all the documents and do the sorting and trimming (limit) in code and would rather like Lucene to do it.
The implementation uses Java. Since I use only a small set of information but in multiple languages I create an index using RAMDirectory (yes I know its deprecated but it works) and add each document to a standard index writer using a standard analyzer.
As far as I manage to understand the requirements, I need to define and use a field stored in a column to allow sorting with Lucene. I tried multiple hours and just gave up due fetching all information and look up the data in memory and sort+trim it did the trick but it is dissatisfactory.
So all what it is needed is add a integer field to the index allowing for sorting in lucene.
...ANSWER
Answered 2021-Mar-23 at 12:52Use a SortedNumericDocValuesField
to add the field to your documents.
Use a SortedNumericSortField
with the same name in your search query:
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install lucene
You can use lucene like any standard Java library. Please include the the jar files in your classpath. You can also use any IDE and you can run and debug the lucene component as you would do with any other Java program. Best practice is to use a build tool that supports dependency management such as Maven or Gradle. For Maven installation, please refer maven.apache.org. For Gradle installation, please refer gradle.org .
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page