lucene | 基于lucene与IKAnalyzer的中文搜索demo及学习记录 | Search Engine library

by suxiongwei Java Version: Current License: No License

X-Ray Key Features Code Snippets Community Discussions(10)Vulnerabilities Install Support

kandi X-RAY | lucene Summary

lucene is a Java library typically used in Database, Search Engine, Spring Boot, Spring applications. lucene has no bugs, it has no vulnerabilities, it has build file available and it has high support. You can download it from GitHub.

IKAnalyzer 是一个开源的，基于java语言开发的轻量级的中文分词工具包最初，它是以开源项目 Lucene为应用主体的，结合词典分词和文法分析算法的中文分词组件。新版本的IKAnalyzer3.0则发展为面向Java的公用分词组件，独立于Lucene项目，同时提供了对Lucene的默认优化实现。.

Support

Quality

Security

License

Reuse

Support

lucene has a highly active ecosystem.

It has 35 star(s) with 25 fork(s). There are 2 watchers for this library.

It had no major release in the last 6 months.

There are 0 open issues and 2 have been closed. On average issues are closed in 5 days. There are no pull requests.

It has a positive sentiment in the developer community.

The latest version of lucene is current.

Quality

lucene has no bugs reported.

Security

lucene has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

License

lucene does not have a standard license declared.

Check the repository for any license declaration and review the terms closely.

Without a license, all rights are reserved, and you cannot use the library in your applications.

Reuse

lucene releases are not available. You will need to build from source code and install.

Build file is available. You can build the component from source.

Installation instructions are not available. Examples and code snippets are available.

Top functions reviewed by kandi - BETA

kandi's functional review helps you automatically verify the functionalities of the libraries and avoid rework.
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of lucene

Get all kandi verified functions for this library.

lucene Key Features

No Key Features are available at this moment for lucene.

lucene Examples and Code Snippets

No Code Snippets are available at this moment for lucene.

Community Discussions

Trending Discussions on lucene

Lucene query result is not correct when running official demo

Apache Solr 8.5.2: Solr not starting due to ClassNotFoundException

edgeID is returned as alpha numeric instead of long in Janusgraph

Get second last value in each row of dataframe, R

why merging of segments in costly CPU-wise in Elasticsearch?

How does lucene store a document?

How to re-index documents with integer id?

Stormcrawler not retrieving all text content from web page

Solr not backing up core

Sort Index using DocValues for integers?

QUESTION

Lucene query result is not correct when running official demo

Asked 2021-Jun-09 at 12:00

I tried Lucene official demo by running IndexFiles with arguments -index . -docs . , and console prints including pom.xml and *.java and *.class are added into index.

Then I tried SearchFiles with arguments -index . -query "lucene AND main", and console prints only IndexFiles.class and SearchFiles.class and IndexFiles.java, but not SearchFiles.java (which I think should be one of searched results).

...

ANSWER

Answered 2021-Jun-09 at 12:00

Your search results are correct (for the .java files, at least).

The sample code uses the StandardAnalyzer which, in turn, uses the StandardTokenizer.

The StandardTokenizer splits input text into tokens using the rules described in this document. For example, from section 4 of that document:

When you have text such as the following, in the source files

Source https://stackoverflow.com/questions/67880602

QUESTION

Apache Solr 8.5.2: Solr not starting due to ClassNotFoundException

Asked 2021-Jun-02 at 08:08

I am trying to run Solr 8.5.2 in my local. When starting, I am getting the following error :

...

ANSWER

Answered 2021-Jun-02 at 08:08

Issue was with web.xml. It had servlet entries for ZookeeperInfoServlet. Removing those, fixed the issue.

Source https://stackoverflow.com/questions/67800510

QUESTION

edgeID is returned as alpha numeric instead of long in Janusgraph

Asked 2021-May-31 at 15:47

I use the following code to create an edge

...

ANSWER

Answered 2021-May-31 at 15:47

Edge IDs in JanusGraph are stored using a special class called RelationIdentifier that contains a lot more information than just the ID itself. The ID of that class is a "UUID like" identifier. You can get other information from the class. Below is an example using a simple 'inmemory' JanusGraph from the Gremlin Console.

Source https://stackoverflow.com/questions/67759285

QUESTION

Get second last value in each row of dataframe, R

Asked 2021-May-14 at 14:45

I am trying to get the second last value in each row of a data frame, meaning the first job a person has had. (Job1_latest is the most recent job and people had a different number of jobs in the past and I want to get the first one). I managed to get the last value per row with the code below:

first_job <- function(x) tail(x[!is.na(x)], 1)

first_job <- apply(data, 1, first_job)

...

ANSWER

Answered 2021-May-11 at 13:56

You can get the value which is next to last non-NA value.

Source https://stackoverflow.com/questions/67486393

QUESTION

why merging of segments in costly CPU-wise in Elasticsearch?

Asked 2021-May-12 at 09:42

I have seen that every time I have a high CPU problem with ES, it's always Lucene Merge Thread.

From what I understand, segments are already sorted, so you are just merging two sorted segments every time, à la merge sort's merge process. Why would merging be so costly. Or am I missing something?

...

ANSWER

Answered 2021-May-12 at 09:42

There are couple of factors why merging process is costly on IO and CPU:

Lucene uses Skip list data structure they can be costly to merge, there is a good article about Skip list merge here.
Couple of parallel merging can happen at the same time.
Lucene will need to create a third segment and merge both segment, so you will need enough space for it.

There is a good blog post about segment merging here.

Source https://stackoverflow.com/questions/67497626

QUESTION

How does lucene store a document?

Asked 2021-May-11 at 19:22

Basically, how are each field inside a document stored in the inverted index? Does Lucene internally create a separate index for each field? Also Suppose a query is on a specific field, how does search works for it internally?

I know how inverted indices work. But how do you store multiple fields in a single index and how do you differentiate when to only search on particular fields when requested.

...

ANSWER

Answered 2021-May-11 at 19:22

As I mentioned in my comment, If you want to see how Lucene stores indexed data, you can use the SimpleTextCodec. See this answer How to view Lucene Index for more details and some sample code. Basically, this generates human-readable index files (as opposed to the usual binary compressed formats).

Below is a sample of what you can expect to see when you use the SimpleTextCodec.

How do you store multiple fields in a single index?

To show a basic example, assume we have a Lucene text field defined as follows:

Source https://stackoverflow.com/questions/67491033

QUESTION

How to re-index documents with integer id?

Asked 2021-May-09 at 16:17

I have JSON documents that represent database rows.

...

ANSWER

Answered 2021-May-09 at 16:17

OK, I think I figured this out. The problem is that you can't create a valid Term by using GetStringValue to convert the integer ID to a string (e.g. "3122"). Instead you have to create the term from the ID's raw bytes (e.g. [60 8 0 0 18 31]), like this:

Source https://stackoverflow.com/questions/67445658

QUESTION

Stormcrawler not retrieving all text content from web page

Asked 2021-Apr-27 at 08:07

I'm attempting to use Stormcrawler to crawl a set of pages on our website, and while it is able to retrieve and index some of the page's text, it's not capturing a large amount of other text on the page.

I've installed Zookeeper, Apache Storm, and Stormcrawler using the Ansible playbooks provided here (thank you a million for those!) on a server running Ubuntu 18.04, along with Elasticsearch and Kibana. For the most part, I'm using the configuration defaults, but have made the following changes:

For the Elastic index mappings, I've enabled _source: true, and turned on indexing and storing for all properties (content, host, title, url)
In the crawler-conf.yaml configuration, I've commented out all textextractor.include.pattern and textextractor.exclude.tags settings, to enforce capturing the whole page

After re-creating fresh ES indices, running mvn clean package, and then starting the crawler topology, stormcrawler begins doing its thing and content starts appearing in Elasticsearch. However, for many pages, the content that's retrieved and indexed is only a subset of all the text on the page, and usually excludes the main page text we are interested in.

For example, the text in the following XML path is not returned/indexed:

   (text)
While the text in this path is returned:
  
   
Are there any additional configuration changes that need to be made beyond commenting out all specific tag include and exclude patterns? From my understanding of the documentation, the default settings for those options are to enforce the whole page to be indexed.
I would greatly appreciate any help. Thank you for the excellent software.
Below are my configuration files:
crawler-conf.yaml
 ...

ANSWER

Answered 2021-Apr-27 at 08:07

IIRC you need to set some additional config to work with ChomeDriver.


Alternatively (haven't tried yet) https://hub.docker.com/r/browserless/chrome would be a nice way of handling Chrome in a Docker container.

Source https://stackoverflow.com/questions/67129360

QUESTION

Solr not backing up core

Asked 2021-Apr-23 at 01:31

I'm trying to backup a Solr core (Solr 8.1.1 in standalone mode). I added the replication requestHandler as per https://solr.apache.org/guide/8_1/index-replication.html#configuring-the-replicationhandler


When I run /solr/core/replication?command=backup it returns:
 ...

ANSWER

Answered 2021-Apr-23 at 01:31

Looks like this is a regression of a bug from earlier Solr that was re-introduced in 8.0.0. It was fixed in 8.4.0 apparently


See https://issues.apache.org/jira/browse/SOLR-13872

Source https://stackoverflow.com/questions/67207332

QUESTION

Sort Index using DocValues for integers?

Asked 2021-Apr-20 at 19:18

I am using Lucene for an autocomplete mechanism of an textfield supporting multiple languages and multiple groups of options. Each group has about 2k to 5k different values.


Currently I query all hits and sort those according to an integer value by hand. Since this is inefficient, I need to create an index using doc-values. I understand the theory but I cant find a good code snippet to make it work. I brought and read in two books and it is either not or poorly covered (one small section with one line of code).
My goal is to index an integer value per document and sort in descending order.
Also I would like to ask if I miss a mayor documentation source? The Lucene documentation is not that comprehensive nor accessible. I used to use Lucene in Action but this book is a decade old and the most recent changes in Lucene are quite dramatic in terms of API.
As an example:

{name:"A1", number:1000}
{name:"A2", number:1001}
{name:"A3", number:990}
{name:"B1", number:300}

= Query: A* + sorted by number + top2 => A3, A1
Summary: I currently fetch all the documents and do the sorting and trimming (limit) in code and would rather like Lucene to do it.

The implementation uses Java. Since I use only a small set of information but in multiple languages I create an index using RAMDirectory (yes I know its deprecated but it works) and add each document to a standard index writer using a standard analyzer.
As far as I manage to understand the requirements, I need to define and use a field stored in a column to allow sorting with Lucene. I tried multiple hours and just gave up due fetching all information and look up the data in memory and sort+trim it did the trick but it is dissatisfactory.
So all what it is needed is add a integer field to the index allowing for sorting in lucene.
 ...

ANSWER

Answered 2021-Mar-23 at 12:52

Use a SortedNumericDocValuesField to add the field to your documents.


Use a SortedNumericSortField with the same name in your search query:

Source https://stackoverflow.com/questions/66492614

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

 Vulnerabilities
No vulnerabilities reported

 Install lucene
You can download it from GitHub.
You can use lucene like any standard Java library. Please include the the jar files in your classpath. You can also use any IDE and you can run and debug the lucene component as you would do with any other Java program. Best practice is to use a build tool that supports dependency management such as Maven or Gradle. For Maven installation, please refer  maven.apache.org. For Gradle installation, please refer  gradle.org .

 Support
For any new features, suggestions and bugs create an issue on  GitHub. 
 If you have any questions check and ask questions on community page  Stack Overflow .
 Find more information at:

`Reuse Trending Solutions`

Build a Realtime Voice-to-Image Generator using Generative AI

Image Resizing using OpenCV in Python

Build your own Custom GPT Content Generator (Open-Source ChatGPT Alternative)

How to Validate an Email Address in JavaScript

Age Calculator using JavaScript

Addressing Bias in AI - Toolkit for Fairness, Explainability and Privacy

15 best JavaScript Node.js Payment libraries

Build Credit Risk predictor using Federated Learning

10 Best JavaScript Tours and Guides Libraries in 2023

Disease Predictor using Pandas & Scikit

28 best Python Face Recognition libraries

Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items

Find more libraries

CLONE

HTTPShttps://github.com/suxiongwei/lucene.git

CLIgh repo clone suxiongwei/lucene

sshUrlgit@github.com:suxiongwei/lucene.git

Download

https://github.com/suxiongwei/lucene/archive/refs/heads/master.zip

Stay Updated

Subscribe to our newsletter for trending solutions and developer bootcamps

Share this Page

Explore Related Topics

DatabaseSearch EngineSpring BootSpring

Reuse Search Engine Kits

8 best Java Search Engine libraries

11 best JavaScript Search Engine libraries

6 best Ruby Search Engine libraries

12 best Python Search Engine libraries

11 best C# Search Engine libraries

See all related Kits

Reuse Database Kits

Zero Hunger Zero Waste

13 best Java SQL Database libraries

12 best JavaScript SQL Database libraries

6 best C++ SQL Database libraries

5 best Ruby SQL Database libraries

See all related Kits

Consider Popular Search Engine Libraries

elasticsearchby elastic

MeiliSearchby meilisearch

elasticsearch-analysis-ikby medcl

Fuseby krisk

sonicby valeriansaliou

See all Search Engine Libraries

Try Top Libraries by suxiongwei

springboot-rabbitmqby suxiongweiJava

SpringCloud-Shopby suxiongweiJava

springboot-elasticsearchby suxiongweiJava

elasticsearch-jest-demoby suxiongweiJava

message-pushby suxiongweiJava

See all Learning Libraries

`Open Weaver – Develop Applications Faster with Open Source`

Terms
Privacy policy

Terms
Privacy policy

lucene | 基于lucene与IKAnalyzer的中文搜索demo及学习记录 | Search Engine library

kandi X-RAY | lucene Summary

kandi X-RAY | lucene Summary

Support

Quality

Security

License

Reuse

Top functions reviewed by kandi - BETA

lucene Key Features

lucene Examples and Code Snippets

Community Discussions

Vulnerabilities

Install lucene

Support

`Reuse Trending Solutions`

`Open Weaver – Develop Applications Faster with Open Source`

kandi

Community and Support

Company

`Follow`