lucene | 基于lucene与IKAnalyzer的中文搜索demo及学习记录 | Search Engine library

 by   suxiongwei Java Version: Current License: No License

kandi X-RAY | lucene Summary

kandi X-RAY | lucene Summary

lucene is a Java library typically used in Database, Search Engine, Spring Boot, Spring applications. lucene has no bugs, it has no vulnerabilities, it has build file available and it has high support. You can download it from GitHub.

IKAnalyzer 是一个开源的,基于java语言开发的轻量级的中文分词工具包最初,它是以开源项目 Lucene为应用主体的,结合词典分词和文法分析 算法的中文分词组件。新版本的IKAnalyzer3.0则发展为 面向Java的公用分词组件,独立于Lucene项目,同时提供了对Lucene的默认优化实现。.
Support
    Quality
      Security
        License
          Reuse

            kandi-support Support

              lucene has a highly active ecosystem.
              It has 35 star(s) with 25 fork(s). There are 2 watchers for this library.
              OutlinedDot
              It had no major release in the last 6 months.
              There are 0 open issues and 2 have been closed. On average issues are closed in 5 days. There are no pull requests.
              It has a positive sentiment in the developer community.
              The latest version of lucene is current.

            kandi-Quality Quality

              lucene has no bugs reported.

            kandi-Security Security

              lucene has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

            kandi-License License

              lucene does not have a standard license declared.
              Check the repository for any license declaration and review the terms closely.
              OutlinedDot
              Without a license, all rights are reserved, and you cannot use the library in your applications.

            kandi-Reuse Reuse

              lucene releases are not available. You will need to build from source code and install.
              Build file is available. You can build the component from source.
              Installation instructions are not available. Examples and code snippets are available.

            Top functions reviewed by kandi - BETA

            kandi's functional review helps you automatically verify the functionalities of the libraries and avoid rework.
            Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of lucene
            Get all kandi verified functions for this library.

            lucene Key Features

            No Key Features are available at this moment for lucene.

            lucene Examples and Code Snippets

            No Code Snippets are available at this moment for lucene.

            Community Discussions

            QUESTION

            Lucene query result is not correct when running official demo
            Asked 2021-Jun-09 at 12:00

            I tried Lucene official demo by running IndexFiles with arguments -index . -docs . , and console prints including pom.xml and *.java and *.class are added into index.

            Then I tried SearchFiles with arguments -index . -query "lucene AND main", and console prints only IndexFiles.class and SearchFiles.class and IndexFiles.java, but not SearchFiles.java (which I think should be one of searched results).

            ...

            ANSWER

            Answered 2021-Jun-09 at 12:00

            Your search results are correct (for the .java files, at least).

            The sample code uses the StandardAnalyzer which, in turn, uses the StandardTokenizer.

            The StandardTokenizer splits input text into tokens using the rules described in this document. For example, from section 4 of that document:

            When you have text such as the following, in the source files

            Source https://stackoverflow.com/questions/67880602

            QUESTION

            Apache Solr 8.5.2: Solr not starting due to ClassNotFoundException
            Asked 2021-Jun-02 at 08:08

            I am trying to run Solr 8.5.2 in my local. When starting, I am getting the following error :

            ...

            ANSWER

            Answered 2021-Jun-02 at 08:08

            Issue was with web.xml. It had servlet entries for ZookeeperInfoServlet. Removing those, fixed the issue.

            Source https://stackoverflow.com/questions/67800510

            QUESTION

            edgeID is returned as alpha numeric instead of long in Janusgraph
            Asked 2021-May-31 at 15:47

            I use the following code to create an edge

            ...

            ANSWER

            Answered 2021-May-31 at 15:47

            Edge IDs in JanusGraph are stored using a special class called RelationIdentifier that contains a lot more information than just the ID itself. The ID of that class is a "UUID like" identifier. You can get other information from the class. Below is an example using a simple 'inmemory' JanusGraph from the Gremlin Console.

            Source https://stackoverflow.com/questions/67759285

            QUESTION

            Get second last value in each row of dataframe, R
            Asked 2021-May-14 at 14:45

            I am trying to get the second last value in each row of a data frame, meaning the first job a person has had. (Job1_latest is the most recent job and people had a different number of jobs in the past and I want to get the first one). I managed to get the last value per row with the code below:

            first_job <- function(x) tail(x[!is.na(x)], 1)

            first_job <- apply(data, 1, first_job)

            ...

            ANSWER

            Answered 2021-May-11 at 13:56

            You can get the value which is next to last non-NA value.

            Source https://stackoverflow.com/questions/67486393

            QUESTION

            why merging of segments in costly CPU-wise in Elasticsearch?
            Asked 2021-May-12 at 09:42

            I have seen that every time I have a high CPU problem with ES, it's always Lucene Merge Thread.

            From what I understand, segments are already sorted, so you are just merging two sorted segments every time, à la merge sort's merge process. Why would merging be so costly. Or am I missing something?

            ...

            ANSWER

            Answered 2021-May-12 at 09:42

            There are couple of factors why merging process is costly on IO and CPU:

            1. Lucene uses Skip list data structure they can be costly to merge, there is a good article about Skip list merge here.
            2. Couple of parallel merging can happen at the same time.
            3. Lucene will need to create a third segment and merge both segment, so you will need enough space for it.

            There is a good blog post about segment merging here.

            Source https://stackoverflow.com/questions/67497626

            QUESTION

            How does lucene store a document?
            Asked 2021-May-11 at 19:22

            Basically, how are each field inside a document stored in the inverted index? Does Lucene internally create a separate index for each field? Also Suppose a query is on a specific field, how does search works for it internally?

            I know how inverted indices work. But how do you store multiple fields in a single index and how do you differentiate when to only search on particular fields when requested.

            ...

            ANSWER

            Answered 2021-May-11 at 19:22

            As I mentioned in my comment, If you want to see how Lucene stores indexed data, you can use the SimpleTextCodec. See this answer How to view Lucene Index for more details and some sample code. Basically, this generates human-readable index files (as opposed to the usual binary compressed formats).

            Below is a sample of what you can expect to see when you use the SimpleTextCodec.

            How do you store multiple fields in a single index?

            To show a basic example, assume we have a Lucene text field defined as follows:

            Source https://stackoverflow.com/questions/67491033

            QUESTION

            How to re-index documents with integer id?
            Asked 2021-May-09 at 16:17

            I have JSON documents that represent database rows.

            ...

            ANSWER

            Answered 2021-May-09 at 16:17

            OK, I think I figured this out. The problem is that you can't create a valid Term by using GetStringValue to convert the integer ID to a string (e.g. "3122"). Instead you have to create the term from the ID's raw bytes (e.g. [60 8 0 0 18 31]), like this:

            Source https://stackoverflow.com/questions/67445658

            QUESTION

            Stormcrawler not retrieving all text content from web page
            Asked 2021-Apr-27 at 08:07

            I'm attempting to use Stormcrawler to crawl a set of pages on our website, and while it is able to retrieve and index some of the page's text, it's not capturing a large amount of other text on the page.

            I've installed Zookeeper, Apache Storm, and Stormcrawler using the Ansible playbooks provided here (thank you a million for those!) on a server running Ubuntu 18.04, along with Elasticsearch and Kibana. For the most part, I'm using the configuration defaults, but have made the following changes:

            • For the Elastic index mappings, I've enabled _source: true, and turned on indexing and storing for all properties (content, host, title, url)
            • In the crawler-conf.yaml configuration, I've commented out all textextractor.include.pattern and textextractor.exclude.tags settings, to enforce capturing the whole page

            After re-creating fresh ES indices, running mvn clean package, and then starting the crawler topology, stormcrawler begins doing its thing and content starts appearing in Elasticsearch. However, for many pages, the content that's retrieved and indexed is only a subset of all the text on the page, and usually excludes the main page text we are interested in.

            For example, the text in the following XML path is not returned/indexed:

            (text)

            While the text in this path is returned:

            Are there any additional configuration changes that need to be made beyond commenting out all specific tag include and exclude patterns? From my understanding of the documentation, the default settings for those options are to enforce the whole page to be indexed.

            I would greatly appreciate any help. Thank you for the excellent software.

            Below are my configuration files:

            crawler-conf.yaml

            ...

            ANSWER

            Answered 2021-Apr-27 at 08:07

            IIRC you need to set some additional config to work with ChomeDriver.

            Alternatively (haven't tried yet) https://hub.docker.com/r/browserless/chrome would be a nice way of handling Chrome in a Docker container.

            Source https://stackoverflow.com/questions/67129360

            QUESTION

            Solr not backing up core
            Asked 2021-Apr-23 at 01:31

            I'm trying to backup a Solr core (Solr 8.1.1 in standalone mode). I added the replication requestHandler as per https://solr.apache.org/guide/8_1/index-replication.html#configuring-the-replicationhandler

            When I run /solr/core/replication?command=backup it returns:

            ...

            ANSWER

            Answered 2021-Apr-23 at 01:31

            Looks like this is a regression of a bug from earlier Solr that was re-introduced in 8.0.0. It was fixed in 8.4.0 apparently

            See https://issues.apache.org/jira/browse/SOLR-13872

            Source https://stackoverflow.com/questions/67207332

            QUESTION

            Sort Index using DocValues for integers?
            Asked 2021-Apr-20 at 19:18

            I am using Lucene for an autocomplete mechanism of an textfield supporting multiple languages and multiple groups of options. Each group has about 2k to 5k different values.

            Currently I query all hits and sort those according to an integer value by hand. Since this is inefficient, I need to create an index using doc-values. I understand the theory but I cant find a good code snippet to make it work. I brought and read in two books and it is either not or poorly covered (one small section with one line of code).

            My goal is to index an integer value per document and sort in descending order.

            Also I would like to ask if I miss a mayor documentation source? The Lucene documentation is not that comprehensive nor accessible. I used to use Lucene in Action but this book is a decade old and the most recent changes in Lucene are quite dramatic in terms of API.

            As an example:

            • {name:"A1", number:1000}
            • {name:"A2", number:1001}
            • {name:"A3", number:990}
            • {name:"B1", number:300}

            = Query: A* + sorted by number + top2 => A3, A1

            Summary: I currently fetch all the documents and do the sorting and trimming (limit) in code and would rather like Lucene to do it.

            The implementation uses Java. Since I use only a small set of information but in multiple languages I create an index using RAMDirectory (yes I know its deprecated but it works) and add each document to a standard index writer using a standard analyzer.

            As far as I manage to understand the requirements, I need to define and use a field stored in a column to allow sorting with Lucene. I tried multiple hours and just gave up due fetching all information and look up the data in memory and sort+trim it did the trick but it is dissatisfactory.

            So all what it is needed is add a integer field to the index allowing for sorting in lucene.

            ...

            ANSWER

            Answered 2021-Mar-23 at 12:52

            Use a SortedNumericDocValuesField to add the field to your documents.

            Use a SortedNumericSortField with the same name in your search query:

            Source https://stackoverflow.com/questions/66492614

            Community Discussions, Code Snippets contain sources that include Stack Exchange Network

            Vulnerabilities

            No vulnerabilities reported

            Install lucene

            You can download it from GitHub.
            You can use lucene like any standard Java library. Please include the the jar files in your classpath. You can also use any IDE and you can run and debug the lucene component as you would do with any other Java program. Best practice is to use a build tool that supports dependency management such as Maven or Gradle. For Maven installation, please refer maven.apache.org. For Gradle installation, please refer gradle.org .

            Support

            For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .
            Find more information at:

            Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items

            Find more libraries
            CLONE
          • HTTPS

            https://github.com/suxiongwei/lucene.git

          • CLI

            gh repo clone suxiongwei/lucene

          • sshUrl

            git@github.com:suxiongwei/lucene.git

          • Stay Updated

            Subscribe to our newsletter for trending solutions and developer bootcamps

            Agree to Sign up and Terms & Conditions

            Share this Page

            share link