kandi background
Explore Kits

lucene-solr | Mirror of Apache Lucene Solr https | Search Engine library

 by   cloudera Java Version: Current License: No License

 by   cloudera Java Version: Current License: No License

Download this library from

kandi X-RAY | lucene-solr Summary

lucene-solr is a Java library typically used in Database, Search Engine, Maven applications. lucene-solr has no bugs, it has no vulnerabilities and it has low support. However lucene-solr build file is not available. You can download it from GitHub.
lucene/ is a search engine library solr/ is a search engine server that uses lucene. To compile the sources run 'ant compile' To run all the tests run 'ant test' To setup your ide run 'ant idea' or 'ant eclipse' For Maven info, see dev-tools/maven/README.maven. For more information on how to contribute see: http://wiki.apache.org/lucene-java/HowToContribute http://wiki.apache.org/solr/HowToContribute.
Support
Support
Quality
Quality
Security
Security
License
License
Reuse
Reuse

kandi-support Support

  • lucene-solr has a low active ecosystem.
  • It has 16 star(s) with 19 fork(s). There are 25 watchers for this library.
  • It had no major release in the last 12 months.
  • There are 3 open issues and 0 have been closed. On average issues are closed in 1110 days. There are no pull requests.
  • It has a neutral sentiment in the developer community.
  • The latest version of lucene-solr is current.
lucene-solr Support
Best in #Search Engine
Average in #Search Engine
lucene-solr Support
Best in #Search Engine
Average in #Search Engine

quality kandi Quality

  • lucene-solr has 0 bugs and 0 code smells.
lucene-solr Quality
Best in #Search Engine
Average in #Search Engine
lucene-solr Quality
Best in #Search Engine
Average in #Search Engine

securitySecurity

  • lucene-solr has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.
  • lucene-solr code analysis shows 0 unresolved vulnerabilities.
  • There are 0 security hotspots that need review.
lucene-solr Security
Best in #Search Engine
Average in #Search Engine
lucene-solr Security
Best in #Search Engine
Average in #Search Engine

license License

  • lucene-solr does not have a standard license declared.
  • Check the repository for any license declaration and review the terms closely.
  • Without a license, all rights are reserved, and you cannot use the library in your applications.
lucene-solr License
Best in #Search Engine
Average in #Search Engine
lucene-solr License
Best in #Search Engine
Average in #Search Engine

buildReuse

  • lucene-solr releases are not available. You will need to build from source code and install.
  • lucene-solr has no build file. You will be need to create the build yourself to build the component from source.
lucene-solr Reuse
Best in #Search Engine
Average in #Search Engine
lucene-solr Reuse
Best in #Search Engine
Average in #Search Engine
Top functions reviewed by kandi - BETA

kandi's functional review helps you automatically verify the functionalities of the libraries and avoid rework.
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample Here

Get all kandi verified functions for this library.

Get all kandi verified functions for this library.

lucene-solr Key Features

Mirror of Apache Lucene + Solr https://github.com/apache/lucene-solr

Using default and custom stop words with Apache's Lucene (weird output)

copy iconCopydownload iconDownload
import org.apache.lucene.analysis.en.EnglishAnalyzer;

...

final List<String> stopWords = Arrays.asList("short", "test");
final CharArraySet stopSet = new CharArraySet(stopWords, true);

CharArraySet enStopSet = EnglishAnalyzer.ENGLISH_STOP_WORDS_SET;
stopSet.addAll(enStopSet);
[bla]
short
this
is
a
test
the
him
it
import org.apache.lucene.analysis.custom.CustomAnalyzer;

...

Analyzer analyzer = CustomAnalyzer.builder()
        .withTokenizer("icu")
        .addTokenFilter("stop",
                "ignoreCase", "true",
                "words", "stopwords.txt",
                "format", "wordset")
        .build();
[Bla]
    <build>  
        <resources>  
            <resource>  
                <directory>src/main/java</directory>  
                <excludes>  
                    <exclude>**/*.java</exclude>  
                </excludes>  
            </resource>  
        </resources>  
    </build> 
import java.nio.file.Path;
import java.nio.file.Paths;

...

Path resources = Paths.get("/path/to/resources/directory");

Analyzer analyzer = CustomAnalyzer.builder(resources)
        .withTokenizer("icu")
        .addTokenFilter("stop",
                "ignoreCase", "true",
                "words", "stopwords.txt",
                "format", "wordset")
        .build();
import org.apache.lucene.analysis.en.EnglishAnalyzer;

...

final List<String> stopWords = Arrays.asList("short", "test");
final CharArraySet stopSet = new CharArraySet(stopWords, true);

CharArraySet enStopSet = EnglishAnalyzer.ENGLISH_STOP_WORDS_SET;
stopSet.addAll(enStopSet);
[bla]
short
this
is
a
test
the
him
it
import org.apache.lucene.analysis.custom.CustomAnalyzer;

...

Analyzer analyzer = CustomAnalyzer.builder()
        .withTokenizer("icu")
        .addTokenFilter("stop",
                "ignoreCase", "true",
                "words", "stopwords.txt",
                "format", "wordset")
        .build();
[Bla]
    <build>  
        <resources>  
            <resource>  
                <directory>src/main/java</directory>  
                <excludes>  
                    <exclude>**/*.java</exclude>  
                </excludes>  
            </resource>  
        </resources>  
    </build> 
import java.nio.file.Path;
import java.nio.file.Paths;

...

Path resources = Paths.get("/path/to/resources/directory");

Analyzer analyzer = CustomAnalyzer.builder(resources)
        .withTokenizer("icu")
        .addTokenFilter("stop",
                "ignoreCase", "true",
                "words", "stopwords.txt",
                "format", "wordset")
        .build();
import org.apache.lucene.analysis.en.EnglishAnalyzer;

...

final List<String> stopWords = Arrays.asList("short", "test");
final CharArraySet stopSet = new CharArraySet(stopWords, true);

CharArraySet enStopSet = EnglishAnalyzer.ENGLISH_STOP_WORDS_SET;
stopSet.addAll(enStopSet);
[bla]
short
this
is
a
test
the
him
it
import org.apache.lucene.analysis.custom.CustomAnalyzer;

...

Analyzer analyzer = CustomAnalyzer.builder()
        .withTokenizer("icu")
        .addTokenFilter("stop",
                "ignoreCase", "true",
                "words", "stopwords.txt",
                "format", "wordset")
        .build();
[Bla]
    <build>  
        <resources>  
            <resource>  
                <directory>src/main/java</directory>  
                <excludes>  
                    <exclude>**/*.java</exclude>  
                </excludes>  
            </resource>  
        </resources>  
    </build> 
import java.nio.file.Path;
import java.nio.file.Paths;

...

Path resources = Paths.get("/path/to/resources/directory");

Analyzer analyzer = CustomAnalyzer.builder(resources)
        .withTokenizer("icu")
        .addTokenFilter("stop",
                "ignoreCase", "true",
                "words", "stopwords.txt",
                "format", "wordset")
        .build();
import org.apache.lucene.analysis.en.EnglishAnalyzer;

...

final List<String> stopWords = Arrays.asList("short", "test");
final CharArraySet stopSet = new CharArraySet(stopWords, true);

CharArraySet enStopSet = EnglishAnalyzer.ENGLISH_STOP_WORDS_SET;
stopSet.addAll(enStopSet);
[bla]
short
this
is
a
test
the
him
it
import org.apache.lucene.analysis.custom.CustomAnalyzer;

...

Analyzer analyzer = CustomAnalyzer.builder()
        .withTokenizer("icu")
        .addTokenFilter("stop",
                "ignoreCase", "true",
                "words", "stopwords.txt",
                "format", "wordset")
        .build();
[Bla]
    <build>  
        <resources>  
            <resource>  
                <directory>src/main/java</directory>  
                <excludes>  
                    <exclude>**/*.java</exclude>  
                </excludes>  
            </resource>  
        </resources>  
    </build> 
import java.nio.file.Path;
import java.nio.file.Paths;

...

Path resources = Paths.get("/path/to/resources/directory");

Analyzer analyzer = CustomAnalyzer.builder(resources)
        .withTokenizer("icu")
        .addTokenFilter("stop",
                "ignoreCase", "true",
                "words", "stopwords.txt",
                "format", "wordset")
        .build();
import org.apache.lucene.analysis.en.EnglishAnalyzer;

...

final List<String> stopWords = Arrays.asList("short", "test");
final CharArraySet stopSet = new CharArraySet(stopWords, true);

CharArraySet enStopSet = EnglishAnalyzer.ENGLISH_STOP_WORDS_SET;
stopSet.addAll(enStopSet);
[bla]
short
this
is
a
test
the
him
it
import org.apache.lucene.analysis.custom.CustomAnalyzer;

...

Analyzer analyzer = CustomAnalyzer.builder()
        .withTokenizer("icu")
        .addTokenFilter("stop",
                "ignoreCase", "true",
                "words", "stopwords.txt",
                "format", "wordset")
        .build();
[Bla]
    <build>  
        <resources>  
            <resource>  
                <directory>src/main/java</directory>  
                <excludes>  
                    <exclude>**/*.java</exclude>  
                </excludes>  
            </resource>  
        </resources>  
    </build> 
import java.nio.file.Path;
import java.nio.file.Paths;

...

Path resources = Paths.get("/path/to/resources/directory");

Analyzer analyzer = CustomAnalyzer.builder(resources)
        .withTokenizer("icu")
        .addTokenFilter("stop",
                "ignoreCase", "true",
                "words", "stopwords.txt",
                "format", "wordset")
        .build();
import org.apache.lucene.analysis.en.EnglishAnalyzer;

...

final List<String> stopWords = Arrays.asList("short", "test");
final CharArraySet stopSet = new CharArraySet(stopWords, true);

CharArraySet enStopSet = EnglishAnalyzer.ENGLISH_STOP_WORDS_SET;
stopSet.addAll(enStopSet);
[bla]
short
this
is
a
test
the
him
it
import org.apache.lucene.analysis.custom.CustomAnalyzer;

...

Analyzer analyzer = CustomAnalyzer.builder()
        .withTokenizer("icu")
        .addTokenFilter("stop",
                "ignoreCase", "true",
                "words", "stopwords.txt",
                "format", "wordset")
        .build();
[Bla]
    <build>  
        <resources>  
            <resource>  
                <directory>src/main/java</directory>  
                <excludes>  
                    <exclude>**/*.java</exclude>  
                </excludes>  
            </resource>  
        </resources>  
    </build> 
import java.nio.file.Path;
import java.nio.file.Paths;

...

Path resources = Paths.get("/path/to/resources/directory");

Analyzer analyzer = CustomAnalyzer.builder(resources)
        .withTokenizer("icu")
        .addTokenFilter("stop",
                "ignoreCase", "true",
                "words", "stopwords.txt",
                "format", "wordset")
        .build();
import org.apache.lucene.analysis.en.EnglishAnalyzer;

...

final List<String> stopWords = Arrays.asList("short", "test");
final CharArraySet stopSet = new CharArraySet(stopWords, true);

CharArraySet enStopSet = EnglishAnalyzer.ENGLISH_STOP_WORDS_SET;
stopSet.addAll(enStopSet);
[bla]
short
this
is
a
test
the
him
it
import org.apache.lucene.analysis.custom.CustomAnalyzer;

...

Analyzer analyzer = CustomAnalyzer.builder()
        .withTokenizer("icu")
        .addTokenFilter("stop",
                "ignoreCase", "true",
                "words", "stopwords.txt",
                "format", "wordset")
        .build();
[Bla]
    <build>  
        <resources>  
            <resource>  
                <directory>src/main/java</directory>  
                <excludes>  
                    <exclude>**/*.java</exclude>  
                </excludes>  
            </resource>  
        </resources>  
    </build> 
import java.nio.file.Path;
import java.nio.file.Paths;

...

Path resources = Paths.get("/path/to/resources/directory");

Analyzer analyzer = CustomAnalyzer.builder(resources)
        .withTokenizer("icu")
        .addTokenFilter("stop",
                "ignoreCase", "true",
                "words", "stopwords.txt",
                "format", "wordset")
        .build();

Community Discussions

Trending Discussions on lucene-solr
  • Using default and custom stop words with Apache's Lucene (weird output)
  • Spring Integration MDC for Async Flow and Task Executors
Trending Discussions on lucene-solr

QUESTION

Using default and custom stop words with Apache's Lucene (weird output)

Asked 2020-Oct-13 at 12:31

I'm removing stop words from a String, using Apache's Lucene (8.6.3) and the following Java 8 code:

private static final String CONTENTS = "contents";
final String text = "This is a short test! Bla!";
final List<String> stopWords = Arrays.asList("short","test");
final CharArraySet stopSet = new CharArraySet(stopWords, true);

try {
    Analyzer analyzer = new StandardAnalyzer(stopSet);
    TokenStream tokenStream = analyzer.tokenStream(CONTENTS, new StringReader(text));
    CharTermAttribute term = tokenStream.addAttribute(CharTermAttribute.class);
    tokenStream.reset();

    while(tokenStream.incrementToken()) {
        System.out.print("[" + term.toString() + "] ");
    }

    tokenStream.close();
    analyzer.close();
} catch (IOException e) {
    System.out.println("Exception:\n");
    e.printStackTrace();
}

This outputs the desired result:

[this] [is] [a] [bla]

Now I want to use both the default English stop set, which should also remove "this", "is" and "a" (according to github) AND the custom stop set above (the actual one I'm going to use is a lot longer), so I tried this:

Analyzer analyzer = new EnglishAnalyzer(stopSet);

The output is:

[thi] [is] [a] [bla]

Yes, the "s" in "this" is missing. What's causing this? It also didn't use the default stop set.

The following changes remove both the default and the custom stop words:

Analyzer analyzer = new EnglishAnalyzer();
TokenStream tokenStream = analyzer.tokenStream(CONTENTS, new StringReader(text));
tokenStream = new StopFilter(tokenStream, stopSet);

Question: What is the "right" way to do this? Is using the tokenStream within itself (see code above) going to cause problems?

Bonus question: How do I output the remaining words with the right upper/lower case, hence what they use in the original text?

ANSWER

Answered 2020-Oct-13 at 12:31

I will tackle this in two parts:

  • stop-words
  • preserving original case

Handling the Combined Stop Words

To handle the combination of Lucene's English stop word list, plus your own custom list, you can create a merged list as follows:

import org.apache.lucene.analysis.en.EnglishAnalyzer;

...

final List<String> stopWords = Arrays.asList("short", "test");
final CharArraySet stopSet = new CharArraySet(stopWords, true);

CharArraySet enStopSet = EnglishAnalyzer.ENGLISH_STOP_WORDS_SET;
stopSet.addAll(enStopSet);

The above code simply takes the English stopwords bundled with Lucene and merges then with your list.

That gives the following output:

[bla]

Handling Word Case

This is a bit more involved. As you have noticed, the StandardAnalyzer includes a step in which all words are converted to lower case - so we can't use that.

Also, if you want to maintain your own custom list of stop words, and if that list is of any size, I would recommend storing it in its own text file, rather than embedding the list into your code.

So, let's assume you have a file called stopwords.txt. In this file, there will be one word per line - and the file will already contain the merged list of your custom stop words and the official list of English stop words.

You will need to prepare this file manually yourself (i.e. ignore the notes in part 1 of this answer).

My test file is just this:

short
this
is
a
test
the
him
it

I also prefer to use the CustomAnalyzer for something like this, as it lets me build an analyzer very simply.

import org.apache.lucene.analysis.custom.CustomAnalyzer;

...

Analyzer analyzer = CustomAnalyzer.builder()
        .withTokenizer("icu")
        .addTokenFilter("stop",
                "ignoreCase", "true",
                "words", "stopwords.txt",
                "format", "wordset")
        .build();

This does the following:

  1. It uses the "icu" tokenizer org.apache.lucene.analysis.icu.segmentation.ICUTokenizer, which takes care of tokenizing on Unicode whitespace, and handling punctuation.

  2. It applies the stopword list. Note the use of true for the ignoreCase attribute, and the reference to the stop-word file. The format of wordset means "one word per line" (there are other formats, also).

The key here is that there is nothing in the above chain which changes word case.

So, now, using this new analyzer, the output is as follows:

[Bla]

Final Notes

Where do you put the stop list file? By default, Lucene expects to find it on the classpath of your application. So, for example, you can put it in the default package.

But remember that the file needs to be handled by your build process, so that it ends up alongside the application's class files (not left behind with the source code).

I mostly use Maven - and therefore I have this in my POM to ensure the ".txt" file gets deployed as needed:

    <build>  
        <resources>  
            <resource>  
                <directory>src/main/java</directory>  
                <excludes>  
                    <exclude>**/*.java</exclude>  
                </excludes>  
            </resource>  
        </resources>  
    </build> 

This tells Maven to copy files (except Java source files) to the build target - thus ensuring the text file gets copied.

Final note - I did not investigate why you were getting that truncated [thi] token. If I get a chance I will take a closer look.


Follow-Up Questions

After combining I have to use the StandardAnalyzer, right?

Yes, that is correct. the notes I provided in part 1 of the answer relate directly to the code in your question, and to the StandardAnalyzer you use.

I want to keep the stop word file on a specific non-imported path - how to do that?

You can tell the CustomAnalyzer to look in a "resources" directory for the stop-words file. That directory can be anywhere on the file system (for easy maintenance, as you noted):

import java.nio.file.Path;
import java.nio.file.Paths;

...

Path resources = Paths.get("/path/to/resources/directory");

Analyzer analyzer = CustomAnalyzer.builder(resources)
        .withTokenizer("icu")
        .addTokenFilter("stop",
                "ignoreCase", "true",
                "words", "stopwords.txt",
                "format", "wordset")
        .build();

Instead of using .builder() we now use .builder(resources).

Source https://stackoverflow.com/questions/64321901

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install lucene-solr

You can download it from GitHub.
You can use lucene-solr like any standard Java library. Please include the the jar files in your classpath. You can also use any IDE and you can run and debug the lucene-solr component as you would do with any other Java program. Best practice is to use a build tool that supports dependency management such as Maven or Gradle. For Maven installation, please refer maven.apache.org. For Gradle installation, please refer gradle.org .

Support

For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .

DOWNLOAD this Library from

Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from
over 430 million Knowledge Items
Find more libraries
Reuse Solution Kits and Libraries Curated by Popular Use Cases
Explore Kits

Save this library and start creating your kit

Explore Related Topics

Share this Page

share link
Consider Popular Search Engine Libraries
Try Top Libraries by cloudera
Compare Search Engine Libraries with Highest Support
Compare Search Engine Libraries with Highest Quality
Compare Search Engine Libraries with Highest Security
Compare Search Engine Libraries with Permissive License
Compare Search Engine Libraries with Highest Reuse
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from
over 430 million Knowledge Items
Find more libraries
Reuse Solution Kits and Libraries Curated by Popular Use Cases
Explore Kits

Save this library and start creating your kit

  • © 2022 Open Weaver Inc.