cf4j | CF4J : Collaborative Filtering for Java | Recommender System library

 by   ferortega Java Version: v2.2.7 License: Apache-2.0

kandi X-RAY | cf4j Summary

kandi X-RAY | cf4j Summary

cf4j is a Java library typically used in Artificial Intelligence, Recommender System, Deep Learning, Spark applications. cf4j has no bugs, it has no vulnerabilities, it has build file available, it has a Permissive License and it has low support. You can download it from GitHub, Maven.

A Java's Collaborative Filtering library to carry out experiments in research on Collaborative Filtering based Recommender Systems. The library has been designed from researchers to researchers.
Support
    Quality
      Security
        License
          Reuse

            kandi-support Support

              cf4j has a low active ecosystem.
              It has 39 star(s) with 19 fork(s). There are 4 watchers for this library.
              OutlinedDot
              It had no major release in the last 12 months.
              There are 1 open issues and 9 have been closed. On average issues are closed in 217 days. There are 1 open pull requests and 0 closed requests.
              It has a neutral sentiment in the developer community.
              The latest version of cf4j is v2.2.7

            kandi-Quality Quality

              cf4j has no bugs reported.

            kandi-Security Security

              cf4j has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

            kandi-License License

              cf4j is licensed under the Apache-2.0 License. This license is Permissive.
              Permissive licenses have the least restrictions, and you can use them in most projects.

            kandi-Reuse Reuse

              cf4j releases are available to install and integrate.
              Deployable package is available in Maven.
              Build file is available. You can build the component from source.
              Installation instructions, examples and code snippets are available.

            Top functions reviewed by kandi - BETA

            kandi has reviewed cf4j and discovered the below as its top functions. This is intended to give you an instant insight into cf4j implemented functionality, and help decide if they suit your requirements.
            • Get the similarity between the two users
            • Returns the similarity between the two users
            • Iterates over the items in the data model
            • Iterates over the list of relevant items
            • Main program of the benchmark
            • Set a single value for a series
            • Computes the similarity between two items
            • Computes the similarity between two users
            • Returns the similarity between two items
            • Returns the similarity between two items
            • Get the similarity between two users
            • Returns the sum similarity between two items
            • Main benchmark
            • Computes the distance between two items
            • Get the histogram content
            • Fit the model
            • Fit the model
            • Places recommendations
            • Create XY plot with data
            • Places the RMSE measure
            • Creates the XY plot
            • Creates and returns a bar plot for histogram
            • Create aral plot for the series
            • Fits the model
            • Gets the bar plot
            • Demonstrates how to draw the recommendations
            Get all kandi verified functions for this library.

            cf4j Key Features

            No Key Features are available at this moment for cf4j.

            cf4j Examples and Code Snippets

            CF4J: Collaborative Filtering for Java,Installation
            Javadot img1Lines of Code : 13dot img1License : Permissive (Apache-2.0)
            copy iconCopy
            
              es.upm.etsisi
              cf4j
              2.2.2
            
            
            compile group: 'es.upm.etsisi', name: 'cf4j', version: '2.2.2'
            
            libraryDependencies += "es.upm.etsisi" % "cf4j" % "2.2.2"
            
            
            @Grapes(
                @Grab(group='es.upm.etsisi', module='cf4j', version='2.2.2')
            )
            
            [es.upm.etsisi/  

            Community Discussions

            QUESTION

            Dataframe users who did not purchase item for user-item collaborative filtering
            Asked 2022-Mar-05 at 12:35

            I intend to use a hybrid user-item collaborative filtering to build a Top-N recommender system with TensorFlow Keras

            currently my dataframe consist of |user_id|article_id|purchase

            purchase is always TRUE because the dataset is a history of user - article purchases

            This dataset has 800,000 rows and 3 columns

            2 Questions
            1. How do I process it such that I will have 20% purchase = true and 80% purchase = false to train the model?

            2. Is a 20%, 80% true:false ratio good for this use case?

            ...

            ANSWER

            Answered 2022-Mar-05 at 12:35
            1. How do I process it such that I will have 20% purchase = true and 80% purchase = false to train the model?

            Since you only have True values, it means that you'll have to generate the False values. The only False that you know of are the user-item interactions that are not present in your table. If your known interactions can be represented as a sparse matrix (meaning, a low percentage of the possible interactions, N_ITEMS x N_USER, is present) then you can do this:

            1. Generate a random user-item combination
            2. If the user-item interaction exists, means is True, then repeat step 1.
            3. If the user-item interaction does not exist, you can consider it a False interaction.

            Now, to complete your 20%/80% part, just define the size N of the sample that you'll take from your ground truth data (True values) and take 4*N False values using the previous steps. Remember to keep some ground truth values for your test and evaluation steps.

            1. Is a 20%, 80% true:false ratio good for this use case?

            In this case, since you only have True values in your ground truth dataset, I think the best you can do is to try out different ratios. Your real world data only contains True values, but you could also generate all of the False values. The important part to consider is that some of the values that you'll consider False while training might actually be True values in your test and validation data. Just don't use all of your ground truth data, and don't generate an important portion of the possible combinations.

            I think a good start could be 50/50, then try 60/40 and so on. Evaluate using multiple metrics, see how are they changing according to the proportion of True/False values (some proportions might be better to reach higher true positive rates, other will perform worse, etc). In the end, you'll have to select one model and one training procedure according to the metrics that matter the most to you.

            Source https://stackoverflow.com/questions/71359291

            QUESTION

            How to Deploy ML Recommender System on AWS
            Asked 2021-Nov-05 at 01:24

            I'm dabbling with ML and was able to take a tutorial and get it to work for my needs. It's a simple recommender system using TfidfVectorizer and linear_kernel. I run into a problem with how I go about deploying it through Sagemaker with an end point.

            ...

            ANSWER

            Answered 2021-Nov-05 at 01:24

            I came to the conclusion that I didn't need to deploy this through SageMaker. Since the final linear_kernel output was a Dictionary I could do quick ID lookups to find correlations.

            I have it working on AWS with API Gateway/Lambda, DynamoDB and an EC2 server to collect, process and plug the data into DynamoDB for fast lookups. No expensive SageMaker endpoint needed.

            Source https://stackoverflow.com/questions/69693666

            QUESTION

            What does .nonzero()[0] mean when we want to compute the sparsity of a matrix?
            Asked 2021-Oct-18 at 17:43

            I am trying to learn about recommender systems in Python by reading a blog that contains a great example of creating a recommender system of repositories in GitHub.

            Once the dataset is loaded with read_csv(), the person that wrote the code decided to convert that data into a pivot_table pandas for visualizing the data in a more simple way. Here, I left you an image of that part of the code for simplicity:

            enter image description here

            In that table, rows are the users and columns are the repositories. The cross section between a row and a column is the punctuation that a user gives to a particular repository.

            Due to the fact that many of the elements of that table are null (we can say that we are having a sparse matrix, very typical in machine learning), he decided to study the level of sparsity of the matrix by means of this code:

            ...

            ANSWER

            Answered 2021-Oct-18 at 17:43

            By default, nonzero will return a tuple of the form (row_idxs, col_idxs). If you hand it a one-dimensional array (like a pandas series), then it will still return a tuple, (row_idxs,). To access this first array, we still must index ratings.nonzero()[0] to get the first-dimension index of nonzero elements.

            More info available on the numpy page for nonzero here, as both pandas and numpy use the same implementation.

            Source https://stackoverflow.com/questions/69620304

            QUESTION

            how to make an integer index corresponding to a string value?
            Asked 2021-Jul-25 at 05:41

            I'm currently building a recommender system using Goodreads data.

            I want to change string user ids into integers. Current user ids are like this: '0d688fe079530ee1fe6fa85eab10ec5c'

            I want to change it into integers(e.g. 1, 2, 3, ...), to have the same integer ids which share the same string ids. I've considered using function df.groupby('user_id'), but I couldn't figure out how to do this.

            I would be very thankful if anybody let me know how to change.

            ...

            ANSWER

            Answered 2021-Jul-25 at 04:52

            Use pd.factorize as suggested by @AsishM.

            Input data:

            Source https://stackoverflow.com/questions/68515268

            QUESTION

            How can I ensure that all users and all items appear in the training set of my recommender system?
            Asked 2021-Jun-11 at 20:37

            I am building a recommender system in Python using the MovieLens dataset (https://grouplens.org/datasets/movielens/latest/). In order for my system to work correctly, I need all the users and all the items to appear in the training set. However, I have not found a way to do that yet. I tried using sklearn.model_selection.train_test_split on the partition of the dataset relevant to each user and then concatenated the results, thus succeeding in creating training and test datasets that contain at least one rating given by each user. What I need now is to find a way to create training and test datasets that also contain at least one rating for each movie.

            ...

            ANSWER

            Answered 2021-Jun-11 at 20:37

            This requirement is quite reasonable, but is not supported by the data ingestion routines for any framework I know. Most training paradigms presume that your data set is populated sufficiently that there is a negligible chance of missing any one input or output.

            Since you need to guarantee this, you need to switch to an algorithmic solution, rather than a probabilistic one. I suggest that you tag each observation with the input and output, and then apply the "set coverage problem" to the data set.

            You can continue with as many distinct covering sets as needed to populate your training set (which I recommend). Alternately, you can set a lower threshold of requirement -- say get three sets of total coverage -- and then revert to random methods for the remainder.

            Source https://stackoverflow.com/questions/67942944

            QUESTION

            LensKit Recommender only returns results for some users, otherwise returns empty DataFrame. Why is this happening?
            Asked 2021-May-23 at 02:53

            I am trying to implement a group recommender system with the Django framework, using the LensKit tools for Python (specifically a Recommender object which adapts the UserUser algorithm). However, it only returns individual recommendations in some cases (for some specific users), but it always returns recommendations for groups of users (I create a hybrid user whose scores are the average of group members' scores and request recommendations for it). Below is my implementation for requesting recommendations for an individual user, as well as for a group:

            ...

            ANSWER

            Answered 2021-May-23 at 02:53

            The most likely cause of this problem is that the user-user recommender cannot build enough viable neighborhoods to provide recommendations. This is a downside to neighborhood-based recommendations.

            The solutions are to either switch to an algorithm that can always recommend for a user with some ratings (e.g. one of the matrix factorization algorithms), and/or use a fallback algorithm such as Popular to recommend when the personalized collaborative filter cannot recommend.

            (Another solution would be to implement one of the various cold-start recommenders or a content-based recommender for LensKit, but none are currently provided by the project.)

            Source https://stackoverflow.com/questions/67647854

            QUESTION

            How to get similarity score for unseen documents using Gensim Doc2Vec model?
            Asked 2021-May-19 at 09:07

            I have trained a gensim doc2vec model for an English news recommender system. the model was trained with 40K news data. I am using the code below to recommend the top 5 most similar news for e.g. news_1:

            ...

            ANSWER

            Answered 2021-May-19 at 09:07

            There's a bulk contiguous vector structure initially created by training, for the initial known set of vectors. It's amenable to the every-candidate bulk vector calculation at the heart of most_similar() - so that operation goes about as fast as it can, with the right vector libraries for your OS/processor.

            But, that structure wasn't originally designed with incremental expansion in mind. Indeed, if you have 1 million vectors in a dense array, then want to add 1 to the end, the straightforward approach requires you to allocate a new 1-million-and-1 long array, bulk copy over the 1 million, then add the last 1. That works, but what seems like a "tiny" operation then takes a while, and ever-longer as the structure grows. And, each add more-than-doubles the temporary memory usage, for the bulk copy. So, the naive pattern of adding a whole bunch of new items individuall in a loop can be really slow & memory-intensive.

            So, Gensim hasn't yet focused on providing a set-of-vectors that's easy & efficient to incrementally grow with new vectors. But, it's still indirectly possible, if you understand the caveats.

            Especially in gensim-4.0.0 & above, the .dv set of doc-vectors is an instance of KeyedVectors with all that class's standard functions. Thos include the add_vector() and add_vectors() methods:

            https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.KeyedVectors.add_vector

            https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.KeyedVectors.add_vectors

            You can try these methods to add your new inferred vectors to the model.dv object - and then they'll also be ncluded in folloup most_similar() results.

            But keep in mind:

            1. The above caveats about performance & memory-usage - which may be minor concerns as long as your dataset isn't too large, or manageable if you do additions in occasional larger batches.

            2. The containing Doc2Vec model generally isn't expecting its internal .dv to be arbitrarily modified or expanded by other code. So, once you start doing that, parts of the model may not behave as expected. If you have problems with this, you could consider saving-aside the full Doc2Vec model before any direct-tampering with its .dv, and/or only expanding a completely separate instance of the doc-vectors, for example by saving them aside (eg: model.dv.save(DOC_VECS_FILENAME)) & reloading them into a separate KeyedVectors (eg: growing_docvecs = KeyedVectors.load(DOC_VECS_FILENAME)).

            Source https://stackoverflow.com/questions/67596945

            QUESTION

            Unable to create dataframe from RDD
            Asked 2021-May-10 at 14:34

            I am trying to create a recommender system from this kaggle dataset: f7a1f242-c

            https://www.kaggle.com/kerneler/starter-user-artist-playcount-dataset-f7a1f242-c

            the file is called: "user_artist_data_small.txt"

            The data looks like this:

            1059637 1000010 238

            1059637 1000049 1

            1059637 1000056 1

            1059637 1000062 11

            1059637 1000094 1

            I'm getting an error on the third last line of code.

            ...

            ANSWER

            Answered 2021-May-10 at 14:09

            Just create a dataframe using the CSV reader (with a space delimiter) instead of creating an RDD:

            Source https://stackoverflow.com/questions/67471938

            QUESTION

            Combining output in pandas?
            Asked 2021-Apr-07 at 23:00

            I have a movie recommender system I have been working on and currently it is printing two different sets of output because I have two different types of recommendation engines. Code is like this:

            ...

            ANSWER

            Answered 2021-Apr-07 at 23:00

            If the return type of get_input_movie() is a Pandas DataFrame or a Pandas Series, you can try:

            Replace the following 2 lines:

            Source https://stackoverflow.com/questions/66994657

            QUESTION

            How to get a while loop to start over after error?
            Asked 2021-Mar-23 at 21:34

            would like to say I still feel fairly new too python in general. But I have a movie recommender system that I have been working on, and the way I have it setup is for the user to enter a movie in the console and then it spits out 10 recommendations and ask for another movie. When a misspelled movie is entered, it gives error message KeyError: 'Goodfellas' and it stops running. I would like for it to just start the loop over, until the user ends the loop using my break word. Here is my code for reference.

            ...

            ANSWER

            Answered 2021-Mar-23 at 21:19

            Community Discussions, Code Snippets contain sources that include Stack Exchange Network

            Vulnerabilities

            No vulnerabilities reported

            Install cf4j

            CF4J is available in the most popular dependency management tools for Java. To add it to your project, you must add the following lines to your dependency management. You can find additional information about these dependencies in https://mvnrepository.com/artifact/es.upm.etsisi/cf4j. If you prefer to use the library without a dependency management tool, you must add the jar packaged version of CF4J to your project's classpath. For example, if you are using IntelliJ IDEA, copy the file to your project's directory, make right click on the jar file and select Add as Library. You can find the jar packaged version of CF4J into the release section of github. You can also package your own jar file . To do that, clone the repository using git clone git@github.com:ferortega/cf4j.git and package it with mvn package.
            Let's encode our first experiment with CF4J. You can find the full code of this example in GettingStartedExample.
            First of all, we need to load MovieLens's ratings. CF4J includes a preloaded version of most popular ratings databases. You can retrieve them using BenchmarkDataModels class. In this experiment we will load MovieLens 100k dataset. DataModel datamodel = BenchmarkDataModels.MovieLens100K(); As you can observe, MovieLens dataset has been loaded into a DataModel. A DataModel is a high level in memory representation of the data structure required by collaborative filtering algorithms.
            Now, we need to create an object store the results of our experiment. CF4J includes some amazing tools to analyze the experimental results. You can find them in the es.upm.etsisi.plot package. In this case, we want to analyze how the Mean Squared Error (MSE) varies according to the value of the regularization term in Probabilistic Matrix Factorization (PMF) recommender, so we will use a LinePlot]. double[] regValues = {0.000, 0.025, 0.05, 0.075, 0.100, 0.125, 0.150, 0.175, 0.200, 0.225, 0.250}; LinePlot plot = new LinePlot(regValues, "regularization", "MSE");
            At this point everything is ready to perform the experiment. We add a new empty series to the plot: plot.addSeries("PMF"); And we iterate over the different regularization values fitting a new instance of (PMF) recommender for each of them, computing the MSE of the fitted recommender predictions and adding the MSE score to the plot data. Note that the remaining model's hyper-parameters has been fixed for this experiment (numFactors=6, numIters=50, gamma=0.01 and seed=43): for (double reg : regValues) { PMF pmf = new PMF(datamodel, 6, 50, reg, 0.01, 43); pmf.fit(); QualityMeasure mse = new MSE(pmf); double mseScore = mse.getScore(); plot.setValue("PMF", reg, mseScore); }
            Finally, we visualize the experimental results. To draw the plot we use: plot.draw(); And we obtain the following chart: To print the plot data in the standard output console we use: plot.printData("0.000"); And we obtain the following output: +----------------+-------+ | regularization | PMF | +----------------+-------+ | 0.000 | 1.150 | +----------------+-------+ | 0.025 | 1.070 | +----------------+-------+ | 0.050 | 1.021 | +----------------+-------+ | 0.075 | 0.990 | +----------------+-------+ | 0.100 | 0.972 | +----------------+-------+ | 0.125 | 0.966 | +----------------+-------+ | 0.150 | 0.969 | +----------------+-------+ | 0.175 | 0.979 | +----------------+-------+ | 0.200 | 0.993 | +----------------+-------+ | 0.225 | 1.009 | +----------------+-------+ | 0.250 | 1.027 | +----------------+-------+

            Support

            For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .
            Find more information at:

            Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items

            Find more libraries
            CLONE
          • HTTPS

            https://github.com/ferortega/cf4j.git

          • CLI

            gh repo clone ferortega/cf4j

          • sshUrl

            git@github.com:ferortega/cf4j.git

          • Stay Updated

            Subscribe to our newsletter for trending solutions and developer bootcamps

            Agree to Sign up and Terms & Conditions

            Share this Page

            share link