How to use isolation forest for anomaly detection in scikit-learn Python

by Dejaswarooba Updated: Jul 10, 2023

Solution Kit

The Isolation Forest algorithm is also known as Isolation-Based Anomaly Detection. It is a powerful method for detecting anomalies in each dataset. Isolation Forest model leverages the concept of isolation trees to isolate individual observations.

The algorithm partitions the data by constructing decision trees using random splits. I can also be done using selected features to create a tree structure. This tree-based model is known as the Isolation Forest. It separates data points from anomalous ones by assigning partitions of the latter. It is particularly effective in detecting anomalies in high-dimensional datasets. Also, in time series data and even for credit card fraud detection.

To implement Isolation Forest, you can use the IsolationForest function provided by scikit-learn. It allows you to isolate and score observations based on their anomaly status. The algorithm assigns an anomaly score to each data point. It indicates its abnormality level. You can identify and flag anomalous observations by comparing scores to the threshold. This threshold value can be adjusted based on the specific requirements.

The Isolation Forest algorithm is an unsupervised outlier detection method. It makes it suitable for scenarios where labeled data is limited. It stands out due to its ability to handle large datasets. It also helps with its capability to handle both numeric and categorical features. The Isolation Forest complements Local Outlier Factor. It offers various tools for anomaly detection tasks.

The Isolation Forest algorithm is a prominent approach for anomaly detection. It uses the isolation trees concept to identify anomalous behavior within a dataset. The isolation tree is built by data partitioning through random splits. This process isolates individual observations as "isolates" in the form of binary trees. Unlike normal points, anomalous ones need fewer random partitions before being isolated.

Implementing the Isolation Forest allows applying the trained model to new data points. It helps determine their anomaly status. This can be particularly useful for real-time anomaly detection in various domains. The Isolation Forest algorithm helps identify anomalous behavior and uncover insights. It may need to be evident through traditional data analysis techniques.

Scatter and box plots can understand the distribution of normal and anomalous observations. These visualizations can help interpret the model's output and aid in decision-making.

How to use isolation forest for anomaly detection in scikit-learn Python

You must import the appropriate libraries to use Isolation Forest for anomaly identification. It includes the IsolationForest class and numpy.
Isolation Forest only works with numerical data, so ensure it's in the right format. You can use one-hot or label encoding if you have categorical data. It helps transform them into numerical variables.
Following that, you create an instance of the IsolationForest class. Define the tree count (n estimators) and the estimated outlier fraction (contamination). We should define a random state for repeatability.
After constructing the instance, use the fit() method to fit the model to your data. Once the model has been trained, you may use the predict() method to predict abnormalities.
This method returns an array of -1's and 1's, where -1 represents an anomaly and 1 represents a non-anomaly.
Finally, you may extract the anomalous data points by using numpy. It filters out the data points with a corresponding -1 value.

Preview of the output obtained

Code

The fit() method is used to fit the model to the training set. The predict() method is used after the model has been trained to forecast the anomalies in the training, test, and outlier datasets.
print(y_pred_test) and print(y_pred_outliers) print the projected values for the test and outlier datasets, respectively.
Because all of the projected values are -1, the Isolation Forest method successfully found all of the outliers in the X outliers dataset. According to the model, the expected values for the test dataset are all 1, indicating that there are no outliers in this dataset.

Isolation Forest

PythonLines of Code : 44License : Strong Copyleft (CC BY-SA 4.0)

Dependent Libraries :

# fit the model
clf = IsolationForest(max_samples=100, random_state=rng)
clf.fit(X_train)
y_pred_train = clf.predict(X_train)
y_pred_test = clf.predict(X_test)
y_pred_outliers = clf.predict(X_outliers) 

print(y_pred_outliers)

from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import IsolationForest

rng = np.random.RandomState(42)
data = load_iris()

X=data.data
y=data.target
X_outliers = rng.uniform(low=-4, high=4, size=(X.shape[0], X.shape[1]))

X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3, random_state=0)

clf = IsolationForest()
clf.fit(X_train)

y_pred_train = clf.predict(X_train)
y_pred_test = clf.predict(X_test)
y_pred_outliers = clf.predict(X_outliers)

print(y_pred_test)
print(y_pred_outliers)

[ 1  1 -1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1]

[-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1]

Follow the steps carefully to get the output easily.

Install Visual Studio Code in your computer.
import the required libraries using the commands -

pip install scikit-learn

pip install numpy

Open the folder in the code editor, copy and paste the above kandi code snippet in the python file.
Remove the below mentioned parts of the code for better understanding of Isolation forest.

Run the code using the run command.

I hope you found this useful. I have added version information and depending libraries in the following sections.

I found this code snippet by searching for "isolation forest for anomaly detection in scikit-learn Python" in kandi. You can try any such use case!

Dependent libraries

numpyby numpy

Python

23755

Version:v1.25.0rc1

License: Permissive (BSD-3-Clause)

The fundamental package for scientific computing with Python.

Support

Quality

Security

License

Reuse

numpyby numpy

Python 23755 Version:v1.25.0rc1 License: Permissive (BSD-3-Clause)

The fundamental package for scientific computing with Python.

Support

Quality

Security

License

Reuse

scikit-learnby scikit-learn

Python

54584

Version:1.2.2

License: Permissive (BSD-3-Clause)

scikit-learn: machine learning in Python

Support

Quality

Security

License

Reuse

scikit-learnby scikit-learn

Python 54584 Version:1.2.2 License: Permissive (BSD-3-Clause)

scikit-learn: machine learning in Python

Support

Quality

Security

License

Reuse

If you do not have scikit-learn and numpy that is required to run this code, you can install it by clicking on the above link and copying the pip Install command from the page in kandi.

You can search for any dependent library on kandi like scikit-learn.

Environment tested

This code had been tested using python version 3.8.0
scikit-learn version 1.2.2 has been used.
numpy version 1.24.2 has been used.

Support

For any support on kandi solution kits, please use the chat
For further learning resources, visit the Open Weaver Community learning page.

FAQ

1. What is Isolation Forest for Anomaly Detection?

Isolation Forest is an unsupervised anomaly detection algorithm. It uses isolation trees to isolate anomalous observations. It leverages the concept of fewer random partitions needed to isolate anomalies. It makes it efficient for identifying anomalies in large datasets.

2. How does the Isolation Forest Algorithm work to detect anomalies?

The Isolation Forest Algorithm detects anomalies by isolating observations. We can do so using a series of random splits and feature selections. It is where anomalous data points need random partitions. It can be isolated compared to normal points.

3. What are isolates, and how do they help with credit card fraud detection?

Isolates are individual observations. They are separated and treated as individual trees in the Isolation Forest algorithm. They help with credit card fraud detection. You can isolate and flag fraudulent credit card transactions as anomalous points. We can do it based on the algorithm's partitioning process and anomaly scoring.

4. Are series data effective when using an Isolation Forest for anomaly detection?

Yes, series data is effective when using an Isolation Forest for anomaly detection. It captures temporal patterns and dependencies. It enables better identification of anomalous behavior over time.

5. How are binary decision trees implemented in the isolation forest model?

Binary decision trees are implemented in the construction of an isolation forest model. You can do so by partitioning the data using random splits on selected features. It creates a tree structure where each internal node represents a binary decision. It is based on a feature and split value.

See similar Kits and Libraries

Open Weaver – Develop Applications Faster with Open Source

Terms
Privacy policy

How to use isolation forest for anomaly detection in scikit-learn Python

How to use isolation forest for anomaly detection in scikit-learn Python

Code

Dependent libraries

Environment tested

Support

FAQ

Open Weaver – Develop Applications Faster with Open Source

kandi

Community and Support

Company

Follow