Skip to content

Commit bb4e77f

Browse files
committed
docs: reorganize README with table of contents and clearer usage sections
- Added a Table of Contents for easier navigation - Moved Introduction and Features into their own sections - Created separate “Getting Started” and “Usage examples” sections - Added a note on running tests (`./gradlew test`) - Renamed “Validation” section to “Performance and benchmarks” - Moved the copyright/license info to the bottom - Incorporated minor editorial fixes and improved headings
1 parent e2d5585 commit bb4e77f

File tree

1 file changed

+50
-24
lines changed

1 file changed

+50
-24
lines changed

README.md

Lines changed: 50 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -5,30 +5,40 @@
55
[![Release](https://img.shields.io/github/v/release/linkedin/isolation-forest)](https://github.com/linkedin/isolation-forest/releases/)
66
[![License](https://img.shields.io/badge/License-BSD%202--Clause-orange.svg)](LICENSE)
77

8-
## Introduction
9-
10-
This is a Scala/Spark implementation of the Isolation Forest unsupervised outlier detection
11-
algorithm. This library was created by [James Verbus](https://www.linkedin.com/in/jamesverbus/) from
12-
the LinkedIn Anti-Abuse AI team.
13-
14-
The `isolation-forest` module supports distributed training and scoring in Scala using Spark data structures.
15-
It inherits from the `Estimator` and `Model` classes in [Spark's ML library](https://spark.apache.org/mllib/)
16-
in order to take advantage of machinery such as `Pipeline`s. Model persistence on HDFS is
17-
supported.
8+
## Table of contents
9+
- [Introduction](#introduction)
10+
- [Features](#features)
11+
- [Getting started](#getting-started)
12+
- [Building the library](#building-the-library)
13+
- [Add an isolation-forest dependency to your project](#add-an-isolation-forest-dependency-to-your-project)
14+
- [Usage examples](#usage-examples)
15+
- [Model parameters](#model-parameters)
16+
- [Training and scoring](#training-and-scoring)
17+
- [Saving and loading a trained model](#saving-and-loading-a-trained-model)
18+
- [ONNX conversion for portable inference](#onnx-conversion-for-portable-inference)
19+
- [Converting a trained model to ONNX](#converting-a-trained-model-to-onnx)
20+
- [Using the ONNX model for inference (example in Python)](#using-the-onnx-model-for-inference-example-in-python)
21+
- [Performance and benchmarks](#performance-and-benchmarks)
22+
- [Copyright and license](#copyright-and-license)
23+
- [Contributing](#contributing)
24+
- [References](#references)
1825

19-
The `isolation-forest-onnx` module provides Python-based converter to convert a trained model to ONNX format for broad
20-
portability across platforms and languages. [ONNX](https://onnx.ai/) is an open format built to represent machine
21-
learning models.
26+
## Introduction
2227

23-
## Copyright
28+
This is a distributed Scala/Spark implementation of the Isolation Forest unsupervised outlier detection
29+
algorithm. It features support for ONNX export for easy cross-platform inference. This library was created
30+
by [James Verbus](https://www.linkedin.com/in/jamesverbus/) from the LinkedIn Anti-Abuse AI team.
2431

25-
Copyright 2019 LinkedIn Corporation
26-
All Rights Reserved.
32+
## Features
2733

28-
Licensed under the BSD 2-Clause License (the "License").
29-
See [License](LICENSE) in the project root for license information.
34+
* **Distributed training and scoring:** The `isolation-forest` module supports distributed training and scoring in Scala
35+
using Spark data structures. It inherits from the `Estimator` and `Model` classes in [Spark's ML library](https://spark.apache.org/mllib/) in
36+
order to take advantage of machinery such as `Pipeline`s. Model persistence on HDFS is supported.
37+
* **Broad portability via ONNX:** The `isolation-forest-onnx` module provides Python-based converter to convert a
38+
trained model to ONNX format for broad portability across platforms and languages. [ONNX](https://onnx.ai/) is an open format built
39+
to represent machine learning models.
3040

31-
## How to use
41+
## Getting started
3242

3343
### Building the library
3444

@@ -51,6 +61,11 @@ To force a rebuild of the library, you can use:
5161
./gradlew clean build --no-build-cache
5262
```
5363

64+
To just run the tests:
65+
```bash
66+
./gradlew test
67+
```
68+
5469
### Add an isolation-forest dependency to your project
5570

5671
Please check [Maven Central](https://repo.maven.apache.org/maven2/com/linkedin/isolation-forest/) for the latest
@@ -89,6 +104,8 @@ Here is an example for a recent Spark/Scala version combination.
89104
</dependency>
90105
```
91106

107+
## Usage examples
108+
92109
### Model parameters
93110

94111
| Parameter | Default Value | Description |
@@ -104,6 +121,7 @@ Here is an example for a recent Spark/Scala version combination.
104121
| predictionCol | "predictedLabel" | The predicted label. This column is appended to the input DataFrame upon scoring. |
105122
| scoreCol | "outlierScore" | The outlier score. This column is appended to the input DataFrame upon scoring. |
106123

124+
107125
### Training and scoring
108126

109127
Here is an example demonstrating how to import the library, create a new `IsolationForest`
@@ -203,7 +221,7 @@ isolationForestModel.write.overwrite.save(path)
203221
val isolationForestModel2 = IsolationForestModel.load(path)
204222
```
205223

206-
## ONNX model conversion and inference
224+
## ONNX conversion for portable inference
207225

208226
### Converting a trained model to ONNX
209227

@@ -276,7 +294,7 @@ print('ONNX Converter outlier scores:')
276294
print(np.transpose(actual_outlier_scores[:num_examples_to_print])[0])
277295
```
278296

279-
## Validation
297+
## Performance and benchmarks
280298

281299
The original 2008 "Isolation forest" paper by Liu et al. published the AUROC results obtained by
282300
applying the algorithm to 12 benchmark outlier detection datasets. We applied our implementation of
@@ -299,11 +317,19 @@ result. The quoted uncertainty is the one-sigma error on the mean.
299317
| [Arrhythmia](http://odds.cs.stonybrook.edu/arrhythmia-dataset/) | 0.80 | 0.804 &plusmn; 0.002 |
300318
| [Ionosphere](http://odds.cs.stonybrook.edu/ionosphere-dataset/) | 0.85 | 0.8481 &plusmn; 0.0002 |
301319

302-
Our implementation provides AUROC values that are in very good agreement the results in the original
303-
Liu et al. publication. There are a few very small discrepancies that are likely due the limited
320+
Our implementation provides AUROC values that are in very good agreement with the results in the original
321+
Liu et al. publication. There are a few very small discrepancies that are likely due to the limited
304322
precision of the AUROC values reported in Liu et al.
305323

306-
## Contributions
324+
## Copyright and license
325+
326+
Copyright 2019 LinkedIn Corporation
327+
All Rights Reserved.
328+
329+
Licensed under the BSD 2-Clause License (the "License").
330+
See [License](LICENSE) in the project root for license information.
331+
332+
## Contributing
307333

308334
If you would like to contribute to this project, please review the instructions [here](CONTRIBUTING.md).
309335

0 commit comments

Comments
 (0)