TopClassifier

Unsupervised Machine Learning to Classify LHC Jets from Top Quark Decays

PLEASE SCROLL TO Learning Points FOR POINTS GOING FORWARD

This script was produced in 2018 as part of a summer studentship supervised by Dr. Jonas Lindert and Prof. Frank Krauss at the IPPP.

It analyses data from simulations of the LHC from SHERPA with jets originating from the decay of top quarks tagged using the HEPTopTagger.

Data Format

The data format in image.dat and the script image_processing.py were developed together, with the analysis producing the data implemented using a custom analysis in Rivet.

The data consists of newline separated lists, with each item a triple of (pseudorapidity/eta, azimuthal angle/phi, transverse momentum/pT) for each of the constituents of the identified jet in an event. The final element of each list is a tag identifying whether the jet originated from a top quark decay in the simulation (1 or 0).

Preprocessing and Classification

The script uses logistic regression on images in the phi-eta plane of the jet constituents to identify whether the jets originated from a top decay. Tests with multiple runs found an average of 90.12% successful identifications, with false-top identification rate of 9.47%. The preprocessing of these images was performed as follows:

The hardest (largest pT) constituent is placed at the centre (0,0).
The image is rotated such that the second hardest constituent has phi = 0.
Flip the image horizontally if the third hardest constituent has phi < 0

Averaged Images

In top_jets.pdf and non_top_jets.pdf an average of the normalised jet images for (respectively) jets originating from and without a top quark decay. These are not the images as generated by the classification, though modification of the code to produce these is relatively simple.

One can see that the energy in top jets is distributed visibly differently to the non-top jets (where the energy is more concentrated at the centre), the preprocessing removes the degrees of freedom in the differing locations of the jets at production and ensures that the profiles of these jets are what the classifier works on.

Learning Points

I uploaded this project as it originally was created bar a few (mostly stylistic) modifications to ensure compatibility with newer python versions. I was a less experienced programmer and data scientist at the time and thought it would be interesting to upload this project as it was created to highlight things I would change.

One of the most inconvenient aspects is the specific implementation of the shuffling of the test/train dataset. This means that each time the script is run an untraceably different split is used. To counter this I would shuffle the dataset according to a pseudorandom distribution, whose parameter(s) (namely the seed) can either be set or used to generate multiple runs and take the average.

In producing the data, I would perhaps use a similar format for the variables (being well-suited to the purposes of the script) but perhaps make use of dynamic databases (such as SQL) which can be read in the script. This would mean that the data would not be stored in human-readable format however it would be much quicker to produce, require less storage space and be quicker to read. For significantly high statistics, this script will take too long to process the data and produce the necessary output. Dynamic data storage would overcome these hurdles.

The machine learning model implemented is very rudimentary and there has been no attempt to test different ML architectures. If repeating this project I would classify the data with several different neural networks, with varying depths and parameters, such that the appropriate architecture can be appropriately chosen after comparison.

There is much more that can be discussed here (e.g. avoiding global variables), however this was an exciting project that enhanced my enthusiasm for complex data analysis. It was a hugely satisfying undertaking, to be able to see the results of the simulations and the deliberation over specific data structures in a visual format and has paved the way to the path I follow now as a PhD candidate.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
LICENSE		LICENSE
README.md		README.md
image.dat		image.dat
image_processing.py		image_processing.py
non_top_jets.pdf		non_top_jets.pdf
top_jets.pdf		top_jets.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TopClassifier

Unsupervised Machine Learning to Classify LHC Jets from Top Quark Decays

Data Format

Preprocessing and Classification

Averaged Images

Learning Points

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TopClassifier

Unsupervised Machine Learning to Classify LHC Jets from Top Quark Decays

Data Format

Preprocessing and Classification

Averaged Images

Learning Points

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages