Skip to content
Snippets Groups Projects
Commit 506334a1 authored by Simon Klüttermann's avatar Simon Klüttermann
Browse files

added linkability, wrote some tables in 3.4

parent de8b7121
Branches
No related tags found
No related merge requests found
Showing
with 142 additions and 46 deletions
<section Introduction and Literature> <section title="Introduction and Literature" label="secintro">
<subsection Introduction> <subsection title="Introduction" label="intro">
This chapter will be written as the last one This chapter will be written as the last one
<ignore>
<subsubsection TESTING GROUND>
<table c="4" cap="An examplary testing table" label="test1" mode="full">
<hline>
<tline topic 1~#1+x#~#sin(x)#~12>
<hline>
<tline a~b~c~d>
<tline e~f~g~h>
<hline>
</table>
<reft test1>
</ignore>
Current State of this Thesis Current State of this Thesis
... ...
......
<subsection New Physics> <subsection title="New Physics" label="physics">
This Chapter is not yet written, since I want to do it justice This Chapter is not yet written, since I want to do it justice
<subsection Autoencoder> <subsection title="Autoencoder" label="ae">
Autoencoder are a kind of neuronal Network, in which the function that should be learned is set to the Identity, and to make this learn more than a trivial Function, a compressed state is introduced. This compressed State is of lower Dimension than the Input Space and is generated by a learnable Function called the Encoder from the Input, and serves as Input to another learnable Function, which is called the Decoder. Together, both try to reconstruct the Input. This reconstruction is usually not perfect, since the reduced Dimension does not allow to encode each possible Value perfectly, but it can be quite good, as long as the data contains some patterns. Consider the following 2 dimensional Data: Autoencoder are a kind of neuronal Network, in which the function that should be learned is set to the Identity, and to make this learn more than a trivial Function, a compressed state is introduced. This compressed State is of lower Dimension than the Input Space and is generated by a learnable Function called the Encoder from the Input, and serves as Input to another learnable Function, which is called the Decoder. Together, both try to reconstruct the Input. This reconstruction is usually not perfect, since the reduced Dimension does not allow to encode each possible Value perfectly, but it can be quite good, as long as the data contains some patterns. Consider the following 2 dimensional Data:
...@@ -6,6 +6,7 @@ Autoencoder are a kind of neuronal Network, in which the function that should be ...@@ -6,6 +6,7 @@ Autoencoder are a kind of neuronal Network, in which the function that should be
As you can see, to completely encode the data you would still require 2 dimensions, but you can approximately encode them into 1 dimension quite well, by using one value as encoding and the second one, in the decoder, as a linear function of this<note Since the Number of trainingsamples is finite, you could map every sample into an index, and map those indices again onto the inputs, reaching a zero loss for any Input with an compression size of 1, but not only is finding such a function quite hard for a neuronal Network, it also would not be useful at all, since on any new data, the Network would not work at all. This is why these kind of functions are a part of what is called overfitting for autoencoder>. As you can see, to completely encode the data you would still require 2 dimensions, but you can approximately encode them into 1 dimension quite well, by using one value as encoding and the second one, in the decoder, as a linear function of this<note Since the Number of trainingsamples is finite, you could map every sample into an index, and map those indices again onto the inputs, reaching a zero loss for any Input with an compression size of 1, but not only is finding such a function quite hard for a neuronal Network, it also would not be useful at all, since on any new data, the Network would not work at all. This is why these kind of functions are a part of what is called overfitting for autoencoder>.
This combination of a compresser and a decompresser can be quite useful in multiple ways. Ignoring the obvious task of compressing data (see (ENTER REFERENCE) for an example of an autoencoder outperforming a classical compressor), you can give the decompressor noise to generate new versions of an already known kind of data (see (ENTER REFERENCE)), and even though nowadays GANs are used for this Task, autoencoder have still some benefits, allowing for more control over the generated Data. This works, since (in good autoencoders<note This works better in a special way of training an Autoencoder, called a Variational Autoencoder.>) similarity in the compressed Space represent similarity of the Inputs, so not only by changing the compressed Versions of an Input sligthly you can reproduce similar Inputs, but by Identifieng Features in the Input Space you can chance just one attribute of an Input, and you can also merge the Features of two Inputs into one, see for these Applications (ENTER REFERENCE) and (ENTER REFERENCE) This combination of a compresser and a decompresser can be quite useful in multiple ways. Ignoring the obvious task of compressing data (see (ENTER REFERENCE) for an example of an autoencoder outperforming a classical compressor), you can give the decompressor noise to generate new versions of an already known kind of data (see (ENTER REFERENCE)), and even though nowadays GANs are used for this Task, autoencoder have still some benefits, allowing for more control over the generated Data. This works, since (in good autoencoders<note This works better in a special way of training an Autoencoder, called a Variational Autoencoder.>) similarity in the compressed Space represent similarity of the Inputs, so not only by changing the compressed Versions of an Input sligthly you can reproduce similar Inputs, but by Identifieng Features in the Input Space you can chance just one attribute of an Input, and you can also merge the Features of two Inputs into one, see for these Applications (ENTER REFERENCE) and (ENTER REFERENCE)
<i f="none" wmode="True">sample combinations of two image Inputs (something like cats and dogs)</i> <i f="none" wmode="True">sample combinations of two image Inputs (something like cats and dogs)</i>
That beeing said, the Application that is focussed on in this work, is the Detection of Anomalies. As introduced on the Informatics Side by (ENTER REFERENCE) and a bit later for Particle Physics by (ENTER REFERENCE), since a well trained Autoencoder should be only able to reconstruct the Features it is trained on well, you can use the reconstruction loss<note The difference between Input and Output of the Autoencoder, measured in a way discussed in Chapter (ENTER CHAPTER> of this Autoencoder, to find Events that are not of the same type as the Data the Autoencoder is trained on. That beeing said, the Application that is focussed on in this work, is the Detection of Anomalies. As introduced on the Informatics Side by (ENTER REFERENCE) and a bit later for Particle Physics by (ENTER REFERENCE), since a well trained Autoencoder should be only able to reconstruct the Features it is trained on well, you can use the reconstruction loss<note The difference between Input and Output of the Autoencoder, measured in a way discussed in Chapter (ENTER CHAPTER> of this Autoencoder, to find Events that are not of the same type as the Data the Autoencoder is trained on.
... ...
......
<subsection Anomaly Detection> <subsection title="Anomaly Detection" label="anoma">
Anomaly Detection, as the Task of finding abnormal Events has a plentitude of Use cases: from improving the Purity of a dataset (ENTER REFERENCE) to fraud and fault detection (see (ENTER REFERENCE) and (ENTER REFERENCE)). Anomaly Detection, as the Task of finding abnormal Events has a plentitude of Use cases: from improving the Purity of a dataset (ENTER REFERENCE) to fraud and fault detection (see (ENTER REFERENCE) and (ENTER REFERENCE)).
To achieve this, there is a multitude of algorithms, that will be introduced and applied in Chapter (ENTER CHAPTER), which have Applications for example in motor failure detection and nuclear safety. Problems, that can be of onknown form and in which there are few Examples of Interest, that have to be found with high accuracy, not to different from the Task of finding new Physics in an abundance of noise. To achieve this, there is a multitude of algorithms, that will be introduced and applied in Chapter (ENTER CHAPTER), which have Applications for example in motor failure detection and nuclear safety. Problems, that can be of onknown form and in which there are few Examples of Interest, that have to be found with high accuracy, not to different from the Task of finding new Physics in an abundance of noise.
... ...
......
<subsection Graphs> <subsection title="Graphs" label="graphs">
A Graph is a mathematical and informatical Concept, that allows you to store a more general Form of Data then just those encoded in Vectors. Namely, Graphs allows storing relational Information of an unbounded<note For computional Reasons, Graphs are not completely unbounded in the following Chapters, but have a maximum size> amount of of Objects. This is done, by defining two Objects: Nodes, which are the Objects of Interest, and can be mathematically described by vectors<note In theory you would not need to be able to define those Objects as Vectors, but for practical Application this is quite useful>, and Edges that are pairs of connected Node Indices, and thus encode the relation between those Nodes. A Graph is a mathematical and informatical Concept, that allows you to store a more general Form of Data then just those encoded in Vectors. Namely, Graphs allows storing relational Information of an unbounded<note For computional Reasons, Graphs are not completely unbounded in the following Chapters, but have a maximum size> amount of of Objects. This is done, by defining two Objects: Nodes, which are the Objects of Interest, and can be mathematically described by vectors<note In theory you would not need to be able to define those Objects as Vectors, but for practical Application this is quite useful>, and Edges that are pairs of connected Node Indices, and thus encode the relation between those Nodes.
Now there is a plentitude of extensions for this simple graph, for example there are directed Graphs, in which the Edges gain a direction, and thus a connection between node #i# and #j# does not automatically imply a connection between #j# and #i#.Also there are weighted graphs, in which each Edge gains an additional Value, that encodes how strong the connection between two nodes is. Now there is a plentitude of extensions for this simple graph, for example there are directed Graphs, in which the Edges gain a direction, and thus a connection between node #i# and #j# does not automatically imply a connection between #j# and #i#.Also there are weighted graphs, in which each Edge gains an additional Value, that encodes how strong the connection between two nodes is.
... ...
......
<subsection Graph Autoencoder> <subsection title="Graph Autoencoder" label="gae">
The main Problem of ParticleNet for finding new Physics is its supervised Approach. This means, that each new physics model can only be detected, if you train a special Network for it. Not only would this need a lot of Networks, with the corresponding high number of false positives, but this also limits there effectiveness, as you can only find new Physics that has already been suggested. This is why you could ask yourself if you could not combine the Graph Approach of ParticleNet with the Autoencoder Idea of QCDorWhat. This is the main Idea that is tried to Implement in this Thesis, and this Taks is definitely not trivial: Creating something like a Graph Autoencoder has some Problems, namely the fact, that a compression is usually not local<note graph pooling operations are quite common, since the output of a graph network usually has a different format than its input. The way this is usually done, is by applying a function (mean,max for example) to each node. ParticleNet for example uses a GlobalAveragePooling (ENTER REFERENCE?), so it calculates the average over the nodes for each feature. This kind of pooling works quite well, but is sadly not really appliable to autoencoders, since those functions are not really invertible>. That does not mean, there are no Approaches, just that most Autors shy away from any Approach that chances the Graph Size<note see (ENTER REFERENCE) or (ENTER REFERENCE)>. The first Paper you find, by just searching for a Graph Autoencoder, is a Paper by Kipf et Al (ENTER REFERENCE) and a lot of Paper referencing it. The main Problem here is, that Kipf et Al uses one fixed Adjacency Matrix, and thus one equal Graph Setup, for any Input. This allows for neither the learnable meaning of similarity that appearently makes ParticleNet so good, the variable inputsize discussed in (ENTER LINK) and, probably worst, not for any structural difference in any different Jets. Other Approaches come from the Problem of Graph Pooling Operations, meaning the definition of some kind of Layer, that takes a Graph as Input, and returns a smaller Graph in a learnable Manner<note This is not an entirely solved Problem, but would be quite useful, since it allows for hirachical Learning, similar to the use of Pooling Layers in Convolutional Networks>. DiffPool (ENTER REFERENCE) and MinCutPool (ENTER REFERENCE) migth be good examples for this, but Graph U Nets (ENTER REFERENCE) stand out, since it also gives an Implementation to a anti Pooling layer, and thus allows for a Graph Autoencoder in the way we require it here , which is why the first Approach we tried is based on their Approach. See for this Chapter (ENTER CHAPTER).<ignore> and even though it does not work very good for us, it still creates the basis for every other Approach.</ignore> The main Problem of ParticleNet for finding new Physics is its supervised Approach. This means, that each new physics model can only be detected, if you train a special Network for it. Not only would this need a lot of Networks, with the corresponding high number of false positives, but this also limits there effectiveness, as you can only find new Physics that has already been suggested. This is why you could ask yourself if you could not combine the Graph Approach of ParticleNet with the Autoencoder Idea of QCDorWhat. This is the main Idea that is tried to Implement in this Thesis, and this Taks is definitely not trivial: Creating something like a Graph Autoencoder has some Problems, namely the fact, that a compression is usually not local<note graph pooling operations are quite common, since the output of a graph network usually has a different format than its input. The way this is usually done, is by applying a function (mean,max for example) to each node. ParticleNet for example uses a GlobalAveragePooling (ENTER REFERENCE?), so it calculates the average over the nodes for each feature. This kind of pooling works quite well, but is sadly not really appliable to autoencoders, since those functions are not really invertible>. That does not mean, there are no Approaches, just that most Autors shy away from any Approach that chances the Graph Size<note see (ENTER REFERENCE) or (ENTER REFERENCE)>. The first Paper you find, by just searching for a Graph Autoencoder, is a Paper by Kipf et Al (ENTER REFERENCE) and a lot of Paper referencing it. The main Problem here is, that Kipf et Al uses one fixed Adjacency Matrix, and thus one equal Graph Setup, for any Input. This allows for neither the learnable meaning of similarity that appearently makes ParticleNet so good, the variable inputsize discussed in (ENTER LINK) and, probably worst, not for any structural difference in any different Jets. Other Approaches come from the Problem of Graph Pooling Operations, meaning the definition of some kind of Layer, that takes a Graph as Input, and returns a smaller Graph in a learnable Manner<note This is not an entirely solved Problem, but would be quite useful, since it allows for hirachical Learning, similar to the use of Pooling Layers in Convolutional Networks>. DiffPool (ENTER REFERENCE) and MinCutPool (ENTER REFERENCE) migth be good examples for this, but Graph U Nets (ENTER REFERENCE) stand out, since it also gives an Implementation to a anti Pooling layer, and thus allows for a Graph Autoencoder in the way we require it here , which is why the first Approach we tried is based on their Approach. See for this Chapter (ENTER CHAPTER).<ignore> and even though it does not work very good for us, it still creates the basis for every other Approach.</ignore>
... ...
......
<section Definitions> <section title="Definitions" label="secdef">
<subsection Binary Classification> <subsection title="Binary Classification" label="binclass">
The Task of evaluating finding the difference between Background and Signal Data has been studied a lot as a Boolean decision Problem. Notable use cases include (ENTER REFERENCE) and (ENTER REFERENCE). In general, they consider 4 fractions, the fraction of events that are of type background or signal, which are classified either as background and signal.<ignore>, and try to minimize the fraction of events that are classified wrongly.</ignore> The Task of evaluating finding the difference between Background and Signal Data has been studied a lot as a Boolean decision Problem. Notable use cases include (ENTER REFERENCE) and (ENTER REFERENCE). In general, they consider 4 fractions, the fraction of events that are of type background or signal, which are classified either as background and signal.<ignore>, and try to minimize the fraction of events that are classified wrongly.</ignore>
<i f="defttetc" wmode="True">(redoo yourself) Definitions of the 4 Fractions used to evaluate binary classification</i> <i f="defttetc" wmode="True">(redoo yourself) Definitions of the 4 Fractions used to evaluate binary classification</i>
<subsubsection ROC curve> <subsubsection title="ROC curve" label="classroc">
For most decision Problems, these Fractions are a functions of some parameter. Consider the following output of an classifier: For most decision Problems, these Fractions are a functions of some parameter. Consider the following output of an classifier:
<i f="add3" wmode="True">(mmt/add3, prob not perfect)some recqual with decision parameter implemented</i> <i f="add3" wmode="True">(mmt/add3, prob not perfect)some recqual with decision parameter implemented</i>
Here this Parameter is the Point at which to cut the distribution in a way that everything above will be classified as signal, while everything below is classified as Background. Since the choise of this cut is quite arbitrary, we evaluate every possible Parameter and plot two fractions against each other. Here this Parameter is the Point at which to cut the distribution in a way that everything above will be classified as signal, while everything below is classified as Background. Since the choise of this cut is quite arbitrary, we evaluate every possible Parameter and plot two fractions against each other.
...@@ -16,11 +16,11 @@ These two Fractions are plotted against each other, either in a way showing the ...@@ -16,11 +16,11 @@ These two Fractions are plotted against each other, either in a way showing the
or, to focus on the Error Rate or, to focus on the Error Rate
<i f="none" wmode="True">The other way of plotting a roc curve</i> <i f="none" wmode="True">The other way of plotting a roc curve</i>
<subsubsection Area Under the Curve> <subsubsection title="Area Under the Curve" label="classauc">
To simplify comparing ROC scores, you can use an AUC score to summarise it. This AUC score is defined as the integral of the true positive rate over the false positive rate. Obviously this is not perfect, since you reduce a function into only one number, but it is fairly wide accepted, and easy to interpret, as a perfect score would result in an AUC score of 1, while a Classifier that just guesses results in an AUC score of 0.5 and a perfect anticlassifier would result in 0. Also this reduction into only one number can makes the AUC score less errorprone than other values. On the other hand, since not every part of the roc curve is equally important for the current problem (if you want to test, if somebody is ill, you migth prefer more false positives over more false negatives). This can result in Networks improving the AUC score by just chancing unimportant parts of the ROC curve. To simplify comparing ROC scores, you can use an AUC score to summarise it. This AUC score is defined as the integral of the true positive rate over the false positive rate. Obviously this is not perfect, since you reduce a function into only one number, but it is fairly wide accepted, and easy to interpret, as a perfect score would result in an AUC score of 1, while a Classifier that just guesses results in an AUC score of 0.5 and a perfect anticlassifier would result in 0. Also this reduction into only one number can makes the AUC score less errorprone than other values. On the other hand, since not every part of the roc curve is equally important for the current problem (if you want to test, if somebody is ill, you migth prefer more false positives over more false negatives). This can result in Networks improving the AUC score by just chancing unimportant parts of the ROC curve.
<subsubsection other Measures> <subsubsection title="other Measures" label="classother">
This Problem of addressed focus is solved by the e30 measure. This measure corresponds to the inverse of the false positive rate corresponding to a true positive rate of 0.3. This means that the e30 score only cares about the accuracy at an acceptable true positive rate, but it also means, that this score is a bit more random, which is why it is not used very often in the following This Problem of addressed focus is solved by the e30 measure. This measure corresponds to the inverse of the false positive rate corresponding to a true positive rate of 0.3. This means that the e30 score only cares about the accuracy at an acceptable true positive rate, but it also means, that this score is a bit more random, which is why it is not used very often in the following
<ignore>Another Score that is not used is the F1 score (ENTER DELETE?)</ignore> <ignore>Another Score that is not used is the F1 score (ENTER DELETE?)</ignore>
... ...
......
<subsection Problems in Evaluating a Model> <subsection title="Problems in Evaluating a Model" label="eval prob">
Even when we can evaluate a Binary Classification Problem<note see Chapter (ENTER CHAPTER)>, this does not mean, that evaluating an Autoencoder is easy. This is a problem, since we basically want to do 2 Things at the same Time, creating an autoencoder and creating a classifier, and there are situations in which the autoencoder is good, but the classifier is bad and situations in which the classifier migth be good, but the autoencoder is useless. Even when we can evaluate a Binary Classification Problem<note see Chapter (ENTER CHAPTER)>, this does not mean, that evaluating an Autoencoder is easy. This is a problem, since we basically want to do 2 Things at the same Time, creating an autoencoder and creating a classifier, and there are situations in which the autoencoder is good, but the classifier is bad and situations in which the classifier migth be good, but the autoencoder is useless.
<subsubsection AUC scores> <subsubsection title="AUC scores" label="evalauc">
Your first Idea in how to evalutate an Autoencoder, migth be to simply use the Quality of the Classifier (the AUC Score, see Chapter (ENTER CHAPTER)), since the Classifier works by the Autoencoder understanding the Data, and thus should only be good if also the Autoencoder is good. And in most cases this works, there is a clear relation between the Quality of the Autoencoder and the Quality of the Classifier (see Chapter (ENTER CHAPTER multi model Plot AUC loss)), but in general this is simply not true, as Chapter (ENTER CHAPTER AUC through zero) shows. And even if your working in a region where this relation is true, Classifier Evaluation Methods<note AUC scores even have one of the lower uncertainities> usually have a much higher uncertainity than other Methods, which is, why in the Regions in which there is a strong correlation, it was more useful to use the loss of the Network, and to assume that the AUC score correlates. Your first Idea in how to evalutate an Autoencoder, migth be to simply use the Quality of the Classifier (the AUC Score, see Chapter (ENTER CHAPTER)), since the Classifier works by the Autoencoder understanding the Data, and thus should only be good if also the Autoencoder is good. And in most cases this works, there is a clear relation between the Quality of the Autoencoder and the Quality of the Classifier (see Chapter (ENTER CHAPTER multi model Plot AUC loss)), but in general this is simply not true, as Chapter (ENTER CHAPTER AUC through zero) shows. And even if your working in a region where this relation is true, Classifier Evaluation Methods<note AUC scores even have one of the lower uncertainities> usually have a much higher uncertainity than other Methods, which is, why in the Regions in which there is a strong correlation, it was more useful to use the loss of the Network, and to assume that the AUC score correlates.
<subsubsection Losses> <subsubsection title="Losses" label="evalloss">
So why not always use the loss: Look at the Quality of the Autoencoder and try to optimize only it. This again has Problems: Not only requires this still a strong relation between AUC and loss (That is not neccesarily given, consider the Problem of finding the best compression size: The loss will usually fall by increasing the compression size, but at some Point, the Autoencoder can just reconstruct everything perfectly, and thus has no more classification potential), but the loss also relies heavily on the Definition of the Network and the Normalization of the Input Data<note see Chapter (ENTER CHAPTER)>, which makes comparing different Networks only possible, if you neither alter the Loss nor the Normalization. So why not always use the loss: Look at the Quality of the Autoencoder and try to optimize only it. This again has Problems: Not only requires this still a strong relation between AUC and loss (That is not neccesarily given, consider the Problem of finding the best compression size: The loss will usually fall by increasing the compression size, but at some Point, the Autoencoder can just reconstruct everything perfectly, and thus has no more classification potential), but the loss also relies heavily on the Definition of the Network and the Normalization of the Input Data<note see Chapter (ENTER CHAPTER)>, which makes comparing different Networks only possible, if you neither alter the Loss nor the Normalization.
<subsubsection Images> <subsubsection title="Images" label="evalimg">
This cross comparison Problem, you can easily solve by simply looking at the Images behind the losses<note the jet Image showing Input and Output of the Autoencoder, see for an example (ENTER CHAPTER)>. But while this is certainly very useful, as it also allows to understand more about your Network(for example, there are Networks, that simply ignore some Parameters, and thus have their whole loss in those Parameters(see Chapter (ENTER CHAPTER)), this can be most easily seen by looking at the Images), this still relies on the relation between AUC and loss and more importantly is less quantitative: Giving 2 Images, finding out which Autoencoder is better is not always an easy Task, especcially since what problems you migth see in those images does not neccesarily correspond to what the network fails at. Most notably you seem to focus on angular differences, and mostly neglect differences in #lp_T#. This cross comparison Problem, you can easily solve by simply looking at the Images behind the losses<note the jet Image showing Input and Output of the Autoencoder, see for an example (ENTER CHAPTER)>. But while this is certainly very useful, as it also allows to understand more about your Network(for example, there are Networks, that simply ignore some Parameters, and thus have their whole loss in those Parameters(see Chapter (ENTER CHAPTER)), this can be most easily seen by looking at the Images), this still relies on the relation between AUC and loss and more importantly is less quantitative: Giving 2 Images, finding out which Autoencoder is better is not always an easy Task, especcially since what problems you migth see in those images does not neccesarily correspond to what the network fails at. Most notably you seem to focus on angular differences, and mostly neglect differences in #lp_T#.
<i f="none" wmode="True">some image, with nothing learned in pt</i> <i f="none" wmode="True">some image, with nothing learned in pt</i>
Sure, you can also look at the #p_T# reproduction, but this demands weighting importance, and does not make evaluating images any easier. Sure, you can also look at the #p_T# reproduction, but this demands weighting importance, and does not make evaluating images any easier.
<subsubsection Oneoff width> <subsubsection title="Oneoff width" label="evaloow">
<ignore>The probably best Solution to this Problem, is sadly also the least applicable, and requires that you to have read Chapter (ENTER CHAPTER) and maybe (ENTER CHAPTER) before understanding it</ignore> The final Solution, and the Solution that seems to be the best currently, is based on the things introduced in Chapters (ENTER CHAPTER) and (ENTER CHAPTER). Because of this, it will be explained in Chapter (ENTER CHAPTER). It seems to help with all the given Problems. There is a strong relation between this width and the AUC score<note or we at least did not yet find any exception to this relation, it migth well be that also this relation is also just wrong>, it is independent of the loss function and the Normalization, and thus is easily comparable, while also beeing exactly what the network cares about. <ignore>The probably best Solution to this Problem, is sadly also the least applicable, and requires that you to have read Chapter (ENTER CHAPTER) and maybe (ENTER CHAPTER) before understanding it</ignore> The final Solution, and the Solution that seems to be the best currently, is based on the things introduced in Chapters (ENTER CHAPTER) and (ENTER CHAPTER). Because of this, it will be explained in Chapter (ENTER CHAPTER). It seems to help with all the given Problems. There is a strong relation between this width and the AUC score<note or we at least did not yet find any exception to this relation, it migth well be that also this relation is also just wrong>, it is independent of the loss function and the Normalization, and thus is easily comparable, while also beeing exactly what the network cares about.
... ...
......
<subsection Datapreperation> <subsection title="Datapreperation" label="data">
Most of the time (always except in Chapter (ENTER CHAPTER)), Jet Data provided by (ENTER REFERENCE) is used, mostly because it is easier to comapre their results. These Jets have a transverse momentum between #550*Gev# and #650*Gev# and a maximum Radius of #LessThan(R,0.8)#. Most of the time (always except in Chapter (ENTER CHAPTER)), Jet Data provided by (ENTER REFERENCE) is used, mostly because it is easier to comapre their results. These Jets have a transverse momentum between #550*Gev# and #650*Gev# and a maximum Radius of #LessThan(R,0.8)#.
These Jets take the form of lists containing 200 momentum 4-vectors sortet by their transverse Momentum, so taking only the first #n# vector for each Jet, you a list of the #n# particles carying the most transverse momentum, and thus probably the most important ones. If a jet is defined by less than 200 final particles, those remaining 4 vectors are set to 0. These Jets take the form of lists containing 200 momentum 4-vectors sortet by their transverse Momentum, so taking only the first #n# vector for each Jet, you a list of the #n# particles carying the most transverse momentum, and thus probably the most important ones. If a jet is defined by less than 200 final particles, those remaining 4 vectors are set to 0.
... ...
......
<subsection Explaining Graphics> <subsection title="Explaining Graphics" label="graphics">
<subsubsection Output Images> <subsubsection title="Output Images" label="imgout">
<i f="none" wmode="True">a sample output input image</i> <i f="none" wmode="True">a sample output input image</i>
In those Images, you see each Particles #phi# and #eta#, including any normalisation, plottet once for the input jet in ???, and once for the output jet in ???. This means that a perfect network would mean, that both jets would overlapp. Zero Particles are not shown and there is some indication of the transverse momentum in the size of the dots, which is given proportional to #1+9/(1+lp_T)#<note inverse function, since higher #p_T# result in lower #lp_T#, and some constants to keep the radius finite>. This sadly does allow you to see differences in #lp_T# very well, so you can also look only at #lp_T#. We show those here as a function of the indice In those Images, you see each Particles #phi# and #eta#, including any normalisation, plottet once for the input jet in ???, and once for the output jet in ???. This means that a perfect network would mean, that both jets would overlapp. Zero Particles are not shown and there is some indication of the transverse momentum in the size of the dots, which is given proportional to #1+9/(1+lp_T)#<note inverse function, since higher #p_T# result in lower #lp_T#, and some constants to keep the radius finite>. This sadly does allow you to see differences in #lp_T# very well, so you can also look only at #lp_T#. We show those here as a function of the indice
<i f="none" wmode="True">sample lpt image</i> <i f="none" wmode="True">sample lpt image</i>
This way of looking at the Network performance is quite useful for finding patterns in the data. There seem to be Networks that show a high correlation between the angles(see (ENTER CHAPTER)), and it is quite common for the reproduced Values to have less spread than the input one (see (ENTER CHAPTER)). A problem here is, that you can only look at some Images, and finding one good reproduction for each Network is not that hard. To compat this, we use always the same event<note the same event for each training set to be precise>. This way of looking at the Network performance is quite useful for finding patterns in the data. There seem to be Networks that show a high correlation between the angles(see (ENTER CHAPTER)), and it is quite common for the reproduced Values to have less spread than the input one (see (ENTER CHAPTER)). A problem here is, that you can only look at some Images, and finding one good reproduction for each Network is not that hard. To compat this, we use always the same event<note the same event for each training set to be precise>.
<subsection AUC Feature Maps> <subsubsection title="AUC Feature Maps" label="imgmaps">
<i f="none" wmode="True">a sample Featuremap</i> <i f="none" wmode="True">a sample Featuremap</i>
Since a L2 loss is just a mean over a lot of L2 losses for each Feature and Particle, you could use the original losses, to also get an AUC score for each Feature and Particle. These AUC scores are shown in Feature Maps, showing the Quality of each combination of Feature on the horizontal axis and Particle on the verticle axis in the form of pixels. A perfect classifier(#Eq(AUC,1)#) would result in a dark blue pixel, a perfect anti classifier(#EQ(AUC,0)#) would be represented by a dark red pixel, and a useless classifier, that guesses or always results in the same class(#Eq(AUC,1/2)#) would be a white classifier. In short: The more colorful a pixel is, the better it is, and an Autoencoder trained on qcd events should be blue, while an Autoencoder trained on top events has to be red. Since a L2 loss is just a mean over a lot of L2 losses for each Feature and Particle, you could use the original losses, to also get an AUC score for each Feature and Particle. These AUC scores are shown in Feature Maps, showing the Quality of each combination of Feature on the horizontal axis and Particle on the verticle axis in the form of pixels. A perfect classifier(#Eq(AUC,1)#) would result in a dark blue pixel, a perfect anti classifier(#EQ(AUC,0)#) would be represented by a dark red pixel, and a useless classifier, that guesses or always results in the same class(#Eq(AUC,1/2)#) would be a white classifier. In short: The more colorful a pixel is, the better it is, and an Autoencoder trained on qcd events should be blue, while an Autoencoder trained on top events has to be red.
The nice thing about those maps, is that they can show the focus of the Network, as well as its problems. Since a perfect reconstruction has no decicion power, the same as a terrible reconstruction, a Network that has focus problems<note reconstructs some things much better than other things, making both parts worse>, can be clearly seen in those maps (see (ENTER CHAPTER)). Also it is fairly common, to get one feature and Particle, that has more decision power, than the whole combined Network (see (ENTER CHAPTER) for an example and (ENTER CHAPTER) for the explanation). Finally, an AUC map that is completely blue or red is quite uncommon, more probably some Features are red, some are blue, and you get an indication on which Features are useful for the current Task (see (ENTER CHAPTER)). The nice thing about those maps, is that they can show the focus of the Network, as well as its problems. Since a perfect reconstruction has no decicion power, the same as a terrible reconstruction, a Network that has focus problems<note reconstructs some things much better than other things, making both parts worse>, can be clearly seen in those maps (see (ENTER CHAPTER)). Also it is fairly common, to get one feature and Particle, that has more decision power, than the whole combined Network (see (ENTER CHAPTER) for an example and (ENTER CHAPTER) for the explanation). Finally, an AUC map that is completely blue or red is quite uncommon, more probably some Features are red, some are blue, and you get an indication on which Features are useful for the current Task (see (ENTER CHAPTER)).
<subsection Network Setups> <subsubsection title="Network Setups" label="imgsetups">
<i f="none" wmode="True">a sample Network Image, see below or probably just 900</i> <i f="none" wmode="True">a sample Network Image, see below or probably just 900</i>
We show the Setup of each Network as Images similar to those<note these Images are not made for you to be able to write the whole Network yourself, but to understand some of the parts that are switched. For a more technical model description, look at the Images keras generates. See for this (ENTER CHAPTER) or in the code at createmodel.py or at encoder.png and decoder.png>. The Information travels from the left side to the rigth side, through a lot of layers, which all have their own symbol. Some symbols are explained at the Point where they are used, but here those that are shown in Image (ENTER IMAGE) from left to rigth We show the Setup of each Network as Images similar to those<note these Images are not made for you to be able to write the whole Network yourself, but to understand some of the parts that are switched. For a more technical model description, look at the Images keras generates. See for this (ENTER CHAPTER) or in the code at createmodel.py or at encoder.png and decoder.png>. The Information travels from the left side to the rigth side, through a lot of layers, which all have their own symbol. Some symbols are explained at the Point where they are used, but here those that are shown in Image (ENTER IMAGE) from left to rigth
<list> <list>
... ...
......
<subsection Graph Neuronal Networks> <subsection title="Graph Neuronal Networks" label="gnn">
Graph Neuronal Networks are defined by a Graph Update Layer. This Layer takes all Feature Vectors of some graph, aswell as their corresponding Graph Connections to return an updated Feature Vector. To do this, this Layer is build from two different Interactions, the Update step of each node itself, which is call the self Interaction Term here, and the Update step of a node corresponding to its neighbouring Nodes in the Graph. This is call the neighbour Interaction Term. Graph Neuronal Networks are defined by a Graph Update Layer. This Layer takes all Feature Vectors of some graph, aswell as their corresponding Graph Connections to return an updated Feature Vector. To do this, this Layer is build from two different Interactions, the Update step of each node itself, which is call the self Interaction Term here, and the Update step of a node corresponding to its neighbouring Nodes in the Graph. This is call the neighbour Interaction Term.
There are multiple different ways of Implementing such a Layer, a notable one would be the one used by ParticleNet (ENTER REFERENCE): Their Graph connectivity is implemented, by just storing all neighbouring vector to each given vector in a set of vectors, this means, they can implement the update procedure as a function of the original and the neighbour vectors<note this function is actually a bit complicated, involving not only Convolutions, but also Normalisations between them, and they end by concatting the Updated Vector to the original one, which is something that is not very useful, when you want to reduce the size of your graph>. This is not exactly what we do here, mostly since the implementation of the Graph as just a corresponding set of neighbourvectors demands for computational reasons that each node is connected to a same number of other nodes, and also require relearning your Graph after each step, which migth not be the best Idea, as explained in Chapter (ENTER CHAPTER), and also would make this less of a Graph Autoencoder, and more into an Autoencoder with some Graph Update Layers in front of it (which migth also not be a good Idea, see Chapter (ENTER CHAPTER)), since there is no way to reduce the number of nodes for such an implementation, without completely ignoring the graph structure. There are multiple different ways of Implementing such a Layer, a notable one would be the one used by ParticleNet (ENTER REFERENCE): Their Graph connectivity is implemented, by just storing all neighbouring vector to each given vector in a set of vectors, this means, they can implement the update procedure as a function of the original and the neighbour vectors<note this function is actually a bit complicated, involving not only Convolutions, but also Normalisations between them, and they end by concatting the Updated Vector to the original one, which is something that is not very useful, when you want to reduce the size of your graph>. This is not exactly what we do here, mostly since the implementation of the Graph as just a corresponding set of neighbourvectors demands for computational reasons that each node is connected to a same number of other nodes, and also require relearning your Graph after each step, which migth not be the best Idea, as explained in Chapter (ENTER CHAPTER), and also would make this less of a Graph Autoencoder, and more into an Autoencoder with some Graph Update Layers in front of it (which migth also not be a good Idea, see Chapter (ENTER CHAPTER)), since there is no way to reduce the number of nodes for such an implementation, without completely ignoring the graph structure.
... ...
......
<section Making Graph Autoencoder work> <section title="Making Graph Autoencoder work" label="secgae">
<subsection Failed Approaches> <subsection title="Failed Approaches" label="failed">
In this Chapter we will quickly go over some Ideas you could have, on how to implement a graph autoencoder and finish with the first implementation that could be considered working. In this Chapter we will quickly go over some Ideas you could have, on how to implement a graph autoencoder and finish with the first implementation that could be considered working.
These implementations are usually defined by an encoding and a decoding algorithm, so basically something to go from a big graph to a small graph, and something to reverse this again. In addition to this, the Graph Update and the Graph Construction stay mostly the same as it was explained in Chapter (ENTER CHAPTERLINK). These implementations are usually defined by an encoding and a decoding algorithm, so basically something to go from a big graph to a small graph, and something to reverse this again. In addition to this, the Graph Update and the Graph Construction stay mostly the same as it was explained in Chapter (ENTER CHAPTERLINK).
<subsubsection trivial models> <subsubsection title="trivial models" label="failedtrivial">
Let us start with the probably most simple autoencoder algorithms: To make a #n# node graph into a #m# node graph, we just cut away the last nodes until there are only #m# nodes left<note Please note the importance of the #p_T# ordering here: Cutting the last particles means cutting the particles with lowest #p_T# and thus the probably least important particles> to reduce the graph size, and add zero valued Particles to it again. One difficulty here lies in the fact that those particles have no more graph connections, this we solved by just keeping the original graph connections stored. Sadly, those Networks still just dont work: even when we would set the Compression size over the Input Size, the reproduced Jets hardly bare any resemble to the Input Jets: This is the first example of the central Problem of Graph Autoencoding: Permutation Invariance. Consider the following encoder: two numbers #a# and #b# where #Eq(a,b+1)#, this would be trivial to compress into one number for a normal<note dense> Autoencoder(maybe just take #a#), but here we have to respect Permutaion Symmetry, so basically we do not know what the first and what the second Particle is and how do we decompress now? In this context you could keep one of the parameters and try to encode if the other one is bigger or smaller than this, maybe you also know that #LessThan(0,a)# and you could multiply it by #-1# if it is the smaller One, but this is less than trivial, and by increasing the Number of Parameters this gets even more complicated. This is a problem that mostly appears as the inability of even a "good" Autoencoder to work with and compression size that is equal to the input size, building an identity (see Chapter (ENTER CHAPTER)). Next to the loss from the compression, there seems to still be a certain loss from the Graph Structure, given at least partially coming from Permutation Invariance. Let us start with the probably most simple autoencoder algorithms: To make a #n# node graph into a #m# node graph, we just cut away the last nodes until there are only #m# nodes left<note Please note the importance of the #p_T# ordering here: Cutting the last particles means cutting the particles with lowest #p_T# and thus the probably least important particles> to reduce the graph size, and add zero valued Particles to it again. One difficulty here lies in the fact that those particles have no more graph connections, this we solved by just keeping the original graph connections stored. Sadly, those Networks still just dont work: even when we would set the Compression size over the Input Size, the reproduced Jets hardly bare any resemble to the Input Jets: This is the first example of the central Problem of Graph Autoencoding: Permutation Invariance. Consider the following encoder: two numbers #a# and #b# where #Eq(a,b+1)#, this would be trivial to compress into one number for a normal<note dense> Autoencoder(maybe just take #a#), but here we have to respect Permutaion Symmetry, so basically we do not know what the first and what the second Particle is and how do we decompress now? In this context you could keep one of the parameters and try to encode if the other one is bigger or smaller than this, maybe you also know that #LessThan(0,a)# and you could multiply it by #-1# if it is the smaller One, but this is less than trivial, and by increasing the Number of Parameters this gets even more complicated. This is a problem that mostly appears as the inability of even a "good" Autoencoder to work with and compression size that is equal to the input size, building an identity (see Chapter (ENTER CHAPTER)). Next to the loss from the compression, there seems to still be a certain loss from the Graph Structure, given at least partially coming from Permutation Invariance.
That beeing said, permutation Invariance can also be a benefit, especially in Permutation Invariant Input data, more to this in Chapter (ENTER CHAPTERLINK) That beeing said, permutation Invariance can also be a benefit, especially in Permutation Invariant Input data, more to this in Chapter (ENTER CHAPTERLINK)
<subsubsection minimal models> <subsubsection title="minimal models" label="failedminimal">
To improve this model, we started working with smaller Graph sizes (mostly the first 4 particles), making the structure less complicated, and allowing for more experimentation thanks to the lower time cost. Notable improvements include replacing the added zeros by a learnable Function of the remaining parameters, relearning the graph on the new parameterspace and adding some Dense Layers after the Graph Interactions, but the most important Improvment was achieved by making the compression and decompression local in some learning axis. Instead of just removing parameters in an arbitrary way of physical intuition, we demand that particles which are similar in some way are to be compressed together: This is achieved by the creation of a Function that compresses a set of particles into one particle, and allow the Network to learn what similarity means<note In the compression step, we define a new Feature for each node, by which we sort the set of nodes, and afterwards we build sets of n particles from this ordering, and compress them using a linear function (it migth be interresting to look at nonlinear functions, but we generally see worse results by adding an alinearity). Please note that since we use a Feature to sort the elements, and in the Graph Update step there are neighbour steps, that generally increase similarity, connected particles are more probably compressed together.>. To improve this model, we started working with smaller Graph sizes (mostly the first 4 particles), making the structure less complicated, and allowing for more experimentation thanks to the lower time cost. Notable improvements include replacing the added zeros by a learnable Function of the remaining parameters, relearning the graph on the new parameterspace and adding some Dense Layers after the Graph Interactions, but the most important Improvment was achieved by making the compression and decompression local in some learning axis. Instead of just removing parameters in an arbitrary way of physical intuition, we demand that particles which are similar in some way are to be compressed together: This is achieved by the creation of a Function that compresses a set of particles into one particle, and allow the Network to learn what similarity means<note In the compression step, we define a new Feature for each node, by which we sort the set of nodes, and afterwards we build sets of n particles from this ordering, and compress them using a linear function (it migth be interresting to look at nonlinear functions, but we generally see worse results by adding an alinearity). Please note that since we use a Feature to sort the elements, and in the Graph Update step there are neighbour steps, that generally increase similarity, connected particles are more probably compressed together.>.
These Networks still have problems, as we will discuss in the following, but generally produce respectable Decision Qualities, and show ("sometimes", see Chapter (ENTER CHAPTERLINK)) similarities between Input and Output Image. These Network is discussed in the next subchapter. These Networks still have problems, as we will discuss in the following, but generally produce respectable Decision Qualities, and show ("sometimes", see Chapter (ENTER CHAPTERLINK)) similarities between Input and Output Image. These Network is discussed in the next subchapter.
... ...
......
<subsection An explicit look at the first working graph autoencoder> <subsection title="An explicit look at the first working graph autoencoder" label="firstworking">
<i f="none" wmode="True">modeldraw of b1/00 (achte auf dense und das batchnorm vor compare ist)</i> <i f="none" wmode="True">modeldraw of b1/00 (achte auf dense und das batchnorm vor compare ist)</i>
...@@ -6,35 +6,35 @@ ...@@ -6,35 +6,35 @@
As you see, this Autoencoder takes qcd jets, transforms them as introduced in (ENTER CHAPTER), does some additional preprocessing, after which a graph is constructed, an graph update step is run, after which a graph compression algorithm is applied just to be reversed afterwards, to reconstruct a new graph, which is again used to update the feature vector, after which (and after a sorter), the current graph is compared. There are some Layers, that have not explicitely been explained before, so this is what we want to do in this Chapter As you see, this Autoencoder takes qcd jets, transforms them as introduced in (ENTER CHAPTER), does some additional preprocessing, after which a graph is constructed, an graph update step is run, after which a graph compression algorithm is applied just to be reversed afterwards, to reconstruct a new graph, which is again used to update the feature vector, after which (and after a sorter), the current graph is compared. There are some Layers, that have not explicitely been explained before, so this is what we want to do in this Chapter
<subsubsection topK> <subsubsection title="topK" label="quicktopK">
The probably most comonly used algorithm, to construct a set of graph connections from a list of vectors, topK, seems to be quite easy to understand: you connect each vector, to the #K# vectors that are most similar to it. The difficulty lies in the word similar: Here two vectors are more similar, the smaller the L2 difference is. In an attempt, to make this more powerful, we also use a learnable metrik in this L2 difference. Even though this migth not be strictly neccesary, since the Network can chance parameters to accomodate its sence of similarity, this still allows the Network to better choose what to focus on in each topK layer. It can be quite useful for autoencoder, since for example ignoring a parameter, could else only be done, by decreasing its size in relation to the other parameters, which migth not be optimal, when you want an accurate reproduction. This also allows you to create a graph, before having any learnable layers. On the other hand, these metrik can complicate the calculcation of the adjacency Matrix, which we solved by demanding that the metrik is entirely diagonal, and the parameters of the metrik can increase the occurence of divergences in training, since even a small chance of those parameters can effect the network output in huge ways. That beeing said, having a humanly understandable metrik, can lead to interresting insigths (see Appendix (ENTER APPENDIX)). The probably most comonly used algorithm, to construct a set of graph connections from a list of vectors, topK, seems to be quite easy to understand: you connect each vector, to the #K# vectors that are most similar to it. The difficulty lies in the word similar: Here two vectors are more similar, the smaller the L2 difference is. In an attempt, to make this more powerful, we also use a learnable metrik in this L2 difference. Even though this migth not be strictly neccesary, since the Network can chance parameters to accomodate its sence of similarity, this still allows the Network to better choose what to focus on in each topK layer. It can be quite useful for autoencoder, since for example ignoring a parameter, could else only be done, by decreasing its size in relation to the other parameters, which migth not be optimal, when you want an accurate reproduction. This also allows you to create a graph, before having any learnable layers. On the other hand, these metrik can complicate the calculcation of the adjacency Matrix, which we solved by demanding that the metrik is entirely diagonal, and the parameters of the metrik can increase the occurence of divergences in training, since even a small chance of those parameters can effect the network output in huge ways. That beeing said, having a humanly understandable metrik, can lead to interresting insigths (see Appendix (ENTER APPENDIX)).
You could ask yourself, if a topK algorithm is the best choice, since the number of possible adjacency matrices is quite low, see for this (ENTER APPENDIX). You could ask yourself, if a topK algorithm is the best choice, since the number of possible adjacency matrices is quite low, see for this (ENTER APPENDIX).
Finally, it should be noted, that the topK layer can increase the size of the Feature vectors, which is useful for the compression algorithm, even though in this specific example this is not used. Finally, it should be noted, that the topK layer can increase the size of the Feature vectors, which is useful for the compression algorithm, even though in this specific example this is not used.
<subsubsection Compression> <subsubsection title="Compression" label="quickcompression">
Also this Layer is defined by a learnable sense of locality, but instead of defining a metrik, we simply use the last parameter of the input feature vectors<note This is essentially the same, since it is preceded by a graph update layer, which redefines each Feature with its self interaction matrix, but it has one definite benefit, as the neigbour interaction matrix, tends to average graph connected feature vectors, and thus two nodes, that are connected in the adjacency matrix, are more similar> to sort each feature vector, to split the list of feature vectors in lists of feature vectors, that are compressed together<note in this example, the 4 input features are compressed by sorting into sets of 4 feature vectors, so you could ask yourself if it actually does something to sort by a learnable sense of locality for minimal models, but they at least help to keep the permutation invariance>, which are then transformed using a simple dense layer into a new feature vector each. Also this Layer is defined by a learnable sense of locality, but instead of defining a metrik, we simply use the last parameter of the input feature vectors<note This is essentially the same, since it is preceded by a graph update layer, which redefines each Feature with its self interaction matrix, but it has one definite benefit, as the neigbour interaction matrix, tends to average graph connected feature vectors, and thus two nodes, that are connected in the adjacency matrix, are more similar> to sort each feature vector, to split the list of feature vectors in lists of feature vectors, that are compressed together<note in this example, the 4 input features are compressed by sorting into sets of 4 feature vectors, so you could ask yourself if it actually does something to sort by a learnable sense of locality for minimal models, but they at least help to keep the permutation invariance>, which are then transformed using a simple dense layer into a new feature vector each.
<i f="none" wmode="True">compression pictograms</i> <i f="none" wmode="True">compression pictograms</i>
After this Graph compression, which reduces #4# times #3+flag# parameters into one vector with #10# parameters, we use 3 dense layers reducing the parameter count down to #6# After this Graph compression, which reduces #4# times #3+flag# parameters into one vector with #10# parameters, we use 3 dense layers reducing the parameter count down to #6#
<subsubsection Decompression> <subsubsection title="Decompression" label="quickdecompression">
After inverting the Dense Layers, we again have a compressed Vector with 10 Features. After inverting the Dense Layers, we again have a compressed Vector with 10 Features.
As each list of vectors is compressed into only one feature vector, the inverse procedure transforms each vector into a list of feature vector. Again we use a simple dense layer As each list of vectors is compressed into only one feature vector, the inverse procedure transforms each vector into a list of feature vector. Again we use a simple dense layer
<i f="none" wmode="True">decompression pictograms</i> <i f="none" wmode="True">decompression pictograms</i>
<subsubsection Sorting> <subsubsection title="Sorting" label="quicksort">
Since the compression stage reordered each feature vector, while the decompression algorithm does not reorder, comparing lists of vectors is not that easy. It should be clear, that a vector that may be perfectly reconstructed, but shuffled in a random way, can have an neirly arbitrarily big loss. This is solved by the initial and final sorting. These layers simply order each feature vector by its #lp_T# Value, so that at least perfectly reconstructed feature vectors are compared the rigth way. It should be noted, that this sorting is not strictly neccesary, as the Network will learn to work even when you just compare vectors in the way, the nodes are given, but sorting makes this task easier, allowing the Network to focus on more important things, and thus generally working better<note The central problem migth be, that the sorting kind of breaks graph permutation symmetry>. See Chapter (ENTER CHAPTER) for an experimental comparison on a more advanced graph autoencoder. Since the compression stage reordered each feature vector, while the decompression algorithm does not reorder, comparing lists of vectors is not that easy. It should be clear, that a vector that may be perfectly reconstructed, but shuffled in a random way, can have an neirly arbitrarily big loss. This is solved by the initial and final sorting. These layers simply order each feature vector by its #lp_T# Value, so that at least perfectly reconstructed feature vectors are compared the rigth way. It should be noted, that this sorting is not strictly neccesary, as the Network will learn to work even when you just compare vectors in the way, the nodes are given, but sorting makes this task easier, allowing the Network to focus on more important things, and thus generally working better<note The central problem migth be, that the sorting kind of breaks graph permutation symmetry>. See Chapter (ENTER CHAPTER) for an experimental comparison on a more advanced graph autoencoder.
<subsubsection Training Setup> <subsubsection title="Training Setup" label="quicktrain">
Another thing that has to be clearified concerning this model, is the training procedure. We use the Adam optimizer, with a learning rate of #0.001#, with a batch size of #200# and train the Network, using an EarlyStopping Callback, until it does no longer improve its validation loss for #10# epochs and afterwards use the Epoch with the minimal validation loss. We use #600000# training and #200000# validation jets to plot here the loss for each Epoch Another thing that has to be clearified concerning this model, is the training procedure. We use the Adam optimizer, with a learning rate of #0.001#, with a batch size of #200# and train the Network, using an EarlyStopping Callback, until it does no longer improve its validation loss for #10# epochs and afterwards use the Epoch with the minimal validation loss. We use #600000# training and #200000# validation jets to plot here the loss for each Epoch
<i f="none" wmode="True">training for b1/00</i> <i f="none" wmode="True">training for b1/00</i>
as you see, there is not really any progress made in the training<note except for maybe the first epoch, which is not shown in these kind of plots>, but you already see one fact, that will be quite common in the following: The Validation Loss is not (much) bigger than the training loss, neither at the end, not anywhere. This is fairly uncommon, as usually EarlyStopping is used to compat Overfitting, and validation losses that seem to increase at some point, but also easily explained, since Encoder and Decoder only amount to a total of 840 trainable Parameters, which is not enough to store informations for #O(100000)# events. Interrestingly, this seems to be a clear benefit for graph autoencoder, as even bigger Networks with similar amounts of parameters, trained on less data, dont seem to show any tendency to overfit. This allows us to reduce the training size to at least 2 orders of magnitude less, without any quality loss (see Chapter (ENTER CHAPTER)), and you could even ask yourself if it would not be possible to remove the whole need if splitting your data into training and validation data. That beeing said, this dataseperation is mentained for the rest of the thesis, and this overfitting safety comes at a price: the validation loss migth not increase in relation to the training loss, but that does not mean that both cannot increase in parallel. This, and the fact that graph training curves are way more errorbased than usual training curves, make EarlyStopping still a viable training callback, and result in most of the reasons, each training stops. as you see, there is not really any progress made in the training<note except for maybe the first epoch, which is not shown in these kind of plots>, but you already see one fact, that will be quite common in the following: The Validation Loss is not (much) bigger than the training loss, neither at the end, not anywhere. This is fairly uncommon, as usually EarlyStopping is used to compat Overfitting, and validation losses that seem to increase at some point, but also easily explained, since Encoder and Decoder only amount to a total of 840 trainable Parameters, which is not enough to store informations for #O(100000)# events. Interrestingly, this seems to be a clear benefit for graph autoencoder, as even bigger Networks with similar amounts of parameters, trained on less data, dont seem to show any tendency to overfit. This allows us to reduce the training size to at least 2 orders of magnitude less, without any quality loss (see Chapter (ENTER CHAPTER)), and you could even ask yourself if it would not be possible to remove the whole need if splitting your data into training and validation data. That beeing said, this dataseperation is mentained for the rest of the thesis, and this overfitting safety comes at a price: the validation loss migth not increase in relation to the training loss, but that does not mean that both cannot increase in parallel. This, and the fact that graph training curves are way more errorbased than usual training curves, make EarlyStopping still a viable training callback, and result in most of the reasons, each training stops.
<subsubsection Results> <subsubsection title="Results" label="quickres1">
So why do we consider this the first working model? So why do we consider this the first working model?
This migth be the first model, that can show some resemblence between the Input and the Output This migth be the first model, that can show some resemblence between the Input and the Output
<i f="none" wmode="True"> b1/00/simpledraw ml, auf cherrypick hinweise</i> <i f="none" wmode="True"> b1/00/simpledraw ml, auf cherrypick hinweise</i>
... ...
......
<subsection Improving Autoencoder> <subsection title="Improving Autoencoder" label="secondworking">
This good AUC score, looks like to only thing we now need to do, is to increase the size, and we probably have an Autoencoder that can rival the best other Autoencoder. But before we try, and fail<note see Chapter (ENTER CHAPTER)>, at this, let us improve our Autoencoder first. As you migth agree, the training curve does not look very impressive, and the reconstruction is also not very good. This good AUC score, looks like to only thing we now need to do, is to increase the size, and we probably have an Autoencoder that can rival the best other Autoencoder. But before we try, and fail<note see Chapter (ENTER CHAPTER)>, at this, let us improve our Autoencoder first. As you migth agree, the training curve does not look very impressive, and the reconstruction is also not very good.
<i f="none" wmode="True"> network plot for c1/200</i> <i f="none" wmode="True"> network plot for c1/200</i>
<subsubsection Training Setup> <subsubsection title="Training Setup" label="quick2setup">
Here we still train on the first #4# nodes, and with the same batch size of #200#, but with a lower learning rate of #0.0003# and with a higher patience, stopping only after the network do not improve for 30 epochs. We also increase the compression size from 6 to 7. Here we still train on the first #4# nodes, and with the same batch size of #200#, but with a lower learning rate of #0.0003# and with a higher patience, stopping only after the network do not improve for 30 epochs. We also increase the compression size from 6 to 7.
<subsubsection Results> <subsubsection title="Results" label="quick2results">
<i f="none" wmode="True"> history for c1/200</i> <i f="none" wmode="True"> history for c1/200</i>
... ...
......
<subsection Making Autoencoder good> <subsection title="Making Autoencoder good" label="thirdworking">
<subsubsection better encoding> <subsubsection title="better encoding" label="encoding">
The current Encoding basically completely ignores any Graph Information. After any compression stage the whole graph has to be relearned, and connections only indirectly<note through the preciding graph update steps> affect the corresponding feature vectors. Why not use the Graph a bit more? Here we suggest that adding using a function of the original graph as the compressed graph migth be a good Idea: When compressing #n# vectors, you can see the adjacency matrix as a matrix of matrices, and the only task you need to solve, is how to extract some form of the global matrix. This is done here, by applying a function to each submatrix. Here, we try out setting this function to be the mean, the maximum or the minimum of the original connections and compare them with or without rounding each entry to be one or zero to the usual graph compression. With the rounding you can see those options as setting a connection to exist when more original connections exist than dont, when at least one connection exist, or when all connections exist. The current Encoding basically completely ignores any Graph Information. After any compression stage the whole graph has to be relearned, and connections only indirectly<note through the preciding graph update steps> affect the corresponding feature vectors. Why not use the Graph a bit more? Here we suggest that adding using a function of the original graph as the compressed graph migth be a good Idea: When compressing #n# vectors, you can see the adjacency matrix as a matrix of matrices, and the only task you need to solve, is how to extract some form of the global matrix. This is done here, by applying a function to each submatrix. Here, we try out setting this function to be the mean, the maximum or the minimum of the original connections and compare them with or without rounding each entry to be one or zero to the usual graph compression. With the rounding you can see those options as setting a connection to exist when more original connections exist than dont, when at least one connection exist, or when all connections exist.
(ENTER RESULTS) <table caption="Quality differences for different encoder with a learnable handling of the Feature vectors" label="encode1" c=6>
<hline>
<tline " ~mean loss~loss std~n~auc~oneoff auc">
<hline>
<tline Rounded Min~#0.366#~0.037~5~#0.533#~#0.623#>
<tline Rounded Max~#0.388#~0.042~6~0.563~0.484>
<tline Rounded Mean~#0.342#~0.036~5~0.560~0.556>
<hline>
<tline Min~0.371~0.025~6~0.559~0.562>
<tline Max~0.372~0.025~6~0.565~0.541>
<tline Mean~0.351~0.04~5~0.540~0.486>
<hline>
</table>
Also we test, how well a learnable parameter transformation, as used before, works, compared to also applying a function (mean, max, min) to each feature vector Also we test, how well a learnable parameter transformation, as used before, works, compared to also applying a function (mean, max, min) to each feature vector
(ENTER RESULTS) (ENTER RESULTS)
<subsubsection better decoding> <table caption="Quality differences for different encoder with a fixed function" label="encode2" c="7" mode="classic2">
Also the decoder, does not use the graph structure completely. So we try to replace the abstraction with a constant learnable graph, by an abstraction with a graph that is not constant. The problem here, is that the tensorproduct introduced in (ENTER CHAPTER) does not work for a product of one graph with multiple graphs. The main difficulty lies in finding out how to work with the nondiagonal terms: Consider again Adjacency Matrices of Adjacency Matrices: When each Feature Vector becomes a Vector of Feature Vectors, also each entry in the Adjacency Matrix becomes a new Matrix. These Matrices, multiplied with the original entry would result in a tensorproduct, when the new Matrices would always be the same, but this is what we want to chance. Finding now the diagonal Matrices can be left to a learnable function of the Feature vector, but for the offdiagonal Matrices, we have two suggestions: The first, graphlike decompresser, define those Matrices as functions of the two corresponding diagonal matrices. Here we compare a product, a sum and those rounded versions and and or not only to the abstraction with a constant graph, but also to the second suggestion: paramlike decompresser: instead of the diagonal Matrices beeing functions of a feature vector, every submatrix is a learnabl function of its two corresponding original feature vectors. <hline>
<tline " ~function~mean loss~loss std~n~auc~oneoff auc">
<hline>
<tline Rounded Min~mean~0.366~0.016~4~0.656~0.472>
<tline Rounded Max~mean~0.333~-1~1~0.656~0.549>
<tline Rounded Mean~mean~0.372~0.016~4~0.656~0.542>
<hline>
<tline Min~mean~0.370~0.007~4~0.656~0.503>
<tline Max~mean~0.362~0.002~3~0.654~0.550>
<tline Mean~mean~0.363~0.010~4~0.655~0.505>
<hline>
<tline Rounded Mean~min~0.296~0.022~5~0.579~0.543>
<tline Rounded Mean~max~-1~-1~-1~-1~-1>
<hline>
</table>
<subsubsection title="better decoding" label="decoding">
Also the decoder, does not use the graph structure completely. So we try to replace the abstraction with a constant learnable graph, by an abstraction with a graph that is not constant. The problem here, is that the tensorproduct introduced in (ENTER CHAPTER) does not work for a product of one graph with multiple graphs. The main difficulty lies in finding out how to work with the nondiagonal terms: Consider again Adjacency Matrices of Adjacency Matrices: When each Feature Vector becomes a Vector of Feature Vectors, also each entry in the Adjacency Matrix becomes a new Matrix. These Matrices, multiplied with the original entry would result in a tensorproduct, when the new Matrices would always be the same, but this is what we want to chance. Finding now the diagonal Matrices can be left to a learnable function of the Feature vector, but for the offdiagonal Matrices, we have two suggestions: The first, graphlike decompresser, define those Matrices as functions of the two corresponding diagonal matrices. Here we compare a product, a sum and those rounded versions and and or not only to the abstraction with a constant graph, but also to the second suggestion: paramlike decompresser: instead of the diagonal Matrices beeing functions of a feature vector, every submatrix is a learnable function of its two corresponding original feature vectors.
(ENTER RESULTS) (ENTER RESULTS)
<table caption="Quality differences for different graph like decoder" label="decode1" c=6>
<hline>
<tline " ~mean loss~loss std~n~auc~oneoff auc">
<hline>
<tline Product~0.265~0.019~4~0.568~0.557>
<tline Sum~0.305~0.026~5~0.566~0.514>
<tline Or~0.404~0.248~18~0.562~0.502>
<tline And~0.354~0.074~22~0.587~-1>
<hline>
</table>
We can also look at the way, the original Graph is combined with the newly generated Graph. Instead of using a product, we can also use a sum, or again round the result to an or<note In Practice we expect the product to be virtually identical to the or, since the inputs are either 1 or 0> or an and. Since the combination with a constant graph is not very interresting, we use paramlike decompression for the practical results: We can also look at the way, the original Graph is combined with the newly generated Graph. Instead of using a product, we can also use a sum, or again round the result to an or<note In Practice we expect the product to be virtually identical to the or, since the inputs are either 1 or 0> or an and. Since the combination with a constant graph is not very interresting, we use paramlike decompression for the practical results:
(ENTER RESULTS) (ENTER RESULTS)
<table caption="Quality differences for different param like decoder" label="decode2" c=6>
<hline>
<tline " ~mean loss~loss std~n~auc~oneoff auc">
<hline>
<tline Product~0.280~0.037~4~0.553~0.522>
<tline Sum~0.300~0.024~4~0.549~0.526>
<tline Or~-1~-1~10~-1~-1>
<tline And~0.348~0.038~16~0.631~0.546>
<hline>
</table>
Finally, since we now have a little Adjacency Matrix, and a list of Feature vectors for each original Feature vector, we can apply a Graph Update Step on those subgraphs, to hopefully enhance the decompression by mixing the decompression of the Adjacency matrix and the Feature vectors. This was already enabled in all previous compression and decompression tests, but is tested here again on paramlike decompression: Finally, since we now have a little Adjacency Matrix, and a list of Feature vectors for each original Feature vector, we can apply a Graph Update Step on those subgraphs, to hopefully enhance the decompression by mixing the decompression of the Adjacency matrix and the Feature vectors. This was already enabled in all previous compression and decompression tests, but is tested here again on paramlike decompression:
(ENTER RESULTS) (ENTER RESULTS)
<table caption="Quality difference for either running learnable sub graph updates or not" label="decode3" c=6>
<hline>
<tline " ~mean loss~loss std~n~auc~oneoff auc">
<hline>
<tline yes~0.280~0.037~4~0.553~0.522>
<tline no~0.277~0.0335~3~0.571~0.528>
<hline>
</table>
<subsection Building Identities out of Graphs> <subsection title="Building Identities out of Graphs" label="identities">
An optimal Autoencoder should be equivalent to the Network with the Compression Size set to the Input Size. The Problem here is, that this trivial Model does not neccesarily reproduce its Input perfectly. As described in Chapter (ENTER CHAPTER), the Graph Update Step is given by An optimal Autoencoder should be equivalent to the Network with the Compression Size set to the Input Size. The Problem here is, that this trivial Model does not neccesarily reproduce its Input perfectly. As described in Chapter (ENTER CHAPTER), the Graph Update Step is given by
##f(x_i*s_j+x_i*A_k**i*n_j**k)## ##f(x_i*s_j+x_i*A_k**i*n_j**k)##
... ...
......
<subsection Scaling> <subsection title="Scaling" label="scaling">
NOT AT THE RIGTH POSITION NOT AT THE RIGTH POSITION
...@@ -7,13 +7,13 @@ Given working small models, you migth think, that creating a bigger model is fai ...@@ -7,13 +7,13 @@ Given working small models, you migth think, that creating a bigger model is fai
<i f="trivscale" wmode="True">(mmt/trivscale)scaling plot for non norms</i> <i f="trivscale" wmode="True">(mmt/trivscale)scaling plot for non norms</i>
Why is that? C addition (see Chapter (ENTER CHAPTER)): The Network focusses on each Part of the Jet the same, but in the Particles with higher #p_T# there is more Information, that gets watered down, by adding less accurate Information to it. Why is that? C addition (see Chapter (ENTER CHAPTER)): The Network focusses on each Part of the Jet the same, but in the Particles with higher #p_T# there is more Information, that gets watered down, by adding less accurate Information to it.
<subsubsection Combination for Scaling> <subsubsection title="Combination for Scaling" label="scalcomb">
The easiest way to understand this, is by splitting up the Network into small Networks. Here we use #n# #4# Particle Networks, instead of one #4*n# Particle Network. Given the sum of those Networks, Scaling is still a Problem The easiest way to understand this, is by splitting up the Network into small Networks. Here we use #n# #4# Particle Networks, instead of one #4*n# Particle Network. Given the sum of those Networks, Scaling is still a Problem
<i f="none" wmode="True">no norm split scaling, but no c add</i> <i f="none" wmode="True">no norm split scaling, but no c add</i>
but we now can use C addition to fix the focus of the Network: We can approximate that the mean of the signal is proportional to the width of the background distributions, and thus divide each partial Network by its loss to the power #-3#, which results in a Network that actually improves by adding more particles but we now can use C addition to fix the focus of the Network: We can approximate that the mean of the signal is proportional to the width of the background distributions, and thus divide each partial Network by its loss to the power #-3#, which results in a Network that actually improves by adding more particles
<i f="splitscale" wmode="True">(mmt/splitscale, not perfect since shape and triv compare already) no norm split scaling with c add</i> <i f="splitscale" wmode="True">(mmt/splitscale, not perfect since shape and triv compare already) no norm split scaling with c add</i>
<subsubsection Redefining losses> <subsubsection title="Redefining losses" label="scalloss">
One thing to remember here, is that these Networks are still just a combination of trivial Models. So the obvious question is how to apply this to bigger Networks? The first Idea migth be to redefine the loss, in a way, that the loss of any particle get multiplied by the factors used in the splitted Networks. The Problem here is that this chances the focus of the Autoencoder again: Since then it can make more errors in the later particles, it will, making the later Particles less useful for classification, but also making the first Particles more useless, since the focus is now lying on them, making their reconstruction better, up to a point at which their reconstruction is to good to find any difference between Background and Signal. One thing to remember here, is that these Networks are still just a combination of trivial Models. So the obvious question is how to apply this to bigger Networks? The first Idea migth be to redefine the loss, in a way, that the loss of any particle get multiplied by the factors used in the splitted Networks. The Problem here is that this chances the focus of the Autoencoder again: Since then it can make more errors in the later particles, it will, making the later Particles less useful for classification, but also making the first Particles more useless, since the focus is now lying on them, making their reconstruction better, up to a point at which their reconstruction is to good to find any difference between Background and Signal.
Another Idea migth be to just apply this loss weigthing in the Evaluation Phase, and not in Training. This definitely helps, but in my tries does not seem to be enough, since the same effect as before works now in the opposite way: Particles with lower #p_T# have a higher inaccuracy, that translates to a higher loss for them, and the Autoencoder focussing more on them, making the lower Particles and the higher Particles less useful. Another Idea migth be to just apply this loss weigthing in the Evaluation Phase, and not in Training. This definitely helps, but in my tries does not seem to be enough, since the same effect as before works now in the opposite way: Particles with lower #p_T# have a higher inaccuracy, that translates to a higher loss for them, and the Autoencoder focussing more on them, making the lower Particles and the higher Particles less useful.
So it seems that this is an optimization Problem: There still seems to be an optimal loss weigthing that gives optimal contribution to each Part, but finding this combination is not trivial. We tried multiple functions including a loss that uses the Index of the Particle or a loss that is weigthed directly by the transverse Momentum, but we have not found anything that mostly works, and weigthed Networks always seem to result in worse looking reconstruction So it seems that this is an optimization Problem: There still seems to be an optimal loss weigthing that gives optimal contribution to each Part, but finding this combination is not trivial. We tried multiple functions including a loss that uses the Index of the Particle or a loss that is weigthed directly by the transverse Momentum, but we have not found anything that mostly works, and weigthed Networks always seem to result in worse looking reconstruction
... ...
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please to comment