q

d1520b1f · Simon Klüttermann · d0b9086f · d1520b1f · d1520b1f · d1520b1f
Commit d1520b1f authored Oct 17, 2020 by Simon Klüttermann
--- a/data/01intro/01intro
+++ b/data/01intro/01intro
 <subsection title="Introduction" label="intro">

+In this thesis, we implement a new kind of neuronal network: An autoencoder that is able to work on data represented by graphs, and use it for the task of anomaly detection, or more specifically for detecting new physics events in jet physics. While ours is not the first anomaly detection algorithm to find new physics, we are able to extend the applicability of this idea and show that this algorithm can find anomalies representing neirly any kind of new physics objects.
+We also provide our graph autoencoder code in way that should make other applications fairly easy (ENTER LINK)
+
+
+
+

 Experimental Version of this Thesis, to test different Orderings, the actual text applications will be written soon


--- a/data/03graphae/03compression
+++ b/data/03graphae/03compression
 <subsection title="Compression Algorithm" label="encode0">
+
+Compression is the first algorithmical problem we have to solve: Find an algorithm that transforms a graph with #n# nodes into one with #LessThan(m,n)# nodes in a learnable way, without removing Information<note as a trivial algorithm, which just cuts away nodes would do>, while keeping permutation invariance<note which would not be the case, by for example applying a dense network to the collection of variables> and while also beeing structuraly invertible later on<note this would not be kept by most graph pooling operations. Consider for example DiffPool (ENTER REFERENCE): when you transform an arbitrary number of nodes into one, we would also have to implement a transformation that transforms a node into an arbitrary amount of nodes, which is not something we can easily implement> and while beeing implementable in the branchless programming style of tensorflow<note consider the algorithm explained in <ref aultcode>, which is not implementable, at least not in a reasonable time>.
+Our Algorithm works as follows:
+We sort each node by their last value. This last value is usually not initially given, but a learnable result of the network. This last value is also usually only used for one compression<note we can not test this, but this migth be a good intuition to have>, as in each compression stage we add more parameters for each node. After sorting each node, each set of #c# neighbouring nodes are compressed into one output node (using a simple dense layer<note it migth be interresting to look at more complicated functions, but we usually saw worse networks, by employing more advanced functions here>). This means, that each compression step reduces an initial number #n# of nodes into #n/c# nodes, and that c has to be a factor of #n#. Also we simply ignore the edges of the graph here, since we can simply relearn them in the next stage of the network. This actually does not mean, that connected nodes are not compressed together, as graph update steps make those nodes that are connected to each node more similar, resulting in them more likely beeing compressed together.
+Finally, there is an appendix giving some physical intuition about this algoritm (Appendix <ref intuitiveecode>) and one suggesting a more complicated algorithm (Appendix <ref encoding>) just to show that it migth not be a great idea.
+
+
+
--- a/data/03graphae/04decompression
+++ b/data/03graphae/04decompression
 <subsection title="Decompression Algorithm" label="decode0">
+
+Finding a decoding algorithm is the true challence in writing a graph autoencoder. Luckely we wrote an encoding algorithm that can be easily inverted, and thus our decoding algorithm works as follows:
+
+Define a learnable transformation (implemented as a simple dense network) that is able to map a single node into #c# nodes and apply this to each node. The graph connections could be relearned again after this step, but it seems to be a good idea to use a more complicated function, and so we use the tensorproduct introduced in Chapter <ref tensorproduct> to combine the graph before the decompression stage with a graph of #c# nodes. This graph is learnable, but constant with respect to the nodes we train on. Also we use a fully connected graph before the first decompression stage, since there is no graph yet.
+
+As this handling of graphs is not very powerful, we also a more complicated version, which is discussed in Appendix <ref decoding>): In contrast to the better encoding algorithm, this one migth actually be worth considering, and is used in Appendix <ref secuse> fairly commonly 
+
+
--- a/data/03graphae/05modelsetup
+++ b/data/03graphae/05modelsetup
+<subsection title="Model Setup" label="setup">
+
+After transforming our input 4 vectors as decribed in Chapter <ref data>, we sort them by their #lp_T# value to get our initial comparison value. This value will now be subjet to a BatchNormalization (ENTER REFERENCE) layer, which helps the network converge<note see Appendix <ref abatchnorm>> and, after generating a graph between them(see Appendix <ref atopkhow>) and using them in 3 graph update stages, we apply a compression stage. This is where the two Networks we setup here, one working on 4 the particles with the highest transverse momentum and one working on the first 9, show their first difference: The 4 node network simply compresses all 4 nodes into only one, while the 9 node network gets compressed by a factor 3, is followed by 3 graph update layers, just to be compressed again by a factor 3<note it generally seems to be a good idea to compress into only one node>. All compression stages add additional parameters, until the 4 node network has now #9# variables on its only node, while the 9 node network has #20# parameters in its node. This current stage is what we call the latent space, and thus the following layers are no longer part of the encoder but of the decoder. This decoder is build basically completely in reverse to the encoder: we start by decompressing the latent space once (or twice with 3 update steps for the 9 node network) and then we use 3 more graph update steps, cut excess parameters and sort each node by their last value<note more about why this sorting is a good idea in Appendix <ref asort>>. Now we have an input and output value to define the loss of our network in a way defined by Chapter <ref losses>.
--- a/data/03graphae/05resultsae
+++ b/data/03graphae/05resultsae
-<subsection title="Evaluating the Autoencoder" label="evalae">
-
--- a/data/04problems/02evmod
+++ b/data/04problems/02evmod
--- a/data/03graphae/06resultsclass
+++ b/data/03graphae/06resultsclass
-<subsection title="Evaluating the Classifier" label="evalclass">
-
--- a/data/03graphae/07resultsae
+++ b/data/03graphae/07resultsae
+<subsection title="Evaluating the Autoencoder" label="evalae">
+
+
+
+<subsubsection title="4 nodes" label="ae4">
+
+For 4 nodes, the number of connections for each node do not really matter, which is why we simply use a fully connected graph for those Networks.
+<i f="none" wmode="True">train 4 node nonorm</i>
+As you see, the training curve converges nicely into one value, but one thing to notice here: the validation loss does not behave worse than the training loss. This is something that you would usually expect, as it is a sign of overfitting, but is something that is very common for the graph autoencoder in this thesis: It is basically impossible for those networks to overfit, in fact we can reduce the number of training values drastically without letting the network overfit (see Appendix <ref asize>).
+
+<i f="none" wmode="True"> simpledraw for 4nodes</i>
+Also the reconstruction images are neirly perfect, and thus we can say, that this network is a good autoencoder
+
+<subsubsection title="9 nodes" label="ae9">
+
+
+
+
+
--- a/data/03graphae/08resultsclass
+++ b/data/03graphae/08resultsclass
+<subsection title="Evaluating the Classifier" label="evalclass">
+
+DATA IS PLACEHOLDER DATA
+
+
+<subsubsection title="4 nodes" label="class4">
+
+<i f="none" wmode="True"> roc 4 nodes</i>
+As you see, these 4 particle networks already reach an AUC score of #0.8#, which is quite good
+
+<i f="none" wmode="True">auc map for 4 nodes</i>
+Interrestingly this good AUC score is mostly a product of the angular parts, as only using them reaches already an AUC value of #0.8#
+
+
+<subsubsection title="9 nodes" label="class9">
+
+<i f="none" wmode="True"> roc 9 nodes</i>
+This migth be the point, at which you understand why the next chapter is called "Problems": using more particles, and thus more information, you would expect your network to improve, but we get a worse AUC score of #0.79#
+
+<i f="none" wmode="True">auc map for 4 nodes</i>
+Again you see that the most important part considering the AUC score is contained in the angular information. Here they would reach an AUC score of #0.80# ignoring the rest
+
+
--- a/data/04problems/01intro
+++ b/data/04problems/01intro
--- a/data/04problems/03scaling/01intro
+++ b/data/04problems/03scaling/01intro
+<subsection title="Scaling the Network Size" label="probscale">
+
--- a/data/04problems/02scaling
+++ b/data/04problems/02scaling
-<subsection title="Scaling" label="scaling">
-
-MAYBE STILL NOT AT THE RIGTH POSITION
-
-
 Given working small models, you migth think, that creating a bigger model is faily easy, but sadly is not. Training a bigger model, migth result in a worse autoencoder, but basically always results in a worse classifier.  
 <i f="trivscale" wmode="True">(mmt/trivscale)scaling plot for non norms</i>
 Why is that? The main Reason is C addition (see Chapter <ref caddition>): The Network focusses on each Part of the Jet the same, but in the Particles with higher #p_T# there is more Information, that gets watered down, by adding less accurate Information to it, resulting in a worse classifier.

--- a/data/02tech/07caddition
+++ b/data/02tech/07caddition
-<subsection title="C addition" label="caddition">
-
-
-PROBABLY NOT AT THE CORRECT POSITION
+<subsubsection title="C addition" label="caddition">


 Since in the loss, each Particle is generally considered with the same importance, you could ask yourself if this is an optimal approach and we will use this Chapter to show that it is absolutely not.

--- a/data/04problems/05inv/01intro
+++ b/data/04problems/05inv/01intro
+<subsection title="Invertibility" label="probinv">
--- a/data/04problems/02simplicity
+++ b/data/04problems/02simplicity
-<subsection title="Simplicity" label="simplicity">
-
-NOT THE BEST POS WS
-
 One thing you can do, by comparing values directly, is to look only at parts of the loss. This allows you to define qualities for each Part of the Input Space. As a reminder, we show them here as AUC maps: each AUC value is one colored box, which is deeply blue for an AUC of #1#, completely red for an AUC of #0# and white for an AUC of #1/2#.
 <i f="aucmap200" wmode="True">AUC map for a simple Network</i>
 As you see, the Classification Quality is mostly set in the Angular Part, and this is even a fairly positive example, there are images, in which the nonangular parts are sometimes even red.

--- a/data/04problems/03inv
+++ b/data/04problems/03inv
-<subsection title="Problems in Invertibility" label="inv">
-
 Given a Classifier that if trained on qcd, finds top jets different, you could (and should) ask yourself if this is just a feature of your data, as shown in Chapter <ref simplicity> or if this is more general. The easiest way to test this, is just to switch the meaning of signal and background: Can an Autoencoder trained on top jets classify qcd jets as anomalies. To keep the usual plots beeing easily readable, we keep qcd as background, which results in an AUC score of 0 beeing optimal for those switched Networks, this just does not work at all
 <i f="recinv" wmode="True">(from mmt)invertibility of a simple model</i>
 This could actually been expected: As shown in Chapter <ref simplicity>, each Network that has a good AUC focusses mostly on the difference between the angles and 0, and since this model does not depend on the attributes of the training data, chanching the training data chances neither them, nor affects their classification. In fact, by choosing a model that focusses even more on the angular size, you can create a model that is basically completely independent of the training data. This can go so far, that a model trained on top jets, can have a lower loss evaluated on qcd jets, than on the top jets it was trained on.

--- a/data/06othernets/02oneoff
+++ b/data/06othernets/02oneoff
@@ -37,30 +37,6 @@ This apparently unused Potential led us to try them out on more classical evalua
 <i f="examples" wmode="True">(752/draw) On the Top: The 5 least 7 like 7th in the Training set. On the Bottom: The 5 most 7 like not 7th in the Evaluation Sample</i>


-<subsubsection title="Some physical interpretability for OneOff Networks" label="oometrik">
-Another example, why OneOff Networks migth be quite useful, comes from our experiments to understand them more <ignore>(see Chapter (ENTER CHAPTER))</ignore>. Instead of constructing arbitrary Features by utilising deep Networks, the Algorithm used here only combines Input Features in a linear way. The data we work on here is provided by Cern Open Data (ENTER REFERENCE) as two lepton events from the 2010 datasets. Momentum 4 vectors of muons(ENTER REFERENCE) as background and of electrons as signals. These 4 vectors are multiplied with a linear metrik, reducing it into one dimension, that is evaluated to minimize #(abs(p**(mu)*g_mu_nu*p**(nu))-1)**2#. This results in the Network learning the following metrik
-<table caption="Learned Metrik Values of OneOff Networks trained on muon events" label="oneoffmuon" c="5" mode="classic">
-<hline>
-<tline " ~#E#~#p_1#~#p_2#~#p_3#">
-<hline>
-<tline #E#~-0.4997~0.0011~-0.0002~0.0002>
-<tline #p_1#~0.0011~0.5069~0.0014~-0.0008>
-<tline #p_2#~-0.0002~0.0014~0.4930~-0.0006>
-<tline #p_3#~0.0002~-0.0008~-0.0006~0.4998>
-<hline>
-
-</table>
-
-As you see, the result is very similar to a minkowski metrik: The Nondiagonal Parts are zero in the range of numerical uncertainity (and symmetric for 5 digits behind the commata), the signs are randomly this way, because of the absolut value in the loss function and the absolute value of the diagonal parts scales the resulting expected output of #1# that the loss expects. Other than this, this simple Network is able to understand itself, that characterising a particle is best done through what we call its mass. That beeing said, the AUC score is not optimal, only reaching #0.5988#, even though we can improve this, by assuming the metrik to be strictly diagonal, which results in a learned metrik of
-<table caption="Learned Metrik Values for a diagonal metrik OneOff Networks trained on muon events" label="oneoffmuondiag" c="4" mode="free">
-<hline>
-<tline "#E#~#p_1#~#p_2#~#p_3#">
-<hline>
-<tline 1.4198~-1.4130~-1.4151~-1.4197>
-<hline>
-
-</table>
-As you see, this still results in a minkowski metrik like result. This time with a flipped sign, and a different scale, which is just a feature of the implementation. Most importantly, this simplified metrik definition, including less noise, results in a much higher auc value of #0.8007#<note you could ask yourself, why we use muons as background events: This is because the relative uncertainity of each electron mass value is much bigger, since the mass is more than two orders of magnitude smaller. Training a (only diagonal) network like this, still results in a minkowski like metrik (#-0.0058#,#0.0043#,#0.0043#,#0.0058#), but the auc value is way worse reaching only #0.5003# as the expected mean value has way less physical meaning>




--- a/data/06othernets/07oopar
+++ b/data/06othernets/07oopar
+<subsection title="Treating OneOff Networks as Observables" label="observable">
+
+If you use a very simple OneOff networks, on momentum 4 vectors (see Appendix <ref oometrik>) results in this network learning the mass of the particles.
+This is interresting, since this is the same we would do to compare different particles by their 4 momenta and we now have an algorithm to automatically extract this feature just from data. So it migth be fair to assume that by applying this algorithm to more complicated data, we still extract some feature, and that we can use this feature to find anomalies. But giving a feature, you can do more: you can look at statistical information: if you only produce electrons in you detector, but you measure masses that are on average a bit higher than #500*keV#, you can conclude that their is something else produced but just electrons. The benefit here is, that you can combine multiple events to get lower uncertainities on the variable you care about, and thus you can easier detect irregularities in your dataset. So when we have an automatic feature extractor, it migth be interresting to see if you can differentiate between datasets using this feature.
+We use here #1000000# jets, of which #0.01# are not qcd but top jets<note this is about the most we can do giving our dataset>. This is enough to reach a significance of #4.6# sigma on a single OneOff Network<note no combination of multiple runs>. So OneOff features migth be appliable to finding new physics. Most interrestingly this probably be applied to any dataset, and so you could define detectorlevel features to directly compare your data to the expectation, assuming your simulation are good enough.
\ No newline at end of file
--- a/data/06othernets/08compare
+++ b/data/06othernets/08compare
+<subsection title="Comparison" label="compare">
+
+This chapter tries to find conclusions giving both our suggested solutions to the problems stated in Chaper <ref secinv>. In practice, there is not much reason on why not use both, as it produces the best results, while also beeing the most stable values. (NEED DATA)
\ No newline at end of file
--- a/data/07otherdata/03otherdata
+++ b/data/07otherdata/03otherdata
 <subsection title="Other datasets" label="otherdata">
+
+
+<subsubsection title="Quark v Gluon" label="qg">
+
+Quark and Gluon data, is generated by Madgraph(ENTER REFERENCE), Pythia(ENTER REFERENCE) and Delphes(ENTER REFERENCE). One set is generated as parton parton to gluon gluon collisions and another as parton parton to two parton without gluon collisions, which jets are used, if there transverse jet momentum is between 550 and 650 GeV. This data was used originally to see if a qcd trained Classifier makes a easily accessible difference between Quarks and Gluons<note you could interpret this, as another form of complexity: while top jets are all the result of top quarks, with qcd jets there are multiple options, we though this could explain why qcd trained encoder are generally worse, but this just not the case>, but even though this is seems not to be the case, we can still use this dataset to test our algorithm a bit further. Again we use 4 Particle Networks, with a compression size of #9# and only neglicible hyperparameter optimization to reach quality of
+<i f="drquarkgluon" f2="dsquarkgluon" wmode="True">double roc curve for quark gluon</i>
+As you see, these are invertible Networks, and even though they are not very good ones, as described in the previous Chapter <ref ldm> this does not really matter, since optimization has the potential to improve them quite a lot. (ENTER REFERENCE https://arxiv.org/abs/1712.03634) could be seen as a Reference Paper for this Process, even though they use a supervised approach and high level input data on different transverse momentum ranges, their achieves AUC values below #0.9# suggest that this tagging job is more complicated than the usual top tagging. Also Chapter <ref crossdata> will support this Hypothesis. 
+
+<subsubsection title="leptons" label="leptons">
+
+
+
+This dataset is not very physically useful, and more interresting from an anomaly detection standpoint: We again generate Particle collisions using Madgraph, Pythia and Delphes, but instead of partons colliding into partons, we use leptons colliding and producing partons. For the first set, we use any combination of electrons and muons with arbitrary charge, and for the second one we only use tau leptons. We also use a fairly big transverse momentum range of #20*Gev# to #5000*Gev# to see how this work, when we vary another Parameter.
+
+<i f="drleptons" f2="dsleptons" wmode="True">lepton double roc curve</i>
+
+Again you see a clear invertibility, helping to support the suggested generality. 
+
+
+
+<subsubsection title="bosons" label="bosons">
+
+
+Our final dataset consists out parton parton to parton parton boson events. This is interresting, since we except a lot of jets to look like qcd jets, and thus the interresting part to be suppressed by a lot of noise. We generate datasets with a transverse momentum between #550*GeV# and #650*GeV# and either with the boson beeing any w boson, or in the other dataset, beeing z bosons.
+
+<i f="none" f2="none" wmode="True">boson double roc curve</i>
+
+Again, this shows clear invertibility
+
+
+
+
+
+
+
+
+
+
+
+
+