Skip to content
Snippets Groups Projects
Commit c3e20bd7 authored by Simon Klüttermann's avatar Simon Klüttermann
Browse files

q

parent 1adc6801
Branches
No related tags found
No related merge requests found
Showing
with 114 additions and 46 deletions
@misc{jetdm}, url={https://atlas.cern/updates/physics-briefing/precision-search-dark-matter}, journal={ATLAS Experiment at CERN}, year={2020}, month={Aug}}
@misc{jetdmalt}, url={https://atlas.cern/updates/physics-briefing/precision-search-dark-matter}}
@misc{jetdm, url={https://atlas.cern/updates/physics-briefing/precision-search-dark-matter}, title = {Jetting into the dark side: a precision search for dark matter
}}
@misc{ostagram, url={https://www.ostagram.me/static_pages/lenta?last_days=1000&locale=en}, journal={Ostagram}}
@misc{kaliningrad, title={Main}, url={https://visit-kaliningrad.ru/en/travel-tools/print-take/}, journal={Информационный центр туризма}}
@misc{kaliningradmaps, url={https://www.google.de/maps/@54.7134816,20.5119317,4270m/data=!3m1!1e3}, journal={Google maps}}
@booklet{setelectronalt
author = {Thomas McCauley},
title = {{Events with two electrons from 2010
},
doi = {10.7483/OPENDATA.CMS.PCSW.AHVG},
howpublished= {\url{http://opendata.cern.ch/record/304}}
@misc{setelectronalt2
author = {Thomas McCauley},
title = {Events with two electrons from 2010
},
doi = {10.7483/OPENDATA.CMS.PCSW.AHVG},
url= {http://opendata.cern.ch/record/304 }
}
@misc{setelectron, url={http://opendata.cern.ch/record/304}, author = {Thomas McCauley},title = {Events with two electrons from 2010}}
@misc{setmuon, url={http://opendata.cern.ch/record/303}, author = {Thomas McCauley},title = {Events with two muons from 2010}}
@misc{nopool,
title={Image Classification with Hierarchical Multigraph Networks},
author={Boris Knyazev and Xiao Lin and Mohamed R. Amer and Graham W. Taylor},
year={2019},
eprint={1907.09000},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
@InProceedings{deepoc,
title = {Deep One-Class Classification},
author = {Ruff, Lukas and Vandermeulen, Robert and Goernitz, Nico and Deecke, Lucas and Siddiqui, Shoaib Ahmed and Binder, Alexander and M{\"u}ller, Emmanuel and Kloft, Marius},
pages = {4393--4402},
year = {2018},
editor = {Jennifer Dy and Andreas Krause},
volume = {80},
series = {Proceedings of Machine Learning Research},
address = {Stockholmsmässan, Stockholm Sweden},
month = {10--15 Jul},
publisher = {PMLR},
pdf = {http://proceedings.mlr.press/v80/ruff18a/ruff18a.pdf},
url = {http://proceedings.mlr.press/v80/ruff18a.html}}
}
@misc{kaliningrad, title={Main}, url={https://visit-kaliningrad.ru/en/travel-tools/print-take/}, journal={Информационный центр туризма}}
@article{high1,
author = "Gianotti, F.",
title = "{Physics at the LHC}",
......@@ -115,7 +167,6 @@
commit = {c923bda1fd75067898faa2ffc099bd1b8e2bef88}
}
@misc{ostagram, url={https://www.ostagram.me/static_pages/lenta?last_days=1000&locale=en}, journal={Ostagram}}
@INPROCEEDINGS{aeforanomaly,
author={B. B. {Thompson} and R. J. {Marks} and J. J. {Choi} and M. A. {El-Sharkawi} and {Ming-Yuh Huang} and C. {Bunje}},
......@@ -434,23 +485,11 @@ doi = {10.1109/ICDM.2008.17}
year = 2010
}
@misc{setelectron
author = {Thomas McCauley},
title = {{Events with two electrons from 2010
},
doi = {10.7483/OPENDATA.CMS.PCSW.AHVG},
howpublished= {\url{http://opendata.cern.ch/record/304}}
@misc{setmuon
author = {Thomas McCauley},
title = {{Events with two muons from 2010
},
doi = {10.7483/OPENDATA.CMS.4M97.3SQ9},
howpublished= {\url{http://opendata.cern.ch/record/303}}
@misc{diagramms, feynman diagram library - list of diagrams, title={List of Diagrams}, url={https://www.physik.uzh.ch/~che/FeynDiag/Listing.php}, journal={Links, Feynman Diagram Library - List of Diagrams}}
@misc{diagramms, title={List of Feynman diagrams}, url={https://www.physik.uzh.ch/~che/FeynDiag/Listing.php}}
@misc{feyngen, url={https://feynman.aivazis.com/}, journal={Draw Feynman Diagram Online}, author={Aivazis, Alec}}
@misc{feyngen , author={Aivazis, Alec}, url={https://feynman.aivazis.com/}}
@article{topquark,
title = {The discovery of the top quark},
......
......@@ -6,7 +6,7 @@ A graph<cite graphbasic> is a mathematical<ignore>/informatical</ignore> concept
<i f="kaliningrad_again" wmode="True" label="kaliningrad">A representation of the city regions as a graph: You can understand a map as a graph, where each region becomes a node and bridges between them represent edges. Here using a map from <cite kaliningradmaps></i>
Mathematically, these nodes and relations are defined in a list of feature vectors<note technically this is equivalent to a matrix, but list of vectors is more intuitive> #X# that stores the features of each node, and an adjacency matrix #A#, which components #A_i**j# are #1#, if the nodes #i# and #j# are connected, and #0# if not. This graph is usually invariant under permuation of the node indices. You achieve this by permuting<note with permuting we here mean switching two indices, or more generally multiplying with a permutation matrix> the adjacency matrix in the same way the feature vectors are permuted<note this is the reason why we don`t call the list of feature vectors a matrix: as a matrix permutation requires permutation matrices on each side (#p*A*p#), the feature vector "Matrix" only requires one permutation matrix (#X*p#)>, and by requiring any action on the graph to be permuation invariant. This action is usually also local, and thus only acts on each node and the mean<note you also need to require each local action to be symmetric under changing the input ordering, since else the output can depend on the order of the nodes. The usual way that is achieved is by using a function like the mean on all neighbours> of nodes that are connected to the current node<note this works, since permuting indices does not change, which nodes are connected together>. This has the benefit of making graphs ideal for modeling interactions between high numbers of objects, as the functions don`t change as you add more nodes to the graph. In informatics, this is useful for example for social networks<cite graphsoc>: Data that consists out of a huge amount of nodes in which mostly only connected nodes (friends) affect each other, are perfect applications for graphs, since else you would need to update your model every time a new user joins. In physics, this reminds of nuclear science, and the approximation of pair interaction potentials<cite pairpot>, and so there are applications using this kind of molecule encoding for chemical feature extraction (for a simple example look at appendix <ref mol>) <cite gnnforchemistry> and medicine <cite gnnformedicine>.
Next to those relational applications, there are also applications that are not utilizing an existing relation, but use the locality of the graph structure to encode the similarity of given data. This is done by letting the sense of similarity between nodes be a learnable function. For example, by using a topK algorithm (each node is connected to its neirest #K# neighbours, see <ref atopk>), you can implement a learnable version of whatever distance means. This allows networks like for example particleNet <cite particeNet>, which uses a special kind of neural network, that is able to work on graphs, to seperate top and QCD jets<note with QCD we mean jets that are generated by gluons or other quarks that are not top quarks> in a supervised way. They use the graph structure, to be able to define and redefine multiple times, which detected particles (nodes), should be considered close to each other. This results in particleNet beeing a quite good classifier(see <cite toptagref>).
Next to those relational applications, there are also applications that are not utilizing an existing relation, but use the locality of the graph structure to encode the similarity of given data. This is done by letting the sense of similarity between nodes be a learnable function. For example, by using a topK algorithm (each node is connected to its neirest #K# neighbours, see <ref atopkhow>), you can implement a learnable version of whatever distance means. This allows networks like for example particleNet <cite particeNet>, which uses a special kind of neural network, that is able to work on graphs, to seperate top and QCD jets<note with QCD we mean jets that are generated by gluons or other quarks that are not top quarks> in a supervised way. They use the graph structure, to be able to define and redefine multiple times, which detected particles (nodes), should be considered close to each other. This results in particleNet beeing a quite good classifier(see <cite toptagref>).
......
......@@ -4,6 +4,6 @@
ParticleNet migth be well suited for classifying jets, but when you want to use this for finding new physics, then its supervised approach is still problematic. Supervised training means, that each new physics model can only be detected, if you train a special network just for it. Not only would this need a lot of networks, with the corresponding high number of false positives, but this also limits their effectiveness, as you can only find new physics that has already been thougth of before.
But maybe could you use the graph structure that makes particleNet so great and combine it with the unsupervised approach of QCDorWhat.
This is the main idea that is implemented in this thesis.
What you require is a autoencoder that can utilize graphs, which is a task that is not trivial: Creating something like a graph autoencoder has some problems, namely the fact, that a compression is usually not local<note graph pooling operations are quite common, since the output of a graph network usually has a different format than its input. The way this is usually done, is by applying a function (mean,max for example) to each node. ParticleNet for example uses a globalAveragePooling <cite globalavpool>, so it calculates the average over the nodes for each feature. This kind of pooling works quite well, but is sadly not really appliable to autoencoders, since those functions are definitely not invertible>. That does not mean, that there are no approaches, just that most autors shy away from any approach that changes the graph size(see (ENTER REFERENCE) or (ENTER REFERENCE)). The first approach you find, by just searching for a graph autoencoder, is a paper <cite kipfetal> and a lot of paper referencing it. The main problem here is, that this paper uses one fixed adjacency matrix, and thus one equal graph setup, for any input and at any point in the network. This allows for neither the learnable meaning of similarity that appearently makes particleNet so good, the variable inputsize discussed in chapter <ref graphs> and, probably worst, not for any structural difference in different jets. Other approaches come from the problem of graph pooling operations, meaning the definition of some kind of layer, that takes a graph as input, and returns a smaller graph as the result of some learnable function<note this is not an entirely solved problem. If solved, it allows for hirachical learning, similar to the use of pooling layers in convolutional networks. See appendix <ref mol> for an application of our algorithm as pooling layer>. DiffPool <cite diffpool> and mincutpool <cite mincutpool> migth be good examples for this, but graph u Nets <cite graphunets> stands out, since it also gives an suggestion on how to implement an anti pooling layer, and thus allows for a graph autoencoder in the way we require it here , which is why the first approach we tried is based on their approach. See for this appendix <refs failed>.<ignore> and even though it does not work very good for us, it still creates the basis for every other approach.</ignore>
What you require is a autoencoder that can utilize graphs, which is a task that is not trivial: Creating something like a graph autoencoder has some problems, namely the fact, that a compression is usually not local<note graph pooling operations are quite common, since the output of a graph network usually has a different format than its input. The way this is usually done, is by applying a function (mean,max for example) to each node. ParticleNet for example uses a globalAveragePooling <cite globalavpool>, so it calculates the average over the nodes for each feature. This kind of pooling works quite well, but is sadly not really appliable to autoencoders, since those functions are definitely not invertible>. That does not mean, that there are no approaches, just that most autors shy away from any approach that changes the graph size(see for example <cite nopool>). The first approach you find, by just searching for a graph autoencoder, is a paper <cite kipfetal> and a lot of paper referencing it. The main problem here is, that this paper uses one fixed adjacency matrix, and thus one equal graph setup, for any input and at any point in the network. This allows for neither the learnable meaning of similarity that appearently makes particleNet so good, the variable inputsize discussed in chapter <ref graphs> and, probably worst, not for any structural difference in different jets. Other approaches come from the problem of graph pooling operations, meaning the definition of some kind of layer, that takes a graph as input, and returns a smaller graph as the result of some learnable function<note this is not an entirely solved problem. If solved, it allows for hirachical learning, similar to the use of pooling layers in convolutional networks. See appendix <ref mol> for an application of our algorithm as pooling layer>. DiffPool <cite diffpool> and mincutpool <cite mincutpool> migth be good examples for this, but graph u Nets <cite graphunets> stands out, since it also gives an suggestion on how to implement an anti pooling layer, and thus allows for a graph autoencoder in the way we require it here , which is why the first approach we tried is based on their approach. See for this appendix <refs failed>.<ignore> and even though it does not work very good for us, it still creates the basis for every other approach.</ignore>
......@@ -10,7 +10,7 @@ Every network has a fixed maximum number of particles that can be put into it, f
<e>flag: a constant #1# for each particle, but 0 if the 4 vector is 0. This input replaces the biases of our update steps, since adding a constant bias would not differentiate between those vectors that represent particles, and those that are just filler zeros, and thus would not result in the network beeing independent under concatting zero vectors to all input (increasing the graph size)<note by adding biases to zero vectors, you get vectors that are not neccesarily zero. But since the effect of a vector in the graph update step is proportional to its size (see chapter <ref gnn>), this means, that zero vectors can have an measurable effect on non zero vectors>. More information about why we do this in appendix <ref nobias>.</e>
<e>#Delta_eta#: #Eq(eta,ln((p+p_3)/(p-p_3))/2)# (with $p=|\vec{p}|$) which is shiftet in such a way, that the mean of #Delta_eta# is 0, since the position of the jet should not have any meaning: #Eq(Delta_eta,eta-mean(eta))#.</e>
<e>#Delta_phi#: #Eq(phi,arctan2(p2,p1))#<note the function #atan2(y,x)# is an extension of #arctan(y/x)# that is able to map to the full #2*pi# output space> which is again shiftet in such a way, that the mean of #Delta_phi# is 0: #Eq(Delta_phi,phi-mean(phi))#<note here is shifting actually not that easy to implement, since you have to consider the difference in a modular space, see appendix <ref aimplementationphi> or the implementation (ENTER GIT LINK gpre5) for more information>, since also this position of the jet should not have any meaning.</e>
<e>#Delta_phi#: #Eq(phi,arctan2(p2,p1))#<note the function #atan2(y,x)# is an extension of #arctan(y/x)# that is able to map to the full #2*pi# output space> which is again shiftet in such a way, that the mean of #Delta_phi# is 0: #Eq(Delta_phi,phi-mean(phi))#<note here is shifting actually not that easy to implement, since you have to consider the difference in a modular space, see appendix <ref aimplementationphi> or the implementation <link w="https://github.com/psorus/grapa/blob/master/grapa/layers.pyHASHTAGL6488">https://github.com/psorus/grapa/blob/master/grapa/layers.pyHASHTAGL6488</link> for more information>, since also this position of the jet should not have any meaning.</e>
<e>#lp_T#: #Eq(p_T**2,p_1**2+p_2**2)#, and #Eq(lp_T,-ln(p_T/p_T**jet))#. This logarithm, is needed to keep each value of about the same order of magnitude, which makes the training more stable. We also divide by the total jet transverse momentum, to make every jet look more similar (see appendix <ref alpt> for the effects of changing this). Finally the sign is used to keep the values positive. <note a consequence is that higher transverse momenta have lower values. Since the alternative in appendix <ref alpt> changes this, we can say that this does not matter to much></e>
......
......@@ -5,7 +5,7 @@ Our graph update layer consists out of two matrices, a self interaction matrix,
So written as a formula, the new vector equals (with the original feature vector #x_i#, the learnable self and neigbour matrices #s_i**j# and #n_i**j#, as well as the adjacency matrix #A_i**j# and the activation #f#)
##f(x_i*s_j**i+x_i*A_k**i*n_j**k)##
(FORMULA GETS REORDERED LATER!)
It should be noted, that this implementation has one central problem: It is a bit slower than the usual approach(for a reasoning on why we cannot use the more usual approach of particleNet, see appendix <ref acomparepnet>)<note especcially since they can utilise GPUs better>, and even though we don`t think the implementation (see git (ENTER LINK)) is as fast possible, this is something that could be improved a lot <ignore>this speed deficit migth be a price to pay, for the higher number of possible actions you can apply to the graph, and thus to making graph autoencoder possible</ignore>.
It should be noted, that this implementation has one central problem: It is a bit slower than the usual approach(for a reasoning on why we cannot use the more usual approach of particleNet, see appendix <ref acomparepnet>)<note especcially since they can utilise GPUs better>, and even though we don`t think the implementation (see git <link w="https://grapa.readthedocs.io/en/latest/">https://grapa.readthedocs.io/en/latest/</link>) is as fast possible, this is something that could be improved a lot <ignore>this speed deficit migth be a price to pay, for the higher number of possible actions you can apply to the graph, and thus to making graph autoencoder possible</ignore>.
<ignore
finally the activation function (here labeled #f#) migth be interesting: Some inspiration in writing this layer was taking from a paper (ENTER PAPERLINK), which does graph updating a bit different compared to many other papers. Instead of having multiple different update layers, they use only one layer, that gets called multiple times, until it converges<note converging neural networks are not very easy to implement, since you cannot easily execute a function if and only of a certain condition is met, in practice converging networks are just executed a certain number of times, until you assume that anything that should converge, converged>. This seemed to be quite physical<note consider a graph describing a system of coupled harmonic oszillators, each update step migth then simulate one timestep, and a converging output would be the stable state. In fact this whole idea reminds a bit on hopfield networks, that use some clever math to set the convergent state to something fixed, and are based on ising models (ENTER REFERENCE)>, but this opens the question, on how to be sure, that your update step actually converges: This is mostly a question of the activation, since the update step itself is learnable, and thus will generally not create any matrices that have determinants to far away from 1<note we simplify here a bit, since what would actually matter, would be the combination of two different matrices, where it is not trivial to say if the result has a determinant of 1, especcially since one depends on the adjacency matrix, but the intuition migth be the same>, but even when the matrices are convergent, an activation like a sigmoid, that for each (positive) input is smaller than its input, would go to zero if applied infinite times, so we demand that the activation fullfills<note in theory, this migth not be neccesary, the matrices could equalate the activation, but since this would still demand that the network learns to equalate this, we keep this assumption>
##Eq(f(f(x)),f(x))##
......
......@@ -4,10 +4,10 @@ Before we can look at the results of our network, we have to look at how to judg
We migth be able to evaluate a binary classification problem(see chapter <refs binclass>), but evaluating an network is a bit more difficult, since we basically want to do 2 things at the same time: Creating an autoencoder and creating a classifier, so there migth be situations in which the autoencoder is good, but the classifier is bad and situations in which the classifier migth be good, but the autoencoder is basically useless.
<subsubsection title="AUC scores" label="evalauc">
If you want to evaluate a network, you migth simply use the quality of the classifier (the AUC Score, see chapter <refs classauc>): since the classifier should work by the autoencoder understanding the data, and thus should only be good if also the autoencoder is good. And in most cases this works, there is a clear relation between the quality of the autoencoder and the quality of the classifier (see chapter <ref normalization>), but in general this is simply not true, as for example chapter <ref simplicity> show. And even if your working in a region where this relation is true, Classifier evaluation methods<note AUC scores even have one of the lower uncertainities> usually have a much higher uncertainity<note uncertainity in the sense that even a well trained network can change its AUC score by a couple of percent after retraining, even if it has the same loss (see appendix <ref auncertain>)> than other methods, which is why in the regions in which there is a strong correlation, it was more useful to use the loss of the network to assert that the network improves, and to simply know that the AUC score will correlate.
If you want to evaluate a network, you migth simply use the quality of the classifier (the AUC Score, see chapter <refs classauc>): since the classifier should work by the autoencoder understanding the data, and thus should only be good if also the autoencoder is good. And in most cases this works, there is a clear relation between the quality of the autoencoder and the quality of the classifier (see chapter <ref normalization>), but in general this is simply not true, as for example chapter <ref simplicity> show. And even if your working in a region where this relation is true, Classifier evaluation methods<note AUC scores even have one of the lower uncertainities> usually have a much higher uncertainity<note uncertainity in the sense that even a well trained network can change its AUC score by a couple of percent after retraining, even if it has the same loss> than other methods, which is why in the regions in which there is a strong correlation, it was more useful to use the loss of the network to assert that the network improves, and to simply know that the AUC score will correlate.
<subsubsection title="Losses" label="evalloss">
<ignore>So why not do this all the time: Look at the quality of the autoencoder and try to optimize only it.</ignore> Using only the quality of your autoencoder and trying to optimize this would be conceptually great, as you only need to use your anomalous data once<note usual machine learning has a problem, in which you network can learn even data that it is not trained on, simply by you comparing networks on it (this is why there is test data), the same can happen here, by you often comparing qualities of your anomalous data and since finding new test data would require you to have completely different anomalous systems, this can be difficult to do (even though we try this in chapter <ref secdata>), which is why choosing to ignore your anomalies in training would be great>, but this again has problems: Not only requires this still a strong relation between AUC and loss (That is here given even less, consider the problem of finding the best compression size: The loss will usually<note always, except for noise and random change> fall by increasing the compression size, but at some point, the autoencoder can just reconstruct everything perfectly, and thus has no more classification potential), but the loss also relies heavily on the definition of the network and the normalization of the input data<note see chapter <ref data>>, which makes comparing different networks only possible, if you neither alter the loss nor the normalization.
<ignore>So why not do this all the time: Look at the quality of the autoencoder and try to optimize only it.</ignore> Using only the quality of your autoencoder and trying to optimize this would be conceptually great, as you only need to use your anomalous data once<note usual machine learning has a problem, in which you network can learn even data that it is not trained on, simply by you comparing networks on it (this is why there is test data), the same can happen here, by you often comparing qualities of your anomalous data and since finding new test data would require you to have completely different anomalous systems, this can be difficult to do (even though we try this in chapter <ref secdata>), which is why choosing to ignore your anomalies in training would be great>, but this again has problems: Not only requires this still a strong relation between AUC and loss (That is here given even less, consider the problem of finding the best compression size: The loss will usually<note always, except for noise and random change> fall by increasing the compression size, but at some point, the autoencoder can just reconstruct everything perfectly, and thus has no more classification potential), but the loss also relies heavily on the definition of the network and the normalization of the input data(see chapter <ref data>), which makes comparing different networks only possible, if you neither alter the loss nor the normalization.
<subsubsection title="Images" label="evalimg">
This cross comparison problem can be easily solved by simply looking at the reconstruction images instead of the losses<note the jet image showing input and output of the autoencoder, see for an example <ref imgout>>. But while this is certainly very useful, as it also allows to understand more about your network(for example, there are networks, that simply ignore some parameters, and thus have their whole loss in those parameters, this can be most easily seen by looking at the images), this still relies on the relation between AUC and loss and more importantly is less quantitative: Giving 2 images, finding out which autoencoder is better is not always an easy task, especially since what differences you migth see in those images does not neccesarily correspond to differences the network sees(see for this chapter <ref losses>). Most notably you usually care more about angular differences, and mostly neglect differences in #lp_T#, while sorting by the transverse momentum introduces a sligth preference for #lp_T#.
......
......@@ -12,7 +12,7 @@ We justify this idea mathematically in appendix <ref oomath> and <ref impro>
<subsubsection title="oneoff quality" label="ooquality">
A simple dense network with just an output that should be one, sadly still has a lot of problems.
First: the loss can go to basically zero(#10**(-12)#), which is a bit unphysical, since the loss, as a distance to one, is basically the variance of the used feature, and you would not expect there to be any physically significant feature of this accuracy in 4 particles<note>Especially, since the lowest difference there can be in the used float32 implementation is bigger than #10**(-8)# and thus, since the final loss is the mean of each loss, this would mean, that at least #0.9999# of each event reproduce exactly 1</note>. So there are features that are more trivial to learn, and make any decision process meaningless. And it is not neccesarily trivial to find those, there migth be those features that are just input variables of one (for example an input that would be set to flag), but not all of them are that easy to find. <note>A notable example migth be the preprocessing of #lp_T#. As descibed in chapter <ref data>, we used a preprocessing similar to that of particleNet: #Eq(x,ln(p_Tjet/p_T))#, but this means (because of the implementation), that a sum over #exp(-x)# is always #1#. This migth be a good time to talk about functions in those kind of networks. Since we have to forbidden any biases (a bias would just result in the network learning a zero and adding a one as bias), the usual reason for a network to learn any function has to be modified a bit. Think about taylor approximations: A function like #exp(x)# could be written as #1+x+O(x**2)# (with as many term as the networks needs), but for a network to learn #1#, the input of #exp(x)# would then be learned to zero, the network would be one and it is basically the same as adding a constant bias. But adding a bias is not allowed, and thus the network can not learn #exp(x)#, but the network can learn #Eq(exp(x)-1,x+O(x**2))#, and, when #Eq(sum(exp(-x_i),i),1)# then is #Eq(sum(exp(-x_i)-1),-3)# for 4 nodes, and thus the network can learn this, without having learned anything physically useful</note>. This means, that training an oneoff network is a bit like outsmarting your algorithm. One thing that we found quite useful, is letting the network not only learn a one on the data that you are interrested in, but also zero on other random data.<note>We choose here random events with the same mean and standart deviation in each feature, as the original data, that still goes through the same preprocessing</note>. When we use relu<note A relu activation can be defined as #x+abs(x)#. See Appendix <ref arelu> for why this is useful> activations here<note>Activations are another thing where those networks can become trivial, think of a sigmoid and a network just learning infinite values before activation</note>, learning values to be zero, means learning them just to be negative, and is thus way easier. This can demand that the network does not fixate on trivial features in the networksetup and preprocessing<note> later on, in chapter <ref mixedidea>, this is no longer needed, and just complicates the training</note>.
First: the loss can go to basically zero(#10**(-12)#), which is a bit unphysical, since the loss, as a distance to one, is basically the variance of the used feature, and you would not expect there to be any physically significant feature of this accuracy in 4 particles<note>Especially, since the lowest difference there can be in the used float32 implementation is bigger than #10**(-8)# and thus, since the final loss is the mean of each loss, this would mean, that at least #0.9999# of each event reproduce exactly 1</note>. So there are features that are more trivial to learn, and make any decision process meaningless. And it is not neccesarily trivial to find those, there migth be those features that are just input variables of one (for example an input that would be set to flag), but not all of them are that easy to find. <note>A notable example migth be the preprocessing of #lp_T#. As descibed in chapter <ref data>, we used a preprocessing similar to that of particleNet: #Eq(x,ln(p_Tjet/p_T))#, but this means (because of the implementation), that a sum over #exp(-x)# is always #1#. This migth be a good time to talk about functions in those kind of networks. Since we have to forbidden any biases (a bias would just result in the network learning a zero and adding a one as bias), the usual reason for a network to learn any function has to be modified a bit. Think about taylor approximations: A function like #exp(x)# could be written as #1+x+O(x**2)# (with as many term as the networks needs), but for a network to learn #1#, the input of #exp(x)# would then be learned to zero, the network would be one and it is basically the same as adding a constant bias. But adding a bias is not allowed, and thus the network can not learn #exp(x)#, but the network can learn #Eq(exp(x)-1,x+O(x**2))#, and, when #Eq(sum(exp(-x_i),i),1)# then is #Eq(sum(exp(-x_i)-1),-3)# for 4 nodes, and thus the network can learn this, without having learned anything physically useful</note>. This means, that training an oneoff network is a bit like outsmarting your algorithm. One thing that we found quite useful, is letting the network not only learn a one on the data that you are interrested in, but also zero on other random data.<note>We choose here random events with the same mean and standart deviation in each feature, as the original data, that still goes through the same preprocessing</note>. When we use relu<note A relu activation can be defined as #x+abs(x)#. See Appendix <ref relu> for why this is useful> activations here<note>Activations are another thing where those networks can become trivial, think of a sigmoid and a network just learning infinite values before activation</note>, learning values to be zero, means learning them just to be negative, and is thus way easier. This can demand that the network does not fixate on trivial features in the networksetup and preprocessing<note> later on, in chapter <ref mixedidea>, this is no longer needed, and just complicates the training</note>.
A simple oneoff network reaches usually an auc of at best #0.6# for the task of finding top jets, which is not to impressive. But if you look at the classification power as a function of the training epoch, you see that this only is so bad, since those AUC scores are way better at earlier epochs (see figure <refi mabe2>.
<i f="mabe2" wmode="True" label="mabe2">AUC score as a function of the epoch, trained on QCD, once for a graph oneoff and once for a dense oneoff. As you see, both relations show a maximum before the training ends, but the graph network is way more continuous</i>
Sadly, this observation is not really useful, since stopping the training at the optimal epoch would not be unsupervised. It is still quite interresting, since it shows, that there is some potential in those kind of networks, which is just not utilised good enough<note>this will be solved in chapter <ref mixedidea></note>.
......
<subsection title="A good classifier" label="finalae">
With the same setup as before (see chapter <ref setup>) and normation as well as after training 25 oneoff networks on each latent space we gain the final top tagger for this thesis
With the same setup as before (see chapter <ref setup>) and normalization as well as after training 25 oneoff networks on each latent space we gain the final top tagger for this thesis
<subsubsection title="Trained on QCD" label="classQCD">
<i f="sephist928" wmode="True" label="sephist928">Oneoff loss distribution for a network trained on top jets</i>
<i f="seproc928" wmode="True" label="seproc928">Oneoff Roc curve for a network trained on top jets</i>
In figures <refi sephist928> and <refi seproc928> you see AUCs worse than in chapter <ref secgae>, but consistently better than by using just normation
In figures <refi sephist928> and <refi seproc928> you see AUCs worse than in chapter <ref secgae>, but consistently better than by using just normalization.
Interrestingly this also helps the reconstruction quality (see figures <refi ang928> and <refi pt928>).
<i f="simpledraw928" wmode="True" label="ang928">Angular reconstruction images for a normalized network trained on QCD</i>
......
<subsection title="Scaling for oneoff networks" label="scale3">
Oneoffs still don`t solve the problem of different parts of the network beeing added supoptimally. You see this when you consider a 9 node network trained on top jets in figure <refi sd1404>.
<i f="simpledraw1404" wmode="True" label="sd1404">Angular reconstruction image for a 9 node network on top</i>
Even though its reconstruction is much better than those from chapter <ref secgae>, we also see here that its AUC falls compared to the 4 node alternative: It reaches only and AUC of #0.34# compared to the #0.177# from the 4 node one.
On the other hand, the batches considered in chapter <ref normimpro> are now all invertible, as figure <refi ooinv> shows.
<subsubsection title="In batches" label="ooscalebatch">
The batches considered in chapter <ref normimpro> are now all invertible, as figure <refi ooinv> shows.
<i f="m4scalesep" wmode="True" label="ooinv"> Invertibility of batches in oneoff networks</i>
Here you see a much more interresting relation compared to before. The variance grows with the batch index, which is expected, but some networks actually beat the AUC score of the first batch (batch 3 has a event below #0.15#). This is a result of the number of particles in each jet becoming a feature at some point. You see this, by noticing that the relation between AUC and batch number is not linear: The AUCs for the second batch migth even be some of the worst, even though they should have the second most information next the the first batch.
Here you see a much more interresting relation compared to before. The variance grows with the batch index, which is expected, but some networks actually beat the AUC score of the first batch (batch 3 has a event below #0.15#). This is a result of the number of particles in each jet becoming a feature at some point. You see this, by noticing that the relation between AUC and batch number is not linear: The AUCs for the second batch migth even be some of the worst, even though they should have the second most information next the the first batch. This also makes combining those batches hard.
<subsubsection title="Without batches" label="finalscale">
From a technical standpoint, bigger networks dont train aswell, since their loss becomes nan at some point. This we can fix for now, by giving up two things: We cannot use a learnable graph anymore and we train on less data. Using a fixed fully connected graph is usually not a good idea, as it seems to slow down the training, but this also removes a lot of nans<note it is still possible for the network to nan, which makes debugging harder>. Using less data should not matter to much, since for 4 nodes appendix <ref asize> shows that it does not change anthing to reduce the number of training samples to #5000# qcd jets. This removes less nans, but has the added effect of accelarating the training a bit. It is also useful to use the normalization from chapter <ref normalization>, as this seems also to remove nans.
We train with a batch size of #100# and a learning rate of #0.003# for at least #500# Epochs and afterwards with a patience of #100# Epochs an autoencoder compressing 16 nodes twice by a factor of 4 until the latent space is of dimension 36. Also between each compression step, there are 3 graph update steps. This results in the training history shown in figure <refi hist1583>. This training took more than 58 hours training on a cpu<note Training on a gpu would accelarate this quite a lot. We expect a factor between 3 and 5, but since this still would not make gpus possible in our computation quota, we use cpus>
<i f="history1583" wmode="True" label="hist1583">Training history for a 16 node network trained on qcd</i>
More importantly, the reconstruction works quite well, as figures <refi sd1583> and <refi pt1583> show.
<i f="simpledraw1583" wmode="True" label="sd1583">Angular reconstruction for a 16 node network trained on qcd</i>
<i f="ptdraw1583" wmode="True" label="pt1583">Momentum reconstruction for a 16 node network trained on qcd</i>
Without oneoff networks, the classification quality is also better than our best AUC on 4 nodes (#0.635#). We move figures <refi roc1583> and <refi rec1583> to the appendix, but they show an AUC value of about #0.7#.
Using oneoff network this AUC falls to #0.55# (see figure <refi seproc1583>). This migth seem like oneoff networks are not as good as we assumed before, but this is not the case. Consider the same training dont now on top jets. The training history (figure <refi hist1590>) and the reconstructions (figures <refi sd1590> and <refi pt1590>) look very similar, which is why you find them in the appendix.
The problem lies in the ROC curves (figures <refi roc1590> and <refi rec1590>). They reach an AUC score of #0.64#, and we thus would have networks that are not invertible.
We think, that these networks are not invertible (even though we use normalization) because the normalization has a much lower effect: On 4 nodes, removing two values means removing #1/2# of all values, on 16 nodes this only means removing #1/8# of all values. So, since the trivial difference is contained in each particle, but still differently for each of the particles, removing 2 values migth remove some width, but in the substructure there is still enough contained for the network to only use a triviality.
Luckely we still have a way to handle this: Using oneoff networks here results in an AUC score of #0.48# (see figure <refi seproc1590>) and at least making this network invertible.
These terrible AUC scores show that simply solving the computation challenges of more node networks is not enough: We think that by adding nodes that are more and more random, the autoencoder focusses on reconstructing them more than about the first nodes. But since in these initial nodes most of the classification power is contained, this just weakens the classifier. So you would need to also keep the focus of the network rigth.
One way of doing this, would be weigthing your loss function, but our experiments with losses that are functions of the node index or the transverse momentum only worsened the reconstruction quality.
<subsection title="Acknowledgements">
I would like to thank Prof. Dr. Michael Krämer for allowing me to write this thesis, Dr. Alexander Mück for beeing the pretty much perfect supervisor, aswell as Thorben Finke for his help, and especcially for generating the data used in chapter <ref ldm> and for sharing his computation resources.
I would also like to thank Yuriy Popovich for proofreading this thesis, and my friends and family for supporting me while having to listen to too many pointless thoughts about graphs.
Simulations were performed with computing resources granted by RWTH Aachen
University under project thes0678.
......@@ -2,8 +2,8 @@
Quick answer: No.
Long answer: Probably no, but not because the quality is neccesarily worse, just because the number of nans (appendix <ref nan>) increases a lot, making training for a long time very hard and thus resulting in worse classifiers.
That beeing said, this still means, that if you could handle the nans, you migth profit from more gtopk layer, but we are not able to test this at the moment, and even though multiple different graphs help interpreting graphs as activations (appendix <ref agaeactivation>), there is not really any physically useful definition of similarity in angles and momenta, but the angles themself, so changing the graph setup in the middle of the layers, migth not have any effect at all.
Long answer: Probably no, but not because the quality is neccesarily worse, just because the number of nans (appendix <ref nans>) increases a lot, making training for a long time very hard and thus resulting in worse classifiers.
That beeing said, this still means, that if you could handle the nans, you migth profit from more gtopk layer, but we are not able to test this at the moment, and even though multiple different graphs allow you to see them as something like activations, there is not really any physically useful definition of similarity in angles and momenta, but the angles themself, so changing the graph setup in the middle of the layers, migth not have any effect at all.
......
......@@ -2,7 +2,6 @@
If you are familiar with image based neuronal networks, choosing our momentum preprocessing migth seems a bit strange. Since #-ln(x)# is monotoneously falling, low momenta correspond to high #lp_T# values. And since image based networks weigth each part with the absolute value of this value, this seems like no good idea (to understand this further, see chapter <ref imageloss>). But graph neuronal networks dont weigh a loss with its transverse momentum, so we dont expect this to be a problem.
To test this, we train a network similar to those from chapter <ref finalae> with #ln(p_T+1)# instead of #lp_T#. This would also include information over the total jet momentum, so we can also use this as a test if it is the rigth choice to exclude this information
(ENTER DATA)
To test this, we train a network similar to those from chapter <ref finalae> with #ln(p_T+1)# instead of #lp_T#. This would also include information over the total jet momentum, but since we still use normalization, this gets filtered out automatically.
Comparing both networks is not easy, as the loss is defined differently, and the reconstruction is neirly perfect. So we simply use the AUC (with oneoffs):
With #lp_T# we reach an AUC score of #0.635# and with #ln(p_T+1)# we get an AUC score of #0.621# on the same model. So we get very sligthly worse AUC scores and use #lp_T# thanks to this. It should be noted that this is not the most effective test, as hyperparameter or just repetition could change this completely, but it does not seem to have a big effect anyway.
<subsection title="Comparing our graph update layer to particleNet" label="acomparepnet">
There are multiple different ways of implementing such a layer, a notable one is the one used by particleNet <cite particleNet>: Their graph connectivity is implemented, by just storing all neighbouring vector to each given vector in a set of vectors, this means, they can implement the update procedure as a function of the original and the neighbour vectors<note this function is actually a bit complicated, involving not only convolutions, but also normalisations between them, and they end by concatting the updated vector to the original one, which is something that is not very useful, when you want to reduce the size of your graph>. This is not exactly what we do here, mostly since the implementation of the graph as just a corresponding set of neighbourvectors demands for computational reasons that each node is connected to a same number of other nodes, and also requires relearning your graph after each step, which we don`t want to force our network to do, as explained in appendix <ref arelearn>, and also would make this less of a graph autoencoder, and more into an autoencoder with some graph update layers in front of it (which migth also not be a good idea, see appendix <ref ainfront>), since there is no way to reduce the number of nodes for such an implementation, without completely ignoring the graph structure.
There are multiple different ways of implementing such a layer, a notable one is the one used by particleNet <cite particleNet>: Their graph connectivity is implemented, by just storing all neighbouring vector to each given vector in a set of vectors, this means, they can implement the update procedure as a function of the original and the neighbour vectors<note this function is actually a bit complicated, involving not only convolutions, but also normalisations between them, and they end by concatting the updated vector to the original one, which is something that is not very useful, when you want to reduce the size of your graph>. This is not exactly what we do here, mostly since the implementation of the graph as just a corresponding set of neighbourvectors demands for computational reasons that each node is connected to a same number of other nodes, and also requires relearning your graph after each step, which we don`t want to force our network to do, as explained in appendix <ref arelearn>, and also would make this less of a graph autoencoder, and more into an autoencoder with some graph update layers in front of it, since there is no way to reduce the number of nodes for such an implementation, without completely ignoring the graph structure.
Please note the difference: Since we use an adjacency matrix itself to define the graph(in comparison to calculating some derivative from it), you not only have complete control over the graph, that can be used to shrink the graph structure with the number of feature vectors, but you also allow for an arbitrary number of connections for each node<note this is mostly interresting, since it extends the number of possible compression algorithms: They do not anymore have to satisfy keeping the number of connections constant: The number of possible graphs with #n# nodes is #2**(n*(n-1)/2)# (ignoring permutation invariance, self connectivity and directed graphs), for #Eq(n,4)# this results in #64# possible graphs, of which only #6# are of this kind. This means that much less compressed graphs are possible, and that finding an algorithm, that can pick only those graphs, is much more complicated (see appendix <ref atopkwhy> for more)><ignore><note you migth also have noticed, that there is another difference, since my implementation does not allow for cross terms between the self and neighbour terms. This is just a minor difference, since in the following graph update step, this is still given, and you would expect them to take a less importatn role anyway></ignore>.
\ No newline at end of file
<subsection title="Comparing our graph update layer to particleNet" label="acomparepnet">
There are multiple different ways of implementing such a layer, a notable one is the one used by particleNet <cite particleNet>: Their graph connectivity is implemented, by just storing all neighbouring vector to each given vector in a set of vectors, this means, they can implement the update procedure as a function of the original and the neighbour vectors<note this function is actually a bit complicated, involving not only convolutions, but also normalisations between them, and they end by concatting the updated vector to the original one, which is something that is not very useful, when you want to reduce the size of your graph>. This is not exactly what we do here, mostly since the implementation of the graph as just a corresponding set of neighbourvectors demands for computational reasons that each node is connected to a same number of other nodes, and also requires relearning your graph after each step, which we don`t want to force our network to do, as explained in appendix <ref arelearn>, and also would make this less of a graph autoencoder, and more into an autoencoder with some graph update layers in front of it (which migth also not be a good idea, see appendix <ref ainfront>), since there is no way to reduce the number of nodes for such an implementation, without completely ignoring the graph structure.
There are multiple different ways of implementing such a layer, a notable one is the one used by particleNet <cite particleNet>: Their graph connectivity is implemented, by just storing all neighbouring vector to each given vector in a set of vectors, this means, they can implement the update procedure as a function of the original and the neighbour vectors<note this function is actually a bit complicated, involving not only convolutions, but also normalisations between them, and they end by concatting the updated vector to the original one, which is something that is not very useful, when you want to reduce the size of your graph>. This is not exactly what we do here, mostly since the implementation of the graph as just a corresponding set of neighbourvectors demands for computational reasons that each node is connected to a same number of other nodes, and also requires relearning your graph after each step, which we don`t want to force our network to do, as explained in appendix <ref arelearn>, and also would make this less of a graph autoencoder, and more into an autoencoder with some graph update layers in front of it, since there is no way to reduce the number of nodes for such an implementation, without completely ignoring the graph structure.
Please note the difference: Since we use an adjacency matrix itself to define the graph(in comparison to calculating some derivative from it), you not only have complete control over the graph, that can be used to shrink the graph structure with the number of feature vectors, but you also allow for an arbitrary number of connections for each node<note this is mostly interresting, since it extends the number of possible compression algorithms: They do not anymore have to satisfy keeping the number of connections constant: The number of possible graphs with #n# nodes is #2**(n*(n-1)/2)# (ignoring permutation invariance, self connectivity and directed graphs), for #Eq(n,4)# this results in #64# possible graphs, of which only #6# are of this kind. This means that much less compressed graphs are possible, and that finding an algorithm, that can pick only those graphs, is much more complicated (see appendix <ref atopkwhy> for more)><ignore><note you migth also have noticed, that there is another difference, since my implementation does not allow for cross terms between the self and neighbour terms. This is just a minor difference, since in the following graph update step, this is still given, and you would expect them to take a less importatn role anyway></ignore>.
\ No newline at end of file
......@@ -4,7 +4,7 @@
<subsubsection title="overflow in angular differences, and how to solve it" label="aimplementationphi">
Our input data contains a #phi# that is centered around its mean: so the most simple implementation would simply subtract the mean of #phi# from each #phi# value. This can lead to overflow problems, since #Eq(phi,2*pi)# is equivalent to #Eq(phi,0)# and thus a mean of about #6# could be sutracted from a tiny value<note or the inverse> resulting in weird phi distributions. Not solving this, results in a loss distribution with a small peak at very high losses
<ignore><i f="none" wmode="True">loss distribution with angular overflow</i></ignore>
So how to solve this? first we need the true mean value (#2*pi-0.1# and #0.1# have mean #0# and not #pi#), and then we use a modular operator to restrict every difference (the difference #2*pi-0.1# molulo #2*pi# equals the true difference of #-0.1#). And for finding this mean value we can cheat a little, but calculating the mean 4 vector, and finding out its #phi# value. For a more indepth look at out solution, you can also take a look at the actual implementation (ENTER LINK TO gpre5).
So how to solve this? first we need the true mean value (#2*pi-0.1# and #0.1# have mean #0# and not #pi#), and then we use a modular operator to restrict every difference (the difference #2*pi-0.1# molulo #2*pi# equals the true difference of #-0.1#). And for finding this mean value we can cheat a little, but calculating the mean 4 vector, and finding out its #phi# value. For a more indepth look at out solution, you can also take a look at the actual implementation <link w="https://github.com/psorus/grapa/blob/master/grapa/layers.pyHASHTAGL6488">https://github.com/psorus/grapa/blob/master/grapa/layers.pyHASHTAGL6488</link>.
......
<subsection title="Metrik analysis" label="ametrikana">
As explained in chapter <ref imgsetup>, our topK algorithm, on which all graphes are based, uses a learnable diagonal metrik, which is used to define similarities in the network. This metrik can be extracted to understand this sence of similarity. Figure <refi metrik00> shows that unnormalized networks use the angular differences between nodes to define similarity. Interrestingly figure <refi metrik725> suggests that using a normalization changes this. Now networks only use one angle, and a negative metrik value for the other one. This means that two nodes are more similar the more one angle is different, but also the more different the other is.
As explained in appendix <ref atopkhow>, our topK algorithm, on which all graphes are based, uses a learnable diagonal metrik, which is used to define similarities in the network. This metrik can be extracted to understand this sence of similarity. Figure <refi metrik00> shows that unnormalized networks use the angular differences between nodes to define similarity. Interrestingly figure <refi metrik725> suggests that using a normalization changes this. Now networks only use one angle, and a negative metrik value for the other one. This means that two nodes are more similar the more one angle is different, but also the more different the other is.
<i f="metrik00" wmode="True" label="metrik00">Typical metrik of unnormalized networks</i>
<i f="metrik725" wmode="True" label="metrik725">Typical metrik of normalized networks</i>
......
......@@ -3,7 +3,7 @@
In this chapter we will quickly go over some bad ideas you could have, on how to implement a graph autoencoder and finish with the first implementation that could be considered working.
These implementations are usually defined by an encoding and a decoding algorithm, so basically something to go from a big graph to a small graph, and something to reverse this again. In addition to this, the graph update and the graph construction stay mostly the same as it was explained in chapter <ref gnn> and <ref imgsetups>.
These implementations are usually defined by an encoding and a decoding algorithm, so basically something to go from a big graph to a small graph, and something to reverse this again. In addition to this, the graph update and the graph construction stay mostly the same as it was explained in chapter <ref gnn> and <ref setup>.
<subsubsection title="trivial models" label="failedtrivial">
Let us start with the probably most simple autoencoder algorithms: To make a #n# node graph into a #m# node graph, we just cut away the last nodes until there are only #m# nodes left<note please note the importance of the #p_T# ordering here: Cutting the last particles means cutting the particles with lowest #p_T# and thus the probably least important particles> to reduce the graph size, and add zero valued particles to it again. One difficulty here lies in the fact that those particles have no more graph connections, this we solved by just keeping the original graph connections stored. Sadly, those networks just don`t work: even when we would set the compression size over the input size, the reproduced jets hardly bare any resemble to the input jets: This is the first example of the central problem of graph autoencoding: Permutation invariance. Consider the following encoder: two numbers #a# and #b# where #Eq(a,b+1)#, this would be trivial to compress into one number for a normal(dense) Autoencoder(maybe just take #a#), but here we have to respect permutaion symmetry, so basically we do not know what the first and what the second particle is and how do we decompress now? In this context you could keep one of the parameters and try to encode if the other one is bigger or smaller than this, maybe you also know that #LessThan(0,a)# and you could multiply it by #-1# if it is the smaller one, but this is less than trivial, and by increasing the number of parameters this gets even more complicated. This is a problem that mostly appears as the inability of even a "good" Autoencoder to work with and compression size that is equal to the input size, building an identity (see appendix <ref identities>). <ignore>Next to the loss from the compression, there seems to still be a certain loss from the graph structure, given at least partially coming from permutation invariance.</ignore>
......@@ -12,7 +12,7 @@ That beeing said, permutation invariance can also be a benefit, especially in pe
<subsubsection title="minimal models" label="failedminimal">
To improve this model, we started working with smaller graph sizes (mostly the first 4 particles), making the structure less complicated, and allowing for more experimentation thanks to the lower time cost. Notable improvements include replacing the added zeros by a learnable function of the remaining parameters, relearning the graph on the new parameterspace and adding some dense layers after the graph interactions, but the most important improvment was achieved by making the compression and decompression local in some learning axis. Instead of just removing parameters in an arbitrary way of physical intuition, we demand that particles which are similar in some way are to be compressed together: This is achieved by the creation of a function that compresses a set of particles into one particle, and allow the network to learn what similarity means<note in the compression step, we define a new feature for each node, by which we sort the set of nodes, and afterwards we build sets of n particles from this ordering, and compress them using a linear function (it migth be interresting to look at nonlinear functions, but we generally see worse results by adding an alinearity). Please note that since we use a feature to sort the elements, and in the graph update step there are neighbour steps, that generally increase similarity, connected particles are more probably compressed together, even though we do not demand this.>.
These networks still have problems, as we will discuss in the following, but generally produce respectable decision qualities, and show ("sometimes", see appendix <ref auncertainity>) similarities between input and output image. These network is discussed in the next subchapter.
These networks still have problems, as we will discuss in the following, but generally produce respectable decision qualities, and show similarities between input and output image. These network is discussed in the next subchapter.
<subsection title="Improving autoencoder" label="secondworking">
Given the fairly good AUC score, it looks like to only thing we now need to do, is to increase the size of this autoencoder, and we probably have a really great anomaly detection algorithm. But before we try, and fail<note see chapter <ref scaling>>, at this, let us improve our autoencoder first. As you migth agree, the training curve does not look very impressive, and the reconstruction is also not very good. Thats why we suggest some changed model<note we alter models iteratively, but since we don`t want to show tausends of models here, you only see summaries, which is why the changes seem a bit random>.
Given the fairly good AUC score, it looks like to only thing we now need to do, is to increase the size of this autoencoder, and we probably have a really great anomaly detection algorithm. But before we try, and fail, at this, let us improve our autoencoder first. As you migth agree, the training curve does not look very impressive, and the reconstruction is also not very good. Thats why we suggest some changed model<note we alter models iteratively, but since we don`t want to show tausends of models here, you only see summaries, which is why the changes seem a bit random>.
<subsubsection title="Training setup" label="quick2setup">
......
......@@ -65,7 +65,7 @@ Evaluating this test series is not as easy as the last one. In the loss, the com
<subsubsection title="better decoding" label="decoding">
Also the decoder, does not use the graph structure completely. So we try to replace the abstraction with a constant learnable graph, by an abstraction with a graph that is not constant. The problem here, is that the tensorproduct introduced in <ref identies> and <ref gnn> does not work for a product of one graph with multiple graphs. The main difficulty lies in finding out how to work with the nondiagonal terms: Consider again adjacency matrices of adjacency matrices: When each feature vector becomes a vector of feature vectors, also each entry in the adjacency matrix becomes a new matrix. These matrices, multiplied with the original entry would result in a tensorproduct, when the new matrices would always be the same, but this is what we want to change. Finding now the diagonal matrices can be left to a learnable function of the feature vector, but for the offdiagonal matrices, we have two suggestions: The first, graphlike decompresser, define those matrices as functions of the two corresponding diagonal matrices. Here we compare a product, a sum and those rounded versions and and or not only to the abstraction with a constant graph, but also to the second suggestion: paramlike decompresser: instead of the diagonal matrices beeing functions of a feature vector, every submatrix is a learnable function of its two corresponding original feature vectors.
Also the decoder, does not use the graph structure completely. So we try to replace the abstraction with a constant learnable graph, by an abstraction with a graph that is not constant. The problem here, is that the tensorproduct introduced in <ref tensorproduct> does not work for a product of one graph with multiple graphs. The main difficulty lies in finding out how to work with the nondiagonal terms: Consider again adjacency matrices of adjacency matrices: When each feature vector becomes a vector of feature vectors, also each entry in the adjacency matrix becomes a new matrix. These matrices, multiplied with the original entry would result in a tensorproduct, when the new matrices would always be the same, but this is what we want to change. Finding now the diagonal matrices can be left to a learnable function of the feature vector, but for the offdiagonal matrices, we have two suggestions: The first, graphlike decompresser, define those matrices as functions of the two corresponding diagonal matrices. Here we compare a product, a sum and those rounded versions and and or not only to the abstraction with a constant graph, but also to the second suggestion: paramlike decompresser: instead of the diagonal matrices beeing functions of a feature vector, every submatrix is a learnable function of its two corresponding original feature vectors.
<table caption="Quality differences for different graph like decoder" label="decode1" c=6>
......
......@@ -7,4 +7,4 @@ In the current algorithm, we would try to seperate this 6 node graph into for ex
<i f="mincut2" wmode="True" label="mincut2">A sample 6 node graph splitted how we would like to split it</i>
To create this graph, you would require a more graph based algorithm. The algorithm minCut<note from <cite mincutpool>, mincut tries to seperate graphs by the least number of graph lines>, would be really hard to write branchless, and since we would want it to result in graphs of different size, whatever algorithm is then applied to the subgraphs compressing them, would be able to handle different numbers of nodes. This also means that the decompression algorithm, would return either different sizes of graphs, which would not only be hard to write but also be difficult to handle considering that this could not work in a graphlike manner, and migth be at least limited in a paramlike manner, or would combine into a graph bigger than the original one (while also limiting subgraph sizes). And considering that already sorting graph nodes is worthy of discussion (see appendix <ref asort>) and sorting is much easier than cutting your graph into a usuable size, this is definitely not trivial to write.
So why do it? appendix <ref intuitivecode> gives you some more physical intuition why this migth work better on jets, but even when this would not work, choosing a better encoding and decoding algorithm still migth be very useful as a graph pooling and a graph generating algorithm (see appendices <ref mol> and <ref madbuild> respectively)
So why do it? appendix <ref intuitivecode> gives you some more physical intuition why this migth work better on jets, but even when this would not work, choosing a better encoding and decoding algorithm still migth be very useful as a graph pooling and a graph generating algorithm (see appendices <ref mol> and <ref tipsy> respectively)
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment