Skip to content
Snippets Groups Projects
Commit b47f8d94 authored by Simon Klüttermann's avatar Simon Klüttermann
Browse files

worked a lot

parent 4544cf3b
No related branches found
No related tags found
No related merge requests found
Showing
with 87 additions and 73 deletions
@misc{jetdm}, url={https://atlas.cern/updates/physics-briefing/precision-search-dark-matter}, journal={ATLAS Experiment at CERN}, year={2020}, month={Aug}}
@misc{jetdm}, url={https://atlas.cern/updates/physics-briefing/precision-search-dark-matter}, journal={ATLAS Experiment at CERN}, year={2020}, month={Aug}}
@misc{kaliningrad, title={Main}, url={https://visit-kaliningrad.ru/en/travel-tools/print-take/}, journal={Информационный центр туризма}}
@article{high1,
author = "Gianotti, F.",
......
<subsection title="Graphs" label="graphs">
A graph<cite graphbasic> is a mathematical<ignore>/informatical</ignore> concept, that allows to define a more general form of data then by just encoding it in vectors. Namely, graphs allow storing relational information of an arbitrary<note for computional reasons, graphs are not completely unbounded in the following chapters, but have a maximum size, up to which their size is arbitrary.> amount of objects. This is done, by defining two objects: Nodes which are the objects of interest and can be mathematically described by vectors<note in theory you would not need to be able to define those objects as only vectors, but for practical application this is quite useful. chapter <ref thirdworking> could be interpreted as using graphs themself as the information encoded in those nodes.> and edges that are pairs of connected nodes and thus encode the relation between those objects of interest
A graph<cite graphbasic> is a mathematical<ignore>/informatical</ignore> concept, that allows to define a more general form of data then by just encoding it in vectors. Namely, graphs allow storing relational information of an arbitrary<note for computional reasons, graphs are not completely unbounded in the following chapters, but have a maximum size, up to which their size is arbitrary.> amount of objects. This is done, by defining two objects: Nodes which are the objects of interest and can be mathematically described by vectors<note in theory you would not need to be able to define those objects as only vectors, but for practical application this is quite useful. chapter <ref thirdworking> could be interpreted as using graphs themself as the information encoded in those nodes.> and edges that are pairs of connected nodes and thus encode the relation between those objects of interest. See <refi kaliningrad> for a simple example.
<note there are multiple extensions for this simple graph, the two most important ones are directed graphs, in which the edges gain a direction, and thus a connection between node #i# and #j# does not automatically imply a connection between #j# and #i# and also weighted graphs, in which each edge gains an additional value, that encodes how strong the connection between two nodes is.>.
<i f="dia3" wmode="True" wid="0.7">(mmt/dia3) A simple graph explaining nodes and edges</i>
<ignore><i f="dia3" wmode="True" wid="0.7" label="kaliningrad">(mmt/dia3) A simple graph explaining nodes and edges</i></ignore>
<i f="kaliningrad" f2="kaliningraph" wmode="True" label="kaliningrad">On the left: tourist map<cite kaliningrad>, On the rigth: the same map as graph of city regions: You can understand for example a map as a graph. Each region becomes a node and bridges between them represent edges. </i>
Mathematically, these nodes and relations are defined in a list of feature vectors<note technically this is equivalent to a matrix, but list of vectors is more intuitive> #X# that stores the features of each node, and an adjacency matrix #A#, which components #A_i**j# are #1#, if the nodes #i# and #j# are connected, and #0# if not. This graph is usually invariant under permuation of the node indices. You achieve this by permuting<note with permuting we here mean switching two indices, or more generally multiplying with a permutation matrix> the adjacency matrix in the same way the feature vectors are permuted<note this is the reason why we don`t call the list of feature vectors a matrix: as a matrix permutation requires permutation matrices on each side (#p*A*p#), the feature vector "Matrix" only requires one permutation matrix (#X*p#)>, and by requiring any action on the graph to be permuation invariant. This action is usually also local, and thus only acts on each node and the mean<note you also need to require each local action to be symmetric under changing the input ordering, since else the output can depend on the order of the nodes. The usual way that is achieved is by using a function like the mean on all neighbours> of nodes that are connected to the current node<note this works, since permuting indices does not change, which nodes are connected together>. This has the benefit of making graphs ideal for modeling interactions between high numbers of objects, as the functions don`t change as you add more nodes to the graph. In informatics, this is useful for example for social networks<cite graphsoc>: Data that consists out of a huge amount of nodes in which mostly only connected nodes (friends) affect each other, are perfect applications for graphs, since else you would need to update your model every time a new user joins. In physics, this reminds of nuclear science, and the approximation of pair interaction potentials<cite pairpot>, and so there are applications using this kind of molecule encoding for chemical feature extraction (for a simple example look at chapter <ref mol>) <cite gnnforchemistry> and medicine <cite gnnformedicine>.
Next to those relational applications, there are also applications that are not utilizing an existing relation, but use the locality of the graph structure to encode the similarity of given data. This is done by letting the sense of similarity between nodes be a learnable function. For example, by using a topK algorithm (each node is connected to its neirest #K# neighbours, see <ref atopk>), you can implement a learnable version of whatever distance means. This allows networks like for example particleNet <cite particeNet>, which uses a special kind of neural network, that is able to work on graphs, to seperate top and QCD jets<note with QCD we mean jets that are generated by gluons or other quarks that are not top quarks> in a supervised way. They use the graph structure, to be able to define and redefine multiple times, which detected particles (nodes), should be considered close to each other. This results in particleNet beeing a quite good classifier(see <cite toptagref>).
......
......@@ -18,9 +18,9 @@ Evaluating the difference between background<note here background data is the da
</table>
<subsubsection title="ROC curve" label="classroc">
For most decision problems, these fractions are a functions of some parameter. Consider the following loss of an autoencoder:
<i f="add3" wmode="True">(mmt/add3, prob not perfect) A sample loss distribution, to highligth the calculation of a roc curve</i>
Here this parameter is the point at which to cut the distribution in a way that everything above will be classified as signal, while also classifying everything below as background. Since the choice of this cut is quite arbitrary, we evaluate every possible parameter and plot two fractions from table <reft fractions> against each other:
For most decision problems, these fractions are a functions of some parameter. Consider the loss function in Image <refi exampleloss>.
<i f="add3" wmode="True" label="exampleloss">(c4/20) A sample loss distribution, to highligth the calculation of a roc curve. To get your ROC curve, you have to choose every possible position of the parameter. Given one parameter value, everything with a loss higher than the parameter is classified as signal, and everything below as background. Your ROC curve is now all true and false positive rates for each possible parameter value</i>
There this parameter is the point at which to cut the distribution in a way that everything above will be classified as signal, while also classifying everything below as background. Since the choice of this cut is quite arbitrary, we evaluate every possible parameter and plot two fractions from table <reft fractions> against each other:
<list>
<e>
The first fraction, called the false positive rate, is given by the number of events that are wrongly classified as type signal over the number of elements that truly are background.
......@@ -30,10 +30,11 @@ The second fraction, called the true positive rate, is defined as the number of
</e>
</list>
These two fractions are plotted against each other in a way showing the fraction of correctly classified events and the AUC score (see chapter <ref classauc>). This is called the ROC curve (Receiver operating characteristic).
<i f="roca" wmode="True">(test/rocdraw)A sample roc curve plotted in way easily showing the AUC score</i>
Or to focus on the fraction of falsy called signal events, which is usually called the background rejection rate
<i f="rocb" wmode="True">(test/rocdraw)A sample roc curve focussing on the background rejection</i>
These two fractions are plotted against each other in a way showing the fraction of correctly classified events and the AUC score (see chapter <ref classauc>) in figure <refi roca>.
<i f="roca" wmode="True" label="roca">(test/rocdraw)A sample roc curve plotted in way easily showing the AUC score</i>
Or to focus on the fraction of falsy called signal events, which is usually called the background rejection rate (see figure <refi rocb>).
<i f="rocb" wmode="True" label="rocb">(test/rocdraw)A sample roc curve focussing on the background rejection</i>
The resulting curve is called the ROC curve (Receiver operating characteristic).
<subsubsection title="Area under the curve" label="classauc">
To simplify comparing ROC scores, you can use an AUC (Area under the curve) score to summarise it. This AUC score is defined as the integral of the true positive rate over the false positive rate. This simplification is not perfect, since you reduce a function into only one number, but it is fairly wide accepted, as it is much easier to interpret: A perfect score would result in an AUC score of 1, while a classifier that just guesses randomly, results in an AUC score of 0.5 and a perfect anticlassifier would result in 0. <ignore>Also this reduction into only one number can make the AUC score less errorprone than other values.</ignore> On the other hand, since not every part of the roc curve is equally important for the current problem (if you want to test, if somebody is ill, you migth prefer more false positives over more hidden illnesses). This could result in networks improving the AUC score by just changing unimportant parts of the ROC curve.
......
<subsection title="Explaining graphics" label="graphics">
<subsubsection title="Output images" label="imgout">
<i f="simpledraw7638_200" wmode="True" wid="0.6">a sample reconstuction image</i>
In those images, you see the #phi# and #eta# value of each particle, including any normalisation, plottet once for the input jet in red, and once for the output jet in blue. This means that a perfect network would show both jets overlapping in violet. Zero particles are not shown and there is some indication of the transverse momentum in the size of the dots, which is given proportional to #1+9/(1+lp_T)#<note inverse function, since higher #p_T# result in lower #lp_T#, and some constants to keep the radius finite>. This sadly does allow you to see differences in #lp_T# very well, so we also have to look only at #lp_T#. We show those here as a function of the index.
<i f="ptdraw200_7638" wmode="True" wid="0.6">A sample momentum reconstruction image</i>
<i f="simpledraw200" wmode="True" label="explsimpledraw">A sample reconstuction image</i>
In images like <refi explsimpledraw>, you see the #phi# and #eta# value of each particle, including any normalisation, plottet once for the input jet in red, and once for the output jet in blue. This means that a perfect network would show both jets overlapping in violet. Zero particles are not shown and there is some indication of the transverse momentum in the size of the dots, which is given proportional to #1+9/(1+lp_T)#<note inverse function, since higher #p_T# result in lower #lp_T#, and some constants to keep the radius finite>. This sadly does allow you to see differences in #lp_T# very well, so we also have to look only at #lp_T#. We show those in images like <refi explptdraw> here as a function of the index.
<i f="ptdraw928" wmode="True" label="explptdraw">A sample momentum reconstruction image</i>
This way of looking at the network performance is quite useful for finding patterns in the data. There seem to be networks that show a high correlation between the angles (see for example appendix <ref firstworking>), and it is quite common for the reproduced values to have less spread than the input one (see appendix <ref secondworking>). A problem here is, that you can only look at some images, and finding one nicely looking reproduction for each network is not that hard. To tackle this, we use always the same event (for each training set)
<subsubsection title="AUC Feature maps" label="imgmaps">
<i f="aucmap200" wmode="True">A sample AUC Featuremap</i>
A loss is usually just a mean over a lot of losses for each feature and particle. So you could just not average them, to also be able to calculate an AUC score for each feature and particle. These AUC scores are shown in feature maps, showing the quality of each combination of feature on the horizontal axis and particle on the verticle axis in the form of pixels. A perfect classifier(#Eq(AUC,1)#) would result in a dark blue pixel, a perfect anti classifier(#EQ(AUC,0)#) would be represented by a dark red pixel. Finally a useless classifier, that guesses if a jet is part of background or signal(or one that always results in the same value)(#Eq(AUC,1/2)#) would be a white classifier. In short: The more colorful a pixel is, the better it is, and an autoencoder trained on QCD events should have a featuremap that is blue, while an autoencoder trained on top events should be represented in red consistently.
<i f="aucmap979" wmode="True" label="explaucmap" wid="0.8">A sample AUC Featuremap</i>
A loss is usually just a mean over a lot of losses for each feature and particle. So you could just not average them, to also be able to calculate an AUC score for each feature and particle. These AUC scores are shown in feature maps like <refi explaucmap>, showing the quality of each combination of feature on the horizontal axis and particle on the verticle axis in the form of pixels. A perfect classifier(#Eq(AUC,1)#) would result in a dark blue pixel, a perfect anti classifier(#EQ(AUC,0)#) would be represented by a dark red pixel. Finally a useless classifier, that guesses if a jet is part of background or signal(or one that always results in the same value)(#Eq(AUC,1/2)#) would be a white classifier. In short: The more colorful a pixel is, the better it is, and an autoencoder trained on QCD events should have a featuremap that is blue, while an autoencoder trained on top events should be represented in red consistently.
The useful thing about those maps, is that they can show the focus of the network, as well as its problems. Since a perfect reconstruction as well as a terrible one, has no decicion power, a network that has focus problems (meaning it reconstructs some things much better than other things, making both parts worse), can be clearly seen in those maps. Also it is fairly common, to get one feature and particle, that has more decision power, than the whole combined network (see <ref oneoff> for an example and <ref caddition> for the explanation). Finally, an AUC map that is completely blue or red is quite uncommon, more probably some features are red, some are blue, and you get an indication on which features are useful for the current task (see <ref firstworking>).
<ignore>
......
......@@ -7,8 +7,8 @@ This focus can be influenced in two ways: The inital normalisation, and the loss
Beeing probably the most obvious choice, setting the loss function to be the quadratic difference between input and output still is not the worst idea:
##Eq(loss,mean(x-f(x)))##
(with the input #x# and the autoencoder #f(x)#)
Not only is this easy to implement and fast to compute, but it also punishes bigger difference more than multiple small differences (two acceptable reconstructions are prefered over one good and one bad one), which we generally prefer over the alternative (see subsubsection <ref ln>), but this also results in autoencoders that learn mean values: If you would need to choose either #a# or #-a#, a l2 normalised network that does not know the rigth choise, whill choose #0#, since sometimes guessing the wrong result, is punished more, than not choosing<note you migth ask why this is a problem, since a network that does not know anything probably should not choose one of the results, but there is a similar effect for non perfect guessing networks: Lets say the network guesses rigth #alpha# times, then for a prediction of #LessThan(b,a)#, the loss is given by #alpha*(a-b)**2+(1-alpha)*(a+b)**2#, which is minimal for #Eq(b,a*(2*alpha-1))#>. This results in the output of a l2 autoencoder always having a lower width than its input. This you can say, is one of the central problems, we want to solve, by looking at different losses<note if not, this means, that each event has a certain baseloss, even when the network would guess rigth everything, just because the network is not completely sure>
<i f="410simpledraw16" wmode="True" wid="0.6">(410/simpledraw, maybe not the best image)A L2 reconstruction image, showing how the reconstruction has lower width than the input</i>
Not only is this easy to implement and fast to compute, but it also punishes bigger difference more than multiple small differences (two acceptable reconstructions are prefered over one good and one bad one), which we generally prefer over the alternative (see subsubsection <ref ln>), but this also results in autoencoders that learn mean values: If you would need to choose either #a# or #-a#, a l2 normalised network that does not know the rigth choise, whill choose #0#, since sometimes guessing the wrong result, is punished more, than not choosing<note you migth ask why this is a problem, since a network that does not know anything probably should not choose one of the results, but there is a similar effect for non perfect guessing networks: Lets say the network guesses rigth #alpha# times, then for a prediction of #LessThan(b,a)#, the loss is given by #alpha*(a-b)**2+(1-alpha)*(a+b)**2#, which is minimal for #Eq(b,a*(2*alpha-1))#>. This results in the output of a l2 autoencoder always having a lower width than its input (see for example image <refi sd979>. This you can say, is one of the central problems, we want to solve, by looking at different losses<note if not, this means, that each event has a certain baseloss, even when the network would guess rigth everything, just because the network is not completely sure>
<i f="simpledraw979" wmode="True" label="sd979">A L2 reconstruction image, showing how the reconstruction has lower width than the input. Here you see this best in #phi#</i>
<ignore>
MAYBE DELETE THIS SUBSUBSECTION ENTIRELY?
......@@ -22,9 +22,9 @@ This value is generally between 0 and 1. A Value of #0# would mean, that the net
<subsubsection title="#L_n# loss" label="ln">
The first other kind of loss you can look at, would be a ln loss. For each #LessThan(2,n)#, this loss still has the same problem of uncertain losses, but for smaller n it does not. L1 does not prefer lower predictions<note the loss described above would now look like #Eq(alpha*a-b+(1-alpha)*a+b,a+b*(1-2*alpha))# which is minimal (for each #LessThan(0.5,alpha)# for b beeing as big as possible (this loss still assumes #LessThan(b,a)#), and anything guessing rigth less than half of the time, would still result in the network learning not to guess, as we would want>, and a n lower than 1 would reverse the effect entirely<note we tried #Eq(n,1/2)#, but this results in NANs (see chapter <ref nans>), so we had to tweak #sqrt(abs(a-b))# into #sqrt(1+abs(a-b))-1# since the NANs are a result of the square root not beeing differentiable at 0. <ignore>Using this tweak, the same loss looks like #alpha*sqrt(1+a-b)+(1-alpha)*sqrt(1+a+b)-2# (ENTER MINIMIZATION).</ignore>> And in a sense, this works, those networks have the same input width as output width, but the different loss has another effect: Since now one big loss is as bad as two small ones, Networks, that remember some values exactly, while guessing the remaining ones, are very common, since this is a much easier thing to learn.
<i f="simpledraw1093" wmode="True" wid="0.6">(c3p10/1093)Reconstruction image for a model that simply remembers 3 nodes completely</i>
Since these networks are not just setting the remaining values to some constant, but try to guess them rigth, we have the same situation as in the normal case: On the known data, this guessing works, and on abnormal data this works less good. But not only do less accurate guesses have way less decision power<note you can model this, as a fixed distance between two gaussian peaks with variying width, a plot of the relation between the width and the AUC is shown in <ref caddition>>, this also ignores the remaining values completely: Since copying is easily done even if the data is abnormal, there is no information gained here. And even if it were, this difference would be tiny compared to the loss of the guessed values, so the whole loss is dominated by only some inaccurate values, ignoring everything else. This does not mean, that LN losses are useless, since there are networks without this problem, but retraining each network, until it does not do this, makes this loss much less desirable
<i f="simpledraw899" wmode="True" wid="0.6">A L1 reconstruction image, working nontrivally</i>
<i f="simpledraw1093" wmode="True" label="sd1093">(c3p10/1093)Reconstruction image for a model that simply remembers 3 nodes perfectly</i>
Since networks like for example in <refi sd1093>, are not just setting the remaining values to some constant, but try to guess them rigth, we have the same situation as in the normal case: On the known data, this guessing works, and on abnormal data this works less good. But not only do less accurate guesses have way less decision power<note you can model this, as a fixed distance between two gaussian peaks with variying width, a plot of the relation between the width and the AUC is shown in <ref caddition>>, this also ignores the remaining values completely: Since copying is easily done even if the data is abnormal, there is no information gained here. And even if it were, this difference would be tiny compared to the loss of the guessed values, so the whole loss is dominated by only some inaccurate values, ignoring everything else. This does not mean, that LN losses are useless, since there are networks (like image <refi sd899>) without this problem, but retraining each network, until it does not do this, makes this loss much less desirable
<i f="simpledraw899" wmode="True" label="sd899">A L1 reconstruction image, working nontrivally</i>
......@@ -33,8 +33,8 @@ One thing to consider, is that image like networks usually do not have these pro
##f(d)*(p_T**a-p_T**b)+(1-f(d))*c(x)##
where #f(d)# is some function of the angular difference, so for example a step function (#Eq(f(d),1)# for #LessThan(d,d_0)# and #Eq(f(d),0)# else) and #c(x)# is some alternative loss, for example #Eq(c(x),abs(p_T**a+p_T**b))#. This extension offers a lot more flexibility and is symmetric as long as #c(x)# is symmetric.
From our (limited) experimentation, choosing a continous function #f(d)# seems to be a good idea. #Eq(f(d),exp(-alpha*d))# with some #Eq(O(alpha),1)# works well. Also choosing #p_T# as alternative loss #c(x)# is not such a good idea either. The point here is, that those networks heavily prefer angular information<note this should be expected, since if #p_T# is not correct, there is some loss, but if the angles are not correct, the accuracy in #p_T# simply does not matter> so ideally you would choose some alternative loss, that shifts the focus more on the transverse momentum. This is not so easy, since simply choosing the difference between input and output #p_T# results in the network not learning any angular information (since they now don`t have any effect), but you can set #c(x)# to be a higher multiplicative of this difference. That being said simply setting #Eq(c(x),c)# to be a constant works best in practice and is what is used in the following. Another thing that is used there is a sligthly different #Eq(f(d),(exp(-alpha*d)+k)/(1+k))# with a #Eq(k,1/10)#. This is used here to balance the network focus more onto #p_T#<note this can be understood as follows: Introducing #k# is introducing a lower border for the effect the #p_T# comparison has on one particle: Even if the distance is hugh, k part of the loss of this particle is still given by the momentum difference>. This combination works quite well and can reach reconstruction values with neirly exactly the same width for 4 particle networks<ignore>, and also ???? for 9 particle networks. Here it should be noted, that since #varfrac# only cares about the worst feature, those values show the focusloss in #p_T# and if you would only consider angles, they would look even better</ignore>
<i f="simpledraw677" wmode="True" wid="0.6">Reconstruction image of an image like loss working well</i>
As you see, this reconstruction matches the input quite well, and thus we will be using image like losses in the following.
<i f="simpledraw677" wmode="True" label="sd677">Reconstruction image of an image like loss working well</i>
As you see in image <refi sd677>, this reconstruction matches the input quite well, and thus we will be using image like losses in the following.
<ignore>
</ignore>
......
......@@ -5,20 +5,22 @@
<subsubsection title="4 nodes" label="ae4">
For 4 nodes, the number of connections for each node do not really matter, which is why we simply use a fully connected graph for those networks.
<i f="history979" wmode="True" wid="0.6">training history for a 4 node network</i>
As you see, the training curve converges nicely to one value, but one thing to notice here: the validation loss does not behave worse than the training loss. This is something that you would usually expect, as it is a sign of overfitting, but is something that is very common for the graph autoencoder in this thesis: It is basically impossible for those networks to overfit, in fact we can reduce the number of training values drastically without letting the network overfit (see appendix <ref asize>).
<i f="history979" wmode="True" label="h979">training history for a 4 node network</i>
In image <refi h979>, the training curve converges nicely to one value, but one thing to notice here: the validation loss does not behave worse than the training loss. This is something that you would usually expect, as it is a sign of overfitting, but is something that is very common for the graph autoencoder in this thesis: It is basically impossible for those networks to overfit, in fact we can reduce the number of training values drastically without letting the network overfit (see appendix <ref asize>).
<i f="simpledraw979" f2="ptdraw979" wmode="True">Reconstruction images for a 4 node image. On the left for angles, and on the rigth for the momentum.</i>
Also the reconstruction images show good resemplence between the input and the output. The only problem is, that these networks seem to not care about #phi# as much as it does about the other variables.
<i f="simpledraw979" wmode="True" label="sd979">Angular reconstruction images for a 4 node image. </i>
<i f="ptdraw979" wmode="True" label="pt979">Momentum reconstruction images for a 4 node image.</i>
The momentum reconstruction in images like <refi pt979> is neirly perfect and also the angular reconstruction images (see <refi sd979>) show some resemplence between the input and the output. The only problem is, that these networks seem to not care about #phi# as much as it does about the other variables.
<subsubsection title="9 nodes" label="ae9">
If we increase the size of the network to 9 particles, much more training fails because of nans.
<i f="history1416" wmode="True" wid="0.6">example training history for a 9 node network showing nan losses</i>
<i f="history1416" wmode="True">example training history for a 9 node network showing nan losses</i>
and thus there reconstruction is also worse
and since now the training effectively stops after only a few epochs, the reconstruction is also much worse (see images <refi sd1416> and <refi pt1416>)
<i f="simpledraw1416" f2="ptdraw1416" wmode="True">Reconstruction images for a 4 node image. On the left for angles, and on the rigth for the momentum.</i>
<i f="simpledraw1416" wmode="True" label="sd1416">Angular reconstruction images for a 4 node image.</i>
<i f="ptdraw1416" wmode="True" label="pt1416">Momentum reconstruction images for a 4 node image.</i>
......
......@@ -4,21 +4,23 @@
<subsubsection title="4 nodes" label="class4">
As you see in image <refi roc4>, these 4 particle networks already reach an AUC score of over #0.81#, which is quite good considering we only use 4 particles. By changing this networks parameters you can reach AUCs upwards of #0.85#.
As you see in image <refi rec4>, these 4 particle networks already seperate qcd from top jets quite well, reaching an AUC score of over #0.81# in <refi roc4>, which is quite good considering we only use 4 particles. By changing this networks parameters you can reach AUCs upwards of #0.85#.
<i f="recqual979" f2="roc979" wmode="True" label="roc4">Loss distribution and roc curve for the 4 particle network</i>
<i f="recqual979" wmode="True" label="rec4">Loss distribution the 4 particle network</i>
<i f="roc979" wmode="True" label="roc4">Roc curve for the 4 particle network</i>
<i f="aucmap979" wmode="True">AZC feature map for 4 nodes</i>
Interrestingly this good AUC score is basicly only a product of the angular parts, as only using them reaches already an AUC value of #0.78#
<i f="aucmap979" wmode="True" label="map979">AUC feature map for 4 nodes</i>
Figure <refi map979> shows, that this good AUC score is basicly only a product of the angular parts, as only using them reaches already an AUC value of #0.78#
<subsubsection title="9 nodes" label="class9">
<i f="recqual1416" f2="roc1416" wmode="True">Loss distribution and roc curve for the 4 particle network</i>
<i f="recqual1416" wmode="True" label="rec1416">Loss distribution for the 9 particle network</i>
<i f="roc1416" wmode="True" label="roc1416">Roc curve for the 9 particle network</i>
Please note, that using more particles, and thus more information, you would expect your network to improve, but we get a worse AUC score of a bit over #0.72#. This will be addressed in the next chapter <ref probscale>
Notice that using more particles, and thus more information, you would expect your network to improve, but in figures <refi rec1416> and <refi roc1416> get a worse AUC score of a bit over #0.72#. This will be addressed in the next chapter <ref probscale>
<i f="aucmap1416" wmode="True">auc feature map for 9 nodes</i>
Again, the most important feature are the angles, they alone would reach an auc of #0.63# here.
<i f="aucmap1416" wmode="True" label="map1416">Auc feature map for 9 nodes</i>
Also for the 9 node network, figure <refi map1416> shows that the most important features are the angles, they alone would reach an auc of #0.63# here.
......@@ -5,8 +5,8 @@ To understand why more information can lead to worse classifiers, consider the f
Instinktively you can understand this as follows: Particles with lower #lp_T#, are more random, but more random parts have a higher loss in the autoencoder, and thus matter more in the classifier.
Please note, that this is not neccesarily the same in image like networks. Since the loss here is the transverse momentum itself, Parts of the network with higher randomness, automatically have lower weigth in the lossfunction, since their momentum is smaller. This is another reason, why we tried to make our loss more image like in chapter <ref losses>.
Mathematically you can model this by considering features of the following kind: Given two gaussian distributions with variable overlapping, where the gaussian peak describe background or signal respectively, the quality of the described feature can be defined as the inverse of the overlapping fraction. This is basically what an AUC score calculates<note the AUC score is a monotonously falling function of the overlapping fraction>, and so we can optimize the combination of two features by combining two different double gaussian peaks in a way that minimizes their overlapping fraction.
<i f="dist1" wmode="True">(mmt/dist1)A simple example showing how to combine two gaussian peaks</i>
Mathematically you can model this by considering features of the following kind: Given two gaussian distributions like in figure <refi gaussexamp>, with variable overlapping, where the gaussian peak describe background or signal respectively, the quality of the described feature can be defined as the inverse of the overlapping fraction. This is basically what an AUC score calculates<note the AUC score is a monotonously falling function of the overlapping fraction>, and so we can optimize the combination of two features by combining two different double gaussian peaks in a way that minimizes their overlapping fraction.
<i f="dist1" wmode="True" label="gaussexamp">(mmt/dist1)A simple example of how we model AUC as a function of the overlapp of two gaussian peaks</i>
To do this, first notice, that the quality of one of those double peaks is translation invariant, as well as scale invariant<note to be more precise, Invariant under any monotonous transformation #f(x)#>. This means, we can set two values to be fixed: We choose here the mean values of those peaks to be #Eq(mu_0,0)# and #Eq(mu_1,1)# and for simplicity we also set the width of both peaks to be the same(#Eq(sigma_0,sigma_1)#)<note this assumption does really effect the final result, but simplifies the calculation. You get the exact result, when you set the new sigma to the quadratic sum of the original width: #Eq(sigma**2,sigma_0**2+sigma_1**2)#>.
So lets add both double peaks with some constant relative factor<note you migth ask if this is the most generall approach, and in general it is not, but the invariance of those features under fairly general transformation make most other transformations useless, and something like a factor that depends on the current position would just break the assumption of gaussian peaks, and thus complicate the calculation for probably quite low gains>:
......@@ -23,7 +23,7 @@ and which is minimal at
So as expected, the bigger the width of one feature is(the more random it is), the less it should contribute.
We should note that this exact relation is only true, when the mean of the signal is constantly 1, and that if it is not true, one has to add another factor #mu_1/mu_2# to the relation.
Finally lets test this relation. First on random data, as you see the following.
<i f="abc" wmode="True" wid="0.8">(from mmt)AUC as function of #c# for two random gaussian double peaks that would reach AUCs alone represented by the horizontal lines</i>
<i f="abc" wmode="True">(from mmt)AUC as function of #c# for two random gaussian double peaks that would reach AUCs alone represented by the horizontal lines</i>
There is an optimal c value, at which you can combine two features as good as possible, but as long as you are close to this optimal value, the resulting quality is still better than either single classifier. This is no longer the case when you go to far away from the optimum. Then adding more information can actually hurt the original quality.This migth already explain why scaling seems to not work here, but lets see: To test this on jets, we use the batchlike autoencoder from chapter <ref scalebatch>. Instead of simply adding these batches together, we use the derived formula for adding distributions (This is actually done in an unsupervised way. The problem here would normaly be the mean values of the signal distribution, but we approximate them, as well as the width of the distributions by the network loss, and thus multiply each feature by the inverse of the third power of this loss(#loss**(-3)#)<note to be more precise, its square root (since the loss is quadrated): this can work, since the l2 loss of the one sided training is the variance of the first peak, and by noticing that both widths are not to differently in practice, and that the higher the loss, the second peak seems to move away too. We test this assumption more in a later chapter <ref impro>>) This can improve this quite a bit:
<i f="powerauc6" wmode="True" label="firstcadd" wid="0.6">(tcroc/imgs/powerauc) Auc as function of the loss power (-3) for parts of a QCD jet, using #10# #4# node networks combined in a way defined by the loss power</i>
You migth seem some sligth deviations from the expected result, but generally using the derived formula, seems to result in the best AUC value, proving our method. On the other hand, if you assume that the difference is not just statistical, the difference migth be explained by carefully considering our quite extensive assumptions. Consider for this appendix <ref awhycaddfail>.
<i f="powerauc4" wmode="True" label="firstcadd">(tcroc/imgs/powerauc) Auc as function of the loss power (-3) for parts of a QCD jet, using #5# #4# node networks combined in a way defined by the loss power</i>
You migth seem some sligth deviations from the expected result in figure <refi firstcadd>, but generally using the derived formula, a power of #3# seems to result in a good AUC value, proving our method. On the other hand, if you assume that the difference is not just statistical, the difference migth be explained by carefully considering our quite extensive assumptions. Consider for this appendix <ref awhycaddfail>.
<subsubsection title="Simplicity" label="simplicity">
One thing you can do, by comparing values directly, is to look only at parts of the loss. This allows you to define qualities for each part of the input space. As a reminder of chapter <ref imgmaps>, we show them here as AUC maps: each AUC value is one colored box, which is deeply blue for an AUC of #1#, completely red for an AUC of #0# and white for an AUC of #1/2#.
<i f="aucmap200" wmode="True" wid="0.8">AUC map for a simple network</i>
<i f="aucmap979" wmode="True">AUC map for a simple network</i>
As you see, the classification quality is mostly set in the angular part. This is fairly common, and there are even images, in which the nonangular parts are partially red.
<ignore>When you also look at the output distributions, you see that most angular outputs are way smaller than their inputs<note smaller in the sense, that they reproduce means, see chapter <ref ameans> for an explanaiton on why this is happening> compared to the momentum reconstruction. This can be seen as the network not beeing very accurate</ignore>
On the other hand, if you look at the loss distribution, the angles are the variables known the least.
<i f="losspos" wmode="True" wid="0.8">(c1/200/analosspos.py)Average relative error by feature (#mean(abs(x-f(x)))/mean(abs(x))#) </i>
<i f="losspos" wmode="True" wid="0.8">(c1/200/analosspos.py..STILL 200)Average relative error by feature (#mean(abs(x-f(x)))/mean(abs(x))#) </i>
This is a bit strange: The thing the network seems to not care about, is the thing that the classifier considers most useful.
But this can be easily understood. If you look at the 2d histogram of the angular distribution, there is a clear difference between top and QCD events.
<i f="meanangle3" wmode="True">(from mmt)2d histogram of angles comparing QCD vs top, here for example for the particle with the third highest transverse momentum</i>
As you see, the width of top jets is much higher than the width of QCD jets, so by comparing both angles to zero, the top jets statistically have a higher loss than the QCD jets, and this can be used to differentiate between them easily.
<i f="meanangle4" wmode="True" label="meanangle">(c1/200/)2d histogram of angles comparing QCD vs top, here for example for the particle with the fourth highest transverse momentum</i>
You can see in figure <refi meanangle>, the width of top jets is much higher than the width of QCD jets, so by comparing both angles to zero, the top jets statistically have a higher loss than the QCD jets, and this can be used to differentiate between them easily.
So how useful is this difference, and how much better does the network do than this relatively trivial seperator: First, a model that only uses its angles to classify jets, does better than a model that also adds the loss in #p_T# (and flag), so it truly is a problem only of angles:
Now given a model that just outputs 0 for the angles and only considers the angles in the loss, how good does it fare:
Now given a model that just outputs 0 for the angles and only considers the angles in the loss, you scale like in figure <refi tscaleno>.
<i f="simponez0.0" wmode="True" label="tscaleno">(tcroc/onez0.0) Trivial comparing angular scaling. Please note, that we do not split each network in batches of 4, but simply calculate the AUC for each number of nodes up to #60#.</i>
as you see, for a low number of particles, this works fairly well, but then, at some point, more information does not mean better classification, and the quality drops. But this is a problem we already know about (see chapter <ref caddition>) and can simply solve through c addition. So when you add each particle together weighted accordingly to their loss (its angular difference to 0), you gain a better classifier.
As you see, for a low number of particles, this works fairly well. But at some point, more information does not mean better classification, and the quality drops. But this is a problem we already know about (see chapter <ref caddition>) and can simply solve through c addition. So when you add each particle together weighted accordingly to their loss (its angular difference to 0), you gain a better scaling behaviour (see figure <refi tscaleno>).
<i f="simponez" wmode="True" label="tscale">(tcroc/simpledraw onez) Trivial comparing angular scaling with c addition. The reason for the falloff at the end migth be the different datasetup of missing particles and the assumptions tested in chapter <ref impro></i>
This better classifier reaches an AUC of over #0.915# that is comparable to the best anomaly detection networks, for example QCDorWhat<cite QCDorWhat> reaches #0.93#<note #0.9255# when you calculate the AUC from their roc curve> but on sligthly different data, while the work of thorben finke reached #0.908# on the same data<cite thorb>. <ignore>Not to discredit them, but you could ask yourself, if</ignore> So what is the value of those complicated models, if they only improve the AUC by at most single percentage differences<ignore>, why use a complicated model at all</ignore>.
That beeing said, you also cannot assume that new physics has the same angular distribution difference, as QCD compared to top, making this alternative model useless in the task of finding new physics<note or at least useless unless you search for one specific kind of abnormal data, there are some examples showing other kinds of abnormal data behaving completely differently in <ref secdata>>,so the question of interrest is just: do complicated models contain something more than this trivial difference? And unfortunately this is very hard to test. C addition allows you to estimate the effect any small additional AUC would have, and an AUC of about #0.6#, optimally combined would only improve an AUC score of #0.9# to #0.904#, while #0.7# would improve up to #0.917#, so both improvements would probably be neirly unmeasurable. So there migth be some hidden effect in a hidden model, that allows them to find new physics<note please note, that we assume here a lot: first, that in #p_T# there is potential to differentiate all kinds of new physics, that this potential is used perfectly by an algorithm that did not work as expected for angles, and also, maybe most improbable, that the loss is used perfectly, and there is no confusion from the angular part at all>.
......
<subsubsection title="Invertibility" label="invertibility">
Given a classifier that if trained on QCD, finds top jets anomalies, a question arising from this is, if this is just a feature of your data, as shown in chapter <ref simplicity> or if this is more general. One easy way to test this, is just to switch the meaning of signal and background: Can an autoencoder trained on top jets classify QCD jets as anomalies. (To keep the usual plots beeing easily readable, we keep QCD as background, which results in an AUC score of 0 beeing optimal for those switched networks). This does not yield the desired results at all.
<i f="recinv" wmode="True" wid="0.6">(from mmt)invertibility of a simple model</i>
Given a classifier that if trained on QCD, finds top jets anomalies, a question arising from this is, if this is just a feature of your data, as shown in chapter <ref simplicity> or if this is more general. One easy way to test this, is just to switch the meaning of signal and background: Can an autoencoder trained on top jets classify QCD jets as anomalies. (To keep the usual plots beeing easily readable, we keep QCD as background, which results in an AUC score of 0 beeing optimal for those switched networks). This does not yield the desired results at all, as you see in figure <refi recinv> that a model trained on top jets, is still a valid classifier for qcd jets.
<i f="reccinv-1" wmode="True" label="recinv">Invertibility of a 4 node model</i>
This could actually been expected: As shown in the last chapter <ref simplicity>, our networks that have a good AUC focuss mostly on the difference between the angles and 0, and since this model does not depend on the attributes of the training data, changing the training data does not alter them, nor it affects their classification. In fact, by choosing a model that focusses even more on the angular size, you can create a model that is completely independent of the training data. This can go so far, that a model trained on top jets, can have a lower loss evaluated on QCD jet, than on the top jets it was trained on.
All of this is obviously quite problematic, which is why the next two chapters (<ref secnorm> and <ref secmixed>) suggest solutions, and after doing so, chapter <ref secdata> focusses entirely on other datasets and their invertibility to make sure this is general.
......@@ -10,8 +10,8 @@ This method has one obvious drawback: Not only do we actively remove information
<subsubsection title="How to normalise an autoencoder" label="normprobs">
One thing, We did not realise before trying to normalise the input datapoints, is that simply demanding that the mean is zero and the standart deviation is one, just does not work. This migth be an effect that is most important when we talk about small networks<note networks with a low amount of input particles>, but is still somewhat of an effect in every network, and becomes important in chapters <ref oneoff> and <ref secmixed>. The problem is, that by demanding a value to be fixed, we remove the size of the input space, and by having an autoencoder that only reduces 12+flag information onto 9 values, this means, we allow the network to trivially learn 3 informations per set feature<note 3, since there are 3 variables which mean and or standart deviation we fix>, and so by setting the standart deviation and mean to be fixed, the autoencoder can trivially learn to compress 12+flag onto 6 values<note ignoring flag for now, three values is always enough to encode 4 flag values, since the four three flag values are neirly always one (the jet with the lowest number of particles has 3 particles in our trainingsset)>, which is below the size of compression space. In practice this is not so easy as there is no garantee that this minima is found, since the graph structure does not neccesarily help for this kind of transformation(see appendix <ref identities>), but training this kind of network definitely does not result in the model gaining any classification power. This can be seen in the corresponding feature map
<i f="aucmap1011" wmode="True">(1011..maybe not the best)Aucmap for normally normalized networks, showing nothing useful being learned</i>
One thing, We did not realise before trying to normalise the input datapoints, is that simply demanding that the mean is zero and the standart deviation is one, just does not work. This migth be an effect that is most important when we talk about small networks<note networks with a low amount of input particles>, but is still somewhat of an effect in every network, and becomes important in chapters <ref oneoff> and <ref secmixed>. The problem is, that by demanding a value to be fixed, we remove the size of the input space, and by having an autoencoder that only reduces 12+flag information onto 9 values, this means, we allow the network to trivially learn 3 informations per set feature<note 3, since there are 3 variables which mean and or standart deviation we fix>, and so by setting the standart deviation and mean to be fixed, the autoencoder can trivially learn to compress 12+flag onto 6 values<note ignoring flag for now, three values is always enough to encode 4 flag values, since the four three flag values are neirly always one (the jet with the lowest number of particles has 3 particles in our trainingsset)>, which is below the size of compression space. In practice this is not so easy as there is no garantee that this minima is found, since the graph structure does not neccesarily help for this kind of transformation(see appendix <ref identities>), but training this kind of network definitely does not result in the model gaining any classification power. This can be seen in the corresponding feature map (figure <refi map423>)
<i f="aucmap423" wmode="True" label="map423">(1011..maybe not the best)Aucmap for normally normalized networks, showing nothing useful being learned</i>
This seems as if there is a trivial solution: just reduce the compression size accordingly, but this has three problems
<list>
<e>First, it is not completely trivial to misuse the normalization (Think of the standart deviation, there is a formula giving you information about the 4th value, given the first three. But even if we ignore the mean as beeing 0, this formula still involves squares and roots, which the network has to learn, and even then, there are always two possibilities for the resulting value.). So assuming that this is trivial, and that the network will always learn it garantueed, would be wrong.</e>
......@@ -39,10 +39,12 @@ Here the definitions of #y# and #n# assert translation and scale invariance resp
<ignore><i f="aucmap928" wmode="True">(928..definitely take some top one)An AUC map for a better normalized network</i></ignore>
<subsubsection title="Using this normalization" label="usenorm">
Using this kind of normalization, 4 node networks are invertible. And not only this, but also most features are invertible.
<i f="aucmap677" f2="aucmap928" wmode="True">invertible 4node network auc maps achieved by a better normalization</i>
But the quality suffers
<i f="drtoptagging" wmode="True">(generate later for computational reasons)double roc curve for invertibility of normalized networks</i>
Using this kind of normalization, 4 node networks are invertible. And not only this, but also most features are invertible (compare figures <refi 4nodeinv1> and <refi 4nodeinv2>.
<i f="aucmap677" wmode="True" label="4nodeinv1" wid="0.6">Invertible 4node network auc maps achieved by a better normalization. Here trained on top jets</i>
<i f="aucmap928" wmode="True" label="4nodeinv2" wid="0.6">Invertible 4node network auc maps achieved by a better normalization. Here trained on qcd jets</i>
But figure <refi invtt> shows that the quality suffers.
<i f="drtoptagging" wmode="True" label="invtt">(generate later for computational reasons)double roc curve for invertibility of normalized networks</i>
and there are other consequences of the fact that this network actually has to learn something nontrivial:
First, we were forced to increase the size of the compressed feature space from 5 to 9. This makes sense, as a network that compares angles to zero, has to just reconstruct zeros in each angle, and thus only has to save #p_T#<note and maybe flag, but as seen later in this chapter and more in chapter <ref oneoff>, this is usually not actually be the case>, needing only a smaller latent space.
Also networks, that before were very reproducable in their training<note which makes sense, as they always just needed to learn to ignored the angles> are now less stable, and often vary their loss<note remember that the l2 loss goes quadratic in changes of the inputs> over about one order of magnitude. Interrestingly, this variation shows a clear relation between the loss, and the classification quality
......
......@@ -4,9 +4,9 @@ These initial normalized networks are not very good. This migth be what we expec
Using this we are able to improve the network trained on top up to #0.377#.
<subsubsection title="Scaling in normalized networks" label="scalenorm">
Sadly this normalization does not change scaling problems to much. Bigger networks still contain more trivial information, since the number of parameters fixed is constant, and even when using batches to scale, the invertibility is just a feature of the first batch
Sadly this normalization does not change scaling problems to much. Bigger networks still contain more trivial information, since the number of parameters fixed is constant, and even when using batches to scale, the invertibility is just a feature of the first batch, as figure <refi batchroc> suggests.
<i f="m4scaleroc" wmode="True">(multi4scale roc) AUC values for higher normalized batches by their training data</i>
<i f="m4scaleroc" wmode="True" label="batchroc">(multi4scale roc) AUC values for higher normalized batches by their training data</i>
<ignore>
......
<subsubsection title="Improving the normalization even further" label="normplus">
After seeing what an effect some kind of normalization can have, we are not completely satisfied anymore with the normalized feature maps:
<i f="aucmap928" wmode="True" wid="0.8">ABE for better norm (just a copy)</i>
consider the highest #p_T# Value (the lower rigth corner). While beeing the generally most interresting particle, there is no classification power in it, and by looking at its distribution it becomes clear why
<i f="pt0draw928" wmode="True" wid="0.8">(928/drawp0.py)distribution of the transverse momentum of the first particle</i>
After seeing what an effect some kind of normalization can have, we are not completely satisfied anymore with the normalized feature maps like in figure <refi map928>.
<i f="aucmap928" wmode="True" wid="0.8" label="map928">AUC feature map for an even better norm </i>
consider the highest #p_T# Value (the lower rigth corner). While beeing the generally most interresting particle, there is no classification power in it at all, and by looking at its distribution (figure <refi pt0>) it becomes clear why
<i f="pt0draw928" wmode="True" wid="0.8" label="pt0">(928/drawp0.py)distribution of the transverse momentum of the first particle</i>
These values are basically constant, so its input it the same as the flag values (first collumn), from which we don`t expect any physically useful information.
So lets solve this: Since #lp_T# mostly has the same structure<note to be more precise, the difference between the first and the second particle is higher than the difference between the last two ones>, most jets transverse momentum get divided by the first one, resulting in it always having the same value. We solve this by replacing the definition of #n# in chapter <ref normalization> to be:
##Eq(n,2*z/(max(abs(z))+mean(abs(z))))##
removing the need to set one value to either positive or negative one, and thus making the highest value in #lp_T# actually useful, and as you see, this removes the difference in the #lp_T# AUC.
<i f="aucmap534" wmode="True" wid="0.8">AUC feature map for a well normated network</i>
removing the need to set one value to either positive or negative one, and thus making the highest value in #lp_T# actually useful, and as you see in figure <refi map534>, this removes the difference in the #lp_T# AUC.
<i f="aucmap534" wmode="True" wid="0.8" label="map534">AUC feature map for a well normated network</i>
But, as you also see, now the whole classification power lies now in flag, and this should be quite confusing to you: Something having no physical meaning beeing more useful than everything else. Not to different compared to chapter <ref simplicity>.
This we will explain in chapter <ref oneoff>.
<subsection title="oneoff networks" label="oneoff">
<subsection title="Oneoff networks" label="oneoff">
When you consider the following feature map of a well normalized network:
<i f="aucmapb" wmode="True" wid="0.8">(534)AUC Feature map for an on top trained autoencoder, using a good normalization</i>
Consider the following feature map of a well normalized network in figure <refi map534t2>.
<i f="aucmapb" wmode="True" wid="0.8" label="map534t2">(534)AUC Feature map for an on top trained autoencoder, using a good normalization</i>
You see, that most of the decision power is in the first feature, but the first feature, flag is basically just one<note>flag is 1 as long as the current event does not contain less particles than the network demands, and since this is a network with only 4 nodes, and there are very few jets with only 3 particles or even less, saying flag is a constant (#Eq(flag,1)#) is a quite good approximation</note>. This migth seem a bit counterintuitive or unphysical at first, how can a variable without any physical meaning be a better seperator than those variables with physical meaning.
To explain this, we need to take a bit more close look at what the network is doing: First, just because the output is has no physical meaning, this does not mean, that no physical variables are used in its calculcation. In fact, before this, we always just assumed that there is one parameter in the latent space, that is learned to be just a one from the input space<note>This is a bit of a simplification, most importantly it would be untestable, since instead of learning a constant, the network could learn a constant as a function of multiple parameters (for a simple example consider #Eq(x_2,x_1+1)#, both variables are not constant, making it harder to find this, but still #x_2# has no additional information with respect to #x_1#, and there is a one learned as #x_2-x_1#)</note>, but this distributiuon of decision power implies that this is not the case: If there would be a constant feature in the compression space, the constant output would be a trivial copy of this constant and thus have no physical meaning. More likely is the following: The network is able to reconstruct an #1# from all the other parameters. This makes sence, since we got this AUC distribution by changing the normalization in a way that made trivial ones in the input space much less likely<note>Since we stopped dividing by #max(abs(x))# and started dividing by #(max(abs(x))+mean(abs(x)))/2#, it is no longer the case that there is either a #-1# or a #1# in each feature</note>, and it also explains how an unphysical output can be physically useful: Since the are utilizing physical inputs, the resulting constant has to be a function of the inputs. And when you change the inputs, the constant is also changed and this change we can use to differenciate signal and background events.
And since this quality is better than every other autoencoder decision quality, it migth be useful to use this: If appearently nonphysical outputs can be at least as good as physical outputs, why not just use outputs that are nonphysical (Outputs that are one). This is what we call oneoff networks<note>Since the distance off 1 is the deciding quality indicator and it is a oneClass algorithm</note><ignore>, and on paper it seams like a great idea</ignore>: As shown before (see chapter <ref simplicity>), complexity is to a big part just width. You may be able to solve this by normalization, but this removes information, and oneoff networks would not require this<note>Since their output, #1# , is obviously automatically normalized</note><note>Also in practice it seems to be still a good idea to normate also oneoff networks, this migth be because this normalization also lets features the oneoff network focusses on to be more similar and thus easier to combine, or because similar sized inputs are easier to train on</note>. Also there migth be a certain kind of complexity benefit, since the whole network is made to just minimize one distance<note> Actually, in practice it seems to simplify the training, if you don`t use only one output, but multiple ones, that all are compared to 1 and which mean is used. This results in very high correlations in the outputs, but seems to help in the convergence of the network</note> that is always the same, instead of optimizing some feature that migth be useful for some events, but weakenen it while considering other events, in which this feature plays a less important role. This should result in the network beeing able to learn more complicated functions.
......@@ -13,8 +13,8 @@ We justify this idea mathematically in appendix <ref oomath> and <ref impro>
A simple dense network with just an output that should be one, sadly still has a lot of problems.
First: the loss can go to basically zero(#10**(-12)#), which is a bit unphysical, since the loss, as a distance to one, is basically the variance of the used feature, and you would not expect there to be any physically significant feature of this accuracy in 4 particles<note>Especially, since the lowest difference there can be in the used float32 implementation is bigger than #10**(-8)# and thus, since the final loss is the mean of each loss, this would mean, that at least #0.9999# of each event reproduce exactly 1</note>. So there are features that are more trivial to learn, and make any decision process meaningless. And it is not neccesarily trivial to find those, there migth be those features that are just input variables of one (for example an input that would be set to flag), but not all of them are that easy to find. <note>A notable example migth be the preprocessing of #lp_T#. As descibed in chapter <ref data>, we used a preprocessing similar to that of particleNet: #Eq(x,ln(p_Tjet/p_T))#, but this means (because of the implementation), that a sum over #exp(-x)# is always #1#. This migth be a good time to talk about functions in those kind of networks. Since we have to forbidden any biases (a bias would just result in the network learning a zero and adding a one as bias), the usual reason for a network to learn any function has to be modified a bit. Think about taylor approximations: A function like #exp(x)# could be written as #1+x+O(x**2)# (with as many term as the networks needs), but for a network to learn #1#, the input of #exp(x)# would then be learned to zero, the network would be one and it is basically the same as adding a constant bias. But adding a bias is not allowed, and thus the network can not learn #exp(x)#, but the network can learn #Eq(exp(x)-1,x+O(x**2))#, and, when #Eq(sum(exp(-x_i),i),1)# then is #Eq(sum(exp(-x_i)-1),-3)# for 4 nodes, and thus the network can learn this, without having learned anything physically useful</note>. This means, that training an oneoff network is a bit like outsmarting your algorithm. One thing that we found quite useful, is letting the network not only learn a one on the data that you are interrested in, but also zero on other random data.<note>We choose here random events with the same mean and standart deviation in each feature, as the original data, that still goes through the same preprocessing</note>. When we use relu<note A relu activation can be defined as #x+abs(x)#. See Appendix <ref arelu> for why this is useful> activations here<note>Activations are another thing where those networks can become trivial, think of a sigmoid and a network just learning infinite values before activation</note>, learning values to be zero, means learning them just to be negative, and is thus way easier. This can demand that the network does not fixate on trivial features in the networksetup and preprocessing<note> later on, in chapter <ref mixedidea>, this is no longer needed, and just complicates the training</note>.
A simple oneoff network reaches usually an auc of at best #0.6# for the task of finding top jets, which is not to impressive. But if you look at the classification power as a function of the training epoch, you see that this only is so bad, since those AUC scores is way better at earlier epochs.
<i f="mabe3" wmode="True" wid="0.8">(multiabe 3)Auc as function of the epoch, trained on QCD, once for a graph oneoff and once for a dense oneoff. As you see, both relations show a maximum before the training ends, but the graph network is way more continuous</i>
A simple oneoff network reaches usually an auc of at best #0.6# for the task of finding top jets, which is not to impressive. But if you look at the classification power as a function of the training epoch, you see that this only is so bad, since those AUC scores are way better at earlier epochs (see figure <refi mabe2>.
<i f="mabe2" wmode="True" label="mabe2">(multiabe 3)Auc as function of the epoch, trained on QCD, once for a graph oneoff and once for a dense oneoff. As you see, both relations show a maximum before the training ends, but the graph network is way more continuous</i>
Sadly, this observation is not really useful, since stopping the training at the optimal epoch would not be unsupervised. It is still quite interresting, since it shows, that there is some potential in those kind of networks, which is just not utilised good enough<note>this will be solved in chapter <ref mixedidea></note>.
Another problem is again invertibility: It is possible to create an invertible oneoff network, but it is not trivially given. This becomes easier, when you use a lot of parameters. To do this, a graph network is less useful, than just a simple dense network.
......
......@@ -5,7 +5,7 @@ This dataset implies a unsupervised classification task that is way more difficu
The first thing that makes this dataset so much more complicated is the angular distribution: while you can use this distribution to differentiate top jets from their QCD counterparts alone, and this quite well (see chapter <ref simplicity>), here both angular distributions are basically the same
<i f="angulardistLDM" f2="altangulardistLDM" wmode="True">angular distribution of ldm jets, on the left as 2d histogram and on the rigth as 2 1d histograms (THE SECOND PEAK IS JUST NUMERICS, THAT I WILL STILL FILTER OUT)(i guess newdata/imgs und both (alt)angular)</i>
and the momentum distribution is not much better
<i f="pthistLDM" wmode="True" wid="0.8">(newdata3/histpt) Momentum distribution of ldm vs lQCD jets</i>
<i f="pthistLDM" wmode="True" >(newdata3/histpt) Momentum distribution of ldm vs lQCD jets</i>
That beeing said, there is one easily understandable parameter that can be used to differentiate both datasets: The number of particles in the jet
<i f="nhistldm" wmode="True" wid="0.8"> Number size distribution of ldm jets (i guess nedata3/histn.py)</i>
......
<section title="appendix" label="appendix">
<ignore><subsection title="Choosing the compression size" label="</ignore>
<section title="Appendix" label="appendix">
"Beware ye who reads here"
The following chapters migth provide precision on some interresting points, but they either have got less attention than the previous ones, or consist out of deep dives that do not really affect the content of this thesis. This does not mean that I think that anything in the following is wrong, but maybe that grammatically/visually these chapters are not perfekt.
\ No newline at end of file
<section title="Understanding specific choices" label="aecunderstand">
\ No newline at end of file
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment