<?xml version="1.0" encoding="utf-8"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en"> <head> <!-- 2023-09-19 Tue 08:29 --> <meta http-equiv="Content-Type" content="text/html;charset=utf-8" /> <meta name="viewport" content="width=device-width, initial-scale=1" /> <title>Data Processing, Code Documentation and Beyond @@latex:\\@@ @@html:<br>@@ (Emacs and org-mode)</title> <meta name="author" content="Jonathan A. Hartman | Lukas C. Bossert" /> <meta name="generator" content="Org Mode" /> <link rel="stylesheet" type="text/css" href="https://fniessen.github.io/org-html-themes/src/readtheorg_theme/css/htmlize.css"/> <link rel="stylesheet" type="text/css" href="https://fniessen.github.io/org-html-themes/src/readtheorg_theme/css/readtheorg.css"/> <script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.3/jquery.min.js"></script> <script src="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.4/js/bootstrap.min.js"></script> <script type="text/javascript" src="https://fniessen.github.io/org-html-themes/src/lib/js/jquery.stickytableheaders.min.js"></script> <script type="text/javascript" src="https://fniessen.github.io/org-html-themes/src/readtheorg_theme/js/readtheorg.js"></script> <style> #content{max-width:1800px;}</style> </head> <body> <div id="content" class="content"> <h1 class="title">Data Processing, Code Documentation and Beyond <br> (Emacs and org-mode)</h1> <div id="table-of-contents" role="doc-toc"> <h2>Table of Contents</h2> <div id="text-table-of-contents" role="doc-toc"> <ul> <li><a href="#orgaaa978e">1. Overview</a></li> <li><a href="#org2afd9fe">2. Introduce</a></li> <li><a href="#org800c288">3. Prepare</a> <ul> <li><a href="#org217abe3">3.1. Data retrieval using SPARQL</a></li> <li><a href="#org9e8e511">3.2. Data cleaning using shell</a></li> </ul> </li> <li><a href="#orgb18a4a2">4. Process</a> <ul> <li><a href="#orgb0d6568">4.1. Data Aggregation with Python</a></li> <li><a href="#orgdff0aad">4.2. Counting Elements with awk</a></li> <li><a href="#org085bd77">4.3. Network Disply with R</a></li> </ul> </li> <li><a href="#orga92537a">5. Preserve</a> <ul> <li><a href="#org5d7fafe">5.1. Manual export</a></li> <li><a href="#org4b3d26d">5.2. Automatic batch process</a></li> </ul> </li> </ul> </div> </div> <div id="outline-container-orgaaa978e" class="outline-2"> <h2 id="orgaaa978e"><span class="section-number-2">1.</span> Overview</h2> <div class="outline-text-2" id="text-1"> <p> This document provides insights into an efficient way handling data. We show not only how to retrieve data from an publicly accesible webpge but also how the data can be processed afterwards. We admit that in the examples shown below we definetly drawing from the full, but we consider this as a proof of concept for how in our modern technological world plain text is still a great way of processing and documenting data workflow and analyses. </p> <p> The paper is divided into three main steps, focussing on first preparing, second processing and last presevering the data and its documentation (fig. <a href="#org52a07e7">1</a>). </p> <div id="org52a07e7" class="figure"> <p><img src="img/nfdi-in-emacs-best-practice-overview.png" alt="nfdi-in-emacs-best-practice-overview.png" width="100%" /> </p> <p><span class="figure-number">Figure 1: </span>Workflow of the document. Source <a href="https://excalidraw.com/#room=8617c3374a9c2c2c895b,a_SoKClI-tyAxWfSgzThWQ">Excalidraw</a>.</p> </div> </div> </div> <div id="outline-container-org2afd9fe" class="outline-2"> <h2 id="org2afd9fe"><span class="section-number-2">2.</span> Introduce</h2> <div class="outline-text-2" id="text-2"> <p> What is Emacs and <b>org-mode</b>? Well, where to start? You may not have heard of Emacs or org-mode, yet. Usually it is considered to a tool for geeks, ….. this might be kind of true, but once you noticed the myriard ways of using Emacs(<a href="#citeproc_bib_item_2">Hahn 2016</a>; <a href="#citeproc_bib_item_3">Kitchin, Gulick, and Zilinski 2016</a>; <a href="#citeproc_bib_item_5">Strobel and Uhl 1996</a>) and espeically its module org-mode you never ever won’t to use anything else.<sup><a id="fnr.1" class="footref" href="#fn.1" role="doc-backlink">1</a></sup> Emacs has been around for decades (no kidding) and is free software. </p> <p> Org-mode is quite younger but the killing feature in Emacs. Or let’s express it with the words of the original creator Carsten Dominik: </p> <blockquote> <p> Org-mode does outlining, note-taking, hyperlinks, spreadsheets, TODO lists, project planning, GTD, HTML and LaTeX authoring, all with plain text files in Emacs. </p> </blockquote> <p> or in a nutshell: </p> <blockquote> <p> Back to the future for plain text<br /> (Carsten Dominik) </p> </blockquote> <p> Let’s make an executive summary of org-mode: </p> <ul class="org-ul"> <li>Module for <a href="https://emacs.org">Emacs</a></li> <li>Plain text based</li> <li>Around since 2003</li> <li>Meant for (scientific) text production and organisation <ul class="org-ul"> <li>project management</li> <li>agenda, diary, journaling</li> <li>personal knowledge management</li> <li>presentation</li> <li>single-source-publishing</li> <li>literate programming</li> </ul></li> <li><p> Extensible and fully customizable </p> <p> Org-mode is a magnificent tool when it comes to reproducible research,(<a href="#citeproc_bib_item_4">Stanisic and Legrand 2014</a>) since this combines a well documented way of analysing a data set. </p></li> </ul> </div> </div> <div id="outline-container-org800c288" class="outline-2"> <h2 id="org800c288"><span class="section-number-2">3.</span> Prepare</h2> <div class="outline-text-2" id="text-3"> <p> For our demonstration, we are going to create a dataset from openly available data on the German National Research Data Infrastructure (<b>NFDI</b>) and perform some simple analysis tasks on it. </p> </div> <div id="outline-container-org217abe3" class="outline-3"> <h3 id="org217abe3"><span class="section-number-3">3.1.</span> Data retrieval using SPARQL</h3> <div class="outline-text-3" id="text-3-1"> <p> The data we are interested in exists on Wikidata. Wikidata is similar to Wikipedia, but rather than long form articles, the data is stored as structured data. This allows machines to easily access and traverse these pages with query langauges. Here, we are going to submit a <code>SPARQL</code> query to the Wikidata query endpoint. </p> <p> SPARQL will look familiar to anyone familar with SQL, however it is slightly more cryptic at first glance. Take a look at the below query – things like “Q98270496” refer to specific items in wikidata, where things like “P31” are more akin to concepts. In English, this query translates to something like </p> <blockquote> <p> Give me the Names for items that has a property (P31) of NFDI Consortia (Q98270496), and return all items you find on each of those entries under the property “affiliations” (P1416). </p> </blockquote> <p> If you like how to do this in more detail, have a look at (<a href="#citeproc_bib_item_1">Bossert et al. 2023</a>). </p> <div class="org-src-container"> <label class="org-src-name"><span class="listing-number">Listing 1: </span>Retrieving the dataset from wikidata</label><pre class="src src-sparql" id="org0bd386c"><span class="linenr">1: </span><span class="org-keyword">SELECT</span> <span class="org-variable-name">?wLabel</span> <span class="org-variable-name">?pLabel</span> <span class="linenr">2: </span><span class="org-keyword">WHERE</span> <span class="linenr">3: </span>{ <span id="coderef-consortium" class="coderef-off"><span class="linenr">4: </span> <span class="org-variable-name">?p</span> wdt:P31 wd:Q98270496 . (consortium)</span> <span id="coderef-affiliations" class="coderef-off"><span class="linenr">5: </span> <span class="org-variable-name">?p</span> wdt:P1416 <span class="org-variable-name">?w</span> . (affiliations)</span> <span class="linenr">6: </span> <span class="org-keyword">SERVICE</span> wikibase:label { bd:serviceParam wikibase:language <span class="org-string">"en"</span> . } <span class="linenr">7: </span>} <span class="linenr">8: </span><span class="org-keyword">ORDER</span> <span class="org-keyword">BY</span> <span class="org-keyword">ASC</span>(<span class="org-variable-name">?wLabel</span>) <span class="org-keyword">ASC</span>(<span class="org-variable-name">?pLabel</span>) <span class="linenr">9: </span><span class="org-keyword">LIMIT</span> <span class="org-highlight-numbers-number">50</span> </pre> </div> <table id="orgca87794" border="2" cellspacing="0" cellpadding="6" rules="groups" frame="hsides"> <caption class="t-above"><span class="table-number">Table 1:</span> Result of the query for NFDI consortia and their institutions.</caption> <colgroup> <col class="org-left" /> <col class="org-left" /> </colgroup> <thead> <tr> <th scope="col" class="org-left">wLabel</th> <th scope="col" class="org-left">pLabel</th> </tr> </thead> <tbody> <tr> <td class="org-left">Q105775472</td> <td class="org-left">NFDI4Health</td> </tr> <tr> <td class="org-left">Q1117007</td> <td class="org-left">NFDI4Health</td> </tr> <tr> <td class="org-left">Q115254989</td> <td class="org-left">NFDI4Objects</td> </tr> <tr> <td class="org-left">Q1205424</td> <td class="org-left">NFDI4Objects</td> </tr> <tr> <td class="org-left">Q17575706</td> <td class="org-left">NFDI4Objects</td> </tr> <tr> <td class="org-left">Academy of Sciences and Humanities in Hamburg</td> <td class="org-left">Text+</td> </tr> <tr> <td class="org-left">Academy of Sciences and Literature Mainz</td> <td class="org-left">NFDI4Culture</td> </tr> <tr> <td class="org-left">Academy of Sciences and Literature Mainz</td> <td class="org-left">NFDI4Memory</td> </tr> <tr> <td class="org-left">Academy of Sciences and Literature Mainz</td> <td class="org-left">NFDI4Objects</td> </tr> <tr> <td class="org-left">Academy of Sciences and Literature Mainz</td> <td class="org-left">Text+</td> </tr> <tr> <td class="org-left">Alfred Wegener Institute for Polar and Marine Research</td> <td class="org-left">NFDI4Biodiversity</td> </tr> <tr> <td class="org-left">Alfred Wegener Institute for Polar and Marine Research</td> <td class="org-left">NFDI4DataScience</td> </tr> <tr> <td class="org-left">Alfred Wegener Institute for Polar and Marine Research</td> <td class="org-left">NFDI4Earth</td> </tr> <tr> <td class="org-left">Anthropological Society (Munich)</td> <td class="org-left">NFDI4Objects</td> </tr> <tr> <td class="org-left">Arachnologische Gesellschaft</td> <td class="org-left">NFDI4Biodiversity</td> </tr> <tr> <td class="org-left">Arbeitskreis Provenienzforschung e.V.</td> <td class="org-left">NFDI4Memory</td> </tr> <tr> <td class="org-left">Archivschule Marburg</td> <td class="org-left">NFDI4Memory</td> </tr> <tr> <td class="org-left">Archäologische Kommission für Niedersachsen</td> <td class="org-left">NFDI4Objects</td> </tr> <tr> <td class="org-left">Archäologisches Museum Hamburg und Stadtmuseum Harburg</td> <td class="org-left">NFDI4Objects</td> </tr> <tr> <td class="org-left">Arthistoricum</td> <td class="org-left">NFDI4Culture</td> </tr> <tr> <td class="org-left">Association for Data-Intensive Radio Astronomy</td> <td class="org-left">PUNCH4NFDI</td> </tr> <tr> <td class="org-left">Association for Technology and Construction in Agriculture</td> <td class="org-left">FAIRAgro</td> </tr> <tr> <td class="org-left">Association of German Architects</td> <td class="org-left">NFDI4Culture</td> </tr> <tr> <td class="org-left">Association of Population Based Cancer Registries in Germany</td> <td class="org-left">NFDI4Health</td> </tr> <tr> <td class="org-left">Association of states archaeologists</td> <td class="org-left">NFDI4Objects</td> </tr> <tr> <td class="org-left">BERD@NFDI</td> <td class="org-left">Base4NFDI</td> </tr> <tr> <td class="org-left">Bach-Archiv Leipzig</td> <td class="org-left">NFDI4Culture</td> </tr> <tr> <td class="org-left">Bauhaus-Universität Weimar</td> <td class="org-left">NFDI4Ing</td> </tr> <tr> <td class="org-left">Bavarian Academy of Sciences and Humanities</td> <td class="org-left">BERD@NFDI</td> </tr> <tr> <td class="org-left">Bavarian Academy of Sciences and Humanities</td> <td class="org-left">NFDI4Earth</td> </tr> <tr> <td class="org-left">Bavarian Academy of Sciences and Humanities</td> <td class="org-left">NFDI4Memory</td> </tr> <tr> <td class="org-left">Bavarian Academy of Sciences and Humanities</td> <td class="org-left">NFDI4Objects</td> </tr> <tr> <td class="org-left">Bavarian Academy of Sciences and Humanities</td> <td class="org-left">NFDIxCS</td> </tr> <tr> <td class="org-left">Bavarian Academy of Sciences and Humanities</td> <td class="org-left">PUNCH4NFDI</td> </tr> <tr> <td class="org-left">Bavarian Academy of Sciences and Humanities</td> <td class="org-left">Text+</td> </tr> <tr> <td class="org-left">Bavarian Forest National Park</td> <td class="org-left">NFDI4Biodiversity</td> </tr> <tr> <td class="org-left">Bavarian Natural History Collections</td> <td class="org-left">NFDI4Biodiversity</td> </tr> <tr> <td class="org-left">Bavarian Natural History Collections</td> <td class="org-left">NFDI4Objects</td> </tr> <tr> <td class="org-left">Bavarian State Archaeological Collection</td> <td class="org-left">NFDI4Objects</td> </tr> <tr> <td class="org-left">Bavarian State Archives</td> <td class="org-left">FAIRAgro</td> </tr> <tr> <td class="org-left">Bavarian State Archives</td> <td class="org-left">NFDI4Biodiversity</td> </tr> <tr> <td class="org-left">Bavarian State Archives</td> <td class="org-left">NFDI4Earth</td> </tr> <tr> <td class="org-left">Bavarian State Archives</td> <td class="org-left">NFDI4Objects</td> </tr> <tr> <td class="org-left">Bavarian State Library</td> <td class="org-left">NFDI4Culture</td> </tr> <tr> <td class="org-left">Bavarian State Library</td> <td class="org-left">NFDI4Memory</td> </tr> <tr> <td class="org-left">Bavarian State Research Center for Agriculture</td> <td class="org-left">FAIRAgro</td> </tr> <tr> <td class="org-left">Beethoven House</td> <td class="org-left">NFDI4Culture</td> </tr> <tr> <td class="org-left">Beilstein Institute for the Advancement of Chemical Sciences</td> <td class="org-left">NFDI4Chem</td> </tr> <tr> <td class="org-left">Berlin State Library</td> <td class="org-left">Base4NFDI</td> </tr> <tr> <td class="org-left">Berlin State Library</td> <td class="org-left">NFDI4Memory</td> </tr> </tbody> </table> </div> </div> <div id="outline-container-org9e8e511" class="outline-3"> <h3 id="org9e8e511"><span class="section-number-3">3.2.</span> Data cleaning using shell</h3> <div class="outline-text-3" id="text-3-2"> <p> The data we got from listing <a href="#org0bd386c">1</a> is good but it needs further cleaning. </p> <p> We can see several entries in our data that look like “Q1234567” - These are Q Ids for items which no label has been defined. Let’s remove those from our dataset. </p> <p> We’re going to include the output from the previous cell, where we executed the SPARQL query, as an input variable to this cell (<code>:var input=raw-dataset</code>). </p> <div class="org-src-container"> <label class="org-src-name"><span class="listing-number">Listing 2: </span>Cleaning the raw data using good old <code>sed</code> and a regex pattern.</label><pre class="src src-sh" id="org3e7cf90"><span class="org-type">echo</span> <span class="org-string">"</span><span class="org-string"><span class="org-constant">$</span></span><span class="org-string"><span class="org-variable-name">input</span></span><span class="org-string">"</span> | sed -E <span class="org-string">'/Q[0-9]+/d'</span> </pre> </div> <table id="org54060d3" border="2" cellspacing="0" cellpadding="6" rules="groups" frame="hsides"> <caption class="t-above"><span class="table-number">Table 2:</span> Cleaned data set which will be used for ruther processing.</caption> <colgroup> <col class="org-left" /> <col class="org-left" /> </colgroup> <thead> <tr> <th scope="col" class="org-left">wLabel</th> <th scope="col" class="org-left">pLabel</th> </tr> </thead> <tbody> <tr> <td class="org-left">Academy of Sciences and Humanities in Hamburg</td> <td class="org-left">Text+</td> </tr> <tr> <td class="org-left">Academy of Sciences and Literature Mainz</td> <td class="org-left">NFDI4Culture</td> </tr> <tr> <td class="org-left">Academy of Sciences and Literature Mainz</td> <td class="org-left">NFDI4Memory</td> </tr> <tr> <td class="org-left">Academy of Sciences and Literature Mainz</td> <td class="org-left">NFDI4Objects</td> </tr> <tr> <td class="org-left">Academy of Sciences and Literature Mainz</td> <td class="org-left">Text+</td> </tr> <tr> <td class="org-left">Alfred Wegener Institute for Polar and Marine Research</td> <td class="org-left">NFDI4Biodiversity</td> </tr> <tr> <td class="org-left">Alfred Wegener Institute for Polar and Marine Research</td> <td class="org-left">NFDI4DataScience</td> </tr> <tr> <td class="org-left">Alfred Wegener Institute for Polar and Marine Research</td> <td class="org-left">NFDI4Earth</td> </tr> <tr> <td class="org-left">Anthropological Society (Munich)</td> <td class="org-left">NFDI4Objects</td> </tr> <tr> <td class="org-left">Arachnologische Gesellschaft</td> <td class="org-left">NFDI4Biodiversity</td> </tr> <tr> <td class="org-left">Arbeitskreis Provenienzforschung e.V.</td> <td class="org-left">NFDI4Memory</td> </tr> <tr> <td class="org-left">Archivschule Marburg</td> <td class="org-left">NFDI4Memory</td> </tr> <tr> <td class="org-left">Archäologische Kommission für Niedersachsen</td> <td class="org-left">NFDI4Objects</td> </tr> <tr> <td class="org-left">Archäologisches Museum Hamburg und Stadtmuseum Harburg</td> <td class="org-left">NFDI4Objects</td> </tr> <tr> <td class="org-left">Arthistoricum</td> <td class="org-left">NFDI4Culture</td> </tr> <tr> <td class="org-left">Association for Data-Intensive Radio Astronomy</td> <td class="org-left">PUNCH4NFDI</td> </tr> <tr> <td class="org-left">Association for Technology and Construction in Agriculture</td> <td class="org-left">FAIRAgro</td> </tr> <tr> <td class="org-left">Association of German Architects</td> <td class="org-left">NFDI4Culture</td> </tr> <tr> <td class="org-left">Association of Population Based Cancer Registries in Germany</td> <td class="org-left">NFDI4Health</td> </tr> <tr> <td class="org-left">Association of states archaeologists</td> <td class="org-left">NFDI4Objects</td> </tr> <tr> <td class="org-left">BERD@NFDI</td> <td class="org-left">Base4NFDI</td> </tr> <tr> <td class="org-left">Bach-Archiv Leipzig</td> <td class="org-left">NFDI4Culture</td> </tr> <tr> <td class="org-left">Bauhaus-Universität Weimar</td> <td class="org-left">NFDI4Ing</td> </tr> <tr> <td class="org-left">Bavarian Academy of Sciences and Humanities</td> <td class="org-left">BERD@NFDI</td> </tr> <tr> <td class="org-left">Bavarian Academy of Sciences and Humanities</td> <td class="org-left">NFDI4Earth</td> </tr> <tr> <td class="org-left">Bavarian Academy of Sciences and Humanities</td> <td class="org-left">NFDI4Memory</td> </tr> <tr> <td class="org-left">Bavarian Academy of Sciences and Humanities</td> <td class="org-left">NFDI4Objects</td> </tr> <tr> <td class="org-left">Bavarian Academy of Sciences and Humanities</td> <td class="org-left">NFDIxCS</td> </tr> <tr> <td class="org-left">Bavarian Academy of Sciences and Humanities</td> <td class="org-left">PUNCH4NFDI</td> </tr> <tr> <td class="org-left">Bavarian Academy of Sciences and Humanities</td> <td class="org-left">Text+</td> </tr> <tr> <td class="org-left">Bavarian Forest National Park</td> <td class="org-left">NFDI4Biodiversity</td> </tr> <tr> <td class="org-left">Bavarian Natural History Collections</td> <td class="org-left">NFDI4Biodiversity</td> </tr> <tr> <td class="org-left">Bavarian Natural History Collections</td> <td class="org-left">NFDI4Objects</td> </tr> <tr> <td class="org-left">Bavarian State Archaeological Collection</td> <td class="org-left">NFDI4Objects</td> </tr> <tr> <td class="org-left">Bavarian State Archives</td> <td class="org-left">FAIRAgro</td> </tr> <tr> <td class="org-left">Bavarian State Archives</td> <td class="org-left">NFDI4Biodiversity</td> </tr> <tr> <td class="org-left">Bavarian State Archives</td> <td class="org-left">NFDI4Earth</td> </tr> <tr> <td class="org-left">Bavarian State Archives</td> <td class="org-left">NFDI4Objects</td> </tr> <tr> <td class="org-left">Bavarian State Library</td> <td class="org-left">NFDI4Culture</td> </tr> <tr> <td class="org-left">Bavarian State Library</td> <td class="org-left">NFDI4Memory</td> </tr> <tr> <td class="org-left">Bavarian State Research Center for Agriculture</td> <td class="org-left">FAIRAgro</td> </tr> <tr> <td class="org-left">Beethoven House</td> <td class="org-left">NFDI4Culture</td> </tr> <tr> <td class="org-left">Beilstein Institute for the Advancement of Chemical Sciences</td> <td class="org-left">NFDI4Chem</td> </tr> <tr> <td class="org-left">Berlin State Library</td> <td class="org-left">Base4NFDI</td> </tr> <tr> <td class="org-left">Berlin State Library</td> <td class="org-left">NFDI4Memory</td> </tr> </tbody> </table> </div> </div> </div> <div id="outline-container-orgb18a4a2" class="outline-2"> <h2 id="orgb18a4a2"><span class="section-number-2">4.</span> Process</h2> <div class="outline-text-2" id="text-4"> </div> <div id="outline-container-orgb0d6568" class="outline-3"> <h3 id="orgb0d6568"><span class="section-number-3">4.1.</span> Data Aggregation with Python</h3> <div class="outline-text-3" id="text-4-1"> <p> The great thing about org mode is that we can seamlessly switch between languages! Our original query (listing <a href="#org0bd386c">1</a>) was written in SPARQL, which returned a kind of table (tab. <a href="#orgca87794">1</a>). We then took that table and ran a shell command on it. Now, we’re going to take the output of that shell command (cf. tab. <a href="#org54060d3">2</a>) and run some python code on it. </p> <div class="org-src-container"> <pre class="src src-sh">python -m pip install pandas --user </pre> </div> <div class="org-src-container"> <label class="org-src-name"><span class="listing-number">Listing 3: </span>Counting the number of consortia involved in one institution.</label><pre class="src src-python" id="org2e1face"><span class="linenr"> 1: </span><span class="org-keyword">import</span> pandas <span class="org-keyword">as</span> pd <span class="linenr"> 2: </span> <span class="linenr"> 3: </span><span class="org-comment-delimiter"># </span><span class="org-comment">The data comes into the cell as a list of lists.</span> <span class="linenr"> 4: </span><span class="org-comment-delimiter"># </span><span class="org-comment">We can pick it apart into a DataFrame object</span> <span class="linenr"> 5: </span><span class="org-variable-name">df</span> <span class="org-operator">=</span> pd.DataFrame(clean_df[<span class="org-highlight-numbers-number">1</span>:], columns<span class="org-operator">=</span>clean_df[<span class="org-highlight-numbers-number">0</span>]) <span class="linenr"> 6: </span> <span class="linenr"> 7: </span><span class="org-comment-delimiter"># </span><span class="org-comment">Perform a groupby operation on wLabel and</span> <span class="linenr"> 8: </span><span class="org-comment-delimiter"># </span><span class="org-comment">rename the resulting new column "Count"</span> <span class="linenr"> 9: </span><span class="org-variable-name">institutions_by_consortia</span> <span class="org-operator">=</span> ( <span class="linenr">10: </span> df <span class="linenr">11: </span> .groupby(<span class="org-string">"wLabel"</span>) <span class="linenr">12: </span> .size() <span class="linenr">13: </span> .sort_values(ascending<span class="org-operator">=</span><span class="org-constant">False</span>) <span class="linenr">14: </span> .reset_index(name<span class="org-operator">=</span><span class="org-string">"Count"</span>)) <span class="linenr">15: </span> <span class="linenr">16: </span><span class="org-comment-delimiter"># </span><span class="org-comment">Return our dataframe in a way that org will</span> <span class="linenr">17: </span><span class="org-comment-delimiter"># </span><span class="org-comment">display it as an org table</span> <span class="linenr">18: </span><span class="org-keyword">return</span> [<span class="org-builtin">list</span>(institutions_by_consortia.columns), <span class="linenr">19: </span> <span class="org-constant">None</span>, <span class="org-operator">*</span><span class="org-builtin">map</span>(<span class="org-builtin">list</span>, institutions_by_consortia.values)] </pre> </div> <table id="orgd2906a2" border="2" cellspacing="0" cellpadding="6" rules="groups" frame="hsides"> <caption class="t-above"><span class="table-number">Table 3:</span> Overview of institutions and the count of their associated consortia.</caption> <colgroup> <col class="org-left" /> <col class="org-right" /> </colgroup> <thead> <tr> <th scope="col" class="org-left">wLabel</th> <th scope="col" class="org-right">Count</th> </tr> </thead> <tbody> <tr> <td class="org-left">Bavarian Academy of Sciences and Humanities</td> <td class="org-right">7</td> </tr> <tr> <td class="org-left">Bavarian State Archives</td> <td class="org-right">4</td> </tr> <tr> <td class="org-left">Academy of Sciences and Literature Mainz</td> <td class="org-right">4</td> </tr> <tr> <td class="org-left">Alfred Wegener Institute for Polar and Marine Research</td> <td class="org-right">3</td> </tr> <tr> <td class="org-left">Berlin State Library</td> <td class="org-right">2</td> </tr> <tr> <td class="org-left">Bavarian State Library</td> <td class="org-right">2</td> </tr> <tr> <td class="org-left">Bavarian Natural History Collections</td> <td class="org-right">2</td> </tr> <tr> <td class="org-left">BERD@NFDI</td> <td class="org-right">1</td> </tr> <tr> <td class="org-left">Beilstein Institute for the Advancement of Chemical Sciences</td> <td class="org-right">1</td> </tr> <tr> <td class="org-left">Beethoven House</td> <td class="org-right">1</td> </tr> <tr> <td class="org-left">Bavarian State Research Center for Agriculture</td> <td class="org-right">1</td> </tr> <tr> <td class="org-left">Bavarian State Archaeological Collection</td> <td class="org-right">1</td> </tr> <tr> <td class="org-left">Bavarian Forest National Park</td> <td class="org-right">1</td> </tr> <tr> <td class="org-left">Bauhaus-Universität Weimar</td> <td class="org-right">1</td> </tr> <tr> <td class="org-left">Bach-Archiv Leipzig</td> <td class="org-right">1</td> </tr> <tr> <td class="org-left">Academy of Sciences and Humanities in Hamburg</td> <td class="org-right">1</td> </tr> <tr> <td class="org-left">Association of Population Based Cancer Registries in Germany</td> <td class="org-right">1</td> </tr> <tr> <td class="org-left">Association of German Architects</td> <td class="org-right">1</td> </tr> <tr> <td class="org-left">Association for Technology and Construction in Agriculture</td> <td class="org-right">1</td> </tr> <tr> <td class="org-left">Association for Data-Intensive Radio Astronomy</td> <td class="org-right">1</td> </tr> <tr> <td class="org-left">Arthistoricum</td> <td class="org-right">1</td> </tr> <tr> <td class="org-left">Archäologisches Museum Hamburg und Stadtmuseum Harburg</td> <td class="org-right">1</td> </tr> <tr> <td class="org-left">Archäologische Kommission für Niedersachsen</td> <td class="org-right">1</td> </tr> <tr> <td class="org-left">Archivschule Marburg</td> <td class="org-right">1</td> </tr> <tr> <td class="org-left">Arbeitskreis Provenienzforschung e.V.</td> <td class="org-right">1</td> </tr> <tr> <td class="org-left">Arachnologische Gesellschaft</td> <td class="org-right">1</td> </tr> <tr> <td class="org-left">Anthropological Society (Munich)</td> <td class="org-right">1</td> </tr> <tr> <td class="org-left">Association of states archaeologists</td> <td class="org-right">1</td> </tr> </tbody> </table> <p> There is also a “native way” getting the counting done by using the package <code>org-aggregate</code><sup><a id="fnr.2" class="footref" href="#fn.2" role="doc-backlink">2</a></sup>. </p> </div> </div> <div id="outline-container-orgdff0aad" class="outline-3"> <h3 id="orgdff0aad"><span class="section-number-3">4.2.</span> Counting Elements with awk</h3> <div class="outline-text-3" id="text-4-2"> <p> We’re not limited to python though. Here we’re going to perform a very similar aggregation, but grouping by consortia to get the number of institutes at each. Like the listing <a href="#org2e1face">3</a> above, we are going to use the output of listing <a href="#org3e7cf90">2</a> (cf. tab. <a href="#org54060d3">2</a>) to perform this operation. Instead of python, we’re going to use <code>awk</code> for our data processing. </p> <p> As an additional bonus, we’re going to paramaterize this cell by defining a variable called <code>consortium</code>. With this we could reuse the code in this cell over and over, changing the desired consortium name to show only the desired results. </p> <div class="org-src-container"> <label class="org-src-name"><span class="listing-number">Listing 4: </span>Calculating the number of involved institions in one specific consortium.</label><pre class="src src-awk" id="orgd933405"><span class="linenr"> 1: </span><span class="org-keyword">BEGIN</span> { <span class="linenr"> 2: </span><span class="org-comment-delimiter"># </span><span class="org-comment">before the evaluating process of the data begins</span> <span class="linenr"> 3: </span><span class="org-comment-delimiter"># </span><span class="org-comment">this block is taken in account</span> <span class="linenr"> 4: </span><span class="org-comment-delimiter"># </span><span class="org-comment">set the separator to tab</span> <span class="linenr"> 5: </span> <span class="org-variable-name">FS</span> = <span class="org-string">"\t"</span> <span class="linenr"> 6: </span>} <span class="linenr"> 7: </span> <span class="org-comment-delimiter"># </span><span class="org-comment">MAIN section of the evaluating process</span> <span class="linenr"> 8: </span> <span class="org-comment-delimiter">#</span><span class="org-comment">----------------------------------------</span> <span class="linenr"> 9: </span> <span class="org-comment-delimiter"># </span><span class="org-comment">while going through the rows of the input</span> <span class="linenr">10: </span> <span class="org-comment-delimiter"># </span><span class="org-comment">check only for the second column</span> <span class="linenr">11: </span> <span class="org-comment-delimiter"># </span><span class="org-comment">step a counter for equal values and store it in 'counts'</span> <span class="linenr">12: </span> $2 == consortium { ++counts[$2] } <span class="linenr">13: </span><span class="org-keyword">END</span> { <span class="linenr">14: </span> <span class="org-comment-delimiter"># </span><span class="org-comment">final part where no evaluation is done anymore</span> <span class="linenr">15: </span> <span class="org-comment-delimiter"># </span><span class="org-comment">only collecting and printing results</span> <span class="linenr">16: </span> <span class="org-comment-delimiter"># </span><span class="org-comment">going through the counts from above</span> <span class="linenr">17: </span> <span class="org-keyword">for</span> (k <span class="org-keyword">in</span> counts) <span class="linenr">18: </span> <span class="org-comment-delimiter"># </span><span class="org-comment">check for the amount of associated institutions</span> <span id="coderef-singular" class="coderef-off"><span class="linenr">19: </span> <span class="org-keyword">if</span> (counts[k] == <span class="org-highlight-numbers-number">1</span>) (singular)</span> <span class="linenr">20: </span> <span class="org-comment-delimiter"># </span><span class="org-comment">if only one institution, then use the singular version</span> <span class="linenr">21: </span> <span class="org-preprocessor">print</span> consortium <span class="org-string">" ("</span> counts[k] <span class="org-string">" institution)"</span>; <span class="linenr">22: </span> <span class="org-comment-delimiter"># </span><span class="org-comment">otherwise we need the plural form.</span> <span class="linenr">23: </span> <span class="org-keyword">else</span> <span class="org-preprocessor">print</span> consortium <span class="org-string">" ("</span> counts[k] <span class="org-string">" institutions)"</span> <span class="linenr">24: </span>} </pre> </div> <p> Having created the source block we can also use it in our text with executing the the function <code>call_institutions-count('NFDI4Objects')</code>. The result will be blended in smoothly in the text and if there are any changes to the initial data set updated automatically. </p> <p> Back to our example: So, now we know of many institutions are involved in NFDI4Objects (9 institutions) or in NFDI4Earth (3 institutions). </p> </div> </div> <div id="outline-container-org085bd77" class="outline-3"> <h3 id="org085bd77"><span class="section-number-3">4.3.</span> Network Disply with R</h3> <div class="outline-text-3" id="text-4-3"> <p> How about something a little more visual than some tables? We can also create plots and visuals, generating them with the code contained in the document and embedding the results in the output. </p> <p> And while we’re at it, how about another language? This time we’ll use R to make a simple network plot of our data. Again, we’re still using the output from listiing <a href="#org3e7cf90">2</a> (which is tab. <a href="#org54060d3">2</a>) to do this. </p> <p> The result is a nice visualization of a network (fig. <a href="#orge5cf2dc">2</a>). Such a visualization can help to detect outliers faster. </p> <div class="org-src-container"> <label class="org-src-name"><span class="listing-number">Listing 5: </span>Network of all institutions and their related consortia.</label><pre class="src src-R" id="org5e90e20"><span class="linenr"> 1: </span><span class="org-comment-delimiter"># </span><span class="org-comment">making sure the required package is installed</span> <span class="linenr"> 2: </span><span class="org-ess-keyword">if</span> (!<span class="org-ess-modifiers">require</span>(<span class="org-string">"igraph"</span>)) install.packages(<span class="org-string">"igraph"</span>) <span class="linenr"> 3: </span><span class="org-ess-modifiers">library</span>(<span class="org-string">"igraph"</span>) <span class="linenr"> 4: </span><span class="org-comment-delimiter"># </span><span class="org-comment">making a more robust outcome by stating a seed number</span> <span class="linenr"> 5: </span>set.seed(<span class="org-highlight-numbers-number">123456789</span>) <span class="linenr"> 6: </span><span class="org-comment-delimiter"># </span><span class="org-comment">convert the tabular data into a data frame which is required</span> <span class="linenr"> 7: </span><span class="org-comment-delimiter"># </span><span class="org-comment">for creating a network</span> <span class="linenr"> 8: </span>NFDI_network <span class="org-ess-assignment"><-</span> graph_from_data_frame(NFDI_edges, <span class="linenr"> 9: </span> directed = <span class="org-ess-constant">FALSE</span>) <span class="linenr">10: </span>plot(NFDI_network, <span class="org-comment-delimiter"># </span><span class="org-comment">loading data frame</span> <span class="linenr">11: </span> main = <span class="org-string">"NFDI Network"</span>, <span class="org-comment-delimiter"># </span><span class="org-comment">adding a title</span> <span class="linenr">12: </span> <span class="org-comment-delimiter"># </span><span class="org-comment">adding a color to all nodes from the second column.</span> <span class="linenr">13: </span> vertex.color = c(<span class="org-string">"blue"</span>, <span class="org-string">"red"</span>)<span class="org-comment-delimiter">#</span> <span class="linenr">14: </span> [<span class="org-highlight-numbers-number">1</span> + names(V(NFDI_network)) <span class="org-ess-XopX">%in%</span> NFDI_edges[,<span class="org-highlight-numbers-number">2</span>]], <span class="linenr">15: </span> vertex.size = <span class="org-highlight-numbers-number">4</span>, <span class="org-comment-delimiter"># </span><span class="org-comment">size of the node</span> <span class="linenr">16: </span> vertex.frame.color = <span class="org-ess-constant">NA</span>, <span class="org-comment-delimiter"># </span><span class="org-comment">no frame for nodes</span> <span class="linenr">17: </span> vertex.label = <span class="org-ess-constant">NA</span>, <span class="org-comment-delimiter"># </span><span class="org-comment">no color of the description</span> <span class="linenr">18: </span> edge.curved = <span class="org-highlight-numbers-number">0.2</span>, <span class="org-comment-delimiter"># </span><span class="org-comment">factor of "curvity"</span> <span class="linenr">19: </span> ) </pre> </div> <div id="orge5cf2dc" class="figure"> <p><img src="img/nfdi-network.png" alt="nfdi-network.png" width="70%" /> </p> <p><span class="figure-number">Figure 2: </span>Network of NFDI consortia (red) and institutions (blue).</p> </div> </div> </div> </div> <div id="outline-container-orga92537a" class="outline-2"> <h2 id="orga92537a"><span class="section-number-2">5.</span> Preserve</h2> <div class="outline-text-2" id="text-5"> <p> There are two ways exporting this document in multiple documents. The concept of this is called “single-source-publishing”. This means we have on document, our org-file, and we will export it into different formats, which are more suitable for different occasions. </p> </div> <div id="outline-container-org5d7fafe" class="outline-3"> <h3 id="org5d7fafe"><span class="section-number-3">5.1.</span> Manual export</h3> <div class="outline-text-3" id="text-5-1"> <p> The common approach is to invoke the commands for exporting into a certain format individually and by hand. Org-mode has a great build in exporting mechanism which converts the document into all mainly used formats. You get to the menue by calling <code>SPC m e</code> or <code>C-c C-e</code> and then select which export format you would like to have. </p> <p> In tab. <a href="#org76dd757">4</a> you find a quick overview of some basic formats. </p> <table id="org76dd757" border="2" cellspacing="0" cellpadding="6" rules="groups" frame="hsides"> <caption class="t-above"><span class="table-number">Table 4:</span> Overview of various individual export functions.</caption> <colgroup> <col class="org-left" /> <col class="org-left" /> <col class="org-left" /> </colgroup> <thead> <tr> <th scope="col" class="org-left"> </th> <th scope="col" class="org-left">evil</th> <th scope="col" class="org-left">normal</th> </tr> </thead> <tbody> <tr> <td class="org-left">PDF</td> <td class="org-left"><code>SPC m e l o</code></td> <td class="org-left"><code>C-c C-e l o</code></td> </tr> <tr> <td class="org-left">HTML</td> <td class="org-left"><code>SPC m e h o</code></td> <td class="org-left"><code>C-c C-e h o</code></td> </tr> <tr> <td class="org-left">ASCII</td> <td class="org-left"><code>SPC m e t a</code></td> <td class="org-left"><code>C-c C-e t a</code></td> </tr> </tbody> </table> </div> </div> <div id="outline-container-org4b3d26d" class="outline-3"> <h3 id="org4b3d26d"><span class="section-number-3">5.2.</span> Automatic batch process</h3> <div class="outline-text-3" id="text-5-2"> <p> In a batch process the file is opened with a clean and neutral version of emacs and will be exported (see listing <a href="#orgc45f15d">6</a>). </p> <div class="org-src-container"> <label class="org-src-name"><span class="listing-number">Listing 6: </span>Exporting file into various formats</label><pre class="src src-emacs-lisp" id="orgc45f15d"><span class="linenr"> 1: </span><span class="org-rainbow-delimiters-depth-1">(</span><span class="org-keyword">let</span> <span class="org-rainbow-delimiters-depth-2">(</span><span class="org-rainbow-delimiters-depth-3">(</span>org-file <span class="org-rainbow-delimiters-depth-4">(</span><span class="org-constant">find-file-noselect</span> filename<span class="org-rainbow-delimiters-depth-4">)</span><span class="org-rainbow-delimiters-depth-3">)</span><span class="org-rainbow-delimiters-depth-2">)</span> <span class="linenr"> 2: </span> <span class="org-rainbow-delimiters-depth-2">(</span><span class="org-keyword">with-current-buffer</span> org-file <span class="linenr"> 3: </span> <span class="org-rainbow-delimiters-depth-3">(</span><span class="org-function-name">org-html-export-to-html</span><span class="org-rainbow-delimiters-depth-3">)</span> <span class="linenr"> 4: </span> <span class="org-rainbow-delimiters-depth-3">(</span><span class="org-constant">message</span> <span class="org-string">"HTML export successful."</span><span class="org-rainbow-delimiters-depth-3">)</span> <span class="linenr"> 5: </span> <span class="org-rainbow-delimiters-depth-2">)</span> <span class="linenr"> 6: </span> <span class="org-rainbow-delimiters-depth-2">(</span><span class="org-keyword">with-current-buffer</span> org-file <span class="linenr"> 7: </span> <span class="org-rainbow-delimiters-depth-3">(</span><span class="org-function-name">org-ascii-export-to-ascii</span><span class="org-rainbow-delimiters-depth-3">)</span> <span class="linenr"> 8: </span> <span class="org-rainbow-delimiters-depth-3">(</span><span class="org-constant">message</span> <span class="org-string">"ASCII export successful."</span><span class="org-rainbow-delimiters-depth-3">)</span> <span class="linenr"> 9: </span> <span class="org-rainbow-delimiters-depth-2">)</span> <span class="linenr">10: </span> <span class="org-rainbow-delimiters-depth-2">(</span><span class="org-keyword">with-current-buffer</span> org-file <span class="linenr">11: </span> <span class="org-rainbow-delimiters-depth-3">(</span><span class="org-function-name">org-latex-export-to-pdf</span><span class="org-rainbow-delimiters-depth-3">)</span> <span class="linenr">12: </span> <span class="org-rainbow-delimiters-depth-3">(</span><span class="org-constant">message</span> <span class="org-string">"PDF export successful."</span><span class="org-rainbow-delimiters-depth-3">)</span> <span class="linenr">13: </span> <span class="org-rainbow-delimiters-depth-2">)</span><span class="org-rainbow-delimiters-depth-1">)</span> </pre> </div> <style>.csl-entry{text-indent: -1.5em; margin-left: 1.5em;}</style><div class="csl-bib-body"> <div class="csl-entry"><a id="citeproc_bib_item_1"></a>Bossert, Lukas C., Magdalene Cyra, Évariste Demandt, Matthias Fingerhuth, and Ceren Yildiz. 2023. “Das Muss Noch in Wikidata Rein.” <i>Bausteine Fdm</i>, September, 2–18. <a href="https://doi.org/10.17192/bfdm.2023.5.8580">https://doi.org/10.17192/bfdm.2023.5.8580</a>.</div> <div class="csl-entry"><a id="citeproc_bib_item_2"></a>Hahn, Harley. 2016. <i>Harley Hahn’s Emacs Field Guide</i>. Apress. <a href="https://doi.org/10.1007/978-1-4842-1703-0">https://doi.org/10.1007/978-1-4842-1703-0</a>.</div> <div class="csl-entry"><a id="citeproc_bib_item_3"></a>Kitchin, John R., Ana E. Van Gulick, and Lisa D. Zilinski. 2016. “Automating Data Sharing through Authoring Tools.” <i>International Journal on Digital Libraries</i> 18 (2): 93–98. <a href="https://doi.org/10.1007/s00799-016-0173-7">https://doi.org/10.1007/s00799-016-0173-7</a>.</div> <div class="csl-entry"><a id="citeproc_bib_item_4"></a>Stanisic, Luka, and Arnaud Legrand. 2014. “Effective Reproducible Research with Org-Mode and Git.” In <i>Euro-Par 2014: Parallel Processing Workshops</i>, edited by Luís Lopes, Julius Žilinskas, Alexandru Costan, Roberto G. Cascella, Gabor Kecskemeti, Emmanuel Jeannot, Mario Cannataro, et al., 475–86. Cham: Springer International Publishing.</div> <div class="csl-entry"><a id="citeproc_bib_item_5"></a>Strobel, Stefan, and Thomas Uhl. 1996. “GNU Emacs.” In <i>Linux Unleashing the Workstation in Your PC</i>, 287–324. Springer US. <a href="https://doi.org/10.1007/978-1-4684-0247-6_13">https://doi.org/10.1007/978-1-4684-0247-6_13</a>.</div> </div> </div> </div> </div> <div id="footnotes"> <h2 class="footnotes">Footnotes: </h2> <div id="text-footnotes"> <div class="footdef"><sup><a id="fn.1" class="footnum" href="#fnr.1" role="doc-backlink">1</a></sup> <div class="footpara" role="doc-footnote"><p class="footpara"> There might be people having a different opinion. </p></div></div> <div class="footdef"><sup><a id="fn.2" class="footnum" href="#fnr.2" role="doc-backlink">2</a></sup> <div class="footpara" role="doc-footnote"><p class="footpara"> <a href="https://github.com/tbanel/orgaggregate">https://github.com/tbanel/orgaggregate</a> </p></div></div> </div> </div></div> <div id="postamble" class="status"> <p class="author">Author: Jonathan A. Hartman | Lukas C. Bossert</p> <p class="date">Created: 2023-09-19 Tue 08:29</p> </div> </body> </html>