expoweb/handbook/computing/logbooks-parsing.html

<!DOCTYPE html>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>CUCC Expedition Handbook: Logbook import</title>
<link rel="stylesheet" type="text/css" href="../../css/main2.css" />
</head>
<body><style>body { background: #fff url(/images/style/bg-system.png) repeat-x 0 0 }</style>
<h2 id="tophead">CUCC Expedition Handbook</h2>
<h1>Logbooks Import</h1>

    <!-- Yes we need some proper context-marking here, breadcrumb trails or something.
        Maybe a colour scheme for just this sequence of pages
    -->


<h3 id="import">Importing the logbook into troggle</a></h3>
<p>This is usually done after expo but it is in excellent idea to have a nerd do this a couple of times during expo to discover problems while the people are still around to ask.

<p>The nerd needs to login to the expo server using <em>their own userid</em>, not the 'expo' userid. The nerd also needs to be in the group that is allowed to do 'sudo'.

<h4>Ideal situation</h4>
<p>Ideally this would all be done on a stand-alone laptop to get the bugs in the logbook parsing sorted out before we upload the corrected file to the server. Unfortunately this requires a full troggle software development laptop as the parser is built into troggle. The <var>expo laptop</var> in the potato hut is not set up to do this (yet - 2022).
<p>However, the <var>expo laptop</var> (or any 'bulk update' laptop) is configured to allow an authorized user to log in to the server itself and to run the import process directly on the server.

<h4>Importing the Blog</h4>
<p>During expo lots of people post text and photos to the UK Caving (rope competition) website. During the winter after expo, an extra nerd task is to fold in all those entries into the main logbook so that
the trips are indexed and we can see who was doing what where.
<p>This is sufficiently complicated that it is documented
<a href="log-blog-parsing.html">in another page</a>. But read this page first.

<h4>Current situation</h4>
<p>The nerd needs to do this:
<ol>
<li>Look at the list of pre-existing old import errors at  <a href="/dataissues">Data Issues</a> </br>

<li>You need to get the list of people on expo sorted out first. </br>
This is documented in the <a href="folkupdate.html">Folk Update</a> process.
<li>Log in to the expo server and run the update script (see below for details)
<li>Watch the error messages scroll by, they are more detailed than the messages archived in the old import errors list
<li>Edit the logbook.html file to fix the errors. These are usually typos, non-unique tripdate ids or unrecognised people. Some unrecognised people will mean that you have to fix them using the  <a href="folkupdate.html">Folk Update</a> process first.
<li>Re-run the import script until you have got rid of all the import errors.
<li>Pat self on back. Future data managers and people trying to find missing surveys will worship you.
</ol>

<p>The procedure is like this. It will be familiar to you because
you will have already done most of this for the <a href="folkupdate.html">Folk Update</a> process.

<pre><code>ssh  expo@expo.survex.com
cd troggle
python databaseReset.py logbooks
</code></pre>

<p>It will produce a list of errors like these below, starting with the most recent logbook which will be the one for the expo you are working on.
You can abort the script (Ctrl-C) when you have got the errors for the current expo that you are going to fix
<pre><code>Loading Logbook for: 2017
 - Parsing logbook: 2017/logbook.html
 - Using parser: Parseloghtmltxt
Calculating GetPersonExpeditionNameLookup for 2017
   - No name match for: 'Phil'
   - No name match for: 'everyone'
   - No name match for: 'et al.'
("can't parse: ", u'\n\n&lt;img src="logbkimg5.jpg" alt="New Topo" /&gt;\n\n')
   - No name match for: 'Goulash Regurgitation'
   - Skipping logentry: Via Ferata: Intersport - Klettersteig - no author for entry
   - No name match for: 'mike'
   - No name match for: 'Mike'</code></pre>

<p>Errors are usually misplaced or duplicated &lt;hr /&gt; tags, names which are not specific enough to be recognised by the parser (though it tries hard) such as "everyone" or "et al." or are simply missing, or a bit of description which has been put into the names section such as "Goulash Regurgitation".

<p>When you have sorted out the logbooks formatting and it is no longer complaining,
you will need to do a full database reset as this will have trashed the online database and none of the troggle webpages will be working:
<pre><code>ssh  expo@expo.survex.com
cd troggle
python databaseReset.py reset
</code></pre>
which takes between 300s and 15 minutes on the server.
<h3 id="history">The logbooks format</h3>
<p>This is documented on the <a href="../logbooks.html#format">logbook user-documentation page</a> as even expoers who can do nothing else technical can at least write up their logbook entries.

<h3 id="history">Historical logbooks format</h3>
<p>Older logbooks (prior to 2007) were stored as logbook.txt with just a bit of consistent markup to allow troggle parsing.</p>

<p>The formatting was largely freeform, with a bit of markup ('===' around header, bars separating date, <place> - <description>, and who) which allows the troggle import script to read it correctly. The underlines show who wrote the entry. </p>
<p>There were also several previous (different) styles of using HTML. The one we are using now is the 5th variant. These older variants were eventually all reformatted into the current HTML format so that now (Jan. 2023) we only need to maintain the code for one parser.

<!--
<p>So the format should be:</p>

<code>
===2009-07-21|204 - Rigging entrance series| Becka Lawson, Emma Wilson ===
</br>
&#123;Text of logbook entry&#125;
</br>
T/U: Jess 1 hr, Emma 0.5 hr
</code>
-->
<hr />
<p>
Back to <a href="../logbooks.html">Logbooks for Cavers</a> documentation.<br>
Forward to <a href="log-blog-parsing.html">Importing the UK Caving Blog</a>.
<hr />

</body>
</html>