expoweb/handbook/computing/logbooks-parsing.html

<!DOCTYPE html>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>CUCC Expedition Handbook: Logbook import</title>
<link rel="stylesheet" type="text/css" href="../../css/main2.css" />
</head>
<body>
<h2 id="tophead">CUCC Expedition Handbook</h2>
<h1>Logbooks Import</h1>

    <!-- Yes we need some proper context-marking here, breadcrumb trails or something.
        Maybe a colour scheme for just this sequence of pages
    -->


<h3 id="import">Importing the logbook into troggle</a></h3>
<p>This is usually done after expo but it is in excellent idea to have a nerd do this a couple of times during expo to discover problems while the people are still around to ask.

<p>The nerd needs to login to the expo server using <em>their own userid</em>, not the 'expo' userid. The nerd also needs to be in the group that is allowed to do 'sudo'.

<p>The nerd needs to do this:
<ol>
<li>Look at the list of pre-existing old import errors at </br> <a href="http://expo.survex.com/admin/core/dataissue/">http://expo.survex.com/admin/core/dataissue/</a> </br>
The nerd will have to login to the troggle management console to do this, not just the usual troggle login.
<li>You need to get the list of people on expo sorted out first. </br>
This is documented in the <a href="folkupdate.html">Folk Update</a> process.
<li>Log in to the expo server and run the update script (see below for details)
<li>Watch the error messages scroll by, they are more detailed than the messages archived in the old import errors list
<li>Edit the logbook.html file to fix the errors. These are usually typos, non-unique tripdate ids or unrecognised people. Some unrecognised people will mean that you have to fix them using the  <a href="folkupdate.html">Folk Update</a> process first.
<li>Re-run the import script until you have got rid of all the import errors.
<li>Pat self on back. Future data managers and people trying to find missing surveys will worship you.
</ol>

<p>The procedure is like this. It will be familiar to you because
you will have already done most of this for the <a href="folkupdate.html">Folk Update</a> process.

<pre><code>ssh  {youruserid}@expo.survex.com
cd ~expo
cd troggle
sudo python databaseReset.py logbooks
</code></pre>

<p>It will produce a list of errors like these below, starting with the most recent logbook which will be the one for the expo you are working on.
You can abort the script (Ctrl-C) when you have got the errors for the current expo that you are going to fix
<pre><code>Loading Logbook for: 2017
 - Parsing logbook: 2017/logbook.html
 - Using parser: Parseloghtmltxt
Calculating GetPersonExpeditionNameLookup for 2017
   - No name match for: 'Phil'
   - No name match for: 'everyone'
   - No name match for: 'et al.'
("can't parse: ", u'\n\n&lt;img src="logbkimg5.jpg" alt="New Topo" /&gt;\n\n')
   - No name match for: 'Goulash Regurgitation'
   - Skipping logentry: Via Ferata: Intersport - Klettersteig - no author for entry
   - No name match for: 'mike'
   - No name match for: 'Mike'</code></pre>

<p>Errors are usually misplaced or duplicated &lt;hr /&gt; tags, names which are not specific enough to be recognised by the parser (though it tries hard) such as "everyone" or "et al." or are simply missing, or a bit of description which has been put into the names section such as "Goulash Regurgitation".

<h3 id="history">The logbooks format</h3>
<p>This is documented on the <a href="..logbooks.html#format">logbook user-documentation page</a> as even expoers who can do nothing else technical can at least write up their logbook entries.

<p>[ Yes this format needs to be re-done using a proper structure:<br />
<code><pre>
&lt;div class="logentry"&gt;<br />
<span style="text-decoration: line-through wavy red;">&nbsp;&nbsp;&nbsp;&nbsp;</span>
&lt;/div"&gt;</pre></code>
it's on the to-do list...]


<h3 id="history">Historical logbooks format</h3>
<p>Older logbooks (prior to 2007) were stored as logbook.txt with just a bit of consistent markup to allow troggle parsing.</p>

<p>The formatting was largely freeform, with a bit of markup ('===' around header, bars separating date, <place> - <description>, and who) which allows the troggle import script to read it correctly. The underlines show who wrote the entry. There is also a format for time-underground info so it can be automagically tabulated.</p>

<p>So the format should be:</p>

<code>
===2009-07-21|204 - Rigging entrance series| Becka Lawson, Emma Wilson ===
</br>
&#123;Text of logbook entry&#125;
</br>
T/U: Jess 1 hr, Emma 0.5 hr
</code>
<p>
<a href="../logbooks.html">Back to Logbooks for Cavers</a> documentation.
<hr />

</body>
</html>