expoweb/handbook/computing/logbooks-parsing.html

89 lines
5.5 KiB
HTML

<!DOCTYPE html>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>CUCC Expedition Handbook: Logbook import</title>
<link rel="stylesheet" type="text/css" href="../../css/main2.css" />
</head>
<body>
<style>body { background: #fff url(/images/style/bg-system.png) repeat-x 0 0 }</style>
<h2 id="tophead">CUCC Expedition Handbook</h2>
<h1>Logbooks Import</h1>
<h3 id="import">Importing the logbook into troggle</a></h3>
<p>This is usually done after expo but it is in excellent idea to have a nerd do this a couple of times during expo to discover problems while the people are still around to ask.
<p>The nerd needs to login to the expo server using <em>their own userid</em>, not the 'expo' userid. The nerd also needs to be in the group that is allowed to do 'sudo'.
<h4>The 'parser'</h4>
<p>This is rather a grand word for the hacked about spaghetti of regexes in troggle/parsers/logbooks.py . It is not a proper parser, just a phrase recognizer, and is horribly, horribly fragile. On the brightside, we now only have one of these instead of 5.
<h4>Ideal situation</h4>
<p>Ideally this would all be done on a stand-alone laptop to get the bugs in the logbook parsing sorted out before we upload the corrected file to the server. Unfortunately this requires a full troggle software development laptop as the parser is built into troggle. The <var>expo laptop</var> in the potato hut is set up to do this (2023) but requires more nouse than is convenient to describe here.
<p>However, the <var>expo laptop</var> (or any 'bulk update' laptop) is configured to allow an authorized user to log in to the server itself and to run the import process directly on the server. DON'T DO THIS. The slightest mistake in formatting will killl logbook functionality on the server for everyone.
<h4>Importing the Blog</h4>
<p>During expo lots of people post text and photos to the UK Caving (rope competition) website. During the winter after expo, an extra nerd task is to fold in all those entries into the main logbook so that
the trips are indexed and we can see who was doing what where.
<p>This is sufficiently complicated that it is documented
<a href="log-blog-parsing.html">in another page</a>. But read this page first.
<h4>Current situation</h4>
<p>With the new data entry form we should have far fewer problems with inventive hacks trying to do clever thngs with HTML, but it is entirely possible that the form can be used to input text which will then break the parser, most obviously by putting in a
<var><span style="color:red">&lt;hr /&gt;</span></var> which is the separator between entries. This is not clever.
<p>The nerd needs to do this:
<ol>
<li>Look at the list of pre-existing old import errors at <a href="/dataissues">Data Issues</a> </br>
<li>You need to get the list of people on expo sorted out first. </br>
This is documented in the <a href="folkupdate.html">Folk Update</a> process.
<li>Log in to the expo server and run the update script (see below for details)
<li>Watch the error messages scroll by, they are more detailed than the messages archived in the old import errors list
<li>Edit the logbook.html file to fix the errors. These are usually typos, too-clever HTML or unrecognised people. Some unrecognised people will mean that you have to fix them using the <a href="folkupdate.html">Folk Update</a> process first.
<li>Re-run the import script until you have got rid of all the import errors.
<li>Pat self on back. Future data managers and people trying to find missing surveys will worship you.
</ol>
<p>The procedure is like this. It will be familiar to you because
you will have already done most of this for the <a href="folkupdate.html">Folk Update</a> process.
<pre><code>ssh expo@expo.survex.com
cd troggle
python databaseReset.py logbooks
</code></pre>
<p>It will produce a list of errors like these below, starting with the most recent logbook which will be the one for the expo you are working on.
You can abort the script (Ctrl-C) when you have got the errors for the current expo that you are going to fix
<pre><code>Loading Logbook for: 2017
- Parsing logbook: 2017/logbook.html
- Using parser: Parseloghtmltxt
Calculating GetPersonExpeditionNameLookup for 2017
- No name match for: 'Phil'
- No name match for: 'everyone'
- No name match for: 'et al.'
("can't parse: ", u'\n\n&lt;img src="logbkimg5.jpg" alt="New Topo" /&gt;\n\n')
- No name match for: 'Goulash Regurgitation'
- Skipping logentry: Via Ferata: Intersport - Klettersteig - no author for entry
- No name match for: 'mike'
- No name match for: 'Mike'</code></pre>
<p>Errors are usually misplaced or duplicated &lt;hr /&gt; tags, names which are not specific enough to be recognised by the parser (though it tries hard) such as "everyone" or "et al." or are simply missing, or a bit of description which has been put into the names section such as "Goulash Regurgitation".
<p>When you have sorted out the logbooks formatting and it is no longer complaining,
you will need to do a full database reset as this will have trashed the online database and none of the troggle webpages will be working:
<pre><code>ssh expo@expo.survex.com
cd troggle
python databaseReset.py reset
</code></pre>
which takes between 300s and 15 minutes on the server.
<hr />
<p>
Back to <a href="../logbooks.html">Logbooks for Cavers</a> documentation.<br>
Forward to <a href="logbooks-format.html">Logbook internal format</a> documentation.<br>
Forward to <a href="log-blog-parsing.html">Importing the UK Caving Blog</a>.
<hr /></body>
</html>