expoweb/handbook/computing/logbooks-parsing.html

98 lines
5.2 KiB
HTML
Raw Normal View History

<!DOCTYPE html>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>CUCC Expedition Handbook: Logbook import</title>
<link rel="stylesheet" type="text/css" href="../../css/main2.css" />
</head>
2022-03-24 22:37:14 +00:00
<body><style>body { background: #fff url(/images/style/bg-system.png) repeat-x 0 0 }</style>
<h2 id="tophead">CUCC Expedition Handbook</h2>
<h1>Logbooks Import</h1>
<!-- Yes we need some proper context-marking here, breadcrumb trails or something.
Maybe a colour scheme for just this sequence of pages
-->
<h3 id="import">Importing the logbook into troggle</a></h3>
<p>This is usually done after expo but it is in excellent idea to have a nerd do this a couple of times during expo to discover problems while the people are still around to ask.
<p>The nerd needs to login to the expo server using <em>their own userid</em>, not the 'expo' userid. The nerd also needs to be in the group that is allowed to do 'sudo'.
2022-12-10 14:23:08 +00:00
<h4>Ideal situation</h4>
<p>Ideally this would all be done on a stand-alone laptop to get the bugs in the logbook parsing sorted out before we upload the corrected file to the server. Unfortunately this requires a full troggle software development laptop as the parser is built into troggle. The <var>expo laptop</var> in the potato hut is not set up to do this (yet - 2022).
<p>However, the <var>expo laptop</var> (or any 'bulk update' laptop) is configured to allow an authorized user to log in to the server itself and to run the import process directly on the server.
<h4>Current situation</h4>
<p>The nerd needs to do this:
<ol>
2022-03-24 22:37:14 +00:00
<li>Look at the list of pre-existing old import errors at <a href="/dataissues">Data Issues</a> </br>
<li>You need to get the list of people on expo sorted out first. </br>
This is documented in the <a href="folkupdate.html">Folk Update</a> process.
<li>Log in to the expo server and run the update script (see below for details)
<li>Watch the error messages scroll by, they are more detailed than the messages archived in the old import errors list
<li>Edit the logbook.html file to fix the errors. These are usually typos, non-unique tripdate ids or unrecognised people. Some unrecognised people will mean that you have to fix them using the <a href="folkupdate.html">Folk Update</a> process first.
<li>Re-run the import script until you have got rid of all the import errors.
<li>Pat self on back. Future data managers and people trying to find missing surveys will worship you.
</ol>
<p>The procedure is like this. It will be familiar to you because
you will have already done most of this for the <a href="folkupdate.html">Folk Update</a> process.
2022-03-24 22:37:14 +00:00
<pre><code>ssh expo@expo.survex.com
cd troggle
2022-03-24 22:37:14 +00:00
python databaseReset.py logbooks
</code></pre>
<p>It will produce a list of errors like these below, starting with the most recent logbook which will be the one for the expo you are working on.
You can abort the script (Ctrl-C) when you have got the errors for the current expo that you are going to fix
<pre><code>Loading Logbook for: 2017
- Parsing logbook: 2017/logbook.html
- Using parser: Parseloghtmltxt
Calculating GetPersonExpeditionNameLookup for 2017
- No name match for: 'Phil'
- No name match for: 'everyone'
- No name match for: 'et al.'
("can't parse: ", u'\n\n&lt;img src="logbkimg5.jpg" alt="New Topo" /&gt;\n\n')
- No name match for: 'Goulash Regurgitation'
- Skipping logentry: Via Ferata: Intersport - Klettersteig - no author for entry
- No name match for: 'mike'
- No name match for: 'Mike'</code></pre>
<p>Errors are usually misplaced or duplicated &lt;hr /&gt; tags, names which are not specific enough to be recognised by the parser (though it tries hard) such as "everyone" or "et al." or are simply missing, or a bit of description which has been put into the names section such as "Goulash Regurgitation".
2022-03-24 22:37:14 +00:00
<p>When you have sorted out the logbooks formatting and it is no longer complaining,
you will need to do a full database reset as this will have trashed the online database and none of the troggle webpages will be working:
<pre><code>ssh expo@expo.survex.com
cd troggle
python databaseReset.py reset
</code></pre>
which takes between 300s and 15 minutes on the server.
<h3 id="history">The logbooks format</h3>
<p>This is documented on the <a href="../logbooks.html#format">logbook user-documentation page</a> as even expoers who can do nothing else technical can at least write up their logbook entries.
<h3 id="history">Historical logbooks format</h3>
<p>Older logbooks (prior to 2007) were stored as logbook.txt with just a bit of consistent markup to allow troggle parsing.</p>
<p>The formatting was largely freeform, with a bit of markup ('===' around header, bars separating date, <place> - <description>, and who) which allows the troggle import script to read it correctly. The underlines show who wrote the entry. There is also a format for time-underground info so it can be automagically tabulated.</p>
2022-03-24 22:37:14 +00:00
<!--
<p>So the format should be:</p>
<code>
===2009-07-21|204 - Rigging entrance series| Becka Lawson, Emma Wilson ===
</br>
&#123;Text of logbook entry&#125;
</br>
T/U: Jess 1 hr, Emma 0.5 hr
</code>
2022-03-24 22:37:14 +00:00
-->
<p>
<a href="../logbooks.html">Back to Logbooks for Cavers</a> documentation.
<hr />
</body>
</html>