mirror of
https://expo.survex.com/repositories/expoweb/.git/
synced 2025-12-08 23:04:35 +00:00
Updated to use the new form
This commit is contained in:
@@ -26,16 +26,24 @@ the python programe <var>databaseReset.py</var> at the command line. (The <var>e
|
||||
<ol>
|
||||
<li>Use a web browser to save the UK Caving blog to a file and sub-folder holding the images.
|
||||
<li>Go to the image folder and make all photos smaller, and convert .png files to .jpg. Delete all non-image files.
|
||||
<li>Edit the troggle import parser <var>troggle/parsers/logbooks.py</var> to include a line of code to do the import.
|
||||
<li>Run <var>python databaseReset.py <b>logbooks</b></var> to import all the logbooks including the blog.
|
||||
<li>Edit the troggle import parser <var>troggle/parsers/logbooks.py</var> to include a line of code to do the import:<br/>
|
||||
<var>"2024": ("ukcavingblog.html", "parser_blog"),</var><br/>
|
||||
in the obvious place near the top of the file (see below too)
|
||||
<li>Run <var>python databaseReset.py <b>logbooks</b></var> to import all the logbooks including the blog
|
||||
<li>Run <var>python databaseReset.py <b>logbook</b></var> to import just the current year's logbook.
|
||||
<li>Export all the logbook entries for the year to a single file <var>logbook-new-format.html</var>. using the <var>expoadmin</var> control panel on troggle running locally on your machine.
|
||||
<li>Rename the existing <var>logbook.html</var> as <var>logbook-original.html</var> and rename <var>logbook-new-format.html</var> as <var>logbook.html</var>.
|
||||
<li>Comment out the additional line you put into <var>troggle/parsers/logbooks.py</var> to import the blog.
|
||||
<li>Re-import all the logbooks.
|
||||
<li>From now on you can use the logbook entry editor to make small changes, which will re-export everything and re-create the 'logbook.html' file. It will also do a git commit on your local machine, so you will need to clean these out and not push them to the server.
|
||||
<li>Tidy up oddities by hand-editing <var>logbook.html</var>: e.g. &amp; incidental decodings, delete blog entry comments, fix blog post author names.
|
||||
<li>Re-import all the logbooks to check that it all looks good. (Several times in practice.)
|
||||
<li>Commit and push the changes you made to the :expoweb: and :troggle: git repos.
|
||||
<li>Log on to the server and do a complete database reset online.
|
||||
<li>Curse mightily as MariaDB crashes on the server, because all the prep work was done using sqlite. The problem
|
||||
is because the blog software now (2023 onwards) uses 4-bit UTF-8 entitites for emojis and MariaDB by default only uses 3-bit UTF-8
|
||||
(see <a href="https://stackoverflow.com/questions/20411440/incorrect-string-value-xf0-x9f-x8e-xb6-xf0-x9f-mysql">utf8mb4</a>.
|
||||
<li>Manually go through the imported HTML using an editor which displays these utf8mb4 (e.g. Notepad++) and delete them. Try again.
|
||||
</ol>
|
||||
|
||||
<p>It's a bit easy to get lost in this process and forget where you were, especially if you are interrupted. So
|
||||
@@ -45,7 +53,7 @@ it is handy to print out this page and tick off the steps as you do them.
|
||||
|
||||
<h3 id="gotcha">Future Gotcha</a></h3>
|
||||
<p>The UK Caving Blog regularly upgrades its software which completely changes the hidden structure of the posts. They did this sometime between the 2017 and 2018 expos. When they do it again, the function
|
||||
<var>parser_blog(year, expedition, txt, sq="")</var> in <var>troggle/parsers/logbooks.py</var> will need to be completely re-written. It is currently 70 lines long and uses several regular expression recognizers.
|
||||
<var>parser_blog(year, expedition, txt, sq="")</var> in <var>troggle/parsers/logbooks.py</var> will need to be completely re-written. It is currently 102 lines long and uses several regular expression recognizers.
|
||||
|
||||
<h3 id="save">Saving the Blog</a></h3>
|
||||
<img src="blog-pages.jpg" hspace="20" align="right">
|
||||
@@ -55,7 +63,11 @@ it is handy to print out this page and tick off the steps as you do them.
|
||||
<li>Now for 2022, the blog split the posts onto two pages (see image), so if that is the case with the year you are dealing with, you will need to navigate to the next page and save again, this time with the filename "ukcavingblog2.html". Our existing troggle code handles up to 4 of these, numbered sequentially.
|
||||
</ul>
|
||||
<p>Now delete all the non-image files in the "ukcavingblog_files/" and "ukcavingblog2_files/" folders.
|
||||
<p>Now use your favourite photo editor (e.g. Irfanview on Windows) or a command-line tool to resize all the photos. A maximum of 600 pixels wide or high, or 400 or 300 pixels wide if the image quality is poor. Keep the same filename then you don't have to try to edit the horrendously horrible HTML which was generated by the blog software. If there are any .png files, convert them to .jpg.
|
||||
<p>Now use your favourite photo editor (e.g. Irfanview on Windows) or a command-line tool to resize all the photos. A maximum of 800 pixels wide or high, or 400 or 300 pixels wide if the image quality is poor. Keep the same filename then you don't have to try to edit the horrendously horrible HTML which was generated by the blog software. If there are any .png files, convert them to .jpg.
|
||||
<p>There will be a lot. So install imagemagick and use the
|
||||
<a href="https://imagemagick.org/script/mogrify.php">mogrify tool</a>:
|
||||
<br />mogrify -resize 800>x800> *.jpg
|
||||
<br /> will resize in-place, overwriting the files, and make the maximum dimension 800 pixels.
|
||||
<p>Look at all the photos in the file browser set to show thumbnails and delete all advertising logos etc., and delete the UK Caving header image which will be of random people not us.
|
||||
|
||||
<h3 id="code1">Edit logbooks.py</a></h3>
|
||||
@@ -109,7 +121,7 @@ is somewhere in the settings for rendering a dictionary using a Django template,
|
||||
<h4>Fixing dates and trips</h4>
|
||||
<p>It it noticeable that a single blog post may cover several trips, and that the blog post date may be several days after the trip(s). So you need to manually find out the exact date of the trip (from the other trip records and particularly from the Bier Book) and change the date on the entry.
|
||||
|
||||
<p>One blog post may also need to be split into several entries - in which case be careful with the 'id=' string as this needs to be unique for each entry.
|
||||
<p>One blog post may also need to be split into several entries - don't worry about the 'id=' string as the parser now rewrites these uniquely.
|
||||
|
||||
When you split a blog into different entries the quickest way to re-order everything in date-order is to export the logbook and re-import it.
|
||||
|
||||
@@ -119,6 +131,7 @@ When you split a blog into different entries the quickest way to re-order everyt
|
||||
<hr />
|
||||
<p>
|
||||
Back to <a href="logbooks-parsing.html">Logbooks Import for Nerds</a> documentation.<br>
|
||||
Back to <a href="logbooks-format.html">Logbook internal format</a> documentation.<br>
|
||||
Back to <a href="../logbooks.html">Logbooks for Cavers</a> documentation.
|
||||
<hr />
|
||||
|
||||
|
||||
136
handbook/computing/logbooks-format.html
Normal file
136
handbook/computing/logbooks-format.html
Normal file
@@ -0,0 +1,136 @@
|
||||
<!DOCTYPE html>
|
||||
<html>
|
||||
<head>
|
||||
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
|
||||
<title>CUCC Expedition Handbook: Logbook internal format</title>
|
||||
<link rel="stylesheet" type="text/css" href="../../css/main2.css" />
|
||||
</head>
|
||||
<body>
|
||||
<style>body { background: #fff url(/images/style/bg-system.png) repeat-x 0 0 }</style>
|
||||
<h2 id="tophead">CUCC Expedition Handbook</h2>
|
||||
<h1>Logbooks internal format</h1>
|
||||
|
||||
<h3 id="format">Format</h3>
|
||||
|
||||
<p>All these scanned handwritten logbook entries are typed into a laptop (often the expo laptop)
|
||||
which is then synchronised the version control system.
|
||||
|
||||
<h3 id="format">Format of the online logbooks</a></h3>
|
||||
|
||||
|
||||
<p>Do whatever you like to try and represent the logbook in html but do keep is <em>simple</em>. Don't try any clever HTML stuff. See the <a href="hbmanual1.html#images">"Edit this Page"</a> instructions for how to insert images and figures.
|
||||
|
||||
<p>Logbooks are typed up and kept in the [expoweb]/years/[nnnn]/ directory as 'logbook.html'.</p>
|
||||
|
||||
<p>When writing logbook entries, just use <a href="hbmanual1.html#images">relative URLs</a> to the same folder as your text, e.g. <var>href="mynicepic.jpg"</var> and the image and the logbook HTML will, for a 2017 expo, be put into <var>/years/2017/</var>.
|
||||
|
||||
<p>One special suggestion: do not use <P> paragraph tags. Well, you can if you like, but they will be stripped out and replaced by double-newlines when the file is parsed. This is because <P> paragraph tags cannot be nested - that is not allowed in HTML - and the fragment you are writing will be merged with other fragments and may be put inside a higher-level paragraph. [This is also true for Cave Description text in <a href="survey/caveentry.html">"Edit this Cave"</a>.]
|
||||
|
||||
<h3>How it all works - editing and archiving</h3>
|
||||
<p>When you use the online form to create a new logbook entry or to edit an old one, when you click the button the changes are made immediately to the online database on the server and you can see the results immediately (except for the list of logbook entry titles in the Expo webpage). Also, when you click the button the entire database of logbook entries is written out to disc, with your new entry in the right place by date, and this file 'logbook.html' is registered with the version control system (git add and git commit).
|
||||
<p>So when you click on any of the links to see the whole logbook, your edited entry will be there for all to see.
|
||||
|
||||
<p>
|
||||
<em>Implementation note</em>: the logbook.html file is not, at that time, re-parsed and re-imported into the database. This is unnecessary and would also expose us to potential loss of data if two people were editing the logbook of the same year at the same time. So the software doesn't do that.
|
||||
|
||||
<h4> Logbook-specific HTML</h4>
|
||||
<p>
|
||||
The only rigid structure is the markup to allow troggle to parse the logbook files into 'trips':</p>
|
||||
<code><pre>
|
||||
<hr />
|
||||
<div class="tripdate" id="2007-07-12b">2007-07-12</div>
|
||||
<div class="trippeople"><u>Jenny Black</u>, Olly Betts</div>
|
||||
<div class="triptitle">Top Camp - Setting up 76 bivi</div>
|
||||
...text of the logbook entry...
|
||||
<div class="timeug">T/U 0.2 hrs</div>
|
||||
<div class="editentry"><br /><a href="/logbookedit/2007-07-12b">Edit this entry</a≷>br /></div></pre></code>
|
||||
<p>When using the online form all this complexity is handled automatically:
|
||||
<ul>
|
||||
<li>The IDs are generated automatically and guaranteed to be unique, and they are rechecked each time there is a database reset. (They are now the Django slugs which uniquely identify each entry.) You never need to touch them. If you are hand-editing a logbook.html file just leave them alone and everything will get fixed up after a reset and the first edit/save.
|
||||
<li>The entire <var><span style="color:red">editentry</span></var> line is generated automatically for all the entries when any entry is saved, and is carefully ignored by the parser when there is next a database reset. However it does need to be there for the parser to work properly during a database reset.
|
||||
<li>The identification of the author as the person between underline tags is done by the form.
|
||||
<li>The identification of the place, as the first item in the title, is done by the form.
|
||||
<li>The entry separator <var><span style="color:red"><hr /></span></var> is inserted correctly.
|
||||
</ul>
|
||||
<p>Note: the ID's must be unique, so are generated from the trip date plus a,b,c etc.
|
||||
when there is more than one trip on a day (if more than 26 on one day, then it uses a cyptographic hash of the content as a suffix).</p>
|
||||
<p>Note: <var><span style="color:red">T/U</span></var> stands for "Time Underground" in decimal hours, e.g. "0.2" for 12 minutes (approx.) . We do not parse or collate this information currently.
|
||||
<p>Note: the <var><span style="color:red"><hr /></span></var> is significant and used in parsing, it is not just prettiness.
|
||||
<p>Note: follow this format exactly. No HTML comments or tabs or newlines.
|
||||
|
||||
<p>Note this special format <var>"<span style="color:red">Top Camp - </span>"</var> in the triptitle line:
|
||||
<code><pre><div class="triptitle"><span style="color:red">Top Camp - </span>Setting up 76 bivi</div></pre></code>
|
||||
It denotes the <var>cave</var> or <var>area</var> the trip or activity happened in. It is a word or two separated from the rest of the triptitle with "<var> - </var>" (space-dash-space). Usual values
|
||||
for this are "Plateau", "Base camp", "264", "Balkon", "Tunnocks", "Travel" etc.
|
||||
|
||||
<p>Note this special format <var>"<span style="color:red"><u>Jenny Black</u></span>"</var> in the trip-people line:
|
||||
<code><pre><div class="trippeople"><span style="color:red"><u>Jenny Black</u></span>, Olly Betts</div>
|
||||
</pre></code>
|
||||
It is necessary that one (and only one) of the people on the trip is set in <span style="color:red"><u></u></span> underline format. This is interpreted to mean that this is the author of the logbook entry. If there is no author set, then this is an error and the entry is ignored.
|
||||
|
||||
<p>If you like, you can put non-expo people in the trip-people line: <var>"<span style="color:red">*Ol's Mum</span>"</var> with a <span style="color:red">*</span> prefix and they will be totally ignored by troggle:
|
||||
<code><pre><div class="trippeople"><u>Jenny Black</u>, Olly Betts, <span style="color:red">*Ol's Mum</span></div>
|
||||
</pre></code>
|
||||
or
|
||||
<code><pre><div class="trippeople"><u>Jenny Black</u>, Olly Betts, <span style="color:red">*4 Hungarian Cavers</span></div>
|
||||
</pre></code>
|
||||
|
||||
|
||||
<h3 id="history">The logbooks format over the years</h3>
|
||||
|
||||
<p>Very old logbooks were simply typed up text documents with no formatting.
|
||||
|
||||
<p>Old logbooks (prior to 2007) were stored as logbook.txt with just a bit of consistent markup to allow troggle parsing.</p>
|
||||
|
||||
<p>The formatting was largely freeform, with a bit of markup ('===' around header, bars separating date, <place> - <description>, and who) which (later) allowed the troggle import script to read it correctly. The underlines show who wrote the entry. </p>
|
||||
<p>There were also several previous (different) styles of using HTML. The one we are using now is the 5th variant. These older variants were eventually all reformatted into the current HTML format so that now (Jan. 2023) we only need to maintain the code for one parser.
|
||||
<p>However, we missed one. The logbook for 1979 needs to be hand-edited to use the new format.
|
||||
|
||||
<!--
|
||||
<p>So the format should be:</p>
|
||||
|
||||
<code>
|
||||
===2009-07-21|204 - Rigging entrance series| Becka Lawson, Emma Wilson ===
|
||||
</br>
|
||||
{Text of logbook entry}
|
||||
</br>
|
||||
T/U: Jess 1 hr, Emma 0.5 hr
|
||||
</code>
|
||||
-->
|
||||
|
||||
<!--
|
||||
<h4>How we used to do it: Adding your trip to the logbook online file</h4>
|
||||
<p><b>DO NOT DO ANY OF THIS ANYMORE</b>. Left here pending archiving.
|
||||
|
||||
<p>If you are using the <em>expo laptop</em> just edit this file (if you are on expo in 2019):
|
||||
|
||||
<code>
|
||||
/home/expo/expoweb/years/2019/logbook.html
|
||||
</code>
|
||||
|
||||
copy the format you can see other people have used; <em>tell a nerd that you have done this</em>
|
||||
so that they can take care of synchronising it with the version control system.
|
||||
|
||||
<p>
|
||||
DO NOT take a copy of the logbook.html file from the expo laptop,
|
||||
copy it by email or USB stick to another laptop, edit it there and then copy it back. That will
|
||||
<em>delete other people's work</em>.
|
||||
|
||||
<p>If you are using your own laptop then you will need to either:
|
||||
<ul>
|
||||
<li>Just type up your trip as a separate file with a useful filename e.g. "logbook-myname-2018-08-03.txt", or just write it in an email, and send it to someone nerdish, or
|
||||
<li>Install and learn how to use the version control software. (This requires a <var><a href="bulkupdatelaptop.html">Bulk Update Laptop</a></var>).
|
||||
And you will need to synchronise regularly (every day) to
|
||||
ensure that the updates from all the people entering trip data are OK and don't get overwritten by ignorant use of this software. Not recommended until you have been on a previous expo and have helped do the post-expo data tidy afterwards.
|
||||
</ul>
|
||||
-->
|
||||
|
||||
|
||||
<hr />
|
||||
<p>
|
||||
Back to <a href="../logbooks.html">Logbooks for Cavers</a> documentation.<br>
|
||||
Go on to <a href="logbooks-parsing.html">Importing logbooks into troggle</a>.<br>
|
||||
Go on to <a href="log-blog-parsing.html">Importing the UK Caving Blog</a>.
|
||||
<hr /></body>
|
||||
</html>
|
||||
|
||||
@@ -10,16 +10,14 @@
|
||||
<h2 id="tophead">CUCC Expedition Handbook</h2>
|
||||
<h1>Logbooks Import</h1>
|
||||
|
||||
<!-- Yes we need some proper context-marking here, breadcrumb trails or something.
|
||||
Maybe a colour scheme for just this sequence of pages
|
||||
-->
|
||||
|
||||
|
||||
<h3 id="import">Importing the logbook into troggle</a></h3>
|
||||
<p>This is usually done after expo but it is in excellent idea to have a nerd do this a couple of times during expo to discover problems while the people are still around to ask.
|
||||
|
||||
<p>The nerd needs to login to the expo server using <em>their own userid</em>, not the 'expo' userid. The nerd also needs to be in the group that is allowed to do 'sudo'.
|
||||
|
||||
<h4>The 'parser'</h4>
|
||||
<p>This is rather a grand word for the hacked about spaghetti of regexes in troggle/parsers/logbooks.py . It is not a proper parser, just a phrase recognizer, and is horribly, horribly fragile. On the brightside, we now only have one of these instead of 5.
|
||||
|
||||
<h4>Ideal situation</h4>
|
||||
<p>Ideally this would all be done on a stand-alone laptop to get the bugs in the logbook parsing sorted out before we upload the corrected file to the server. Unfortunately this requires a full troggle software development laptop as the parser is built into troggle. The <var>expo laptop</var> in the potato hut is set up to do this (2023) but requires more nouse than is convenient to describe here.
|
||||
<p>However, the <var>expo laptop</var> (or any 'bulk update' laptop) is configured to allow an authorized user to log in to the server itself and to run the import process directly on the server. DON'T DO THIS. The slightest mistake in formatting will killl logbook functionality on the server for everyone.
|
||||
@@ -31,6 +29,8 @@ the trips are indexed and we can see who was doing what where.
|
||||
<a href="log-blog-parsing.html">in another page</a>. But read this page first.
|
||||
|
||||
<h4>Current situation</h4>
|
||||
<p>With the new data entry form we should have far fewer problems with inventive hacks trying to do clever thngs with HTML, but it is entirely possible that the form can be used to input text which will then break the parser, most obviously by putting in a
|
||||
<var><span style="color:red"><hr /></span></var> which is the separator between entries. This is not clever.
|
||||
<p>The nerd needs to do this:
|
||||
<ol>
|
||||
<li>Look at the list of pre-existing old import errors at <a href="/dataissues">Data Issues</a> </br>
|
||||
@@ -39,7 +39,7 @@ the trips are indexed and we can see who was doing what where.
|
||||
This is documented in the <a href="folkupdate.html">Folk Update</a> process.
|
||||
<li>Log in to the expo server and run the update script (see below for details)
|
||||
<li>Watch the error messages scroll by, they are more detailed than the messages archived in the old import errors list
|
||||
<li>Edit the logbook.html file to fix the errors. These are usually typos, non-unique tripdate ids or unrecognised people. Some unrecognised people will mean that you have to fix them using the <a href="folkupdate.html">Folk Update</a> process first.
|
||||
<li>Edit the logbook.html file to fix the errors. These are usually typos, too-clever HTML or unrecognised people. Some unrecognised people will mean that you have to fix them using the <a href="folkupdate.html">Folk Update</a> process first.
|
||||
<li>Re-run the import script until you have got rid of all the import errors.
|
||||
<li>Pat self on back. Future data managers and people trying to find missing surveys will worship you.
|
||||
</ol>
|
||||
@@ -76,30 +76,12 @@ cd troggle
|
||||
python databaseReset.py reset
|
||||
</code></pre>
|
||||
which takes between 300s and 15 minutes on the server.
|
||||
<h3 id="history">The logbooks format</h3>
|
||||
<p>This is documented on the <a href="../logbooks.html#format">logbook user-documentation page</a> as even expoers who can do nothing else technical can at least write up their logbook entries.
|
||||
|
||||
<h3 id="history">Historical logbooks format</h3>
|
||||
<p>Older logbooks (prior to 2007) were stored as logbook.txt with just a bit of consistent markup to allow troggle parsing.</p>
|
||||
|
||||
<p>The formatting was largely freeform, with a bit of markup ('===' around header, bars separating date, <place> - <description>, and who) which allows the troggle import script to read it correctly. The underlines show who wrote the entry. </p>
|
||||
<p>There were also several previous (different) styles of using HTML. The one we are using now is the 5th variant. These older variants were eventually all reformatted into the current HTML format so that now (Jan. 2023) we only need to maintain the code for one parser.
|
||||
<p>However, we missed one. The logbook for 1979 needs to be hand-edited to use the new format.
|
||||
|
||||
<!--
|
||||
<p>So the format should be:</p>
|
||||
|
||||
<code>
|
||||
===2009-07-21|204 - Rigging entrance series| Becka Lawson, Emma Wilson ===
|
||||
</br>
|
||||
{Text of logbook entry}
|
||||
</br>
|
||||
T/U: Jess 1 hr, Emma 0.5 hr
|
||||
</code>
|
||||
-->
|
||||
<hr />
|
||||
<p>
|
||||
Back to <a href="../logbooks.html">Logbooks for Cavers</a> documentation.<br>
|
||||
Forward to <a href="logbooks-format.html">Logbook internal format</a> documentation.<br>
|
||||
Forward to <a href="log-blog-parsing.html">Importing the UK Caving Blog</a>.
|
||||
<hr /></body>
|
||||
</html>
|
||||
|
||||
Reference in New Issue
Block a user