expoweb/handbook/computing/log-blog-parsing.html

<!DOCTYPE html>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>CUCC Expedition Handbook: Logbook - importing Blog posts</title>
<link rel="stylesheet" type="text/css" href="/css/main2.css" />
</head>
<body><style>body { background: #fff url(/images/style/bg-system.png) repeat-x 0 0 }</style>
<h2 id="tophead">CUCC Expedition Handbook</h2>
<h1>Blog Import</h1>

    <!-- Yes we need some proper context-marking here, breadcrumb trails or something.
        Maybe a colour scheme for just this sequence of pages
    -->


<h2 id="import">Importing the UK Caving Blog into troggle</a></h2>
<p>This is straightforward but a bit time-consuming. You need a
<a href="../troggle/troglaptop.html">Troggle software development machine</a> and be happy running
the python programe <var>databaseReset.py</var> at the command line. (The <var>expo laptop</var> is <em>not</em> a Troggle software development machine.)

<p><b>Simply:</b> we import all the logbook entries and blog posts for an expo into the database, then export them to a single file. This file is then used for future database resets.

<p>This is the online <a href="https://ukcaving.com/board/index.php?threads/cucc-austria-expedition-2022-blog.29712/">UK Caving Blog for Expo 2022</a>

<ol>
<li>Use a web browser to save the UK Caving blog to a file and sub-folder holding the images.
<li>Go to the image folder and make all photos smaller, and convert .png files to .jpg. Delete all non-image files.
<li>Edit the troggle import parser <var>troggle/parsers/logbooks.py</var> to include a line of code to do the import:<br/>
    <var>"2024": ("ukcavingblog.html", "parser_blog"),</var><br/>
    in the obvious place near the top of the file (see below too)
<li>Run <var>python databaseReset.py <b>logbooks</b></var> to import all the logbooks including the blog
<li>Run <var>python databaseReset.py <b>logbook</b></var> to import just the current year's logbook.
<li>Export all the logbook entries for the year to a single file <var>logbook-new-format.html</var>. using the <var>expoadmin</var> control panel on troggle running locally on your machine.
<li>Rename the existing <var>logbook.html</var> as <var>logbook-original.html</var> and rename <var>logbook-new-format.html</var> as <var>logbook.html</var>.
<li>Comment out the additional line you put into <var>troggle/parsers/logbooks.py</var> to import the blog.
<li>Re-import all the logbooks.
<li>From now on you can use the logbook entry editor to make small changes, which will re-export everything and re-create the 'logbook.html' file. It will also do a git commit on your local machine, so you will need to clean these out and not push them to the server.
<li>Tidy up oddities by hand-editing <var>logbook.html</var>: e.g. &amp;amp; incidental decodings, delete blog entry comments, fix blog post author names.
<li>Re-import all the logbooks to check that it all looks good. (Several times in practice.)
<li>Commit and push the changes you made to the :expoweb: and :troggle: git repos.
<li>Log on to the server and do a complete database reset online.
<li>Curse mightily as MariaDB crashes on the server, because all the prep work was done using sqlite. The problem
is because the blog software now (2023 onwards) uses 4-bit UTF-8 entitites for emojis and MariaDB by default only uses 3-bit UTF-8
(see <a href="https://stackoverflow.com/questions/20411440/incorrect-string-value-xf0-x9f-x8e-xb6-xf0-x9f-mysql">utf8mb4</a>.
<li>Manually go through the imported HTML using an editor which displays these utf8mb4 (e.g. Notepad++) and delete them. Try again.
</ol>

<p>It's a bit easy to get lost in this process and forget where you were, especially if you are interrupted. So
it is handy to print out this page and tick off the steps as you do them.

<p>After step 4, the blog posts appear in the list of logbook entries in the troggle Expo page for the year, correctly dated, and with titles such as "Expo - UK Caving Blog post 3".

<h3 id="gotcha">Future Gotcha</a></h3>
<p>The UK Caving Blog regularly upgrades its software which completely changes the hidden structure of the posts. They did this sometime between the 2017 and 2018 expos. When they do it again, the function
<var>parser_blog(year, expedition, txt, sq="")</var> in <var>troggle/parsers/logbooks.py</var> will need to be completely re-written. It is currently 102 lines long and uses several regular expression recognizers.

<h3 id="save">Saving the Blog</a></h3>
 <img src="blog-pages.jpg" hspace="20" align="right">
<ul>
<li>With your browser (this example uses Chrome), go to <a href="https://ukcaving.com/board/index.php?threads/cucc-austria-expedition-2022-blog.29712/">UK Caving Blog for Expo 2022</a>.
<li>Press <var>ctrl-S</var> and save as filename "ukcavingblog.html" in <var>:expoweb:/years/2022/</var> where <var>:expoweb:</var> is where you keep you copy of the <var>:expoweb:</var> on your <a href="../troggle/troglaptop.html">Troggle software development laptop</a>.
<li>Now for 2022, the blog split the posts onto two pages (see image), so if that is the case with the year you are dealing with, you will need to navigate to the next page and save again, this time with the filename "ukcavingblog2.html". Our existing troggle code handles up to 4 of these, numbered sequentially.
</ul>
<p>Now delete all the non-image files in the "ukcavingblog_files/" and "ukcavingblog2_files/" folders.
<p>Now use your favourite photo editor (e.g. Irfanview on Windows) or a command-line tool to resize all the photos. A maximum of 800 pixels wide or high, or 400 or 300 pixels wide if the image quality is poor. Keep the same filename then you don't have to try to edit the horrendously horrible HTML which was generated by the blog software. If there are any .png files, convert them to .jpg.
<p>There will be a lot. So install imagemagick and use the
<a href="https://imagemagick.org/script/mogrify.php">mogrify tool</a>:
<br />mogrify -resize 800>x800> *.jpg
<br /> will resize in-place, overwriting the files, and make the maximum dimension 800 pixels.
<p>Look at all the photos in the file browser set to show thumbnails and delete all advertising logos etc., and delete the UK Caving header image which will be of random people not us.

<h3 id="code1">Edit logbooks.py</a></h3>
<p>Edit this bit in the obvious manner to add a line for the year you want to add:
<code><pre>BLOG_PARSER_SETTINGS = {
                "2017": ("ukcavingblog.html", "parser_blog"),
                "2018": ("ukcavingblog.html", "parser_blog"),
                "2019": ("ukcavingblog.html", "parser_blog"),
                "2022": ("ukcavingblog.html", "parser_blog"),
            }
</pre></code>
<p>If there are 2nd or 3rd pageswithin the same year, these will be detected automatically.  But you have to tell it about the first one.

<img src="export-dialog.jpg" hspace="20" align="right">
<h3 id="export">Exporting all entries to a new file</a></h3>
<p>Run Troggle locally, and navigate your browser to
<var>http://localhost:8000/controlpanel</var>. Select the drop-down for the year you are working on (2017 in this example).
<p>You need to login as the "expoadmin" user id, not the usual "expo" id. This has a different password but you already know what it is because you set up your local copy of Troggle.
<p>There is only one export format: "HTML 2005 style". This uses the Django template
<var>troggle/templates/logbook2005style.html</var>.
<p>All entries for the year will be exported in date order, which may not be the order they were originally written in the paper logbook.
<p>Some logbooks have "front matter": text and images which are not part of any trip entry. This front matter is copied out when the logbook.html file was most recently parsed and is copied in to the front of the generated export file. The export file is always called "logbook-new-format.html" and is located in the same folder as "logbook.html". If there is a file of that name already there it is overwritten without warning.
<h3 id="code2">Edit logbooks.py again</a></h3>
<p><p>Edit this bit in the obvious manner to show that you have done all the work for 2017. Don't just delete the line, make it obvious that the importing job was done:
<code><pre>BLOG_PARSER_SETTINGS = {
                # "2017": ("ukcavingblog.html", "parser_blog"), # now folded in to logbooks.html
                "2018": ("ukcavingblog.html", "parser_blog"),
                "2019": ("ukcavingblog.html", "parser_blog"),
                "2022": ("ukcavingblog.html", "parser_blog"),
            }
</pre></code>
<h3 id="reimport">Rename and reimport</a></h3>
Go to page http://expo.survex.com/controlpanel and use the "Export logbook to a different format" section to export
the year you are working on.
<p>
In expoweb/years/<em>current year</em>/ :
<ol>
<li>Rename logbook.html as logbook-original.html
<li>Rename logbook-new-format.html as logbook.html
</ol>
<p>At the command line, re-import the logbook using <var>python databaseRest.py logbooks</var> and look for errors in the terminal as it does it.
<p>You have now consolidated the blog into the logbook, and put all the entries in date order too.

<p>[ The true expert will edit parsers/imports.py to make the databaseReset option 'logbook' just do the year you are working on.]

<h3 id="tidy">Oddities</a></h3>
<p>With the blog, we have well known expoers labelled as unrecognized because while they posted to the blog, they were not actually on expo in that year. This is not a bug, but don't be confused by it.
<p>Somewhere the encode/decode process of exporting the content of the trip writeups is turning quote marks into question marks, and &gt; into &amp;gt;. Currently these are all being hand-edited to fix. The fault
is somewhere in the settings for rendering a dictionary using a Django template, and hard to find and fix.

<h4>Fixing dates and trips</h4>
<p>It it noticeable that a single blog post may cover several trips, and that the blog post date may be several days after the trip(s). So you need to manually find out the exact date of the trip (from the other trip records and particularly from the Bier Book) and change the date on the entry.

<p>One blog post may also need to be split into several entries - don't worry about the 'id=' string as the parser now rewrites these uniquely.

When you split a blog into different entries the quickest way to re-order everything in date-order is to export the logbook and re-import it.

<h3 id="final">Finally: Separate out Training Weeks</a></h3>
<p>The very last thing to do is to edit 'logbook.html' to remove the pre-expo training events and to put them into a file 'training-weekends.html', and to edit the 'index.html' file to link to that as well as to the logbook itself. This is necessary for the pre-expo material to get indexed by the server free-text indexer and search engine.

<hr />
<p>
Back to <a href="logbooks-parsing.html">Logbooks Import for Nerds</a> documentation.<br>
Back to <a href="logbooks-format.html">Logbook internal format</a> documentation.<br>
Back to <a href="../logbooks.html">Logbooks for Cavers</a> documentation.
<hr />

</body>
</html>