expoweb/handbook/computing/log-blog-parsing.html

<!DOCTYPE html>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>CUCC Expedition Handbook: Logbook - importing Blog posts</title>
<link rel="stylesheet" type="text/css" href="../../css/main2.css" />
</head>
<body><style>body { background: #fff url(/images/style/bg-system.png) repeat-x 0 0 }</style>
<h2 id="tophead">CUCC Expedition Handbook</h2>
<h1>Blog Import</h1>

    <!-- Yes we need some proper context-marking here, breadcrumb trails or something. 
        Maybe a colour scheme for just this sequence of pages
    -->


<h2 id="import">Importing the UK Caving Blog into troggle</a></h2>
<p>This is straightforward but a bit time-consuming. You need a 
<a href="../troggle/troglaptop.html">Troggle software development machine</a> and be happy running
the python programe <var>databaseReset.py</var> at the command line. (The <var>expo laptop</var> is <em>not</em> a Troggle software development machine.)

<p><b>Simply:</b> we import all the logbook entries and blog posts for an expo into the database, then export them to a single file. This file is then used for future database resets.

<p>This is the online <a href="https://ukcaving.com/board/index.php?threads/cucc-austria-expedition-2022-blog.29712/">UK Caving Blog for Expo 2022</a>

<ol>
<li>Use a web browser to save the UK Caving blog to a file and sub-folder holding the images.
<li>Go to the image folder and make all photos smaller, and convert .png files to .jpg. Delete all non-image files.
<li>Edit the troggle import parser <var>troggle/parsers/logbooks.py</var> to include a line of code to do the import.
<li>Run <var>python databaseReset.py <b>logbooks</b></var> to import all the logbooks including the blog.
<li>Export all the logbook entries for the year to a single file <var>logbook-new-format.html</var>. using the <var>expoadmin</var> control panel on troggle running locally on your machine.
<li>Rename the existing <var>logbook.html</var> as <var>logbook-original.html</var> and rename <var>logbook-new-format.html</var> as <var>logbook.html</var>.
<li>Comment out the additional line you put into <var>troggle/parsers/logbooks.py</var> to import the blog.
<li>Re-import all the logbooks.
<li>Tidy up oddities by hand-editing <var>logbook.html</var>: e.g. &amp;amp; incidental decodings, delete blog entry comments, fix blog post author names.
<li>Re-import all the logbooks to check that it all looks good. (Several times in practice.)
<li>Commit and push the changes you made to the :expoweb: and :troggle: git repos.
<li>Log on to the server and do a complete database reset online.
</ol>

<p>It's a bit easy to get lost in this process and forget where you were, especially if you are interrupted. So
it is handy to print out this page and tick off the steps as you do them.

<p>After step 4, the blog posts appear in the list of logbook entries in the troggle Expo page for the year, correctly dated, and with titles such as "Expo - UK Caving Blog post 3". 

<h3 id="gotcha">Future Gotcha</a></h3>
<p>The UK Caving Blog regularly upgrades its software which completely changes the hidden structure of the posts. They did this sometime between the 2017 and 2018 expos. When they do it again, the function 
<var>parser_blog(year, expedition, txt, sq="")</var> in <var>troggle/parsers/logbooks.py</var> will need to be completely re-written. It is currently 70 lines long and uses several regular expression recognizers.

<h3 id="save">Saving the Blog</a></h3>
 <img src="blog-pages.jpg" hspace="20" align="right">
<ul>
<li>With your browser (this example uses Chrome), go to <a href="https://ukcaving.com/board/index.php?threads/cucc-austria-expedition-2022-blog.29712/">UK Caving Blog for Expo 2022</a>.
<li>Press <var>ctrl-S</var> and save as filename "ukcavingblog.html" in <var>:expoweb:/years/2022/</var> where <var>:expoweb:</var> is where you keep you copy of the <var>:expoweb:</var> on your <a href="../troggle/troglaptop.html">Troggle software development laptop</a>.
<li>Now for 2022, the blog split the posts onto two pages (see image), so if that is the case with the year you are dealing with, you will need to navigate to the next page and save again, this time with the filename "ukcavingblog2.html". Our existing troggle code handles up to 4 of these, numbered sequentially.
</ul>
<p>Now delete all the non-image files in the "ukcavingblog_files/" and "ukcavingblog2_files/" folders.
<p>Now use your favourite photo editor (e.g. Irfanview on Windows) or a command-line tool to resize all the photos. A maximum of 600 pixels wide or high, or 400 or 300 pixels wide if the image quality is poor. Keep the same filename then you don't have to try to edit the horrendously horrible HTML which was generated by the blog software. If there are any .png files, convert them to .jpg.
<p>Look at all the photos in the file browser set to show thumbnails and delete all advertising logos etc., and delete the UK Caving header image which will be of random people not us.

<h3 id="code1">Edit logbooks.py</a></h3>
<h3 id="export">Exporting all entries to a new file</a></h3>
<h3 id="code2">Edit logbooks.py again</a></h3>
<h3 id="tidy">Tidy oddities</a></h3>
<hr />
<p>
Back to <a href="logbooks-parsing.html">Logbooks Import for Nerds</a> documentation.<br>
Back to <a href="../logbooks.html">Logbooks for Cavers</a> documentation.
<hr />

</body>
</html>
New docum. and logbook/blog update 2022-12-17 18:34:39 +00:00			`<!DOCTYPE html>`
			`<html>`
			`<head>`
			`<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />`
			`<title>CUCC Expedition Handbook: Logbook - importing Blog posts</title>`
			`<link rel="stylesheet" type="text/css" href="../../css/main2.css" />`
			`</head>`
			`<body><style>body { background: #fff url(/images/style/bg-system.png) repeat-x 0 0 }</style>`
			`<h2 id="tophead">CUCC Expedition Handbook</h2>`
			`<h1>Blog Import</h1>`

			`<!-- Yes we need some proper context-marking here, breadcrumb trails or something.`
			`Maybe a colour scheme for just this sequence of pages`
			`-->`


			`<h2 id="import">Importing the UK Caving Blog into troggle</a></h2>`
			`<p>This is straightforward but a bit time-consuming. You need a`
			`<a href="../troggle/troglaptop.html">Troggle software development machine</a> and be happy running`
			`the python programe <var>databaseReset.py</var> at the command line. (The <var>expo laptop</var> is <em>not</em> a Troggle software development machine.)`

			`<p><b>Simply:</b> we import all the logbook entries and blog posts for an expo into the database, then export them to a single file. This file is then used for future database resets.`

			`<p>This is the online <a href="https://ukcaving.com/board/index.php?threads/cucc-austria-expedition-2022-blog.29712/">UK Caving Blog for Expo 2022</a>`

			`<ol>`
			`<li>Use a web browser to save the UK Caving blog to a file and sub-folder holding the images.`
			`<li>Go to the image folder and make all photos smaller, and convert .png files to .jpg. Delete all non-image files.`
			`<li>Edit the troggle import parser <var>troggle/parsers/logbooks.py</var> to include a line of code to do the import.`
			`<li>Run <var>python databaseReset.py <b>logbooks</b></var> to import all the logbooks including the blog.`
			`<li>Export all the logbook entries for the year to a single file <var>logbook-new-format.html</var>. using the <var>expoadmin</var> control panel on troggle running locally on your machine.`
			`<li>Rename the existing <var>logbook.html</var> as <var>logbook-original.html</var> and rename <var>logbook-new-format.html</var> as <var>logbook.html</var>.`
			`<li>Comment out the additional line you put into <var>troggle/parsers/logbooks.py</var> to import the blog.`
			`<li>Re-import all the logbooks.`
			`<li>Tidy up oddities by hand-editing <var>logbook.html</var>: e.g. &amp; incidental decodings, delete blog entry comments, fix blog post author names.`
			`<li>Re-import all the logbooks to check that it all looks good. (Several times in practice.)`
			`<li>Commit and push the changes you made to the :expoweb: and :troggle: git repos.`
			`<li>Log on to the server and do a complete database reset online.`
			`</ol>`

			`<p>It's a bit easy to get lost in this process and forget where you were, especially if you are interrupted. So`
			`it is handy to print out this page and tick off the steps as you do them.`

			`<p>After step 4, the blog posts appear in the list of logbook entries in the troggle Expo page for the year, correctly dated, and with titles such as "Expo - UK Caving Blog post 3".`

			`<h3 id="gotcha">Future Gotcha</a></h3>`
			`<p>The UK Caving Blog regularly upgrades its software which completely changes the hidden structure of the posts. They did this sometime between the 2017 and 2018 expos. When they do it again, the function`
			`<var>parser_blog(year, expedition, txt, sq="")</var> in <var>troggle/parsers/logbooks.py</var> will need to be completely re-written. It is currently 70 lines long and uses several regular expression recognizers.`

			`<h3 id="save">Saving the Blog</a></h3>`
			`<img src="blog-pages.jpg" hspace="20" align="right">`
			`<ul>`
			`<li>With your browser (this example uses Chrome), go to <a href="https://ukcaving.com/board/index.php?threads/cucc-austria-expedition-2022-blog.29712/">UK Caving Blog for Expo 2022</a>.`
			`<li>Press <var>ctrl-S</var> and save as filename "ukcavingblog.html" in <var>:expoweb:/years/2022/</var> where <var>:expoweb:</var> is where you keep you copy of the <var>:expoweb:</var> on your <a href="../troggle/troglaptop.html">Troggle software development laptop</a>.`
			`<li>Now for 2022, the blog split the posts onto two pages (see image), so if that is the case with the year you are dealing with, you will need to navigate to the next page and save again, this time with the filename "ukcavingblog2.html". Our existing troggle code handles up to 4 of these, numbered sequentially.`
			`</ul>`
			`<p>Now delete all the non-image files in the "ukcavingblog_files/" and "ukcavingblog2_files/" folders.`
			`<p>Now use your favourite photo editor (e.g. Irfanview on Windows) or a command-line tool to resize all the photos. A maximum of 600 pixels wide or high, or 400 or 300 pixels wide if the image quality is poor. Keep the same filename then you don't have to try to edit the horrendously horrible HTML which was generated by the blog software. If there are any .png files, convert them to .jpg.`
			`<p>Look at all the photos in the file browser set to show thumbnails and delete all advertising logos etc., and delete the UK Caving header image which will be of random people not us.`

			`<h3 id="code1">Edit logbooks.py</a></h3>`
			`<h3 id="export">Exporting all entries to a new file</a></h3>`
			`<h3 id="code2">Edit logbooks.py again</a></h3>`
			`<h3 id="tidy">Tidy oddities</a></h3>`
			`<hr />`
			`<p>`
			`Back to <a href="logbooks-parsing.html">Logbooks Import for Nerds</a> documentation.<br>`
			`Back to <a href="../logbooks.html">Logbooks for Cavers</a> documentation.`
			`<hr />`

			`</body>`
			`</html>`