expoweb/handbook/website-history.html

<!DOCTYPE html>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>CUCC Expedition Handbook: Website History</title>
<link rel="stylesheet" type="text/css" href="../css/main2.css" />
</head>
<body>
<h2 id="tophead">CUCC Expedition Handbook</h2>
<h1>EXPO Data Management History</h1>

<div style="text-align:left">
<!-- Comment
Justified text is hard to read:
https://designshack.net/articles/mobile/the-importance-of-designing-for-readability/
https://designforhackers.com/blog/justify-text-html-css/
-->
<style>
summary::-webkit-details-marker {
    color: red;
    font-size: 200%;
}
summary {
    color: #005151;
    font-family: Tahoma,'Trebuchet MS','Lucida Grande',Verdana,Arial,Sans-Serif;
    font-size: 12pt;
    margin-block-start: 1.33em;
    margin-block-end: 1.33em;
    margin-inline-start: 0px;
    margin-inline-end: 0px;
    font-weight: bold;
}
</style>
<h2>Early history</h2>
<p>
Over more than 4 decades, CUCC has developed methods for handling such information. Refinements in data
management were made necessary by improved quantity and quality of survey; but refinements in
data management also helped to drive those improvements. The first CUCC Austria expedition, in
1976, produced only Grade 1 survey for the most part (ref <a href="http://expo.survex.com/years/1977/report.htm">
Cambridge Underground 1977 report</a>).
<p>In
the 1980s, the use of programmable calculators to calculate survey point position from compass,
tape, and clinometer values helped convince expedition members to conduct precise surveys of
every cave encountered. Previously, such work required hours of slide rule or log table work. On
several expeditions, such processing was completed after the expedition by a FORTRAN program
running on shared mainframe time. BASIC programs running on personal computers took over with
the release of the BBC Micro and then the Acorn A4. A full history of this period is described in
<a href="c21bs.html">Taking Expo Bullshit into the 21st Century</a> - a story of the data management system up to Spring 1996. [This was less than five years after Tim Berners-Lee published the world's very first web page on 6th August 1991. So the expo website is nearly as old as the web itself.]

<h3>Survex - cave surveying</h3>
<p>In the 1990s, Olly Betts and Wookey began
work on "<a href="computing/getsurvex.html">Survex</a>", a
program in C for the calculation and 3-D visualization of centerlines, with
intelligent loop closure processing. Julian Todd's Java program "Tunnel" facilitated the
production of attractive, computerized passage sketches from Survex centerline data and scanned
hand-drawn notes.
A <a href="survexhistory96.htm">history of survex</a> article covering the period 1988-1996 was published in Cambridge Underground 1996.

<h3>Initial cave data management</h3>
<p>Along with centrelines and sketches, descriptions of caves were also affected by improvements
in data management. In a crucial breakthrough, Andrew Waddinton introduced the use of the
nascent markup language HTML to create an interlinked, navigable system of descriptions (see <a href="c21bs.html">"Expo Bullshit"</a>). Links
in HTML documents could mimic the branched and often circular structure of the caves themselves.
For example, the reader could now follow a link out of the main passage into a side passage, and
then be linked back into the main passage description at the point where the side passage
rejoined the main passage. This elegant use of technology enabled and encouraged expedition
members to better document their exploration.

<p>To organize all other data, such as lists of caves and their explorers, expedition members
eventually wrote a number of scripts which took spreadsheets (or comma separated value
files, .CSV ) of information and produced webpages in HTML. Other scripts also used information
from Survex data files. Web pages for each cave as well as the indexes which listed all of the
caves were generated by one particularly powerful script, <em>make-indxal4.pl</em> . The same data was
used to generate a prospecting map as a JPEG image. The system of automatically generating
webpages from data files reduced the need for repetitive manual HTML coding. Centralized storage
of all caves in a large .CSV file with a cave on each row made the storage of new information
more straightforward.
<hr />

<h2>HTML and the website</h2>

<a href="https://en.wikipedia.org/wiki/RISC_OS"><img class="onright" src="t/riscos.jpg" width="100px"/></a>
<p>From having a set of HTML files, it was a small step to publish a website. The CUCC Expo Website, which publishes the cave data, was originally created by
Andy Waddington in the early 1990s and was hosted by Wookey.
<details>
<summary>1999 scripts and spreadsheets</summary>
The version control system was <a href="https://www.nongnu.org/cvs/">CVS</a>. The whole site was just static HTML, carefully
designed to be RISCOS-compatible (hence the short 10-character filenames)
as both Wadders and Wookey were <a href="https://en.wikipedia.org/wiki/RISC_OS">RISCOS"</a> people then (in the early 1990s).
Wadders wrote a huge amount of info collecting expo history, photos, cave data etc.</p>

<p>Martin Green added the <em>survtab.csv</em> file to contain tabulated data for many caves around 1999, and a
script to generate the index pages from it. Dave Loeffler added scripts and programs to generate the
prospecting maps in 2004. The server moved to Mark Shinwell's machine in the early
2000s, and the version control system was updated to <a href="https://subversion.apache.org/">subversion</a>.</p>
</details>

<details>
<summary>Breaking out of spreadsheet cells into HTML</summary>
<p>Not all types of data could be stored in spreadsheets or survey files. In order a
display descriptions on the webpage for an individual cave, the entire description, written in
HTML, had to be typed into a spreadsheet cell. A spreadsheet cell makes for an extremely awkward
HTML editing environment. To work around this project, descriptions for large caves were written
manually as a tree of HTML pages and then the main cave page only contained a link to them.


<p>A less obvious but more deeply rooted problem was the lack of relational information. One
table named <em>folk.csv</em> stored names of all expedition members, the years in which they were
present, and a link to a biography page. This was great for displaying a table of members by
expedition year, but what if you wanted to display a list of people who wrote in the logbook
about a certain cave in a certain expedition year? Theoretically, all of the necessary
information to produce that list has been recorded in the logbook, but there is no way to access
it because there is no connection between the person's name in <em>folk.csv</em> and the entries he wrote
in the logbook.


<p>The only way that relational information was stored in our csv files was by putting
references to other files into spreadsheet cells. For example, there was a column in the main
cave spreadsheet, <em>cavetab2.csv</em> , which contained the path to the QM list for each cave. The
haphazard nature of the development of the "script and spreadsheet" method meant that every cave
had an individual system for storing QMs. Without a standard system, it was sometimes unclear
how to correctly enter data.

<p><em>From "<a href="/expofiles/documents/troggle/troggle2020.pdf" download>
Troggle: a revised system for cave data management</a>", by Philip Sargent and Aaron Curtis, CUCC [with some additions]</em>.

<em>Original (2006) paper: "<a href="/expofiles/documents/troggle/troggle_paper.pdf" download>
Troggle: a novel system for cave exploration information management</a>", by Aaron Curtis</em>.
</details>

<details>
<summary>HTML and browser incompatibilities</summary>
<p>This was a much bigger problem in the past than it is possible to imagine now
(most of the early editing is done on an Acorn platform using !Zap). So much so
that we have a separate page all about it: "<a href="computing/useany.html">Platform portability</a> - making the website work widely".
</details>

<details>
<summary>Version control</summary>
<p>Another important element of this system was <a href="computing/repos.html">version control</a>. The entire data structure was
stored initially in a <a href="https://en.wikipedia.org/wiki/Concurrent_Versions_System">Concurrent Version System</a> repository, and later migrated to
<a href="https://en.wikipedia.org/wiki/Apache_Subversion">Subversion</a> [<em>now using <a href="computing/repos.html">git</a> in 2020</em>].
Any edits to the spreadsheets which caused the scripts to fail, breaking the
website, could be easily reversed.
</details>


<details>
<summary>2006 and troggle</summary
<p>In 2006 Aaron Curtis decided that a more modern set of generated, database-based pages
made sense, and so wrote Troggle.
This uses Django to generate pages.
This reads in all the logbooks and surveys and provides a nice way to access them, and enter new data.
It was separate for a while until Martin Green added code to merge the old static pages and
new troggle dynamic pages into the same site. This is now the live system running everything (in 2019). Work on developing Troggle further still continues (see <a href="troggle/trogintro.html">Troggle intro</a>).</p>

<p>After Expo 2009 the version control system was updated to a <a href="computing/onlinesystems.html#mercurial">DVCS</a> (Mercurial, aka 'hg'),
because a distributed version control system makes a great deal of sense for expo
(where it goes offline for a month or two and nearly all the year's edits happen).</p>

<p>The site was moved to Julian Todd's seagrass server (in 2010),
but the change from a 32-bit to 64-bit machine broke the website autogeneration code,
which was only fixed in early 2011, allowing the move to complete. The
data was split into separate repositories: the website,
troggle, the survey data, the tunnel data. Seagrass was turned off at
the end of 2013, and the site has been hosted by Sam Wenham at the
university since Feb 2014.

<p><em>From "<a href="/expofiles/documents/troggle/troggle2020.pdf" download>
Troggle: a revised system for cave data management</a>", by Philip Sargent and Aaron Curtis, CUCC [with some additions]</em>.

<em>Original (2006) paper: "<a href="/expofiles/documents/troggle/troggle_paper.pdf" download>
Troggle: a novel system for cave exploration information management</a>", by Aaron Curtis</em>.
</details>


<details>
<summary>2018 Four repositories</summary>
<p>In 2018 we had 4 repositories, 2 mercurial, 2 git

<ul>
 <li><a href="/hgrepositories/home/expo/loser/graph/">loser</a> - the survex cave survey data (hg)</li>
 <li><a href="/repositories/drawings/.git/log">drawings</a> - the tunnel and therion cave data and drawings (git)</li>
 <li><a href="/hgrepositories/home/expo/expoweb/graph">expoweb</a> - the website pages, handbook, generation scripts (hg)</li>
 <li><a href="/repositories/troggle/.git/log">troggle</a> - the database/software part of the survey data management system - see <a href="troggle/trogintro.html">notes on troggle</a> for further explanation (git)</li>
</ul>

<p>In spring 2018 Sam, Wookey and Paul Fox updated the  Linux version and the Django version (i.e. troggle) to
something vaguely acceptable to the university computing service and fixed all the problems that were then observed.
</details>


<details>
<summary>Early 2019 - Leaving the Computing service</summary>
<p>In early 2019 the university computing service upgraded its firewall rules which took the
server offline completely.
<p>
Wookey eventually managed to find us free space (a virtual machine)
on a debian mirror server somewhere in Leicestershire (we think).
This move to a different secure server means that all ssh to the server now needs to use cryptographic keys tied to individual machines. There is an expo-nerds email list (all mailing lists are now hosted on wookware.org as the university list system restricted what non-Raven-users could do) to coordinate server fettling.

<p>At the beginning of the 2019 expo two repos had been moved from mercurial to git: troggle and drawings (formerly called tunneldata).

</details>

<details>
<summary>Wookey: July 2019</summary>

<p>The troggle software has been migrated to git, and the old erebus and cvs branches (pre 2010) removed. Some decrufting was done to get rid of log files, old copies of embedded javascript (codemirror, jquery etc) and some fat images no longer used.
<p>
The tunneldata repo has also been migrated to git, and renamed 'drawings' as it includes therion data too these days.
<p>
The loser repo and expoweb repo need more care in hg->git migration (expoweb is the website content - which is published by troggle). Loser should have the old 1999-2004 CVS history restored, and maybe Tom's annual snapshots from before that, so ancient history can usefully be researched (sometimes useful). It's also a good idea to add the 2015, 2016 and 2017 ARGE data we got (in 2017) added in the correct years so that it's possible to go back to an 'end of this year' checkout and get an accurate view of what was found (for making plots and length stats). All of that requires some history rewriting, which is best done at the time of conversion.
<p>
Similarly expoweb is full of bloat from fat images and surveys and one 82MB thesis that got checked in and then removed. Clearing that out is a good idea. I have a set of 'unused fat blob' lists which can be stripped out with git-gilter. It's not hard to make a 'do the conversion' script, ready for sometime after expo 2019 has calmed down.
</details>

<details>
<summary id='#may2020'>May 2020 and django</summary>
<p>
Wookey has now moved 'expoweb' from mercurial to git largely "as-is". Mark Shinwell has said that he will help on the loser (survex files) migration to git.
<p>In May we were on django 1.7 and python 2.7.17. Sam continued to work on upgrading django from v1.7 . We wanted to upgrade django as quickly as possible because old versions of django had unpatched security issues.
[Upgrading to later django versions <a href="troggle/trogdjangup.html">is a real pig</a> - not helped by the fact that all the tools to help do it are now out of date for these very old django releases.]
<ul>
<li>"Django 1.11 is the last version to support Python 2.7. Support for Django 1.11 ends in 2020." see: <a href="https://docs.djangoproject.com/en/3.0/faq/install/">django versions</a>. You will notice that we are really outstaying our welcome here, especially as python2.7 was <a href="https://python-release-cycle.glitch.me/">declared dead in January</a> this year.

<li>For a table displaying the various versions of django and support expiry dates
see <a href="https://www.djangoproject.com/download/">the django download</a> page.
Django 1.7 expired in December 2015.
Django: <a href="https://www.djangoproject.com/download/#supported-versions">full deprecation timeline</a>.

<li>Ubuntu 20.04 came out on 23rd April but it does not support python2 at all. So we cannot use it for software maintenance (well be can, but only using non-recommended software, which is what we are trying to get away from).
</ul>
<p>We planned to upgrade from django 1.7 to django 1.11, then port from python2 to python3 on
the same version of django, then upgrade to as recent a version of django as we could. But we have
discovered that django1.7 works just fine with  <a href="https://docs.djangoproject.com/en/1.10/topics/python3/">python3</a>, so we will move the development version to python3 during June and
then upgrade the server operating system from Debian <var>stretch</var> to <var>buster</var> before
tackling the next step: thinking deeply about when we migrate from django
<a href="troggle/trogdesign.html">to something else</a>.
</p>
    <p>
Enforced time at home under covid lockdown is giving us a new impetus to writing and restructuring the documentation for everything.
</details>

<h3>More recent</h3>
<p>
For the current situation see <a href="troggle/trogstatus.html">expo systems status</a>.
<hr />
</div>
Return to<br />
<a href="computing/onlinesystems.html">expo online systems overbiew</a><br />

<hr />
</body>
</html>