Data management updates

2024-11-25 16:52:00 +00:00 · 2018-08-03 22:41:20 +02:00 · 2018-08-03 22:41:20 +02:00 · 4bb4d41537
commit 4bb4d41537
parent 4762f57d53
4 changed files with 165 additions and 10 deletions
--- a/handbook/datamgt.html
+++ b/handbook/datamgt.html
@ -0,0 +1,50 @@
+<html>
+<head>
+<title>CUCC Expo Handbook - Data Management</title>
+<link rel="stylesheet" type="text/css" href="../css/main2.css" />
+</head>
+<body>
+<h2 id="tophead">CUCC Expedition Handbook - Data Management</h2>
+
+<h1>Why cavers need effective data management</h1>
+
+<p>
+Cave exploration is more data-intensive than any other sport. The only way to "win" at this
+sport is to bring back large quantities of interesting survey, and possibly photos or scientific 
+data. Aside from the data collection requirements of the game itself, setting up a game (an 
+expedition) of cave exploration often involves collection of personal information ranging from 
+dates available to medical information to the desire to purchase an expedition t-shirt.
+<p>
+If an expedition will only happen once, low-tech methods are usually adequate to record 
+information. Any events that need to be recorded can go in a logbook. Survey notes must be 
+turned into finished cave sketches, without undue concern for the future expansion of those sketches.
+
+<p>
+However, many caving expeditions are recurring, and managing their data is a more challenging 
+task. For example, let us discuss annual expeditions. Every year, for each cave explored, a list 
+of unfinished leads (which will be called "Question Marks" or "QMs" here) must be maintained to 
+record what has and has not been investigated. Each QM must have a unique id, and information 
+stored about it must be easily accessible to future explorers of the same area. Similarly, on 
+the surface, a "prospecting map" showing which entrances have been investigated needs to be 
+produced and updated at least after every expedition, if not more frequently.
+
+<p>
+These are only the minimum requirements for systematic cave exploration on an annual expedition. 
+There is no limit to the set of data that would be "nice" to have collected and organized 
+centrally. An expedition might collect descriptions of every cave and every passage within every 
+cave. Digital photos of cave entrances could be useful for re-finding those entrances. Scans of 
+notes and sketches provide good backup references in case a question arises about a finished 
+survey product, and recording who did which survey work when can greatly assist the workflow, 
+for example enabling the production of a list of unfinished work for each expedition member. The 
+expedition might keep an inventory of their equipment or a catalog of their library. Entering 
+the realm of the frivolous, an expedition might store mugshots and biographies of its members, 
+or even useful recipes for locally available food. The more of this information the expedition 
+wishes to keep, the more valuable an effective and user-friendly system of data management.
+
+<p><em>From "<a href="../../troggle/docsEtc/troggle_paper.odt" download>
+Troggle: a novel system for cave exploration information management</a>", by Aaron Curtis, CUCC.</em>
+<hr />
+
+
+</body>
+</html>
--- a/handbook/index.htm
+++ b/handbook/index.htm
@ -58,6 +58,9 @@
 <dt><a href="meteo.htm">Weather</a></dt>
 <dd>Unpredictable in the mountains. Local thunderstoms with rapid run-off are the biggest danger.</dd>

+<dt><a href="datamgt.html">Cave data management</a></dt>
+<dd>The biggest surprise for new people on expo are the intense efforts on recording and managing cave data. This tells you why.</dd>
+
 <dt><a href="look4.htm">Prospecting</a></dt>
 <dd>The printable  <a href="/prospecting_guide/">new prospecting guide (slow to load)</a> is a list of all known cave entrances and is essential reading before you wander the plateau stumbling across holes of potential interest. <br><br>
 Do now read <a href="look4.htm">how to do plateau prospecting</a>.<br><br>
--- a/handbook/update.htm
+++ b/handbook/update.htm
@ -1,12 +1,17 @@
 <html>
 <head>
 <title>CUCC Expedition Handbook: The Website</title>
-<link rel="stylesheet" type="text/css" href="../../css/main2.css" />
+<link rel="stylesheet" type="text/css" href="../css/main2.css" />
 </head>
 <body>
 <h2 id="tophead">CUCC Expedition Handbook</h2>
 <h1>Expo Website Manual</h1>
-<p>The website is now large and complicated with a lot of aspects. This handbook section contains info at various levels: simple 'How to add stuff' information for the typical expoer, more detailed info for cloning it onto your own machine for more significant edits, and structural info on how it's all put together for people who want/need to change things. [This manual is now so big that it is being restructured and split up. Much of it is obsolete.]</p>
+<p>The website is now large and complicated with a lot of aspects. 
+This handbook section contains info at various levels: 
+simple 'How to add stuff' information for the typical expoer, 
+more detailed info for cloning it onto your own machine for more significant edits, 
+and structural info on how it's all put together for people who want/need to change things. 
+[This manual is now so big that it is being restructured and split up. Much of it is obsolete.]</p>

 <p>We have <a href="http://wookware.org/talks/expocomputer/#/">an Overview Presentation</a> on how the cave data, handbook and website are constructed and managed. It contains material which will be merged into this website manual.

--- a/handbook/website-history.html
+++ b/handbook/website-history.html
@ -5,22 +5,119 @@
 </head>
 <body>
 <h2 id="tophead">CUCC Expedition Handbook</h2>
-<h1>EXPO WebsiteHistory</h1>
+<h1>EXPO Data Management History</h1>

-<p>The CUCC Website was originally created by Andy Waddington in the early 1990s and was hosted by Wookey. The VCS was CVS. The whole site was just static HTML, carefully designed to be RISCOS-compatible (hence the short 10-character filenames) as both Wadders and Wookey were RISCOS people then. Wadders wrote a huge amount of info collecting expo history, photos, cave data etc.</p>
+<h2>History in review</h2>
+<p>
+Over 32 years, CUCC has developed methods for handling such information. Refinements in data 
+management were made necessary by improved quantity and quality of survey; but refinements in 
+data management also helped to drive those improvements. The first CUCC Austria expedition, in 
+1976, produced only Grade 1 survey for the most part (ref <a href="http://expo.survex.com/years/1977/report.htm">
+Cambridge Underground 1977 report</a>). In 
+the 1980s, the use of programmable calculators to calculate survey point position from compass, 
+tape, and clinometer values helped convince expedition members to conduct precise surveys of 
+every cave encountered. Previously, such work required hours of slide rule or log table work. On 
+several expeditions, such processing was completed after the expedition by a FORTRAN program 
+running on shared mainframe time. BASIC programs running on personal computers took over with 
+the release of the BBC Micro and then the Acorn A4. 

-<p>Martin Green added the SURVTAB.CSV file to contain tabulated data for many caves around 1999, and a script to generate the index pages from it. Dave Loeffler added scripts and programs to generate the prospecting maps in 2004. The server moved to Mark Shinwell's machine in the early 2000s, and the VCS was updated to subversion.</p>
+<p>In the 1990s, Olly Betts and Wookey began 
+work on "<a href="http://www.survex.com">Survex</a>", a 
+program in C for the calculation and 3-D visualization of centerlines, with 
+intelligent loop closure processing. Julian Todd's Java program "Tunnel" facilitated the 
+production of attractive, computerized passage sketches from Survex centerline data and scanned 
+hand-drawn notes.

-<p>In 2006 Aaron Curtis decided that a more modern set of generated, database-based pages made sense, and so wrote Troggle. This uses Django to generate pages. This reads in all the logbooks and surveys and provides a nice way to access them, and enter new data. It was separate for a while until Martin Green added code to merge the old static pages and new troggle dynamic pages into the same site. Work on Troggle still continues sporadically.</p>
+<p>Along with centrelines and sketches, descriptions of caves were also affected by improvements 
+in data management. In a crucial breakthrough, Andrew Waddinton introduced the use of the 
+nascent markup language HTML to create an interlinked, navigable system of descriptions. Links 
+in HTML documents could mimic the branched and often circular structure of the caves themselves. 
+For example, the reader could now follow a link out of the main passage into a side passage, and 
+then be linked back into the main passage description at the point where the side passage 
+rejoined the main passage. This elegant use of technology enabled and encouraged expedition 
+members to better document their exploration.

-<p>After expo 2009 the VCS was updated to hg, because a DVCS makes a great deal of sense for expo (where it goes offline for a month or two and nearly all the year's edits happen).</p>
+<p>To organize all other data, such as lists of caves and their explorers, expedition members 
+eventually wrote a number of scripts which took spreadsheets (or comma separated value 
+files, .CSV ) of information and produced webpages in HTML. Other scripts also used information 
+from Survex data files. Web pages for each cave as well as the indexes which listed all of the 
+caves were generated by one particularly powerful script, <em>make-indxal4.pl</em> . The same data was 
+used to generate a prospecting map as a JPEG image. The system of automatically generating 
+webpages from data files reduced the need for repetitive manual HTML coding. Centralized storage 
+of all caves in a large .CSV file with a cave on each row made the storage of new information 
+more straightforward.

-<p>The site was moved to Julian Todd's seagrass server (in 2010), but the change from a 32-bit to 64-bit machine broke the website autogeneration code,
+<p>Another important element of this system was version control. The entire data structure was 
+stored initially in a Concurrent Version System repository, and later migrated to 
+Subversion. Any edits to the spreadsheets which caused the scripts to fail, breaking the 
+website, could be easily reversed.
+
+
+<p>However, not all types of data could be stored in spreadsheets or survey files. In order a 
+display descriptions on the webpage for an individual cave, the entire description, written in 
+HTML, had to be typed into a spreadsheet cell. A spreadsheet cell makes for an extremely awkward 
+HTML editing environment. To work around this project, descriptions for large caves were written 
+manually as a tree of HTML pages and then the main cave page only contained a link to them.
+
+
+<p>A less obvious but more deeply rooted problem was the lack of relational information. One 
+table named <em>folk.csv</em> stored names of all expedition members, the years in which they were 
+present, and a link to a biography page. This was great for displaying a table of members by 
+expedition year, but what if you wanted to display a list of people who wrote in the logbook 
+about a certain cave in a certain expedition year? Theoretically, all of the necessary 
+information to produce that list has been recorded in the logbook, but there is no way to access 
+it because there is no connection between the person's name in <em>folk.csv</em> and the entries he wrote 
+in the logbook.
+
+
+<p>The only way that relational information was stored in our csv files was by putting 
+references to other files into spreadsheet cells. For example, there was a column in the main 
+cave spreadsheet, <em>cavetab2.csv</em> , which contained the path to the QM list for each cave. The 
+haphazard nature of the development of the "script and spreadsheet" method meant that every cave 
+had an individual system for storing QMs. Without a standard system, it was sometimes unclear 
+how to correctly enter data. 
+
+<p><em>From "<a href="../../troggle/docsEtc/troggle_paper.odt" download>
+Troggle: a novel system for cave exploration information management</a>", by Aaron Curtis, CUCC.</em>
+<hr />
+
+<h2>History in summary</h2>
+
+<p>The CUCC Website, which publishes the cave data, was originally created by 
+Andy Waddington in the early 1990s and was hosted by Wookey. 
+
+The version control system was <a href="https://www.nongnu.org/cvs/">CVS</a>. The whole site was just static HTML, carefully 
+designed to be RISCOS-compatible (hence the short 10-character filenames) 
+as both Wadders and Wookey were <a href="https://en.wikipedia.org/wiki/RISC_OS">RISCOS"</a> people then (in the early 1990s). 
+Wadders wrote a huge amount of info collecting expo history, photos, cave data etc.</p>
+
+<p>Martin Green added the <em>survtab.csv</em> file to contain tabulated data for many caves around 1999, and a 
+script to generate the index pages from it. Dave Loeffler added scripts and programs to generate the 
+prospecting maps in 2004. The server moved to Mark Shinwell's machine in the early 
+2000s, and the version control system was updated to <a href="https://subversion.apache.org/">subversion</a>.</p>
+
+<p>In 2006 Aaron Curtis decided that a more modern set of generated, database-based pages 
+made sense, and so wrote Troggle. 
+This uses Django to generate pages. 
+This reads in all the logbooks and surveys and provides a nice way to access them, and enter new data. 
+It was separate for a while until Martin Green added code to merge the old static pages and 
+new troggle dynamic pages into the same site. Work on Troggle still continues sporadically.</p>
+
+<p>After Expo 2009 the version control system was updated to hg (Mercurial), 
+because a distributed version control system makes a great deal of sense for expo 
+(where it goes offline for a month or two and nearly all the year's edits happen).</p>
+
+<p>The site was moved to Julian Todd's seagrass server (in 2010), 
+but the change from a 32-bit to 64-bit machine broke the website autogeneration code,
 which was only fixed in early 2011, allowing the move to complete. The
-data has been split into 3 separate repositories: the website,
+data was split into separate repositories: the website,
 troggle, the survey data, the tunnel data. Seagrass was turned off at
 the end of 2013, and the site has been hosted by Sam Wenham at the
-university since Feb 2014.</p>
+university since Feb 2014. In 2018 we have 4 repositories, see <a href="update.htm">the website manual</a></p>.
+
+<p>In spring 2018 Sam, Wookey and Paul Fox updated the  Linux version and the Django version to 
+something vaguely acceptable to the university computing service and fixed all the problems that were then observed.
+
 Return to<br>
 <a href="update.html">Website update</a><br>
 <a href="expodata.html">Website developer information</a><br>