updated re recent chnages

This commit is contained in:
Philip Sargent 2023-10-04 23:21:14 +03:00
parent 573c6d3f3b
commit 1750c5a623

@ -19,30 +19,43 @@
</ul> </ul>
<h2 id="why">Names: Why we need a change</h2> <h2 id="why">Names: Why it is a problem</h2>
<p>The <a href="#whatold">current system</a> completely fails with names which are in any way "non standard". <p>The <a href="#whatold">former system</a> completely failed with names which are in any way "non standard".
Troggle can't cope with a name not structured as Troggle ccouldn't cope with a name not structured as
"Forename Surname": where it is only two words and each begins with a capital letter (with no other punctuation, "Forename Surname": where it is only two words and each begins with a capital letter (with no other punctuation,
capital letters or other names or initials). capital letters or other names or initials).
<p>There are 19 people for which the troggle name parsing and the separate <a href="scriptscurrent.html#folk">folklist script</a> parsing <p>There were 19 people for which the troggle name parsing and the separate <a href="scriptscurrent.html#folk">folklist script</a> parsing
are different. Reconciling these (find easily using a link checker scanner on the were different.
folk/.index.htm file) is a job that needs to be done. Every name in the generated
index.htm now has a hyperlink which goes to the troggle page about that person. Except
for those 19 people.
This has to be fixed as it affects ~5% of our expoers.
<p><em>[This document originally written 31 August 2022]</em>
<h2 id="maint">Names: Maintenance constraints</h2> <h2 id="maint">Names: Maintenance constraints</h2>
<p>We have special code scattered across troggle to cope with "Wookey", "Wiggy" and "Mike the Animal". This is a pain to maintain. <p>We have special code scattered across troggle to cope with "Wookey", "Wiggy" and "Mike the Animal". This is a pain to maintain.
<h2 id="whatold">Names: How it works now</h2> <h2 id="whatold">Names: How it works</h2>
<p>Fundamentally we have regexes detecting whether something is a name or not - in several places. These should all be replaced by properly delimited strings. <p>Fundamentally we have regexes detecting whether something is a name or not - in several places in the different types of raw data. However we do now use unique 'slugs' for the references between pages (since Sept. 2023).
<h4>Four different bits</h4> <h4>Four different bits</h4>
<ul> <ul>
<li>In <var>urls.py</var> we have <li>We have the <a href="scriptscurrent.html#folk">folklist script</a> holding "Forename Surname (nickname)" and "Surname" as the first two columns in the CSV file.
These are used by the standalone script to produce the <var>/folk/index.html</var> which is run manually, and which is also parsed by troggle (by a regex in <var>
parsers/people.py</var>) only when a full data import is done. Which is a problem for people like <var>Lydia-Clare Leather</var> and various 'von' and 'de' middle
'names', McLean, MacLeod and McAdam.
<li>We have the <var>*team notes Becka Lawson</var> lines in all our survex files which are parsed (by regexes in <var> parsers/survex.py</var>) when a full data import is done (or when a survex file is edited online).
<li>We have the <var>&lt;div class="trippeople"&gt;&lt;u&gt;Luke&lt;/u&gt;, Hannah&lt;/div&gt;</var> trip people line in each logbook entry.
These are recognised by a regex in <var>parsers/logbooks.py</var> when a full data import is done (or when a logbook entry is edited online).
<li>We have the names of people in a list on a wallet: which is necessary when the wallet has no attached survex file. But even when there are (one or more) attached survexfiles, there is a place to input a list of peoples' names as well. This is parsed by <var>parsers/scans.py</var>.
</ul>
<p>Frankly it's amazing it even appears to work at all.
<p>
In <var>urls.py</var> we used to have
<code> <code>
re_path(r'^person/(?P<first_name>[A-Z]*[a-z\-\'&;]*)[^a-zA-Z]*(?P<last_name>[a-z\-\']*[^a-zA-Z]*[\-]*[A-Z]*[a-zA-Z\-&;]*)/?', person, name="person"), re_path(r'^person/(?P<first_name>[A-Z]*[a-z\-\'&;]*)[^a-zA-Z]*(?P<last_name>[a-z\-\']*[^a-zA-Z]*[\-]*[A-Z]*[a-zA-Z\-&;]*)/?', person, name="person"),
@ -50,23 +63,19 @@ This has to be fixed as it affects ~5% of our expoers.
re_path('wallets/person/(?P<first_name>[A-Z]*[a-z\-\'&;]*)[^a-zA-Z]*(?P<last_name>[a-z\-\']*[^a-zA-Z]*[\-]*[A-Z]*[a-zA-Z\-&;]*)/?', walletslistperson, name="walletslistperson"), re_path('wallets/person/(?P<first_name>[A-Z]*[a-z\-\'&;]*)[^a-zA-Z]*(?P<last_name>[a-z\-\']*[^a-zA-Z]*[\-]*[A-Z]*[a-zA-Z\-&;]*)/?', walletslistperson, name="walletslistperson"),
</code> </code>
where the transmission noise is attmpting to recognise a name and split it into &lt;first_name&gt; and &lt;last_name&gt;.
Naturally this fails horribly even for relatively straightforward names such as <em>Ruairidh MacLeod</em>.
<li>We have the <a href="scriptscurrent.html#folk">folklist script</a> holding "Forename Surname (nickname)" and "Surname" as the first two columns in the CSV file. where the 'transmission noise' is attmpting to recognise a name and split it into &lt;first_name&gt; and &lt;last_name&gt;.
These are used by the standalone script to produce the <var>/folk/index.html</var> which is run manually, and which is also parsed by troggle (by a regex in <var> Naturally this failed horribly even for relatively straightforward names such as <em>Ruairidh MacLeod</em>.
parsers/people.py</var>) only when a full data import is done. Which it gets wrong for people like <var>Lydia-Clare Leather</var> and various 'von' and 'de' middle <p>
'names', McLean, MacLeod and McAdam. We now [October 2023] have
<code>
<li>We have the <var>*team notes Becka Lawson</var> lines in all our survex files which are parsed (by regexes in <var> parsers/survex.py</var>) only when a full data path('person/&lt;slug:slug&gt;', person, name="person"),<br />
import is done.
path('personexpedition/&lt;slug:slug&gt;/&lt;int:year&gt;', personexpedition, name="personexpedition"),<br />
path('wallets/person/&lt;slug:slug&gt;', walletslistperson, name="walletslistperson"),
<li>We have the <var>&lt;div class="trippeople"&gt;&lt;u&gt;Luke&lt;/u&gt;, Hannah&lt;/div&gt;</var> trip people line in each logbook entry. </code>
These are recognised by a regex in <var>parsers/logbooks.py</var> only when a full data import is done. which is a lot easier to maintain.
</ul>
<p>Frankly it's amazing it even appears to work at all.
<h4>Troggle folk data importing</h4> <h4>Troggle folk data importing</h4>
<p> <p>
@ -91,11 +100,6 @@ and trying to fix this breaks something else (weirdly, not fully investigated).
There seems to be a problem with importing blurbs with more than one image file, even those the code There seems to be a problem with importing blurbs with more than one image file, even those the code
in people.py only looks for the first image file but then fails to use it.]</ul> in people.py only looks for the first image file but then fails to use it.]</ul>
<h4>Proposal</h4>
<p>I would start by replacing the recognisers in <var>urls.py</var> with a slug for an arbitrary text string, and interpreting it in the python code handling the page.
This would entail replacing all the database parsing bits to produce the same slug in the same way.
<p>At that point we should get the 19 people into the system even if all the other crumdph is still there.
Then we take a deep breath and look at it all again.
<h2 id="otherfolk">Folk: pending possible improvements</h2> <h2 id="otherfolk">Folk: pending possible improvements</h2>