switch to html2text() instead of strip_tags() when preparing FTS index

This commit is contained in:
Andrew Dolgov
2023-10-21 10:51:24 +03:00
parent 2b61052e87
commit 03e956132d
73 changed files with 27833 additions and 17 deletions
@@ -0,0 +1,5 @@
A document without any HTML open/closing tags.
---------------------------------------------------------------
We try and use the representation given by common browsers of the HTML document, so that it looks similar when converted to plain text. visit foo.com - or http://www.foo.com link
An anchor which will not appear
+5
View File
@@ -0,0 +1,5 @@
A document without any HTML open/closing tags.
---------------------------------------------------------------
We try and use the representation given by common browsers of the HTML document, so that it looks similar when converted to plain text. [visit foo.com](http://foo.com) - or http://www.foo.com [link](http://foo.com)
[An anchor which will not appear]
@@ -0,0 +1,15 @@
Hello, World!
This is some e-mail content. Even though it has whitespace and newlines, the e-mail converter will handle it correctly.
Even mismatched tags.
A div
Another div
A div
within a div
Another line
Yet another line
A link
+15
View File
@@ -0,0 +1,15 @@
Hello, World!
This is some e-mail content. Even though it has whitespace and newlines, the e-mail converter will handle it correctly.
Even mismatched tags.
A div
Another div
A div
within a div
Another line
Yet another line
[A link](http://foo.com)
+44
View File
@@ -0,0 +1,44 @@
Hello
> Nest some block quotes with preformated text
>
>> Here is the code
>>
>> #include <stdlib.h>
>> #include <stdio.h>
>>
>> int main(){
>> return 0;
>> };
>>
>> Put some tags at the end
>
> Some text and tags here
>
>> First line
>>
>> Header 1
>>
>> Some text
>> ---------------------------------------------------------------
>> Some more text
>>
>> Paragraph tag!
>>
>> Header 2
>>
>> ---------------------------------------------------------------
>>
>> Header 3
>>
>> Some text
>>
>> Header 4
>>
>>> More quoted text!
>>
>> Paragraph tag!
>>
>> Final line
Some ending text just to make sure
@@ -0,0 +1 @@
Hello
View File
+53
View File
@@ -0,0 +1,53 @@
http://localhost/home 16 December 2015
Account 123
Hi Susan
Here is your cat report.
You have found 5 cats less than anyone else
[Find more cats](http://localhost/cats)
Down the road
Across the hall
Your achievements
You're currently finding about
12 cats
per day
[Number of cats found]
---------------------------------------------------------------
Your last cat was found two days ago.
One type of cat is a kitten.
Special account A1
12.345
http://localhost/logout
How can you find more cats?
Look in trash cans
Start meowing
Eat cat food
Some cats like to hang out in trash cans. Some cats do not. Some cats are attracted to similar tones. So one day your tears may smell like cat food, attracting more cats.
https://localhost/about https://localhost/about https://localhost/about
[Cats are great.](https://github.com/soundasleep/html2text_ruby) [Find more cats.](https://github.com/soundasleep/html2text_ruby) [Do more things.](https://github.com/soundasleep/html2text_ruby)
[Contact us](http://localhost/contact)
cats@cats.com
Monday and Friday
https://github.com/soundasleep/html2text https://github.com/soundasleep/html2text_ruby
Having trouble seeing this email? [View it online](http://localhost/view_it_online).
File diff suppressed because it is too large Load Diff
+27
View File
@@ -0,0 +1,27 @@
One:
Two: [two]
Three: [three]
Four: [four]
With links
One: http://localhost
Two: [two](http://localhost)
Three: [three](http://localhost)
Four: [four](http://localhost)
With links with titles
One: [one link](http://localhost)
Two: [two link](http://localhost)
Three: [three link](http://localhost)
Four: [four link](http://localhost)
+1
View File
@@ -0,0 +1 @@
Hello &nbsnbsp; world
+17
View File
@@ -0,0 +1,17 @@
List tests
Add some lists.
- one
- two
- three
An unordered list
- one
- two
- three
- one
- two
- three
@@ -0,0 +1,7 @@
Anchor tests
Visit http://openiaml.org or openiaml.org or http://openiaml.org.
To visit with SSL, visit https://openiaml.org or openiaml.org or https://openiaml.org.
To mail, email support@openiaml.org or mailto:support@openiaml.org or support@openiaml.org or mailto:support@openiaml.org.
+12
View File
@@ -0,0 +1,12 @@
Dear html2text,
This is an example email that can be used to test html2text conversion of outlook / exchange emails.
The addition of <o:p> tags is very annoying!
This is a single line return
This is bold
This is italic
This is underline
Andrew
+1
View File
@@ -0,0 +1 @@
hello world & people < > &NBSP;
+12
View File
@@ -0,0 +1,12 @@
Just two divs
Hanging out
Nested divs and line breaks
Nested divs and line breaks
More text
Just text
Just text
Just text
This is the end!
+35
View File
@@ -0,0 +1,35 @@
Hello
How are you?
How are you?
How are you?
Just two divs
Hanging out
This is not the end!
How are you again?
This is the end!
Just kidding
Header 1
Some text
---------------------------------------------------------------
Some more text
Paragraph tag!
Header 2
---------------------------------------------------------------
Header 3
Some text
Header 4
Paragraph tag!
Final line
@@ -0,0 +1 @@
these spaces are non-breaking
+8
View File
@@ -0,0 +1,8 @@
Here is the code
#include <stdlib.h>
#include <stdio.h>
int main(){
return 0;
};
+7
View File
@@ -0,0 +1,7 @@
Hello, World!
Col A Col B
Data A1 Data B1
Data A2 Data B2
Data A3 Data B4
Total A Total B
+2
View File
@@ -0,0 +1,2 @@
test one
test two
+5
View File
@@ -0,0 +1,5 @@
1
2
3
4
5 < 6
@@ -0,0 +1,2 @@
- ÅÄÖ
- åäö
@@ -0,0 +1,2 @@
- ÅÄÖ
- åäö
@@ -0,0 +1 @@
foobar