Store message content as HTML #50

faraazb · 2022-02-05T19:34:51Z

Fixes #43

Messages are stored in the database as HTML. This preserves formatting such as bold, italic, underline, strikethrough, monospace and inline links.
Telegram links in a message to other messages (t.me/group/message_id) are replaced with their archival site version.
For example, t.me/example_group/12 becomes example_group/site/2022-02.html#12.

Store the messages as HTML so that all formatting is preserved. Telegram links in a message to other messages of the group or channel are replaced with site links.

knadh · 2022-02-07T10:27:10Z

Thanks. Will test this soon.

Farzat07 · 2022-02-12T10:39:15Z

I tried this and it works, but I think the html template file should be edited to reflect the changes, as the html elements are not rendered.

The rss template though seems to be working just fine for now.

Farzat07 · 2022-02-12T18:21:54Z

Actually nevermind - I was using the old template for the html website. The new template actually does work just fine.

knadh · 2022-02-13T07:44:18Z

Sorry, just got a chance to look at this. URLs aren't being rendered as hyerplinks anymore.

Fresh site created using --new with this PR:

Current master:

Farzat07 · 2022-02-13T08:08:19Z

Are you sure you deleted the database and then synced again? Because otherwise you would be just applying the new code/template on the old raw text messages.

knadh · 2022-02-13T09:12:00Z

I used an existing database, but that shouldn't break existing links on existing installations. Re-syncing large channels may be impractical.

replace_msg_link() can be renamed to urlize() (like in the current version) and it can continue to convert non-<a> URLs to links along with replacing Telegram group links like it is doing right now.
This PR also involves changing the template, which means all existing installations will break after an upgrade, which isn't ideal. Have to come up with a way to avoid this.

faraazb · 2022-02-14T20:55:08Z

I agree, starting over with large channels seems impractical. I could be missing something but what I understand is raw text should not be rendered without escaping and HTML cannot be escaped and I don't think it is possible to differentiate between raw text and HTML. I am unable to have a generalized urlize(). This would also lead to both HTML and raw text being stored in the database, which doesn't sound nice to me.
I think we can have a 'formatted-message' config which is True by default, so that new sites preserve message formatting and the existing ones do not break. I will make the change and test it out. What do you guys think?

Farzat07 · 2022-02-15T02:56:06Z

IF we make such a setting, I believe it should be set to True by default in NEW configurations by adding it to the example config.yaml file. However, if the setting does not exist in the config.yaml file (i.e. started with an old version) it should assume it is False.

knadh · 2022-02-15T08:21:49Z

Yeah, an html_messages: true which is by default turned on for new setups should be fine.

New sites will preserve message formatting by default. Fix hyperlinks not rendering on existing sites.

faraazb · 2022-02-17T07:36:22Z

Thanks for the feedback! Have made the change.

knadh · 2022-02-19T06:55:50Z

Almost works! One last quirk. Syncing with html_messages: True saves HTML in the DB. If you then set it to False and rebuild the site, the HTML tags render as plaintext.

Farzat07 · 2022-02-19T08:20:01Z

Well that makes sense because the setting is meant to be constant; otherwise normally all projects should be set to use the html one. The real point of the setting is to not break previous setups by preventing a mix of text and html messages.

If this behaviour is confusing, one solution would be to add description next to the option about its nature and that it should not be changed after the first sync.

Another solution would be to remove the option entirely, and pull all new messages as html, regardless of the previous ones. Then, when generating the html/rss templates, check each message to see if it is text or html, and handle it accordingly. I believe a good check would be to check the datatype, but really any method should suffice.

Actually now that I think about it, the second solution makes way more sense but for some reason I didn't think of it before.

knadh · 2022-02-19T09:37:28Z

Yep, the second option is better. The True/False should only affect rendering HTML or escaped text.

faraazb · 2022-02-19T12:48:39Z

Thanks @farzat, but with the option you suggest we need to determine whether a message is raw text or HTML.
Consider these example messages,

<script src="main.js"/>

msg.raw_text - <script src="main.js"/> (requires escaping)
msg.text - <script src="main.js"/> (already escaped, by Telethon I guess)

This is html

msg.raw_text - This is html
msg.text - This is <em>html</em>

I think we can't reliably differentiate (as existing raw text messages could be like example 1) without storing some information:

Currently, it is the html_messages config. Having instructions about its usage in the config file is definitely required as @farzat said. For switching, a user has to start over. This means complete site in one format.
A new message type - html_message alongside message in the messages table. The user can change html_messages config. before syncing the site. This means mixed formats.

From these options, 1 seemed better to me because it is consistent. However, 2 is more flexible, especially for existing sites.

@knadh If we want to also have the ability to switch after syncing the site and then build, we will have to store both raw text (content) and HTML (content_html), right? This could increase the database size. Handling this for existing sites adds some cases as well.
Storing HTML from now on and getting rid of tags is another way that came to my mind, but it is not a reliable option, consider example 1 raw text being synced before this change.

knadh · 2022-02-19T12:51:16Z

<script src="main.js"/>

I don't think this can be the case. If you type out HTML tags, Telegram encodes it. The above message will come as <script src="main.js"/>. Basically, the only valid HTML that we get should be HTML that Telegram has sanitized and generated (bold, italics etc.).

faraazb · 2022-02-19T13:23:50Z

I guess Telegram returns MessageEntity objects which is left for the client (Telethon) to handle and Telethon can give us raw text, HTML and Markdown using those objects. The raw_text from Telethon for both the below messages is unescaped, whereas with text it gets escaped.

knadh · 2022-02-19T13:32:43Z

This is with html_messages: True

faraazb · 2022-02-19T14:32:36Z

Yes, as expected, html_messages: True leads to message.text being stored, which is already sanitized by Telethon. However, with html_messages: False or in the absence of the config., unescaped raw text is stored and escaped during the build process, this is identical to the current master branch's behaviour. So, existing unescaped raw text messages cannot be differentiated from new HTML messages that this change will start storing unless we store some additional information like I described.

Farzat07 · 2022-02-20T07:43:02Z

tgarchive/build.py

-        return _NL2BR.sub("\n\n", s).replace("\n", "\n<br />")
+        return _NL2BR.sub("\n\n", str(s)).replace("\n", "\n<br />")


Can't we use this? Whether the s variable is a string or not?

Yes, it can be removed now as urlize() does a cast already.
The re.sub() method expects a string and there are NULL/None messages which need to be cast to string. This was indirectly taken care of by the escape filter in the master branch. Since I changed the way filters work in the first commit, I had to make this cast.

Farzat07 · 2022-02-20T10:16:39Z

Ok I guess the way we could do this is just adding a new field in the database, for example specifying which version of tg-archive was used to sync this message. In our case, the type or content of the field doesn't matter, but its very existence suggests that this message was synced after this pull request, which means that the content is html, or otherwise plain text. This way, the option is also removed (all new messages are html), simplifying the config file.

This of course stems from my philosophy of limiting options to the user. Raw text support was brought up here simply for the sake of backwards compatability, so as long as we can keep backwards compatability without adding the option then the option shouldn't exist.

Farzat07 · 2022-02-20T10:21:32Z

I think the example I chose of version number is especially useful because it will also be helpful in similar cases in the future as well. Empty fields represent versions older than whatever the current version is. However, as I mentioned in the last comment, for this very paritcular case, any field with whatever random content should also do the job.

faraazb · 2022-02-20T12:25:11Z

I agree. Some information has to be stored somewhere which is ultimately due to the Telethon raw_text behaviour that was discussed above.

If we store version number for future purposes other than this, it will require tracking which version introduced what feature/change. We have to check against a message's tg-archive version before doing something. I see how it can be used more generally, though.

Can you @knadh please check and confirm the message.raw_text behaviour? It returns the content unescaped for me.

faraazb · 2022-02-20T12:28:41Z

To summarize the options and their impact, we can have

A config. option (html_messages: True), enforces a format on site level.
New messages are in HTML if config. is present and True. Cannot be changed for a site without starting over.
A message type in messages table: html_message, enforces a format on message level.
We can choose to allow swicthing before a sync using html_messages config.
We can also choose to enforce HTML for existing sites or let them keep raw text unless they add the config manually.
tg-archive version for each message in the messages table, enforces a format on message level.
New messages are in HTML by default for existing and new sites. No overrides.
Start storing both.
Possible to switch anytime instead of just before a sync. Larger databases.

Farzat07 · 2022-02-20T13:23:50Z

The fourth option also has the advantage of keeping the raw text available in case we wanted to, for example, search the database. Personally I prefer the 3rd option the most (it's my idea after all), and then the 4th option.

faraazb · 2022-06-10T12:42:08Z

Sorry, this discussion has been idle for a few months now. But I came back to this, a few days ago, thought of a different method - instead of the entire HTML message, storing only the message entities as JSON (e.g. {"bold": [[0, 5], [9, 2]]} where 0 is the offset and 5 is the length and so on) and then reconstructing the entity objects and passing them to html.unparse() to get the HTML message. But different entities such as URLs can have more attributes than just offset and length, making the resulting JSON complex. Also, this would be more useful when the messages are known to be very long, otherwise in maximum cases storing both raw text and the HTML version is simple and better.
So, option 4 is the best in my opinion. We can close this in case it's not required anymore.

Store message content as HTML

811f3cd

Store the messages as HTML so that all formatting is preserved. Telegram links in a message to other messages of the group or channel are replaced with site links.

faraazb and others added 3 commits February 17, 2022 01:28

Add 'html_messages' config. option

ad060e1

New sites will preserve message formatting by default. Fix hyperlinks not rendering on existing sites.

Merge branch 'knadh:master' into master

27c5dfe

Resolve conflict

24abc2b

Farzat07 reviewed Feb 20, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Store message content as HTML #50

Store message content as HTML #50

faraazb commented Feb 5, 2022

knadh commented Feb 7, 2022

Farzat07 commented Feb 12, 2022

Farzat07 commented Feb 12, 2022 •

edited

knadh commented Feb 13, 2022

Farzat07 commented Feb 13, 2022

knadh commented Feb 13, 2022

faraazb commented Feb 14, 2022

Farzat07 commented Feb 15, 2022

knadh commented Feb 15, 2022

faraazb commented Feb 17, 2022

knadh commented Feb 19, 2022

Farzat07 commented Feb 19, 2022

knadh commented Feb 19, 2022

faraazb commented Feb 19, 2022

knadh commented Feb 19, 2022 •

edited

faraazb commented Feb 19, 2022

knadh commented Feb 19, 2022

faraazb commented Feb 19, 2022

Farzat07 Feb 20, 2022

faraazb Feb 20, 2022

Farzat07 commented Feb 20, 2022

Farzat07 commented Feb 20, 2022

faraazb commented Feb 20, 2022

faraazb commented Feb 20, 2022

Farzat07 commented Feb 20, 2022

faraazb commented Jun 10, 2022

		return _NL2BR.sub("\n\n", s).replace("\n", "\n<br />")
		return _NL2BR.sub("\n\n", str(s)).replace("\n", "\n<br />")

Store message content as HTML #50

Are you sure you want to change the base?

Store message content as HTML #50

Conversation

faraazb commented Feb 5, 2022

knadh commented Feb 7, 2022

Farzat07 commented Feb 12, 2022

Farzat07 commented Feb 12, 2022 • edited

knadh commented Feb 13, 2022

Farzat07 commented Feb 13, 2022

knadh commented Feb 13, 2022

faraazb commented Feb 14, 2022

Farzat07 commented Feb 15, 2022

knadh commented Feb 15, 2022

faraazb commented Feb 17, 2022

knadh commented Feb 19, 2022

Farzat07 commented Feb 19, 2022

knadh commented Feb 19, 2022

faraazb commented Feb 19, 2022

knadh commented Feb 19, 2022 • edited

faraazb commented Feb 19, 2022

knadh commented Feb 19, 2022

faraazb commented Feb 19, 2022

Farzat07 Feb 20, 2022

Choose a reason for hiding this comment

faraazb Feb 20, 2022

Choose a reason for hiding this comment

Farzat07 commented Feb 20, 2022

Farzat07 commented Feb 20, 2022

faraazb commented Feb 20, 2022

faraazb commented Feb 20, 2022

Farzat07 commented Feb 20, 2022

faraazb commented Jun 10, 2022

Farzat07 commented Feb 12, 2022 •

edited

knadh commented Feb 19, 2022 •

edited