Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store message content as HTML #50

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

faraazb
Copy link
Contributor

@faraazb faraazb commented Feb 5, 2022

Fixes #43

Messages are stored in the database as HTML. This preserves formatting such as bold, italic, underline, strikethrough, monospace and inline links.
Telegram links in a message to other messages (t.me/group/message_id) are replaced with their archival site version.
For example, t.me/example_group/12 becomes example_group/site/2022-02.html#12.

Store the messages as HTML so that all formatting is preserved.
Telegram links in a message to other messages of the group or channel are replaced with site links.
@knadh
Copy link
Owner

knadh commented Feb 7, 2022

Thanks. Will test this soon.

@Farzat07
Copy link
Contributor

I tried this and it works, but I think the html template file should be edited to reflect the changes, as the html elements are not rendered.

The rss template though seems to be working just fine for now.

@Farzat07
Copy link
Contributor

Farzat07 commented Feb 12, 2022

Actually nevermind - I was using the old template for the html website. The new template actually does work just fine.

@knadh
Copy link
Owner

knadh commented Feb 13, 2022

Sorry, just got a chance to look at this. URLs aren't being rendered as hyerplinks anymore.

Fresh site created using --new with this PR:
image

Current master:

image

@Farzat07
Copy link
Contributor

Are you sure you deleted the database and then synced again? Because otherwise you would be just applying the new code/template on the old raw text messages.

@knadh
Copy link
Owner

knadh commented Feb 13, 2022

I used an existing database, but that shouldn't break existing links on existing installations. Re-syncing large channels may be impractical.

  1. replace_msg_link() can be renamed to urlize() (like in the current version) and it can continue to convert non-<a> URLs to links along with replacing Telegram group links like it is doing right now.

  2. This PR also involves changing the template, which means all existing installations will break after an upgrade, which isn't ideal. Have to come up with a way to avoid this.

@faraazb
Copy link
Contributor Author

faraazb commented Feb 14, 2022

I agree, starting over with large channels seems impractical. I could be missing something but what I understand is raw text should not be rendered without escaping and HTML cannot be escaped and I don't think it is possible to differentiate between raw text and HTML. I am unable to have a generalized urlize(). This would also lead to both HTML and raw text being stored in the database, which doesn't sound nice to me.
I think we can have a 'formatted-message' config which is True by default, so that new sites preserve message formatting and the existing ones do not break. I will make the change and test it out. What do you guys think?

@Farzat07
Copy link
Contributor

IF we make such a setting, I believe it should be set to True by default in NEW configurations by adding it to the example config.yaml file. However, if the setting does not exist in the config.yaml file (i.e. started with an old version) it should assume it is False.

@knadh
Copy link
Owner

knadh commented Feb 15, 2022

Yeah, an html_messages: true which is by default turned on for new setups should be fine.

faraazb and others added 3 commits February 17, 2022 01:28
New sites will preserve message formatting by default. Fix hyperlinks not rendering on existing sites.
@faraazb
Copy link
Contributor Author

faraazb commented Feb 17, 2022

Thanks for the feedback! Have made the change.

@knadh
Copy link
Owner

knadh commented Feb 19, 2022

Almost works! One last quirk. Syncing with html_messages: True saves HTML in the DB. If you then set it to False and rebuild the site, the HTML tags render as plaintext.

image

@Farzat07
Copy link
Contributor

Well that makes sense because the setting is meant to be constant; otherwise normally all projects should be set to use the html one. The real point of the setting is to not break previous setups by preventing a mix of text and html messages.

If this behaviour is confusing, one solution would be to add description next to the option about its nature and that it should not be changed after the first sync.

Another solution would be to remove the option entirely, and pull all new messages as html, regardless of the previous ones. Then, when generating the html/rss templates, check each message to see if it is text or html, and handle it accordingly. I believe a good check would be to check the datatype, but really any method should suffice.

Actually now that I think about it, the second solution makes way more sense but for some reason I didn't think of it before.

@knadh
Copy link
Owner

knadh commented Feb 19, 2022

Yep, the second option is better. The True/False should only affect rendering HTML or escaped text.

@faraazb
Copy link
Contributor Author

faraazb commented Feb 19, 2022

Thanks @farzat, but with the option you suggest we need to determine whether a message is raw text or HTML.
Consider these example messages,

  1. <script src="main.js"/>

msg.raw_text - <script src="main.js"/> (requires escaping)
msg.text - &lt;script src=&quot;main.js&quot;/&gt; (already escaped, by Telethon I guess)

  1. This is html

msg.raw_text - This is html
msg.text - This is <em>html</em>

I think we can't reliably differentiate (as existing raw text messages could be like example 1) without storing some information:

  1. Currently, it is the html_messages config. Having instructions about its usage in the config file is definitely required as @farzat said. For switching, a user has to start over. This means complete site in one format.
  2. A new message type - html_message alongside message in the messages table. The user can change html_messages config. before syncing the site. This means mixed formats.

From these options, 1 seemed better to me because it is consistent. However, 2 is more flexible, especially for existing sites.

@knadh If we want to also have the ability to switch after syncing the site and then build, we will have to store both raw text (content) and HTML (content_html), right? This could increase the database size. Handling this for existing sites adds some cases as well.
Storing HTML from now on and getting rid of tags is another way that came to my mind, but it is not a reliable option, consider example 1 raw text being synced before this change.

@knadh
Copy link
Owner

knadh commented Feb 19, 2022

<script src="main.js"/>

I don't think this can be the case. If you type out HTML tags, Telegram encodes it. The above message will come as &lt;script src=&quot;main.js&quot;/&gt;. Basically, the only valid HTML that we get should be HTML that Telegram has sanitized and generated (bold, italics etc.).

@faraazb
Copy link
Contributor Author

faraazb commented Feb 19, 2022

I guess Telegram returns MessageEntity objects which is left for the client (Telethon) to handle and Telethon can give us raw text, HTML and Markdown using those objects. The raw_text from Telethon for both the below messages is unescaped, whereas with text it gets escaped.
image

@knadh
Copy link
Owner

knadh commented Feb 19, 2022

This is with html_messages: True

image

@faraazb
Copy link
Contributor Author

faraazb commented Feb 19, 2022

Yes, as expected, html_messages: True leads to message.text being stored, which is already sanitized by Telethon. However, with html_messages: False or in the absence of the config., unescaped raw text is stored and escaped during the build process, this is identical to the current master branch's behaviour. So, existing unescaped raw text messages cannot be differentiated from new HTML messages that this change will start storing unless we store some additional information like I described.

Comment on lines -181 to +186
return _NL2BR.sub("\n\n", s).replace("\n", "\n<br />")
return _NL2BR.sub("\n\n", str(s)).replace("\n", "\n<br />")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we use this? Whether the s variable is a string or not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it can be removed now as urlize() does a cast already.
The re.sub() method expects a string and there are NULL/None messages which need to be cast to string. This was indirectly taken care of by the escape filter in the master branch. Since I changed the way filters work in the first commit, I had to make this cast.

@Farzat07
Copy link
Contributor

Ok I guess the way we could do this is just adding a new field in the database, for example specifying which version of tg-archive was used to sync this message. In our case, the type or content of the field doesn't matter, but its very existence suggests that this message was synced after this pull request, which means that the content is html, or otherwise plain text. This way, the option is also removed (all new messages are html), simplifying the config file.

This of course stems from my philosophy of limiting options to the user. Raw text support was brought up here simply for the sake of backwards compatability, so as long as we can keep backwards compatability without adding the option then the option shouldn't exist.

@Farzat07
Copy link
Contributor

I think the example I chose of version number is especially useful because it will also be helpful in similar cases in the future as well. Empty fields represent versions older than whatever the current version is. However, as I mentioned in the last comment, for this very paritcular case, any field with whatever random content should also do the job.

@faraazb
Copy link
Contributor Author

faraazb commented Feb 20, 2022

I agree. Some information has to be stored somewhere which is ultimately due to the Telethon raw_text behaviour that was discussed above.

If we store version number for future purposes other than this, it will require tracking which version introduced what feature/change. We have to check against a message's tg-archive version before doing something. I see how it can be used more generally, though.

Can you @knadh please check and confirm the message.raw_text behaviour? It returns the content unescaped for me.

@faraazb
Copy link
Contributor Author

faraazb commented Feb 20, 2022

To summarize the options and their impact, we can have

  1. A config. option (html_messages: True), enforces a format on site level.
    New messages are in HTML if config. is present and True. Cannot be changed for a site without starting over.
  2. A message type in messages table: html_message, enforces a format on message level.
    We can choose to allow swicthing before a sync using html_messages config.
    We can also choose to enforce HTML for existing sites or let them keep raw text unless they add the config manually.
  3. tg-archive version for each message in the messages table, enforces a format on message level.
    New messages are in HTML by default for existing and new sites. No overrides.
  4. Start storing both.
    Possible to switch anytime instead of just before a sync. Larger databases.

@Farzat07
Copy link
Contributor

The fourth option also has the advantage of keeping the raw text available in case we wanted to, for example, search the database. Personally I prefer the 3rd option the most (it's my idea after all), and then the 4th option.

@faraazb
Copy link
Contributor Author

faraazb commented Jun 10, 2022

Sorry, this discussion has been idle for a few months now. But I came back to this, a few days ago, thought of a different method - instead of the entire HTML message, storing only the message entities as JSON (e.g. {"bold": [[0, 5], [9, 2]]} where 0 is the offset and 5 is the length and so on) and then reconstructing the entity objects and passing them to html.unparse() to get the HTML message. But different entities such as URLs can have more attributes than just offset and length, making the resulting JSON complex. Also, this would be more useful when the messages are known to be very long, otherwise in maximum cases storing both raw text and the HTML version is simple and better.
So, option 4 is the best in my opinion. We can close this in case it's not required anymore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Links in message text
3 participants