Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal to facilitate localization #109

Open
SxN02 opened this issue Sep 23, 2021 · 3 comments
Open

Proposal to facilitate localization #109

SxN02 opened this issue Sep 23, 2021 · 3 comments

Comments

@SxN02
Copy link

SxN02 commented Sep 23, 2021

I would like to suggest an addition to pcapng, in the form of an IETF language tag, where it is applicable. First it came to mind to have it associated with the field opt_comment (section 3.5), but then I realized that it may add value elsewhere too, so it should be, perhaps, part of block headers if it is relevant in blocks.

The reason for this addition is to point to the original language, giving (to applications rendering pcapng) information for on-the-fly translation. The language tags can be recognized as strings starting with a letter, ending in a letter and, optionally, having between start and end letters and/or dashes, but not consecutive dashes. A "smart" application may be capable of indexing and searching in both the original language and the target language.

I believe it to be a very small addition and yet have an important contribution to future-proofing the format. Please share your thought on it.

@guyharris
Copy link
Collaborator

This would not apply to packet data, as the language that it's in is, from the point of view of pcapng, called "raw binary".

Some data might happen to be text, but that data might carry its own language tags, such as HTML language tags. Those tags, not any tag in the capture file, should indicate the language from which to translate.

So this would apply only to data defined as text in the pcapng specification itself. Thus, it would currently apply to the opt_comment, custom string, shb_hardware, shb_os, shb_userappl, if_description, if_os, and if_hardware options. It would not apply to:

  • the if_name option, as those are identifiers provided by or to the software to indicate an interface ("en0" is barely even in English - yes, "en" stands for "Ethernet", but I'm not even sure whether "Ethernet" is written in other languages as anything other than "Ethernet");
  • the if_filter option, as, if it's a pcap filter expression, it's a string supplied to libpcap, which requires "port" and "host" and "and" and "or" and... to be in English;
  • the ns_dnsname option, as it might not be in any human natural language either ("foobar.dyndns.net"?).

Note also that there is no guarantee that all options in a block are in the same language; you might have an interface whose description was written in Simplified Chinese, with a hardware description in Traditional Chinese, about which comments in the capture file have been written in English and Russian.

So this might take the form of an option that, if it appears before another option, indicates the language of that option (doing nothing if the option is not one that's in a language). I.e., it's a non-locking shift; a locking shift is another possibility.

@SxN02
Copy link
Author

SxN02 commented Sep 24, 2021

In terms of which text field would be a good candidate to localization and which not, with how I understand this format I would agree with the list above. Where I think we see it differently is in the language declaration via an HTML tag, which is useful in the complex example provided, but otherwise largely optional. Having a language declaration as a field would force writers to honour it accordingly, so, what a Canadian writer honours, an American reader can render as "honors", reliably, on the fly. HTML tags, of course, can override the field.

@guyharris
Copy link
Collaborator

Where I think we see it differently is in the language declaration via an HTML tag, which is useful in the complex example provided, but otherwise largely optional.

I didn't propose HTML tags for anything other than HTML data in packets in the capture, and all I noted there is that 1) it makes no sense to have a language tag for packet data, as the packet data is what it is, and it's either identified as such in the data, in which case that's what should be used if an application translates HTML text in packets, or it's not identified as such, in which case it's not clear how it would be identified by options in the packet, especially given that a given Web page in a capture might be in more than one language.

HTML tags simply wouldn't apply to text options in a pcapng block; there's no HTML there to even contain them, unless there's HTML in the capture, and even there, whatever random Web traffic you might have captured should not have any effect on a pcapng reader's notion of what language a comment attached to a packet is in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants