Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Tree.pformat_latex_forest #2956

Open
wants to merge 4 commits into
base: develop
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
24 changes: 24 additions & 0 deletions nltk/test/tree.doctest
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,30 @@ tree object to one of several standard tree encodings:
[.dp [.d the ] [.np dog ] ]
[.vp [.v chased ] [.dp [.d the ] [.np cat ] ] ] ]

>>> print(tree.pformat_latex_forest())
\begin{forest}
[s
[dp [d [the] ] [np [dog] ] ]
[vp [v [chased] ] [dp [d [the] ] [np [cat] ] ] ] ]
\end{forest}

More flexibility in output is available via `pformat` method:

>>> print(tree.pformat(indent=2, nodesep="=", parens=("[[","]]"), quotes=True))
[[s=
[[dp= [[d= 'the']] [[np= 'dog']]]]
[[vp= [[v= 'chased']] [[dp= [[d= 'the']] [[np= 'cat']]]]]]]]

>>> print(tree.pformat(indent=3, nodesep="=", parens="{}", quotes="[]"))
{s=
{dp= {d= [the]} {np= [dog]}}
{vp= {v= [chased]} {dp= {d= [the]} {np= [cat]}}}}

The default output format uses parentheses, no quoting, and no separator characters:

>>> print(tree.pformat(quotes=False))
(s (dp (d the) (np dog)) (vp (v chased) (dp (d the) (np cat))))

There is also a fancy ASCII art representation:

>>> tree.pretty_print()
Expand Down
74 changes: 49 additions & 25 deletions nltk/tree/tree.py
Original file line number Diff line number Diff line change
Expand Up @@ -799,7 +799,7 @@ def pprint(self, **kwargs):
stream = None
print(self.pformat(**kwargs), file=stream)

def pformat(self, margin=70, indent=0, nodesep="", parens="()", quotes=False):
def pformat(self, margin=70, indent=0, nodesep="", parens="()", quotes=("", "")):
"""
:return: A pretty-printed string representation of this tree.
:rtype: str
Expand All @@ -810,10 +810,18 @@ def pformat(self, margin=70, indent=0, nodesep="", parens="()", quotes=False):
subsequent lines.
:type indent: int
:param nodesep: A string that is used to separate the node
from the children. E.g., the default value ``':'`` gives
from the children. E.g., the value ``':'`` gives
trees like ``(S: (NP: I) (VP: (V: saw) (NP: it)))``.
:type nodesep: str
:param parens: Two-element iterable to surround non-leaf nodes.
:param quotes: Two-element iterable to surround leaf nodes,
or True to quote leaf nodes as Python strings.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To the best of my knowledge we've avoided this kind of polymorphism in NLTK. I think it needs some consideration if we're going to start doing this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To the best of my knowledge we've avoided this kind of polymorphism in NLTK. I think it needs some consideration if we're going to start doing this.

Any alternative suggestions?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes by @tyomitch make quotes equivalent to parens, which is a logical move. Then, beyond that, he also allows backwards compatibility so True and False can still be used. I don't necessarily see an issue with that, especially because using True means the use of repr, which cleverly uses the right quotes depending on the string.

That said, at a glance the implementation is likely bugged. On line 841 - what if the child is a string, and quotes = False. Then, the if-statement is true, and "\n" + indent_str + quotes[0] + str(child) + quotes[1] is added to s. However, quotes[0] will throw an exception.

An alternative implementation that avoids the polymorphism simply reverts quotes to what it was, and adds e.g. quote_strings=None. The last else branch is where the quotes should be added, so then we can do e.g.:

            elif isinstance(child, str) and not quotes:
                s += "\n" + " " * (indent + 2) + "%s" % child
            else:
                if not quote_strings:
                    s += "\n" + " " * (indent + 2) + repr(child)
                else:
                    s += "\n" + " " * (indent + 2) + quote_strings[0] + str(child) + quote_strings[1]

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tomaarsen thank you for the detailed comment.

My expectation is that pre-existing callers either pass quotes=True or don't pass quotes at all; that's why the updated docstring doesn't list False as an acceptable input.

With the code you suggest, when quotes is False, quote_strings get ignored, which may be confusing to the user. To apply custom quote characters, he'd need to pass both parameters, which is less user-friendly. I have no problem with that -- just flagging it up.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that that would be confusing. I don't dislike your current implementation per se, although I would probably want to see support for quotes=False (i.e. have it be equivalent to quotes=("", "")).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tomaarsen I've added a commit to support quotes=False

"""

# For backwards compatibility
if quotes is False:
quotes = ("", "")

# Try writing it on one line.
s = self._pformat_flat(nodesep, parens, quotes)
if len(s) + indent < margin:
Expand All @@ -824,19 +832,20 @@ def pformat(self, margin=70, indent=0, nodesep="", parens="()", quotes=False):
s = f"{parens[0]}{self._label}{nodesep}"
else:
s = f"{parens[0]}{repr(self._label)}{nodesep}"
indent_str = " " * (indent + 2)
for child in self:
if isinstance(child, Tree):
s += (
"\n"
+ " " * (indent + 2)
+ indent_str
+ child.pformat(margin, indent + 2, nodesep, parens, quotes)
)
elif isinstance(child, tuple):
s += "\n" + " " * (indent + 2) + "/".join(child)
elif isinstance(child, str) and not quotes:
s += "\n" + " " * (indent + 2) + "%s" % child
s += "\n" + indent_str + "/".join(child)
elif isinstance(child, str) and quotes is not True:
s += "\n" + indent_str + quotes[0] + str(child) + quotes[1]
else:
s += "\n" + " " * (indent + 2) + repr(child)
s += "\n" + indent_str + repr(child)
return s + parens[1]

def pformat_latex_qtree(self):
Expand All @@ -862,33 +871,48 @@ def pformat_latex_qtree(self):
pformat = self.pformat(indent=6, nodesep="", parens=("[.", " ]"))
return r"\Tree " + re.sub(reserved_chars, r"\\\1", pformat)

def pformat_latex_forest(self):
r"""
Returns a representation of the tree compatible with the
LaTeX forest package. This consists of the tree represented
in bracketed notation, wrapped in a forest environment.

For example, the following result was generated from a parse
tree of the sentence ``the dog chased the cat``::

\begin{forest}
[S
[NP [D [the] ] [N [dog] ] ]
[VP [V [chased] ] [NP [D [the] ] [N [cat] ] ] ] ]
\end{forest}

:return: A latex forest representation of this tree.
:rtype: str
"""
reserved_chars = re.compile(r"([#\$%&~_\{\}])")

pformat = self.pformat(indent=2, parens=("[", " ]"), quotes=("[", "]"))
pformat = re.sub(reserved_chars, r"\\\1", pformat)
return "\\begin{forest}\n " + pformat + "\n\\end{forest}"

def _pformat_flat(self, nodesep, parens, quotes):
childstrs = []
for child in self:
if isinstance(child, Tree):
childstrs.append(child._pformat_flat(nodesep, parens, quotes))
elif isinstance(child, tuple):
childstrs.append("/".join(child))
elif isinstance(child, str) and not quotes:
childstrs.append("%s" % child)
elif isinstance(child, str) and quotes is not True:
childstrs.append(f"{quotes[0]}{child}{quotes[1]}")
else:
childstrs.append(repr(child))
if isinstance(self._label, str):
return "{}{}{} {}{}".format(
parens[0],
self._label,
nodesep,
" ".join(childstrs),
parens[1],
)
else:
return "{}{}{} {}{}".format(
parens[0],
repr(self._label),
nodesep,
" ".join(childstrs),
parens[1],
)
return "{}{}{} {}{}".format(
parens[0],
self._label if isinstance(self._label, str) else repr(self._label),
nodesep,
" ".join(childstrs),
parens[1],
)


def _child_names(tree):
Expand Down