Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUGFIX] Respect language based style names on reading Word files #2597

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

sbuerk
Copy link

@sbuerk sbuerk commented Apr 2, 2024

[BUGFIX] Respect language based style names on reading Word files

Microsoft Office saves Office document with language based style
mappings for default styles. For example, if a german based Word
version is used, it writes following to the word/styles.xml in
the container archive (*.docs):

<w:style w:type="paragraph" w:styleId="berschrift1">
  <w:name w:val="heading 1"/>
  ....
  </w:style>

versus for a english based version it would be:

<w:style w:type="paragraph" w:styleId="Heading1">
  <w:name w:val="heading 1"/>
  ...
</w:style>

The value of <w:name /> defines the internal native code
identifier, whereas the w:styleId attribute on the outer
<w:style /> tag would describe the virtual or alias name.

Later parsing of the document structure, for example the
paragraphs, references the alias (w:styleId) name of a
style. The reader code uses hardcoded RegEx matchings in
a case-insensitive manner but using the englisch speaking
variant (Header\s+d) - on the language based one, which
would not match at all.

Therefore, multiple tasks need to be done and contained
in this change:

  • A alias map is implementend and used to register title
    aliases. Along with this corresponding lookup method is
    added.
  • Use the lookup method to resolve for alias where the
    hardcoded language RegEx is needed to be used.
  • Gathering all style alias names during reading the
    wordfile styles settings for all possible styles.

@coveralls
Copy link

coveralls commented Apr 2, 2024

Coverage Status

coverage: 97.171% (-0.05%) from 97.217%
when pulling 13a5d65 on sbuerk:stefan-1
into 8b891bb on PHPOffice:master.

Microsoft Office saves Office document with language based style
mappings for default styles. For example, if a german based Word
version is used, it writes following to the `word/styles.xml` in
the container archive (*.docs):

```
<w:style w:type="paragraph" w:styleId="berschrift1">
  <w:name w:val="heading 1"/>
  ....
  </w:style>
```

versus for a english based version it would be:

```
<w:style w:type="paragraph" w:styleId="Heading1">
  <w:name w:val="heading 1"/>
  ...
</w:style>
```

The value of `<w:name />` defines the internal native code
identifier, whereas the `w:styleId` attribute on the outer
`<w:style />` tag would describe the virtual or alias name.

Later parsing of the document structure, for example the
paragraphs, references the alias (`w:styleId`) name of a
style. The reader code uses hardcoded RegEx matchings in
a case-insensitive manner but using the englisch speaking
variant (`Header\s+d`) - on the language based one, which
would not match at all.

Therefore, multiple tasks need to be done and contained
in this change:

* A alias map is implementend and used to register title
  aliases. Along with this corresponding lookup method is
  added.
* Use the lookup method to resolve for alias where the
  hardcoded language RegEx is needed to be used.
* Gathering all style alias names during reading the
  wordfile styles settings for all possible styles.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants