Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

decode() to XML (argument to="xml") #255

Open
ablaette opened this issue Jun 19, 2023 · 0 comments
Open

decode() to XML (argument to="xml") #255

ablaette opened this issue Jun 19, 2023 · 0 comments
Assignees

Comments

@ablaette
Copy link
Collaborator

As of now, the decode() method can produce data.table and AnnotatedPlainTextDocument output. Many users may find the option useful to decode to XML. This is a function that I have written for this purpose in a specific context. It would have to be adapted to be generic.

make_collection_xml <- function(newspaper, year){
  matching_min <- matching[corpus == newspaper][grepl(year, article_date)]
  articles <- corpus(corpus_id) %>% 
    subset(article_id %in% matching_min[["article_id"]]) %>% 
    split(s_attribute = "article_id", progress = interactive(), verbose = FALSE)
  
  articles_dt <- pblapply(
    articles,
    function(article)
      decode(
        article,
        to = "data.table",
        p_attributes = "word",
        s_attributes = "s",
        verbose = FALSE
      ),
    cl = NULL
  )
  
  articles_xml <- pblapply(
    articles_dt,
    function(dt){
      dt[, "tab" := sprintf('<w cpos="%s">%s</w>', dt[["cpos"]], dt[["word"]])]
      
      sentences <- lapply(
        split(dt, by = "s"),
        function(s)
          sprintf(
            '<s struc="%d">\n%s\n</s>',
            s[["s"]][1],
            paste(s[["tab"]], collapse = "\n")
          )
      )
      
      sprintf(
        "<article>\n%s\n</article>",
        paste(unlist(sentences), collapse = "\n")
      )
    },
    cl = parallel::detectCores() - 2L
  )
  sprintf(
    '<collection newspaper="%s" year="%s">\n%s\n</collection>\n',
    newspaper,
    year,
    paste(unlist(articles_xml), collapse = "\n")
  )
}
@ablaette ablaette self-assigned this Jul 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant