NASA data.json has changed #72

juliasilge · 2019-12-09T21:38:35Z

The data.json made available by NASA has changed its schema so we likely want to update the analysis at some point.

tmasjc · 2020-03-17T07:05:53Z

The NASA case study requires just a little correction to work, primarily the dataset ids. Please refer to the comments in the codeblock below,

library(tidyverse)
library(tidytext)
library(jsonlite)
library(widyr)
library(igraph)
library(ggraph)
set.seed(1234)

metadata <- fromJSON("https://data.nasa.gov/data.json")
names(metadata$dataset)

# previously metadata$dataset$`_id`$`$oid`
ids = metadata$dataset$identifier

nasa_title <- tibble(id = ids, title = metadata$dataset$title)
nasa_title <- nasa_title %>% 
    unnest_tokens(word, title) %>% 
    anti_join(stop_words, by = "word") %>%
    # remove terms v1.0, l2, 0.500, i, ii, ...
    filter(!str_detect(word, "^[v|l][0-9]?[\\.[0-9]?]"), 
           !str_detect(word, "^[0-9]+[\\.[0-9]+]*$"),
           !str_detect(word, "^[i]+$"))

# sample outcome
nasa_title %>%
    pairwise_count(word, id, sort = TRUE, upper = FALSE) %>%
    # reduce threshold from 250 to 150
    filter(n > 150) %>%
    graph_from_data_frame() %>%
    ggraph(layout = "fr") +
    geom_edge_link(aes(edge_alpha = n, edge_width = n),
                   edge_colour = "navyblue",
                   show.legend = FALSE) +
    geom_node_point(size = 3, col = "darkblue") +
    geom_node_text(
        aes(label = name),
        repel = TRUE,
        family = "Menlo",
        size = 3,
        point.padding = unit(0.2, "lines")
    ) +
    theme_void()

juliasilge · 2020-03-17T14:26:31Z

Thanks so much @tmasjc!

walinchus · 2021-01-18T20:48:06Z

I'm guessing this is related to the JSON change, but I'm not sure. The map of the title_word_pairs works fine, but the map of the desc_word_pairs does not. I can't figure out why.

All of this code up to this point works:

##NASA Datamining

library(jsonlite)
metadata <- fromJSON("https://data.nasa.gov/data.json")
names(metadata$dataset)

class(metadata$dataset$title)
class(metadata$dataset$description)
class(metadata$dataset$keyword)

library(dplyr)

nasa_title <- tibble(id = metadata$dataset$identifier, 
                     title = metadata$dataset$title)
nasa_title

nasa_desc <- tibble(id = metadata$dataset$identifier, 
                     desc = metadata$dataset$description)

nasa_desc %>% 
  select(desc) %>% 
  sample_n(5)

library(tidyr)

nasa_keyword <- tibble(id = metadata$dataset$identifier, 
                       keyword = metadata$dataset$keyword) %>%
  unnest(keyword)

nasa_keyword

library(tidytext)

nasa_title <- nasa_title %>% 
  unnest_tokens(word, title) %>% 
  anti_join(stop_words)

nasa_desc <- nasa_desc %>% 
  unnest_tokens(word, desc) %>% 
  anti_join(stop_words)

nasa_title
nasa_desc

nasa_title %>%
  count(word, sort = TRUE)

nasa_desc %>% 
  count(word, sort = TRUE)

my_stopwords <- tibble(word = c(as.character(1:10), 
                                "v1", "v03", "l2", "l3", "l4", "v5.2.0", 
                                "v003", "v004", "v005", "v006", "v7"))
nasa_title <- nasa_title %>% 
  anti_join(my_stopwords)
nasa_desc <- nasa_desc %>% 
  anti_join(my_stopwords)

nasa_keyword %>% 
  group_by(keyword) %>% 
  count(sort = TRUE)

nasa_keyword <- nasa_keyword %>% 
  mutate(keyword = toupper(keyword))

library(widyr)

title_word_pairs <- nasa_title %>% 
  pairwise_count(word, id, sort = TRUE, upper = FALSE)

title_word_pairs

desc_word_pairs <- nasa_desc %>% 
  pairwise_count(word, id, sort = TRUE, upper = FALSE)

desc_word_pairs

library(ggplot2)
library(igraph)
library(ggraph)

set.seed(1234)
title_word_pairs %>%
  filter(n >= 250) %>%
  graph_from_data_frame() %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(edge_alpha = n, edge_width = n), edge_colour = "cyan4") +
  geom_node_point(size = 5) +
  geom_node_text(aes(label = name), repel = TRUE, 
                 point.padding = unit(0.2, "lines")) +
  theme_void()

But then when I go for the second graph:

set.seed(1234)
desc_word_pairs %>%
  filter(n >= 5000) %>%
  graph_from_data_frame() %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(edge_alpha = n, edge_width= n), edge_colour = "darkred") +
  geom_node_point(size = 5) +
  geom_node_text(aes(label = name), repel = TRUE,
                 point.padding = unit(0.2, "lines")) +
  theme_void()

It gives this error: Error: Aesthetics must be valid data columns. Problematic aesthetic(s): edge_alpha = n, edge_width = n. Did you mistype the name of a data column or forget to add after_stat()?

I tried seeing if something was different about the description tibble versus the other tibble but they look identical to me pretty much:

summary(desc_word_pairs)
item1 item2 n
Length:19925275 Length:19925275 Min. : 1.000
Class :character Class :character 1st Qu.: 1.000
Mode :character Mode :character Median : 1.000
Mean : 4.205
3rd Qu.: 2.000
Max. :4537.000

summary(title_word_pairs)
item1 item2 n
Length:310608 Length:310608 Min. : 1.000
Class :character Class :character 1st Qu.: 1.000
Mode :character Mode :character Median : 1.000
Mean : 2.754
3rd Qu.: 2.000
Max. :2498.000

juliasilge · 2021-01-22T18:50:28Z

@walinchus it's because in the new version of the JSON available from NASA's website, the variable is now called description instead of desc. You should be able to do something like this:

library(tidyverse)
library(tidytext)
library(jsonlite)
#> 
#> Attaching package: 'jsonlite'
#> The following object is masked from 'package:purrr':
#> 
#>     flatten
metadata <- fromJSON("https://data.nasa.gov/data.json")
## notice `description` now!!
names(metadata$dataset)
#>  [1] "accessLevel"                 "landingPage"                
#>  [3] "bureauCode"                  "issued"                     
#>  [5] "@type"                       "modified"                   
#>  [7] "references"                  "keyword"                    
#>  [9] "contactPoint"                "publisher"                  
#> [11] "identifier"                  "description"                
#> [13] "title"                       "programCode"                
#> [15] "distribution"                "accrualPeriodicity"         
#> [17] "theme"                       "citation"                   
#> [19] "temporal"                    "spatial"                    
#> [21] "language"                    "data-presentation-form"     
#> [23] "release-place"               "series-name"                
#> [25] "creator"                     "graphic-preview-description"
#> [27] "graphic-preview-file"        "editor"                     
#> [29] "issue-identification"        "describedBy"                
#> [31] "dataQuality"                 "describedByType"            
#> [33] "license"                     "rights"

metadata_wrangled <- as_tibble(metadata$dataset) %>%
    select(title, description, keyword) %>% 
    mutate(id = row_number())

library(widyr)
desc_word_pairs <- metadata_wrangled %>% 
    unnest_tokens(word, description) %>% 
    anti_join(get_stopwords()) %>%
    pairwise_count(word, id, sort = TRUE, upper = FALSE)
#> Warning: `distinct_()` is deprecated as of dplyr 0.7.0.
#> Please use `distinct()` instead.
#> See vignette('programming') for more help
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_warnings()` to see where this warning was generated.
#> Joining, by = "word"

desc_word_pairs
#> # A tibble: 23,412,441 x 3
#>    item1    item2          n
#>    <chr>    <chr>      <dbl>
#>  1 data     set         4982
#>  2 contains data        4414
#>  3 data     2           4394
#>  4 data     system      4219
#>  5 data     product     4132
#>  6 data     using       4122
#>  7 data     1           4039
#>  8 data     used        3899
#>  9 data     resolution  3889
#> 10 data     instrument  3725
#> # … with 23,412,431 more rows

^{Created on 2021-01-22 by the reprex package (v0.3.0)}

walinchus · 2021-01-22T18:54:34Z

Ah great thanks.

walinchus · 2021-01-22T19:10:21Z

Oh no wait I already switched out "description" for "desc." (See my first post). Hmm any other ideas?

walinchus · 2021-01-22T19:38:13Z

Okay progress! I set the filter to >=250. After a VERY long time, it worked. And it's a huge unreadable mess. BUT! That must mean that the code worked, it's just that in the new dataset there aren't enough descriptions over 5000. So you can mess with the filter to make ones people can see in the book.

juliasilge · 2021-01-22T20:36:28Z

Ah, I apologize; it wasn't quite clear where things were going wrong. The key to finding where things were going wrong is to look at desc_word_pairs; notice that no values of n are higher than 5000 in the new version so if you filter for n >= 5000, you will filter everything out!

If you instead filter down to things above 2000, you get a more reasonable plot:

library(tidyverse)
library(tidytext)
library(jsonlite)
#> 
#> Attaching package: 'jsonlite'
#> The following object is masked from 'package:purrr':
#> 
#>     flatten

metadata <- fromJSON("https://data.nasa.gov/data.json")

metadata_wrangled <- as_tibble(metadata$dataset) %>%
  select(title, description, keyword) %>% 
  mutate(id = row_number())

library(widyr)
desc_word_pairs <- metadata_wrangled %>% 
  unnest_tokens(word, description) %>% 
  anti_join(get_stopwords()) %>%
  pairwise_count(word, id, sort = TRUE, upper = FALSE)
#> Warning: `distinct_()` is deprecated as of dplyr 0.7.0.
#> Please use `distinct()` instead.
#> See vignette('programming') for more help
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_warnings()` to see where this warning was generated.
#> Joining, by = "word"

library(igraph)
#> 
#> Attaching package: 'igraph'
#> The following objects are masked from 'package:dplyr':
#> 
#>     as_data_frame, groups, union
#> The following objects are masked from 'package:purrr':
#> 
#>     compose, simplify
#> The following object is masked from 'package:tidyr':
#> 
#>     crossing
#> The following object is masked from 'package:tibble':
#> 
#>     as_data_frame
#> The following objects are masked from 'package:stats':
#> 
#>     decompose, spectrum
#> The following object is masked from 'package:base':
#> 
#>     union
library(ggraph)

set.seed(1234)
desc_word_pairs %>%
  filter(n >= 2000) %>%
  graph_from_data_frame() %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(edge_alpha = n, edge_width = n), edge_colour = "darkred") +
  geom_node_point(size = 5) +
  geom_node_text(aes(label = name), repel = TRUE,
                 point.padding = unit(0.2, "lines")) +
  theme_void()

^{Created on 2021-01-22 by the reprex package (v0.3.0)}

In the future, it would be great to create a reprex (a minimal reproducible example) for something like this. The goal of a reprex is to make it easier for someone to recreate your problem so that they/I can understand it and/or fix it.

If you've never heard of a reprex before, you may want to start with the tidyverse.org help page. You may already have reprex installed (it comes with the tidyverse package), but if not you can install it with:

install.packages("reprex")

Thanks! 🙌

walinchus · 2021-01-22T21:07:12Z

Will do thanks. I haven't heard of reprex before but it sounds great. I am still learning R thanks to the help of great books like this one!

Update id definition to reflect current NASA metadata (metadata$dataset$identifier) as per issue dgrtwo#72

Oleh-Zaritskyi · 2024-05-15T19:03:05Z

Good afternoon,
thank you for your work, a very useful book.
I use it for learning. Help with this error.
After executing the code
desc_tf_idf <- full_join(desc_tf_idf, nasa_keyword, by = "id")
Gives an error message:
Detected an unexpected many-to-many relationship between x and y.
ℹ Row 1 of x matches multiple rows in y.
ℹ Row 47289 of y matches multiple rows in x.
ℹ If a many-to-many relationship is expected, set relationship = "many-to-many" to silence this warning.
Next code doesn't work:
desc_tf_idf %>%

filter(!near(tf, 1)) %>%
filter(keyword %in% c("SOLAR ACTIVITY", "CLOUDS",

                    "SEISMOLOGY", "ASTROPHYSICS",

                    "HUMAN HEALTH", "BUDGET")) %>%

arrange(desc(tf_idf)) %>%
group_by(keyword) %>%
distinct(word, keyword, .keep_all = TRUE) %>%
slice_max(tf_idf, n = 15, with_ties = FALSE) %>%
ungroup() %>%
mutate(word = factor(word, levels = rev(unique(word)))) %>%
ggplot(aes(tf_idf, word, fill = keyword)) +
geom_col(show.legend = FALSE) +
facet_wrap(~keyword, ncol = 3, scales = "free") +
labs(title = "Highest tf-idf words in NASA metadata description fields",

   caption = "NASA metadata from https://data.nasa.gov/data.json",

```
   x = "tf-idf", y = NULL)
```

Mistake:
Error in combine_vars():
! Faceting variables must have at least one value.
Run rlang::last_trace() to see where the error occurred.

It seems to me that problem with changing _id with identifier is haunting us

juliasilge · 2024-05-16T02:24:44Z

@Oleh-Zaritskyi I believe a many-to-many relationship here is expected, so you will want to specify that. Also, note that the NASA data.json file has changed, so you'll need to update the wrangling code:

library(tidyverse)
library(tidytext)
library(jsonlite)
#> 
#> Attaching package: 'jsonlite'
#> The following object is masked from 'package:purrr':
#> 
#>     flatten

metadata <- fromJSON("https://data.nasa.gov/data.json")

metadata_wrangled <- as_tibble(metadata$dataset) |> 
  select(title, description, keyword) |> 
  mutate(id = row_number())

desc_tf_idf <- metadata_wrangled |> 
  unnest_tokens(word, description) |> 
  anti_join(stop_words) |> 
  count(id, word, sort = TRUE) |> 
  bind_tf_idf(word, id, n)
#> Joining with `by = join_by(word)`

nasa_keyword <- metadata_wrangled |> 
  unnest(keyword) |> 
  select(id, keyword)

full_join(desc_tf_idf, nasa_keyword, relationship = "many-to-many")
#> Joining with `by = join_by(id)`
#> # A tibble: 6,194,658 × 7
#>       id word      n    tf   idf tf_idf keyword                    
#>    <int> <chr> <int> <dbl> <dbl>  <dbl> <chr>                      
#>  1  9987 gt       96 0.201  5.29  1.06  active                     
#>  2  9987 gt       96 0.201  5.29  1.06  gmat                       
#>  3  9987 gt       96 0.201  5.29  1.06  goddard space flight center
#>  4  9987 gt       96 0.201  5.29  1.06  project                    
#>  5  9987 lt       96 0.201  5.33  1.07  active                     
#>  6  9987 lt       96 0.201  5.33  1.07  gmat                       
#>  7  9987 lt       96 0.201  5.33  1.07  goddard space flight center
#>  8  9987 lt       96 0.201  5.33  1.07  project                    
#>  9 16591 gt       94 0.188  5.29  0.997 sbir/sttr                  
#> 10 16591 gt       94 0.188  5.29  0.997 nasa headquarters          
#> # ℹ 6,194,648 more rows

^{Created on 2024-05-15 with reprex v2.1.0}

Oleh-Zaritskyi · 2024-05-17T06:12:10Z

@juliasilge Thank you very much. I'l try . Thank for you greate work, sorry for disturbing you

Oleh-Zaritskyi · 2024-05-17T19:05:09Z

@juliasilge Thank you, almost get the final topic.
One small error
lda_gamma <- full_join(lda_gamma, nasa_keyword, by = c("document" = "id"))
Error in full_join():
! Can't join x$document with y$id due to incompatible types.
ℹ x$document is a .
ℹ y$id is a .

juliasilge · 2024-05-17T20:06:25Z

@Oleh-Zaritskyi You need to convert one of those columns to be the same type as the other one, using mutate(). Then you can match them up during full_join().

Oleh-Zaritskyi · 2024-05-18T17:28:12Z

@juliasilge Thank you, everything works correctly

enzedonline added a commit to enzedonline/tidy-text-mining that referenced this issue Apr 12, 2022

Update 08-nasa-metadata.Rmd

2d08135

Update id definition to reflect current NASA metadata (metadata$dataset$identifier) as per issue dgrtwo#72

enzedonline mentioned this issue Apr 12, 2022

Update 08-nasa-metadata.Rmd #107

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NASA data.json has changed #72

NASA data.json has changed #72

juliasilge commented Dec 9, 2019

tmasjc commented Mar 17, 2020

juliasilge commented Mar 17, 2020

walinchus commented Jan 18, 2021

juliasilge commented Jan 22, 2021

walinchus commented Jan 22, 2021

walinchus commented Jan 22, 2021

walinchus commented Jan 22, 2021

juliasilge commented Jan 22, 2021

walinchus commented Jan 22, 2021

Oleh-Zaritskyi commented May 15, 2024

juliasilge commented May 16, 2024

Oleh-Zaritskyi commented May 17, 2024

Oleh-Zaritskyi commented May 17, 2024

juliasilge commented May 17, 2024

Oleh-Zaritskyi commented May 18, 2024

NASA data.json has changed #72

NASA data.json has changed #72

Comments

juliasilge commented Dec 9, 2019

tmasjc commented Mar 17, 2020

juliasilge commented Mar 17, 2020

walinchus commented Jan 18, 2021

juliasilge commented Jan 22, 2021

walinchus commented Jan 22, 2021

walinchus commented Jan 22, 2021

walinchus commented Jan 22, 2021

juliasilge commented Jan 22, 2021

walinchus commented Jan 22, 2021

Oleh-Zaritskyi commented May 15, 2024

juliasilge commented May 16, 2024

Oleh-Zaritskyi commented May 17, 2024

Oleh-Zaritskyi commented May 17, 2024

juliasilge commented May 17, 2024

Oleh-Zaritskyi commented May 18, 2024