Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delays in stations: lexical error #403

Open
Steven-UA opened this issue Feb 12, 2020 · 1 comment
Open

Delays in stations: lexical error #403

Steven-UA opened this issue Feb 12, 2020 · 1 comment
Labels
Needs information Needs additional information Question

Comments

@Steven-UA
Copy link

My code giving me all the delays in the stations worked perfectly a couple of months ago. The last few days I tried to rerun the same script but always got the same error. I've changed my script and updating the packages since, but i'm unable to make it work again. I asked more advanced programmers who said the error looks like there is something wrong with the data coming from the website I'm scraping, not your script. Has the format changed or something?

The error/traceback:

 Error: lexical error: invalid char in json text.
                                       <br /> <b>Fatal error</b>:  Unc
                     (right here) ------^ 
5.
parse_string(txt, bigint_as_char) 
4.
parseJSON(txt, bigint_as_char) 
3.
parse_and_simplify(txt = txt, simplifyVector = simplifyVector, 
    simplifyDataFrame = simplifyDataFrame, simplifyMatrix = simplifyMatrix, 
    flatten = flatten, ...) 
2.
jsonlite::fromJSON(content(c, "text"), flatten = TRUE) 
1.
loop.scraper(12) 

My code (one script is filled with functions):

library(httr)
library(jsonlite)
library(tidyverse)

load.stations <- function(){
  a <- GET("https://api.irail.be/stations/?format=json") #get command for all stations from irail api
  parsed <- jsonlite::fromJSON(content(a, "text"), flatten=TRUE) #parse json into r
  stations <- parsed$station %>% 
    filter(grepl("^BE.NMBS.0088",id)) #keep only stations in Belgium. Regular expression ^ is begins with  
  return(stations)
}

get.time <- function(){
  time <- paste(format(Sys.time(),"%d/%m/%y %H:%M:%S")) #formats system time in dd/mm/yyyy hh:mm:ss in a string
  strpt <- strptime(time,"%d/%m/%y %H:%M:%S") #takes time-string and converts to interpretable date and time
  return(strpt)
}

get.temp_df <- function(stations, i){
  goget <- paste0("https://api.irail.be/liveboard/?format=json&id=",stations$id[i]) #http for get command, get liveboard (similar to screens in station i)
  c <- GET(goget) #get the data
  parsed_c <- jsonlite::fromJSON(content(c, "text"), flatten=TRUE) #parse from json
  temp_df <- parsed_c$departures$departure #get the dataframe with departures from the parsed json
  return(temp_df)
}

add.to.all <- function(all_df, temp_df){
  all_df <- rbind(all_df,temp_df)%>% #add temporary dataframe to master dataframe
    group_by(stationneke,time,vehicle)%>% # group departure times by station - remove doubles
    top_n(1,importtime)%>% #only keep the most recent observation  - remove doubles 2
    ungroup() #lift grouping
  return(all_df)
}

save.day <- function(all_df){
  strpt <- get.time()
  saveRDS(all_df,file = paste(strpt$mday, strpt$mon+1, strpt$year+1900,"Punct.rda",sep = "-"))
  Sys.sleep(time = 3600-(strpt$min*60+strpt$sec)) #sleep one hour minus number of secs in the sleep time
  return(data.frame())
}
library(httr)
library(jsonlite)
library(tidyverse)

## all departures - scraper

loop.scraper <- function(hour_of_pause =3){
  source("NMBS-punctuality-functions.R")
  all_df <- data.frame() #leeg dataframe
  stations <- load.stations()
  while (TRUE) { #infinite loop
    strpt <- get.time()
    while(strpt$hour != hour_of_pause){ #enters loop when hour is not "hour_of_pause"
      # startloop <- (strpt$min*60 + strpt$sec)
      for (i in 1:nrow(stations)) { #second loop through the stations
        temp_df <- get.temp_df(stations, i)
        if(is.null(temp_df)) next #skip if dataframe is empty (some stations have been closed in recent years)
        temp_df$stationneke <- stations$name[i] #add departure station name i to the dataframe
        temp_df$importtime <- Sys.time() # add variable with the time of import of the observation
        all_df <- add.to.all(all_df, temp_df)
        strpt <- get.time()
      } #end of loop through stations
      # stoploop <- (strpt$min*60 + strpt$sec)
    } #end of hour-check loop, code below only executed when no trains active (at night)
    all_df <- save.day(all_df) #saves file and returns empty dataframe
  }
}
@Bertware
Copy link
Member

Hi Steven,

In order to find the root cause we need a bit more information:

  • What is the URL of the api page that can't be parsed?
  • What is the response on the page? "Fatal error: Unc" is the start of an error message, but the important part is cut off.

In general I'd also say you're better off using another data format instead of scraping data from all stations for analytics. GTFS-RT is a way better fit, but is hard to "quickly use" as it needs a lot of preprocessing.

We're working on a new "graph" API, which is based of this GTFS-RT data with the preprocessing already done for you: https://graph.irail.be/sncb/connections . This is a list of all departing trains, paginated by their departure date. If you get the pages for the upcoming hour, you got all departures and arrivals all Belgian stations for the upcoming hour. This might be interesting for your use case, as this API to handle lots of requests and allows you to run analytics, and to reuse data client-side for different questions. See https://linkedconnections.org/ for more information, Pieter will also gladly tell you more. Ping @pietercolpaert.

This isn't a "stop using this API" thing, it's just something to consider in the future as it might make things easier for you ;) .

As a small footnote, I'd recommend you to set a user-agent header when making requests to our API (this might be hard in R, but if it's possible, do it). This way we can contact you if we notice strange things on our side such as invalid requests, or to see who gets rate limited in order to resolve it together.

@Bertware Bertware added Needs information Needs additional information Question labels Feb 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs information Needs additional information Question
Projects
None yet
Development

No branches or pull requests

2 participants