Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEATURE REQUEST: Extract Highlights #330

Open
pr0fsmith opened this issue Apr 23, 2024 · 18 comments
Open

FEATURE REQUEST: Extract Highlights #330

pr0fsmith opened this issue Apr 23, 2024 · 18 comments
Labels
enhancement New feature or request

Comments

@pr0fsmith
Copy link

pr0fsmith commented Apr 23, 2024

It would be great to be able to extract highlights along with document title and page number to a text file that I can then copy to my computer. Found the below script if it helps. I just can't figure out how to get the page number.

#!/bin/bash

# getPDFHighlights.sh
#    fetches Remarkable PDF text highlights, stores all data in the directory it is called from.

# 2022-09-18, NickK: some clean ups
# 2022-09-17, NickK: initial version

# Dear gods, this is one horrible, horrible result of bored saturday night
# hacking, more to prove a point and to keep me away from rage-buying another
# device that would allow me not only to do PDF text highlights, but also to extract them.

# use at your own risk, it might fry your flux compensators.

# Works under Windows WSL Ubuntu, device software 2.14. Let me know about your setup.

# dependencies:
#    jq
#    sshpass
#    scp
#    ssh
#    ls
#    find
#    cat

#   reminder: sudo apt-get install <dependency>

# ---------------------------------------------------------------------
# edit this section for your device

USERNAME="root"
    HOST="my.ip.ad.ress"
    PASS="mypass"

# ---------------------------------------------------------------------
# nothing to do below here

PATH_XOCHITL='/home/root/.local/share/remarkable/xochitl/'

SCRIPT="
  cd ${PATH_XOCHITL}
  ls -1d *.highlights"

# get a list of all existing highlight directories (i.e. for all documents that contain any highlights)
echo "logging in to device in order to fetch a list of all highlighted files (PDF and non-PDF alike)"

filesHighlighted=$(sshpass -p ${PASS} ssh -l ${USERNAME} ${HOST} "${SCRIPT}")

# narrow down to highlight directories for PDF files only and fetch their .content file, containing meta data
myHighlights=($filesHighlighted)
for (( i=0; i<${#myHighlights[@]}; i++ )); do
    bareName=${myHighlights[$i]%.highlights}
    contentName="${bareName}.content"
    echo "  ${contentName}"
    sshpass -p ${PASS} ssh -l ${USERNAME} ${HOST} "grep -q \"pdf\" ${PATH_XOCHITL}${contentName} && cat ${PATH_XOCHITL}${contentName} " > $contentName
done

echo -e "\nremoving non pdf-associated .content files\n"
# they are all size 0
find . -size 0 -delete -type f

# the last couple of steps could really need some refactoring love

# walk through local pdf content files
filesPdfHighlighted=$(ls -1 *.content)
myPdfHighlights=($filesPdfHighlighted)

# fetch all directories for PDFs highlighted
echo -e "copying PDF highlight directories\n"
for (( i=0; i<${#myPdfHighlights[@]}; i++ )); do
    bareName=${myPdfHighlights[$i]%.content}
    sshpass -p ${PASS} scp -r root@${HOST}:${PATH_XOCHITL}${bareName}.highlights .
done

echo -e "walking through meta data for PDF files, getting highlight data files for individual pages\n"

let pdfFileCount=0;
for (( i=0; i<${#myPdfHighlights[@]}; i++ )); do
  let pdfFileCount++
  echo "File: ${myPdfHighlights[$i]}"

  # get pages (i.e. page IDs) out of content metadata file
  pages=$(cat ${myPdfHighlights[$i]}| jq .pages[] -r)
  myPages=($pages)

  # walk through the pages
  let pageCount=0
  for (( j=0; j<${#myPages[@]}; j++ )); do

      # check if there is a highlight (JSON file) for a given page ID
      if test -f "${myPdfHighlights[i]%.content}.highlights/${myPages[$j]}.json"; then
              let pageCount++

              # extract highlighted content from JSON files to text files
              # naming the latter to the original document's page number

              # - first page in Remarkable is 0, so we +1 to have the right offset
              # - left-pad page number with 0. 1 becomes 001. 53 becomes 053.
              k=$(printf "%03d" $((j + 1)))

              echo "  extracting page ${k}"

              # put page number in page file
              echo -e ">>>> highlights on page ${k}\n" > "${myPdfHighlights[i]%.content}.highlights/${k}.txt"

              # using jq to extract text, sorting for highlight position instead of keeping the default order.
              #   Remarkable's devices store highlights in the order you actually did them.

              #   ATT: you can highlight the same text more than once, and consequently it will be extracted more than once

              cat "${myPdfHighlights[i]%.content}.highlights/${myPages[$j]}.json" | jq ".highlights[]|=sort_by(.start)|.highlights[][].text" -r >> "${myPdfHighlights[i]%.content}.highlights/${k}.txt"

              # delete local highlight JSON
              rm "${myPdfHighlights[i]%.content}.highlights/${myPages[$j]}.json"
      fi
  done
  echo -e "Extracted highlights on ${pageCount} page(s).\n"

done

echo -e "There were ${pdfFileCount} PDF file(s) with highlights.\n"

@anbzerc
Copy link

anbzerc commented Apr 24, 2024

+1 I spent a lot of time trying to do it, without success.

@pr0fsmith
Copy link
Author

pr0fsmith commented Apr 24, 2024 via email

@Eeems
Copy link
Contributor

Eeems commented Apr 24, 2024

@pr0fsmith you may want to edit your first post to wrap the code with ``` so that it's properly formatted as code.

@pr0fsmith
Copy link
Author

pr0fsmith commented Apr 24, 2024 via email

@pr0fsmith
Copy link
Author

Here's my finished code:

#!/bin/bash

# getPDFHighlights.sh
#    fetches Remarkable PDF text highlights, stores all data in the directory it is called from.

# 2022-09-18, NickK: some clean ups
# 2022-09-17, NickK: initial version

# Dear gods, this is one horrible, horrible result of bored saturday night
# hacking, more to prove a point and to keep me away from rage-buying another
# device that would allow me not only to do PDF text highlights, but also to extract them.

# use at your own risk, it might fry your flux compensators.

# Works under Windows WSL Ubuntu, device software 2.14. Let me know about your setup.

# dependencies:
#    jq
#    sshpass
#    scp
#    ssh
#    ls
#    find
#    cat
#    rsync

#   reminder: sudo apt-get install <dependency>

# ---------------------------------------------------------------------
# edit this section for your device

USERNAME="root"
    HOST="enter ip of your rM here"
    PASS="enter the password of your rM here"
    PATH_XOCHITL='/home/root/.local/share/remarkable/xochitl/' #you may or may not need to change the directory path here

# ---------------------------------------------------------------------
# nothing to do below here

SCRIPT="
  cd ${PATH_XOCHITL}
  ls -1d *.highlights"

# get a list of all existing highlight directories (i.e. for all documents that contain any highlights)
echo "logging in to device in order to fetch a list of all highlighted files (PDF and non-PDF alike)"

filesHighlighted=$(sshpass -p ${PASS} ssh -l ${USERNAME} ${HOST} "${SCRIPT}")

# narrow down to highlight directories for PDF files only and fetch their .content file, containing meta data
myHighlights=($filesHighlighted)
for (( i=0; i<${#myHighlights[@]}; i++ )); do
    bareName=${myHighlights[$i]%.highlights}
    contentName="${bareName}.content"
    echo "  ${contentName}"
    sshpass -p ${PASS} ssh -l ${USERNAME} ${HOST} "grep -q \"pdf\" ${PATH_XOCHITL}${contentName} && cat ${PATH_XOCHITL}${contentName} " > $contentName
done

echo -e "\nremoving non pdf-associated .content files\n"
# they are all size 0
find . -size 0 -delete -type f

# the last couple of steps could really need some refactoring love

# walk through local pdf content files
filesPdfHighlighted=$(ls -1 *.content)
myPdfHighlights=($filesPdfHighlighted)

# fetch all directories for PDFs highlighted
echo -e "copying PDF highlight directories and metadata files\n"
for (( i=0; i<${#myPdfHighlights[@]}; i++ )); do
    bareName=${myPdfHighlights[$i]%.content}
    sshpass -p ${PASS} rsync -a root@${HOST}:${PATH_XOCHITL}${bareName}.highlights . --info=progress2
    sshpass -p ${PASS} rsync root@${HOST}:${PATH_XOCHITL}${bareName}.metadata . --info=progress2
done

echo -e "walking through meta data for PDF files, getting highlight data files for individual pages\n"

let pdfFileCount=0;
for (( i=0; i<${#myPdfHighlights[@]}; i++ )); do
  let pdfFileCount++
  echo "File: ${myPdfHighlights[$i]}"

  # Check the structure of the .content file
  if jq -e '.cPages.pages[].id' "${myPdfHighlights[$i]}" >/dev/null 2>&1; then
    # If the structure matches the nested page ID, extract page IDs accordingly
    pages=$(cat ${myPdfHighlights[$i]}| jq .cPages.pages[].id -r)
  else
    # Otherwise, assume the existing structure and extract page IDs directly
    pages=$(cat ${myPdfHighlights[$i]}| jq .pages[] -r)
  fi
  myPages=($pages)

# Walk through the pages
let pageCount=0
for (( j=0; j<${#myPages[@]}; j++ )); do

    # Check if there is a highlight (JSON file) for a given page ID
    if test -f "${myPdfHighlights[i]%.content}.highlights/${myPages[$j]}.json"; then
            let pageCount++

            # Extract visibleName from metadata file
            visibleName=$(jq -r '.visibleName' "${myPdfHighlights[i]%.content}.metadata")

            # Create the 'md' directory if it does not exist
            mkdir -p md

            # Extract highlighted content from JSON files to Markdown files
            # Naming the latter to the original document's page number

            # - First page in Remarkable is 0, so we +1 to have the right offset
            # - Left-pad page number with 0. 1 becomes 001. 53 becomes 053.
            k=$(printf "%03d" $((j + 1)))

            echo "  Extracting page ${k}" 

            # Using jq to extract text, sorting for highlight position instead of keeping the default order.
            # Remarkable's devices store highlights in the order you actually did them.
            echo -e "\n [[${visibleName}]]  Highlights on page ${k}" >> "md/${visibleName}.md"
            # ATT: You can highlight the same text more than once, and consequently it will be extracted more than once
            jq -r '.highlights[] | .[].text' "${myPdfHighlights[i]%.content}.highlights/${myPages[$j]}.json" | while IFS= read -r line; do
                echo -n "$line " >> "md/${visibleName}.md"
            done

            # Add an empty line to separate contents of different .json files
            # Also, append the page number and the name of the pdf file wrapped with 2 square brackets on each side
          
            echo "" >> "md/${visibleName}.md"

            # Delete local highlight JSON
            rm "${myPdfHighlights[i]%.content}.highlights/${myPages[$j]}.json"
    fi
done


done

echo -e "There were ${pdfFileCount} PDF file(s) with highlights.\n"

@pr0fsmith
Copy link
Author

Updated the script again.

  1. User is prompted to enter user name and ip address
  2. User can now select a single file to have highlights extracted from
  3. User can optionally choose to extract all highlights from all files
  4. Source files are deleted from the computer when finished
  5. highlights from other filetypes are included
  6. Password removed from script in order to encourage user to use an ssh-key
#!/bin/bash

# getPDFHighlights.sh
#    fetches Remarkable PDF text highlights, stores all data in the directory it is called from.

# 2022-09-18, NickK: some clean ups
# 2022-09-17, NickK: initial version

# Dear gods, this is one horrible, horrible result of bored saturday night
# hacking, more to prove a point and to keep me away from rage-buying another
# device that would allow me not only to do PDF text highlights, but also to extract them.

# use at your own risk, it might fry your flux compensators.

# Works under Windows WSL Ubuntu, device software 2.14. Let me know about your setup.

# dependencies:
#    jq
#    rsync
#    ssh
#    ls
#    find
#    cat

#   reminder: sudo apt-get install <dependency>

# It is recommended you setup an SSH Key https://remarkable.guide/guide/access/ssh.html#setting-up-a-ssh-key

# nothing to do below here

echo "Enter your tablet's user name. This is usually 'root'"
read USERNAME
echo "Enter your tablet's ip address"
read HOST
PATH_XOCHITL='/home/root/.local/share/remarkable/xochitl/'

# Function to fetch visible names from metadata files
get_visible_name() {
    local filename="$1"
    local metadata_file="${filename%.highlights}.metadata"
    ssh -l ${USERNAME} ${HOST} "cat ${PATH_XOCHITL}$metadata_file" | jq -r '.visibleName'
}

#Function to convert highlight data to mardkdown file
highlights_to_markdown() {
echo -e "walking through meta data for PDF files, getting highlight data files for individual pages\n"

let pdfFileCount=0;
for (( i=0; i<${#myHighlights[@]}; i++ )); do
  let pdfFileCount++
  echo "File: ${myHighlights[$i]}"

  # Check the structure of the .content file
  if jq -e '.cPages.pages[].id' "${myHighlights[$i]}" >/dev/null 2>&1; then
    # If the structure matches the nested page ID, extract page IDs accordingly
    pages=$(cat ${myHighlights[$i]}| jq .cPages.pages[].id -r)
  else
    # Otherwise, assume the existing structure and extract page IDs directly
    pages=$(cat ${myHighlights[$i]}| jq .pages[] -r)
  fi
  myPages=($pages)

# Walk through the pages
let pageCount=0
for (( j=0; j<${#myPages[@]}; j++ )); do

    # Check if there is a highlight (JSON file) for a given page ID
    if test -f "${myHighlights[i]%.content}.highlights/${myPages[$j]}.json"; then
            let pageCount++

            # Extract visibleName from metadata file
            visibleName=$(jq -r '.visibleName' "${myHighlights[i]%.content}.metadata")

            # Extract highlighted content from JSON files to Markdown files
            # Naming the latter to the original document's page number

            # - First page in Remarkable is 0, so we +1 to have the right offset
            # - Left-pad page number with 0. 1 becomes 001. 53 becomes 053.
            k=$(printf "%03d" $((j + 1)))

            echo "  Extracting page ${k}" 

            # Using jq to extract text, sorting for highlight position instead of keeping the default order.
            # Remarkable's devices store highlights in the order you actually did them.
            echo -e "\n [[${visibleName}]]  Highlights on page ${k}" >> "${visibleName}.md"
            # ATT: You can highlight the same text more than once, and consequently it will be extracted more than once
            jq -r '.highlights[] | .[].text' "${myHighlights[i]%.content}.highlights/${myPages[$j]}.json" | while IFS= read -r line; do
                echo -n "$line " >> "${visibleName}.md"
            done

            # Add an empty line to separate contents of different .json files
            # Also, append the page number and the name of the pdf file wrapped with 2 square brackets on each side
          
            echo "" >> "${visibleName}.md"

            # Delete local highlight JSON
            rm "${myHighlights[i]%.content}.highlights/${myPages[$j]}.json"
    fi
done


done

}


# Fetch list of highlighted files
echo "Choose an option:"
i=1
filesHighlighted=$(ssh -l ${USERNAME} ${HOST} "cd ${PATH_XOCHITL} && ls -1d *.highlights")
for option in $filesHighlighted; do
    visible_name=$(get_visible_name "$option")
    echo "$i. $option ($visible_name)"
    ((i++))
done

# Prompt user for selection
read -p "Enter the number of the option you want to choose (or 'a' to choose all): " choice_num

# Check if the choice is 'a'
if [ "$choice_num" = "a" ]; then
    echo "You have chosen to copy all files."
    rsync -a -v root@${HOST}:${PATH_XOCHITL}*.metadata root@${HOST}:${PATH_XOCHITL}*.content root@${HOST}:${PATH_XOCHITL}*.highlights . --progress
    filesHighlighted=$(ls -1 *.content)
    myHighlights=($filesHighlighted)
    highlights_to_markdown
    echo "Deleting source files from computer leaving only the markdown file(s)"
    rm -R *.metadata *.content *.highlights
    exit 0
fi

# Validate user input
if ! [[ "$choice_num" =~ ^[0-9]+$ ]]; then
    echo "Error: Please enter a valid number or 'a' for all."
    exit 1
fi

# Check if the selected number is within the valid range
if [ "$choice_num" -ge 1 ] && [ "$choice_num" -le "$i" ]; then
    # Get the corresponding option
    chosen_option=$(echo "$filesHighlighted" | sed -n "${choice_num}p")
    echo "You have chosen: $visible_name"
    
    # Assign the name of the chosen .highlights file to chosenfile without the extension
    chosenfile=$(basename "$chosen_option" .highlights)
    echo "Copying files"
    rsync -a -v root@${HOST}:${PATH_XOCHITL}"$chosenfile".metadata root@${HOST}:${PATH_XOCHITL}"$chosenfile".highlights root@${HOST}:${PATH_XOCHITL}"$chosenfile".content . --progress 
    filesHighlighted=$(ls -1 *.content)
    myHighlights=($filesHighlighted)    
    highlights_to_markdown
    echo "Deleting source files from computer leaving only the markdown file(s)"
    rm -R "$chosenfile"*
else
    echo "Error: Please enter a number within the valid range or 'a' for all."
fi

@beelux beelux added the enhancement New feature or request label Apr 30, 2024
@beelux
Copy link
Collaborator

beelux commented Apr 30, 2024

Pretty cool script, but if it's ran remotely (from a computer), should this really be within rm-hacks?
If yes, how do you see this being implemented, some button that extracts the highlights in the current document?

@pr0fsmith
Copy link
Author

Pretty cool script, but if it's ran remotely (from a computer), should this really be within rm-hacks? If yes, how do you see this being implemented, some button that extracts the highlights in the current document?

I think there's benefit to it being an rm-hack for less advanced users. An implementation of it, I imagine, on the rM itself would be a button that extracts the highlights and emails it to you or creates a page at the end of the document with the highligts on it. In either case, the user would have the means to import that highlights into a program of their choosing.

@folofjc
Copy link

folofjc commented May 9, 2024

Please see remarks

I was using it a lot when on 2.x. I pushed some updated with the way remarkable changed things. The problem is that it is always a moving target, since rM changes how they do highlights quite a bit.

It isn't updated to 3.x yet (see here).

Here is something that parses the new format: parser

@free-da
Copy link

free-da commented May 16, 2024

Hi, I was looking for this and I was indeed thinking about an integrated button, so that the highlighted lines would be exported into a new note book (page), each highlighted section would be annotated with with the papers name and page number, and with expandable space to write more annotations on. Would be incredibly helpful for academic work. Cheers!

@pr0fsmith
Copy link
Author

pr0fsmith commented May 16, 2024 via email

@thomasartopoulos
Copy link

Hello!
Im actually trying to run your script but apparently no .highlights file in my device are found. Is this available for >3 versions?

@pr0fsmith
Copy link
Author

Hello! Im actually trying to run your script but apparently no .highlights file in my device are found. Is this available for >3 versions?

Double check that the path in the script matches your device. PATH_XOCHITL value may need to be changed.

@thomasartopoulos
Copy link

thomasartopoulos commented Jun 3, 2024

Hello! Im actually trying to run your script but apparently no .highlights file in my device are found. Is this available for >3 versions?

Double check that the path in the script matches your device. PATH_XOCHITL value may need to be changed.

Thanks for your answer!

Actually the path is correct.

This is the error:
ls: *.highlights: No such file or directory

No .highlights file in my folder. Only this files.

ecf7a0cd-753f-4505-b2ae-3b5534b6bbeb
ecf7a0cd-753f-4505-b2ae-3b5534b6bbeb.content
ecf7a0cd-753f-4505-b2ae-3b5534b6bbeb.metadata
ecf7a0cd-753f-4505-b2ae-3b5534b6bbeb.pagedata
ecf7a0cd-753f-4505-b2ae-3b5534b6bbeb.pdf
ecf7a0cd-753f-4505-b2ae-3b5534b6bbeb.thumbnails

@Eeems
Copy link
Contributor

Eeems commented Jun 3, 2024

I would recommend using find . -maxdepth 1 -name '*.highlights' -type d instead of ls -1d *.highlights. This will not fail if there are no highlights folders.

@thomasartopoulos
Copy link

I would recommend using find . -maxdepth 1 -name '*.highlights' -type d instead of ls -1d *.highlights. This will not fail if there are no highlights folders.

Still facing issues. Export attached with original code.
logs.txt

.highlights are missing on my folder. Searched on the whole deviced but nothing found. I'm actually running Version 3.11.2.5.

@pr0fsmith
Copy link
Author

I would recommend using find . -maxdepth 1 -name '*.highlights' -type d instead of ls -1d *.highlights. This will not fail if there are no highlights folders.

Still facing issues. Export attached with original code. logs.txt

.highlights are missing on my folder. Searched on the whole deviced but nothing found. I'm actually running Version 3.11.2.5.

I'm running 3.9 right now. I wonder if rM changed the way they save highlights. I don't think I'm going to be able to help since I'm not running the same software version.

@thomasartopoulos
Copy link

I would recommend using find . -maxdepth 1 -name '*.highlights' -type d instead of ls -1d *.highlights. This will not fail if there are no highlights folders.

Still facing issues. Export attached with original code. logs.txt
.highlights are missing on my folder. Searched on the whole deviced but nothing found. I'm actually running Version 3.11.2.5.

I'm running 3.9 right now. I wonder if rM changed the way they save highlights. I don't think I'm going to be able to help since I'm not running the same software version.

Oh. Yeah, there might be something different. Will try to upgrade to your version or just wait for an oficial solution. thanks a lot

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

7 participants