GitHub - arpitjh001/Supercool-projects: Extracting Trademark Applications and image-trademarks from Indian Trademark Journals and renaming according to their application_number,class and journal number

-Steps to follow :

Unzip all three zips
Place journal_split folder under eclise-workspace folder
Include jars available in jars.zip
Change line no. 20 in split_and_conquer.java as the location where you will extract your tools.zip and place both pdftohtml.exe and convert.exe files. In my system it is :

bw.write("cd C:\Users\welcome\Desktop"+"\n"); //location of pdftohtml.exe and convert.exe file

Change the address in location string on line no. 16 in split_and_conquer.java to the location where you have kept
sample_input_file folder.

 String location = "C:\\Users\\welcome\\Desktop\\pdf"; //for testing
             ---------change to------
 String location = "location/of/your/sample input files";

Uncomment line 68 for getting images from pdf :
```
 c.executebatch(batchfile);
```

-Short Description :

This java code basically used for extracting application number,journal number and page number from published Indian and International trademarks . The Office of Controller General of Patents ,Designs and Trademarks of India publishes trademarks in journals on weekly basis.Publication of a trademark application in the trademark journal provides the public a chance to raise a trademark opposition, if necessary. This code split the journals (which is in form of pdf file) containing trademark applications into seperate pdfs ; rename the pdf as per the format --> page-number_trademark-class_application-number_journal-number. Doing this will aid the data expert to easily index document in database like mysql or solr/elastic search platforms. It will also extract image using tools pdfimage.exe which will extract image in .ppm format and then converting the same in .jpg/.png format using convert tool.

-Challenges Faced :

Trademark application inside the journal may be of multipage (span to two or more pages). So splitting them in a way that the application containing multipage is splitted as a single pdf file is difficult. Also you will find some journals at the starting pages contain cover page and index which is of no use. For this two situation I have developed a concept of useless page. First I split all of the pages pdf file blindly without caring if the application inside it is a multipage or if it contains index. Then I have renamed all these pages either as per the format told above or as 'useless' if no regex of class number or application number is found and then merge the consecutive pages together in a manner that the first page is not a useless page and the other in a pair will be a useless page.In short merge all useless page to the page above them (to a 'not useless page') and then deleting the second.And then finally deleting the index pages from the starting if exists.
Writing regex for extracting application number is itself a challenging task as there are lot of different scenerios .

I have provided sample_input_files.zip. You can also download more Journals from the following link : -http://www.ipindia.gov.in/journal-tm.htm

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.gitignore		.gitignore
Data Capture		Data Capture
LICENSE		LICENSE
README.md		README.md
jars.zip		jars.zip
journal_split.zip		journal_split.zip
sample input files.zip		sample input files.zip
sample output.zip		sample output.zip
tools.zip		tools.zip

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitignore

.gitignore

Data Capture

Data Capture

LICENSE

LICENSE

README.md

README.md

jars.zip

jars.zip

journal_split.zip

journal_split.zip

sample input files.zip

sample input files.zip

sample output.zip

sample output.zip

tools.zip

tools.zip

Repository files navigation

About

Releases

Packages

License

arpitjh001/Supercool-projects

Folders and files

Latest commit

History

Repository files navigation

About

Topics

Resources

License

Stars

Watchers

Forks