A Neural Approach for Text Extraction from Scholarly Figures Dataset uri icon

abstract

  • # A Neural Approach for Text Extraction from Scholarly Figures\r\nThis is the readme for the supplemental data for our ICDAR 2019 paper.\r\n## Datasets\r\nWe used different sources of data for testing, validation, and training. Our testing set was assembled by the work we cited by B\xf6schen et al. We excluded the DeGruyter dataset, and use it as our validation dataset.\r\n### Testing\r\nThese datasets contain a readme with license information. Further information about the associated project can be found on the authors' [project page.](http://www.kd.informatik.uni-kiel.de/en/research/software/text-extraction)\r\n- [EconBiz](http://www.kd.informatik.uni-kiel.de/en/research/software/text-extraction-files/econbiz-dataset)\r\n- [CHIME-R](http://www.kd.informatik.uni-kiel.de/en/research/software/text-extraction-files/chime-r-dataset)\r\n- [CHIME-S](http://www.kd.informatik.uni-kiel.de/en/research/software/text-extraction-files/chime-s-dataset)\r\n### Validation\r\nThe [DeGruyter dataset](http://www.kd.informatik.uni-kiel.de/en/research/software/text-extraction-files/degruyter-dataset) does not include the labeled images due to license restrictions. As of writing, the images can still be downloaded from DeGruyter via the links in the readme. Note that depending on what program you use to strip the images out of the PDF they are provided in, you may have to re-number the images.\r\n### Training\r\nWe used [label_generator](https://github.com/domoritz/label_generator)'s generated dataset, which the author made available on a requester-pays [amazon s3 bucket](s3://escience.washington.edu.viziometrics).\r\nWe also used the Multi-Type Web Images dataset, which is mirrored [here](https://tianchi.aliyun.com/competition/introduction.htm?spm=5176.100066.0.0.3bcad780oQ9Ce4&raceId=231651).\r\n## Code\r\nWe have made our code available in `code.zip`. We will upload code, announce further news, and field questions via the [github repo](https://github.com/david-morris/Neural-Figure-Text).\r\n\r\nOur text detection network is adapted from [Argman's EAST implementation](https://github.com/argman/EAST). The `EAST/checkpoints/ours` subdirectory contains the trained weights we used in the paper.\r\n\r\nWe used a tesseract script to run text extraction from detected text rows. This is inside our code `code.tar` as `text_recognition_multipro.py`.\r\n\r\nWe used a java script provided by Falk B\xf6schen and adapted to our file structure. We included this as `evaluator.jar`.\r\n\r\nParameter sweeps are automated by `param_sweep.rb`. This file also shows how to invoke all of these components.

publication date

  • 2019

has restriction

  • https://creativecommons.org/licenses/by/3.0/