Preliminary entity topic model implementation + Migration to scala 2.10 + Factorie NER spotters (trainable + trained models) #306
base: master
Are you sure you want to change the base?
Conversation
Spark training for entity topic model increased performance for parsing wikipages into entity topic models bug fixes in training
* initial implementation of entity topic disambiguator
…raining and disambiguation works, but it requires a lot of memory at the moment. Probabilistic count stores might help here.
*Serialization and Deserialization working *Integration in SpotlightModel
* New spotter for pretrained ner models from factorie: see PretrainedFactorieNerSpotter, SimpleNerSpotter, BilouConllNerSpotter
…ailes to build spotlight, because the log grows too big
@dirkweissenborn would love to give this a try, any chance you can upload |
@dav009 if you want to use the pretrained models, just uncomment the factorie dependencies in the core pom.xml (see below). This should be enough to use either one of the the following spotters /**
* Fast, but not as sophisticated as BilouConllPretrainedCRFSpotter
*/
object SimpleNerSpotter extends PretrainedFactorieNerSpotter(DocumentAnnotatorPipeline.apply[ner.NerTag])
/**
* Slow but uses many features for NERTagging and is thus much more sophisticated compared to the SimpleNerSpotter
*/
object BilouConllNerSpotter extends PretrainedFactorieNerSpotter(DocumentAnnotatorPipeline.apply[ner.BilouConllNerTag]) <!-- don't download if not used, these are models. Uncomment these dependencies if you use a PretrainedFactorieNerSpotter-->
<!--dependency>
<groupId>cc.factorie.app.nlp</groupId>
<artifactId>ner</artifactId>
</dependency>
<dependency>
<groupId>cc.factorie.app.nlp</groupId>
<artifactId>pos</artifactId>
</dependency--> |
Hi @dirkweissenborn we have been trying this locally and evaluate the use of the new spotters. We have uncommented these dependencies and set |
Also, when we set
|
@tgalery yeah, the predefined spotters are not yet integrated within the SpotlightModel. The SpotlightModel integration only works for trained models from our own trainable model implementation. The pretrained model implementation is just a wrapper of factorie ner models. You can easily integrate the pretrained models though models. Just change the spotter property to something else (e.g.: pretrained-ner) and add a case to the SpotlightModel for that |
Cool, we'll do then. |
There is a trained entity topic model here, for whoever is interested in testing. This is a compressed file containing both the model and the necessary stores needed to created the EntityTopicDisambiguator. The model can be loaded through SimpleEntityTopicModel.fromFile(file). The stores can be loaded as it is done in the SpotlightModel. Should all be fairly easy. |
Hey all, I would merge this as soon as the Scala version upgrade is tested. Let me know if any of you has the time to test this. The raw counts to try are here. |
the entity topic model implementation is very simple and preliminary, but fast and working at the moment. It relies on the statistical backend for training (spotter+stores). It needs a lot of memory at the moment, though. Probabilistic count stores might be a nice idea to reduce those requirements.
I also migrated the whole project to 2.10, further testing would be nice, e.g., creating spotlight model (I don't have the raw counts).
Currently, disambiguation on CSAW has a precision of around 0.81 after 150 iterations of training, which is compared to our common sense baseline with 0.83 a little worse, which means that the context model actually hurts performance, compared to only using surface-form to resource probabilities.
Edit: Includes now new LinearChainCRFSpotter and a new spotter for pretrained factorie ner-models: see PretrainedFactorieNerSpotter (e.g., SimpleNerSpotter, BilouConllNerSpotter)