Skip to content
This repository has been archived by the owner on Feb 19, 2020. It is now read-only.

Documentation and Knowledge base is so incredible poor #18

Open
MarcusSjolin opened this issue Dec 20, 2014 · 4 comments
Open

Documentation and Knowledge base is so incredible poor #18

MarcusSjolin opened this issue Dec 20, 2014 · 4 comments

Comments

@MarcusSjolin
Copy link

Can there be some more ways of gaining knowledge please?

At least documented code?

More examples?

The introduction clip on YouTube (https://www.youtube.com/watch?v=rpfVtRqQ4_o) explains that you can create your own features, but not how to use them. For this project to gain traction there needs to be more information. I would help writing guides etc, but I just can't get on with it, there's nothing to reference to.

/Marcus

@dlwh
Copy link
Owner

dlwh commented Dec 21, 2014

Sorry about that. I agree the documentation is pretty shoddy. What would you like to be able to do?

Did you look at https://github.com/dlwh/epic-demo ?

@MarcusSjolin
Copy link
Author

My biggest problem I guess is to know what I can combine, what goes where and how things integrate with each other.

I'd like to know how I implement a simple feature to use when going through a text?

I'd like to know how to use multiple custom ones?

I've seen the Epic demos, and they all work

What are these representing?
preprocess?

  • Do something with the data before running something on it, but what can be achieved here?

slab?

  • A data source that you can do something with?

models?

  • Reference to a set of features that can pick out certain things in a text? (pre build ones are language feature detectors?)

parser?

  • Something that goes through the text to work out what is necessary?

trees?

  • A representation of what words are, like noun and after that there's a verb etc?

sequences?

  • Segment data to pick up if it is a set of two words or one?

Some of these concepts, I think it would be much easier to get started if they can be explained. Why they are there, and what I can do with them. If I'm looking for a certain feature, where should I look?

Might be a lot to answer, but I do think you got something useful here and I'd like to see it being developed further!

/Marcus

@dlwh
Copy link
Owner

dlwh commented Dec 22, 2014

Thanks. That is helpful.

At the moment, the internals of Epic (making features, etc) are kind of
targeted at people with a good bit of NLP ML expertise. Really some of the
external bits are too. I would like to make it more friendly, but it's a
long way from that, obviously.

On Sun, Dec 21, 2014 at 3:16 PM, Marcus Sjölin notifications@github.com
wrote:

My biggest problem I guess is to know what I can combine, what goes where
and how things integrate with each other.

I'd like to know how I implement a simple feature to use when going
through a text?

I'm not sure what you mean here?

I'd like to know how to use multiple custom ones?

Featurizers in Epic can be added together with the "+" operator to create
composite featurizers.
"Featurizers" turn a sentence into a set of features. I think you might
have a misconception about what I mean by features (which is the standard
ML terminology?), which is property of (part of) an input data point (like
a sentence) that can be used to predict the appropriate output.

I've seen the Epic demos, and they all work

What are these representing?
preprocess?

  • Do something with the data before running something on it, but what
    can be achieved here?

preprocess can:

  1. segment sentences
    val segmenter = MLSentenceSegmenter.bundled().get
    segmenter.segment(text)

  2. Tokenize sentences into words and punctuation.
    epic.preprocess.tokenize(sentence)

  3. Do both at once (epic.preprocess.preprocess) as demonstrated in the demo.

  4. Extract content from arbitrary files or urls using Apache Tika
    (epic.extractText(url))

slab?

  • A data source that you can do something with?

Slabs hold annotations (parse trees, named entities, etc) for a text in a
uniform way. We're actually reworking them, so don't put a lot of effort
into learning them.

models?

  • Reference to a set of features that can pick out certain things in a
    text? (pre build ones are language feature detectors?)

Something like that. Models refer to the result of a machine learning
algorithm, with a featurizer, some weights, and a dynamic program which can
build structures over a text, like (I overload terminology and sometimes
use "model" to mean everything except the weights.)

parser?

  • Something that goes through the text to work out what is necessary?

Parsers produce parse trees, as below.

trees?

  • A representation of what words are, like noun and after that there's
    a verb etc?

That and how the words are related to one another: what are the noun
phrases in a sentence, what verb has what object, etc.
http://en.wikipedia.org/wiki/Parse_tree

If you didn't know what these were going in, they will probably not be
useful to you---I'm working in the background on a format that's more
useful to laymen, but it will be some time.

sequences?

  • Segment data to pick up if it is a set of two words or one?

There are two kinds of predictions we have under sequences: something that
assigns a label to every word (e.g. part of speech tags like noun, verb,
etc), and those that assign a label to disjoint contiguous sequences of
words (e.g. which phrases are people, places, or things.)

Some of these concepts, I think it would be much easier to get started if
they can be explained. Why they are there, and what I can do with them. If
I'm looking for a certain feature, where should I look?

Might be a lot to answer, but I do think you got something useful here and
I'd like to see it being developed further!

/Marcus


Reply to this email directly or view it on GitHub
#18 (comment).

@MarcusSjolin
Copy link
Author

Thanks! That was really helpful, I think these answers were what I needed to grasp how things are connected. I now see more clearly how the process from input to output should be formed and what I can use in between. Thanks a lot!

Good going with the library as well, there seem to be a lot of work put into this.

/Marcus

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants