Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added support for parallel executions. Also fixed a bug in enhanced semantic roles. #422

Draft
wants to merge 24 commits into
base: master
Choose a base branch
from

Conversation

MihaiSurdeanu
Copy link
Contributor

No description provided.

@MihaiSurdeanu
Copy link
Contributor Author

@kwalcock: can you please take a look at this PR? I tried to implement transparent support for parallel executions. You may start with the Metal.getInferenceModel() method, and double check the Layers.clone() method next.
Thank you!

@kwalcock
Copy link
Member

Thank you for the use case. I keep updating my example with your good ideas and that documentation is still catching up. In a later version of fatdynet than what you probably have, one should be able to do

  def cloneBuilder(builder: RnnBuilder): RnnBuilder = {
    val newBuilder = builder.clone()

The newBuilder.newGraph is needed or else a runtime exception is thrown. I'm still working on the theory as to why. The code looks good. I will try out Metal.main as soon as I can.

@MihaiSurdeanu
Copy link
Contributor Author

Thanks @kwalcock!
Some unit tests are failing here. Can you please look into this?

@kwalcock
Copy link
Member

Being dependent on a locally published snapshot, it's not going to work on Jenkins. I will test locally for now. So far I've made just one change to the C++ code that would necessitate recompiling. Things can't be removed from Maven Central, so I'm not real enthusiastic about sending it (fatdynet) there yet.

@kwalcock
Copy link
Member

The tests do still pass locally. If I understand correctly, the parallel ability is not yet tested.

@MihaiSurdeanu
Copy link
Contributor Author

Not yet. A test of the parallel execution would be awesome :)

@kwalcock
Copy link
Member

In cases where clone calls copy without changing any of the values, like

override def clone(): GreedyForwardLayer = copy()

one could just return the original:

override def clone(): GreedyForwardLayer = this

It would save some memory.

@kwalcock
Copy link
Member

kwalcock commented Sep 16, 2020

Someone left helpful hints in the C++ code that explains how newGraph fits in:

// CURRENT STATE | ACTION              | NEXT STATE
// --------------+---------------------+-----------------
// CREATED       | new_graph           | GRAPH_READY
// GRAPH_READY   | start_new_sequence  | READING_INPUT
// READING_INPUT | add_input           | READING_INPUT
// READING_INPUT | start_new_seqeunce  | READING_INPUT
// READING_INPUT | new_graph           | GRAPH_READY

Not following these rules results in messages like

State transition error: currently in state 0 but received operation 1

@MihaiSurdeanu
Copy link
Contributor Author

Thanks for catching the clone() issue. I pushed the fix.

@kwalcock
Copy link
Member

ViterbiForwardLayer is another opportunity. That's the only other I noticed.

@MihaiSurdeanu
Copy link
Contributor Author

another good catch. fixed and pushed. Thanks!

@kwalcock
Copy link
Member

For ParallelProcessorExample example, I get

Serial: 57 seconds
Parallel: 17 seconds

on 8 threads.

@MihaiSurdeanu
Copy link
Contributor Author

MihaiSurdeanu commented Sep 16, 2020 via email

@kwalcock
Copy link
Member

kwalcock commented Sep 16, 2020

Thanks for the reminder. I was just curious about whether it was working or not. Here are the times for the difference processors on a collection of just 10 documents. Each time the serial and parallel version got the same answer, but output for the various processors always differed:

Processor Threading Time (s)
CluProcessor Serial 57
Parallel 17
FastNLPProcessor Serial 31
Parallel 13
FastNLPProcessorWithSemanticRoles Serial 52
Parallel 20

@MihaiSurdeanu
Copy link
Contributor Author

Thanks! I think the correct comparison is between CluProcessor and FastNLPProcessorWithSemanticRoles. Good to see the times are about the same.

@MihaiSurdeanu
Copy link
Contributor Author

So, ready to release fatdynet, and then merge in master for processors?

@MihaiSurdeanu
Copy link
Contributor Author

I checked all these methods, and none are synchronized. I do use one synchronized in ConstEmbeddingsGlove.apply(), but that one is called just once.

I have an idea on how to debug this issue, based on the observation that if we sum up the runtimes in CluProcessor methods during the parallel execution, the sum should not be higher than the seq times. If it is, it means the threads are blocking each other somewhere, and we have to dig deeper.
In CluProcessor, the key methods are:
tagSentence()
nerSentence()
parseSentence()
srlSentence()

Thanks!!

@kwalcock
Copy link
Member

I uploaded some Flight Recorder information, but so far it has not been especially helpful to me: https://drive.google.com/file/d/1oviO6z2RgtjDPa5tRYFLOEDrMHVErSoD/view?usp=sharing
This is for a FastNLPProcessorWithSemanticRoles in the parallel branch with just 10 files of varying sizes, so it is only for a short time that 8 files can be processed in parallel. After that, individual threads don't have anything to do and the efficiency goes down. With bad luck, the last thread processes some really big file and the other 7 are idle for a long time. The most recent timings were for 52 files in an attempt to avoid the phenomenon. It would help for me to sort files by size and process from longest to shortest and I'll try that, but it's not what I was hoping for. For testing I could also even do the exact same file in different threads to test the limits of parallelization.

@MihaiSurdeanu
Copy link
Contributor Author

Hmm, good point. Maybe we should try 8 files of the same size, so we know that the max speedup is up to 8? If these files are dominated by a big one, parallelism is limited...

@kwalcock
Copy link
Member

FWIW I'm reading up on Scala parallelization at places like https://docs.scala-lang.org/overviews/parallel-collections/overview.html and https://docs.scala-lang.org/overviews/parallel-collections/performance.html#how-big-should-a-collection-be-to-go-parallel. I've been using the easiest thing that works/compiles, nothing that was based on actual effectiveness.

@MihaiSurdeanu
Copy link
Contributor Author

I suspect you called .par on a list of files, which is Ok. But because of the wait at the end on the whole collection, all short jobs have to wait on the big one (if it exists).

@MihaiSurdeanu
Copy link
Contributor Author

stating the obvious here :)

@kwalcock
Copy link
Member

I have all sorts of numbers, but have not found an explanation or workaround for this kind of limitation. I've almost decided it's something like memory contention or garbage collection overhead.

image

@MihaiSurdeanu
Copy link
Contributor Author

Thanks!

Have you tried using the same number of threads, but an increasing amount of RAM available to DyNet?

@kwalcock
Copy link
Member

No. One thing that had occurred to me is that the memory in DyNet could be fragmenting so that malloc takes longer to find an appropriate space and free takes longer to combine regions. I doubt that's the problem, though. I'll see if more memory can bend the line up.

@MihaiSurdeanu
Copy link
Contributor Author

@kwalcock: In any case, should we merge this in master? I am not sure we should wait.

@kwalcock
Copy link
Member

kwalcock commented Oct 2, 2020

While observing execution related to that crash issue, I notice in the IntelliJ debugger a lot of threads in the MONITOR state. Something isn't quite right and it seems to be a clue.

@MihaiSurdeanu
Copy link
Contributor Author

MihaiSurdeanu commented Oct 2, 2020 via email

@kwalcock
Copy link
Member

kwalcock commented Oct 2, 2020

It looks like a false alarm. It went away on the next run. I think something was stale. Still making sure.

@kwalcock
Copy link
Member

kwalcock commented Oct 6, 2020

I've been trying to run this in an Ubuntu18 virtual machine with 16mb of memory. It will run serially but not in parallel. The error message seems almost unrelated, but it must be some side effect of memory shortage.

[error] (run-main-4) java.lang.RuntimeException: ERROR: parameter glove.matrixResourceName must be defined!
[error] java.lang.RuntimeException: ERROR: parameter glove.matrixResourceName must be defined!

This doesn't look familiar, does it?

@MihaiSurdeanu
Copy link
Contributor Author

@kwalcock
Copy link
Member

kwalcock commented Oct 6, 2020

The test program I'm running in processors seems to max out at a 6x speedup or so (graph above). I wanted to make sure that isn't some general limit based on something like the memory bus bandwidth in the servers. This memory intensive C++ program was able to get 20x. I'm now suspicious of the malloc that is part of the forward pass in dynet and trying to get a profiler to show some insight.

image

@MihaiSurdeanu
Copy link
Contributor Author

This is encouraging! Are you running the C and Scala tests on the same machine?

@kwalcock
Copy link
Member

kwalcock commented Oct 6, 2020

The two graphs above are both from Jenny. I'm not able to do the C-level profiling there, though, and that's the reason for yesterday's failed attempt to get processors to run locally on Ubuntu 18. It was probably just something I ate/installed. Processors just started working on Ubuntu 20, though, so I'll see if the running dynet can explain itself. It wasn't anything obvious to me at the Java level.

@kwalcock
Copy link
Member

BTW the graph above showing the 20x speedup is for a C++ program unrelated to Dynet, and that was probably responsible for the confusion this morning. I was about to compare it to speedup for a C++ Dynet program and may see something closer to the Java 6x speedup than to this 20x. That would mean that the problem isn't with Java but is already in the C++ version, possibly in the synchronized memory allocation. Details would be investigated initially with perf tools that aren't installed for the servers, but they are running locally now.

@MihaiSurdeanu
Copy link
Contributor Author

MihaiSurdeanu commented Oct 12, 2020 via email

@kwalcock
Copy link
Member

This PR will be abandoned when it is finally replaced by updated parallel processing code.

@kwalcock
Copy link
Member

@MihaiSurdeanu ,there are graphs for TorchScript multi-threaded performance at kwalcock/torchscript#1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants