Added support for parallel executions. Also fixed a bug in enhanced semantic roles. #422

MihaiSurdeanu · 2020-09-16T03:56:23Z

No description provided.

…tic roles.

MihaiSurdeanu · 2020-09-16T04:00:26Z

@kwalcock: can you please take a look at this PR? I tried to implement transparent support for parallel executions. You may start with the Metal.getInferenceModel() method, and double check the Layers.clone() method next.
Thank you!

kwalcock · 2020-09-16T05:44:01Z

Thank you for the use case. I keep updating my example with your good ideas and that documentation is still catching up. In a later version of fatdynet than what you probably have, one should be able to do

  def cloneBuilder(builder: RnnBuilder): RnnBuilder = {
    val newBuilder = builder.clone()

The newBuilder.newGraph is needed or else a runtime exception is thrown. I'm still working on the theory as to why. The code looks good. I will try out Metal.main as soon as I can.

MihaiSurdeanu · 2020-09-16T14:28:04Z

Thanks @kwalcock!
Some unit tests are failing here. Can you please look into this?

kwalcock · 2020-09-16T14:57:03Z

Being dependent on a locally published snapshot, it's not going to work on Jenkins. I will test locally for now. So far I've made just one change to the C++ code that would necessitate recompiling. Things can't be removed from Maven Central, so I'm not real enthusiastic about sending it (fatdynet) there yet.

kwalcock · 2020-09-16T15:50:56Z

The tests do still pass locally. If I understand correctly, the parallel ability is not yet tested.

MihaiSurdeanu · 2020-09-16T15:52:01Z

Not yet. A test of the parallel execution would be awesome :)

kwalcock · 2020-09-16T17:42:37Z

In cases where clone calls copy without changing any of the values, like

override def clone(): GreedyForwardLayer = copy()

one could just return the original:

override def clone(): GreedyForwardLayer = this

It would save some memory.

kwalcock · 2020-09-16T17:44:38Z

Someone left helpful hints in the C++ code that explains how newGraph fits in:

// CURRENT STATE | ACTION              | NEXT STATE
// --------------+---------------------+-----------------
// CREATED       | new_graph           | GRAPH_READY
// GRAPH_READY   | start_new_sequence  | READING_INPUT
// READING_INPUT | add_input           | READING_INPUT
// READING_INPUT | start_new_seqeunce  | READING_INPUT
// READING_INPUT | new_graph           | GRAPH_READY

Not following these rules results in messages like

State transition error: currently in state 0 but received operation 1

Do away with a few casts

MihaiSurdeanu · 2020-09-16T18:44:50Z

Thanks for catching the clone() issue. I pushed the fix.

kwalcock · 2020-09-16T18:48:31Z

ViterbiForwardLayer is another opportunity. That's the only other I noticed.

MihaiSurdeanu · 2020-09-16T19:44:40Z

another good catch. fixed and pushed. Thanks!

kwalcock · 2020-09-16T22:24:53Z

For ParallelProcessorExample example, I get

Serial: 57 seconds
Parallel: 17 seconds

on 8 threads.

MihaiSurdeanu · 2020-09-16T23:10:09Z

Thanks! What about FastNlpProcessor? And FastNlpProcessorWithSemanticRoles?

…

On Wed, Sep 16, 2020 at 3:25 PM Keith Alcock ***@***.***> wrote: For ParallelProcessorExample example, I get Serial: 57 seconds Parallel: 17 seconds on 8 threads. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#422 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAI75TXQ62KZGHBP53FSNE3SGE3MHANCNFSM4ROFLFOQ> .

kwalcock · 2020-09-16T23:48:22Z

Thanks for the reminder. I was just curious about whether it was working or not. Here are the times for the difference processors on a collection of just 10 documents. Each time the serial and parallel version got the same answer, but output for the various processors always differed:

Processor	Threading	Time (s)
CluProcessor	Serial	57
	Parallel	17
FastNLPProcessor	Serial	31
	Parallel	13
FastNLPProcessorWithSemanticRoles	Serial	52
	Parallel	20

MihaiSurdeanu · 2020-09-16T23:50:18Z

Thanks! I think the correct comparison is between CluProcessor and FastNLPProcessorWithSemanticRoles. Good to see the times are about the same.

MihaiSurdeanu · 2020-09-16T23:50:38Z

So, ready to release fatdynet, and then merge in master for processors?

MihaiSurdeanu · 2020-09-22T18:54:05Z

I checked all these methods, and none are synchronized. I do use one synchronized in ConstEmbeddingsGlove.apply(), but that one is called just once.

I have an idea on how to debug this issue, based on the observation that if we sum up the runtimes in CluProcessor methods during the parallel execution, the sum should not be higher than the seq times. If it is, it means the threads are blocking each other somewhere, and we have to dig deeper.
In CluProcessor, the key methods are:
tagSentence()
nerSentence()
parseSentence()
srlSentence()

Thanks!!

kwalcock · 2020-09-24T22:11:35Z

I uploaded some Flight Recorder information, but so far it has not been especially helpful to me: https://drive.google.com/file/d/1oviO6z2RgtjDPa5tRYFLOEDrMHVErSoD/view?usp=sharing
This is for a FastNLPProcessorWithSemanticRoles in the parallel branch with just 10 files of varying sizes, so it is only for a short time that 8 files can be processed in parallel. After that, individual threads don't have anything to do and the efficiency goes down. With bad luck, the last thread processes some really big file and the other 7 are idle for a long time. The most recent timings were for 52 files in an attempt to avoid the phenomenon. It would help for me to sort files by size and process from longest to shortest and I'll try that, but it's not what I was hoping for. For testing I could also even do the exact same file in different threads to test the limits of parallelization.

MihaiSurdeanu · 2020-09-24T22:19:38Z

Hmm, good point. Maybe we should try 8 files of the same size, so we know that the max speedup is up to 8? If these files are dominated by a big one, parallelism is limited...

kwalcock · 2020-09-24T22:21:12Z

FWIW I'm reading up on Scala parallelization at places like https://docs.scala-lang.org/overviews/parallel-collections/overview.html and https://docs.scala-lang.org/overviews/parallel-collections/performance.html#how-big-should-a-collection-be-to-go-parallel. I've been using the easiest thing that works/compiles, nothing that was based on actual effectiveness.

MihaiSurdeanu · 2020-09-24T22:25:46Z

I suspect you called .par on a list of files, which is Ok. But because of the wait at the end on the whole collection, all short jobs have to wait on the big one (if it exists).

MihaiSurdeanu · 2020-09-24T22:25:59Z

stating the obvious here :)

kwalcock · 2020-09-28T23:31:45Z

I have all sorts of numbers, but have not found an explanation or workaround for this kind of limitation. I've almost decided it's something like memory contention or garbage collection overhead.

MihaiSurdeanu · 2020-09-29T00:12:09Z

Thanks!

Have you tried using the same number of threads, but an increasing amount of RAM available to DyNet?

kwalcock · 2020-09-29T00:58:45Z

No. One thing that had occurred to me is that the memory in DyNet could be fragmenting so that malloc takes longer to find an appropriate space and free takes longer to combine regions. I doubt that's the problem, though. I'll see if more memory can bend the line up.

MihaiSurdeanu · 2020-10-01T19:41:36Z

@kwalcock: In any case, should we merge this in master? I am not sure we should wait.

kwalcock · 2020-10-02T00:35:56Z

While observing execution related to that crash issue, I notice in the IntelliJ debugger a lot of threads in the MONITOR state. Something isn't quite right and it seems to be a clue.

MihaiSurdeanu · 2020-10-02T01:21:00Z

Good finding. So, that means we did miss a sync somewhere in the Java/Scala code, no?

…

On Thu, Oct 1, 2020 at 17:36 Keith Alcock ***@***.***> wrote: While observing execution related to that crash issue, I notice in the IntelliJ debugger a lot of threads in the MONITOR state. Something isn't quite right and it seems to be a clue. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#422 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAI75TVPQPPNOP5SWOWHZC3SIUN7RANCNFSM4ROFLFOQ> .

kwalcock · 2020-10-02T01:25:28Z

It looks like a false alarm. It went away on the next run. I think something was stale. Still making sure.

kwalcock · 2020-10-06T03:03:20Z

I've been trying to run this in an Ubuntu18 virtual machine with 16mb of memory. It will run serially but not in parallel. The error message seems almost unrelated, but it must be some side effect of memory shortage.

[error] (run-main-4) java.lang.RuntimeException: ERROR: parameter glove.matrixResourceName must be defined!
[error] java.lang.RuntimeException: ERROR: parameter glove.matrixResourceName must be defined!

This doesn't look familiar, does it?

MihaiSurdeanu · 2020-10-06T03:53:34Z

It's probably coming from here:
https://github.com/clulab/processors/blob/master/main/src/main/scala/org/clulab/dynet/ConstEmbeddingsGlove.scala#L64

But I have no idea why...

kwalcock · 2020-10-06T23:32:46Z

The test program I'm running in processors seems to max out at a 6x speedup or so (graph above). I wanted to make sure that isn't some general limit based on something like the memory bus bandwidth in the servers. This memory intensive C++ program was able to get 20x. I'm now suspicious of the malloc that is part of the forward pass in dynet and trying to get a profiler to show some insight.

MihaiSurdeanu · 2020-10-06T23:37:31Z

This is encouraging! Are you running the C and Scala tests on the same machine?

kwalcock · 2020-10-06T23:54:13Z

The two graphs above are both from Jenny. I'm not able to do the C-level profiling there, though, and that's the reason for yesterday's failed attempt to get processors to run locally on Ubuntu 18. It was probably just something I ate/installed. Processors just started working on Ubuntu 20, though, so I'll see if the running dynet can explain itself. It wasn't anything obvious to me at the Java level.

kwalcock · 2020-10-12T17:27:23Z

BTW the graph above showing the 20x speedup is for a C++ program unrelated to Dynet, and that was probably responsible for the confusion this morning. I was about to compare it to speedup for a C++ Dynet program and may see something closer to the Java 6x speedup than to this 20x. That would mean that the problem isn't with Java but is already in the C++ version, possibly in the synchronized memory allocation. Details would be investigated initially with perf tools that aren't installed for the servers, but they are running locally now.

MihaiSurdeanu · 2020-10-12T18:08:36Z

Got it. Thank you!

…

On Mon, Oct 12, 2020 at 12:27 PM Keith Alcock ***@***.***> wrote: BTW the graph above showing the 20x speedup is for a C++ program unrelated to Dynet, and that was probably responsible for the confusion this morning. I was about to compare it to speedup for a C++ Dynet program and may see something closer to the Java 6x speedup than to this 20x. That would mean that the problem isn't with Java but is already in the C++ version, possibly in the synchronized memory allocation. Details would be investigated initially with perf tools that aren't installed for the servers, but they are running locally now. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#422 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAI75TUUSFNQI6VNQWBAXDTSKM4ATANCNFSM4ROFLFOQ> .

kwalcock · 2021-02-10T15:22:15Z

This PR will be abandoned when it is finally replaced by updated parallel processing code.

kwalcock · 2021-08-30T21:20:36Z

@MihaiSurdeanu ,there are graphs for TorchScript multi-threaded performance at kwalcock/torchscript#1.

MihaiSurdeanu added 5 commits September 11, 2020 21:43

Started Keith's parallel implementation

588b830

Added cloneBuilder()

a57592d

Added THreadable

9ae1135

added clone() everywhere. does not compile

aaa46b7

Added support for parallelism. Fixed a few bugs in the enhanced seman…

0e910b5

…tic roles.

MihaiSurdeanu added 4 commits September 15, 2020 21:01

CHANGES

fbb43bd

CHANGES

61b5e54

fixed comment

881fc90

Merge branch 'master' into parallel

46a4086

kwalcock and others added 2 commits September 16, 2020 11:16

Empty copy() to this

667bec3

Do away with a few casts

clone returns this for Greedy

57cf778

clone for viterbi

5191664

kwalcock added 2 commits September 16, 2020 20:39

Match other changes

d28b702

Update fatdynet version

6106001

Merge branch 'master' into parallel

649d097

Modify example program

55e8c3d

Merge branch 'master' into kwalcock-parallel

fdeefe8

kwalcock added 3 commits November 6, 2020 09:40

Merge branch 'master' into parallel

da1c96c

Merge branch 'kwalcock-parallel' into parallel

8c8e7b6

Fix merge

0a1d842

kwalcock mentioned this pull request Aug 31, 2021

Threading performance needs to be evaluated kwalcock/torchscript#1

Open

kwalcock marked this pull request as draft February 15, 2023 17:02

Added support for parallel executions. Also fixed a bug in enhanced semantic roles. #422

Are you sure you want to change the base?

Added support for parallel executions. Also fixed a bug in enhanced semantic roles. #422

Conversation

MihaiSurdeanu commented Sep 16, 2020

MihaiSurdeanu commented Sep 16, 2020

kwalcock commented Sep 16, 2020

MihaiSurdeanu commented Sep 16, 2020

kwalcock commented Sep 16, 2020

kwalcock commented Sep 16, 2020

MihaiSurdeanu commented Sep 16, 2020

kwalcock commented Sep 16, 2020

kwalcock commented Sep 16, 2020 • edited

MihaiSurdeanu commented Sep 16, 2020

kwalcock commented Sep 16, 2020

MihaiSurdeanu commented Sep 16, 2020

kwalcock commented Sep 16, 2020

MihaiSurdeanu commented Sep 16, 2020 via email

kwalcock commented Sep 16, 2020 • edited

MihaiSurdeanu commented Sep 16, 2020

MihaiSurdeanu commented Sep 16, 2020

MihaiSurdeanu commented Sep 22, 2020

kwalcock commented Sep 24, 2020

MihaiSurdeanu commented Sep 24, 2020

kwalcock commented Sep 24, 2020

MihaiSurdeanu commented Sep 24, 2020

MihaiSurdeanu commented Sep 24, 2020

kwalcock commented Sep 28, 2020

MihaiSurdeanu commented Sep 29, 2020

kwalcock commented Sep 29, 2020

MihaiSurdeanu commented Oct 1, 2020

kwalcock commented Oct 2, 2020

MihaiSurdeanu commented Oct 2, 2020 via email

kwalcock commented Oct 2, 2020

kwalcock commented Oct 6, 2020

MihaiSurdeanu commented Oct 6, 2020

kwalcock commented Oct 6, 2020

MihaiSurdeanu commented Oct 6, 2020

kwalcock commented Oct 6, 2020

kwalcock commented Oct 12, 2020

MihaiSurdeanu commented Oct 12, 2020 via email

kwalcock commented Feb 10, 2021

kwalcock commented Aug 30, 2021

kwalcock commented Sep 16, 2020 •

edited

kwalcock commented Sep 16, 2020 •

edited