T2T batching #786

varisd · 2019-01-30T12:20:08Z

included batching scheme methods from:
https://github.com/tensorflow/tensor2tensor/blob/415585f40d9f21c56df7bda35033bc915d82321e/tensor2tensor/utils/data_reader.py

jindrahelcl · 2019-02-22T13:47:54Z

neuralmonkey/readers/string_vector_reader.py

@@ -13,7 +13,7 @@ def process_line(line: str, lineno: int, path: str) -> np.ndarray:

        return np.array(numbers, dtype=dtype)

-    def reader(files: List[str])-> Iterable[List[np.ndarray]]:
+    def reader(files: List[str]) -> Iterable[List[np.ndarray]]:


tohle nesouvisí s tou změnou, jen to zanese konflikt do branche s tf datasetem.

ale jestli to jinak neprojde přes travis, tak to tu nechej

jindrahelcl · 2019-02-22T13:48:33Z

tests/hier-multiattention.ini

@@ -4,6 +4,7 @@ tf_manager=<tf_manager>
 output="tests/outputs/hier-multiattention"
 overwrite_output_dir=True
 epochs=1
+batch_size=1


batch size by neměla být povinná jen kvůli tomu, že je někde nějaký workaround..

jindrahelcl · 2019-02-22T13:49:21Z

neuralmonkey/learning_utils.py

@@ -85,6 +85,9 @@ def training_loop(cfg: Namespace) -> None:
                    trainer_result = cfg.tf_manager.execute(
                        batch, feedables, cfg.trainers, train=True,
                        summaries=True)
+                    # workaround: we need to use validation batching scheme
+                    #             during evaluation
+                    batch.batching = BatchingScheme(batch_size=cfg.batch_size)


tohle neni validation batching scheme. zahoď tuhle změnu, v mým refaktoru už to funguje správně a tohle by zbytečně zaneslo konflikt.

jindrahelcl · 2019-02-22T13:49:46Z

neuralmonkey/dataset.py

+            batch sizes and sequence length tolerance.
+        min_length: int, sequences shorter than this will be skipped.
+    Return:
+         A dictionary with parameters that can be passed to input_pipeline:


tohle neni pravda

jindrahelcl · 2019-02-22T13:50:55Z

neuralmonkey/dataset.py

@@ -95,6 +95,84 @@ def __init__(self,
 # pylint: enable=too-few-public-methods


+def _bucket_boundaries(max_length, min_length=8, length_bucket_step=1.1):
+    """Create a default set of length-bucket boundaries."""


přidal bych příklad vstupu a výstupu, moc nechápu proč length bucket step je float

jindrahelcl · 2019-02-22T13:51:02Z

neuralmonkey/dataset.py

@@ -95,6 +95,84 @@ def __init__(self,
 # pylint: enable=too-few-public-methods


+def _bucket_boundaries(max_length, min_length=8, length_bucket_step=1.1):


chybí typový anotace

jindrahelcl · 2019-02-22T13:52:42Z

neuralmonkey/dataset.py

+    max_length = max_length or batch_size
+    if max_length < min_length:
+        raise ValueError("max_length must be greater or equal to min_length")
+


tady by se mělo kontrolovat že length_bucket_step je > 1.0 a hodit valueerror se zprávou a nenechávat to až na assert v pomocný funkci

jindrahelcl · 2019-02-22T13:57:36Z

ad workaround - to už svuj pull request má, proč je to tady taky?

varisd · 2019-02-25T12:07:51Z

Workaround == je to rozvrtane (rozumej, pada to v normalnich scenarich), takze potrebuju rychly fix, abych mohl pracovat na dalsich vecech.

Vetsina tech veci na sobe zavisi, na druhou stranu se daji semanticky rozdelit, coz jsem udelal do pull requestu. Klidne muzu priste udelat jeden velky PR a nebudeme muset resit zavislosti.

jlibovicky · 2019-03-18T16:47:42Z

Rozumím tomu správě, že tohle potřeba zamergovat jako první? Na čem to teda přesně vázne?

varisd · 2019-03-18T17:43:47Z

Rozumím tomu správě, že tohle potřeba zamergovat jako první? Na čem to teda přesně vázne?
Jo, protoze tento PR prinasi humanni vytvareni schematu pro bucketed token-level batching (ktery je de-facto pro transformery nezbytny).

Je potreba opravit dokumentaci v tech dataset.* metodach vykradenych z t2t (a uvest, ze je berem od nich). Dale doplnit anotace... Jak rikam slo prakticky o copy-paste, abych si nemusel pokazde rucne pocitat bucket_batch_sizes a bucket_boundaries.

Samozrejme ty ostatni PR by mely fungovat i bez tohoto, ale budes si je muset rebasnout :)

jindrahelcl · 2019-05-09T13:04:38Z

Tohle je teda součást #802? Jestli jo, tak to prosím zavři.

varisd · 2019-05-09T13:37:01Z

Neni. Spatne jsem rebasnul

varisd added 2 commits January 9, 2019 16:29

workaround for train_set batching during inference time

299c1bc

added batching schemes from tensor2tensor

7a62312

varisd mentioned this pull request Feb 6, 2019

added simplified BERT support #791

Closed

fixing failed travis tests

1d968b5

varisd force-pushed the t2t_batching branch from a97affc to 1d968b5 Compare February 6, 2019 15:42

jindrahelcl requested changes Feb 22, 2019

View reviewed changes

jindrahelcl mentioned this pull request Mar 20, 2019

Attentive interface #802

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

T2T batching #786

T2T batching #786

varisd commented Jan 30, 2019

jindrahelcl Feb 22, 2019

jindrahelcl Feb 22, 2019

jindrahelcl Feb 22, 2019

jindrahelcl Feb 22, 2019

jindrahelcl Feb 22, 2019

jindrahelcl Feb 22, 2019

jindrahelcl Feb 22, 2019

jindrahelcl Feb 22, 2019

jindrahelcl commented Feb 22, 2019

varisd commented Feb 25, 2019

jlibovicky commented Mar 18, 2019

varisd commented Mar 18, 2019

jindrahelcl commented May 9, 2019

varisd commented May 9, 2019

		@@ -95,6 +95,84 @@ def __init__(self,
		# pylint: enable=too-few-public-methods


		def _bucket_boundaries(max_length, min_length=8, length_bucket_step=1.1):

T2T batching #786

Are you sure you want to change the base?

T2T batching #786

Conversation

varisd commented Jan 30, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jindrahelcl commented Feb 22, 2019

varisd commented Feb 25, 2019

jlibovicky commented Mar 18, 2019

varisd commented Mar 18, 2019

jindrahelcl commented May 9, 2019

varisd commented May 9, 2019