data/acl_2017/train/reviews/318.json

{"reviews": [{"IMPACT": "3", "SUBSTANCE": "4", "APPROPRIATENESS": "5", "MEANINGFUL_COMPARISON": "3", "PRESENTATION_FORMAT": "Oral Presentation", "comments": "This work showed that word representation learning can benefit from sememes\nwhen used in an appropriate attention scheme. Authors hypothesized that sememes\ncan act as an essential regularizer for WRL and WSI tasks and proposed SE-WL\nmodel which detects word senses and learn representations simultaneously.\nThough experimental results indicate that WRL benefits, exact gains for WSI are\nunclear since a qualitative case study of a couple of examples has only been\ndone. Overall, paper is well-written and well-structured.\n\nIn the last paragraph of introduction section, authors tried to tell three\ncontributions of this work. (1) and (2) are more of novelties of the work\nrather than contributions. I see the main contribution of the work to be the\nresults which show that we can learn better word representations (unsure about\nWSI) by modeling sememe information than other competitive baselines. (3) is\nneither a contribution nor a novelty.\n\nThe three strategies tried for SE-WRL modeling makes sense and can be\nintuitively ranked in terms of how well they will work. Authors did a good job\nexplaining that and experimental results supported the intuition but the\nreviewer also sees MST as a fourth strategy rather than a baseline inspired by\nChen et al. 2014 (many WSI systems assume one sense per word given a context).\nMST many times performed better than SSA and SAC. Unless authors missed to\nclarify otherwise, MST seems to be exactly like SAT with a difference that\ntarget word is represented by the most probable sense rather than taking an\nattention weighted average over all its senses. MST is still an attention based\nscheme where sense with maximum attention weight is chosen though it has not\nbeen clearly mentioned if target word is represented by chosen sense embedding\nor some function of it.\n\nAuthors did not explain the selection of datasets for training and evaluation\ntasks. Reference page to Sogou-T text corpus did not help as reviewer does not\nknow Chinese language. It was unclear which exact dataset was used as there are\nseveral datasets mentioned on that page. Why two word similarity datasets were\nused and how they are different  (like does one has more rare words than\nanother) since different models performed differently on these datasets. The\nchoice of these datasets did not allow evaluating against results of other\nworks which makes the reviewer wonder about next question.\n\nAre proposed SAT model results state of the art for Chinese word similarity? \nE.g. Schnabel et al. (2015) report a score of 0.640 on WordSim-353 data by\nusing CBOW word embeddings.\n\nReviewer needs clarification on some model parameters like vocabulary sizes for\nwords (Does Sogou-T contains 2.7 billion unique words) and word senses (how\nmany word types from HowNet). Because of the notation used it is not clear if\nembeddings for senses and sememes for different words were shared. Reviewer\nhopes that is the case but then why 200 dimensional embeddings were used for\nonly 1889 sememes. It would be better if complexity of model parameters can\nalso be discussed.\n\nMay be due to lack of space but experiment results discussion lack insight into\nobservations other than SAT performing the best. Also, authors claimed that\nwords with lower frequency were learned better with sememes without evaluating\non a rare words dataset.\n\nI have read author's response.", "SOUNDNESS_CORRECTNESS": "5", "ORIGINALITY": "5", "is_meta_review": null, "RECOMMENDATION": "4", "CLARITY": "5", "REVIEWER_CONFIDENCE": "5"}, {"IMPACT": "3", "SUBSTANCE": "3", "APPROPRIATENESS": "5", "MEANINGFUL_COMPARISON": "3", "PRESENTATION_FORMAT": "Poster", "comments": "- Strengths:\n\nThis paper proposes the use of HowNet to enrich embedings. The idea is\ninteresting and gives good results.\n\n- Weaknesses:\nThe paper is interesting, but I am not sure the contibution is important enough\nfor a long paper. Also, the comparision with other works may not be fair:\nauthors should compare to other systems that use manually developed resources.\n\nThe paper is understandable, but it would help some improvement on the English.\n\n- General Discussion:", "SOUNDNESS_CORRECTNESS": "5", "ORIGINALITY": "5", "is_meta_review": null, "RECOMMENDATION": "3", "CLARITY": "4", "REVIEWER_CONFIDENCE": "3"}, {"IMPACT": "3", "SUBSTANCE": "4", "APPROPRIATENESS": "5", "MEANINGFUL_COMPARISON": "3", "PRESENTATION_FORMAT": "Oral Presentation", "comments": "- Strengths:\n\n1. The proposed models are shown to lead to rather substantial and consistent\nimprovements over reasonable baselines on two different tasks (word similarity\nand word analogy), which not only serves to demonstrate the effectiveness of\nthe models but also highlights the potential utility of incorporating sememe\ninformation from available knowledge resources for improving word\nrepresentation learning.\n2. The paper contributes to ongoing efforts in the community to account for\npolysemy in word representation learning. It builds nicely on previous work and\nproposes some new ideas and improvements that could be of interest to the\ncommunity, such as applying an attention scheme to incorporate a form of soft\nword sense disambiguation into the learning procedure.\n\n- Weaknesses:\n\n1. Presentation and clarity: important details with respect to the proposed\nmodels are left out or poorly described (more details below). Otherwise, the\npaper generally reads fairly well; however, the manuscript would need to be\nimproved if accepted.\n2. The evaluation on the word analogy task seems a bit unfair given that the\nsemantic relations are explicitly encoded by the sememes, as the authors\nthemselves point out (more details below).\n\n- General Discussion:\n\n1. The authors stress the importance of accounting for polysemy and learning\nsense-specific representations. While polysemy is taken into account by\ncalculating sense distributions for words in particular contexts in the\nlearning procedure, the evaluation tasks are entirely context-independent,\nwhich means that, ultimately, there is only one vector per word -- or at least\nthis is what is evaluated. Instead, word sense disambiguation and sememe\ninformation are used for improving the learning of word representations. This\nneeds to be clarified in the paper.\n2. It is not clear how the sememe embeddings are learned and the description of\nthe SSA model seems to assume the pre-existence of sememe embeddings. This is\nimportant for understanding the subsequent models. Do the SAC and SAT models\nrequire pre-training of sememe embeddings?\n3. It is unclear how the proposed models compare to models that only consider\ndifferent senses but not sememes. Perhaps the MST baseline is an example of\nsuch a model? If so, this is not sufficiently described (emphasis is instead\nput on soft vs. hard word sense disambiguation). The paper would be stronger\nwith the inclusion of more baselines based on related work.\n4. A reasonable argument is made that the proposed models are particularly\nuseful for learning representations for low-frequency words (by mapping words\nto a smaller set of sememes that are shared by sets of words). Unfortunately,\nno empirical evidence is provided to test the hypothesis. It would have been\ninteresting for the authors to look deeper into this. This aspect also does not\nseem to explain the improvements much since, e.g., the word similarity data\nsets contain frequent word pairs.\n5. Related to the above point, the improvement gains seem more attributable to\nthe incorporation of sememe information than word sense disambiguation in the\nlearning procedure. As mentioned earlier, the evaluation involves only the use\nof context-independent word representations. Even if the method allows for\nlearning sememe- and sense-specific representations, they would have to be\naggregated to carry out the evaluation task.\n6. The example illustrating HowNet (Figure 1) is not entirely clear, especially\nthe modifiers of \"computer\".\n7. It says that the models are trained using their best parameters. How exactly\nare these determined? It is also unclear how K is set -- is it optimized for\neach model or is it randomly chosen for each target word observation? Finally,\nwhat is the motivation for setting K' to 2?", "SOUNDNESS_CORRECTNESS": "5", "ORIGINALITY": "5", "is_meta_review": null, "RECOMMENDATION": "4", "CLARITY": "2", "REVIEWER_CONFIDENCE": "4"}], "abstract": "Sememes are minimum semantic units of word meanings, and the meaning of each word sense is typically composed by several sememes. Since sememes are not explicit for each word, people manually annotate word sememes and form linguistic common-sense knowledge bases. In this paper, we present that, word sememe information can improve word representation learning (WRL), which maps words into a low-dimensional semantic space and serves as a fundamental step for many NLP tasks. The key idea is to utilize word sememes to capture exact meanings of a word within specific contexts accurately. More specifically, we follow the framework of Skip-gram and present three sememe-encoded models to learn representations of sememes, senses and words, where we apply the attention scheme to detect word senses in various contexts. We conduct experiments on two tasks including word similarity and word analogy, and our models significantly outperform baselines. The results indicate that WRL can benefit from sememes via the attention scheme, and also confirm our models being capable of correctly modeling sememe information.", "histories": [], "id": "318", "title": "Improved Word Representation Learning with Sememes"}