Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Notice & Warning on lines 216, 217, 219 WordnetCorpus.php #72

Open
mehroz1 opened this issue Oct 5, 2021 · 4 comments
Open

Notice & Warning on lines 216, 217, 219 WordnetCorpus.php #72

mehroz1 opened this issue Oct 5, 2021 · 4 comments

Comments

@mehroz1
Copy link

mehroz1 commented Oct 5, 2021

I am trying out your awesome library and I found notices & warnings on lines 216, 217, 219 of php-text-analysis/src/corpus/WordnetCorpus.php

it happens when you call stem() with MorphStemmer class with wordnet corpus:
$stemmedTokens = stem($top_keywords, \TextAnalysis\Stemmers\MorphStemmer::class);

@yooper
Copy link
Owner

yooper commented Oct 5, 2021 via email

@yooper
Copy link
Owner

yooper commented Oct 10, 2021

@mehroz1 , please can you provide a test case for me to recreate the issue.

Thanks,

@mehroz1
Copy link
Author

mehroz1 commented Oct 11, 2021

@yooper Here run this test file its also using PHP-ML you can remove those lines and provide tokens on line 60:

<?php
ini_set("memory_limit", "-1");
set_time_limit(0);
require_once __DIR__ . '/vendor/autoload.php';


use Phpml\Tokenization\WordTokenizer;
use Phpml\FeatureExtraction\StopWords\English;
use Phpml\FeatureExtraction\TfIdfTransformer;
use Phpml\FeatureExtraction\TokenCountVectorizer;
use Phpml\Preprocessing\Normalizer;
use TextAnalysis\Tokenizers\GeneralTokenizer;

function getWikipediaPage($page, $save_page=false) {
    global $data_set_dir;
    ini_set('user_agent', 'NlpToolsTest/1.0 (tests@php-nlp-tools.com)');
    if($save_page){
        file_put_contents($data_set_dir."/".$page."/".$page.".txt", file_get_contents("http://en.wikipedia.org/w/api.php?format=json&action=parse&page=".urlencode($page)));
    }
    $page = json_decode(file_get_contents("http://en.wikipedia.org/w/api.php?format=json&action=parse&page=".urlencode($page)),true);
    return preg_replace('/\s+/',' ',strip_tags($page['parse']['text']['*']));
}
function getDataFromFile($file_name = "./sample-data.txt", $processed = true){
    if($processed==true){
        $page = json_decode(file_get_contents("$file_name"),true);
        return preg_replace('/\s+/',' ',strip_tags($page['parse']['text']['*']));
    }else{
        return file_get_contents($file_name);
    }

}

global $page_name, $data_set_dir;
$page_name = "Aristotle";
$data_set_dir = "./data-sets"; # without trailing slash
if(!is_dir($data_set_dir)){
    mkdir($data_set_dir);
}
if(!is_dir($data_set_dir."/".$page_name)){
    mkdir($data_set_dir."/".$page_name);
}

 $sample_text =  $sample_text_ori= getWikipediaPage($page_name, true);
# $sample_text = $sample_text_ori = getDataFromFile($data_set_dir."/".$page_name."/".$page_name.".txt",true);
//print("<pre>".print_r($sample_text,true)."</pre>");


$tokenizer = new WordTokenizer();
$tokenized_sample_text = $tokenizer->tokenize($sample_text);
$vectorizer = new TokenCountVectorizer(new WordTokenizer, new English());

$vectorizer->fit($tokenized_sample_text);

$vectorized_text = $vectorizer->getVocabulary();

#print("<pre>".print_r($vectorized_text,true)."</pre>");
#exit();

# $tokens = tokenize($sample_text_ori); Text Analysis tokkenization
$normalizedTokens = normalize_tokens($vectorized_text);
# print("<pre>".print_r($normalizedTokens,true)."</pre>");

$stopWords = [
    'a', 'about', 'above', 'after', 'again', 'against', 'all', 'am', 'an', 'and', 'any', 'are', 'aren\'t', 'as', 'at', 'be', 'because',
    'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can\'t', 'cannot', 'could', 'couldn\'t', 'did', 'didn\'t',
    'do', 'does', 'doesn\'t', 'doing', 'don\'t', 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn\'t', 'has',
    'hasn\'t', 'have', 'haven\'t', 'having', 'he', 'he\'d', 'he\'ll', 'he\'s', 'her', 'here', 'here\'s', 'hers', 'herself', 'him',
    'himself', 'his', 'how', 'how\'s', 'i', 'i\'d', 'i\'ll', 'i\'m', 'i\'ve', 'if', 'in', 'into', 'is', 'isn\'t', 'it', 'it\'s', 'its',
    'itself', 'let\'s', 'me', 'more', 'most', 'mustn\'t', 'my', 'myself', 'no', 'nor', 'not', 'of', 'off', 'on', 'once', 'only', 'or',
    'other', 'ought', 'our', 'oursourselves', 'out', 'over', 'own', 'same', 'shan\'t', 'she', 'she\'d', 'she\'ll', 'she\'s', 'should',
    'shouldn\'t', 'so', 'some', 'such', 'than', 'that', 'that\'s', 'the', 'their', 'theirs', 'them', 'themselves', 'then', 'there',
    'there\'s', 'these', 'they', 'they\'d', 'they\'ll', 'they\'re', 'they\'ve', 'this', 'those', 'through', 'to', 'too', 'under',
    'until', 'up', 'very', 'was', 'wasn\'t', 'we', 'we\'d', 'we\'ll', 'we\'re', 'we\'ve', 'were', 'weren\'t', 'what', 'what\'s',
    'when', 'when\'s', 'where', 'where\'s', 'which', 'while', 'who', 'who\'s', 'whom', 'why', 'why\'s', 'with', 'won\'t', 'would',
    'wouldn\'t', 'you', 'you\'d', 'you\'ll', 'you\'re', 'you\'ve', 'your', 'yours', 'yourself', 'yourselves', 'a', 'abbr', 'b', 'bdi', 'br', 'col', 'dd', 'del', 'dfn', 'div', 'dl', 'dt', 'em', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'hr', 'i', 'img', 'ins', 'kbd', 'li', 'ol', 'p', 'q', 'rb', 'rp', 'rt', 'rtc', 's', 'sup', 'td', 'th', 'tr', 'u', 'ul', 'li', 'var', 'wbr', 'px', 'st', 'a', 'able', 'about', 'above', 'abst', 'accordance', 'according', 'accordingly', 'across', 'act', 'actually', 'added', 'adj', 'affected', 'affecting', 'affects', 'after', 'afterwards', 'again', 'against', 'ah', 'all', 'almost', 'alone', 'along', 'already', 'also', 'although', 'always', 'am', 'among', 'amongst', 'an', 'and', 'announce', 'another', 'any', 'anybody', 'anyhow', 'anymore', 'anyone', 'anything', 'anyway', 'anyways', 'anywhere', 'apparently', 'approximately', 'are', 'aren', 'arent', 'arise', 'around', 'as', 'aside', 'ask', 'asking', 'at', 'auth', 'available', 'away', 'awfully', 'b', 'back', 'be', 'became', 'because', 'become', 'becomes', 'becoming', 'been', 'before', 'beforehand', 'begin', 'beginning', 'beginnings', 'begins', 'behind', 'being', 'believe', 'below', 'beside', 'besides', 'between', 'beyond', 'biol', 'both', 'brief', 'briefly', 'but', 'by', 'c', 'ca', 'came', 'can', 'cannot', 'can\'t', 'cause', 'causes', 'certain', 'certainly', 'co', 'com', 'come', 'comes', 'contain', 'containing', 'contains', 'could','couldnt','couldn\'t', 'd', 'date', 'did', 'didn\'t', 'different', 'do', 'does', 'doesn\'t', 'doing', 'done', 'don\'t', 'down', 'downwards', 'due', 'during', 'e', 'each', 'ed', 'edu', 'effect', 'eg', 'eight', 'eighty', 'either', 'else', 'elsewhere', 'end', 'ending', 'enough', 'especially', 'et', 'et-al', 'etc', 'even', 'ever', 'every', 'everybody', 'everyone', 'everything', 'everywhere', 'ex', 'except', 'f', 'far', 'few', 'ff', 'fifth', 'first', 'five', 'fix', 'followed', 'following', 'follows', 'for', 'former', 'formerly', 'forth', 'found', 'four', 'from', 'further', 'furthermore', 'g', 'gave', 'get', 'gets', 'getting', 'give', 'given', 'gives', 'giving', 'go', 'goes', 'gone', 'got', 'gotten', 'h', 'had', 'happens', 'hardly', 'has', 'hasn\'t', 'have', 'haven\'t', 'having', 'he', 'hed', 'hence', 'her', 'here','hereafter', 'hereby', 'herein', 'heres', 'hereupon', 'hers', 'herself', 'hes', 'hi', 'hid', 'him', 'himself', 'his', 'hither', 'home', 'how', 'howbeit', 'however', 'hundred', 'i', 'id', 'ie', 'if', 'i\'ll', 'im', 'immediate', 'immediately', 'importance', 'important', 'in', 'inc', 'indeed', 'index', 'information', 'instead', 'into', 'invention', 'inward', 'is', 'isn\'t', 'it', 'itd', 'it\'ll', 'its', 'itself', 'i\'ve' , 'j', 'just', 'k', 'keep', 'keeps', 'kept', 'kg', 'km', 'know', 'known', 'knows', 'l', 'largely', 'last', 'lately', 'later', 'latter', 'latterly', 'least', 'less', 'lest', 'let', 'lets', 'like', 'liked', 'likely', 'line', 'little', '\'ll', 'look', 'looking', 'looks', 'ltd', 'm', 'made', 'mainly', 'make', 'makes', 'many', 'may', 'maybe', 'me', 'mean', 'means', 'meantime', 'meanwhile', 'merely', 'mg', 'might', 'million', 'miss', 'ml', 'more', 'moreover', 'most', 'mostly', 'mr', 'mrs', 'much', 'mug', 'must', 'my', 'myself', 'n', 'na', 'name', 'namely', 'nay', 'nd', 'near', 'nearly', 'necessarily', 'necessary', 'need', 'needs', 'neither', 'never', 'nevertheless', 'new', 'next', 'nine', 'ninety', 'no', 'nobody', 'non', 'none', 'nonetheless', 'noone', 'nor', 'normally', 'nos', 'not', 'noted', 'nothing', 'now', 'nowhere', 'o', 'obtain', 'obtained', 'obviously', 'of', 'off', 'often', 'oh', 'ok', 'okay', 'old', 'omitted', 'on','once', 'one', 'ones', 'only', 'onto', 'or', 'ord', 'other', 'others', 'otherwise', 'ought', 'our', 'ours', 'ourselves', 'out', 'outside', 'over', 'overall', 'owing', 'own', 'p', 'page', 'pages', 'part', 'particular', 'particularly', 'past', 'per', 'perhaps', 'placed', 'please', 'plus', 'poorly', 'possible', 'possibly', 'potentially', 'pp', 'predominantly', 'present', 'previously', 'primarily', 'probably', 'promptly', 'proud', 'provides', 'put', 'q', 'que', 'quickly', 'quite', 'qv', 'r', 'ran', 'rather', 'rd', 're', 'readily', 'really', 'recent', 'recently', 'ref', 'refs', 'regarding', 'regardless', 'regards', 'related', 'relatively', 'research', 'respectively', 'resulted', 'resulting', 'results', 'right', 'run', 's', 'said', 'same', 'saw', 'say', 'saying', 'says', 'sec', 'section', 'see', 'seeing', 'seem', 'seemed', 'seeming', 'seems', 'seen', 'self', 'selves', 'sent', 'seven', 'several', 'shall', 'she', 'shed', 'she\'ll', 'shes', 'should', 'shouldn\'t', 'show', 'showed', 'shown', 'showns', 'shows', 'significant', 'significantly', 'similar', 'similarly', 'since', 'six', 'slightly', 'so', 'some', 'somebody', 'somehow', 'someone', 'somethan', 'something', 'sometime', 'sometimes', 'somewhat', 'somewhere', 'soon', 'sorry', 'specifically', 'specified', 'specify', 'specifying', 'still', 'stop', 'strongly', 'sub', 'substantially', 'successfully', 'such', 'sufficiently', 'suggest', 'sup', 'sure', 't', 'take', 'taken', 'taking', 'tell', 'tends', 'th', 'than', 'thank', 'thanks', 'thanx', 'that', 'that\'ll', 'thats', 'that\'ve', 'the', 'their', 'theirs', 'them', 'themselves', 'then', 'thence', 'there', 'thereafter', 'thereby', 'thered', 'therefore', 'therein', 'there\'ll', 'thereof', 'therere', 'theres', 'thereto', 'thereupon', 'there\'ve', 'these', 'they', 'theyd', 'they\'ll', 'theyre', 'they\'ve', 'think', 'this', 'those', 'thou', 'though', 'thoughh', 'thousand', 'throug', 'through', 'throughout', 'thru', 'thus', 'til', 'tip', 'to', 'together', 'too', 'took', 'toward', 'towards', 'tried', 'tries', 'truly', 'try', 'trying', 'ts', 'twice', 'two', 'u', 'un', 'under','unfortunately', 'unless', 'unlike', 'unlikely', 'until', 'unto', 'up', 'upon', 'ups', 'us', 'use', 'used', 'useful', 'usefully', 'usefulness', 'uses', 'using', 'usually', 'v', 'value', 'various', '\'ve', 'very', 'via', 'viz', 'vol', 'vols', 'vs', 'w', 'want', 'wants', 'was', 'wasnt', 'wasn\'t', 'way', 'we', 'wed', 'welcome', 'we\'ll', 'went', 'were', 'werent', 'weren\'t', 'we\'ve', 'what', 'whatever', 'what\'ll', 'whats', 'when', 'whence', 'whenever', 'where', 'whereafter', 'whereas', 'whereby', 'wherein', 'wheres', 'whereupon', 'wherever', 'whether', 'which', 'while', 'whim', 'whither', 'who', 'whod', 'whoever', 'whole', 'who\'ll', 'whom', 'whomever', 'whos', 'whose', 'why', 'widely', 'willing', 'wish', 'with', 'within', 'without', 'wont', 'words', 'world', 'would', 'wouldnt','wouldn\'t', 'www', 'x', 'y', 'yes', 'yet', 'you', 'youd', 'you\'ll', 'your', 'youre', 'yours', 'yourself', 'yourselves', 'you\'ve', 'z', 'zero'
];

$filters = array(
    new \TextAnalysis\Filters\LowerCaseFilter(),
    new \TextAnalysis\Filters\QuotesFilter(),
    new \TextAnalysis\Filters\StripTagsFilter(),
    new \TextAnalysis\Filters\TrimFilter(),
    new \TextAnalysis\Filters\PunctuationFilter(),
    new \TextAnalysis\Filters\QuotesFilter(),
    new \TextAnalysis\Filters\SpacePunctuationFilter(),
    new \TextAnalysis\Filters\WhitespaceFilter(),
    new \TextAnalysis\Filters\NumbersFilter(),
    new \TextAnalysis\Filters\DomainFilter(),
    new \TextAnalysis\Filters\EmailFilter(),
    new \TextAnalysis\Filters\CharFilter(),
    new \TextAnalysis\Filters\StopWordsFilter($stopWords)
);

$document = new \TextAnalysis\Documents\TokensDocument($normalizedTokens);
$docCollection = new \TextAnalysis\Collections\DocumentArrayCollection(array($document));
$docCollection->applyTransformations($filters);

//print("<pre>".print_r($docCollection[0]->getDocumentData(),true)."</pre>");

$freqDist = new \TextAnalysis\Analysis\FreqDist($docCollection[0]->getDocumentData());
$frequency_keywords = $freqDist->getKeyValuesByFrequency();
file_put_contents($data_set_dir."/".$page_name."/".$page_name."-frequency-keywords.txt", json_encode($frequency_keywords));

//$top1000 = array_splice($frequency_keywords, 0, 1000);

# print("<pre>".print_r($top10,true)."</pre>");
foreach($frequency_keywords as $key => $single_keyword){
    $top_keywords[] = (string)$key;
}
file_put_contents($data_set_dir."/".$page_name."/".$page_name."-keywords.txt", json_encode($top_keywords));
//print("<pre>".print_r($top_keywords,true)."</pre>");

$stemmedTokens = stem($top_keywords, \TextAnalysis\Stemmers\MorphStemmer::class);
file_put_contents($data_set_dir."/".$page_name."/".$page_name."-stemmed-tokens.txt", json_encode($stemmedTokens));

print("<pre>".print_r(array_filter( $stemmedTokens),true)."</pre>");

@mehroz1
Copy link
Author

mehroz1 commented Oct 11, 2021

I am testing this library on PHP 8.0.11 and I solved this issue by using (int) on lines 216, 217, 219 of WordnetCorpus.php

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants