Skip to content

Latest commit

 

History

History
831 lines (620 loc) · 67 KB

CHANGELOG.md

File metadata and controls

831 lines (620 loc) · 67 KB

Changelog

aut-1.2.0 (2022-11-17)

Full Changelog

Closed issues:

  • Include last modified date for a resource #546

Merged pull requests:

aut-1.1.1 (2022-10-31)

Full Changelog

Fixed bugs:

  • DomainGraph should use YYYYMMDD not YYYYMMDDHHMMSS #544

Merged pull requests:

aut-1.1.0 (2022-06-17)

Full Changelog

Fixed bugs:

  • org.apache.tika.mime.MimeTypeException: Invalid media type name: application/rss+xml lang=utf-8 #542

Closed issues:

  • Add ARCH text files derivatives #540

Merged pull requests:

aut-1.0.0 (2022-06-10)

Full Changelog

Implemented enhancements:

  • Remove http headers, and html on webpages() #538
  • Add domain column to webpages() #534
  • Replace Java ARC/WARC record processing library #494
  • Method to perform finer-grained selection of ARCs and WARCs #247
  • Unnecessary buffer copying #18

Fixed bugs:

  • Discard date RDD filter only takes a single string, not a list of strings. #532
  • Extract gzip data from transfer-encoded WARC #493
  • ARC reader string vs int error on record length #492

Closed issues:

  • java.lang.RuntimeException: Unsupported literal type class scala.collection.immutable.Set$Set1 Set(liberal.ca) #529
  • Improve CommandLineApp.scala test coverage #262
  • Improve ExtractBoilerpipeText.scala test coverage #261
  • Improve ArchiveRecord.scala test coverage #260
  • Unit testing for RecordLoader #182
  • Improve ArchiveRecordWritable.java test coverage #76
  • Improve WarcRecordUtils.java test coverage #74
  • Improve ArcRecordUtils.java test coverage #73
  • Improve ExtractDate.scala test coverage #64
  • Remove org.apache.commons.httpclient #23

Merged pull requests:

aut-0.91.0 (2022-01-21)

Full Changelog

Implemented enhancements:

  • Include timestamp in crawl date #525

Merged pull requests:

  • Change crawl_date format to YYYYMMDDHHMMSS, update hasDate filter. #526 (ruebot)

aut-0.90.4 (2021-11-01)

Full Changelog

Implemented enhancements:

  • Replace scala-uri library from ExtractDomain and just parse public_suffix_list.dat #521

Fixed bugs:

  • Scaladocs haven't been created since 0.90.0 release #522

Merged pull requests:

aut-0.90.3 (2021-10-22)

Full Changelog

Fixed bugs:

  • ExtractDomains returns non-Apex Domains #519

Merged pull requests:

aut-0.90.2 (2021-05-12)

Full Changelog

Fixed bugs:

  • ARC file name appearing in url list #516
  • WARC-Target-URI in Wget warc files is not parsed properly #514

Merged pull requests:

  • Filter or filedesc and dns records from arcs. #517 (ruebot)
  • Handle wget WARC-Target-URI formatting. #515 (ruebot)

aut-0.90.1 (2021-04-29)

Full Changelog

Fixed bugs:

  • crawl_date is not included on binary information jobs when documentation says it is #512

Merged pull requests:

  • Add missing crawl_date column to binary information jobs. #513 (ruebot)
  • Update jsoup to 1.13.1 #511 (ruebot)

aut-0.90.0 (2021-01-27)

Full Changelog

Fixed bugs:

  • Python implementation of .all() has .keepValidPages() incorrectly applied to it #502
  • Extract hyperlinks from wayback machine #501
  • Release 0.80.0 JAR produces error; built 0.80.1 fatjar built on repo works #495

Closed issues:

  • Migrate CI infrastructure from TravisCI to GitHub Action #506
  • Split tf into it's own repo #498
  • Change master branch to main branch #490
  • GitHub action - Run isort and black on Python code #488
  • Add scalafmt GitHub action #486
  • Add Google Java Formatter as a GitHub action #484
  • Packages build is often broken - should we support it? #483
  • Implement SaveToDisk in Python #478
  • Java 11 support #356

Merged pull requests:

  • ars-cloud compatibility with aut and Java 11 #510 (ruebot)
  • Update to Spark 3.0.1 #508 (ruebot)
  • Replace TravisCI with GitHub Actions. #507 (ruebot)
  • Bump junit from 4.12 to 4.13.1 #505 (dependabot[bot])
  • Fix relative links extraction #504 (yxzhu16)
  • Remove .keepValidPages() on .all() Python implmentation. #503 (ruebot)
  • Updates read.me to include citation section #500 (SamFritz)
  • Remove tf project; resolves #498. #499 (ruebot)
  • Add Python formatter GitHub Action. #489 (ruebot)
  • Add scalafmt GitHub action and apply it to scala code. #487 (ruebot)
  • Add Google Java Formatter as an action, and apply it. #485 (ruebot)
  • Add Python implementation of SaveBytes. #482 (ruebot)
  • Bump xercesImpl from 2.11.0 to 2.12.0 #481 (dependabot[bot])
  • [Skip Travis] Trim README down given aut.docs.archivesunleashed.org #480 (ruebot)
  • Spark 3.0.0 + Java 11 support. #375 (ruebot)

aut-0.80.0 (2020-06-03)

Full Changelog

Closed issues:

  • Broken link in documentation #476
  • Improve udfs/package.scala test coverage #473
  • Remove tabDelimit #471
  • Remove Extract Entities #469
  • PEP8 Naming - UDFs, App method names, DataFrame names, and filters. #468
  • Python UDFs - class or not? #467
  • Remove ExtractImageDetailsDF.scala #464
  • github-stite-deploy uses password based authentication which is being deprecated by GitHub #461
  • Implement Python versions of Serializable APIs #410
  • Implement Python versions of App utilities #409
  • Implement Python versions of Matchbox utilities #408
  • Improve TupleFormatter.scala test coverage #59
  • Create tests for NERCombinedJson.scala #53
  • Create tests for NER3Classifier.scala #52
  • Create tests for ExtractEntities.scala #48

Merged pull requests:

  • Remove RDD suffixes on file, class, and object names. #479 (ruebot)
  • PEP8 Python app method names. #477 (ruebot)
  • Move Python UDF methods out of their own class. #475 (ruebot)
  • Add DataFrame udf tests. #474 (ruebot)
  • Remove tabDelimit. #472 (ruebot)
  • Remove NER functionality. #470 (ruebot)
  • Add ExtractPopularImages, WriteGEXF, and WriteGraphML to Python. #466 (ruebot)
  • Remove ExtractImageDetailsDF; resolves #464. #465 (ruebot)
  • Implement Scala Matchbox UDFs in Python. #463 (ruebot)
  • Import clean-up for df package. #462 (ruebot)

aut-0.70.0 (2020-05-04)

Full Changelog

Implemented enhancements:

  • Update PlainTextExtractor to just extract text #452
  • Migration of all RDD functionality over to DataFrames #223

Fixed bugs:

  • DomainFrequencyExtractor should remove WWW prefix #456

Closed issues:

  • For extractor (spark-submit) job, set Spark app name to be the extractor job name. #458
  • Remove RDD options from app #449
  • Add parquet as an app format option #448
  • Add datathon derivatives to app (binary info, web pages, web graph #447
  • Update Java 8 instructions for MacOS #445
  • Add spark-submit to README #444

Merged pull requests:

  • [skip travis] README updates #460 (ruebot)
  • Set spark-submit app name to be "aut - extractorName". #459 (ruebot)
  • Add RemovePrefixWWWDF to DomainFrequencyExtractor. #457 (ruebot)
  • Updating Java install instructions for MacOS, resolves #445 #455 (ianmilligan1)
  • Add option to save to Parquet for app. #454 (ruebot)
  • Update PlainTextExtractor to output a single column; text. #453 (ruebot)
  • Add a number of additional app extractors. #451 (ruebot)
  • Remove RDD option in app; DataFrame only now. #450 (ruebot)
  • [skip-travis] Add spark-submit option to README; resolves #444. #446 (ruebot)

aut-0.60.0 (2020-04-15)

Full Changelog

Implemented enhancements:

  • Discussion: Restyle UDFs in the context of DataFrames #425
  • Add alt text column to imageGraph (imageLinks) #420
  • UDFs that filter on url should also filter on src #418

Fixed bugs:

  • CommandLineApp DomainGraphExtractor Uses Different Node IDs than WriteGraph #439
  • DomainGraphExtractor produces different output in RDD vs DF #436
  • Command line app fails because of missing log4j configuration #433

Closed issues:

  • Remove GraphXML and ExtractGraphX #442
  • Use Monochromatic Ids instead of hash to produce network identifiers. #440
  • Add graphml output to DomainGraphExtractor #435
  • Add webgraph, imagegraph, webpages, etc. to command line app #431
  • Rename imageLinks to imageGraph #419

Merged pull requests:

  • Remove GraphX support; resolves #442. #443 (ruebot)
  • Remove WriteGraph; resolves #439. #441 (ruebot)
  • Add graphml output to CommandLineApp and DomainGraphExtractor. #438 (ruebot)
  • Align RDD and DF output for DomainGraphExtractor. #437 (ruebot)
  • Update log4j configuration to resolve #433. #434 (ruebot)
  • Add imagegraph, and webgraph to command line app. #432 (ruebot)
  • Tweak hasDate to handle Seq. #430 (ruebot)
  • Restyle keep/discard filter UDFs in the context of DataFrames #429 (ruebot)
  • Update Spark and Hadoop versions. #426 (ruebot)
  • update for 'src' column #424 (SinghGursimran)
  • [skip travis] Add pre-print link to README. #423 (ruebot)
  • Add img alt text to imagegraph(); resolves #420. #422 (ruebot)
  • Rename imageLinks to imageGraph; resolves #419 #421 (ruebot)
  • Need --repositories flag with --packages. #417 (ruebot)

aut-0.50.0 (2020-02-05)

Full Changelog

Implemented enhancements:

  • Add crawl_date to binary DataFrames and imageLinks #413

Fixed bugs:

  • 0.18.0 with --packages is broken #407

Closed issues:

  • .webpages() additional tokenized columns? #402
  • Test and documentation inventory #372

Merged pull requests:

aut-0.18.1 (2020-01-17)

Full Changelog

Implemented enhancements:

  • Enhance keepValidPages #359
  • Add discardLanguage filter #352

Fixed bugs:

  • textFiles does not filter properly #390
  • DataFrame error with text files: java.net.MalformedURLException: unknown protocol: filedesc #362

Closed issues:

  • Missing doc comments #392
  • Bug in ArcTest? Why run RemoveHTML? #369
  • UDF CaMeL cASe consistency issues #368
  • ExtractDomain or ExtractBaseDomain? #367
  • Align DataFrame boilerplate in Python and Scala #366
  • Create a ComputeSHA1 method #363
  • Discussion: Should we align our Named Entity Recognition output with WANE format? #297
  • DataFrame discussion: open thread #190

aut-0.18.0 (2019-08-21)

Full Changelog

Implemented enhancements:

  • Add method for unknown extensions in binary extractions #343
  • Use Tika's detected MIME type instead of ArchiveRecord getMimeType? #342
  • Add filter/keep by http status to RecordLoader class #315
  • Audio binary object extraction #307
  • Video binary object extraction #306
  • Powerpoint binary object extraction #305
  • Doc binary object extraction #304
  • Spreadsheet binary object extraction #303
  • PDF binary object extraction #302
  • Test aut with Apache Spark 2.4.0 #295
  • Replace hashing of unique ids with .zipWithUniqueId() #243
  • Integration of neural network models for image analysis #240
  • More complete Twitter Ingestion #194
  • Image Search Functionality #165
  • feature request: log when loadArchives opens and closes warc files in a dir #156

Fixed bugs:

  • DataFrame commands throwing java.lang.NullPointerException on example data #320
  • Class issues when using aut-0.17.0-fatjar.jar #313
  • Image extraction does not scale with number of WARCs #298
  • ExtractDomain mistakenly checks source first then url #277
  • Improve ExtractDomain to Better Isolate Domains #269

Security fixes:

  • CVE-2017-7525 -- com.fasterxml.jackson.core:jackson-databind #279

Closed issues:

  • Inconsistency in ArchiveRecord.getContentBytes #334
  • Rationalize computeHash and ComputeMD5 #333
  • Test additional Java versions with TravisCI #324
  • Remove Twitter/tweet analysis #322
  • Trouble testing s3 connectivity #319
  • Depfu Error: No dependency files found #309
  • Strategy to deal with conflict between application and Spark distribution dependencies #308
  • SaveImageTest.scala should delete saved image file #299
  • Remove Deprecated ExtractGraph.scala file for next release. #291
  • DetectLanguage.scala: class LanguageIdentifier in package language is deprecated #286
  • Maven build warning during release #273
  • Improve DataFrameLoader.scala test coverage #265
  • Improve package.scala test coverage #263
  • Discussion: Idiom for loading DataFrames #231
  • DataFrame field names: open thread #229
  • DataFrame performance comparison: Scala vs. Python #215
  • TweetUtilsTest.scala doesn't test Spark, only underlying json4s library #206
  • feature request: ArchiveRecord.archiveFile #164
  • feature request: possibility to query about the progress #162
  • Update to Apache Tika 1.19.1; security vulnerabilities in 1.12 #131
  • Create tests for ExtractGraph.scala #49
  • Setup Victims #5

Merged pull requests:

  • Update LICENSE and license headers. #351 (ruebot)
  • Add binary extraction DataFrames to PySpark. #350 (ruebot)
  • Add method for determining binary file extension #349 (jrwiebe)
  • Add keep and discard by http status. #347 (ruebot)
  • Add office document binary extraction. #346 (ruebot)
  • Use version of tika-parsers without a classifier #345 (jrwiebe)
  • Use Tika's detected MIME type instead of ArchiveRecord getMimeType. #344 (ruebot)
  • Add Audio & Video binary extraction #341 (ruebot)
  • Extract PDF #340 (jrwiebe)
  • More scalastyle work; addresses #196. #339 (ruebot)
  • Replace computeHash with ComputeMD5; resolves #333. #338 (ruebot)
  • Update Tika to 1.22; address security alerts. #337 (ruebot)
  • Tests #336 (ruebot)
  • Make ArchiveRecord.getContentBytes consistent, Resolve #334 #335 (ianmilligan1)
  • Enable S3 access #332 (jrwiebe)
  • Updates to pom following 0e701b271e04e60c6fa89f39299dae7142d700b8 #328 (ruebot)
  • Move data frame fields names to snake_case. #327 (ruebot)
  • Python formatting, and gitignore additions. #326 (ruebot)
  • Test Java 8 & 11, and remove OracleJDK; resolves #324. #325 (ruebot)
  • Remove Tweet utils. #323 (ruebot)
  • Update to Spark 2.4.3 and update Tika to 1.20. #321 (ruebot)
  • add image analysis w/ tensorflow #318 (h324yang)
  • Makes ArchiveRecordImpl serializable #316 (jrwiebe)
  • Resolve cobertura-maven-plugin class issue; resolves #313. #314 (ruebot)
  • Update spark-core_2.11 to 2.3.1. #312 (ruebot)
  • Log closing of ARC and WARC files, per #156 #301 (jrwiebe)
  • Delete saved image file; resolves #299 #300 (jrwiebe)
  • Remove Deprecated ExtractGraph app #293 (greebie)
  • Add .getHttpStatus and .getFilename to ArchiveRecordImpl class #198 & #164 #292 (greebie)
  • Update license headers for #208. #290 (ruebot)
  • Change Id generation for graphs from using hashes for urls to using .zipWithUniqueIds() #289 (greebie)
  • CVE-2018-11771 update #288 (ruebot)
  • CVE-2017-17485 update; follow-on to #281. #287 (ruebot)
  • Update Apache Tika - security vulnerabilities; resolves #131. #285 (ruebot)
  • [skip travis] Update README #284 (ruebot)
  • Only trigger TravisCI on master. #283 (ruebot)
  • Missed something for #208. #282 (ruebot)
  • CVE-2018-7489 fix. #281 (ruebot)
  • Update jackson-databind version; resolves #279. #280 (ruebot)
  • Patch for #277: Fix bug and unit test for ExtractDomain #278 (borislin)
  • Patch for #269: Replace backslash with forward slash in URL #276 (borislin)
  • Clean-up pom.xml to remove plugin warnings; resolves #273. #274 (ruebot)

aut-0.17.0 (2018-10-04)

Full Changelog

Implemented enhancements:

  • Add EscapeHTML Function for ExtractLinks #266
  • PySpark support #12

Fixed bugs:

  • AUT exits/dies on java.util.zip.ZipException: too many length or distance symbols #271
  • AUT exits/dies on java.util.zip.ZipException: invalid distance too far back #246
  • Improve ExtractDomain Normalization #239
  • Twitter analysis is broken; see also: json4s/json4s#496 #197
  • Prevent encoding errors in PySpark #122

Closed issues:

  • Cannot skip bad record while reading warc file #267
  • Why did Scalastyle not reject null values in TweetUtilTest #255
  • Create UDF to combine basic text filtering features #253
  • spark-shell --packages "io.archivesunleashed:aut:0.16.0" fails with not_found dependencies #242
  • CommandLineAppRunner.scala produces output per WARC instead of combined result. #235
  • Extract images out of images DataFrame and store to disk #232
  • Before the next release, make sure docker-aut builds on master... or make sure --packages works #227
  • DataFrames for image analysis #220
  • The attempt to upgrade Spark version to 2.3.0 is not successful #218
  • Convert nulls to Option(T) #212
  • Bringing Scala DataFrames into PySpark #209
  • What is AUT? #208
  • Refactor ExtractGraph and assess value of GraphX for producing network graphs #203
  • Codify creation of standard derivatives into apps #195
  • TweetUtils - support fulltext #192
  • Combine UDFs into appropriate objects #187
  • Register Scala functions for use in Pyspark #148
  • PySpark performance bottlenecks: counting values #130
  • Redesign of PySpark DataFrame interface for filtering #120
  • Improve RecordLoader.scala test coverage #60

Merged pull requests:

  • Patch for #246 & #271: Fix exception error when processing corrupted ARC files #272 (borislin)
  • Update Bug report template. #268 (ruebot)
  • ExtractBoilerpipeText to remove headers as well. #253 #256 (greebie)
  • Add additional tweet fields to TweetUtils; partially address #194. #254 (ruebot)
  • Add support for full_text in tweets; resolve #192. #252 (ruebot)
  • Get rid of 'filesystem-root relative reference' warning. #251 (ruebot)
  • Remove stray characters from example commands. #250 (ruebot)
  • Deal with final scalastyle assessments: Issue 212 #249 (greebie)
  • Address main scalastyle errors - #196 #248 (greebie)
  • Add ExtractGraphX including algorithms for PageRank and Components. Issue 203 #245 (greebie)
  • Travis build fixes #244 (ruebot)
  • Data frame implementation of extractors. Also added cmd arguments to resolve #235 #236 (TitusAn)
  • Save images from dataframe to disk #234 (jwli229)
  • Add missing dependencies in; addresses #227. #233 (ruebot)
  • Code cleanup: ArchiveRecord + impl moved into same Scala file #230 (lintool)
  • Add Extract Image Details API #226 (jwli229)
  • Implement DomainFrequency, DomainGraph and PlainText extractor that can be run from command line #225 (TitusAn)
  • Remove duplicate call of keepValidPages #224 (jwli229)
  • Extract Image Links DF API + Test #221 (jwli229)
  • Update Apache Spark to 2.3.0; resolves #218 #219 (ruebot)
  • Resolve archivesunleashed/docker-aut#17 #217 (ruebot)
  • Create issue templates #216 (ruebot)
  • Exposing Scala DataFrames in PySpark #214 (lintool)
  • Update project description; resolves #208. #211 (ruebot)
  • Initial DataFrames merge #210 (lintool)
  • Add more instructions on how to use things to the README. #207 (ruebot)

aut-0.16.0 (2018-04-26)

Full Changelog

Implemented enhancements:

  • Revisit approach to .keepValidPages() #177

Closed issues:

  • keepValidPages incorrectly filters out pages with mime-type text/html followed by charset #199

Merged pull requests:

aut-0.15.0 (2018-04-11)

Full Changelog

Implemented enhancements:

  • Clean-up scaladoc comments #184

Closed issues:

  • Rename package io.archivesunleashed.io #188
  • Major Refactoring: RecordRDD #180
  • Major refactoring: matchbox cleanup #179
  • Major refactoring: io.archivesunleashed.spark -> io.archivesunleashed #178

Merged pull requests:

aut-0.14.0 (2018-03-20)

Full Changelog

Closed issues:

  • Incorporate Scala UDFs into Auto-documentation #176

Merged pull requests:

  • Resolve #176; setup scaladocs. #183 (ruebot)
  • Revert "make ArchiveRecord a trait (#175)" #181 (ruebot)

aut-0.13.0 (2018-03-07)

Full Changelog

Merged pull requests:

aut-0.12.2 (2018-02-28)

Full Changelog

Implemented enhancements:

  • ArchiveRecord.warcFile #171
  • Better approach to ids in WriteGraphML & WriteGEXF #168
  • Build pre-filtered networks #109
  • KeepDate UDF should support date range #108
  • Changing keepDate to allow multiple dates, would close #108 #161 (ianmilligan1)

Fixed bugs:

  • Broken GEXF Files Due to < and > characters in node id fields #172
  • There is insufficient memory for the Java Runtime Environment to continue #159
  • AUT Fails on Extracting Text from WARCs #158

Closed issues:

  • RecordLoader.loadArchives fails with nested dirs #169
  • Unparseable date error #163
  • remove angle brackets from ArchiveRecord.getUrl #157
  • Benchmarking Scala vs Python #121
  • Improve WacArcInputFormat.java test coverage #80
  • Improve WacWarcInputFormat.java test coverage #78
  • Improve WarcRecordWritable.java test coverage #77
  • Improve ArcRecordWritable.java test coverage #75
  • Improve ArcRecord.scala test coverage #69
  • Improve RemoveHttpHeader.scala test coverage #57
  • Investigate Jupyter notebooks on Altiscale #37

Merged pull requests:

aut-0.12.1 (2017-12-15)

Full Changelog

Fixed bugs:

  • ARC Handling Bug in 0.12.0 when Extracting Links #154
  • Changes jsoup version in pom.xml (#154) #155 (ianmilligan1)

aut-0.12.0 (2017-12-11)

Full Changelog

Implemented enhancements:

  • Add GraphML UDF #142
  • GEXF Output #103
  • Native notebook support #14
  • DataFrames support #13

Fixed bugs:

  • NullPointerException error during build #124
  • Resolves Issue #128: Uses new getOrigins method #136 (ianmilligan1)

Closed issues:

  • Create tests for WriteGEXF.scala #138
  • ERROR ArcRecordUtils - Read 1224 bytes but expected 1300 bytes #128
  • WarcRecordUtils.java uses or overrides a deprecated API #127
  • class LanguageIdentifier in package language is deprecated #126
  • multiple versions of scala #125
  • ExtractLinks running slowly #123
  • com.cloudera.cdh:hadoop-ant:pom:0.20.2-cdh3u4 -- errors #118

Merged pull requests:

  • Too many JUNITs #152 (ruebot)
  • Add more packages and exclusions for #113 #150 (ruebot)
  • Tuple Formatter Test Improvement #145 (greebie)
  • Check to replace partial coverage for ExtractDate. #144 (greebie)
  • Add GraphML UDF #143 (greebie)
  • Remove stackTrace output on caught error. #141 (greebie)
  • Add deprecation warnings to outmoded Arc and Warc formats. #140 (greebie)
  • Tests for WriteGEXF Issue #138 #139 (greebie)
  • Include script to write to GEXF. (#103) #137 (greebie)
  • Use correct import for WARCConstants; Resolves #127. #133 (ruebot)
  • Downgrade Tika to 1.12. Resolves #126. #132 (ruebot)
  • Pin everything to Scala 2.11.8; Resolves #125. #129 (ruebot)
  • Exclude old version of Hadoop. Resolves #118. #119 (ruebot)

aut-0.11.0 (2017-11-22)

Full Changelog

Implemented enhancements:

  • GetCrawlYear to accompany GetCrawlMonth #104
  • Refactor RecordLoader classes #102
  • Adding getCrawlYear in ArchiveRecords, resolves #104 #105 (ianmilligan1)

Closed issues:

  • spark-shell --packages "io.archivesunleashed:aut:0.10.0"` fails with not_found dependencies #113
  • update the version of the dependencies not available on the central maven repository #111
  • Bake keepValidPages() into RecordLoader #101
  • Create tests for JsonUtil.scala #66
  • Improve ExtractDomain.scala test coverage #63
  • Improve ExtractImageLinks.scala test coverage #62
  • Improve ExtractLinks.scala test coverage #61
  • Improve StringUtils.scala test coverage #58
  • Improve RemoveHTML.scala test coverage #56
  • Create tests for TweetUtils.scala #54
  • Create tests for ExtractTextFromPDFs.scala #51
  • Create tests for ExtractPopularImages.scala #50
  • Create tests for ExtractBoilerpipeText.scala #47
  • Create tests for ComputeMD5.scala #46
  • Create tests for ComputeImageSize.scala #45

Merged pull requests:

  • This needs to hold steady. #117 (ruebot)
  • Update all dependencies, and add missing dependencies to resolve #113. #116 (ruebot)
  • Updated documentation links; link to project page #115 (ianmilligan1)
  • Remove pom.xml cruft; Partially resolves #111. #112 (ruebot)
  • Created Code of Conduct file #110 (SamFritz)
  • Refactor ArchiveRecord classes; addresses #101 and #102 #107 (MapleOx)
  • Improve coverage for issue-67 (RecordRDD.scala) #99 (greebie)
  • Minor fix to improve coverage. #55 #98 (greebie)
  • Test ExtractTextFromPDFs. #51 #97 (greebie)
  • Remove example scripts. Resolves #95, #70, #71, #72. #96 (ruebot)
  • Setup cobertura better so we have local html reports. #94 (ruebot)
  • Create unit tests for Issue #50 (ExtractPopularImages) #93 (greebie)
  • Add ExtractGraphTest; lint fixes on RemoveHttpHeaderTest. #92 (greebie)
  • Improve coverage for Issue #80 #91 (greebie)
  • Improve coverage for TweetUtils #90 (greebie)
  • Increase coverage for ComputeImageSize. #45 #89 (greebie)
  • Complete coverage for #66 #88 (greebie)
  • Improve Test Coverage for #55, #56, #57, #58, #59, #60, #61, #62, #63, #64 & #66 #87 (greebie)
  • Add PR template. #85 (ruebot)
  • First round of unit tests #84 (greebie)
  • Use Scala 2.11.8; Align further with Altiscale. #83 (ruebot)

aut-0.10.0 (2017-10-02)

Full Changelog

Fixed bugs:

  • NER breaks for WARC files? #41

Closed issues:

  • Do we need pythonconverters/ArcRecordConverter.scala? If so, tests. If not, delete it. #65
  • Upgrade to Spark 2 on Altiscale #43
  • Investigate our test coverage according to codecov.io #36
  • Update Scala version #35
  • Update to use Java 8 #32
  • Migrate warcbase-resources to aut-resources #30
  • mvn site-deploy -DskipTests is still failing #27
  • Retarget Hadoop #9

Merged pull requests:

  • Update to Apache Spark 2.1.1; resolves #43. #82 (ruebot)
  • Remove unused file; resolves #65. #81 (ruebot)
  • Removed inaccurate information from README.md #44 (lintool)
  • Add WARC support for ExtractEntities; Resolve #41. #42 (ruebot)
  • Add OpenJDK8 and remove OracleJDK7 so we can use trusty. #39 (ruebot)
  • Link to aut-docs in README #38 (ianmilligan1)
  • Resolve #32; Update to Java 8 #34 (ruebot)
  • Resolve #9; Update Hadoop and Spark versions. #33 (ruebot)
  • Added reference to the releases #31 (ianmilligan1)
  • Resolve #27 - Deploy javadocs to gh-pages #29 (ruebot)
  • Add Maven Central badge. #28 (ruebot)

aut-0.9.0 (2017-08-24)

Full Changelog

Closed issues:

  • More work needs to be done on the pom.xml to get us to a release. #25
  • Is src/main/java/io/archivesunleashed/demo required? #17
  • Visualization Repo (aut-viz) #16
  • Remove src/main/python #10
  • What do we do with all the documentation at docs.warcbase.org? #8
  • Setup to publish javadocs on ghpages #7
  • Get a project setup on sonatype #6
  • Setup license headers and mycila #4
  • Setup checkstyle #3
  • Setup codecov.io #1

Merged pull requests:

  • Resolve #25 update pom.xml to do a release #26 (ruebot)
  • Resolve #7 #24 (ruebot)
  • Add Slack integration for TravisCI #21 (ruebot)
  • Setup mycila plugin, and normalize all license headers; Resolves #4. #20 (ruebot)
  • Add checkstyle plugin, and remove demo; resolves #3 #17. #19 (ruebot)
  • Updating README #15 (ianmilligan1)
  • Remove dir; resolves #10 #11 (ruebot)
  • Setup codecov.io integration; resolves #1 #2 (ruebot)

* This Changelog was automatically generated by github_changelog_generator