Skip to content

Releases: salesforce/TransmogrifAI

0.7.0

11 Jun 23:58
036d1fc
Compare
Choose a tag to compare

Bug fixes:

  • Fix flaky ModelInsight tests #407
  • Remove logging of tokens of text fields #420, #438, #447, #474
  • Add validation prepare call before model selection when no DAG is passed #424, #429
  • Fix Days.daysBetween int overflow #471

New features / updates:

  • Downsample the number of training samples to maxTrainingSample for regression #413 and multi-class classification #414
  • Refactor InsightLOCOTest #412
  • Enable more loss types for OpLinearRegression #421
  • Add property-based tests for regression model selection #427
  • Add option to calculate LOCO for dates/texts by leaving out their entire vector #418
  • Add Chinese and Korean examples to TextTokenizerTest #442
  • Add support for ignoring text that looks like IDs in SmartTextVectorizer #448, #455
  • Add a unary estimator for detecting names in text fields and transforming to likely gender #445
  • Allow result features to be removed by raw feature filter #458
  • Metadata changes for sensitive feature information #457
  • Add MinVarianceFilter which checks that computed features have a minimum variance #463, #465
  • Allow TextStats length distribution to be token-based and refactor for testability #464
  • Use Spark job grouping to distinguish steps of the machine learning flow #467, #468, #470
  • Add categorical detection to be coverage based in addition to unique count based #473
  • Remove duplicate features using sanity checker feature to feature correlations #476, #479
  • Lift the upper bound on number of hash features #477
  • Enable Html stripping on text-like features #478

Dependency updates (#402, #466):

  • Update Apache Spark version to 2.4.5
  • Avro is a built-in data source in Spark 2.4, so no longer using the spark-avro package
  • Avro to 1.8.2
  • XGBoost to 0.90
  • MLeap to 0.14.0
  • json4s to 3.5.3
  • JUnit to 4.12
  • chill to 0.9.3
  • gradle-avro-plugin to 0.16.0

Miscellaneous:

  • Add ROADMAP.md #394

0.6.1

12 Sep 00:30
f4b6af3
Compare
Choose a tag to compare

Bug fixes:

  • Ensure correct metrics despite model failures on some CV folds #404
  • Fix flaky ModelInsight tests #395
  • Avoid creating SparseVectors for LOCO #377

New features / updates:

  • Model combiner #385
  • Added new sample for HousingPrices #365
  • Test to verify that custom metrics appear in model insight metrics #387
  • Add FeatureDistribution to SerializationFormats #383
  • Add metadata to OpStandadrdScaler to allow for descaling #378
  • Improve json serde error in evalMetFromJson #380
  • Track mean & standard deviation as metrics for numeric features and for text length of text features #354
  • Making model selectors robust to failing models #372
  • Use compact and compressed model json by default #375
  • Descale feature contribution for Linear Regression & Logistic Regression #345

Dependency updates:

  • Update tika version #382

0.6.0

12 Jul 22:57
Compare
Choose a tag to compare

Bug fixes:

  • Quick Fix Alias Type Names #346
  • Forecast Evaluator - fixes SMAPE, adds MASE and Seasonal Error metrics #342

New features / updates:

  • Aggregate LOCOs of DateToUnitCircleTransformer. #349
  • Convert lambda functions into concrete classes to allow compatibility with Scala 2.12 #357
  • Replace mapValues with immutable Map where applicable #363
  • Aggregate spark metrics during run time instead of post processing by default #358
  • Allow customizing serialization for FeatureGenerator extract function #352
  • Update helloworld examples to be simple #351
  • Adding key ctor field in all RawFeatureFilter results #348
  • Forecast evaluator + SMAPE metric #337
  • Local scoring for model with features of all types #340
  • Remove local runner + update docs #335
  • Added missing test for java conversions #334
  • Get rid of scalaj-collections #333
  • Workflow independent model loading #274
  • Aggregated LOCOs of SmartTextVectorizer outputs #308
  • Added community projects docs section #326
  • Add FeatureBuilder.fromSchema #325
  • Improve WeekOfMonth in date transformers #323
  • Improved datetime unit transformer shortcuts - Part 2 #319
  • Correctly pass main class for CLI sub project #321
  • Serialize blacklisted map keys with the model + updated access on workflow/model members #320
  • Improved datetime unit transformer shortcuts #316
  • Improved OpScalarStandardScalerTest #317
  • improved PercentileCalibratorTest #318
  • Added concrete wrappers for HashingTF, NGram and StopWordsRemover #314
  • Avoid singleton random generators #312
  • Remove free function aggregation with feature builders #311
  • Added util methods to create class/object by name + retrieve type tag by type name #310

Dependency updates:

  • Bump shadowjar plugin to 5.0.0 #306
  • Bump Apache Tika to 1.21 #331
  • Enable CicleCI version 2.1 #353

0.5.3

08 May 21:16
8d2e819
Compare
Choose a tag to compare

Bug fixes:

  • Threshold metrics calculation fix when unseen labels are present #293
  • DataCutter-related fixes for multiclass #263
  • Fixed onSetInput so is always called with new input #280

New features / updates:

  • Improved test SmartTextMapVectorizerTest #296
  • Add check to raw feature filter for removing all features #303
  • Spec-ifying ngram similarity tests #299
  • Add random test feature generator to generate datasets with features of all types #298
  • Spec-ifying NGramTest #297
  • Added base spec for testing Spark wrapping transformers #295
  • Add/upgrade string indexing tests #294
  • Improved multi pick list map vectorizer test #292
  • Improvements of Vectorizer tests #291
  • Updated TextMapPivotVectorizerTest to use OpEstimatorSpec #290
  • Update TextTokenizerTest to use OpTransformerSpec #289
  • Add test for RealNNVectorizer #288
  • Improved OPCollectionHashingVectorizerTest test #286
  • Created new tests for OpCollection #285
  • Update names of transformer tests and files to match class names #284
  • Improved test by extending OpTransformerSpec #283
  • Skip writing empty stages & skip loading stages without uid-s #282
  • Skip serializing estimators + fix test + added empty data transform test #281

Dependency updates:
N/A

0.5.2

11 Apr 03:05
Compare
Choose a tag to compare

Bug fixes:

  • Fixed local scoring with multipicklist features #243
  • Fixed error messages in DataCutter and DataBalancer #256
  • Fixed bug in in model selector fit method #251
  • Fixed some Transmogrifier defaults to be modifiable / exposed #232
  • Fixed bug in OpXGBoostClassificationModel #229
  • Minor fixes / cleanup on notebooks, Helloworld examples, and developer guide #226, #230, #240, #259

New features / updates:

  • Added transformer classes for common math operations #255, #257
  • Added string transformers for substring search and valid email #265
  • Added scaler and descaler transformers #223
  • Added Raw Feature Filter results e.g., metrics, exclusion reasons to serialization and to ModelInsights #237, #252, #258, #276
  • Changed OpBinScoreEvaluator to allow for lift analysis #233
  • Added random param builder for random hyperparameter search in model selectors #238
  • Added possibility to return top K positives and top K negatives improvement for LOCO #264
  • Added a max cardinality percentage that can be set for pivot #241
  • Added minimum rows for scoring set in RawFeatureFilter #250
  • Allowed copying model instances across multiple threads #270
  • Added stub to allow loading models without workflow #269, #272
  • Made decision tree numeric bucketizer tests less flaky #225
  • Added Jupyter notebooks for samples #231

Dependency updates:

  • Switched to MLeap runtime from Aardpfark for local scoring #249, #261

0.5.1

09 Feb 04:42
Compare
Choose a tag to compare

Bug fixes:

  • Fix indices in LOCO for record-level insights and add more robust tests #216
  • Fix sorting in Prediction type for multiclass classification and add stronger tests #213
  • Fixing code generation bug with underscores in names #208
  • Correct some syntax/compilation errors in Titanic Binary Classification Docs Example #202

New features / updates:

  • Make some tests a little less flaky #221
  • Integrate helloworld project with Travis CI #210, #212
  • Use ParamGridBuilder in model selector grids to allow modifications #206
  • Use class.getName & update splitter meta parsing #204
  • Export model selector defaults + metadata fixes #199
  • Use OS specific path separator #193
  • Add transformer / estimator for text length calculation and options for using this as default behavior #190, #195
  • Allow conversion from Date and Timestamp Spark types to Date and DateTime TransmogrifAI types #188

Dependency updates:

  • Upgrade to Gradle 5.2 #218
  • Upgrade shadowjar plugin to 4.0.4 #220

0.5.0

22 Nov 21:31
078c8a0
Compare
Choose a tag to compare

New features and bug fixes:

  • XGBoost classification & regression models - EXPERIMENTAL #44
  • Add default param grid for xgboost #175
  • Fix ModelInsights for xgboost #170
  • Added Parquet reader #169
  • Added aggregate & conditional readers for Parquet #172
  • Evaluators check for empty data #178
  • Refactored splitter tests #176
  • Return scoring feature distributions from RawFeatureFilter #171
  • Using MapReduce Api for Avro Read Write #150
  • Improve test coverage for VectorsCombiner and make vector aggregator efficient #168
  • Time based aggregators #167
  • Ignore null values in meta + support floats #166
  • CLI command name fix + bump shadow plugin version + cleanup #164
  • Fix build.sbt example in readme #165
  • Removed an old test I added to check if Spark ran out of memory when calculating a correlation matrix (this is unnecessary and unhelpful) #160
  • Replace assert with require #159
  • Streaming histogram implementation #152
  • Added test and removed dead code for Sanity Checker dealing with map with same key #153
  • Fixed model insights exception when features are excluded from sanity checker correlation calculations #147
  • Added logging of response distribution to RFF #146
  • Use proper test ranges in feature converter test #143
  • Added support for DateType and TimestampType primitive spark types #135
  • Standardizing timezone to UTC #138

Dependency upgrades & misc:

  • XGBoost 0.81 #180
  • Spark 2.3.2 #44
  • Gradle 4.10.2 #142
  • Use OpenJDK8 for CircleCI builds + refactor build config #140

0.4.0

23 Sep 06:35
62aed6e
Compare
Choose a tag to compare

New features and bug fixes:

  • Allow to specify the formula to compute the text features bin size for RawFeatureFilter (see RawFeatureFilter.textBinsFormula argument) #99
  • Fixed metadata on Geolocation and GeolocationMap so that keep the name of the column in descriptorValue. #100
  • Local scoring (aka Sparkless) using Aardpfark. This enables loading and scoring models without Spark context but locally using Aardpfark (PFA for Spark) and Hadrian libraries instead. This allows orders of magnitude faster scoring times compared to Spark. #41
  • Add distributions calculated in RawFeatureFilter to ModelInsights #103
  • Added binary sequence transformer & estimator: BinarySequenceTransformer and BinarySequenceEstimator + plus the associated base traits #84
  • Added StringIndexerHandleInvalid.Keep option into OpStringIndexer (same as in underlying Spark estimator) #93
  • Allow numbers and underscores in feature names #92
  • Stable key order for map vectorizers #88
  • Keep raw feature distributions calculated in raw feature filter #76
  • Transmogrify to use smart text vectorizer for text types: Text, TextArea, TextMap and TextAreaMap #63
  • Transmogrify circular date representations for date feature types: Date, DateTime, DateMap and DateTimeMap #100
  • Improved test coverage for utils and other modules #50, #53, #67, #69, #70, #71, #72, #73
  • Match feature type map hierarchy with regular feature types #49
  • Redundant and deadlock-prone end listener removal #52
  • OS-neutral filesystem path creation #51
  • Make Feature class public instead hide it's ctor #45
  • Specify categorical variables in metadata #120
  • Fix fill geo location vectorizer values #132
  • Adding feature importance for new model types #128
  • Adding binaryclassification bin score evaluator #119
  • Apply DateToUnitCircleTransformer logic in raw feature filter transformations 130#

Breaking changes:

  • Made case class to deal with model selector metadata #39
  • Made FileOutputCommiter a default and got rid of DirectMapreduceOutputCommitter and DirectOutputCommitter #86
  • Refactored OpVectorColumnMetadata to allow numeric column descriptors #89
  • Renaming JaccardDistance to JaccardSimilarity #80
  • New model selector interface #55. The breaking changes are related to return type and the way the parameters are passed into model selectors. Starting this version model selectors would return a single result feature of type Prediction (instead of a variable number of feature - (pred, raw, prob)). Example:
val (pred, raw, prob) = MultiClassificationModelSelector() // won't compile anymore
val prediction = MultiClassificationModelSelector() // ok!

Another change is the way parameters are passed into model selectors. Example:

BinaryClassificationModelSelector
  .withCrossValidation()
  .setLogisticRegressionRegParam(0.05, 0.1) // won't compile anymore

Instead one should do:

val lr = new OpLogisticRegression()
val models = Seq(lr -> new ParamGridBuilder().addGrid(lr.regParam, Array(0.05, 0.1)).build())
BinaryClassificationModelSelector
  .withCrossValidation(modelsAndParameters = models)

For more example on how to use new model selectors please refer to our documentation and helloworld examples.

Dependency upgrades & misc:

  • CI/CD runtime improvements for CircleCI and TravisCI
  • Updated Gradle to 4.10
  • Updated scala-graph to 1.12.5
  • Updated scalafmt to 1.5.1
  • New transmogrifai-local subproject #41 introduces aardpfark and hadrian dependencies.

0.3.4

22 Aug 23:50
Compare
Choose a tag to compare

Performance improvements:

  • Added featureLabelCorrOnly parameter in SanityChecker to only compute correlations between features and label (defaults to false)
  • Added ignoreHashCorrelations parameter in SanityChecker that ignores correlations from hashed text features (defaults to false)
  • Parallelize OP cross validation and set default validation parallelism to 8
  • Added warmup in concurrent checks

New features and bug fixes:

  • Replace deprecated 'forceSharedHashSpace' param with HashingStrategy
  • Added explicit annotations for all classes with generic collections that use JsonUtils
  • Added .transmogrify shortcut for arrays of features
  • Removed referencing UID from a case object
  • DecisionTree & DropIndices stages tests now use the OP spec base classes
  • Added map features removed by RFF to model insights
  • Pretty print model summaries
  • Ensure OP Models are portable across environments
  • Ignore _ in simple streaming avro file reader
  • Updated evaluators so they can work with either Prediction type feature or three input featues
  • Added Algebird kryo registrar
  • Make Sure that SmartTextVectorizerModel can be serialized to/from json

Dependency upgrades:

  • Upgraded to Scala 2.11.12
  • Updated Gradle to 4.9 & bump Scalastyle plugin to 1.0.1

Released to Bintray - https://bintray.com/salesforce/maven/TransmogrifAI