Feature Selection Toolbox - FST3 Library / History

Version history

3.1.1.beta
- Fixed a bug in Sparse ARFF filter that prevented correct Sparse ARFF files from being read.
- Standard ARFF filter is now less sensitive to header formatting and accepts more ARFF files straight away.
  Remark: Note that there is no difference between FST 3.1.0 and 3.1.1 except in ARFF filter (_src_dataio/data_file_ARFF.cpp) and Reuters ARFF sample data.
3.1.0.beta
- optimal Branch & Bound methods
  - BBB, Basic Branch & Bound
  - IBB, Improved Branch & Bound
  - BBPP, Branch & Bound with Partial Prediction (averaging predictor)
  - FBB, Fast Branch & Bound (averaging predictor)
- DAF, Dependency-Aware Feature Ranking, is a new highly efficient method for very-high-dimensional FS; unlike BIF it does not ignore contextual information and, consequently, is capable of yielding considerably better results (enables wrapper-based feature selection with dimensionality in the order of 10⁵-10⁶; works with arbitrary wrapper)
  - DAF₀ (standard)
  - DAF₁ (normalized)
- SFRS/SBRS - Sequential Retreating Search algorithm is related to Floating Search but more thorough. Suitable also for use with secondary criterion (result regularization)
- 'generalized' variants of all sequential methods enabling more thorough search by testing feature g-tuples instead of just single features per step (see 1982 Devijver Kittler book)
  - (G)SFS, (G)SBS,
  - (G)SFFS, (G)SBFS,
  - (G)OS,
  - (G)DOS,
  - (G)SFRS, (G)SBRS
- all sequential methods now allow start from arbitrary subset (useful for tuning of results using several different methods)
- individual ranking (BIF) threaded implementation (handy in very high dimensional tasks)
- Monte Carlo and threaded Monte Carlo method selects the best from a random sequence of feature subsets
- SFS/SBS, SFFS/SBFS, and SFRS/SBRS now enable post-search retreival of best results of each subset size as observed in the course of search
- modified SFFS implementation to fit more closely the original definition (now runs faster)
- re-implmented threading in sequential methods now more efficient due to reduced number of thread creations/destructions
- search method output is now redirectable to arbitrary output stream
- search method output can be switched off (introduced output levels SILENT, NORMAL, DETAILED)
- improved result trackers (cloning, joining, etc.)
- arbitrary data part access substitution (TEST for TRAIN, etc.) to enable bias estimation
- bias estimating wrapper
- cleaner stopwatch implementation
- now permits missing values in data - such values are substituted per feature by the mean value over valid values
- classifiers now implement method classify(), enabling classification of an arbitrary sample
- refactored directory structure
- lots of new demos showing broader variety of usage scenarios
- demos grouped according to purpose (for easier orientation especially of novice users)
- various minor improvements and additions (e.g., alternative random initialization of subsets, etc.)
- corrected several bugs and minor issues
3.0.2.beta
- added Exhaustive Search procedure in both sequential and threaded implementations to enable optimal feature selection
- corrected minor issues to support LibSVM 3.0
- result trackers now support cloning and memory usage limits
- added logfile with captured output of all demos for verification purposes (rundemos.log)
- corrected several minor issues
3.0.1.beta
- added support for reading ARFF (Waikato Weka) data files
- corrected minor issues to enable compilation in Visual C++
3.0.0.beta
- initial public release
- templated C++ code, using Boost library
- feature selection criteria
  - classification accuracy estimation based (wrappers), see data access options below
    - normal Bayes classifier
    - k-Nearest Neighbor classifier (based on various L-distances)
    - Support Vectior Machine (optional, depends on external LibSVM library)
  - normal model based (filter)
    - Bhattacharyya distance
    - Divergence
    - Generalized Mahalanobis distance
  - multinomial model based (filter) - Bhattacharyya, Mutual Information
  - criteria ensembles
  - hybrids
- feature selection methods
  - ranking (BIF, best individual features)
  - sequential search (hill-climbing)
    - sequential selection (SFS/SBS, restricted/unrestricted)
    - floating search (SFFS/SBFS, restricted/unrestricted)
    - oscillating search (OS, deterministic, randomized, restricted/unrestricted)
    - dynamic oscillating search (DOS, deterministic, randomized, restricted/unrestricted)
    - in any of the above: threaded, sequential, hybrid or ensemble based feature preference evaluation
  - supporting techniques (freely combinable with methods above)
    - subset size optimization vs. subset size as user parameter
    - result regularization (preference of solutions with slightly lower criterion value to counter over-fitting)
    - feature acquisition cost minimization
    - feature selection process stability evaluation
    - two-process similarity evaluation (to determine impact of parameter change etc.)
- flexible data processing
  - nested multi-level sampling (splitting to training, valitation, test and possibly other data parts)
  - sampling through extendable objects (includes re-substitution, cross-valiation, hold-out, leave-one-out, random sampling, etc.)
  - normalization through extendable objects (interval shrinking, whitening)
  - support for textual flat data format TRN (see FST1)
pre-3.0.0
- see Feature Selection Toolbox 1