data influence

  • Introducing Data Dowsing for Dataset Optimization


    New Tool for Finding Training DatasetsAn innovative tool called "Data Dowsing" has been developed to recommend open-source datasets, aiming to optimize training when data resources are limited. The tool seeks to prioritize data collection by approximating the influence of training data on specific concepts, thereby enhancing model robustness and performance without the unsustainable practice of indiscriminately gathering vast amounts of internet data. By analyzing subspaces and applying certain constraints, this method provides a practical, albeit imprecise, signal to guide data filtering, prioritization, and adversarial training. The approach is built on the premise that calculating influence directly is too costly, so it uses perplexity to capture differences in training procedures. This matters because it offers a more sustainable and efficient way to improve machine learning models, especially in resource-constrained environments.

    Read Full Article: Introducing Data Dowsing for Dataset Optimization