Machine Learning: Sudoku Benchmark - Dataset size

Context

The main requirement of machine learning is data, supposedly the more you have, the better the prediction of your model will be. Mathematicaly, it's given that a Sudoku game with more than 17 relvealed numbers admits only one solution and it's proven that 5.4 M unique solutions exist. Therefore, the dataset required to understand the game logic shoudn't be too large.

Observation: What is the minimal dataset size to reach the best quality ?

We iterate over different dataset sizes and we find the best one by crossing dataset size and inference score

{ "configuration": { "chart": { "type": "spline", "polar": false, "zoomType": "", "options3d": {}, "height": 500, "width": null, "margin": null, "inverted": false, "zooming": {} }, "credits": { "enabled": false }, "title": { "text": "" }, "colorAxis": null, "subtitle": { "text": "" }, "xAxis": { "title": { "text": "Train dataset size" }, "categories": [ 1000, 10000, 100000, 1000000, 10000000 ] }, "yAxis": [ { "gridLineWidth": 0, "title": { "text": "Inference score", "style": { "color": "#4BAF50", "font-size": "20px" } }, "opposite": true, "floor": 0, "ceiling": 1, "softMin": 0, "softMax": 1, "labels": { "style": { "color": "#4BAF50" } } }, { "gridLineWidth": 0, "title": { "text": "Loss", "style": { "color": "#8A5CA0", "font-size": "20px" } }, "opposite": true, "floor": 0, "softMin": 0, "labels": { "style": { "color": "#8A5CA0" } } }, { "title": { "text": "Train speed", "style": { "color": "#333333", "font-size": "20px" } }, "floor": 0, "softMin": 0, "labels": { "style": { "color": "#333333" } } } ], "zAxis": { "title": { "text": "" } }, "plotOptions": { "series": { "dataLabels": { "enabled": false, "format": "{series.name}", "distance": 30, "align": "left", "inside": true, "allowOverlap": false, "style": { "fontSize": "17px" } }, "showInLegend": null, "turboThreshold": 1000, "stacking": "", "groupPadding": 0, "centerInCategory": false, "findNearestPointBy": "x" } }, "navigator": { "enabled": false }, "scrollbar": { "enabled": false }, "rangeSelector": { "enabled": false, "inputEnabled": false }, "legend": { "enabled": true, "maxHeight": null, "align": "center", "verticalAlign": "bottom", "layout": "horizontal", "width": null, "margin": 12, "reversed": false }, "series": [ { "name": "Inference score", "data": [ 0.111, 0.122, 0.125, 0.9986, 1.0 ], "lineWidth": 5, "color": "#4BAF50", "marker": { "enabled": 0 } }, { "name": "Loss", "data": [ 3.7243950366973877, 2.293525218963623, 1.3531546592712402, 0.30791865587234496, 0.2479047179222107 ], "yAxis": 1, "lineWidth": 5, "color": "#8A5CA0", "marker": { "radius": 5 } }, { "name": "Train speed", "data": [ 140.34034852608812, 1256.2111906347707, 12415.864856978804, 38571.15468764459, 48659.29370694757 ], "yAxis": 2, "lineWidth": 5, "color": "#333333", "marker": { "enabled": 0 } } ], "drilldown": {}, "tooltip": { "enabled": true, "useHTML": false, "format": null, "headerFormat": "", "pointFormat": "{series.name}: {point.y:.2f} ", "footerFormat": "", "shared": true, "outside": false, "valueDecimals": null, "split": false }, "annotations": null }, "hc_type": "chart", "id": "212034850675425857494640883846537400314" }

Assertions

Training speed constantly increases (in spite of data generation time).
Model start to understand at a Loss of 0.31 or/and a dataset of 1M