Machine Learning: Sudoku Benchmark - Dataset size

Go back to list

Context

The main requirement of machine learning is data, supposedly the more you have, the better the prediction of your model will be. Mathematicaly, it's given that a Sudoku game with more than 17 relvealed numbers admits only one solution and it's proven that 5.4 M unique solutions exist. Therefore, the dataset required to understand the game logic shoudn't be too large.

Observation: What is the minimal dataset size to reach the best quality ?

We iterate over different dataset sizes and we find the best one by crossing dataset size and inference score

{ "configuration": { "chart": { "type": "spline", "polar": false, "zoomType": "", "options3d": {}, "height": 500, "width": null, "margin": null, "inverted": false, "zooming": {} }, "credits": { "enabled": false }, "title": { "text": "" }, "colorAxis": null, "subtitle": { "text": "" }, "xAxis": { "title": { "text": "Train dataset size" }, "categories": [ 1000, 10000, 100000, 1000000, 10000000 ] }, "yAxis": [ { "gridLineWidth": 0, "title": { "text": "Inference score", "style": { "color": "#4BAF50", "font-size": "20px" } }, "opposite": true, "floor": 0, "ceiling": 1, "softMin": 0, "softMax": 1, "labels": { "style": { "color": "#4BAF50" } } }, { "gridLineWidth": 0, "title": { "text": "Loss", "style": { "color": "#8A5CA0", "font-size": "20px" } }, "opposite": true, "floor": 0, "softMin": 0, "labels": { "style": { "color": "#8A5CA0" } } }, { "title": { "text": "Train speed", "style": { "color": "#333333", "font-size": "20px" } }, "floor": 0, "softMin": 0, "labels": { "style": { "color": "#333333" } } } ], "zAxis": { "title": { "text": "" } }, "plotOptions": { "series": { "dataLabels": { "enabled": false, "format": "{series.name}", "distance": 30, "align": "left", "inside": true, "allowOverlap": false, "style": { "fontSize": "17px" } }, "showInLegend": null, "turboThreshold": 1000, "stacking": "", "groupPadding": 0, "centerInCategory": false, "findNearestPointBy": "x" } }, "navigator": { "enabled": false }, "scrollbar": { "enabled": false }, "rangeSelector": { "enabled": false, "inputEnabled": false }, "legend": { "enabled": true, "maxHeight": null, "align": "center", "verticalAlign": "bottom", "layout": "horizontal", "width": null, "margin": 12, "reversed": false }, "series": [ { "name": "Inference score", "data": [ 0.111, 0.122, 0.125, 0.9986, 1.0 ], "lineWidth": 5, "color": "#4BAF50", "marker": { "enabled": 0 } }, { "name": "Loss", "data": [ 3.7243950366973877, 2.293525218963623, 1.3531546592712402, 0.30791865587234496, 0.2479047179222107 ], "yAxis": 1, "lineWidth": 5, "color": "#8A5CA0", "marker": { "radius": 5 } }, { "name": "Train speed", "data": [ 140.34034852608812, 1256.2111906347707, 12415.864856978804, 38571.15468764459, 48659.29370694757 ], "yAxis": 2, "lineWidth": 5, "color": "#333333", "marker": { "enabled": 0 } } ], "drilldown": {}, "tooltip": { "enabled": true, "useHTML": false, "headerFormat": "", "pointFormat": "<span style=\"color:{series.color}\">{series.name}</span>: <b>{point.y:.2f}</b><br/>", "footerFormat": "", "shared": true, "outside": false, "valueDecimals": null, "split": false }, "annotations": null }, "hc_type": "chart", "id": "112833885681743318771511191010650783035" }

Assertions

  • Training speed constantly increases (in spite of data generation time).
  • Model start to understand at a Loss of 0.31 or/and a dataset of 1M