Wednesday, May 31, 2017

Machine Learning and Lots of Data

At the beginning of the month I wrote about artificial intelligence and how it is not going to take over the world any time soon. I have continued to play with it and have been working on a branch called supervised learning. My basic example or use case is to feed my simple program a bunch of training sentences that are categorized as questions or statements. Then I have a test group of sentences to see how well my program has learned.

I started with a very small set of training sentences thinking that a person would be able to distinguish between a question and a statement fairly easily using just these examples. My training set began with only 30 sentences and my test set had 20. After training, my program correctly identified 16 of the test set. That sounds pretty good at 80% but I really need to get closer to 100%. So I added more training data. I found a list of 800 random questions and added them as well as several pages of text from two popular books I found online: Uncle Tom's Cabin and The Old Man and the Sea. That brought me closer with 18 sentences correctly identified as statements or questions.

The statement, "I like to ski," was wrongly classified as a question while "Who is your favorite actor?" got classified as a statement. So I added more training data until I got 100% correct classification. My original training data started with 30 sentences and is now close to 1000. That seems like a lot of extra work.

Now it is time to tune my algorithm. There are some things I can do to get better results with less data. However if you plan to embark on your own supervised learning project, be prepared to collect a lot of training data. You will need it.

No comments:

Post a Comment