QUOTE(iSean @ Jan 15 2021, 09:42 PM)
Hopefully I'm not wasting your breathe. I really appreciate your time explaining to me all these, as I don't have someone to guide me through all these....
[If you don't mind guiding, I think I can ask my supervisor to add your name into my thesis if you wanted.]
Back to the topic:
Problem is they always the online dataset stored into TensorFlow
Then this weird like called "(X_train, y_train), (X_test, y_test) = mnist.load_data()" which makes my life miserable when splitting the data, as I have no idea how they actually split the data.
Not mistaken "y" are labels/names and "x" are images.
===================================================
Also, TensorFlow normally uses a ImageDataGenerator so I also don't know which data they take.
Also the terminology between "testing" and "validation" confuses me from time to time...So let me get this straight, when people mention of using 80% as training data it includes the validation data comprising of (20%) for it train the model correct?

So the model basically fine-tunes itself from the code with the training data of 80%, from the automatic splitting of the ImageDataGenerator?
Then the testing data is the data the model "never" seen before. And it is Feed into the model afterwards to see how well the model is performing?
Means, I should technically export out a "Model", then manually feed images into the Model to get the results from Testing and obtain the
predicted accuracy etc value?No problem lah, we learn together. Knowledge is meant to be shared.
You need to learn how to google doing programming
Unsure anything, just paste that code in google
https://stackoverflow.com/questions/5806426...in-and-test-setThe mnist is the dataset, so it contains a function called load_data()
So what this code does
CODE
def load_data(path='mnist.npz'):
path = get_file(path, origin='https://s3.amazonaws.com/img-datasets/mnist.npz', file_hash='8a61469f7ea1b51cbae51d4f78837e45')
with np.load(path, allow_pickle=True) as f:
x_train, y_train = f['x_train'], f['y_train']
x_test, y_test = f['x_test'], f['y_test']
return (x_train, y_train), (x_test, y_test)
It separates out the dataset that has already been split for you
So your call (X_train, y_train), (X_test, y_test) = mnist.load_data()
Will automatically define the variables X_train .. y_test to the appropriate sets
In your case, you need to manually shuffle and subset your data
I am not sure exact code to do in python but one of the way you can do this is by
1. Define your images from 0-100 eg in COVID positive images
2. Randomly draw 80 numbers
3. Subset your image dataset based on the 80 numbers
I believe there should be a function that helps you to do this. Do your homework lol. Come back to me with the code and I'll check for you.
To answer your second question:
That method is kinda dated
We have what we called cross-validation/Out of bag sampling method
You can read up on it but it does not involved another separate hold out set which your limited dataset will suffer from
This post has been edited by pipedream: Jan 15 2021, 10:15 PM