Welcome Guest ( Log In | Register )

Outline · [ Standard ] · Linear+

 deep learning, machine learning related, multi variate regression prediction

views
     
TSweisinx7
post Nov 25 2019, 04:53 PM, updated 5y ago

Getting Started
**
Junior Member
212 posts

Joined: Oct 2019
Hi,

Currently i'm solving a supervised regression problem with multiple outputs

I'm having around 8k observations provided with the outputs.

for example
observation 1 = 7890ABV890
outputs = 3020
observation 2 = 912ABNMU89
outputs = 389QV


Separated the 8k data to train, validate and test set using 0.7,02,0.1 split.

From my initial testing, each of the inputs show unique pattern and i always get almost 0 % accuracy in the validation results, but 90% + accuracy in the training set.

Anyone have experience in solving this kind of similar problem before? Glad to have someone who can discuss with me further.

So far, i tried random forest regressor, svm regressor , knn regressor , 1D-densenet, resnet+biLSTM but all getting no good results.

Thanks

alexa
post Nov 25 2019, 11:32 PM

Big Boss
******
Senior Member
1,456 posts

Joined: Jan 2009
From: mont kiara, kuala lumpur



QUOTE(weisinx7 @ Nov 25 2019, 04:53 PM)
Hi,

Currently i'm solving a supervised regression problem with multiple outputs

I'm having around 8k observations provided with the outputs.

for example
observation 1 = 7890ABV890
outputs = 3020
observation 2  = 912ABNMU89
outputs = 389QV
Separated the 8k data to train, validate and test set using 0.7,02,0.1 split.

From my initial testing, each of the inputs show unique pattern and i always get almost 0 % accuracy in the validation results, but 90% + accuracy in the training set.

Anyone have experience in solving this kind of similar problem before? Glad to have someone who can discuss with me further.

So far, i tried random forest regressor, svm regressor ,  knn regressor , 1D-densenet, resnet+biLSTM but all getting no good results.

Thanks
*
Can you clarify more on this?

observation 1 = 7890ABV890
outputs = 3020
observation 2 = 912ABNMU89
outputs = 389QV

What's your TP rate, FP rate, Precision, Recall, F-Measure & ROC?
TSweisinx7
post Nov 26 2019, 09:06 AM

Getting Started
**
Junior Member
212 posts

Joined: Oct 2019
QUOTE(alexa @ Nov 25 2019, 11:32 PM)
Can you clarify more on this?

observation 1 = 7890ABV890
outputs = 3020
observation 2  = 912ABNMU89
outputs = 389QV

What's your TP rate, FP rate, Precision, Recall, F-Measure & ROC?
*
Hi alexa,

The observations and outputs are the training data, so i have 8k of those data. And in regression, i'm using RMSE for the evaluation metric and i think is quite different compared with the evaluation metric in classification problem.

So do you come across some problem where all of the training data shows no pattern or all? or each data showing their own unique pattern?
alexa
post Nov 26 2019, 10:26 AM

Big Boss
******
Senior Member
1,456 posts

Joined: Jan 2009
From: mont kiara, kuala lumpur



QUOTE(weisinx7 @ Nov 26 2019, 09:06 AM)
Hi alexa,

The observations and outputs are the training data, so i have 8k of those data. And in regression, i'm using RMSE for the evaluation metric and i think is quite different compared with the evaluation metric in classification problem.

So do you come across some problem where all of the training data shows no pattern or all? or each data showing their own unique pattern?
*
So whats your RMSE value?
TSweisinx7
post Nov 27 2019, 09:36 AM

Getting Started
**
Junior Member
212 posts

Joined: Oct 2019
QUOTE(alexa @ Nov 26 2019, 10:26 AM)
So whats your RMSE value?
*
the RMSE loss is high, few thousands to few hundreds, ideally it should close to zero

actually i'm having some experience with normal regression problem. However, in this case, the observations doesn't show any pattern so i'm trying to get some further idea to proceed with the testing

thanks
alexa
post Nov 27 2019, 09:41 AM

Big Boss
******
Senior Member
1,456 posts

Joined: Jan 2009
From: mont kiara, kuala lumpur



QUOTE(weisinx7 @ Nov 27 2019, 09:36 AM)
the RMSE loss is high,  few thousands to few hundreds, ideally it should close to zero

actually i'm having some experience with normal regression problem. However, in this case, the observations doesn't show any pattern so i'm trying to get some further idea to proceed with the testing

thanks
*
How you collect the data? You already cleaned the data?
TSweisinx7
post Nov 27 2019, 12:10 PM

Getting Started
**
Junior Member
212 posts

Joined: Oct 2019
QUOTE(alexa @ Nov 27 2019, 09:41 AM)
How you collect the data? You already cleaned the data?
*
The data is from a old computer software and we have no idea what is that software. The software is no longer operating and we only have the inputs and outputs from the previous log, so basically we only have the one similar to what i provided earlier. What we want to do is model a system that give a similar response.

The data is pretty straight forward, 17 character inputs and 4~7 character outputs.

The data may encrypted, or maybe it is some sort of assembly language but i cannot sure yet.

Tried a bit here to check whether it is encrypted but doesn't looks like this is helpful enough: https://gchq.github.io/CyberChef/





moltenx
post Dec 5 2019, 04:36 PM

New Member
*
Newbie
3 posts

Joined: May 2014
QUOTE(weisinx7 @ Nov 25 2019, 04:53 PM)
Hi,

Currently i'm solving a supervised regression problem with multiple outputs

I'm having around 8k observations provided with the outputs.

for example
observation 1 = 7890ABV890
outputs = 3020
observation 2  = 912ABNMU89
outputs = 389QV
Separated the 8k data to train, validate and test set using 0.7,02,0.1 split.

From my initial testing, each of the inputs show unique pattern and i always get almost 0 % accuracy in the validation results, but 90% + accuracy in the training set.

Anyone have experience in solving this kind of similar problem before? Glad to have someone who can discuss with me further.

So far, i tried random forest regressor, svm regressor ,  knn regressor , 1D-densenet, resnet+biLSTM but all getting no good results.

Thanks
*
How sure you are that this is regression problem? I see one of the output got a character? For this problem, I would say you have to do feature engineering. Like calculating :

- how many characters
- how many numbers
- sum up all numbers
- average all numbers
- converting characters to numeric representation and sum it up.

A lot of things you can do to create the features. Then try again.
TSweisinx7
post Dec 6 2019, 10:46 AM

Getting Started
**
Junior Member
212 posts

Joined: Oct 2019
QUOTE(moltenx @ Dec 5 2019, 04:36 PM)
How sure you are that this is regression problem? I see one of the output got a character? For this problem, I would say you have to do feature engineering. Like calculating :

- how many characters
- how many numbers
- sum up all numbers
- average all numbers
- converting characters to numeric representation and sum it up.

A lot of things you can do to create the features. Then try again.
*
Hi moltenx,

Yup, actually this can be a classification problem as well.

I'm not sure the maximum number the outputs (maybe 3 to 8 outputs), but its always 0-9 and A-Z. And i also tried the method you mentioned and include it in the feature, sum, mean, std, var, first derivative and etc but none of these show any distinct improvement.

So far i tried 3 methods,

1st
Using conventional regression (random forest, KNN, SVM, DenseNet-1D), converting the output and input to binary or uint8, then treat it as multi-output regression problem

2nd
Using classification approach (random forest, SVM, KNN, DenseNet-1D), treat it as multi-output label, for each output, there are 36 categories (0-9, A-Z)

3rd
Using OCR similar approach, convert the 1D inputs to 2D (something like gram matrix), ResNet to extract the features and then use biLSTM to learn the characters from the image


None of the methods work. But from my observation, KNN with 1 neighbourhood can get almost 100% in training accuracy, but 0% in the testing set. This make me think that each of the input is unique on their own since there's only 1 neighbourhood is used and more than 1 neighbourhood will give poorer results. I'm trying to find the pattern in the data, but looks like even the deep learning method doesn't able to learn any pattern so far.
moltenx
post Dec 6 2019, 12:12 PM

New Member
*
Newbie
3 posts

Joined: May 2014
QUOTE(weisinx7 @ Dec 6 2019, 10:46 AM)
Hi moltenx,

Yup, actually this can be a classification problem as well.

I'm not sure the maximum number the outputs (maybe 3 to 8 outputs), but its always 0-9 and A-Z. And i also tried the method you mentioned and include it in the feature, sum, mean, std, var, first derivative and etc but none of these show any distinct improvement.

So far i tried 3 methods,

1st
Using conventional regression (random forest, KNN, SVM, DenseNet-1D), converting the output and input to binary or uint8, then treat it as multi-output regression problem

2nd
Using classification approach (random forest, SVM, KNN, DenseNet-1D), treat it as multi-output label, for each output, there are 36 categories (0-9, A-Z)

3rd
Using OCR similar approach, convert the 1D inputs to 2D (something like gram matrix), ResNet to extract the features and then use biLSTM to learn the characters from the image
None of the methods work. But from my observation, KNN with 1 neighbourhood can get almost 100% in training accuracy, but 0% in the testing set. This make me think that each of the input is unique on their own since there's only 1 neighbourhood is used and more than 1 neighbourhood will give poorer results. I'm trying to find the pattern in the data, but looks like even the deep learning method doesn't able to learn any pattern so far.
*
More likely reverse engineering (RE)? But not sure how machine learning can combine with RE. Especially when you said that each input and the output is unique. Some encryption method involve?
TSweisinx7
post Dec 6 2019, 12:26 PM

Getting Started
**
Junior Member
212 posts

Joined: Oct 2019
QUOTE(moltenx @ Dec 6 2019, 12:12 PM)
More likely reverse engineering (RE)? But not sure how machine learning can combine with RE. Especially when you said that each input and the output is unique. Some encryption method involve?
*
I'm not sure on the encryption part, maybe or maybe not

maybe it is in hex or assembly language but i have no idea on this part since i'm not so proficient in this section

from my testing with cyberchef so far >> https://gchq.github.io/CyberChef/

It shows no correlation with almost all kinds of encryptions
Mussel
post Dec 9 2019, 07:49 PM

Casual
***
Junior Member
433 posts

Joined: Jun 2016


QUOTE(weisinx7 @ Nov 27 2019, 12:10 PM)
.....The data is pretty straight forward, 17 character inputs and 4~7 character outputs......The data may encrypted, or maybe it is some sort of assembly language but i cannot sure yet.
*
QUOTE(weisinx7 @ Dec 6 2019, 12:26 PM)
......maybe it is in hex or assembly language but i have no idea on this part since i'm not so proficient in this section.
*
Hi there. I am not into deep learning or machine learning, but I wish to give some opinion on the hexadecimal or Assembly language.

Yes, any hexadecimal code can be Assembler opcode:
https://en.wikipedia.org/wiki/X86_instruction_listings

But you have to use disassembler (or legacy tools like DOS' debugger) to dump all the data to see if they are a series of meaningful Assembly code.

Maybe @cikelempadey can elaborate more on "hex or asm" part.

EDIT: Online disassembler: https://onlinedisassembler.com/odaweb/ (just paste your hexadecimal values onto it, and you'll see if they're Assembly language)

A meaningful asm code section might begins with PUSH xxx and ends with POP xxx and RET.

This post has been edited by Mussel: Dec 9 2019, 07:59 PM
TSweisinx7
post Dec 10 2019, 11:14 AM

Getting Started
**
Junior Member
212 posts

Joined: Oct 2019
QUOTE(Mussel @ Dec 9 2019, 07:49 PM)
Hi there. I am not into deep learning or machine learning, but I wish to give some opinion on the hexadecimal or Assembly language.

Yes, any hexadecimal code can be Assembler opcode:
https://en.wikipedia.org/wiki/X86_instruction_listings

But you have to use disassembler (or legacy tools like DOS' debugger) to dump all the data to see if they are a series of meaningful Assembly code.

Maybe @cikelempadey can elaborate more on "hex or asm" part.

EDIT: Online disassembler: https://onlinedisassembler.com/odaweb/  (just paste your hexadecimal values onto it, and you'll see if they're Assembly language)

A meaningful asm code section might begins with PUSH xxx and ends with POP xxx and RET.
*
Thanks, i will have a look on that, but i think less likely it is an assembly language

 

Change to:
| Lo-Fi Version
0.0136sec    0.22    5 queries    GZIP Disabled
Time is now: 29th March 2024 - 04:55 AM