i must get the many predictive key words and/or expressions to accurately classify the the dating advice and relationship advice subreddit pages therefore we may use them to ascertain which adverts should populate for each web page. Because this is a category issue, we’ll make use of Logistic Regression & Bayes models. Misclassifications in this full situation will be fairly benign and so I will make use of the accuracy rating and set up a baseline of 63.3per cent to price success. Making use of TFiDfVectorization, I’ll get the function value to find out which terms have actually the prediction power that is highest for the prospective factors. If effective, this model is also utilized to focus on other pages which have comparable regularity of this words that are same phrases.
See dating-advice-scrape and relationship-advice-scrape notebooks because of this component.
After switching most of the scrapes into DataFrames, they were saved by me as csvs that you can get into the dataset folder of the repo.
Information https://spot-loan.net/payday-loans-mt/ Cleaning and EDA
- dropped rows with null self text line becuase those rows are worthless in my experience.
- combined title and selftext column directly into one brand new all_text columns
- exambined distributions of term counts for games and selftext column per post and contrasted the 2 subreddit pages.
Preprocessing and Modeling
Found the baseline precision rating 0.633 this means if i usually select the value that develops most frequently, i will be appropriate 63.3% of that time period.
First effort: logistic regression model with default CountVectorizer paramaters. train rating: 99 | test 75 | cross val 74 Second attempt: tried CountVectorizer with Stemmatizer preprocessing on first pair of scraping, pretty bad rating with a high variance. Train 99%, test 72%
- attempted to decrease maximum features and rating got a whole lot worse
- tried with lemmatizer preprocessing instead and test score went up to 74percent
Just enhancing the information and y that is stratifying my test/train/split increased my cvec test score to 81 and cross val to 80. Incorporating 2 paramaters to my CountVectorizers helped a lot. A min_df of 3 and ngram_range of (1,2) increased my test score to 83.2 and get a cross val to 82.3 nonetheless, these rating disappeared.
I do believe Tfidf worked top to decrease my overfitting due to variance issue because
we customized the end terms to just just just take the ones away which were really too regular to be predictive. This is a success, but, with increased time we most likely could’ve tweaked them much more to boost all ratings. Taking a look at both the solitary terms and terms in sets of two (bigrams) had been the most readily useful param that gridsearch advised, nonetheless, each of my top most predictive words wound up being uni-grams. My list that is original of had a good amount of jibberish terms and typos. Minimizing the # of that time period an expressed term had been necessary to show as much as 2, helped be rid of these. Gridsearch additionally recommended 90% max df rate which aided to get rid of oversaturated terms aswell. Finally, establishing max features to 5000 reduced cut down my columns to about 25 % of whatever they had been to just concentrate probably the most frequently employed terms of the thing that was kept.
Summary and tips
Also though I wish to have greater train and test ratings, I became in a position to effectively reduce the variance and you can find certainly a few terms which have high predictive energy
and so I think the model is willing to introduce a test. If marketing engagement increases, exactly the same key phrases could possibly be utilized to locate other possibly profitable pages. It was found by me interesting that taking right out the overly used terms helped with overfitting, but brought the precision rating down. I do believe there clearly was probably nevertheless room to relax and play around with the paramaters of this Tfidf Vectorizer to see if various end terms make an or that is different
Used Reddit’s API, demands collection, and BeautifulSoup to clean articles from two subreddits: Dating information & union information, and trained a classification that is binary to anticipate which subreddit confirmed post originated from