Initial Intent The initial intent here is to take three datasets of article titles of technical articles from different sources and try to classify them using an RNN. And also, another goal is to do this in hour long bites. And going to use this resource, https://madewithml.com/courses/foundations/recurrent-neural-networks/ for inspiration and direction on how to do this. The datasets (1) Per the above lesson, adapting the article dataset, “news.csv” , https://raw.githubusercontent.com/GokuMohandas/MadeWithML/main/datasets/news.csv , ( which was originally transformed from http://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html ) (2) This set of ML/data science articles from Kaggle ( https://www.kaggle.com/hsankesara/medium-articles ) (3) And the set of stackoverflow questions ( https://www.kaggle.com/imoore/60k-stack-overflow-questions-with-quality-rate) First task: make my own dataset I am going to combine these three datasets. The label will be the source I only want the titles. The first two are technical, but the news articles source has different categories, so I’ll filter by the technical category. Do this import os import pandas as pd sources = { "news": { "loc": f"{os.getenv('SOURCE_NEWS_DIR')}/news.csv", "title_col": "title", "filter": { "col": "category", "val": "Sci/Tech", } }, "medium": { "loc": f"{os.getenv('SOURCE_MEDIUM_DIR')}/articles.csv", "title_col": "title", }, "stackoverflow": { "loc": f"{os.getenv('SOURCE_STACKOVERFLOW_DIR')}/train.csv", "title_col": "Title"} } def build_my_dataset(sources, out_loc, head=False): dfs = [] for source, detail in sources.items(): title_col = detail["title_col"] df = pd.read_csv(detail["loc"]) df = df.rename(columns={title_col: "title"}) if head: df = df.head() df["source"] = source also_filter = detail.get("filter") if also_filter: df = df[df[also_filter["col"]] == also_filter["val"]].copy() dfs.append(df[["title", "source"]]) pd.concat(dfs).to_csv(out_loc, index=False) # workdir = os.getenv("WORKDIR") out_loc = f"{workdir}/data/newdata.small.csv" build_my_dataset(sources, out_loc=out_loc, head=True) Just the sample.. !cat data/newdata.small.csv title,source Hacker Cracks Apple's Streaming Technology (AP),news European Download Services Go Mobile (Reuters),news Open Source Apps Developer SugarCRM Releases Sugar.Sales 1.1 (TechWeb),news Oracle Sales Data Seen Being Released (Reuters),news Sun's Looking Glass Provides 3D View (PC World),news Chatbots were the next big thing: what happened? – The Startup – Medium,medium Python for Data Science: 8 Concepts You May Have Forgotten,medium Automated Feature Engineering in Python – Towards Data Science,medium Machine Learning: how to go from Zero to Hero – freeCodeCamp,medium Reinforcement Learning from scratch – Insight Data,medium Java: Repeat Task Every Random Seconds,stackoverflow Why are Java Optionals immutable?,stackoverflow Text Overlay Image with Darkened Opacity React Native,stackoverflow Why ternary operator in swift is so picky?,stackoverflow hide/show fab with scale animation,stackoverflow out_loc = f"{workdir}/data/newdata.csv" build_my_dataset(sources, out_loc=out_loc, head=False) Next time Following along here, https://madewithml.com/courses/foundations/recurrent-neural-networks/ I think next I should do some preprocessing, so my data is tokenized.
...