Initial Intent

  • The initial intent here is to take three datasets of article titles of technical articles from different sources and try to classify them using an RNN. And also, another goal is to do this in hour long bites.
  • And going to use this resource, https://madewithml.com/courses/foundations/recurrent-neural-networks/ for inspiration and direction on how to do this.

The datasets

First task: make my own dataset

  • I am going to combine these three datasets. The label will be the source
  • I only want the titles.
  • The first two are technical, but the news articles source has different categories, so I’ll filter by the technical category.

Do this


import os
import pandas as pd

sources = {
    "news": {
        "loc": f"{os.getenv('SOURCE_NEWS_DIR')}/news.csv",
        "title_col": "title",
        "filter": {
            "col": "category",
            "val": "Sci/Tech",
        }
    },
    "medium": {
        "loc": f"{os.getenv('SOURCE_MEDIUM_DIR')}/articles.csv",
        "title_col": "title",
    },
    "stackoverflow": {
        "loc": f"{os.getenv('SOURCE_STACKOVERFLOW_DIR')}/train.csv",
        "title_col": "Title"}
    }

def build_my_dataset(sources, out_loc, head=False):
    dfs = []
    for source, detail in sources.items():
        title_col = detail["title_col"]
        df = pd.read_csv(detail["loc"])
        df = df.rename(columns={title_col: "title"})
        if head:
            df = df.head()

        df["source"] = source

        also_filter = detail.get("filter")
        if also_filter:
            df = df[df[also_filter["col"]] == also_filter["val"]].copy()
        dfs.append(df[["title", "source"]])
    pd.concat(dfs).to_csv(out_loc, index=False)

#
workdir = os.getenv("WORKDIR")
out_loc = f"{workdir}/data/newdata.small.csv"
build_my_dataset(sources, out_loc=out_loc, head=True)
  • Just the sample..
!cat data/newdata.small.csv

title,source
Hacker Cracks Apple's Streaming Technology (AP),news
European Download Services Go Mobile (Reuters),news
Open Source Apps Developer SugarCRM Releases Sugar.Sales 1.1 (TechWeb),news
Oracle Sales Data Seen Being Released (Reuters),news
Sun's Looking Glass Provides 3D View (PC World),news
Chatbots were the next big thing: what happened?  The Startup  Medium,medium
Python for Data Science: 8 Concepts You May Have Forgotten,medium
Automated Feature Engineering in Python  Towards Data Science,medium
Machine Learning: how to go from Zero to Hero  freeCodeCamp,medium
Reinforcement Learning from scratch  Insight Data,medium
Java: Repeat Task Every Random Seconds,stackoverflow
Why are Java Optionals immutable?,stackoverflow
Text Overlay Image with Darkened Opacity React Native,stackoverflow
Why ternary operator in swift is so picky?,stackoverflow
hide/show fab with scale animation,stackoverflow
out_loc = f"{workdir}/data/newdata.csv"
build_my_dataset(sources, out_loc=out_loc, head=False)

Next time

Following along here, https://madewithml.com/courses/foundations/recurrent-neural-networks/ I think next I should do some preprocessing, so my data is tokenized.