Mini NLP Detour Part I

Initial Intent

The initial intent here is to take three datasets of article titles of technical articles from different sources and try to classify them using an RNN. And also, another goal is to do this in hour long bites.
And going to use this resource, https://madewithml.com/courses/foundations/recurrent-neural-networks/ for inspiration and direction on how to do this.

The datasets

(1) Per the above lesson, adapting the article dataset, “news.csv” , https://raw.githubusercontent.com/GokuMohandas/MadeWithML/main/datasets/news.csv , ( which was originally transformed from http://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html )
(2) This set of ML/data science articles from Kaggle ( https://www.kaggle.com/hsankesara/medium-articles )
(3) And the set of stackoverflow questions ( https://www.kaggle.com/imoore/60k-stack-overflow-questions-with-quality-rate)

First task: make my own dataset

I am going to combine these three datasets. The label will be the source
I only want the titles.
The first two are technical, but the news articles source has different categories, so I’ll filter by the technical category.

Do this


import os
import pandas as pd

sources = {
    "news": {
        "loc": f"{os.getenv('SOURCE_NEWS_DIR')}/news.csv",
        "title_col": "title",
        "filter": {
            "col": "category",
            "val": "Sci/Tech",
        }
    },
    "medium": {
        "loc": f"{os.getenv('SOURCE_MEDIUM_DIR')}/articles.csv",
        "title_col": "title",
    },
    "stackoverflow": {
        "loc": f"{os.getenv('SOURCE_STACKOVERFLOW_DIR')}/train.csv",
        "title_col": "Title"}
    }

def build_my_dataset(sources, out_loc, head=False):
    dfs = []
    for source, detail in sources.items():
        title_col = detail["title_col"]
        df = pd.read_csv(detail["loc"])
        df = df.rename(columns={title_col: "title"})
        if head:
            df = df.head()

        df["source"] = source

        also_filter = detail.get("filter")
        if also_filter:
            df = df[df[also_filter["col"]] == also_filter["val"]].copy()
        dfs.append(df[["title", "source"]])
    pd.concat(dfs).to_csv(out_loc, index=False)

#
workdir = os.getenv("WORKDIR")
out_loc = f"{workdir}/data/newdata.small.csv"
build_my_dataset(sources, out_loc=out_loc, head=True)

Just the sample..

!cat data/newdata.small.csv

title,source
Hacker Cracks Apple's Streaming Technology (AP),news
European Download Services Go Mobile (Reuters),news
Open Source Apps Developer SugarCRM Releases Sugar.Sales 1.1 (TechWeb),news
Oracle Sales Data Seen Being Released (Reuters),news
Sun's Looking Glass Provides 3D View (PC World),news
Chatbots were the next big thing: what happened? – The Startup – Medium,medium
Python for Data Science: 8 Concepts You May Have Forgotten,medium
Automated Feature Engineering in Python – Towards Data Science,medium
Machine Learning: how to go from Zero to Hero – freeCodeCamp,medium
Reinforcement Learning from scratch – Insight Data,medium
Java: Repeat Task Every Random Seconds,stackoverflow
Why are Java Optionals immutable?,stackoverflow
Text Overlay Image with Darkened Opacity React Native,stackoverflow
Why ternary operator in swift is so picky?,stackoverflow
hide/show fab with scale animation,stackoverflow

out_loc = f"{workdir}/data/newdata.csv"
build_my_dataset(sources, out_loc=out_loc, head=False)

Next time

Following along here, https://madewithml.com/courses/foundations/recurrent-neural-networks/ I think next I should do some preprocessing, so my data is tokenized.

Initial Intent#

The datasets#

First task: make my own dataset#

Do this#

Next time#

Initial Intent

The datasets

First task: make my own dataset

Do this

Next time