WP Engine Managed WordPress Hosting SEMrush Earn up to $7500 for one sale!

 

30 Reasons You Need Google/ OpenRefine for Data Journalism

OpenRefine, formerly Google Refine and before that Freebase Gridworks, is also known as Refine. It’s a free, incredibly powerful browser-based tool with similarities to spreadsheets, but with considerably more functionality.

OpenRefine runs locally on your computer (Windows, Mac OS X, Linux), although you can set it up to run remotely on a Web server, or use a paid hosted service offered by the founders of Gridworks.

What can Refine do? Well the list below is just for starters. If any of this means anything to you, then you’ll have a sense of how feature-rich OpenRefine is (although there’s much more not mentioned):

  1. Data import from multiple sources and file types
  2. Multiple file format export, including JSON, spreadsheets
  3. Template-based data export – ideal for custom text format output
  4. Project export – for duplication with tweaks
  5. Regular expressions, GREL (Google Regular Expression Language), BeautifulSoup
  6. Multi-column sorting, with reverse option and partial undo
  7. Facets – for multiple levels of grouping and slicing data
  8. Column collapsing – for viewing convenience
  9. Reconciliation – for transforming one column of information into something else
  10. URL retrieval – for HTML page fetch, with custom delays for Web etiquette
  11. Bulk HTML parsing of a column with XPath
  12. Bulk JSON parsing of a column
  13. Split a column into multiple columns on several criteria
  14. Split a row into multiple rows
  15. Flag and star – for filtering rows
  16. Cell type conversions – text, boolean, numeric, date
  17. Data filtering with multiple criteria
  18. Data clustering on multiple criteria
  19. Data transformation
  20. Bulk editing
  21. Data massage – to “refine” dirty data into cleaner, more structured data
  22. Records – for grouping rows on a variable
  23. Blank down on column data – for grouping rows as records
  24. Fill-down on column data – reverse of blank-down
  25. Transaction history – for limitless undo operations
  26. Cross-join between Refine projects
  27. Row indexing
  28. Record indexing
  29. Column indexing – for complex multi-column manipulation
  30. Framework for Web crawling

An appropriate description of OpenRefine’s functionality might be “data exploration, massage and transformation.”

There is a learning curve, but Refine is the single most powerful tool in my data journalism / data science toolkit – aside from maybe R and R Studio. In fact, even over two years of regularly using OpenRefine (and four years since discovering it), I am still new features. There are a lot of features that you may never use, some features you might use occasionally, and others you might even use daily — depending on the kind of data work that you do. I use Refine nearly daily, for both client and personal projects.

I’ll be preparing a series of tutorials about using OpenRefine for general data manipulation and reporting, and how the workflow can be integrated with other data tools, including data visualization.

Do I Need to Learn HTML and How to Build Web Pages?

(source: unsplash)

TL;DR: The short answer to the title of this post: a little bit of HTML Web page coding knowledge opens up writing opportunities. It is not always necessary, but it’s a valuable skill. Especially if you find that you enjoy it and want to actually build Web pages for clients (or yourself) as part of your services.

I have a writing colleague whom I “met” online when we were working for the same client. We both wrote high-end content for the client, which involved research and technically required knowledge of a small amount of HTML. But my colleague refused to (or was actually afraid to) learn enough and, as a result, his content did not always render in the client’s WordPress pages with the right layout. (HTML is the computer code that powers Web pages, but it is NOT a programming language — just a collection of special “tag” instructions.)

You might be asking, “If the client is using WP (WordPress), can’t my colleague use the visual editor in WP? That would make it really easy for him, and he wouldn’t really need to know any HTML at all.”

Answer: No. The client has other people who do content check and publication. That means the client’s freelance writers need to know some HTML to get work – just a few features. It’s really not that much effort to learn:

  • different heading size codes
  • how to link to a Web page
  • how to add images that center on a page
  • adding extra line breaks
  • how to produce an unnumbered bullet list (like this one)
  • how to produce a numbered bullet list

and a few other HTML features.

But my take is that my colleague is afraid of something, and fear and learning just do not play well. His background is journalism — many years of it — so he’s a good writer. Unfortunately, not being willing to learn HTML limits the range of writing work he can take on.

HTML is not a foreign language, and I’ve known a fair number of creative types (writer, artists, photographers, etc.) who have picked it up. There are countless tutorials online. If you’re looking to learn some basics of HTML and Web pages, I suggest you start with a search for “HTML” on Coursera, where really college courses are offered free (no certification). If you want to go further and learn a bit of Web page development (actual coding/ programming), see freeCodeCamp. Of course, you can also your favorite search engine and check on YouTube.

If there’s enough interest, I might put together a small ebook on some of the basics of HTML that Web-based writers should know and learn. What do you think? Have you had to learn HTML to supplement your writing or other creative work? Leave a comment.

Data Visualization Tools – C3.js

source: C3js.org
source: C3js.org

Previously, I introduced the D3.js data visualization library. If you’ve seen that article and felt there is too much of a learning for D3, you might be interested in C3.js.

C3 is also a JavaScript library and is based on D3. At the time of writing, the C3 library had over a dozen prefab chart types, as listed below:

  1. Line chart
  2. Simple XY line chart
  3. Step chart
  4. Bar chart
  5. Pie chart
  6. Combination chart
  7. Timeseries chart
  8. Multiple XY line chart
  9. Area chart
  10. Stacked bar chart
  11. Donut chart
  12. Spline chart
  13. Line chart with regions
  14. Stacked area chart
  15. Scatter plot
  16. Gauge chart

These pre-built types make charting a bit more plug-and-play — provided that you are looking to use the basic types above.

Personally, I’ve used a combination of C3 and D3, as necessary. I’ve also used C3 and customized some of the above types by writing some additional JavaScript code, as well as HTML and CSS. More datavis articles to come.
 

Data Visualization Tools – D3.js

data visualization tools
source: D3.js – Data-Driven Documents

Of all the data visualization tools available, D3.js is arguably the most powerful.

The D3 stands for “Data-Driven Documents.” While the number of types of graphs, charts and other visualizations is probably limitless, there is a fairly steep learning curve: you have to know some basic JavaScript programming for Web pages. Which means that having a working understanding of HTML and CSS (Cascading Style Sheets) goes a long way towards successfully using D3.js.

Some examples of possible visualizations are in the banner image above. Created by Mike Bostock, there are dozens of examples on his “bl.ocks.org” site. As well, there is a D3 Gallery hosted on github.com — where the JavaScript library code is available.

As for the learning curve, while the examples might make it easier, fortunately, there are several JS (JavaScript) data visualization libraries built over top of D3.js and which simplify the effort even more. However, you still need a basic understanding of at least how to add a JS library to an HTML web page and how to use the code.

This is just an introduction to D3’s existence. In later articles, I’ll get into examples using real data, as well as talk about simpler ways to create interactive charts, graphs, etc., for Web pages.

How Humans Learn – Some Ways

source: https://unsplash.com/photos/y0Fa1DEKOKs
source: unsplash

Humans are visual, kinesthetic and aural creatures. Those are the three main ways we collectively learn. Most people prefer to learn two of those three ways, even if they do not know it consciously. One way will be the primary learning method, and another will be a secondary way.

  1. Learn from seeing, such as through notes/ written word, slideshows, videos, etc.
  2. Learn from being shown a process, such as a walkthrough training or even videos.
  3. Learn from being told, such as through verbal instruction.
  4. This is, of course, just a nutshell discussion of learning methods. I’ll expand upon this post in the future by covering tools that cater to each learning method.

Dataset Findings – Airbnb New York City Anonymized Home-Sharing Data

by Christopher Harris
source: unsplash

The story about Airbnb and their commitment to an “open and transparent” community is in the NY Times. Unfortunately, the data is only available by appointment at Airbnb’s NY City office, so it’s not exactly open and transparent — yet. Here’s to hoping that the company will make the data available online — even if interested parties have to register for the access.

Dataset Findings – College Scorecard

college library
unsplash

If you’re looking for government data on colleges, the College Scorecard, published by the U.S. Department of Education, may be useful to you — whether you’re a student looking for a college, or a writer, or somehow affiliated with the education sector.

The data profiles colleges/ universities which receive federal funding and reports on many variables including, but not limited to:

  • Average annual cost, plus breakdown by family income
  • Financial aid and debt
  • Graduation rate and retention
  • Salary after attending
  • Student population and demographics
  • SAT and ACT test scores, as available
  • Most popular programs

Data is for undergraduate (associate’s, bachelor’s) degrees only, at the moment.

Link: College Scorecard

Dataset Findings – Citizens Police Report Chicago PD Data

by Matt Popovich
source: unsplash

The Invisible Institute supports a number of projects related to public policy, including the Citizens Police Data Project (CPDP). The CPDP is a collaboration with University of Chicago Law School (specifically, the latter’s Mandel Legal Aid Clinic). Released in Nov 2015, the database contains around 56K “misconduct complaint” records for the Chicago Police force, which consists of over 8,500 officers. The data is compiled from four datasets that the CPD (Chicago Police Department) itself has provided. The timeline is 2001 to 2015, minus 2009-2010.

There are instructions on the CPDB site that explains how you can use FOIA (Freedom of Information Act) to request aditional details of allegations of interest.

Links:

  1. https://invisible.institute/police-data/
  2. Citizens Police Data Project
  3. Using FOIA to request details

Free Data Journalism Ebooks for Your Digital Reader

by Alejandro Escamilla
source: unsplash

If you’re looking to increase your data journalism knowledge and need some reading material for the holidays, Online Journalism Blog has an extensive list of around 20 free ebooks for use in your Kindle or other digital reader — or in some cases, just anywhere you can view a PDF file. Data Journalist and Professor Paul Bradshaw provides a quick description of each, so that you  can be selective.

Check out the list of free data journalism ebooks at Online Journalism Blog;.

Why You Need the Great Suspender Extension for Chrome Web Browser

 

too-many-tabs

Previously, I wrote about the Session Buddy extension for Chrome Web browser, and why you need it if you have a tendency to have too many browser tabs open. Well, the Great Suspender is another browser extension worth installing in Chrome for a similar reason: to prevent the likelihood that Chrome will crash your computer, or even just itself.

Look at the image strip of a browser window above. For most people, that’s too many tabs. For others like myself, it’s not even close to the number of browser tabs that might be open. (I had over 300 tabs open at last count, for legitimate research and client project reasons — up to eight tabs per project and 20+ ongoing projects, at times.) While Chrome does a reasonable job in managing computer memory, there are still problems, and every tab you leave open eats up more memory — even when a tab is not actively being used/ viewed.

The Great Suspender extension does something that Chrome really should have as a native feature: it controls the memory of open tabs by putting them in a sleep state. The amount of RAM memory this saves can be phenomenal, and is an absolute must for me while researching online — which is actually every day.

If you use it in tandem with the Session Buddy extension in Chrome (linked above), you can improve your overall online research workflow. You can set specific sites to not be suspended, have auto-sleep by default, and more.

Of the similar Chrome extensions I’ve tried, this is the most reliable, with the best overall workflow, and I can’t do without it. I just wish it had been around years ago.

Please note: banner ads may be affiliate links.

Fastest WordPress Hosting Namecheap.com