Information Extraction:
get data from techcrunch your documents or content.
Do you have a large volume of documents or content in one domain? Like the News from TechCrunch! 1
What if you don't already have a database or it doesn't have all the information that you need? Imagine you don't have CrunchBase or CrunchBase is not enough.
What if you need to extract data from them? For example to find out funding amounts throughout the USA over the past few years.
What if exploring all the documents one by one, was very expensive or just impossible? Have you tried to read the last ~100K news articles from TechCrunch? ;-)
TechCrunch News, through IEPY

We prototyped an Information Extraction system using TechCrunch News as the input

and analyzed the output against CrunchBase.

Funded Companies vs Average money raised

Both the VC Industry and the specialized press are discussing trends in funding rounds in recent years.

Here there are some differences.

Company fundings count

Extracted from TechCrunch News vs CrunchBase

Extracted events will always be far less than the real events in part because the source (TechCrunch News) doesn't cover each and every single funding round.

Company funding rounds money raised

Extracted from TechCrunch News vs CrunchBase

Funding amounts raised over time

Extracted with IEPY from TechCrunch News [click on each bar to reveal the sources]

TechCrunch coverage heatmap

How many of the funding events in CrunchBase were covered by TechCrunch?

Most funded locations extracted with IEPY

The location considered is that of the headquarters of the companies being funded.
Does it makes sense to run IEPY to deal with...

Technical documentation with information about parts and/or processes? YES
Tons of user's comments? Comments accumulated and organized by product or service or categorized by merchant or simply uncategorized? YES
Financial reports about companies or products with georeferences, time references, references to other entities, or categorized in some way? YES
Scientific content, academic literature, forums, sport/entertainment news, undisclosed documents, government documents, legal documents, wikis or any other unstructured text sources? YES
Does it work for different languages?
Yes. Apart from English, Spanish is also supported. Other languages can be supported in the future.
How much time did you invest in tuning vanilla IEPY for this demo?
A few weeks, not months, not years.
Do I need any other tools or to complete any other tasks?
It always depends on your problem. As usual in data science projects keep in mind that it's likely you're going to need to complete a number of steps in advance to clean and/or classify your documents/data. It would be great to have people with some computer science experience and some knowledge of Machine Learning in your team, but if you don't please don't hesitate to contact us or to ask something about it in the public lists around the topic.
What is the main thing that I should keep in mind with this tool if I'm not a Machine Learning expert?
This is not magic. To succeed you will need humans in the loop to help IEPY train the models (Active Learning). Also to succeed with Information Extraction you should have one or more explicit relationships among entities that you are pretty sure exist in the documents.