addressalign-toparrow-leftarrow-leftarrow-right-10x10arrow-rightbackbellblockcalendarcameraccwcheckchevron-downchevron-leftchevron-rightchevron-small-downchevron-small-leftchevron-small-rightchevron-small-upchevron-upcircle-with-checkcircle-with-crosscircle-with-pluscontroller-playcredit-cardcrossdots-three-verticaleditemptyheartexporteye-with-lineeyefacebookfolderfullheartglobe--smallglobegmailgooglegroupshelp-with-circleimageimagesinstagramFill 1languagelaunch-new-window--smalllight-bulblightning-boltlinklocation-pinlockm-swarmSearchmailmediummessagesminusmobilemoremuplabelShape 3 + Rectangle 1ShapeoutlookpersonJoin Group on CardStartprice-ribbonprintShapeShapeShapeShapeImported LayersImported LayersImported Layersshieldstar-shapestartickettrashtriangle-downtriangle-uptwitteruserwarningyahooyoutube

Re: [webdesign-429] PHP Performance Boost

From: Jibao John M.
Sent on: Wednesday, 5 November 2014, 11:57 am
Thanks for sharing Eric.

On Thu, Oct 30, 2014 at 10:53 AM, Eric Worrall <[address removed]> wrote:
Hi guys,

Just thought I'd share an interesting technique for boosting PHP performance. The technique is - use a C or C++ Zend extension.

A year ago I was building a mobile OCR system for a client, based on the Google Tesseract OCR library. The client had a difficult requirement - they had to read text from very dirty images, images which contained lots of visual noise - colourful swirls and decorative graphics. This is a major problem for OCR - optical character recognition systems are easily confused by noise. Thankfully there is a way of filtering out most of the noise - Tesseract comes with a library called Leptonica, which contains very powerful image manipulation algorithms, such as the Tophat filter, which allows you to filter out everything which does not fall within a narrow range of visual characteristics.

The problem is guessing the right visual characteristics to programme into Tophat. Each photograph is unique. There might be some clever way of guessing the right characteristics, based on the average brightness of the image or somesuch, but I didn't find it. Instead, since the text I was reading was structured, I was able to create a fitness function, a way of calculating the likelihood that a particular OCR attempt had produced the correct result. This meant it was viable to produce a brute force OCR reader - instead of trying to work out the right parameters for the Tophat image filter, I could try every possibly viable setting over a wide range, and compare the fitness of each OCR attempt.

This worked, but it was painfully slow - the mobile app took over a minute to produce a result, as it had to perform thousands of OCR requests on slightly different variations of the filtered image, to read the text. I tried various seeking strategies, but there were lots of local maxima in my fitness function - the only reliable way I found of producing a high quality result was to brute force the entire landscape of possible Tophat filter settings.

So I set up a PHP / OCR server.

My first version was to more or less an attempt to expose the Tesseract and Leptonica library functions as PHP functions, so I could orchestrate the process using PHP. This was faster than the mobile app, but it still didn't produce the performance I needed.

I then created a new version, which pulled out all the stops - a multi-threaded Zend / PHP extension, a C++ library which did most of the processing, and exposed a simple interface to PHP, to allow PHP to supply the image, and to report the result.

This was blinding fast - it produced a result in under 10 seconds, even including the latency of uploading a high resolution image of the text. The actual brute force OCR operation took less than 2 seconds.

If your PHP code is not delivering the performance you need, particularly if the PHP code is doing something processor intensive, such as image manipulation, then consider converting the most processor intensive components of your PHP code into C++ .

Eric Worrall
http://desirableapps.com.au


--
Please Note: If you hit "REPLY", your message will be sent to everyone on this mailing list ([address removed])
http://www.meetup.com/The-Brisbane-Web-Design-Meetup-Group/
This message was sent by Eric Worrall ([address removed]) from The Brisbane Web Design Meetup Group.
To learn more about Eric Worrall, visit his/her member profile: http://www.meetup.com/The-Brisbane-Web-Design-Meetup-Group/members/56775302/
Set my mailing list to email me

As they are sent
http://www.meetup.com/The-Brisbane-Web-Design-Meetup-Group/list_prefs/?pref=1

In one daily email
http://www.meetup.com/The-Brisbane-Web-Design-Meetup-Group/list_prefs/?pref=2

Don't send me mailing list messages
http://www.meetup.com/The-Brisbane-Web-Design-Meetup-Group/list_prefs/?pref=0
Meetup, POB 4668 #37895 NY NY USA 10163 | [address removed]


People in this
group are also in: